Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How to split out my dataset ? 1

Status
Not open for further replies.

irinnew

Technical User
Mar 19, 2005
81
US
Hi experts,

As I mentioned in the previous post, I have a long file with two columns. One of the columns is a program’s name and another one is a number of participants applied.
I evaluated the number of members in different buckets: small-small, medium-small and etc and created a chart.
Now I would like to use a different approach and to split out the dataset into 6 equal size groups. In other words, I need to split dataset by the equal number of programs field. For example I need 6 equal groups

Program Number of participants
__________________________________
A 1
B 1
C 2
D 4
E 4
F 5
G 8
H 10
I 12
J 12
K 14
L 14
M 14
N 20
O 20
P 20
Q 100
R 115
S 120


It would be:
1st Grp:
A 1
B 1
C 2

2nd grp:
D 4
E 4
F 5

3rd grp:
G 8
H 10
I 12

4th
J 12
K 14
L 14

And so on. If it is not even …then the last group could slightly differ LIKE THE LAST ONE.
P 20
Q 100
R 115
S 120


I have no idea how to divide the dataset. Hopefully somebody can help me.
Thank you in advance!

Irin
 
This needs to be brokwn down into stages.
1 - Find out how many records are in the dataset. There are several ways of doing this, one of which was discussed in this forum recently.
Code:
data _null_;
   set  dset nobs=count;
   call symput('obscount',trim(left(put(count,10.))));
   stop; 
run;
2 - If you know you want 6 equal groups then you can divide this number by 6 and round it to the nearest whole number. Then in your datastep to split you would do something like this
Code:
data dset1
     dset2
     dset3
     ...
     dset6;
  set big_data;

  retain count 0 dset 1;

  if dset=1 then output dset1;
  else if dset=2 then output dset2;
  ...
  else output dset6;

  if count = &obscount then do;
     count = 0;
     dset + 1;
  end;

  count + 1;
run;

That should do it, or at least it'll be close to what you need. There's probably a more elegant way, but I'm too busy to think about it ATM. Sorry.
 

Chris,

I am not sure what ampersand in &obscount means. Is is a part of macro or kind of a substitution variable?


Log gave me the following error....


if count=&obscount then do;
--
180
ERROR 22-322: Syntax error, expecting one of the following: a name, a quoted string,
a numeric constant, a datetime constant, a missing value, bitstring, INPUT, PUT.

ERROR 180-322: Statement is not valid or it is used out of proper order.

What does it mean?
 
OOPS!
I forgot to include "how many records " part! Please disregard my previous message. But...This is the whole code:

libname num 'c:\Irin\SAS Datasets';
data dset;
set num.program;
run;

data _null_;
set dset nobs=count;
call symput('obscount', trim(left(put(count,10.))));
stop;
run;

data dset1
dset2
dset3
dset4
dset5
dset6;
set dset;

retain count 0 dset 1;

if dset=1 then output dset1;
else if dset=2 then output dset2;
else if dset=3 then output dset3;
else if dset=4 then output dset4;
else if dset=5 then output dset5;
else output dset6;

if count=&obscount then do;
count=0;
dset + 1;
end;
count + 1;
run;
--------------------------------------------
Although I have no error log anymore my log is giving me the following:
NOTE: There were 2000 observations read from the data set WORK.DSET.
NOTE: The data set WORK.DSET1 has 2000 observations and 4 variables.
NOTE: The data set WORK.DSET2 has 0 observations and 4 variables.
NOTE: The data set WORK.DSET3 has 0 observations and 4 variables.
NOTE: The data set WORK.DSET4 has 0 observations and 4 variables.
NOTE: The data set WORK.DSET5 has 0 observations and 4 variables.
NOTE: The data set WORK.DSET6 has 0 observations and 4 variables.

Thefore my dset1 have the same # of records as dset. All others dset2,3,4,5,6, are empty...

Also I am not sure what count 10 in trim(left(put(count,10)means? Should I leave it as it is or to replace it with the # of observations for each of 6 groups?

What I am doing wrong?
 
Sorry, what you need to do is divide &obscount by 6. Change the first step to

Code:
data _null_;
    set dset nobs=count;
    splitcount = round(count/6,1);
    call symput('obscount', splitcount);
    stop;
run;
That should work. The "round" is to round it to the nearest whole number.

call symput, if you aren't familiar with it, sets up a macro variable.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top