Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Count unique codes by group

Status
Not open for further replies.

cte1

Technical User
Aug 16, 2010
9
US
I am trying to count the number of unique codes in a group. Each group consist of multiple rows and the codes are pipe delimited in each cell. A sample of my dataset looks like this:

GROUP CODES
1 |231|322|414|
1 |231|322|
2 |231|
2 |231|114|
2
3
3

So in GROUP 1 there are 3 unique codes (231, 322, and 414) and in GROUP 2 there are 2 unique codes (231 and 114) There are 0 codes in GROUP 3. Can anyone tell me how to get SAS to give me an output dataset like this:

Group Count
1 3
2 2
3 0

Thanks in advance.

Chris
 
You just count the number of "|" and subtract 1?

data x;
set master;
rcount =(length(Name)-length(compress(Name,"|"))) - 1;
run;

compress removes the '|' from whatever the variable is leaving you with the number of '|'. Since there is an extra '|' just subtract one. There would be other ways to do this, but this was the easiest I could think of.
 
Oh, and if some don't have the '|', you could do the index function to make sure and then use an If, then statement.

So, basically:

if index(name,'|') > 0 then rcount = (statement above); else 0;
 
In SAS 9 there is a countw function that counts the 'words' in your string. All you have to do is set the delimiter to the pipe '|' char.

ex
Code:
total = countw(codes,'|');

Hope this helps..
Klaz
 
Thanks for the help. I really appreciate it. That worked great for counting the number of codes within each cell. The problem is I need to know the number of unique codes across multiple observations (or rows). See example above. Any ideas?

 
I figured that you would want the unique codes count too (and I was curious) so I experimented with some code.

Code:
data YouroutputDS(keep=outstr wrdtot agestr total_unq);
  set inputds end = last;
  *** INITIALIZE YOUR OUTPUT STRING VAR ***;
  length outstr $500;
  *** SET UP AN ARRAY WITH 5000 FIELDS ***;
  array strsort (*) $500 strs1-strs5000 ;
  *** COUNT THE ITEMS IN YOUR STRING (CODES) ***;
  wrdtot = countw(CODE,'|');
  do i=1 to wrdtot;
    strsort(i) = scan(CODE,i,'|');
  end;
  *** THIS IS A SORT FUNCTION ***;
  *** THIS FUNCTION SHOULD HAVE A UNIQUE OPTION ***;
  *** WE SHOULD PETITION SAS :)                 ***;
  call sortc(of strsort (*));
  *** CHECK FOR UNIQUE CODES ***;
  flag = 0;
  do i=1 to dim(strsort);
    *** WE DONT CHECK ON THE FIRST POSITION ***;
    if i gt 1 and trim(strsort(i)) ne '' then do;
     if trim(strsort(i))= trim(strsort(i-1)) then do;
           *** BLANK OUT ALL DUPS ***;
	   strsort(i)='';
       call sortc(of strsort (*));
	 end;
	end;
  end;

  CALL CATX('|', OF strs1-strs5000);
  outstr = strsort(1);
  total_unq = countw(outstr,'|');
run;
I hope this helps...
Klaz
 
Thanks Klaz. I ran the code, however it is only giving the number of unique codes within each cell. For example, if my dataset looks like this:

CODE
|191|
|191|
|187|189|191|278|934|
|187|189|191|278|
|191|278|

When I run the code I get a dataset that looks like this:

outstr wrdtot total_unq
|191| 2 1
|191| 2 1
|187|189|191|278|934| 6 5
|187|189|191|278| 5 4
|191|278| 3 2

It is giving the total unique codes within each cell but not the total unique codes accross the group. In this case there are 6 unique codes within this group (or within these 5 rows).

I really appreciate your help. Do you have any other ideas?

Thanks,
Chris
 
Sorry, there was an error in my last post. The example should look like this.

CODE
|190|
|191|
|187|189|191|278|934|
|187|189|191|278|
|191|278|

When I run the code I get a dataset that looks like this:

outstr wrdtot total_unq
|190| 2 1
|191| 2 1
|187|189|191|278|934| 6 5
|187|189|191|278| 5 4
|191|278| 3 2

It is giving the total unique codes within each cell but not the total unique codes accross the group. In this case there are 6 unique codes within this group (or within these 5 rows).

Thanks again,
Chris
 
Yes the code only gives you the unique count in each cell (record). If you wanted to combine rows and then get calculate the unique codes you have two options.
1) Use the retain statement and 'grow' your code string until all codes are in one record. Then run the script you have.

2) Use Proc transpose to transpose your data values into a vertical structure (1 column for all your values by groupID) then use Proc SQL to count with a distinct option.

Personaly, I would use the second option as the first may run out of room if you have a few hundred codes in a group.

Klaz
 
Here is an example of the Proc Transpose method.

Code:
data f2;
  set your_ds end = last;
  
  array strsort (*) $500 strs1-strs5000 ;
  wrdtot = countw(code,'|');
  do i=1 to wrdtot;
    strsort(i) = scan(code,i,'|');
  end;
run;
*** DATA MUST BE SORTED BY ID VAR ***;
proc transpose data =f2 out =f3;
by id;
var strs1-strs5000;
run;

*** COUNTS NON-MISSING VALUES ***;
proc sql;
  create table f4 as
  select id , count(distinct col1) as unq_tot
  from f3
  group by id;
quit;

I hope this helps you.
Klaz
 
Hi Klaz,

First, thanks so much for you help. The code you posted ran fine but it is only giving me the unique (or distinct) count for col1, which is not the distinct count across the entire group. In the sql statement I need the distinct count across all of the columns (i.e. col1, col2, col3, etc.). Any ideas?

Thanks again,

Chris
 
I thought that you wanted the unique total across all your columns. If you don't all you have to do is in the PROC SQL step add a group by varname.

Look at the transposed dataset and you will find a column that has the original transposed variable name. Add that column to both the select and group statements.

select id, thetransposedcolumnname, count(distinct col1) as...

group by id, thetransposedcolumnname

change the above variable name to the correct one (I can't recall off-hand what that name is)

klaz
 
Hi Klaz,

Thanks again for getting back to me. I do want the unique total across all columns. The code is not giving that result. Here is the code that you sent that I ran:

data f2;
set cltest end = last;
array strsort (*) $500 strs1-strs5000 ;
wrdtot = countw(code,'|');
do i=1 to wrdtot;
strsort(i) = scan(code,i,'|');
end;
run;
proc sort data=f2; by id; run;
proc transpose data =f2 out =f3;
by id;
var strs1-strs5000;
run;
*** COUNTS NON-MISSING VALUES ***;
proc sql;
create table f4 as
select id , count(distinct col1) as unq_tot
from f3
group by id;
quit;

The input data file, cltest, is available here:


For example there are 11 unique CODES for ID 601 in the cltest dataset. The codes are 178, 181, 186, 278, 204, 816, 208, 212, 214, 176, and 202. However the dataset f4 only shows 9 unique codes for ID 601, as it is only counting the unique observations in the first transposed column (col1).

Sorry for so many questions. This stuff seems so easy for you. If you have any ideas to fix this problem it would really help me out a lot. Again, I really appreciate all of your help.

thanks,
chris
 
Hi Chris,

Hopefully this is what you are looking for (one of many ways):

First transposing, then getting the unique group/counts and using proc summary to perform the counts. The completetypes and classdata options allow proc summary to retrieve the groups that have 0 codes.

Code:
data have ;
   infile cards missover ;
   input group codes :$20. ;
cards ;
1 |231|322|414|
1 |231|322|
2 |231|
2 |231|114|
3      
3       
3        
;;;;;
proc print;run;
data long (keep=group code);
   set have ;
   do _n_=1 to countW(codes,'|')-1;
      code =input(scan(codes,_n_,'|'),best.) ;
      output;
      end ;
   run;
proc sort data=long out=unique nodupkey ;
   by group code ;
   run;
proc summary data=unique missing completetypes classdata=have nway ;
   class group ;
   output out= want(keep=group count) n(code)=count ;
   run;
proc print;run;
 
That worked perfectly! Thank you so much.

Chris
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top