Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Include an additional variable in a proc sql group by statement

Status
Not open for further replies.

Jubinell

Technical User
Oct 21, 2008
14
JP
Hi all,

I have 2 datasets: b is the main dataset, a contains the list of relevant unique personIDs that I want to pull data from b based on. In particular, I'm pulling the minimum date grouped by personID. I also want to restrict the minimum date for class "A" persons only and exclude missing dates.

Here is trouble: in the table I want to list (alongside personID and minimum date) the subclass associated with the minimum date. My table should have unique personIDs that look something like this:

personID FirstDate FirstDateSubClass

However, the table that I got contains duplicate personIDs even though it returned the minimum date correctly. I think SAS first subset dataset b using the "having" statement, then displays the min date for however many lines there are remaining in the b dataset as opposed to grouping them to one line per personID.



Here is the code I have so far:

*******
proc sql;
create table mytable as
select
a.personID,
b.personID,
min(b.date) as firstdate,
b.subclass as firstdatesubclass
from
a left join
b
on a.personID = b.personID
group by b.personID
having b.date ne . and b.class = 'A';
quit;

*************

How do I correct this code to return one unique personID per line containing minimum date of that personID and the subclass of that minimum date?
 
Hi Jubinell,

I think you are confusing what the group by clause in SQL does. The group by is for performing a calculation on a specific group of data. As you are looking for the minimum date for each personid, the group clause you are using is valid. I think you just need to use the distinct clause to eliminate the duplicate records.

Code:
proc sql;
create table mytable as
select distinct
        a.personID,
        b.personID,
        min(b.date) as firstdate,
        b.subclass as firstdatesubclass
from    
        a left join
        b
        on a.personID = b.personID
group by b.personID
having b.date ne . and b.class = 'A';
quit;

If you just want to use table a to identify which records to pull out of table b you might find an in-line view more efficient; I prefer the way it looks.

Code:
proc sql;
   create table mytable as
   select distinct personID, min(date) as firstdate,
           subclass as firstdatesubclass
   from b
   where personid in (select personid from a)
   group by personID
   having date ne . and class = 'A';
   quit;

 
wow...your code is way more efficient than mine. I think it was slick to pull only matching personids using the in() statement. Thank you!

One thing I've been puzzled for a while and haven't found a satisfying answer. What's the difference between the where and the having statements? I am under the impression that you'll need the having statement to refer to a newly calculated variable, but is this all? Is it wrong to use having when the variables referred to are variables from the original datasets?

Another thing, how do you put your codes in the code box? (-:
 
Yes, where and having are both for subsetting. There are times when both will work such as your example here. In such a case I would use the where clause over the having clause as the variable exists in the dataset.

So my code should have looked like:

Code:
proc sql;
   create table mytable as
   select distinct personID, min(date) as firstdate,
           subclass as firstdatesubclass
   from b
   where personid in (select personid from a) and 
      date ne . and class = 'A'
   group by personID;
   quit;

As a rule of thumb, if you want to subset on a summary function use the group by, otherwise use the where clause.
Here's a few examples of using summary functions in SQL.

Code:
data large;
input id val @@;
cards;
1 23 1 34 2 343 3 45 3 44 
4 55 4 561 5 33
;run;

data small;
input id @@;
cards;
2 3 4 5
;run;

proc sql;
   ** No group by clause will return min of entire dataset;
   Title 'Subsetting only on having';
   select distinct l.id, min(val) as mins
   from large as l 
   where l.id in (select id from small)
   having calculated mins lt 40;

   ** Group by and having will return records filtered by id;
   Title 'Subsetting only on having';
   select distinct l.id, min(val) as mins
   from large as l 
   where l.id in (select id from small)
   group by l.id
   having calculated mins lt 40;

   ** Using summary functions in the where clause will return error;
   Title 'Subsetting summary functions on where clause';
   select distinct l.id, min(val) as mins
   from large as l 
   where l.id in (select id from small) and min(val) lt 40;
quit;

And finally, if you put your code between code tags, they will be boxed (remove spaces in the tags) "[ code] put code here [ /code]" HTH
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top