Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

What the difference 2

Status
Not open for further replies.

sophie123

Programmer
Nov 2, 2004
10
US
What the difference between the noduplicate command and the no dup keys command.
 
Where do you find a command named 'noduplicate' in SAS? I know of a NODUP command but not a 'noduplicate' command. The nodup works almost like the the nodupkey command only its the option in a function call or in an AF object. The NODUPKEY is usually reserved for use in a PROC such as the Proc Sort.

Are there any others that can explain this too.
Klaz
 
NODUPKEY eliminates the observations with the duplicate BY values in the sort

NODUPLICATE eliminates duplicate data observations from the sort

Check out
this compares NODUPEKEY, NODUPLICATE and actually talks about why it is easier/better (my opinion, ok it compares the stuff for us old school SAS folks) to do this sort of thing in SQL - from the person who eliminated SAS processing with SQL in just a few weeks... (sorry)

This help Sophie?
 
If you are deduping it is best (in SAS) to use a data step so that you can store the dupes as well to look at them. You can also set up the order of the records so that you can make sure you keep the records you want. For instance to dedupe a file of client records, keeping only the most recent record:-
Code:
proc sort data=dset1;
  by client_id rec_date;
run;

data dset2(drop=kept)
     dupes;
  set dset1;
  by client_id;

  if last.client_id then 
  do;
     output dset2;
     kept='Y';
  end;
  if not(first.client_id and last.client_id) then output dupes;
run;
This will give you one file (dset2) containing unique client records, where we know that the record kept is the most recent record. We also have a second dataset (dupes) which contains all records which weren't unique on the original dataset. Sometimes this is useful to view potential issues with the original data. You also have a record of which records were kept and which were dropped.

Using just Proc Sort with nodup or nodupkey can give you slightly unstable results, and also if your by group when using nodup does not encompass enough variables, you can still end up with duplicate records in the output dataset, because SAS just compares each record read in with the record that was last read out, so it needs to be pre-sorted first.
 
Thanks everyone. Yes Jymm this was helpful. This was an interview question so I think the short def is what they were after. I use NODUPKEY because we usally we have a specific key that can be used. I never used the NODUPLICATE command.
 
For extra credit you can comment that the nodup/noduplicate method does not always get rid of duplicates. To guarantee that it works you actually need to list every variable in the by statement to ensure that duplicate records would be placed next to each other and will therefore be eliminated correctly. Or, as the SAS manual says:-

Because NODUPRECS checks only consecutive observations, some nonconsecutive duplicate observations may remain in the output data set. You can remove all duplicates with this option by sorting on all variables.

Also, here's something peculiar. In the SAS8 online doco, NODUPLICATES is listed in the index, and takes you to the NODUPRECS option. The Noduplicates option does not appear anywhere within the doco....
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top