Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Please help! Merging data sets

Status
Not open for further replies.

cosmid

Programmer
Feb 14, 2008
73
US
Hi,

I have a task that requires me to merge 4 different data sets. Names of the data sets are A, B, C and D. They all have a common variable ID, age, hours, lname, fname. They each got their own unique variables such as uniqueA, uniqueB, etc. The objective of the task is to merge all 4 data sets into 1 master data sets ABCD.

I realized that with merge you will loose the value of the variable if they are the same. For example,

data ABCD;
merge A B C D;
by ID;
run;

with the above code, if the value of age is 26 in A, 28 in B, 29 in C, and 30 in D, in the master data set ABCD, the value for age will be 30. My question is, how do I create a master data set with all four data sets A, B, C, and D?
 
Ok, let me explain again to see if I have this right. Data set A, B, and C are data from 2001, 2002, and 2003 respectively. Data set D is all background information data. I guess I should use set instead of merge on data set A, B, and C first right? And then merge them with D?

Data ABC;
set A B C;
by ID;
run;

Data ABCD;
merge D ABC;
by ID;
run;

merge and set is so confusing. this is my first project that involves merging few large data files. Thanks for your help and time!
 
You want all the data in the 4 data sets in 1 big data set without losing any values?

You can stack them vertically like this:

Code:
data want;
set A B C D;
run;

So everything in Column1B will be under Column1A..etc.

Hopefully that gets you closer to what you need.

~Dave
 
Yeah. But that would be using set. I suppose to merge them together. Is there anyway SAS can do this?
 
I think I understand now. Do the data sets have a lot of variables? If you only need a few you can use the rename option during the merge to keep the data from overwriting itself.

Code:
data want;       
merge TableA (rename=(var1=Var1A var2=Var2A var3=Var3A))
	  TableB (rename=(var1=Var1B var2=Var2B var3=Var3B)); 
by id;
run;

But that is not something I would really want to do if you have to rename a lot of variables.

I'm not sure if there is a way to rename all the variables in a dataset at once...for example, add an "B", or "C" to the end of the variable name.
 
Here is a way to rename all the variables in a datastep (using SQL and proc datasets), that should save your fingers a little :)

This example uses the sashelp.class dataset.

Code:
* Make copy of sashelp.class in work for demo;
proc copy in=sashelp out=work;
   select class;
   run;

* Create code to be used in proc datasets;
* and store in macro variable;
proc sql noprint;
   select cats(name,'=',name,'_B')
   into: names separated by ' '
   from dictionary.columns
   where libname='WORK' and memname='CLASS';
   quit;

* Rename all variables in work.class;
proc datasets lib=work nolist;
   modify class;
   rename &names;
   run;

* Merge original sashelp.class and modified;
* work.class to show both are kept;
data merged;
   merge class sashelp.class;
   run;

proc print;run;
 
dblan:

There are about 20 variables total. 8 out of the 20 variables are common variables, like age, hours, etc. I am not suppose to change variable names. My objective is to create a master file that combine all 4 data sets into a master data set. But with merge, new values will overwrite old ones, I need to ask my supervisor if I understand it correctly. I don't think there is a way of doing this except of creating a cartesan product.
 
Sorry guys. I should had made things clear. I am not suppose to rename variables. Actually, I am required to change all the different variables in different data sets into one common variable. If hours is called HR in data set A, and HRs in data set B, Hour in C, and Hours in ABC, and for the master data set, I am supposed to name all hours variable to Hours.

In other words, by default, the variable names are different from the beginning because they use different variable names each year to reference the same thing. I suppose to create a new master data set the combine all the years and change the variable names into 1 single common variable. But the problem I am having is that the values of the variables will be overwritten by the last data set in the merge statement. I am not even sure if this task is possible to do.

thanks again for the help guys!
 
Yeah. I just got an email. I think I am suppose to do a set not a merge. The instructions weren't clear. But the question still exists. Is it possible to merge 4 different data sets into 1 master data set without replacing any values if the variables are the same? It is not possible right?

Data set A: ID, age, hours
Data set B: ID, name, hours
Data set C: ID, dept, hours
Data set ABC: ID, hours

To merge the 4 data sets together into a master data set ABCD, the values for hours will be replaced by the last data set used in the merge statement right? I am not talking about set. The new master data set should be:

Data set ABCD: ID, age, name, dept, hours

Thanks
 
Is hours the same value in each dataset for each id?

As in would Set A, ID 1 have an hours value of 2 while Set B ID 1 has an hours value of 4?

If they have the same value, then it shouldn't matter if they are overwriting. If they don't have the same value, then you'll just have to keep the one you want and not the others.

ie:

Data ABCD;
merge A (in=a keep=id age)
B (in=b keep=id name)
C (in=c keep=id dept hours);
by id;
run;
 
They all have different values. So you can only keep 1 final value right?
 
How about posting some test data and an example of what you would like the output to look like. Although this may be a bit more work upfront, it will ensure you get a more tailored answer, it also takes a lot of the guess work for the rest of us.
 
Some test data would make it more clear...

From what you're saying you have an hours field in each dataset that may differ from dataset to dataset.

Do you want to keep each seperate data set's version of the hours field in your merged data set or a specific one? if it's a specific one then you can use the code I put in above.

If you want to keep all of them then you'll just need to rename them in your merge.

Data ABCD;
merge A (in=a keep=id age hours rename=(hours=hoursA))
B (in=b keep=id name hours rename=(hours=hoursB))
C (in=c keep=id dept hours rename=(hours=hoursC));
by id;
run;
 
If I'm understanding correctly, cosmid's goal is to not rename the columns. So there would be 4 variables named "hours".

Is there a reason why the names need to be the same? In theory, if they are the same name it would make impossible to query data from the variable. (for example, if you had 4 variables named "Hours" and you had a statement (where=(Hours>4))...SAS wouldn't know which 1 of the 4 to use.)

In fact, if you try to import a table where the variable names are the same, SAS will automatically add a 1,2,3,etc to the end of the name. (hours1, hours2, hours3)

If they are needed for presentation only, try adding label names that match what you need. You can choose to display those in a proc print.

PROC PRINT DATA=ABCD label;;
RUN;

~Dave
 
Wow! Thanks for taking the time guys! I really appreciate the inputs. But you guys can ignore this thread now. I don't think the objective is possible to do. It didn't make sense logically. Instead of using a merge, the solution was to use a set statement instead. Thanks again for the time and help!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top