Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Merge Vs Proc sql 1

Status
Not open for further replies.

dimpu79

Programmer
May 8, 2004
8
0
0
US
Hi,
While merging large datasets, say 3 to 4, is it better to use merge or proc sql? Considering performance problems, space etc...
Thanks in advance
 
When you say large would 5 million be in that league? PROC SQL is quicker on the real large ones that are sorted. This means that the SQL proc utilizes the sort index and only reads the records that it needs. The data step (merge) will read all the records and then only use the ones it needs.
So if you plan to read through the large Dataset (not too much processing) they should be the same. Although I think that the SQL acts faster (if its pre-sorted).
see this article on google groups:

Klaz
 
Thank you Klaz. Your answers are helpful to me.
 
dont get me wrong - I used SAS alot to do this kind of processing (note past tense) --- depending on the database that you are pulling from why not create a view of the data (oracle or MS SQL) and let the processing be done on the database server end before SAS even touches it.

I will guarentee you that a join in SQL in either SQL 2k or Oracle will blow SAS's (proc SQL or merge) time out of the water... and coupled with the SQL being a view of the database things can be optimized with indexes that would make even accessing the biggest database fly...

In other words - SAS was really built to do the Statistical processing of data and is not really built to do data manipulation (although I used it for that too).

my 2 cents
 
One thing to note though when using SAS to access data from Oracle:-

This is good:-

proc sql;

connect to oracle as oracle
(user="&dbuser" pass="&dbpass" path="&dbpath" );

* Licences *;
create table licence as
select *
from connection to oracle
(select *
from dw_table
where status = 'COMPLETE'
);
quit;

This is bad:-

libname prod oracle user="&dbuser"pw="&dbpass" path="&dbpath" preserve_col_names=yes;

data licence;
set prod.dw_table(where=(status = 'COMPLETE'));
run;

The reason for this is that Oracle will process the whole table or view in the second instance to create temporary data for SAS to access (ie the where clause is not passed up to Oracle) whereas in the first example, the whole query is executed on the Oracle database. This can have HUGE impact on the speed of your queries if the tables/views are large.

I think I may stick this one in the FAQ section, it's something we're trying to educate our users about at the moment.

 
Aside from performance times, you won't always get the same result from proc SQL, merge.

If there are duplicate values in the fields you are joining/merging on for more than one fo the datasets, proc sql will provide the more predictable result.
 
Yep. SAS does not handle Cartesian Joins, nor does it even give an ERROR or WARNING. What you get is a "NOTE: More that one dataset contains repeats of by values". If you ever see this in your log, look really carefully at what you are doing becaus the results are almost certainly wrong. IF you absolutely HAVE to do a cartesian join, then you have to do it through Proc SQL, but 99% of the time, you don't want to do a cartesian join.
 
I found a new option the other day which may help people joining SAS Datsets to Oracle or other DBMS tables, the DBKEY option.

In the example below, the library DWH is actually an Oracle table.

Code:
proc sql;
  create table emp_address as
    select a.*
          ,compbl(b.ADDRESS_LINE_1_TX || '/' || b.ADDRESS_LINE_2_TX || '/' || b.ADDRESS_LINE_3_TX  || '/' 
          || b.SUBURB_NM  || '/' || put(b.POST_CD,4.)) as mailing_address
    from emp_details  a
        ,dwh.address(dbkey=address_id dbnullkeys=NO)  b
    where a.address_id = b.address_id
    ;

When processing this, multiple queries are sent to Oracle, one of each record in the emp_details dataset to retrieve one single record from the address table. This is alot faster for small SAS Datasets than not using the option, as without DBKEY the whole address table is retrieved. This has made a HUGE difference in processing time on some of my queries. DBNULLKEYS=NO tells SAS that it is not expecting any records to have null values in the key field.
Look up the DBKEY and DBNULLKEYS options in the SAS doco for more information.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top