Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

PROC SQL Matching Problem

Status
Not open for further replies.

JerichoHill

Programmer
Apr 25, 2006
2
US
Here's the issue.

I have two files.

SANBEN is a listing of surnames in a place of interest.
HISPANIC80 is a listing of partial and full Hispanic surnames.

I'm currently trying to port over this surname analysis program into SAS from a very old version of DBASE.

The following code works, but is not seemingly matching up the smaller partial surnames.

proc sql;
create table ASN_SanBen as
select *
from SanBen
where SanBen.surname IN
(select Hispanic80.ssname
from asn.Hispanic80);
quit;

By partial surname, I mean like

BADA would match to BADA and BADABOOM

The error I get is

ERROR: Subquery evaluated to more than one row.

I assume this is because, where the HISPANIC80 has both Bada and Badaboom, the SANBEN list has Badaboom.

This is what I've been trying
proc sql;
create table ASN_SanBen as
select *
from SanBen
where SanBen.surname CONTAINS
(select Hispanic80.ssname
from asn.Hispanic80);
quit;

I've also tried ?, and a few other things.

I would appreciate any thoughts on this matter. Thanks
 
In theory, the error message there is saying that there are duplicates on surname on your ssname field in asn.Hispanic80, I would double check this.

Also, SAS will only do a match on the complete name, so BADA will only match to BADA, not BADABOOM. If you want to match it, you'll have to create a shortened form of the surname field on both sets and match on that.

You may also want to look up the "IN:" and "=:" operator to see if that might help you out.
 
Thanks Chris.

The =: or In: operator doesn't seem to function in SQL, but here's how I did manage to reduce my error rate (not identifying a spanish surname the previous Dbase program did) from 3% to 1%.

I went to look at what surnames were being missed. It seemed that a pattern had emerged that surnames with a space or a hyphen were not being identified.

Eliminating the hyphen doesn't do anything, but removing the spaces from a name like DE LA SOUL works because the ssname list doesnt have space (it appears as DELASOUL).

That netted me hundreds more of observations for this particular data problem when I used the COMPRESS function to eliminate spaces in the surnames.

Based on your comments, It appears I can't go much further than this. The error is actually BADA hitting both BADA AND BADABOOM, it doesn't like the double return, I think. While I would love to search each surname for the substring(ssname), according to you SAS wont do this conditional match.

Bummer. But, with an error rate of 1%, I'm reasonably satisfied. If anyone else can help figure out how to eliminate the subquery to more than 1 row error, I'd be very appreciative.

Again, thanks Chris.
 
Why not try an iterative method?
First do exact matches, then create a second copy of your SSNAME file which where the surnames are truncated to say 4 characters or so, create a similar field on your main data, and join on this field again, maybe adding a second flag etc. Soemthing like this
Code:
proc sql;
        create table ASN_SanBen1 as
        select a.*
              ,b.flag  as match1
        from SanBen  a
           LEFT JOIN
             asn.Hispanic80   b
          ON a.surname = b.ssname
        ;

        create table ASN_SanBen2 as
        select a.*
              ,b.flag  as match2
        from ASN_SanBen1   a
           LEFT JOIN
             asn.Hispanic80
          ON SanBen.short_surname = b.short_ssname
        ;
quit;
You could keep doing this with varying lengths of name, adding the flag on for each attempt, and this would give you a level of confidence for each match (ie flag1 would be 100% confident, each higher flag would be successively less confident).
You could then afterwards do a final report on the table shoing how many you hit on each level, pick a point at which you're getting a good number of matches without sacrificing accuracy, and you'll be laughing.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top