Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Match parts of a character variable against a name list

Status
Not open for further replies.

Jubinell

Technical User
Oct 21, 2008
14
JP
Hi,

I have a dataset with a variable called description. A description may contain zero, one or many names matching names in a list called namelist. I'm trying to search description for these names and, when found, create new variables containing them called name1, name2, name3, etc. For the matching algorithm I'm using regular expression. I'm open to other algorithms but I'd like to keep the flexibility of the regular expression.

How to go about this? I got started but got stuck. Here are my codes. I don't think I am using symput correctly. Furthermore, I'm not even sure that I need to be dealing so much with the macro facility here. Perhaps it's better to stick with the datastep.

Code:
%macro mysearch;
%data outdata;
%set indata;
%let p = 0;
%do n = 1 %to 30/*number of names in name list*/;
    match = %match(%scan(&namelist, &n),description);
    call symput('q', match);
    %let p = %eval(&p + &q);
    if %match(%scan(&namelist, &n),description) ne 0 and if &p > 0 and if name&p ne "" then name&p = %scan(&namelist, &n)";
%end;
run;
%mend; %mysearch;


%macro match(match_text, match_in);
%let match = prxmatch(prxparse("/&match_text\b/"),lowcase(&match_in));
&match
%mend;

 
Could you show us some example data and what you want the output to look like?
Cheers.
Chris.

Chris
Business Analyst, Code Monkey, Data Wrangler.
SAS Guru.
 
Hi Chris,

So let's say my dataset has 1 variable called description. Let's just say I have 5 observations:

Code:
peter and john go to school
john is a boy
i like john, mark and peter
i go to school
peter goes to school


the list that i have, called name list, contains the following:

Code:
peter
john
mark

I'd like the macro to return new variables name1, name2 and name3 as follows:

Code:
#Obs Name1 Name2 Name3
1    peter   john
2    john
3    john peter mark
4
5  peter

The order of what name is in name1, name2 and name3 is not important. What is important is that all names from namelist are captured. Also...I'd like to have a macro system setup that automate the naming of nam1-3, since the real list is a lot bigger than that (say 100 names to match against).


Thank you in advance for your help.

 
OK, that makes it pretty clear.
I think I can see a way to do it, but I'm having trouble finding the time to thrash out the code.
In the meantime, here's the steps I would do.

1 - Read in your list of comparison names and use this to create a format using the "proc format cntlin=" method. Set it up so that anything else returns a value you can test for (say "X").
2 - process the data using a do loop to extract each word using SCAN and loading each word, via the format into a temp array.
3 - If the word is not "X" it gets loaded, else you iterate it.
4 - create a new text string out of the array variables
5 - exract out each word (which will now only contain the words you want) into a different array, keep these variables in the output.
That should do it.

Another alternative:-
1 - read each word in the text string into an array, and output each one.
2 - Transpose the data, keeping a key on the data to identify the original record it came from.
3 - Join this to your list of words, dropping any record that don't match.
4 - number the records by the original key (ie if you have 3 words from a record originally id'd as 206 you want to have 3 records id'd 206, with another variable containing 1, 2 and 3.
5 - Transpose the data back again BY the original key, using the number as the ID variable.

Both these methods should do it for you.


Chris
Business Analyst, Code Monkey, Data Wrangler.
SAS Guru.
 
Interesting problem. Here's a solution using a hash as the lookup table. You can decide whether you need the merge step.

Code:
data have;
   infile cards missover dlm=' ';
   input  wordlist $1-27;
   i=_n_; *To merge on;
   cards;
peter and john go to school
john is a boy
i like john, mark and peter
i go to school
peter goes to school
   ;run;

data namelist;
   input name$ @@;
   cards;
peter john mark
   ;run;

data findme(keep=wordlist name x i);
   if 0 then set namelist;
   length x $20;
   call missing(x);
   declare hash hh(dataset:'namelist');
   hh.definekey('name');
   hh.definedata('name');
   hh.definedone();
   do i=1 by 1 until(eof);
      set have end=eof;
      do j=1 to countw(wordlist);
         x = scan(wordlist,j,', ');
         if ^hh.find(key: x) then output;
         end;
      end;
   run;

proc transpose data=findme out=trans(drop=_:) prefix=name;
   by i;
   var name;
   run;

data want(drop=i);
   merge have(in=a keep=i wordlist) trans;
   by i;
   if a;
   run;

proc print;run;
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top