Assigning ID number to multiple cases.

RicoUnsuave · Aug 29, 2008

Hi,

I have a data file with a contact name, contact address, and contact phone number information. There are duplicates within each of those fields. If there are records that are duplicate on *any* of the fields, I want to assign them a "family id." Here is an example of what I am trying to accomplish. Hope that makes sense.

name address phone Family id
john doe 1122 Main 512-441-7777 1
john doe 1734 main 512-476-7778 1
April Showers 1122 Main 512-444-5959 1
Cookie Sheets 2600 McGregor 512-441-7777 1
Mary Jolly 1200 State 734-777-7777 2
Mary Jolly 1200 State 734-777-7777 2
Joseph Jolly 1200 State 734-777-7777 2

bryanbayfield · Sep 1, 2008

Can you please clarify ... the first two records match on name, but "Family ID" is 1, yet the last 3 records match on name, address and phone number and have a Family ID of 2.

Is it 2 if there's a match on any of the fields with another observation, and 1 if not?

RicoUnsuave · Sep 2, 2008

Bryan,

Sorry for the confusion. What I want to do is assign a unique family id to each case that match on either of the three variables.

In the example I provided instead of assigning family ids 1 and 2, I could have assigned 1001 and 1002 or whatever. The id number is used to identify the cases as belonging to a particular "family.

bryanbayfield · Sep 2, 2008

You know what, that's actually a bit trickier than you'd think at first glance.

Perhaps you could duplicate the dataset, and then do a cartesian join between the two, and delete rows where there wasn't a match on any of the three fields.

I'm not sure, will have to think about it. Difficult.

bryanbayfield · Sep 2, 2008

OK, think this one will do it ...

You will obviously be creating a composite surname variable from "name" as this is what identifies a family. That is your third variable along with address and telephone.

First sort by surname, then create a new dataset from that using by processing on the surname variable. When first.surname=1 increment surname count by 1. Now do this for address and telephone so you end up having a dataset with three count variables in it.

Now you do a new dataset based on the one you just created. Use 3 retained character variable to store previous unique count variable values in, delimited by spaces (don't forget to use put() when concatenating them on). If indexw(concatenated_var_n,cnt_var_n)>0 then that count value has previously appeared. If indexw()=0 for all 3 then that family hasn't previously appeared, so increment your family variable by one.

That should do it, unless you have so many different families that the previosu values concatenated together are too long for a single char variable.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Assigning ID number to multiple cases.

RicoUnsuave

Programmer

bryanbayfield

Technical User

RicoUnsuave

Programmer

bryanbayfield

Technical User

bryanbayfield

Technical User

Similar threads

Part and Inventory Search

Sponsor