Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Splitting name into component parts

Status
Not open for further replies.

grega

Programmer
Feb 2, 2000
932
GB
I'm currently faced with a problem where I have a persons name as a single string. There is currently no stringent format applied to this name, i.e. it may be MR JOE BLOGGS, J BLOGGS ESQ, MRS JOANNE BLOGGS F.R.C.S, MR.J.BLOGGS ... you see what I'm getting at.

I need to convert this string into it's component parts - Title, Forename, Surname, Suffix and am having some success with awk, but it's proving difficult getting it to match every conceivable pattern of name.

Has anyone else had experience of this sort of exercise? I don't want someone to do it for me (unless it's been done already) but any ideas would be appreciated.

Regards,

Greg.
 
sorry grega, you have no chance to do it,
no regexp, no rules, nothing :(
the only way, force users to properly insert data! -----------
when they don't ask you anymore, where they are come from, and they don't tell you anymore, where they go ... you'r getting older !
 
Very difficult IMO and would require you to parse the names looking for all possible separators in case one didn't match.
Of course it wouldn't work if a name was created with
multiple separator types.
One function for whitespace splitting, one for period splitting, etc.., and using the return from split to
determine when a name split had been successful.



 
In this case it is really hard to specify all the rules. If you can get them all then this is feasible. Don't forget about commas ( last name - comma - first name ), apostrophes, multiple middle names, ommited grammar, extra suffixes (e.g. JR, III) and hypenated names. Most likely some rules will be missed and have to be added downstream.

Can your system handle this kind of inaccuracy? Co-workers here standardize data with a coding application which gets best fit matches (40-90%). Then they make corrections by hand on uncoded data.

Dealing with multiple aliases (e.g. 'J. Brown', 'Jim G. Brown', 'Brown, JG') for the same individual is another big issue. Cheers,
ND [smile]

bigoldbulldog@hotmail.com
 
Thanks all. I agree it is a difficult proposition. I think our approach will be to build a set of rules which will convert a large proportion of the data successfully, and the remainder will probably need to be manually examined to check. We're dealing with 1.1M records here :)

Greg.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top