Splitting name into component parts

grega · Dec 16, 2002

I'm currently faced with a problem where I have a persons name as a single string. There is currently no stringent format applied to this name, i.e. it may be MR JOE BLOGGS, J BLOGGS ESQ, MRS JOANNE BLOGGS F.R.C.S, MR.J.BLOGGS ... you see what I'm getting at.

I need to convert this string into it's component parts - Title, Forename, Surname, Suffix and am having some success with awk, but it's proving difficult getting it to match every conceivable pattern of name.

Has anyone else had experience of this sort of exercise? I don't want someone to do it for me (unless it's been done already) but any ideas would be appreciated.

Regards,

Greg.

jamisar · Dec 16, 2002

sorry grega, you have no chance to do it,
no regexp, no rules, nothing

the only way, force users to properly insert data! -----------
when they don't ask you anymore, where they are come from, and they don't tell you anymore, where they go ... you'r getting older !

marsd · Dec 16, 2002

Very difficult IMO and would require you to parse the names looking for all possible separators in case one didn't match.
Of course it wouldn't work if a name was created with
multiple separator types.
One function for whitespace splitting, one for period splitting, etc.., and using the return from split to
determine when a name split had been successful.

bigoldbulldog · Dec 16, 2002

In this case it is really hard to specify all the rules. If you can get them all then this is feasible. Don't forget about commas ( last name - comma - first name ), apostrophes, multiple middle names, ommited grammar, extra suffixes (e.g. JR, III) and hypenated names. Most likely some rules will be missed and have to be added downstream.

Can your system handle this kind of inaccuracy? Co-workers here standardize data with a coding application which gets best fit matches (40-90%). Then they make corrections by hand on uncoded data.

Dealing with multiple aliases (e.g. 'J. Brown', 'Jim G. Brown', 'Brown, JG') for the same individual is another big issue. Cheers,
ND [smile]

bigoldbulldog@hotmail.com

grega · Dec 18, 2002

Thanks all. I agree it is a difficult proposition. I think our approach will be to build a set of rules which will convert a large proportion of the data successfully, and the remainder will probably need to be manually examined to check. We're dealing with 1.1M records here

Greg.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Splitting name into component parts

grega

Programmer

jamisar

Programmer

marsd

IS-IT--Management

bigoldbulldog

Programmer

grega

Programmer

Similar threads

Part and Inventory Search

Sponsor