one approach for this is
1 -replace all possible symbol characters with a space
ID=1, name= "john o'dowd" becomes ID=1, name = "john o dowd"
2 - split individual words into single rows - flag duplicates where needed
ID=1, name = "john o dowd" becomes
ID=1, split = 1, name = "john"
ID=1, split = 2, name = "o"
ID=1, split = 3, name = "dowd"
step above becomes the basis for further validation
validation 1
1 - order all split alphabetically
2 - concatenate all splits
once strings are concatenated you can compare them directly, if they don't match you can then also use similarity functions (
and those with high enough (.96 or more normally) may be the same.
Validation 2
Those that do not match validation 1
remove duplicates if any - perform validation 1 again
validation 3
Those that do not match validation 2
some cases may not match due to a letter on the middle of one of the names that does not give a exact match, neither gives a high enough score on similarity.
concatenate again all strings, ignoring strings with 1 char only. repeat validation 1
Other things to do - replace all know variations of names like "o'connor", 'o'donahue' with the version without the "'" before doing the replacement mentioned in point 1.
replace zeros with letter O and others the same eventually if you wish to dig deeper
this way if one name is "o'connor" in one field, and "oconnor" in the other they will match. - and you can still flag them as having a difference for users to verify which version of the name is the correct one.
All steps mentioned above will require you to search the net for ways of splitting strings and converting individual splits into rows, and converting rows into a single column with the concatenation of same rows.
words like PIVOT, UNPIVOT, "for xlmpath" will help you on your search
Regards
Frederico Fonseca
SysSoft Integrated Ltd
FAQ219-2884
FAQ181-2886