Quantify how similar two text fields are 1

HardJeans · Dec 27, 2007

I'm working on cleaning up some of the data our company has in its databases, and I need to quantify how similar two text fields are.
e.g. Database1 has a table with employee information, one of the columns has name stored as "Susan Smith"
Database2 has a table with employee information, one of the columns has a name stored as "Suzy Smith"

It is the same person, and failing all other methods to connect database1 record to database2 record, I need a way to find out the most likely record to database2 it is. A human will then go through and try to verify, but I need a starting point.

Essentially, I'm looking for some way for inputs of "George Lucas" and "George Lucaz" to return something similar to 91.66% match or "Lucas, George" returning 92.31%. Is there any such method to get this info?

--
"I'm not talking to myself, I'm just the only one who's listening." - JCS

HardJeans · Dec 27, 2007

Sorry, I'm using SQL Server 2000 (though I hope to be on 2005 sometime next year).

--
"I'm not talking to myself, I'm just the only one who's listening." - JCS

RiverGuy · Dec 27, 2007

If you are using SQL Server 2005 Enterprise Edition, you might have a look at the fuzzy matching transformations it has for SSIS. I haven't used it, but it's worth a look. Barring that, I wouldn't recommend trying to write your own routine to do the matching. You'd be better off purchasing a commercial software package for that function or paying a firm which specializes in matching to do the work for you.

ESquared · Dec 27, 2007

This forum's FAQ, a better SoundEx function, might get you started. My suggestion is to break each field at every space into separate words using a numbers table, then compare every soundex of each row's words against every soundex of every other row's words. Careful grouping and counting will reveal those that have a good soundex match.

Code:

drop table emp1
drop table emp2
drop table #temp1
drop table #temp2
drop table #results

/*
-- If you do not have a numbers table, create one:
create table Numbers (Num int identity(1,1) primary key clustered)
insert Numbers DEFAULT VALUES
while scope_identity() < 2000 insert Numbers DEFAULT VALUES
*/

create table Emp1 (FullName varchar(50))
create table Emp2 (FullName varchar(50))

insert Emp1 select 'Susan  Smith' insert Emp2 select 'Suzy Smith'

insert Emp1 select 'George Lucas' insert Emp2 select 'George  Lucaz'
insert Emp1 select 'Bronson,  James' insert Emp2 select 'Jimmy Bronson'
                                     insert Emp2 select 'Jason Bronson'

insert Emp1 select 'Dick Jones' insert Emp2 select 'Jones, Richard'
                                insert Emp2 select 'Jones,David'
insert Emp1 select ' Dan Jones' insert Emp2 select 'Jones, Don '

insert Emp1 select 'Brian Duke' insert Emp2 select 'Duck, Bryan'

insert Emp1 select 'Olaf Olafson' insert Emp2 select 'Olafsen, Olfa'


update Emp1 set fullname = ltrim(rtrim(fullname)) where fullname like ' %' or fullname like '% '
update Emp2 set fullname = ltrim(rtrim(fullname)) where fullname like ' %' or fullname like '% '

while exists (select * from Emp1 where FullName like '%  %')
   update Emp1 set FullName = Replace(FullName, '  ', ' ') where FullName like '%  %'

while exists (select * from Emp2 where FullName like '%  %')
   update Emp2 set FullName = Replace(FullName, '  ', ' ') where FullName like '%  %'

while exists (select * from Emp1 where FullName like '%,[^ ]%')
   update Emp1 set FullName = Replace(Replace(FullName, ',', ', '), ',  ', ', ') where FullName like '%,[^ ]%'

while exists (select * from Emp2 where FullName like '%,[^ ]%')
   update Emp2 set FullName = Replace(Replace(FullName, ',', ', '), ',  ', ', ') where FullName like '%,[^ ]%'

select
   E.FullName,
   Pos = N.Num - Len(Replace(Left(E.FullName, N.Num), ' ', '')) + 1,
   Word = Replace(Substring(E.FullName, N.Num, charindex(' ', E.FullName + ' ', N.Num) - 1), ',', ''),
   Sound = dbo.SoundexAlphaFunction(Replace(Substring(E.FullName, N.Num, charindex(' ', E.FullName + ' ', N.Num) - 1), ',', ''))
into
   #Temp1
from
   Emp1 E
   inner join Numbers N ON Len(E.FullName) >= N.Num AND Substring(' ' + E.FullName, N.Num, 1) = ' '

select
   E.FullName,
   Pos = N.Num - Len(Replace(Left(E.FullName, N.Num), ' ', '')) + 1,
   Word = Replace(Substring(E.FullName, N.Num, charindex(' ', E.FullName + ' ', N.Num) - 1), ',', ''),
   Sound = dbo.SoundexAlphaFunction(Replace(Substring(E.FullName, N.Num, charindex(' ', E.FullName + ' ', N.Num) - 1), ',', ''))
into
   #Temp2
from
   Emp2 E
   inner join Numbers N ON Len(E.FullName) >= N.Num AND Substring(' ' + E.FullName, N.Num, 1) = ' '

select
   Name1 = E1.FullName,
   Pos1 = E1.Pos,
   Name2 = E2.FullName,
   Pos2 = E2.Pos,
   Score = 100.
INTO
   #Results
from
   #Temp1 E1
   INNER JOIN #Temp2 E2 ON E1.Sound = E2.Sound

insert #results
select
   Name1 = E1.FullName,
   Pos1 = E1.Pos,
   Name2 = E2.FullName,
   Pos2 = E2.Pos,
   Score = Sum(50. / Case when Len(E1.Sound) > len(e2.Sound) then Len(e1.Sound) else len(e2.Sound) end)
from
   #Temp1 E1
   INNER JOIN Numbers N1 ON Len(E1.Sound) >= N1.Num
   INNER JOIN (
      #Temp2 E2
      INNER JOIN Numbers N2 ON Len(E2.Sound) >= N2.Num
   ) ON Substring(E1.Sound, N1.Num, 1) = Substring(E2.Sound, N2.Num, 1)
group by
   E1.FullName,
   E1.Pos,
   E2.FullName,
   E2.Pos

select
   name1,
   name2,
   totalscore = sum(score) / max(case when pos1 > pos2 then pos1 else pos2 end)
from
   #results
group by
   name1,
   name2
order by
   name1,
   totalscore desc

[COLOR=black #e0e0e0]For SQL and technical ideas, visit my blog, Squared Thoughts.

_{The best part about anything that has cheese is the cheese.}[/color]

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Quantify how similar two text fields are 1

HardJeans

Technical User

HardJeans

Technical User

RiverGuy

Programmer

ESquared

Programmer

Similar threads

Part and Inventory Search

Sponsor