Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Is There A Way to Relate Similiar Data Between Two Tables?

Status
Not open for further replies.

VAST39

Programmer
Sep 21, 2012
26
US
Hi guys,

I am trying to join two employer tables with similar data, but no primary key exists between them. Employer table A contains the employer id. Employer Table B does not have the employer id. I want to merge the two tables so I can update employer table B with the employer ids.


Since I don't have a primary key to join the tables, I can only compare and join their employer names. The problem is most names are similar between the two, but not exact.

For example:


Table A - ASHLEY COUNTY CONSERVATION DIST
Table b - ASHLEY CO. CONSERVATION DISTRICT

There is no pattern to which record names are abbreviated or not.

I know I can use wildcards and the LIKE command, but that would only work for a single record. I need to match up over 1,000 records. I really don't see a way to do this. Can anyone help?





 
Thanks for the article link! I'm excited to try this out. I also had a thought after posting. If I could find some way to run an update query which would search and change the abbreviations to full words. Like change DIST to DISTRICT or SCH to SCHOOL. Then I would get more matches. But how would I tell SQL to search and change words that are a portion of the text content? If the only content in the field was DIST then it would be easy to change.
 
But how do you know that its not ASHLEY COMPANY CONSERVATION DISTRICT?

My personal view on this....

If you need 100% accuracy then there is only one way to do it.

Do a straight match (maybe change Ltd to LIMITED?). Then see whats left. Lets assume there would be less than half, maybe a quarter. That leaves you with maybe as low as 250 records to try and match manually. Surely that wouldnt take too long.

I know its tedious work - but i have done it before because i felt it was the only way to get 100% Match.

Dan



----------------------------------------

Be who you are and say what you feel, because those who mind don't matter and those who matter don't mind - Bernard Baruch

Computer Science is no more about computers than astronomy is about telescopes - EW Dijkstra
----------------------------------------
 
Just to clarify - my view is that no other system can be 100% accurate. At the end of the day its common sense - can you program that?

----------------------------------------

Be who you are and say what you feel, because those who mind don't matter and those who matter don't mind - Bernard Baruch

Computer Science is no more about computers than astronomy is about telescopes - EW Dijkstra
----------------------------------------
 
its common sense - can you program that?

I was going to try, but now it just seems foolish. [wink]

I've done similar things in the past. The entire time you are doing this, think to yourself, would it be quicker to do this manually? With 1,000 rows, maybe not. With 10 rows, absolutely yes. The tipping point is somewhere in between.

I do have a couple suggestions:

Ignore the Soudex and Difference function. It only works well for short strings, and you don't have short strings.

To see what I mean:

Code:
Declare @Temp Table(Name VarChar(100))

Insert Into @Temp Values('ASHLEY COUNTY CONSERVATION DIST')
Insert Into @Temp Values('ASHLEY CO. CONSERVATION DISTRICT')
Insert into @Temp Values('ASHLEY COUNTY WHORE HOUSE')
Insert into @Temp Values('ASHLEY osdjfkljsdf ;klsjdf ;skdjf slk;djf sdf')
Insert Into @Temp Values('Assly')

Select Name, Soundex(Name)
From   @Temp

Instead, do a google search on "sql server lenvenshtein". Levenshtein is slower but a lot more accurate. Of course, it would help if you have other columns in both tables that represent the same data (for example zip code).

Regardless, whatever you decide to do, be aware that you could end up spending more time trying to code a solution to match them up than it would take to do it manually.

-George
Microsoft SQL Server MVP
My Blogs
SQLCop
twitter
"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
George got there before me. Soundex is decent for a word at a time, but not for longer strings. My son has very successfully used an algorithm that gives you a percentage similarity and then lets you decide how similar is similar enough. Pretty sure he was using the Levenshtein technique George mentioned. Here's one article:

[URL unfurl="true"]http://www.codeproject.com/Articles/11157/An-improvement-on-capturing-similarity-between-str[/url]

Tamar
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top