Advanced users! Please How to Get rid of Homonyms on fields of Table

wakkara · Jun 4, 2008

Hi there!!
Sorry my English!!

Well i have a big issue to solve on a small table on a database With near 23,000 rows.

I'll start from the beginning...

I have developed a small system that registers users and some other stuff as well.

The point is that the people who is in charge of register the users, on the system.

Most of the time they "Repeat" or miss spell the Name of the users and of course
it causes some trouble when triyng to find them with a query.

Example:

1 Michelle Smith Smith
2 Michelle Smiht Smith

And that was caused because i didnt add a proper code to avoid the repeting when adding a user.

Now i have implemented the LEVENSHTEIN Distance to avoid this repeatings.

How LEVENSHTEIN function works? ..well:
SELECT * FROM pacientes WHERE LEVENSHTEIN("Smith",AppPacV) <= 2;

AND it will show using the example before:

NomPacV AppPacV ApmPacV
1 Michelle Smith Smith
2 Michelle Smiht Smith

Any way
i have still near 23,000 users to verify the homonyms on the table so i could delete them from it.

I had a few ideas but they didnt work out as i thougth they would.

1. Use a Stored procedure with levenshtein trying this:
- Copy the whole table on a new one and rename it so i could have 2 tables with exactly the same data.

TABLE 1: pacientes TABLE 2: pacientess

- Use a query to verify the Distance between them:
SELECT p.* FROM pacientes p, pacientess q WHERE LEVENSHTEIN(p.NomPacV,q.NomPacV) <= 2;

But it takes to long to finish the query (I waited 40min and got tired).

2. Using VB6
Keep on 2 recordset each of the Tables above using Visual Basic 6 AND a Levenshtein Function written in VB (Wich is faster than Mysql).
- Do a nested Loop with the recordsets and make a comparison.

For n = 0 to size of Recodset1

For s =0 to size of Recodset2

A = Levenshtein(Recodset1.Field1,Recodset2.Field1)
B = Levenshtein(Recodset1.Field2,Recodset2.Field2)
C = Levenshtein(Recodset1.Field3,Recodset2.Field3)

If A >= 2 AND B >= 2 AND C >= 2 then
SAVE the ID of the registry
End if
s++
Loop
n++
s++
Loop

Eitherway it will take near 30 hours to make this, (i estimate this number cause in one registry it takes 6 seconds to calculate the levenshtein distance).

Is there a way to make it faster, or what im doing wrong.

Please im quite desperate here

audiopro · Jun 5, 2008

I am not sure if it is the correct way of doing it but I always store names, addresses and such information in upper case and make all query information upper case.

Keith

http://www.studiosoft.co.uk

hoinz · Jun 5, 2008

Hello,

a few thoughts about it:

1) As you have found out yourself, SQL is not a good language for text processing. Any compiler language should be much faster.

2) If I correctly understand, you have a one-time task. And in this case 30 hours should be acceptable, imho. Just start your program on Friday in the evening, and when you return to work on Monday, it will be done. Or am I misunderstanding?

3) If I correctly understand again, you are currently comparing each pair of rows (rowA, rowB) twice, first dist(rowA,rowB), and later on dist(rowB,rowA).
I'd do something like:
for A=1 to size_of_table
for B=A+1 to size_of_table
compare rowA and rowB
....

4) From what you wrote I am not sure whether you are going to rely on the computer for eliminating duplicates. You can't know if there actually are two persons with very similar names, or even the same name, imho. You might consider to create a list that will have to be verified by a human.

5) Not sure if you really have to copy your data to another table, just for looping through it. You should be able to use nested loops on one table.
On the other hand, for testing purposes, if you are going to delete data, using a copy of your data first is wise for sure.

6) Levenshtein Distance seems a good approach to me. You might consider to code another sophisticated algorithm by yourself, but I doubt that you will get away with much less time in the end.

7) An idea for your computing of A, B, C as distance of field1, field2 and field3:
I don't exactly understand what you are doing, but it might be faster if you:
* compute A = Lev....
* check if A>=2
* decide whether it is necessary at all to compute B and C
* if so, then the same for B
Reducing the numbers of calls of Levenshtein should speed up the program, I think.

regards

wakkara · Jun 5, 2008

Thnks for your quick answers.

To hoinz----------

First fo all...Thnks for taking the time and effort to read, analyze and answer everithing a posted.

1. Well Yes you are right

2. And yes its a one time task, that for now it takes a long time to acomplish. I didnt know tasks like such were so slow.

3.I dont think i am doing it twice.

Lets see if i can explain my self better on this one.

============================================================
For n = 0 to size of Table ONE /* that means the first for will run all the rows of the table ONE*/

For S = 0 to size of Table TWO /*For each row on table ONE, the second one will run from the first to the last row*/

------------------------------------
/*If field from table table ONE has a Distance Less than 2 with the Field at the moment from table TWO, so i save the ID of the table ONE, and later i will delete or do some task with the registry.*/
------------------------------------
s++//Next row on table TWO
loop//Loop Second FOR

s=0 //If the second FOR is over make s =0 so it can start over
n++ //Next row on table ONE
loop//Loop first FOR
============================================================
ANOTHER EXAMPLE:

TABLE 1 TABLE 2
Row1T1 Row1T2
Row2T1 Row2T2
Row3T1 Row3T2
Row4T1 Row4T2
Row5T1 Row5T2

This is how my nested loop will work

When the first FOR is in the Row1T1 it will compare with all the rows on table 2,
When the first FOR is in the Row2T1 it will compare with all the rows on table 2,................And so on

So I can tell that my nested loop is working as expected.

4. Yes you are right with this code im not eliminating any registry so far. I will get the Rows "Duplicated" and then comparing with other variables on the table ill make the decision of deleting or something else.

And again you are right, i might use some Human processing here to make better desicions.

5. Well i copied the table just to use the levenshtein distance on Mysql, but you are absolutely right if i use Recordsets, on VB6 i can populate both recordsets with the same table

good one.

6. The code i used for levenshtein in Visual basic its taken from here

http://www.vbforums.com/showthread.php?t=505053

AND i was reading it and i dont think i could make a faster way to do this.

7. Well i compute, as you said, A,B AND C cause i need to compare Name LasteName and Last-LastName (As you know in Mexico we have name and two lastnames).
So A represents the distance beteween the name of the recordsets,
B Represent the Lastname and so on...

SO..........

IN THE END I SEE I HAVE NO OTHER CHOICE BUT TO WAIT UNTIL MY 90 BRITHDAY TO SEE THIS WORKING....MMMM.............

Ok i have some time left.

And if there is someone overthere that know a better way to do this, ill invite him a great bottle of something =).

I think it should work cause Google uses something similar to suggest words when you missspell one on the search field, and they do it sooo fast......is that becouse of their Turbo servers, or just becouse they have a clever programers......?

hoinz · Jun 6, 2008

Hello wakkara,

a few more thoughts:

3. I am not sure if I understand correctly. My reason for thinking that you are doing it twice is that you said you copy Table1 to Table2. Or am I wrong here?
If you just copy tables, then the rows will be the same, Row1T1 = Row1T2, Row2T1 = Row2T2, and so on.
First you compare, lets say, Row1T1 to Row5T2, which is the same as comparing Row1T1 to Row5T1.
And there is no reason for comparing Row5T1 to Row1T1 later on. Or what am I missing here?
(I am assuming here that both tables are ordered the same way. If they aren't, you should be able to order them.)

7. I still think you are making things much too complicated.
Let's assume you have got these two names:
1 Michelle Smiht Smith
2 Jose Rodriguez Alvaro
So first you compute Levenshtein distance of 'Michelle' and 'Jose', and you find out that they are very different.
But why on earth are you so eager to find out how different the last names are ???
You already know that these two names cannot represent the same person.

hope this helps

wakkara · Jun 6, 2008

Ok thanks for your incredible and usefull ideas

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Advanced users! Please How to Get rid of Homonyms on fields of Table

wakkara

Programmer

audiopro

Programmer

hoinz

MIS

wakkara

Programmer

hoinz

MIS

wakkara

Programmer

Similar threads

Part and Inventory Search

Sponsor