Hi there!!
Sorry my English!!
Well i have a big issue to solve on a small table on a database With near 23,000 rows.
I'll start from the beginning...
I have developed a small system that registers users and some other stuff as well.
The point is that the people who is in charge of register the users, on the system.
Most of the time they "Repeat" or miss spell the Name of the users and of course
it causes some trouble when triyng to find them with a query.
Example:
1 Michelle Smith Smith
2 Michelle Smiht Smith
And that was caused because i didnt add a proper code to avoid the repeting when adding a user.
Now i have implemented the LEVENSHTEIN Distance to avoid this repeatings.
How LEVENSHTEIN function works? ..well:
SELECT * FROM pacientes WHERE LEVENSHTEIN("Smith",AppPacV) <= 2;
AND it will show using the example before:
NomPacV AppPacV ApmPacV
1 Michelle Smith Smith
2 Michelle Smiht Smith
Any way
i have still near 23,000 users to verify the homonyms on the table so i could delete them from it.
I had a few ideas but they didnt work out as i thougth they would.
1. Use a Stored procedure with levenshtein trying this:
- Copy the whole table on a new one and rename it so i could have 2 tables with exactly the same data.
TABLE 1: pacientes TABLE 2: pacientess
- Use a query to verify the Distance between them:
SELECT p.* FROM pacientes p, pacientess q WHERE LEVENSHTEIN(p.NomPacV,q.NomPacV) <= 2;
But it takes to long to finish the query (I waited 40min and got tired).
2. Using VB6
Keep on 2 recordset each of the Tables above using Visual Basic 6 AND a Levenshtein Function written in VB (Wich is faster than Mysql).
- Do a nested Loop with the recordsets and make a comparison.
For n = 0 to size of Recodset1
For s =0 to size of Recodset2
A = Levenshtein(Recodset1.Field1,Recodset2.Field1)
B = Levenshtein(Recodset1.Field2,Recodset2.Field2)
C = Levenshtein(Recodset1.Field3,Recodset2.Field3)
If A >= 2 AND B >= 2 AND C >= 2 then
SAVE the ID of the registry
End if
s++
Loop
n++
s++
Loop
Eitherway it will take near 30 hours to make this, (i estimate this number cause in one registry it takes 6 seconds to calculate the levenshtein distance).
Is there a way to make it faster, or what im doing wrong.
Please im quite desperate here
Sorry my English!!
Well i have a big issue to solve on a small table on a database With near 23,000 rows.
I'll start from the beginning...
I have developed a small system that registers users and some other stuff as well.
The point is that the people who is in charge of register the users, on the system.
Most of the time they "Repeat" or miss spell the Name of the users and of course
it causes some trouble when triyng to find them with a query.
Example:
1 Michelle Smith Smith
2 Michelle Smiht Smith
And that was caused because i didnt add a proper code to avoid the repeting when adding a user.
Now i have implemented the LEVENSHTEIN Distance to avoid this repeatings.
How LEVENSHTEIN function works? ..well:
SELECT * FROM pacientes WHERE LEVENSHTEIN("Smith",AppPacV) <= 2;
AND it will show using the example before:
NomPacV AppPacV ApmPacV
1 Michelle Smith Smith
2 Michelle Smiht Smith
Any way
i have still near 23,000 users to verify the homonyms on the table so i could delete them from it.
I had a few ideas but they didnt work out as i thougth they would.
1. Use a Stored procedure with levenshtein trying this:
- Copy the whole table on a new one and rename it so i could have 2 tables with exactly the same data.
TABLE 1: pacientes TABLE 2: pacientess
- Use a query to verify the Distance between them:
SELECT p.* FROM pacientes p, pacientess q WHERE LEVENSHTEIN(p.NomPacV,q.NomPacV) <= 2;
But it takes to long to finish the query (I waited 40min and got tired).
2. Using VB6
Keep on 2 recordset each of the Tables above using Visual Basic 6 AND a Levenshtein Function written in VB (Wich is faster than Mysql).
- Do a nested Loop with the recordsets and make a comparison.
For n = 0 to size of Recodset1
For s =0 to size of Recodset2
A = Levenshtein(Recodset1.Field1,Recodset2.Field1)
B = Levenshtein(Recodset1.Field2,Recodset2.Field2)
C = Levenshtein(Recodset1.Field3,Recodset2.Field3)
If A >= 2 AND B >= 2 AND C >= 2 then
SAVE the ID of the registry
End if
s++
Loop
n++
s++
Loop
Eitherway it will take near 30 hours to make this, (i estimate this number cause in one registry it takes 6 seconds to calculate the levenshtein distance).
Is there a way to make it faster, or what im doing wrong.
Please im quite desperate here