Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Percent Match Comparing Two Strings

Status
Not open for further replies.

Elysium

Programmer
Aug 7, 2002
212
US
Has anyone attempted to create a function that will show how closely two strings match? In particular, if sWord1 = "Drs Vision Center" and sWord2 = "Doctors Vision Center", I want to be able to say that the two strings match, although not 100%. They differ because of the Drs != Doctors, but really it is the same. I thought about doing a replace function that would substitute Drs for Doctors and then I would get a 100% match. But then the equation gets really muddy when I compare two strings such as "FL Pest Control Co, Inc" to "Florida Pest Control Company, Inc."

I appreciate any guidance with this. Thank you.
 
Elysium,

I don't know of a way to do a fuzzy compare simply. The rules would be quite complex.

I'd be more apt to modify a COPY of the source data, by building a replace table after carefully verifying that what I am changing is what I intend to change like...

"Co," = "Company,"
"FL " = "Florida "

Then, when I had all my replace rules in my replace table, I'd process the copy.

:)

Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884

Skip,
 
You need to be more precise about what constitutes a partial match. For example does

Doctors Vision Center

Really match

Center For Doctors With Vision
?

It's not just a question of "Drs" Vs. "Doctors" or "Co" Vs. "Company". Those are basically issues of constraining the user to acceptable forms of certain words as SkipVought implies.

The more general issue is ... Is 'closeness' determined by the matching of certain words? The order of the words? The same words in the same positions?

 
Since these are business names, I am going to assume that the order won't change much, unless one string begins with the word "the" and the other string doesn't. To answer your question about "Center for Doctors With Vision", I believe I found a solution for that by using the Levenshtein Distance algorithm. This would tell me that 25 characters would need to be changed in order to achieve a perfect match. If the length of sWord1 is 14, then I would know that the two strings are worlds apart because the number of changes exceeds the len of sWord1.

The hitch I see in this method lies in the word "Incorporated". It's abreviation is 9 characters less which is probably going to cause some problems.

I am still looking ....
 
As I understand it, the question isn't really about strings it's about the interpretation on the contents of strings. It's a problem that in the current state of technology is, perhaps, impossible to solve unless you can be certain that the contents of the strings are restricted to a very small semantic domain (e.g., company names, medical terms, etc.). Your question is, it seems, about what the contents of the string means as much as the characters it contains.

If you can be certain that when the string contains "Drs" it is an abbreviation of "Doctors" rather than a a mis-spelling of "Dry" (in English) or "Das" (in German), then all would be fine (but you'd have to be certain). And would "drs" (with a lower-case 'd') be OK?

I think you'd need a good dictionary plus some linguistic analysis to really get to grips with this problem.


________________________________________
[hippy]Roger J Coult; Grimsby, UK
In the game of life the dice have an odd number of sides.
 
Woja, you're right. I do have to make assumptions on rules, i.e. Dry vs. Drs. My task is to examine the content of two strings, and based on a percent match (something over 90%), I can reasonably assume that the names are equal. For those that aren't matched, I will look at them myself using a side-by-side comparison. But, I wanted to reduce the original amount substantially.

Thanks to all for the input!!
 
There are some string parsing processes which simply attempt to show the longest common sub string between inputs. If you combined this approach with a couple of other processes (replacing common abreviations and discarding trivial words, like "and", "or", "Inc", etc) I think you could get quite a bit closer. If you then applied the match on each word, it might even get to the end point.




MichaelRed
m.red@att.net

Searching for employment in all the wrong places
 
This compares two strings and returns an LCS string with a "." where the strings do not match according to the longer string. You might change it to put some other character. It is a start.
Code:
Public Function ComputeLCS(ByVal x As String, ByVal y As String) As String
    Dim i As Long
    Dim j As Long
    Dim m As Long
    Dim n As Long
    Dim k As Long
    Dim T() As Long
    Dim a As Long
    Dim b As Long
    Dim w As String
    m = Len(x)
    n = Len(y)
    ReDim T(m, n)
    
    For i = 1 To m
        For j = 1 To n
            If Mid$(x, i, 1) = Mid$(y, j, 1) Then
                T(i, j) = T(i - 1, j - 1) + 1
            Else
                a = T(i, j - 1)
                b = T(i - 1, j)
                If a > b Then
                    T(i, j) = a
                Else
                    T(i, j) = b
                End If
            End If
        Next
    Next
    For i = 1 To m
        For j = 1 To n
            If T(i, j) = 0 Then
                Debug.Print "   ";
            Else
                Debug.Print Right("0" & CStr(T(i, j)), 2) & " ";
            End If
        Next
        Debug.Print " "
    Next
    '*****
    '* Trace Back
    i = m
    j = n
    k = m
    If k < n Then k = n
    w = String(k,&quot;.&quot;)
    Do While (i > 0 And j > 0)
        If T(i, j) = T(i - 1, j - 1) + 1 And Mid$(x, i, 1) = Mid$(y, j, 1) Then
            'Mid$(w, Len(w) - k + 1, 1) = Mid$(x, i, 1)
            Mid$(w, k, 1) = Mid$(x, i, 1)
            i = i - 1
            j = j - 1
            k = k - 1
        ElseIf T(i - 1, j) > T(i, j - 1) Then
            If k = i Then k = k - 1
            i = i - 1
        Else
            If k = j Then k = k - 1
            j = j - 1
        End If
    Loop
    ComputeLCS = w
End Function

Forms/Controls Resizing/Tabbing Control
Compare Code (Text)
Generate Sort Class in VB or VBScript
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top