Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

SQL statement for approximate string matching

Status
Not open for further replies.

bghh

Programmer
Sep 13, 2010
3
ZA
Hello,

I'm trying to compose an SQL-query which does approximate string matching using n-grams.
This is the problem that I'm trying to solve: we have a system on which you can register your company and users within that company. However, before a new company is registered, we would like to do a check to see whether that company already exists in the database. In this case there is a good possibility that the user didn't type the company name exactly as in the existing record in the database.

The idea that I'm trying to implement is this: from the input string (the new company name) we generate a number of n-grams. Let us let n be 3. Then the 3-grams of the string "approximate" for example are "app", "ppr", "pro", "rox", "oxi", "xim", "ima", "mat", and "ate". I would like to do a select-statement which returns (or counts) all the company names in the COMPANY table that contain a certain number of these 3-grams.
More generally, we are looking for a SQL-statement that selects all rows that, given n conditions, satisfy at least i of these n conditions.
How would we do this in a SQL select-expression?

---------------------------------------------------

The idea that I got is to use a SQL-query of the following form:
Code:
SELECT ... FROM ... WHERE (CASE WHEN #CONDITION1# THEN 1 ELSE 0) + (CASE WHEN #CONDITION2# THEN 1 ELSE 0) + ... + (CASE WHEN #CONDITIONn# THEN 1 ELSE 0)>=i
.
For this specific scenario the SQL-query would look something like
Code:
SELECT COMPANYFULLNAME FROM COMPANY WHERE (CASE WHEN LOWER(COMPANYFULLNAME) LIKE '%mic%' THEN 1 ELSE 0) + (CASE WHEN LOWER(COMPANYFULLNAME) LIKE '%icr%' THEN 1 ELSE 0) + (CASE WHEN LOWER(COMPANYFULLNAME) LIKE '%cro%' THEN 1 ELSE 0)>=3
.
The problem is that DB2 finds syntax errors in this SQL. It says
Code:
SQL0104N  An unexpected token ")" was found following "%mic%' THEN 1 ELSE 0".  Expected tokens may include:  "END
It seems as though DB2 doesn't like the parentheses around the CASEs. However, when I remove these parentheses, I get a different syntax error:
Code:
SQL0104N  An unexpected token ">=" was found following "%cro%' THEN 1 ELSE 0".  Expected tokens may include:  "END
Is there a way to write this SQL-statement so that DB2 accepts it?
Or should I take a different approach on this.

I'll appreciate any help.
Thanks in advance.
 


hi,

The WHERE clause is looking for an expression, not a value.

For instance in the WHERE, you might use a CASE decode to return a field name or a field value, but eventually you need...
Code:
WHERE [i]SomeField SomeOperator SomeValue[/i]


Skip,

[glasses]Just traded in my old subtlety...
for a NUANCE![tongue]
 
I believe the form of an expression in a WHERE clause is a bit more general, such as
Code:
WHERE #expression1# #operator# #expression2#

But man, I just figured out what I did wrong in that SQL-query, and indeed, it was just a simple syntax error.
The syntax of the CASE construct is of the form "CASE WHEN #condition# THEN #expression1# ELSE #expression2# END". I clean forgot the "END" at the end. When the syntax error told me that an END token was expected, I thought that it meant the end of the query.
The correct form of the type of SQL-query that I'm implementing is
Code:
SELECT ... FROM ... WHERE CASE WHEN #CONDITION1# THEN 1 ELSE 0 [b]END[/b] + CASE WHEN #CONDITION2# THEN 1 ELSE 0 [b]END[/b] + ... + CASE WHEN #CONDITIONn# THEN 1 ELSE 0 [b]END[/b]>=i
And the example SQL-query instantiation is
Code:
SELECT COMPANYFULLNAME FROM COMPANY WHERE CASE WHEN LOWER(COMPANYFULLNAME) LIKE '%mic%' THEN 1 ELSE 0 [b]END[/b] + CASE WHEN LOWER(COMPANYFULLNAME) LIKE '%icr%' THEN 1 ELSE 0 [b]END[/b] + CASE WHEN LOWER(COMPANYFULLNAME) LIKE '%cro%' THEN 1 ELSE 0 [b]END[/b]>=2
And yes, this query works perfectly!
Thanks for the help!
 
We can simplyfy the same query as below:
SELECT
(CASE WHEN LOWER(COMPANYFULLNAME) LIKE '%mic%' THEN 1
WHEN LOWER(COMPANYFULLNAME) LIKE '%icr%' THEN 1 WHEN LOWER(COMPANYFULLNAME) LIKE '%cro%' THEN 1 ELSE 0 END)
FROM COMPANY;
 



Yes you were correct. Now I see what you are attempting.

Skip,

[glasses]Just traded in my old subtlety...
for a NUANCE![tongue]
 
Well Marc,
the SOUNDEX function (or DIFFERENCE function, which is based on SOUNDEX) works quite well in some cases. One specific case where it doesn't work well is when the two strings that are compared are similar, but they don't start with the same letters; for instance "Apple Co" and "The Apple Co" is a poor match according to DIFFERENCE.
However a pair of strings like "Microblah Pty Ltd" and "Mikrobla" is a good match according to DIFFERENCE.
The solution above with the CASEs works well, but if performance becomes an issue, I'll consider using DIFFERENCE.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top