SQL statement for approximate string matching

bghh · Sep 13, 2010

Hello,

I'm trying to compose an SQL-query which does approximate string matching using n-grams.
This is the problem that I'm trying to solve: we have a system on which you can register your company and users within that company. However, before a new company is registered, we would like to do a check to see whether that company already exists in the database. In this case there is a good possibility that the user didn't type the company name exactly as in the existing record in the database.

The idea that I'm trying to implement is this: from the input string (the new company name) we generate a number of n-grams. Let us let n be 3. Then the 3-grams of the string "approximate" for example are "app", "ppr", "pro", "rox", "oxi", "xim", "ima", "mat", and "ate". I would like to do a select-statement which returns (or counts) all the company names in the COMPANY table that contain a certain number of these 3-grams.
More generally, we are looking for a SQL-statement that selects all rows that, given n conditions, satisfy at least i of these n conditions.
How would we do this in a SQL select-expression?

---------------------------------------------------

The idea that I got is to use a SQL-query of the following form:

Code:

SELECT ... FROM ... WHERE (CASE WHEN #CONDITION1# THEN 1 ELSE 0) + (CASE WHEN #CONDITION2# THEN 1 ELSE 0) + ... + (CASE WHEN #CONDITIONn# THEN 1 ELSE 0)>=i

.
For this specific scenario the SQL-query would look something like

Code:

SELECT COMPANYFULLNAME FROM COMPANY WHERE (CASE WHEN LOWER(COMPANYFULLNAME) LIKE '%mic%' THEN 1 ELSE 0) + (CASE WHEN LOWER(COMPANYFULLNAME) LIKE '%icr%' THEN 1 ELSE 0) + (CASE WHEN LOWER(COMPANYFULLNAME) LIKE '%cro%' THEN 1 ELSE 0)>=3

.
The problem is that DB2 finds syntax errors in this SQL. It says

Code:

SQL0104N  An unexpected token ")" was found following "%mic%' THEN 1 ELSE 0".  Expected tokens may include:  "END

It seems as though DB2 doesn't like the parentheses around the CASEs. However, when I remove these parentheses, I get a different syntax error:

Code:

SQL0104N  An unexpected token ">=" was found following "%cro%' THEN 1 ELSE 0".  Expected tokens may include:  "END

Is there a way to write this SQL-statement so that DB2 accepts it?
Or should I take a different approach on this.

I'll appreciate any help.
Thanks in advance.

SkipVought · Sep 13, 2010

hi,

The WHERE clause is looking for an expression, not a value.

For instance in the WHERE, you might use a CASE decode to return a field name or a field value, but eventually you need...

Code:

WHERE [i]SomeField SomeOperator SomeValue[/i]

Skip,
_{Just traded in my old subtlety...

for a NUANCE!}

bghh · Sep 13, 2010

I believe the form of an expression in a WHERE clause is a bit more general, such as

Code:

WHERE #expression1# #operator# #expression2#

But man, I just figured out what I did wrong in that SQL-query, and indeed, it was just a simple syntax error.
The syntax of the CASE construct is of the form "CASE WHEN #condition# THEN #expression1# ELSE #expression2# END". I clean forgot the "END" at the end. When the syntax error told me that an END token was expected, I thought that it meant the end of the query.
The correct form of the type of SQL-query that I'm implementing is

Code:

SELECT ... FROM ... WHERE CASE WHEN #CONDITION1# THEN 1 ELSE 0 [b]END[/b] + CASE WHEN #CONDITION2# THEN 1 ELSE 0 [b]END[/b] + ... + CASE WHEN #CONDITIONn# THEN 1 ELSE 0 [b]END[/b]>=i

And the example SQL-query instantiation is

Code:

SELECT COMPANYFULLNAME FROM COMPANY WHERE CASE WHEN LOWER(COMPANYFULLNAME) LIKE '%mic%' THEN 1 ELSE 0 [b]END[/b] + CASE WHEN LOWER(COMPANYFULLNAME) LIKE '%icr%' THEN 1 ELSE 0 [b]END[/b] + CASE WHEN LOWER(COMPANYFULLNAME) LIKE '%cro%' THEN 1 ELSE 0 [b]END[/b]>=2

And yes, this query works perfectly!
Thanks for the help!

delta403 · Sep 14, 2010

We can simplyfy the same query as below:
SELECT
(CASE WHEN LOWER(COMPANYFULLNAME) LIKE '%mic%' THEN 1
WHEN LOWER(COMPANYFULLNAME) LIKE '%icr%' THEN 1 WHEN LOWER(COMPANYFULLNAME) LIKE '%cro%' THEN 1 ELSE 0 END)
FROM COMPANY;

SkipVought · Sep 14, 2010

Yes you were correct. Now I see what you are attempting.

Skip,
_{Just traded in my old subtlety...

for a NUANCE!}

MarcLodge · Sep 14, 2010

bghh,
Can you use the SOUNDEX function?

Marc

bghh · Sep 14, 2010

Well Marc,
the SOUNDEX function (or DIFFERENCE function, which is based on SOUNDEX) works quite well in some cases. One specific case where it doesn't work well is when the two strings that are compared are similar, but they don't start with the same letters; for instance "Apple Co" and "The Apple Co" is a poor match according to DIFFERENCE.
However a pair of strings like "Microblah Pty Ltd" and "Mikrobla" is a good match according to DIFFERENCE.
The solution above with the CASEs works well, but if performance becomes an issue, I'll consider using DIFFERENCE.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

SQL statement for approximate string matching

bghh

Programmer

SkipVought

Programmer

bghh

Programmer

delta403

Programmer

SkipVought

Programmer

MarcLodge

Programmer

bghh

Programmer

Similar threads

Part and Inventory Search

Sponsor