Hello,
I'm trying to compose an SQL-query which does approximate string matching using n-grams.
This is the problem that I'm trying to solve: we have a system on which you can register your company and users within that company. However, before a new company is registered, we would like to do a check to see whether that company already exists in the database. In this case there is a good possibility that the user didn't type the company name exactly as in the existing record in the database.
The idea that I'm trying to implement is this: from the input string (the new company name) we generate a number of n-grams. Let us let n be 3. Then the 3-grams of the string "approximate" for example are "app", "ppr", "pro", "rox", "oxi", "xim", "ima", "mat", and "ate". I would like to do a select-statement which returns (or counts) all the company names in the COMPANY table that contain a certain number of these 3-grams.
More generally, we are looking for a SQL-statement that selects all rows that, given n conditions, satisfy at least i of these n conditions.
How would we do this in a SQL select-expression?
---------------------------------------------------
The idea that I got is to use a SQL-query of the following form:
.
For this specific scenario the SQL-query would look something like
.
The problem is that DB2 finds syntax errors in this SQL. It says
It seems as though DB2 doesn't like the parentheses around the CASEs. However, when I remove these parentheses, I get a different syntax error:
Is there a way to write this SQL-statement so that DB2 accepts it?
Or should I take a different approach on this.
I'll appreciate any help.
Thanks in advance.
I'm trying to compose an SQL-query which does approximate string matching using n-grams.
This is the problem that I'm trying to solve: we have a system on which you can register your company and users within that company. However, before a new company is registered, we would like to do a check to see whether that company already exists in the database. In this case there is a good possibility that the user didn't type the company name exactly as in the existing record in the database.
The idea that I'm trying to implement is this: from the input string (the new company name) we generate a number of n-grams. Let us let n be 3. Then the 3-grams of the string "approximate" for example are "app", "ppr", "pro", "rox", "oxi", "xim", "ima", "mat", and "ate". I would like to do a select-statement which returns (or counts) all the company names in the COMPANY table that contain a certain number of these 3-grams.
More generally, we are looking for a SQL-statement that selects all rows that, given n conditions, satisfy at least i of these n conditions.
How would we do this in a SQL select-expression?
---------------------------------------------------
The idea that I got is to use a SQL-query of the following form:
Code:
SELECT ... FROM ... WHERE (CASE WHEN #CONDITION1# THEN 1 ELSE 0) + (CASE WHEN #CONDITION2# THEN 1 ELSE 0) + ... + (CASE WHEN #CONDITIONn# THEN 1 ELSE 0)>=i
For this specific scenario the SQL-query would look something like
Code:
SELECT COMPANYFULLNAME FROM COMPANY WHERE (CASE WHEN LOWER(COMPANYFULLNAME) LIKE '%mic%' THEN 1 ELSE 0) + (CASE WHEN LOWER(COMPANYFULLNAME) LIKE '%icr%' THEN 1 ELSE 0) + (CASE WHEN LOWER(COMPANYFULLNAME) LIKE '%cro%' THEN 1 ELSE 0)>=3
The problem is that DB2 finds syntax errors in this SQL. It says
Code:
SQL0104N An unexpected token ")" was found following "%mic%' THEN 1 ELSE 0". Expected tokens may include: "END
Code:
SQL0104N An unexpected token ">=" was found following "%cro%' THEN 1 ELSE 0". Expected tokens may include: "END
Or should I take a different approach on this.
I'll appreciate any help.
Thanks in advance.