Optimise NEAR with an extra table? (newbie question)

BillGodfrey · Aug 12, 2003

Hello. Here's a newbie question. I'm devloping a CGI program using C++ and MySQL.

I have a table of audio CDs. One field is the title and artist VARCHAR(80). This is a human readable ascii text string and may contain all sorts of punctation in mysterious places.

I'd like to allow searching on this field and to work with near matches. To this end, instead of searching directly into the title field, the user would search within a normalised title field instead.

This "normalised title" is the same title text, but lower-cased, any excess punctation removed and common words (the a etc) removed. (So "FOObar!!" becomes "foobar&quot

For good measure, some suitable mis-spellings could be added at a later date.

The plan is that the user's typed in search request would also be normalised and split into words which were separated by spaces. For each word, I'd perform...

SELECT cd_key FROM cd WHERE normtitle NEAR '% .... %'
(The ... replaced by the single word search term.)

Each returned cd_key (an auto incrementing BIGINT) would be added (in the C++ world) to an array (or some STL container) keeping a total of how many times this cd_key has been returned in a search. This would be sorted and the highest scoring titles would be returned to the user.

First question: Is this okay so far?

The only thing worrying me is the substring search. Say there are a million rows in this table. Is it going to search each row one by one for each word that the user types in?

It occurs to me that I could make the search a bit quicker by keeping an extra index table instead of the extra normtitle field. (Note, I'm talking about a new separate table which happens to perform an indexing function. I'm not talking about the indexing which MySQL routinely performs.)

When a new CD is added, the title is again normalised as before, but also split into words. Each normalised-title-word becomes a new entry in the new index table, mapping to the cd_key it refers to.

Searching for a CD title now invloves the same normalisation of the user's search terms, but now, my query, performed once for each word the user gives...

SELECT cd_key FROM word_index WHERE norm_title_word='....'

The cd_keys returned are again counted up and sorted as before.

Second question: Is this a better way?
Third question: Is there an even better way than either of these suggestions?

Many thanks for any help you may be able to offer.

Bill, learning.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Optimise NEAR with an extra table? (newbie question)

BillGodfrey

Programmer

Similar threads

Part and Inventory Search

Sponsor