Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Removing common words from a search index

Status
Not open for further replies.

audiopro

Programmer
Apr 1, 2004
3,165
0
0
GB
I have an array of words which is created from picture descriptions in a table.

I want to remove common words such as the, and, but, with etc. which are just not words users will search for.

What is the most efficient way to do it?

I thought of comparing each new word against a list before adding each one to the array but hoped there may be a more elegant solution?

Keith
 
Hi

Note that you may not need to remove them. If you store the descriptions in a database and search them with full text search, that will skip stop words in certain modes. ( For MySQL's related documentation see Full-Text Stopwords. )

Otherwise personally I would do it like this :
Perl:
[navy]@stop[/navy][teal]=[/teal][b]qw[/b][teal]{[/teal] the [b]and[/b] but with [teal]}[/teal][teal];[/teal]
[navy]$text[/navy][teal]=[/teal][green][i]'I want to remove common words such as the, and, but, with etc.  which are just not words users will search for.'[/i][/green][teal];[/teal]

[navy]@word[/navy][teal]=[/teal][b]grep[/b] [teal]{[/teal] [navy]$x[/navy][teal]=[/teal][navy]$_[/navy][teal];[/teal] [teal]![/teal] [b]grep[/b] [teal]{[/teal] [navy]$_[/navy] [b]eq[/b] [navy]$x[/navy] [teal]}[/teal] [navy]@stop[/navy] [teal]}[/teal] [navy]$text[/navy][teal]=~[/teal][b]m[/b][fuchsia]/\w{2,}/[/fuchsia][b]g[/b][teal];[/teal]

Feherke.
 
Hi

By the way, in my example I considered words the sequences of 2 or more word characters. Depending on the context,
[ul]
[li]you may want to raise the minimum length limit to 3,[/li]
[li]you may want to set a maximum length to skip ASCII-art and alike,[/li]
[li]you may want to exclude words with more that 50% digits,[/li]
[li]you may want to exclude words present in more than 50% of the descriptions.[/li]
[/ul]
What I try to say is, stopwords are just one thing if you want to fine tune your indexing and searching. And the databases' full-text searches already handle some such settings.


Feherke.
 
Thanks for the reply, sory I was a bit slow responding.
Prompted by your suggestion, I found the easiest solution was to reject words less than 4 characters long and only accept the 26 letter characters.

Keith
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top