Retrieving 5 most occurring words 1

vpekulas · Dec 20, 2003

I'm just wandering if it is possible to retrieve words that are
included in the text field most often. So for example content:

Hello, how are you ? Is that it ? How you doing ? How is that ?

Would produce words:

how (3 times)
is (2 times)
that (2 times)

Is this doable without using stored proc ?

"Taxes are the fees we pay for civilized society" G.W.

r937 · Dec 20, 2003

yes, i think it is, however, you may not want to attempt it with sql, because it's, well, ugly

see this article --

How to do looping in SQL

http://searchdatabase.com/ateQuestionNResponse/0,289625,sid13_cid544808_tax285649,00.html

(registration may be required, but it's free)

at the bottom, there's some sql by Joe Celko for parsing a comma-delimited string of values

you could adapt that for space-delimited, and add the grouping necessary to do counts

good luck

rudy

http://r937.com/

vpekulas · Dec 21, 2003

Sorry mate, the URL is dead

What was the article title ?

"Taxes are the fees we pay for civilized society" G.W.

Golom · Dec 21, 2003

The code that Rudy is refering to is like this

Code:

SELECT I1.KeyCol, Mid$(&quot;,&quot; & I1.instring & &quot;,&quot;, S1.num + 1 , S2.num -
S1.num - 1) AS SubString

FROM InputStrings AS I1, integers AS S1, integers AS S2

WHERE     Mid$(&quot;,&quot; & I1.instring & &quot;,&quot; ,S1.num , 1 ) = ','
      and Mid$(&quot;,&quot; & I1.instring & &quot;,&quot; ,S2.num , 1 ) = ','
      and S1.num < S2.num
      and InStr(1,Mid$(&quot;,&quot; & I1.instring & &quot;,&quot;, S1.num + 1 , S2.num -
S1.num - 1), &quot;,&quot;)=0;

Where "integers" is a table with integer values from 1 to 2 more than the length of the longest string that you are trying to parse.
If you run this on

Code:

Keycol instring

Code:

2             abc,def,ghi
3             a,b,cde,fg,hij

It will produce

Code:

KeyCol SubString

Code:

3             a
2             abc
3             b
2             def
3             cde
3             fg
2             ghi
3             hij

vpekulas · Dec 21, 2003

Golom, your code is assuming that I know what words I want to find in the text. However I don't, what I need is
for the code to find the most occuring words, it can be any word, not just what I tell it to find. Any help there ?

"Taxes are the fees we pay for civilized society" G.W.

Golom · Dec 21, 2003

No ... the code makes no assumptions about what words to find ... but it does assume that the words are separated by commas. If your words are separated by something else (space for example) then change "," to " " in the SQL provided.

The code is intended to parse the records from the delimiter-separated form to a separate record for each sub-string (i.e. word). Suppose you were to save that as a query (or view depending on your DBMS) called "qryMyStrings" (for example), then

Code:

Select [Substring], Count(*) As [Occurences]
From   qryMyStrings
Group By [SubString]
Order By 1, 2

Should give you the result in the form

Code:

Substring Occurences

Code:

    a                1
    abc              1
    b                1
    cde              1
    def              1
    fg               1
    ghi              1 
    hij              1

Using my test values.

Golom · Dec 21, 2003

Here's a version that does it all in one query using a space as the word separator.

Code:

SELECT TOP 5 
   Mid$(&quot; &quot; & I1.instring & &quot; &quot;, S1.num + 1 , S2.num - S1.num - 1) AS [Word], 
   Count(*) As [Occurences]

FROM InputStrings AS I1, intsmall AS S1, intsmall AS S2

WHERE 
  Mid$(&quot; &quot; & I1.instring & &quot; &quot; ,S1.num , 1 ) = ' '
  and Mid$(&quot; &quot; & I1.instring & &quot; &quot; ,S2.num , 1 ) = ' '
  and S1.num < S2.num
  and InStr(1,Mid$(&quot; &quot; & I1.instring & &quot; &quot;, S1.num + 1 , S2.num - S1.num - 1), &quot; &quot;)=0
  and Len ( Mid$(&quot; &quot; & I1.instring & &quot; &quot;, S1.num + 1 , S2.num - S1.num - 1) ) > 0

GROUP BY Mid$(&quot; &quot; & I1.instring & &quot; &quot;, S1.num + 1 , S2.num - S1.num - 1)

ORDER BY 2 DESC, 1

r937 · Dec 21, 2003

Golom, thanks

i know how long it takes to set up test cases and the time you invested on this is appreciated

i'm bookmarking this thread, if only for my own reference

;-)

Golom · Dec 21, 2003

Rudy ...

Perhaps you recall that we had a discussion about this some time ago because I was having some problem getting Joe's original code to run. These test cases were developed back then based on Joe's original query.

I'm not nearly smart enough to be able to do something like this myself.

vpekulas · Dec 21, 2003

Either way guys, thank yuo so much.
I appreciate your help with this. I tink I'll be able to do somthing with your examples.

"Taxes are the fees we pay for civilized society" G.W.

r937 · Dec 21, 2003

> "I'm not nearly smart enough
> to be able to do something like this myself."

fabulous comment

i feel exactly the same way about myself

we are tall only because we stand on the shoulders of giants

celko has inspired me immensely

rudy

dnoeth · Dec 21, 2003

Me too...
Celko probably inspired most of his readers, i just wish he always posted code which really worked ;-)

What i didn't like about Celko's code was the usage of two cross joins to the sequence table. The following code is from an example i usually use in my trainings:

SELECT
groupcol
,SUBSTRING(string
FROM n
FOR (POSITION(',' IN SUBSTRING(string || ',' FROM n))) - 1
) AS word
FROM Strings s, (SELECT n FROM nums WHERE n BETWEEN 1 AND 247) tmp
WHERE
(n <= CHAR_LENGTH(string) AND SUBSTRING(string FROM n - 1 FOR 1) = ',')
OR
n = 1;

The following is less readable, but gets rid of all string concats. So this turned out to be the fastest version on my DBMS, Teradata:

SELECT
groupcol
,SUBSTRING(string
FROM n
FOR (CASE
WHEN (POSITION(',' IN SUBSTRING(string FROM n))) = 0
THEN 64000
ELSE (POSITION(',' IN SUBSTRING(string FROM n)))
END) - 1) AS word
FROM Strings s, (SELECT n FROM nums WHERE n BETWEEN 1 AND 247) tmp
WHERE
(n <= CHAR_LENGTH(string) AND SUBSTRING(string FROM n - 1 FOR 1) = ',')
OR
n = 1;

And as an additional advantage the WHERE-clause can now easily be enhanced for more than one delimiter character using a string or IN-list or table of delimiter chars:

WHERE
(n <= CHAR_LENGTH(string)
AND POSITION(SUBSTRING(string FROM n - 1 FOR 1) IN '!,;*') > 0)
OR
n = 1;

Dieter

r937 · Dec 21, 2003

wow

thanks, dieter

i'll have to build my own test cases just try your queries, but i intend to do just that

not every database (including in particular the ones i use) support some of your functions, so i'm looking forward to rewriting them, if just for the fact that doing so will make me understand how they work

you don't by any chance have a web site, do you?

rudy

dnoeth · Dec 22, 2003

Hi Rudy,
this is pure Standard SQL92, so at least POSITION and SUBSTRING are supported and even Oracle now knows CASE ;-)

But the posted SQL doesn't allow delimiter lists, this is only possible with SQL:1999, i forgot to copy this one:

SELECT
groupcol
,SUBSTRING(string
FROM n
FOR (COALESCE(MAX

OVER (PARTITION BY groupcol
ORDER BY n ROWS BETWEEN 1 FOLLOWING and 1 FOLLOWING)
, 64000)
) - n - 1
) AS word
,ROW_NUMBER() OVER (PARTITION BY groupcol ORDER BY n) AS pos
FROM Strings s, (SELECT n FROM nums WHERE n BETWEEN 1 AND 247) tmp
WHERE
(n <= CHAR_LENGTH(string) AND SUBSTRING(string FROM n - 1 FOR 1) = ',')
OR
n = 1;

Currently Teradata, Oracle and DB2 support those OLAP funuctions.

I'll post a test case from a private MCT trainer forum by Itzik Ben-Gan for MS SQL:

CREATE TABLE Strings
(
groupcol INT NOT NULL,
string VARCHAR(1000) NOT NULL
)

INSERT INTO Strings VALUES(1, 'a')
INSERT INTO Strings VALUES(2, 'b,c')
INSERT INTO Strings VALUES(3, 'd,e,f')
INSERT INTO Strings VALUES(4, 'g,h,i,g')

-- Permanent Auxiliary table - Nums
CREATE TABLE Nums
(
n INT NOT NULL PRIMARY KEY
)

SET NOCOUNT ON

DECLARE
@rc AS INT,
@max AS INT

SET @rc = 1
SET @max = 8000

INSERT INTO Nums VALUES(1)

WHILE @rc * 2 < @max
BEGIN
INSERT INTO Nums
SELECT n + @rc
FROM Nums

SET @rc = @rc + @@rowcount
END

INSERT INTO Nums
SELECT n + @rc
FROM Nums
WHERE n + @rc <= @max

--Itzik's solution
SELECT
groupcol,
SUBSTRING(string, n, CHARINDEX(',', string + ',', n) - n) AS word
FROM Strings JOIN Nums
ON n <= LEN(string) AND SUBSTRING(',' + string, n, 1) = ','

--mine had a longer source code, but was a bit faster ;-)
select
groupcol,
substring(string, n, case when (charindex(',', string, n)) = 0
then 8000
else (charindex(',', string, n))
end - n) as word
from Strings s, nums
where
n <= Len(string)
and
( n = 1
or
substring(string, n - 1, 1) = ','
)
order by groupcol, n

If you have access to the SQLmag archives, that's Itzik's article:

http://www.sqlmag.com/Articles/Index.cfm?ArticleID=25678

btw, i still don't have a web site...

Dieter

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Retrieving 5 most occurring words 1

vpekulas

Programmer

r937

Technical User

vpekulas

Programmer

Golom

Programmer

vpekulas

Programmer

Golom

Programmer

Golom

Programmer

r937

Technical User

Golom

Programmer

vpekulas

Programmer

r937

Technical User

dnoeth

Instructor

r937

Technical User

dnoeth

Instructor

Similar threads

Part and Inventory Search

Sponsor