Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Retrieving 5 most occurring words 1

Status
Not open for further replies.

vpekulas

Programmer
Jan 8, 2002
154
CA
I'm just wandering if it is possible to retrieve words that are
included in the text field most often. So for example content:

Hello, how are you ? Is that it ? How you doing ? How is that ?

Would produce words:

how (3 times)
is (2 times)
that (2 times)

Is this doable without using stored proc ?

"Taxes are the fees we pay for civilized society" G.W.
 
yes, i think it is, however, you may not want to attempt it with sql, because it's, well, ugly

see this article --

How to do looping in SQL
(registration may be required, but it's free)

at the bottom, there's some sql by Joe Celko for parsing a comma-delimited string of values

you could adapt that for space-delimited, and add the grouping necessary to do counts

good luck


rudy
 
Sorry mate, the URL is dead :( What was the article title ?

"Taxes are the fees we pay for civilized society" G.W.
 
The code that Rudy is refering to is like this
Code:
SELECT I1.KeyCol, Mid$("," & I1.instring & ",", S1.num + 1 , S2.num -
S1.num - 1) AS SubString

FROM InputStrings AS I1, integers AS S1, integers AS S2

WHERE     Mid$("," & I1.instring & "," ,S1.num , 1 ) = ','
      and Mid$("," & I1.instring & "," ,S2.num , 1 ) = ','
      and S1.num < S2.num
      and InStr(1,Mid$(&quot;,&quot; & I1.instring & &quot;,&quot;, S1.num + 1 , S2.num -
S1.num - 1), &quot;,&quot;)=0;
Where &quot;integers&quot; is a table with integer values from 1 to 2 more than the length of the longest string that you are trying to parse.
If you run this on
Code:
Keycol    instring
Code:
2             abc,def,ghi
3             a,b,cde,fg,hij
It will produce
Code:
KeyCol    SubString
Code:
3             a
2             abc
3             b
2             def
3             cde
3             fg
2             ghi
3             hij
 
Golom, your code is assuming that I know what words I want to find in the text. However I don't, what I need is
for the code to find the most occuring words, it can be any word, not just what I tell it to find. Any help there ?

&quot;Taxes are the fees we pay for civilized society&quot; G.W.
 
No ... the code makes no assumptions about what words to find ... but it does assume that the words are separated by commas. If your words are separated by something else (space for example) then change &quot;,&quot; to &quot; &quot; in the SQL provided.

The code is intended to parse the records from the delimiter-separated form to a separate record for each sub-string (i.e. word). Suppose you were to save that as a query (or view depending on your DBMS) called &quot;qryMyStrings&quot; (for example), then
Code:
Select [Substring], Count(*) As [Occurences]
From   qryMyStrings
Group By [SubString]
Order By 1, 2
Should give you the result in the form
Code:
Substring       Occurences
Code:
    a                1
    abc              1
    b                1
    cde              1
    def              1
    fg               1
    ghi              1 
    hij              1
Using my test values.
 
Here's a version that does it all in one query using a space as the word separator.
Code:
SELECT TOP 5 
   Mid$(&quot; &quot; & I1.instring & &quot; &quot;, S1.num + 1 , S2.num - S1.num - 1) AS [Word], 
   Count(*) As [Occurences]

FROM InputStrings AS I1, intsmall AS S1, intsmall AS S2

WHERE 
  Mid$(&quot; &quot; & I1.instring & &quot; &quot; ,S1.num , 1 ) = ' '
  and Mid$(&quot; &quot; & I1.instring & &quot; &quot; ,S2.num , 1 ) = ' '
  and S1.num < S2.num
  and InStr(1,Mid$(&quot; &quot; & I1.instring & &quot; &quot;, S1.num + 1 , S2.num - S1.num - 1), &quot; &quot;)=0
  and Len ( Mid$(&quot; &quot; & I1.instring & &quot; &quot;, S1.num + 1 , S2.num - S1.num - 1) ) > 0

GROUP BY Mid$(&quot; &quot; & I1.instring & &quot; &quot;, S1.num + 1 , S2.num - S1.num - 1)

ORDER BY 2 DESC, 1
 
Golom, thanks

i know how long it takes to set up test cases and the time you invested on this is appreciated

i'm bookmarking this thread, if only for my own reference

;-)
 
Rudy ...

Perhaps you recall that we had a discussion about this some time ago because I was having some problem getting Joe's original code to run. These test cases were developed back then based on Joe's original query.

I'm not nearly smart enough to be able to do something like this myself.
 
Either way guys, thank yuo so much.
I appreciate your help with this. I tink I'll be able to do somthing with your examples.

&quot;Taxes are the fees we pay for civilized society&quot; G.W.
 
> &quot;I'm not nearly smart enough
> to be able to do something like this myself.&quot;

fabulous comment

i feel exactly the same way about myself

we are tall only because we stand on the shoulders of giants

celko has inspired me immensely


rudy
 
Me too...
Celko probably inspired most of his readers, i just wish he always posted code which really worked ;-)

What i didn't like about Celko's code was the usage of two cross joins to the sequence table. The following code is from an example i usually use in my trainings:

SELECT
groupcol
,SUBSTRING(string
FROM n
FOR (POSITION(',' IN SUBSTRING(string || ',' FROM n))) - 1
) AS word
FROM Strings s, (SELECT n FROM nums WHERE n BETWEEN 1 AND 247) tmp
WHERE
(n <= CHAR_LENGTH(string) AND SUBSTRING(string FROM n - 1 FOR 1) = ',')
OR
n = 1;



The following is less readable, but gets rid of all string concats. So this turned out to be the fastest version on my DBMS, Teradata:

SELECT
groupcol
,SUBSTRING(string
FROM n
FOR (CASE
WHEN (POSITION(',' IN SUBSTRING(string FROM n))) = 0
THEN 64000
ELSE (POSITION(',' IN SUBSTRING(string FROM n)))
END) - 1) AS word
FROM Strings s, (SELECT n FROM nums WHERE n BETWEEN 1 AND 247) tmp
WHERE
(n <= CHAR_LENGTH(string) AND SUBSTRING(string FROM n - 1 FOR 1) = ',')
OR
n = 1;


And as an additional advantage the WHERE-clause can now easily be enhanced for more than one delimiter character using a string or IN-list or table of delimiter chars:

WHERE
(n <= CHAR_LENGTH(string)
AND POSITION(SUBSTRING(string FROM n - 1 FOR 1) IN '!,;*') > 0)
OR
n = 1;

Dieter
 



wow


thanks, dieter

i'll have to build my own test cases just try your queries, but i intend to do just that

not every database (including in particular the ones i use) support some of your functions, so i'm looking forward to rewriting them, if just for the fact that doing so will make me understand how they work

you don't by any chance have a web site, do you?



rudy
 
Hi Rudy,
this is pure Standard SQL92, so at least POSITION and SUBSTRING are supported and even Oracle now knows CASE ;-)

But the posted SQL doesn't allow delimiter lists, this is only possible with SQL:1999, i forgot to copy this one:

SELECT
groupcol
,SUBSTRING(string
FROM n
FOR (COALESCE(MAX(n) OVER (PARTITION BY groupcol
ORDER BY n ROWS BETWEEN 1 FOLLOWING and 1 FOLLOWING)
, 64000)
) - n - 1
) AS word
,ROW_NUMBER() OVER (PARTITION BY groupcol ORDER BY n) AS pos
FROM Strings s, (SELECT n FROM nums WHERE n BETWEEN 1 AND 247) tmp
WHERE
(n <= CHAR_LENGTH(string) AND SUBSTRING(string FROM n - 1 FOR 1) = ',')
OR
n = 1;

Currently Teradata, Oracle and DB2 support those OLAP funuctions.


I'll post a test case from a private MCT trainer forum by Itzik Ben-Gan for MS SQL:

CREATE TABLE Strings
(
groupcol INT NOT NULL,
string VARCHAR(1000) NOT NULL
)

INSERT INTO Strings VALUES(1, 'a')
INSERT INTO Strings VALUES(2, 'b,c')
INSERT INTO Strings VALUES(3, 'd,e,f')
INSERT INTO Strings VALUES(4, 'g,h,i,g')

-- Permanent Auxiliary table - Nums
CREATE TABLE Nums
(
n INT NOT NULL PRIMARY KEY
)

SET NOCOUNT ON

DECLARE
@rc AS INT,
@max AS INT

SET @rc = 1
SET @max = 8000

INSERT INTO Nums VALUES(1)

WHILE @rc * 2 < @max
BEGIN
INSERT INTO Nums
SELECT n + @rc
FROM Nums

SET @rc = @rc + @@rowcount
END

INSERT INTO Nums
SELECT n + @rc
FROM Nums
WHERE n + @rc <= @max

--Itzik's solution
SELECT
groupcol,
SUBSTRING(string, n, CHARINDEX(',', string + ',', n) - n) AS word
FROM Strings JOIN Nums
ON n <= LEN(string) AND SUBSTRING(',' + string, n, 1) = ','


--mine had a longer source code, but was a bit faster ;-)
select
groupcol,
substring(string, n, case when (charindex(',', string, n)) = 0
then 8000
else (charindex(',', string, n))
end - n) as word
from Strings s, nums
where
n <= Len(string)
and
( n = 1
or
substring(string, n - 1, 1) = ','
)
order by groupcol, n


If you have access to the SQLmag archives, that's Itzik's article:


btw, i still don't have a web site...

Dieter
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top