awk and SQL server?

ericb123 · Mar 20, 2009

Please excuse the newbie question, I'm very new to AWK.

I have a large directory of HTML files, and I would like to search through these files for keywords, and take the results of the search and put it into a sql server database.

Is this possible, and if so, are there any samples out there on how to do it?

Any help is greatly appreciated, thanks!

feherke · Mar 20, 2009

Hi

Probably not the best idea.

But of course we should know more about that searching and putting.

Feherke.

http://rootshell.be/~feherke/

ericb123 · Mar 20, 2009

Thanks for the reply. Someone had recommended AWK, but admittedly, I'm a complete novice to it.

I have a SQL database with a table that lists keywords I want to search for. Then I have a directory structure on my server that has HTML files:
Folder/subfolder/HTML files

So I'm trying to find an *automated* way to search these HTML files for the list of keywords, and then to put the search results into another SQL table, so my flow is something like this:

Select * from table.keywords

search through all HTML files in the directory

input results (file name, link, date, keyword found, etc) into SQL table

My client searches the web for specific web pages, and when found, they save these HTML files in these folders. I'm trying to build a database of just the ones they need based on a list of keywords.

Any suggestions? Thanks again.

Annihilannic · Mar 22, 2009

This may give you some ideas.

Assuming you have used an SQL query to dump the list of keywords into a plain text file keywords.

Code:

find /home/[URL unfurl="true"]www -type[/URL] f -name '*.html' | awk '
        BEGIN {
                nkeywords=0
                while (getline < "keywords") { keyword[nkeywords++] = $0 }
                close("keywords")
        }
        {
                file=$0
                found=""
                # create temporary hash from keyword array
                for (i=0; i<nkeywords; i++) {
                        tempkeyword[keyword[i]]=1
                }
                while (getline < file) {
                        for (word in tempkeyword) {
                                if (match($0,word)) {
                                        found=found word " "
                                        # found this word, no point in
                                        # searching for it again in this file
                                        delete tempkeyword[word]
                                }
                        }
                }
                if (found != "") { print file,found }
                close(file)
        }
'

This is a brute force approach, but it should list each file name followed by any keywords that were matched in it (if any were found). Naturally you will want to change the output syntax so that it is suitable for insertion into your database, and to obtain the additional fields that you were looking for...

Annihilannic.

LKBrwnDBA · Mar 23, 2009

How about just using grep?

Code:

grep -il -f keywords /home/[URL unfurl="true"]www/*.html[/URL]

----------------------------------------------------------------------------
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb

Annihilannic · Mar 23, 2009

I guess because that wouldn't tell you which keywords it had found in which file, so it wouldn't be a lot of help as data to drive a search function.

Annihilannic.

LKBrwnDBA · Mar 23, 2009

The grep utility will give the file name and line of text where it found any one of the keywords, just remove the 'l' option:

Code:

grep -i -f keywords /home/www/*.html

PS: keywords is a file with the keywords to search.

----------------------------------------------------------------------------
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb

Annihilannic · Mar 23, 2009

Yeah... but that still won't tell which keyword(s) resulted in the successful match (without extra parsing of the line)... so it isn't possible to insert the required information into the "keyword -> filename" database without extra processing.

GNU grep's -o option might however give some usable output for this.

Annihilannic.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

awk and SQL server?

ericb123

MIS

feherke

Programmer

ericb123

MIS

Annihilannic

MIS

LKBrwnDBA

MIS

Annihilannic

MIS

LKBrwnDBA

MIS

Annihilannic

MIS

Similar threads

Part and Inventory Search

Sponsor