Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

awk and SQL server?

Status
Not open for further replies.

ericb123

MIS
Mar 3, 2008
55
0
0
US
Please excuse the newbie question, I'm very new to AWK.

I have a large directory of HTML files, and I would like to search through these files for keywords, and take the results of the search and put it into a sql server database.

Is this possible, and if so, are there any samples out there on how to do it?

Any help is greatly appreciated, thanks!
 
Thanks for the reply. Someone had recommended AWK, but admittedly, I'm a complete novice to it.

I have a SQL database with a table that lists keywords I want to search for. Then I have a directory structure on my server that has HTML files:
Folder/subfolder/HTML files

So I'm trying to find an *automated* way to search these HTML files for the list of keywords, and then to put the search results into another SQL table, so my flow is something like this:

Select * from table.keywords

search through all HTML files in the directory

input results (file name, link, date, keyword found, etc) into SQL table

My client searches the web for specific web pages, and when found, they save these HTML files in these folders. I'm trying to build a database of just the ones they need based on a list of keywords.

Any suggestions? Thanks again.
 
This may give you some ideas.

Assuming you have used an SQL query to dump the list of keywords into a plain text file keywords.

Code:
find /home/[URL unfurl="true"]www -type[/URL] f -name '*.html' | awk '
        BEGIN {
                nkeywords=0
                while (getline < "keywords") { keyword[nkeywords++] = $0 }
                close("keywords")
        }
        {
                file=$0
                found=""
                # create temporary hash from keyword array
                for (i=0; i<nkeywords; i++) {
                        tempkeyword[keyword[i]]=1
                }
                while (getline < file) {
                        for (word in tempkeyword) {
                                if (match($0,word)) {
                                        found=found word " "
                                        # found this word, no point in
                                        # searching for it again in this file
                                        delete tempkeyword[word]
                                }
                        }
                }
                if (found != "") { print file,found }
                close(file)
        }
'

This is a brute force approach, but it should list each file name followed by any keywords that were matched in it (if any were found). Naturally you will want to change the output syntax so that it is suitable for insertion into your database, and to obtain the additional fields that you were looking for...

Annihilannic.
 
How about just using grep?

Code:
grep -il -f keywords /home/[URL unfurl="true"]www/*.html[/URL]
[3eyes]


----------------------------------------------------------------------------
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
 
I guess because that wouldn't tell you which keywords it had found in which file, so it wouldn't be a lot of help as data to drive a search function.

Annihilannic.
 

The grep utility will give the file name and line of text where it found any one of the keywords, just remove the 'l' option:
Code:
grep -i -f keywords /home/www/*.html
[noevil]
PS: keywords is a file with the keywords to search.


----------------------------------------------------------------------------
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
 
Yeah... but that still won't tell which keyword(s) resulted in the successful match (without extra parsing of the line)... so it isn't possible to insert the required information into the "keyword -> filename" database without extra processing.

GNU grep's -o option might however give some usable output for this.

Annihilannic.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top