Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

indexing documents

Status
Not open for further replies.

glendacom

MIS
Oct 23, 2002
36
0
0
US
I have approximately 1000 documents (word, power point, excel, project, pdf) that I need to index. I want to search each of the documents and generate a list of keywords (important words not to include articles, pronouns, conjunctions, and prepositions) from each document. I then want to put this list of keywords in a SQL*Server table with links to the document containing the keyword. It would be nice to assign a weight factor to the keyword (i.e. does it occur 100 times or 1 time?) but at this point I would settle for just having a list of words in each document. Anyone have an idea of where to start?
 
I would look at fulltext indexing in sql server and let it do the work for you.
 
In order to do this with SQL Server, you will be responsible for writing the code to pull the text out of the files and putting it in SQL Server. This means using an interface that understands the file format.

One other option might be to use a full text search engine outside of SQL Server. However, finding one that understands all your file formats and one that is within your budget might be difficult. An example can be found at
 
afsearch looks promising and is definitely within my budget but I would really like to end up with a list of words for each document in a SQL*Server table. I have no idea where to even start to take a document(word for example) and extract ALL words (not just ones I specify although I would like to specify ones to NOT list) and load them into a table.

can anyone point me in the correct direction?
 
Word and Excel provide COM components to get at their data. To take a wild guess, I would say Project and Powerpoint would too. However, Microsoft does not recommend using them in a server environment. PDF has some tools to extract the text from it, however, you will have to purchase those modules from Adobe. I attempted to extract data from a PDF a year or so ago and it wasn't too much fun. They may have improved it since then though.

Chris.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top