Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

searchable pdf / word docs

Status
Not open for further replies.

1DMF

Programmer
Jan 18, 2005
8,795
GB
How would I go about building a search facility that could include the content of Word and PDF documents?

Do I need to consider some form of meta data that is stored in the DB with the document?

Would it not be practical to do a real-time search of a folder containing a bunch of Word and PDF docs?

All input greatly received.

1DMF



"In complete darkness we are all the same, it is only our knowledge and wisdom that separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"

Free Dance Music Downloads
 
I've never actually coded anything like this, but I've thought about it a few times in response to some user requirements.

I'd always imagined having some kind of many-to-many relationship between the words you parse out of the documents, and the documents themselves (on a DBMS). That way you'd be able to search for a number of keywords and the document that had the most matches (as a count) would be the one you'd want to see at the top of the list. It also means that you'd be able to expend the CPU cycles (once) to parse and analyse the documents as they arrive, and the search cost would be lower. Kind of like a poor man's Google...

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
If those documents are permanent and there are hundreds or thousands of them, then you should go with including their full text content in a DB.
However not all pdf's contain text (scanned pages), with those you can only go with an OCR application (that cannot be fully automatic). Also, if you want to write a word or pdf text extraction routine, this is not a simple task (different versions...).

: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
prex1 is right about the scanned-content PDFs although many scanning applications have OCR which adds machine readable text based on the content of the scanned document into the PDF.

I was thinking of a workflow which went something like:
[ol 1]
[li]parse the text of the document using split or some other regex-based mechanism[/li]
[li]normalise the 'words' by lower-casing, punctuation removal etc.[/li]
[li]exclude one, two, and some common three and four letter words like and, is, when, etc.[/li]
[li]use a hash with the word as a key and a count as a value, zip through the parsed words to get a count for this document[/li]
[li]add the document locator (file name, URL etc.) and document ID to the 'document' table[/li]
[li]for each item in your hash, check if 'word' exists on the word table, and if not, insert it[/li]
[li]add a row to the document-word table with the document ID, the word ID, and the count[/li]
[/ol]

I think that if your DBMS supports full-text indexing of text and VARCHAR fields, you might be able to just slap it in the table and let the DBMS take care of points 1 through 7 for you. But you might not want to store the full text on the DB. And it's not very challenging either, is it? [smile]

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
hmm some interesting food for thought.

I think that if your DBMS supports full-text indexing of text and VARCHAR fields, you might be able to just slap it in the table and let the DBMS take care of points 1 through 7 for you.

Do you mean store the BLOB of the document and the DBMS will index the text? or is there now data typing of PDF / DOC / XLS etc?

What am I slapping into the table?

If this is an option and not difficult to do, KISS is always better than a challenge isn't it?

Why make life difficult for yourself, the re-write of the entire web app is challenging enough, anything that can give powerful enhancements with as little effort as possible is my kind of solution!

The PDF's & Word docs in question are mainly textual and less than 500 docs in total, so I'm sure I could use a parser and collect the words in the document and index accordingly, but if MS SQL 2008 R2 can do the donkey work for me, then that'll be a bonus.

"In complete darkness we are all the same, it is only our knowledge and wisdom that separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"
Free Electronic Dance Music
 
Well a quick Bing and you could be on to something Stevexff.


Looks like I might need to get an iFilter from Adobe to enable the full-text indexing / search to work, though apparently Word / XLS are built into to MS SQL 2008.



"In complete darkness we are all the same, it is only our knowledge and wisdom that separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"
Free Electronic Dance Music
 
1DMF

Not a BLOB, but a CLOB (character large object) for the text. I think M$ SQL Server might even support an indexable text column type. You will need the filter to pull the text out of the PDF but you will only have to do that once per document when you store it. Might even be practical to store the text of the PDF and its file location or URL only on the table; once you've found the PDF with the index, then you can return that and the user can request the PDF by clicking on the URL.

You get the idea, anyway...

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
Well according to the link I posted, you store the PDF as a BLOB, import the iFilter into SQL server and then use a special join to perform a 'real-time' search of the PDF...

Code:
sql installation use :

SELECT *
FROM sys.fulltext_document_types

When you install the iFilters on the server you'll need to call the following querry in order to load the filters in the full text search engine:

EXEC sys.sp_fulltext_service 'load_os_resources', 1;
GO

EXEC sys.sp_fulltext_service 'update_languages', NULL;

Than you can search the file content using CONTAINS or CONTAINSTABLE this way

SELECT [ID],[Name],[FileContent]
FROM [MyDatabase].[dbo].[Files]
INNER JOIN 
CONTAINSTABLE ([MyDatabase].[dbo].[Files], 
([Name], [FileContent]), 
'ISABOUT( FORMSOF (INFLECTIONAL, Here goes your searched text) WEIGHT(0.9))', 
language 'English') AS res
ON res.[key]=[ID]

I have the iFilter installed and loaded into the full-text search system, just need to store a PDF and try a search and hopefully that's be job done :)

"In complete darkness we are all the same, it is only our knowledge and wisdom that separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"
Free Electronic Dance Music
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top