Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Extracting Text from PDF

Status
Not open for further replies.

jerasi

Programmer
Jun 19, 2000
141
CA
I need to extract all the text from PDF files that are uploaded to my site. I want to insert the extracted text to the database so I can run searches on it.
Does anyone know a good web control that will do this for me? I was looking for a long time now and couldn't find anything suitable.
Thank you.
 
jerasi - have you considered first reading the file into an XML file and then processing from there?
 
I don't know any way to read a pdf file into XML do you?
 
You can save pdf files as text and use them as you need from there.

Jim
 
jerasi - Jim's solution is probably your best shot; I was just reading this morning an argument in which it was mentioned that pdf files were converted to XML and then processed from there; so I mentioned it. Let us know the course you are taking and we'll help if we can with the specifics.
 
The process is like this:
1. User goes to website and upload a pdf document
2. The pdf document gets saved to the docs directory

The user wants the ability to search the text in the files by keyword.
How do I extract the text from the PDF document in order to insert it to the database directly from the web app?

Thank you.
 
jerasi: didn't mean to abandon you in midstream - had to take off yesterday afternoon - I'm going to dedicate a bit of time this morning in looking for a ref or two on this note - this is a good problem - lets see if we can't work towards a solution - someone else may chime in as well during the morning - hang in there!
 
If objective is to search text in PDF files
and IIS and SQL server are used
then try this:

Enable Index services.
Get iFilter from Adobe web site for indexing pdf files.
FREETEXT searching will find files with key words.

 
jerasi: I did a cursory review and it doesn't appear readily obvious which path may be the correct one here. Technically you could convert the pdf to a Text file, store it or its path in a database and continue developing the type of search you want to perform.

The easiest way I could see would be to simply store the pdf itself and use an adobe Indexing Service or something similar.

On the other hand, Jim's suggestion that you convert the pdf first to a text file would make them more readily available for searching using ASP.NET. I do not see any reason to store the file itself in a database, only perhaps a reference to its path. Let us know how things go and perhaps someone else will drop in and give you their 2 cents on this. Nice project by the way -- one I am sure will be well worth the effort in solving (would be nice to see a brief on how this is accomplished).
 
yuhui, that's an interesting solution. I will check if the web hoster will allow me to install iFilter.
Also, you mentioned iFilter and FREETEXT, how would the query look like?
Thank you all for your help, it is greatly appreciated.

 
Well, that was a quick "No" by the web hoster :) The iFIlter solution sounded good but I can't use it.
I will be able to use a third party control if I can find one that does a pdf to text conversion.
 
iFilter is free, but Indexing service will consume CPU on server.



as for query:



strQuery = "Select DocTitle,Filename,Size,PATH,URL, DocComments from SCOPE() where CONTAINS(' """ & SearchText & """ ')"

''// extra search
strQuery &= " OR CONTAINS(DocComments, ' """ & SearchText & """ ')"

Dim connString As String = "Provider=MSIDXS.1;Integrated Security .='';Data Source='" & Catalog & "'"

''// Catalog is indexing Db name



Dim cn As New System.Data.OleDb.OleDbConnection(connString)
Dim cmd As New System.Data.OleDb.OleDbDataAdapter(strQuery, cn)
Dim testDataSet As New DataSet
cmd.Fill(testDataSet)
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top