Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Full text search of PDF without the text layer

Status
Not open for further replies.

xcerv

IS-IT--Management
Feb 18, 2009
15
CZ
I have a question regarding possibilities how to make simple PDFs without the text layer (or images) searchable in the Content Server.
Our users are storing all sorts of documents inside the system, some of them are simple PDFs or images that cannot be converted to text by the indexing. However user would still like to search the contents of these document, without any work from their side of course [smile]

Are there any possibilities for converting these documents to something with searchable content? Renditions?

Users also store emails from their exchange mailboxes (mostly by dragging the email from their mailbox and storing it inside the server trough enterprise connect client integration in Outlook) and the emails can also contain these documents. Can these be converted too? Or are these documents handled differently?

Thank you for any answers or ideas.
xcerv
 
Hello when a PDF is not text searchable how can a program search it?
Livelink uses commercial conversion engines (code)developed by Stellent or their own to do
Document Conversion.When you bring in Renditions the work there is to make electronic already searchable
documents into other forms for e.g PDF so that users can open it without word.There also you will need
to purchase something from Adlib if you wanted any conversion like OCR.
In your outlook to EC scenario the object is made into a standard livelink objects.If the mime type of the
object is allowed by your org for FT indexing it will get FT indexed.BTW on all non FT indexed objects livelink will
peel any metadata that is known to it like create time,author some advanced MS props etc.Typically people dealing
with non FT searchable content like TIFF will attach meaningful metadata either when they scan or a person filling out metadata
when adding the doc.Livelink has an easy enough to use metadata search to then find the object.If a image content needs to be text searched
either you ahv to do that before it enters livelink or after entering livelink use is reconditioning method.If you are wanting a no frills experience then I would convert the images before loading using commercial or free software.

Livelink does not do anything special with mimetypes.If a outllok/msg mimetype is detected it will send that when you request it if a excel doc was requested it will send a application/vnd.msexcel or the new hugely unwriteable mimetype when you try to open it .That is how all th DMS's do it.

Livelink has in place "View as HTML" which is like google documents trying to convert a document when you load it and it as been around since 20 years,nowadays it is not free form OT I think.

Well, if I called the wrong number, why did you answer the phone?
James Thurber, New Yorker cartoon caption, June 5, 1937
Certified OT Developer,Livelink ECM Champion 2008,Livelink ECM Champion 2010
 
Hello and thanks Appu,

You answered as I was expecting. The problem is there are lots of documents (mostly PDFs) that need to be stored in livelink, but are not initially OCRed so without the text layer. We are getting them mostly through email from outside of the company. If we want them searchable in livelink users have to convert them somehow to PDF/A type so they can be indexed and FT searchable. I was looking for some other automated function, that would do this for the users.
Renditions are not good for several reasons, mostly because it would not distinguish the simple PDF and PDF/A types and create a rendition for both, so it would double the data. Or at least I think it will, maybe someone can clarify this.

Regards
xcerv
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top