Text From PDF

PCHomepage · Oct 27, 2013

I've searched here and other sites and found lots of functions that are supposed to be able to extract text from a PDF but haven't yet found one to suit my needs. Some extract gibberish while others extract nothing at all. The one that did the best job was a huge set of functions and classes that ended up extracting only the first PDF page's text, ignoring the rest but what it did give was reasonably accurate.

The one below is much simpler and seems to get it all but it also gives much of the code from the PDF. It also seems limited to "FlatDecode text" while I need it to be more versatile in the types of PDFs that it can read. Right now, though, just getting clear text from it is the main thing but being able to get only "real" words from a MySQL dictionary table would be even better! Any ideas?

Code:

function pdf2text($datastream) {
	if (strlen ($datastream) < 1000 && file_exists ($datastream)) $datastream = file_get_contents ($datastream); //get the data from file
	if (!trim ($datastream)) echo "Error: there is no PDF data or file to process.";
	$result = '';

	if (preg_match_all ('/<<[^>]*FlateDecode[^>]*>>\s*stream(.+)endstream/Uis', $datastream, $m)) foreach ($m[1] as $chunk) {
		$chunk = gzuncompress (ltrim ($chunk));
		//If there are [] in the data, then extract all stuff within (), or just extract () from the data directly
		$a = preg_match_all ('/\[([^\]]+)\]/', $chunk, $m2) ? $m2[1] : array ($chunk); //get all the stuff within []
		foreach ($a as $subchunk) if (preg_match_all ('/\(([^\)]+)\)/', $subchunk, $m3)) $result .= join ('', $m3[1]); //within ()
	}
	else echo "Error: there is no FlatDecode text in this PDF file to process.";
	return $result;
}

jpadie · Oct 27, 2013

getting text from a pdf is not straightforward because it largely depends on how the pdf was created. in some cases the pdf is an image, even thought it is made of text. for this you need an ocr.

A number of years ago I had a similar requirement for a very complex app that sat on a wordpress backend. I have included the class below.

I used this class for automatic filing of emails. My code would grab all new emails every couple of minutes and first parse the body for keywords. If no match against known keywords was found, it would then extract all the text from attachments and do the same. If a match was found the email was extracted and made into a wordpress post (with the attachments as wordpress uploads to that post) and filed with a category equating to the matched keyword (or 'uncategorised' if no match).

The code relies on a number of unix tools which can be installed as follows

Code:

apt-get update
apt-get upgrade
apt-get install docx2txt catdoc poppler-utils

it also relies (for the ocr piece) on abbyy software's binary. I tried many many different freeware libraries such as tesseract (and ocropus and gocr etc) and could not get anything like the reliability of abbyy. See here:

http://www.ocr4linux.com/en:pricing.

Ocrfeeder looks like a possible freeware alternative that I did not text. Invoke it like this

Code:

ocrfeeder-cli -i input1.jpg input2.jpg -f html -o output.htm

An alternative might be to use google docs. You would need programmatically to upload the file to gDrive. You'd need to set a cron job to retest the scanning process a minute or so after upload and retrieve the text. I've not looked but it can't be more than 20 mins work to rough up some code to test this.

There are also some other online api's that can do the same. Not suitable for me because of security concerns, but possibly OK for other apps.

the code below is used like so (it can be used for doc, rtf, docx, txt, pdf etc - in fact it can be used with anything but will return a zero length string when it fails).

looking at the code again, I see that I never really finished the zip aspects - for non xlsx/docx files it would be better to extract the files from the zip and then resubmit them to the class. I also notice that I did not ever install an xlsx2txt tool. you can find these online and then edit the class to point to the necessary binary.

Code:

$textExtractor = new TextExtractor( '/path/to/file.pdf' );
$text = $textExtractor->getOutput();
if($text != false || $text == ''):
  //nothing returned
else:
  //text is in $text.  nb some files will be comma delimited so you can't use spaces to search for word boundaries
endif;

Code:

<?php
class textExtractor{
	private $textOutput = '';
	public function __construct($inputFile){
		$this->inputFile = $inputFile;
		if (!is_file($this->inputFile)){
			$this->textOutput = false;
		} else {
			$this->extractText();
		}
	}
	
	public function getOutput(){
		return $this->textOutput;
	}
	
	private function extractText(){
		$this->getMimeType();
		
		switch ($this->mime){
			case 'application/vnd.openxmlformats-officedocument.wordprocessingml.document':
				$this->docx2txt();
			break;
			
			case 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet':
				$this->xlsx2txt();
			break;
			
			case 'application/x-zip':
				$info = pathinfo($this->inputFile);
				switch (strtolower($info['extension'])):
					case 'docx':
						$this->docx2txt();
						break;
					case 'xlsx':
						$this->xlsx2txt();
						break;
					default:
						$this->textOutput = '';
				endswitch;
			break;
			case 'application/ms-word':
				$this->catDoc();
			break;
			
			case 'image/tiff':
			case 'image/x-tiff':
			case 'image/png':
			case 'image/jpeg':
			case 'image/jpg':
				$this->ocr();
			break;
			case 'application/excel':
			case 'application/vnd.ms-excel':
			case 'application/x-excel':
			case 'application/x-msexcel':
				$this->xls2csv();
			break;
			case 'application/pdf':
				$this->pdftotext();
			break;
			case 'text/html':
			case 'message/rfc822\0117bit':
			case 'text/plain':
			default:
				$this->textOutput = $this->adjust(strip_tags(file_get_contents($this->inputFile), '<table><tr><th><td><br>'));
				$this->textOutput = preg_replace('/<br\/?>/ims',"\n",$this->textOutput);
		}
	}
	
	
	private function getMimeType(){
		$cmd = 'file --mime ';
		$arg = escapeshellarg($this->inputFile);
		exec ($cmd . $arg, $output, $return);
		list($fileName, $output) = explode(':', $output[0]);
		list($mime, $charset) = explode (';', $output);
		$this->mime = trim($mime);
	}
	
	private function docx2txt(){
		$cmd = 'perl /usr/bin/docx2txt.pl ' . escapeshellarg($this->inputFile) .' -';
		exec($cmd, $output);
		$this->textOutput = implode ("\n", $output);
	}
	
	private function xlsx2txt(){
		$this->textOutput = '';
	}
	
	private function catDoc(){
		$cmd = 'catdoc -w ' . escapeshellarg($this->inputFile);
		exec($cmd, $output);
		$this->textOutput = implode("\n", $output);
	}
	
	private function xls2csv(){
		$cmd = 'xls2csv -fY-m-d ' . escapeshellarg($this->inputText);
		exec ($cmd, $output);
		$this->textOutput = implode("\n", $output);
	}
	
	private function ocr(){
		//need to use abbyy
		$cmd = 'abbyyocr -if ' . escapeshellarg($this->inputFile) .' -f Text -c ' . '2>&1';
		exec ($cmd, $output, $return);
		$this->textOutput = implode("\n", $output);
	}
	
	private function pdftotext(){
		$cmd = 'pdftotext -layout -enc UTF-8 ';
		$cmd = $cmd . escapeshellarg($this->inputFile) . ' - 2>&1' ;
		exec ($cmd, $output );
		$output = implode("\n", $output);
		if (strlen($output) > 10){
			$this->textOutput = $output;
		} else {
			$this->ocr();
		}
	}
	
	private function adjust($string){
		$string = str_replace("\r\n", "\n", $string);
		$string = str_replace("\r", "\n", $string);
		return $string;
	}
}
?>

to compare these to a mysql table of keywords I think i'd do this

Code:

[code]
$textExtractor = new TextExtractor( '/path/to/file.pdf' );
$text = $textExtractor->getOutput();
if($text != false || $text == ''):
  //nothing returned
else:
  //text is in $text.  nb some files will be comma delimited so you can't use spaces to search for word boundaries
 $sql = <<<SQL
SELECT   keywordID, keyword
FROM     keywordtable
WHERE    ? LIKE concat('%', keyword, '%')
SQL;
 $pdo->prepare( $sql );
 $pdo->execute( array($text) );
 $results = $pdo->fetchAll( PDO::FETCH_OBJ );
 if(count($results) > 0):
  echo 'no match';
 else:
  echo "matches \n";
  print_r($results);
 endif;
endif;

PCHomepage · Oct 27, 2013

Thanks a bundle! I
I'll check it out in the morning as it's quite late (or early) here now. These PDFs are basically images but they have been OCRed in Acrobat and I'm simply trying to fetch some keywords although I am also looking for a way to OCR images for the same purpose. Even if it's not perfect, it doesn't matter much. Oddly, though, the little function I posted works on the OCRed PDFs but it crashes on a PDF that was created directly from InDesign where it has actual text in it.

PCHomepage · Oct 27, 2013

I forgot to say that this is on a hosted server so I cannot install anything.

jpadie · Oct 27, 2013

i don't really understand what you mean by OCR'd in acrobat. does acrobat store the text in the image format of the PDF? that would be odd. more likely it stores the text in a metadata file. but i'm not an expert in pdf.

if you can't install stuff yourself you're on the wrong platform or with the wrong host. there are cloud providers that now will sell you a reasonably spec'd virtual box for 5USD per month. much better to use one of those.

If you cannot install any third party tools yourself then I have no other solution for you. it is possible that there are php classes that can disassemble the pdf themselves but from memory those that I trialled years ago where very inaccurate and provided the internal workings of the pdf as well as the actual text (if they provided anything at all)

I strongly recommend developing only for platforms over which you have proper control.

PCHomepage · Oct 27, 2013

OCRing a PDF? It's a common thing to do when scanning in hard-copy. In this case, these are old show programs, reviews and the like from the '40s and '50s so OCRing is the only way to get anything in the PDF to be able to extract. Otherwise it just contains an image. PDFs are self-contained so yes, it stores the text as metadata which is apparent when opening a PDF into Notepad or some other text editor. It is my understanding that there are differences between the PDF versions but what those difference are I'm not sure. However, since I'm the one creating the PDFs, they should all be the same version.

The little bit of code I posted above can retrieve what I need but it is also retrieving some of the non-text - probably the codes that layout the PDF - and also the words it does get are not individual but rather they're all attached to one another. Here is a partial extract where the PDF show Program from 1955 starts with an advertisement for Cadillac:

0-Cadillacisl11u\(;hIllorethananewmodel.Itisawhollyrestyledandcompletelyre\255engineeredmotorcar-71CWfromitsmoregrilletoisLROanceofouth\:tPa'hamana\(WhirlingOramaofMIhrFoarth\\255:holila.KoriNajala13.YMA'-MIl::Anelanrolklor1.CalliinilodIIndio\(Indian"aThehanlship'nndofIndianlife...,,...,.,.,,,AdobeUCSI.t\225AdobeUCSbNewJerse,3SWilliamSt.Newark,N.J.MArhl4-5151YouNeverHeardItSoGood!FOTAmazing,life\267likeReproductionofBroadcastandRecordedMusic.

Acrobat's ability to OCR is rudimentary at best and because the OCRing is done on old, yellowed material, it might be struggling and adding things like square brackets or other symbols rather than letters, and PDFs use some of these to separate the code, thus confusing the ability to extract only the text.

A basic PDF has a specific structure like the following but I'm not totally sure if an OCRed document stores the text in the same way as a PDF that was created from an document containing text, such as Word on InDesign, for example: (from Introduction to PDF)

Code:

%PDF-1.7

1 0 obj  % entry point
<<
  /Type /Catalog
  /Pages 2 0 R
>>
endobj

2 0 obj
<<
  /Type /Pages
  /MediaBox [ 0 0 200 200 ]
  /Count 1
  /Kids [ 3 0 R ]
>>
endobj

3 0 obj
<<
  /Type /Page
  /Parent 2 0 R
  /Resources <<
    /Font <<
      /F1 4 0 R 
    >>
  >>
  /Contents 5 0 R
>>
endobj

4 0 obj
<<
  /Type /Font
  /Subtype /Type1
  /BaseFont /Times-Roman
>>
endobj

5 0 obj  % page content
<<
  /Length 44
>>
stream
BT
70 50 TD
/F1 12 Tf
([COLOR=red][bold]Hello, world![/bold][/color]) Tj
ET
endstream
endobj

xref
0 6
0000000000 65535 f 
0000000010 00000 n 
0000000079 00000 n 
0000000173 00000 n 
0000000301 00000 n 
0000000380 00000 n 
trailer
<<
  /Size 6
  /Root 1 0 R
>>
startxref
492
%%EOF

As far as installing third-party tools, I can install them onto my Web sites but not into the OS and I have never had need to do so. This project is part of an established site and I don't want to reinvent the wheel to be able to add this feature nor is there any reason to do so. PHP is fully capable of doing what I need even if there are third-party tools that might do it better.

As always, I appreciate your help. However, if you don't know the answer, maybe someone else who reads this forum does. I saw this question posted before but never a resolution and I find it impossible to believe that there isn't one!

jpadie · Oct 27, 2013

What you posted above is a PDF made of text.
What you are talking about is a PDF of images that Acrobat has ocrd. I did not know that acrobat stored the recognised text in metadata within the file itself.

If it is in the file you can get it. Post a link to an acrobat'd file and I will have a look at how to unpack the file.

But bear in mind that installing Abbyy would cos 150 dollars and buying a virtual server from digital ocean would cost 5dollars. So if you account for your time at all you have an available proven working solution for 155 usd. If I were pricing for my own project that equates to a half hour of my time. No brainer from a cost-time-benefit scenario. And much safer and more streamlines than relying on acrobat scans and (poor quality) OCR.

PCHomepage · Oct 27, 2013

Thank you. The PDF is not currently online nor available online until this issue has been solved. However, I've sent it to you directly by email (hopefully it's not too large), although keep in mind that it is just one example and I have many more to go. However, the process does not require Abbyy nor any other third-party application. It is purely a PHP issue and, yes, the metadata from the OCR process in stored internally in the PDF file. That's the whole point of a PDF: to be portable.

There may also be internal information about paper size and number of pages that would be good to capture but it's not all that important. I'm not sure if it was clear that the file is being processed during an upload into MySQL from the file's data stream.

jpadie · Oct 27, 2013

nothing received.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Text From PDF

PCHomepage

Programmer

jpadie

Technical User

PCHomepage

Programmer

PCHomepage

Programmer

jpadie

Technical User

PCHomepage

Programmer

jpadie

Technical User

PCHomepage

Programmer

jpadie

Technical User

Similar threads

Part and Inventory Search

Sponsor