PCHomepage
Programmer
I've searched here and other sites and found lots of functions that are supposed to be able to extract text from a PDF but haven't yet found one to suit my needs. Some extract gibberish while others extract nothing at all. The one that did the best job was a huge set of functions and classes that ended up extracting only the first PDF page's text, ignoring the rest but what it did give was reasonably accurate.
The one below is much simpler and seems to get it all but it also gives much of the code from the PDF. It also seems limited to "FlatDecode text" while I need it to be more versatile in the types of PDFs that it can read. Right now, though, just getting clear text from it is the main thing but being able to get only "real" words from a MySQL dictionary table would be even better! Any ideas?
The one below is much simpler and seems to get it all but it also gives much of the code from the PDF. It also seems limited to "FlatDecode text" while I need it to be more versatile in the types of PDFs that it can read. Right now, though, just getting clear text from it is the main thing but being able to get only "real" words from a MySQL dictionary table would be even better! Any ideas?
Code:
function pdf2text($datastream) {
if (strlen ($datastream) < 1000 && file_exists ($datastream)) $datastream = file_get_contents ($datastream); //get the data from file
if (!trim ($datastream)) echo "Error: there is no PDF data or file to process.";
$result = '';
if (preg_match_all ('/<<[^>]*FlateDecode[^>]*>>\s*stream(.+)endstream/Uis', $datastream, $m)) foreach ($m[1] as $chunk) {
$chunk = gzuncompress (ltrim ($chunk));
//If there are [] in the data, then extract all stuff within (), or just extract () from the data directly
$a = preg_match_all ('/\[([^\]]+)\]/', $chunk, $m2) ? $m2[1] : array ($chunk); //get all the stuff within []
foreach ($a as $subchunk) if (preg_match_all ('/\(([^\)]+)\)/', $subchunk, $m3)) $result .= join ('', $m3[1]); //within ()
}
else echo "Error: there is no FlatDecode text in this PDF file to process.";
return $result;
}