Extracting Text from a PDF

bdichiara · Feb 14, 2007

I found a function on PHP.net that allows extraction of text in a PDF, however it returns a whole mess of strange characters, copyrights about fonts, verisign, microsoft,Arial fonts, etc.

Here's the function:

Code:

function pdf2string($sourcefile)
{
   $fp = fopen($sourcefile, 'rb');
   $content = fread($fp, filesize($sourcefile));
   fclose($fp);

   # Locate all text hidden within the stream and endstream tags
   $searchstart = 'stream';
   $searchend = 'endstream';
   $pdfdocument = "";

   $pos = 0;
   $pos2 = 0;
   $startpos = 0;
   # Iterate through each stream block
   while( $pos !== false && $pos2 !== false )
   {
     # Grab beginning and end tag locations if they have not yet been parsed
     $pos = strpos($content, $searchstart, $startpos);
     $pos2 = strpos($content, $searchend, $startpos + 1);
     if( $pos !== false && $pos2 !== false )
     {
         # Extract compressed text from between stream tags and uncompress
         $textsection = substr($content, $pos + strlen($searchstart) + 2, $pos2 - $pos - strlen($searchstart) - 1);
         $data = @gzuncompress($textsection);
         # Clean up text via a special function
         $data = ExtractText($data);
         # Increase our PDF pointer past the section we just read
         $startpos = $pos2 + strlen($searchend) - 1;
         if( $data === false ) { return -1; }
         $pdfdocument = $pdfdocument . $data;
     }
   }

   return $pdfdocument;
}

function ExtractText($postScriptData)
{
   while( (($textStart = strpos($postScriptData, '(', $textStart)) && ($textEnd = strpos($postScriptData, ')', $textStart + 1)) && substr($postScriptData, $textEnd - 1) != '\\') )
   {
     $plainText .= substr($postScriptData, $textStart + 1, $textEnd - $textStart - 1);
     if( substr($postScriptData, $textEnd + 1, 1) == ']' ) // This adds quite some additional spaces between the words
     {
         $plainText .= ' ';
     }

     $textStart = $textStart < $textEnd ? $textEnd : $textStart + 1;
   }

   return stripslashes($plainText);
}

Is there a better way to do this or a way to clear out all the garbage and leave JUST text. I'm trying to put the text from PDF files into my MySQL database so it's searchable.

_______________
_brian.

bdichiara · Feb 14, 2007

Maybe even a way to return only standard keyboard characters such as letters, numbers and standard symbols.

_______________
_brian.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Extracting Text from a PDF

bdichiara

Programmer

bdichiara

Programmer

Similar threads

Part and Inventory Search

Sponsor