Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Westi on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Search Directory within files!

Status
Not open for further replies.

FOR111

Programmer
Sep 29, 2005
103
MT
Hi all,

I have a problem to solve and maybe if i ask you guys can help me out!

I have an intranet going on and one of the main areas is the Document Manager. This application mainly asks the users (internal development team) to upload a file version and insert certain information (such as name, version, free text etc) to go with it. The File is uploaded within a particular directory.

Various type of files are uploaded mainly, .doc, .xls, .pdf and .txt.

I was also asked to create to set of search functions. One is a simple search where the user just enters a keyword/s and the function will go through the Free Text section and searches the keywords..That was simple enough since i only have to search the database.

The second one is where the problem lies. I was asked to make a search where users will write a keyword or a set of keywords and the function will first go through the Directory and make a search WITHIN the files.

For this i used the linux grep bash function. The problem was that it only returned files having .doc or .txt!

HOW or what can i use to search all the files especially the pdf files and return the filenames into a simple file?


Tahnks for your help
For
 
You should be able to grep in .pdfs, however they may give odd results at times, however you'll have real trouble if they've been created on the cheap as it uses embedded images, and more chance of finding Yoda you have.

______________________________________________________________________
There's no present like the time, they say. - Henry's Cat.
 
Thanks KarveR,

though i would like a more elaborate solution if its possible!

Thanks once again
Nick
 
This I got from the PHP website:
(see the user comments)

I dont know if it works, but maybe it will help you.

Code:
<?php
function pdf2string($sourcefile) {

   $fp = fopen($sourcefile, 'rb');
   $content = fread($fp, filesize($sourcefile));
   fclose($fp);

   $searchstart = 'stream';
   $searchend = 'endstream';
   $pdfText = '';
   $pos = 0;
   $pos2 = 0;
   $startpos = 0;

   while ($pos !== false && $pos2 !== false) {

       $pos = strpos($content, $searchstart, $startpos);
       $pos2 = strpos($content, $searchend, $startpos + 1);

       if ($pos !== false && $pos2 !== false){

           if ($content[$pos] == 0x0d && $content[$pos + 1] == 0x0a) {
               $pos += 2;
           } else if ($content[$pos] == 0x0a) {
               $pos++;
           }

           if ($content[$pos2 - 2] == 0x0d && $content[$pos2 - 1] == 0x0a) {
               $pos2 -= 2;
           } else if ($content[$pos2 - 1] == 0x0a) {
               $pos2--;
           }

           $textsection = substr(
               $content, 
               $pos + strlen($searchstart) + 2, 
               $pos2 - $pos - strlen($searchstart) - 1
           );
           $data = @gzuncompress($textsection);
           $pdfText .= pdfExtractText($data);
           $startpos = $pos2 + strlen($searchend) - 1;

       }
   }

   return preg_replace('/(\s)+/', ' ', $pdfText);

}

function pdfExtractText($psData){

   if (!is_string($psData)) {
       return '';
   }

   $text = '';

   // Handle brackets in the text stream that could be mistaken for
   // the end of a text field. I'm sure you can do this as part of the 
   // regular expression, but my skills aren't good enough yet.
   $psData = str_replace('\)', '##ENDBRACKET##', $psData);
   $psData = str_replace('\]', '##ENDSBRACKET##', $psData);

   preg_match_all(
       '/(T[wdcm*])[\s]*(\[([^\]]*)\]|\(([^\)]*)\))[\s]*Tj/si', 
       $postScriptData, 
       $matches
   );
   for ($i = 0; $i < sizeof($matches[0]); $i++) {
       if ($matches[3][$i] != '') {
           // Run another match over the contents.
           preg_match_all('/\(([^)]*)\)/si', $matches[3][$i], $subMatches);
           foreach ($subMatches[1] as $subMatch) {
               $text .= $subMatch;
           }
       } else if ($matches[4][$i] != '') {
           $text .= ($matches[1][$i] == 'Tc' ? ' ' : '') . $matches[4][$i];
       }
   }

   // Translate special characters and put back brackets.
   $trans = array(
       '...'                => '&hellip;',
       '\205'                => '&hellip;',
       '\221'                => chr(145),
       '\222'                => chr(146),
       '\223'                => chr(147),
       '\224'                => chr(148),
       '\226'                => '-',
       '\267'                => '&bull;',
       '\('                => '(',
       '\['                => '[',
       '##ENDBRACKET##'    => ')',
       '##ENDSBRACKET##'    => ']',
       chr(133)            => '-',
       chr(141)            => chr(147),
       chr(142)            => chr(148),
       chr(143)            => chr(145),
       chr(144)            => chr(146),
   );
   $text = strtr($text, $trans);

   return $text;
}
 
Thanks Recor,

I already took a look at this! the thing is that since i have a directory full of different extensions, i just wanted one function which takes care of a search. This could also be a linux bash function...doesn't make a difference!

Thanks very much for your time
Nick
 
Ok,

why don't you make i function like this (I just write it in plain english):

Code:
function searchInFiles($directory)
{
  //put all files in $directory in array
  $fileArray = getFilesInArray($directory)

  //walk through array
  for ($i=0; $i<count($fileArray);$i++)
  {
   $ext=getExtensionFromFile($fileArray[$i]);

   if ($ext='doc')
   {
     $searchResults[]=searchInWordDocument($fileArray[$i]);
   }

   if ($ext='pdf')
   {
     $searchResults[]=searchInPDFDocument($fileArray[$i]);
   }

   etc.
  }  
  return $searchResults;
}

Of course you should also define the functions getExtensionFromFile, getFilesInArray, searchInPDFDocument, searchInWordDocument etc.
 
hmmm,

You've got a point! ok i can give it a go! let you know how it went! Thanks again

Nci
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top