Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

perl code running out of 2 GB VM limit

Status
Not open for further replies.

MoshiachNow

IS-IT--Management
Feb 6, 2002
1,851
IL
HI,

I run my perl code copiled to .exe on an XP machine.
The code basically parses pdf files and extrates info.
(The sub runs as a perl thread - "use threads")
I do undef all values on every new file start - however the program crashes after about 2 hours when it reaches 2 GB Virtual Memory (it keeps accumulating VM inspite of my undef).
Is there any work around this problem ?

The code is below.
=======================================================
sub processPDF {
$file=shift;
undef $Transparency;
#return if ($file !~ /.*\.pdf$/i || $file =~ /MAP-CA Datasheet|Xerox Calendar_Print|HP_User_Guid|41178_PUBL01|zeev\\datasheet|ARCHMOV|ARCA|TRVL_|\#|AAA|Castle|705P01094_edoc|Jennifer Blauvelt|winback|Business_Cards_Print_Allops/);
return if ($file !~ /.*\.pdf$/i);
$filename = $file;
my $filename3 = $file;;
($path,$filename) = $filename3 =~ m/(.*\\)(.*)/;
my $filenameshort = $filename;
$filenameshort =~ m!(.*?\.)pdf!;
$filenameshort = $1;

#sleep 1;
undef $version;
undef $NOofPages;
undef $MediaBox;
undef $counter9;
undef $length;
undef $producer;
undef $creator;
undef $pdf;

$counter=0;
#print STDOUT "version=$version,NOofPages=$NOofPages,width=$width,height=$height\nfile=$file\n";

#Redirect STDERR to file
open STDERR, ">$TEMP\\k" or warn "Can't redirect STDOUT: $!";
#print STDOUT "HERE01\n";

$pdf=PDF->new($file);
#print STDOUT "HERE1\n";
#print STDOUT "=$file=\n";
$NOofPages = $pdf->Pages;
$creator=$pdf->GetInfo("Creator");
$producer=$pdf->GetInfo("Producer");
$creator = "Unknown" if ($creator =~ /[\x00-\x09\x0B\x0C\x0E-\x1F\x7F-\xFF]/ || ! $creator );
$producer = "Unknown" if ($producer =~ /[\x00-\x09\x0B\x0C\x0E-\x1F\x7F-\xFF]/ || ! $producer );
$version=$pdf->Version;
#print STDOUT "creator=$creator,producer=$producer,NOofPages=$NOofPages\n";
close STDERR;

#my $pdf = CAM::pDF->new("$file") ; #establish No of pages
#$NOofPages = $pdf->numPages() ;
#print STDOUT "NOofPages=$NOofPages\n";
#sleep 1;
#print STDOUT "version=$version,NOofPages=$NOofPages,width=$width,height=$height\nfile=$file\n";

open(FILE, $file) or die "Can't open `$filename': $!";
#%PDF-1.3
#/Count 234
#/N 345
#print STDOUT "HERE1\n";
$/ = 'obj'; #set INPUT_RECORD_SEPARATOR
while (read FILE, $buffer, 2048) {
chomp $buffer;
$line = $lastchunk . $buffer;
$lastchunk = $buffer;
#$line = $_;
chomp;
#$line = $_;
$line =~ tr/\0-\037\177-\377/\//d;
#next if ($line =~ /\0-\037\177-\377/);
#print STDOUT "line=$line\n";
#return if ($line =~ /GTS_PPMLVDXVersion/); #it's actualy a VDX file ...

#print STDOUT "===creator=$creator\n";
if ($line =~ m!/Creator! && $creator =~ /Unknown/ || $creator !~ /\w/) { #establish creator if CAM::pDF did not
#print STDOUT "HERE7\n";
#<</CreationDate(D:20110421165627+02'00')/Author(Istituto Poligrafico e Zecca dello Stato)/Creator(IPZS)/Producer(SecurePaper WebService 3-Heights\(TM\) PDF Producer 1.7.4.1 \( #<</ModDate(D:20101206135406-05'00')/CreationDate(D:20101206132902-05'00')/Title(Uncharted Cours)/Creator(QuarkXPress\(R\) 7.5)/Producer(QuarkXPress\(R\) 7.5)>>
#print STDOUT "line=$line=\n";
$creator =$line ;
$creator =~ m!/Creator\s*\((.*?)\).*!;
$creator =$1;
if ($producer =~ /Unknown/ || $producer !~ /\w/) {
$producer =$line ;
$producer =~ m!/Producer\s*\((.*?)\).*!;
$producer =$1;
#chomp $NOofPages;#chop $NOofPages;
}
} elsif ($line =~ m!CreatorTool! && $creator =~ /Unknown/) {
#<xmp:CreatorTool>IPZS</xmp:CreatorTool>
$creator =$line ;
$creator =~ m!CreatorTool>\s*(.*?)<.*!;
$creator =$1;
}

#print STDOUT "line=$line=\n";
#Creator="Exstream Dialogue CodeName Utopia.011p">
#print STDOUT "NOofPages=$NOofPages\n" unless ! $NOofPages;
if ($line =~ m!/Type\s*/Pages\s*/Count\s*\d+\s*/Kids\[[\w\s]*]!) { #establish number of pages if CAM::pDF did not
#if ($line =~ m!\/MediaBox!) { #establish number of pages if CAM::pDF did not
# 32384 0 obj << /Type /Pages /Count 20002 /Kids[5390 0 R 32385 0 R 59394 0 R] /MediaBox[0 0 596 842] /Resources 1 0 R >> endobj
#5942 0 obj << /Type /Pages /Count 2044 /Kids[1079 0 R 5943 0 R] /MediaBox[0 0 596 842] /Resources 5 0 R >> endobj
#<< /Type /Pages /Count 20002 /Kids[5537 0 R 32532 0 R 59541 0 R] /MediaBox
#rint STDOUT "HERE1\n";
#print STDOUT "line=$line=\n";
#sleep 30;
$NOofPages=$line ;
$NOofPages =~ m!/Type\s*/Pages\s*/Count\s*(\d+)\s*/Kids\[[\w\s]*]\s*/MediaBox!;
$NOofPages=$1;
#rint STDOUT "NOofPages=$NOofPages\n";
#chomp $NOofPages;#chop $NOofPages;
} elsif ($line =~ m!<< /Type\s*/Pages\s*/Count\s+\d+\s*>>! && ! $NOofPages) { #establish number of pages if CAM::pDF did not
#<< /Type /Pages /Kids [ 1 0 R 54 0 R 100 0 R 145 0 RR 4195 0 R 4240 0 R 4285 0 R ] /Count 96 >>
#rint STDOUT "HERE2\n";
#print STDOUT "line=$line=\n";
$NOofPages=$line ;
$NOofPages =~ m!<< /Type /Pages.*/Count\s+(\d+)\s*>>!;
$NOofPages=$1;
#print STDOUT "NOofPages=$NOofPages\n";
} elsif ($line =~ m!<<\s*/Count\s*\d+\s*/Type/Pages/! && ! $NOofPages) { #establish number of pages if CAM::pDF did not
#<</Count 64/Type/Pages/Kids[643 0 R 644 0 R 645 0 R 646 0 R 647 0 R 648 0 R 649 0 R]>>
#print STDOUT "HERE3\n";
#print STDOUT "line=$line=\n";
$NOofPages=$line ;
$NOofPages =~ m!<<\s*/Count\s*(\d+)\s*/Type/Pages/!;
$NOofPages=$1;
#print STDOUT "NOofPages=$NOofPages\n";
#chomp $NOofPages;#chop $NOofPages;
#} elsif ($line =~ m!<<.(/Count\s+|/N\s+)\d+!) { #establish number of pages if CAM::pDF did not
#/Count 568
#print STDOUT "HERE2\n";
#print STDOUT "line=$line=\n";
#$NOofPages=$line ;
#$NOofPages =~ m!<<.(/Count\s+|/N\s+)(\d+)!;
#$NOofPages=$2;
#chomp $NOofPages;#chop $NOofPages;
} elsif ($line =~ m!<<\s*/Count\s*\d+\s*/Kids\[.*?\]/Type/Pages>>! && ! $NOofPages) { #establish number of pages if CAM::pDF did not
#<</Count 64/Type/Pages/Kids[643 0 R 644 0 R 645 0 R 646 0 R 647 0 R 648 0 R 649 0 R]>>
#print STDOUT "HERE3\n";
#print STDOUT "line=$line=\n";
$NOofPages=$line ;
$NOofPages =~ m!<<\s*/Count\s*(\d+)\s*/Kids!;
$NOofPages=$1;
} elsif ($line =~ m!egGr:nrpages[=>]'*\d+['<]! && ! $NOofPages) { #establish number of pages if CAM::pDF did not
#print STDOUT "HERE4\n";
# xmlns:egInk=' egGr:nrpages='66'
# <egGr:nrpages>96</egGr:nrpages>
#print STDOUT "line=$line=\n";
$NOofPages=$line ;
$NOofPages =~ m!egGr:nrpages[=>]'*(\d+)['<]!;
$NOofPages=$1;
#print STDOUT "NOofPages=$NOofPages\n";
#chomp $NOofPages;#chop $NOofPages;
} elsif ($line =~ m!<<.*/Producer! && $producer =~ /Unknown/) { #establish Producer if CAM::pDF did not
# <</Producer(iText 2.1.4 \(by lowagie.com\))/ModDate(D:20110502202242+02'00')/CreationDate(D:20110502202240+02'00')>>
# 79079 0 obj<</Producer(iTextSharp 4.1.2 \(based on iText 2.1.2u\))/ModDate(D:20110510141924-07'00')/CreationDate(D:20110510141924-07'00')>>
#<</CreationDate(D:20110421165627+02'00')/Author(Istituto Poligrafico e Zecca dello Stato)/Creator(IPZS)/Producer(SecurePaper WebService 3-Heights\(TM\) PDF Producer 1.7.4.1 \( #print STDOUT "line=$line=\n";
#print STDOUT "HERE5\n";
$producer=$line ;
$producer =~ m!<<.*/Producer\s*\((.*?)\).*!;
$producer=$1;
#chomp $NOofPages;#chop $NOofPages;
} elsif ($line =~ m!pdf:producer! && $producer =~ /Unknown/) { #establish Producer if CAM::pDF did not
#print STDOUT "HERE6\n";
# <pdf:producer>SecurePaper WebService 3-Heights(TM) PDF Producer 1.7.4.1 ( #/Producer (Foxit Phantom Printer Version 2.0.1.0823)
#print STDOUT "line=$line=\n";
$producer=$line ;
$producer =~ m!pdf:producer\s*[\(|>](.*?)[\)|<].*!;
$producer=$1;
#chomp $NOofPages;#chop $NOofPages;
}

#/MediaBox [ 0 0 667 491 ]
#print STDOUT "HERE3,MediaBox=$MediaBox=\n"; sleep 1;
#print STDOUT "line=$line=\n";
if ($line =~ m/\/MediaBox/ && ! $MediaBox) { #establish BoundingBox
#print STDOUT "line=$line=\n";
$MediaBox=$line ;
($width,$height) = $MediaBox =~ m!\/MediaBox\s*\[\s*[-]*\S+\s*[-]*\S+\s*(\S+)\s*(\S+)\s*\]!i;
$width = $width * 35277 / 100000;
$width =~ s/(\d+\.\d).*/$1/;
$height = $height * 35277 / 100000;
$height =~ s/(\d+\.\d).*/$1/;
#print STDOUT "width=$width,height=$height\n";
}

if ($line =~ m/Transparency/) { #establish Transparency
$Transparency="YES";
}


#print STDOUT "version=$version,NOofPages=$NOofPages,MediaBox=$MediaBox=\n";
#last if ($version && $NOofPages && $MediaBox && $producer !~ /Unknown/ && $creator !~ /Unknown/ && $producer && $creator);
}

$/ = "\n"; #set INPUT_RECORD_SEPARATOR
close FILE;
#print STDOUT "version=$version,NOofPages=$NOofPages,width=$width,height=$height\nfile=$file\n";
#sleep 1;
$producer =~ s/[\\\/\(\)\[\]\@\{\}-]/ /g;
$creator =~ s/[\\\/\(\)\[\]\@\{\}-]/ /g;

print STDOUT "version=$version,creator=$creator,producer=$producer\nNOofPages=$NOofPages,Transparency=$Transparency,width=$width,height=$height\n";
#print STDOUT "minimum=$minimum\n";
return if (($pages < $minimum) && $pages);

#sleep 1;
#return unless $NOofPages;

#$filename =~ s!\$!!; #remove $ from path
#print STDOUT "NEwFilename=$filename=\n";

unless ($remove && $counter ne 0) {
if ($move) { #If the file will be moved
print "$filename\t$version\t$creator\t$producer\t$NOofPages\t$Transparency\t$width\t$height\t$path";
print "\tfile:///$filenamexcel\t" unless ($counter eq 0);
print "\n" ; #print to log;
} else {
print "$filename\t$version\t$creator\t$producer\t$NOofPages\t$Transparency\t$width\t$height\t$path";
print "\tfile:///$filenamexcel\t" unless ($counter eq 0);
print "\n" ; #print to log;
}
}
}

Long live king Moshiach !
 
There are memory links related to par compiling and hashs/hash refs (I can't find where I read that at but I had the same problem). I ended up writing a wrapper program that calls my code (per pdf, kinda doing the same thing), and then all memory gets released back to the programs as the wrapper is really doing nothing, I still limited it to about 100 pdfs then it dies and re-runs again via task scheduler

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]
Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top