Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How do I get a PDF file page size and page count programatically?

Status
Not open for further replies.

rsevero

Programmer
Aug 25, 2001
5
BR
I need to get the page size of a PDF file programatically, with acrobat or ghostscript. How can I get it?

And it's page count?

I am running FreeBSD.

TIA,

Rodrigo Severo
 
PDFs are organized as a hierarchy of objects. To get the number of pages, you want the page catalog dictionary.

Just open the file in a good text editor, and search for the /Pages entry. That will give you the number of the object, for example "19 0".

That tells you to search for "19 0 obj", which is an object definition followed by a dictionary. That dictionary will contain the /Count entry, which is your page count.

You can also search for the /MediaBox and /CropBox entries, which are followed by arrays. The arrays contain 4 numbers, representing the lower left to upper right coordinates, in PostScript points (72 points per inch), of the Media (paper size) and Crop (what size to trim the page down to).

So you can determine page count and page dimensions with basic file i/o.

Here are some snippets from a 6 page PDF , 8.5x11 inch paper:

Code:
29 0 obj<</Contents 38 0 R/Type/Page/Parent 19 0 R/Rotate 0/MediaBox[0 0 612 792]/CropBox[0 0 612 792]/Resources 30 0 R>>
endobj

Code:
19 0 obj<</Count 6/Kids[29 0 R 1 0 R 4 0 R 7 0 R 10 0 R 13 0 R]/Type/Pages>>
endobj

Thomas D. Greer
 
First of all, thanks for your answer.

As I need some automatic process, I might try some regex to find the info but I wonder if there isn't a more appropriate way to get this info. I.e., some code (possibly postscript) that would get this info from the PDF file.

As far as I know there are binary postscript files, aren't there binary PDF files? If this is true, the regex solution wouldn't work for the binary ones, I believe.

Rodrigo Severo
 
PostScript code won't help you here, unless you want to use PostScript for file i/o.

PDFs can contain binary streams, yes that's true. That's irrelevant to what you're after.

I know of several C and .NET kits for creating PDFs programmatically, ie through calls, but not for extracting information out.

You can check here: I haven't used their tools.

For direct PDF manipulation and data extraction, I use C# or VB to read the data out as I've suggested.



Thomas D. Greer
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top