I wanted to build an awesome place for people to discuss module specific issues, but I don't have any more time for this, and there are much better places to discuss Perl-related issues. I'd recommend asking your question on Stack Overflow or on Perl Monks.
If you are looking for a Perl tutorial or Perl-related news, I hope these links will serve you well.
Posted on 2010-01-24 22:43:22.067753-08 by binarybits
Manipulating XObjects
Hi, I'm trying to detect black rectangles in PDF files, and I'm having trouble figuring out if CAM::PDF does what I need to do. I'm trying to read a PDF that looks (in relevant part) like this:
/XObject <</Fm1 5 0 R>>
And subsequently...
5 0 obj <</Matrix [1.0 0.0 0.0 1.0 -71.0 -179.48] /Subtype /Form /Length 207 /Resources <</ProcSet [/PDF] >> /FormType 1 /BBox [71.0 179.48 526.605 227.572] /Type /XObject >> stream 0 g 1 0 0 1 0 0 cm 474.7041 205.6601 m 526.1046 205.6601 l 526.1046 227.072 l 474.7041 227.072 l 474.7041 205.6601 l f endstream endobj
I've figured out how to get this far:
my $pdf = CAM::PDF->new($infile); my @properties = $pdf->getPropertyNames($pagenum); my $property = $pdf->getProperty($pagenum, 'Fm1');
But then I'm stuck. When I run this code, I get a list in @properties that includes Fm1, the XObject I want to examine. And $property winds up with a data structure that seems like it might have the data I want in the "StreamData" field. But the data seems to be encoded somehow, and I'm not sure how to extract the contents. I've spent a couple of hours poring through the CAM::PDF documentation and source code and haven't been able to make any progress. Am I missing something? Or is this not something CAM::PDF does? Thanks for a great library!
Direct Responses: 12240 | Write a response
Posted on 2010-01-25 16:34:10.596366-08 by cdolan in response to 12237
Re: Manipulating XObjects

The method you want is $pdf->getPageContent($pagenum) which returns the decoded stream content. You can either work with that content as plain text, followed perhaps by setPageContent(), or you can get a parsed representation via getPageContentTree().

Chris

Direct Responses: 12241 | Write a response
Posted on 2010-01-25 16:46:24.797258-08 by binarybits in response to 12240
Re: Manipulating XObjects

I played around with getPageContent, but it didn't seem to do what I wanted. For example, consider the following code:

my $pdf = CAM::PDF->new($infile); my $content = $pdf->getPageContent($pagenum); print $content;

If run on a particular PDF I'm trying to process, it produces the following output:

q 1 0 0 1 70 178.4801025 cm /Fm1 Do Q

If I'm reading the PDF spec correctly, the "/Fm1 Do" is an instruction to display the XObject named "Fm1" (not sure if that's the right terminology). My question is: is there a way to access to contents of this XObject? I think...

$pdf->getProperty($pagenum, 'Fm1');

...gives me a data structure describing this XObject, but I'm not sure how to extract the actual drawing commands, which is what I really need.

I hope that's clear. Thanks again!

-Tim

Direct Responses: 12242 | Write a response
Posted on 2010-01-25 17:58:47.689684-08 by cdolan in response to 12241
Re: Manipulating XObjects

Oh, I see, I misunderstood your original request. It's not at all obvious how to get the xstream.

This might do the trick, but I have not tested it

my $property = $pdf->getProperty($pagenum, 'Fm1'); # check $property for undef here my $propval = $pdf->getValue($property); # check that $propval is a 'dictionary' node, perhaps my $xobject = $pdf->decodeOne($propval);
Direct Responses: 12245 | Write a response
Posted on 2010-01-26 07:11:07.976209-08 by binarybits in response to 12242
Re: Manipulating XObjects

Thanks, that was super helpful! In case anyone has this problem in the future, here's the working code I came up with (this is inside a loop, hence the $_ and next):

my $property = $pdf->getProperty($pagenum, $_); next if(!defined($property)); my $propval = $pdf->getValue($property); my $type = $propval->{Type}->{value}; next if(!defined($type) || $type ne 'XObject'); # decodeOne() expects a hash with key "type" and value "dictionary," # so that's what we're going to give it. my %dictionary = ('type' => 'dictionary', 'value' => $propval); my $content = $pdf->decodeOne(\%dictionary); my $pagetree = CAM::PDF::Content->new($content);

As you can see it's a bit of a hack. Ideally there'd be a getParseTreeFromXObject() function (or something) that takes a page number and resource name and returns the parse tree of the associated xstream. I'd submit a patch but I'm not sure I understand the CAM::PDF internals well enough to produce something usable.

Thanks again for your help. This saved me a ton of time and frustration.

-Tim

Direct Responses: Write a response