I wanted to build an awesome place for people to discuss module specific issues, but I don't have any more time for this, and there are much better places to discuss Perl-related issues. I'd recommend asking your question on Stack Overflow or on Perl Monks.
If you are looking for a Perl tutorial or Perl-related news, I hope these links will serve you well.
Posted on 2010-11-13 10:28:56.366477-08 by binarybits
Hexstring

Hi,

Does anyone have advice on converting "hexstring" objects to ascii? I'm using CAM::PDF to extract individual "Tj" and "TJ" commands. Sometimes they have a "string" argument that's just plain ASCII, in which case everything works great. In other cases, though, the string is of type "hexstring," and printing those characters out gives me garbage. This seems to happen for lines that have "smart quotes." The pdftotext tool handles the conversion just fine so I don't think it's a problem with the PDF. I'm assuming this means that the strings are in some non-ASCII encoding. Does anyone have pointers to code for converting these strings to ASCII?

Thanks a lot.

Direct Responses: 13054 | Write a response
Posted on 2010-11-13 11:14:31.902043-08 by cdolan in response to 13053
Re: Hexstring
hexstring is almost always used for non-ASCII data in a PDF, so trying to convert is to ASCII is almost certainly a lossy operation.
Direct Responses: 13055 | Write a response
Posted on 2010-11-13 11:43:20.699619-08 by binarybits in response to 13054
Re: Hexstring
Thanks for the prompt response! The strings in question are English text, and I don't care very much about non-alphanumeric characters, so lossy conversion would be fine.
Direct Responses: 13056 | Write a response
Posted on 2010-11-13 12:24:59.962066-08 by cdolan in response to 13055
Re: Hexstring
I see, you are working with unparsed page content? There are three approaches: 1) use getPageContentTree() which is a parsed representation of the page. But it's very strict and may not be able to parse all syntax 2) pull out just your Tj and sent it through parseHexstring() 3) use pack('H*', $string). But see the source of parseHexstring() for a detail about 0 padding.
Direct Responses: 13057 | Write a response
Posted on 2010-11-13 14:56:33.216348-08 by binarybits in response to 13056
Re: Hexstring

I'm using getPageContentTree(). Here's a highly simplified version of what I'm doing.

my $pagetree = $pdf->getPageContentTree($pagenum); my @stack = ([@{$pagetree->{blocks}}]); # Here I'm ommitting code to traverse the @stack getting $nodes. my $block = shift @{$node}; my $opname = $block->{name}; if($opname eq 'TJ') { if (@args == 1 && $args[0]->{type} eq 'array') { my @strings = @{$args[0]->{value}}; foreach my $element (@strings) { if ($element->{type} eq 'string') { print "String: ".$element->{'value'}."\n"; } elsif ($element->{type} eq 'hexstring') { print "Hexstring: ".$element->{'value'}."\n"; } } } }

This works perfectly for 'string' blocks. But at least on the PDF I'm using for testing, it produces garbage for 'hexstring' blocks. Does that mean that my testing PDF is using a weird character set?

Direct Responses: 13058 | Write a response
Posted on 2010-11-13 15:16:36.701168-08 by cdolan in response to 13057
Re: Hexstring
Yes, it could be a weird charset. PDF allows the creator to use arbitrary encodings if they are mapped to an embedded font. This is not widely used because it breaks text searching. But that's another possible explanation. Or maybe your terminal just doesn't like Unicode? Can you share some sample hex data (as hex not as binary) or a sample PDF?
Direct Responses: 13059 | Write a response
Posted on 2010-11-14 14:34:01.035137-08 by binarybits in response to 13058
Re: Hexstring

Sure! The PDF I'm having trouble with is this one:

http://ia700100.us.archive.org/21/items/gov.uscourts.nysd.321800/gov.uscourts.nysd.321800.28.0.pdf

For example, if you look at page 10, the phrase "transfers designed to obscure the illicit origin of the USD" is a regular ASCII string. The phrase ' - thereby "cleaning" the money - before' is a hexstring, probably because of the smart quotes. And then the phrase "transferring it to clients seeking to purchase USD" is a normal ASCII string again.

I tried the same code on a few other PDFs and didn't see any similar problems, so it's possible that this specific PDF is screwed up.

Thanks again for your help.

Direct Responses: 13060 | Write a response
Posted on 2010-11-14 16:57:26.96398-08 by cdolan in response to 13059
Re: Hexstring

This snippet:

[<0026>-2<0052004F00520050>-3<0045004C004400B6>5<00560003>] TJ

translates to "Columbia's" in a PDF viewer. That text is in font "/F4" which is described as:

26 0 obj<</Type/Font/DescendantFonts[27 0 R]/ToUnicode 31 0 R/BaseFont/Times#20New#20Roman/Subtype/Type0/Encoding/Identity-H>>

That in turn refers to a compressed unicode mapping table in object 31. Alas, CAM::PDF does not support mapping tables (I didn't even know they existed until now!)

Direct Responses: 13065 | Write a response
Posted on 2010-11-15 12:01:43.967674-08 by binarybits in response to 13060
Re: Hexstring
OK, good to know I'm not crazy. Thanks for checking on it!
Direct Responses: Write a response