Posted on 2010-12-02 23:08:02.313142-08 by buga
Get formated text from a pdf file
Hi, I need to get the text of a pdf file grouped by their fontsize. I used the following code to get all available fonts:
foreach my $fontname (sort $pdf->getFontNames(1)) { my $font = $pdf->getFont(1, $fontname); print $fontname."\n"; }
I was also able to get the complete text of a pdf page with:
my $page1 = $pdf->getPageContent(1); print $pdf->getPageText();
How can I get the text together with the font. I need to find keywords in headings or bold text. So it would be nice if there is a possibility to get all text with OPBaseFont0, all with OPExtFont0, all with OPExtFont1 and so on... I would be happy about some help with this. Thank you
Direct Responses: 13093 | Write a response
Posted on 2010-12-03 22:28:05.504271-08 by cdolan in response to 13089
Re: Get formated text from a pdf file
You can do it, but it's not straightforward. You need to search the page content for constructs like this example:
BT 216 0 0 -216 142 291 Tm /F1.0 1 Tf (E) Tj 216 0 0 -216 231.754 291 Tm (m) Tj 216 0 0 -216 398.289 291 Tm (p) Tj 216 0 0 -216 510.402 291 Tm (lo) Tj 216 0 0 -216 679.996 291 Tm (y) Tj 216 0 0 -216 776.816 291 Tm (e) Tj 216 0 0 -216 887.242 291 Tm (e) ET
All text rendering is surrounded by "BT ... ET". The "Tf" command takes two preceding arguments, the name of the font ("/F1.0") and the scale ("1"). Then the Tj commands emit the actual text, positioned. In the example above, the text says "Employee".
Direct Responses: 13097 | Write a response
Posted on 2010-12-06 01:47:31.765348-08 by buga in response to 13093
Re: Get formated text from a pdf file
Hi thanks for your reply. I am new to cam::pdf. In general I understand what you mean but I dont know how I should get the BT .. ET code you posted. I also dont see how you get the word "Employee" out of that. Can you provide a small perl script reading a pdf file and getting the text with its format? I hope I dont ask to much but for now I dont know how to start. I also didnt find a good tutorial yet. I uploaded an example file here http://pdfcast.org/download/cam-pdf-example-pdf.pdf Thanks for your help!
Direct Responses: Write a response
Perl Weekly newsletter
A free weekly newsletter for people who are busy to read all the blogs. click here to check it out.