The epiphany came when I was trying to extract usable information from a bunch of documents.
Some people insist on distributing essential information in PDF format making it very hard to make use of said information.
Now, I have never really made it past the table of contents of Adobe's PDF Reference, and I can't really figure out many of the available Perl modules dealing with PDFs. I know what goes into a PDF document (basically boxes with coordinates), but, just as I have never written a web server in Postscript either, I haven't been able to go into this in depth.
One of the problems with utilities that naively convert PDF to text is that usually they do a straightforward translation of the layout which does a number on the order the text comes out. The location of an object on the page and it's position in the object stream don't really correspond very reliably to each other.
At first, I was very frustrated … Then, I realized the value of the
With this option, the PDF document is output as
elements. For example:
<page number="6" position="absolute" top="0" left="0" height="918" width="1188"> … <text top="176" left="109" width="125" height="15" font="2">DATA RECORD </text>
This is extremely useful when trying to extract information. First, if
the entity producing the document used consistent styling, the
attribute of the
text elements can be used to select items of
interest. However, multi-column documents are still a pain.
The key to my epiphany lies in sorting the
text elements using a
ordering: Text on
page 5 should come before text on page 7. Text in column one comes
before text in column three. Text on line five in column two comes
before text on line two in column three … See what I did there?
At first you might think it is OK to define columns using the
text elements. The problem is when some attributes for
the data you want to extract are defined in section headers that can
appear in the middle of a column. People will usually center the text in
those headers (for visual aesthetic reasons), and therefore they will
appear to be in a later column than the data items that follow.
This may seem obvious right now, but the solution came to me only after looking at the following plot:
That is, I need a mapping of ranges of left margins to columns.
Once that mapping is defined,
text elements can be sorted into a
natural reading order, and information can be extracted using usual