Download all pieces of the full text of TPP, and combine them into a single PDF document

TPP is a treaty negotiated in secret.

A few days ago, the full text was posted on the U.S. Trade Representative's web site, and on the web site of New Zeland Ministry of Foreign and Trade.

I haven't compared the contents on both sites to see if they were identical.

The first thing I noticed was the fact that both governments had decided to post the so called "full text" of the treaty in bits and pieces: On the U.S. site, there are 238 individual pieces. New Zealand offers a ZIP archive that contains the 30 main documents (without annexes) in one handy archive, but that's still 30 files.

I had the strange idea that having a single easily searchable document would have been more helpful to citizens trying to stay informed about the international committments their governments are making.

So, I put together a quick script to download the pieces from the U.S. Trade Representatives web site, and combine them into a single document using PDF::Reuse: Strangely, HTTP::Tiny, LWP, and curl ran into errors downloading individual documents from the web site (I did not investigate), so I used GNU wget (see gist).

Now, I must admit, I posted the script without checking if the final document produced matched what was expected. It turns out some of the documents in the collection had issues. For example, trying to extract a page using GraphicsMagick gave the following output:

**** Warning: can't process font stream, loading font by the name.
**** Error reading a content stream. The page may be incomplete.
**** File did not complete the page properly and may be damaged.

**** This file had errors that were repaired or ignored.
**** The file was produced by:
**** >>>> Adobe PDF Library 11.0 <<<<
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.

which I checked after realizing using PDF::Reuse to combine all the pieces had resulted in a document just about 300 pages long rather than the six thousand or so pages everyone was talking about.

Again, I just wanted everything in a single document — I am not interested in why the myriad bits and pieces did not fit together correctly. It was time to turn to venerable Ghostscript. I decided to use it to extract all the pages from the individual documents, and then combine every page one by one, instead of dealing with the components at the document level.

Here is a quick script to extract individual pages from all the documents. This assumes you used a technique similar to the one in my original download script where I prefixed each file downloaded with a three digit number indicating the order of its appearance in the table of contents.

#!/usr/bin/env perl

use strict;
use warnings;

my @docs = sort {
    no warnings 'numeric';
    $a <=> $b;
} map s/\s+\z//r, grep /^[0-9]+/, `ls ../*.pdf`;

for my $doc (@docs) {
    my ($seq) = ( $doc =~ /\A ([0-9]+) /x );
    my ($pages) = map /^ Pages [^0-9]+ ([0-9]+) /x, `pdfinfo '../$doc'`;

    for my $page (1 .. $pages) {
        my @cmd = (
            gs => qw(-sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER),
            "-dFirstPage=$page",
            "-dLastPage=$page",
            sprintf('-sOutputFile=%03d-%04d.pdf', $seq, $page),
            "../$doc"
        );
        say "@cmd";
        system @cmd;
    }
}

Now, this obviously is not going to be a paragon of speed, but, still it took a reasonable amount of time while I was attending to something else. At the end, there were 6,526 individual documents in my working directory.

Despite the fact that gs printed out a bunch of warnings during the process, PDF::Reuse had no problem putting them together:

#!/usr/bin/env perl

use strict;
use warnings;

use PDF::Reuse;

my @pages = sort map s/\s+\z//r, grep /^[0-9]+/, `ls *.pdf`;

prFile('TPP.pdf');

prDoc($_) for @pages;

prEnd();

PS: You can comment on this post on /r/perl.

PPS: FYI, as an economist, I am a proponent of free trade. You don't need international negotiations for free trade. As a citizen who wants to stay informed of issues, I want every government to be transparent about the committments they are making.

PPPS: You can find the resulting PDF file at https://j.mp/TPP-20151108.