Fixing malformed XML with Perl's XML::Parser

I have more experience than I would like to recount being in the middle of a data processing pipeline. I often have to acquire data sets which have been produced using deity knows what kind of COBOL written several decades ago, then passed through several layers to produce what seems to be the cool data format of recent times. Owing to the fact that I am both an economist and a developer, I am often also the end user of said data sets, so I have a vested interest in getting those data sets to usable shape. I have had to resort to using JScript on locked down Windows Servers to process awful lookup tables put together as if the largest hard drives were still 44 Mb Miniscribes.

Sometimes, after having jumped through 1600 bureaucratic steps to actually get access to the data, you find yourself dealing with malformed data files, and the upstream is either unresponsive to constructive criticism, or worse. In these circumstances, it falls on you to fix others' mistakes, and keep things moving, because it is not your job, say, to enforce the XML standard.

A good way of doing so is to separate the preprocessing of the nasty input from your main program. This has two benefits one of which is to be able to get rid of it in case the upstream ever gets around to fixing anything. The more important benefit, however, is to separate the cost of fixing other people's malformed data from the cost of your processing.

Especially in business environments where the time spent processing such data files can have immediate bottom-line consequences, being able to attach a separate, individually identifiable cost to fixing the malformed data set can be a very good way to motivate others to pressure upstream to pressure their developers to fix things.

So, I had instant sympathy for Stackoverflow user disruptiveglow who faces the task of parsing XML files where some tags have attribute values with unescaped < characters in them. In this case, adding a separate preprocessing step to the workflow would enable the abused developer to at least quantify the cost of dealing with malformed data so that the business can make an intelligent decision on whether to spend resources on getting the upstream to fix the issue.

Of course, if you have someone on the other side who can respond to quick friendly email, and take care of the issue, there is no need to go through this at all. This presumes that civilized ways of handling the issue have not worked ;-)

Economically, so long as the fixed cost of developing the preprocesor, and the marginal cost of preprocessing each document multiplied by the number of documents is less than the cost of dealing with upstream to fix the data, it is rational for a business to keep ingesting malformed XML.

The business side needs to understand that the cost of developer time is not just the money it spends on the developer. The cost of using the developer's time to put together the preprocessor is whatever profit the business would forego by not tasking the developer to work on the most profitable alternative.

And, of course, the cost of using computing resources is not just what the business pays in dollars and cents. It is whatever profit they could have made by tasking the computing resources for something else.

A developer cannot make these calls: However, a developer can communicate concrete information so business people can make informed decisions. They know if pressuring upstream to fix the data format may lead to months of stonewalling, a subscription plan being canceled, or anything that might involve lawyers etc.

Keeping these considerations, and the fact that, as Sobrique points out, the standard says malformed XML is fatal, what can one do to preprocess it to remove trivial issues that interfere with the pipeline?

Regular Expressions

The OP's immediate instinct was to reach for regular expressions. Well, now you have two problems!

Subverting XML::Parser

Instead of going crazy trying to figure out some regular expression pattern you will not recognize a day later, I would recommend using a proper XML parser and catching failures. If a failure occurred at the specific character in the specific attribute of the specific tag you are trying to fix, then fix the input, and restart the XML parser. Once the parser is able to go through the document without any failures, save the fixed input with a new name, and make a record of the resources used in fixing it. Generally, for business purposes, just the time spent is enough. Then, explain how long it took you to develop the fix, and how long it takes for each malformed document to be fixed. The business side should know what resources are being used for the main business purpose, and what is being used to fix others' errors.

At first, I reached for XML::Twig in my answer to this question, but, actually, we don't need anything more than XML::Parser:

If neither Style nor Handlers are specified, then parsing just checks the document for being well-formed.

So, the job of the script will be to:

  1. Ingest the XML into a string
  2. Try to parse it using XML::Parser
  3. If an exception occurrs, check if it is an invalid token exception, record its position, check if the character is <, if so, replace it with <

Strictly speaking, we do not know if the error did happen within the value of the specific attribute of the specific tag in which we are interested, but I am assuming the conditions of the OP's question hold: The XML documents are malformed only in this one specific way.

So, I put together a script that will do this. Obviously, I haven't tested this very much, so it likely has some bugs. The script accepts the following arguments:

--input (required) : Malformed XML file

--output (required) : Output file

--limit (default = 1,000) : Maximum number of correction iterations. Just in case something gets stuck. Adjust on the basis of how many misplaced < characters you expect.

Here is the main method of the script. It is important not to do any encoding conversion on the input to avoid confusing libexpat.

sub run {
    my $argv = shift;
    my $opt = parse_argv($argv);
    my $xml = $opt->{input}->slurp(iomode => '<:raw');

    my $tries;

    TRY:
    for ($tries = 0; $tries < $opt->{limit}; $tries += 1) {
        my $pos = try_parse( \$xml );
        last TRY if not defined ($pos);

        $xml = join('&lt;',
            substr($xml, 0, $pos),
            substr($xml, $pos + 1)
        );
    }

    if ($tries < $opt->{limit}) {
        warn "'$opt->{input}' well-formed after $tries corrections\n";
    }
    else {
        croak "'$opt->{input}' still malformed after $tries corrections\n";
    }

    file($opt->{output})->spew(iomode => '>:raw', $xml);

    return;
}

The try_parse subroutine does the work and book-keeping:

sub try_parse {
    my $xml = shift;
    my $parser = XML::Parser->new;
    return if eval {
        $parser->parse( $$xml );
        1;
    };
    my $err = [email protected];
    if ($err =~ m{invalid \s+ token .+? byte \s+ ([0-9]+) }x ) {
        my $pos = $1;
        if ( substr($$xml, $pos, 1) eq '<' ) {
            return $pos;
        }
    }
    croak "Unexpected XML::Parser exception: '$err'";
}

I tried the script on a simple document containing 70 errors:

<?xml version="1.0" encoding="utf-8"?>
<document>
    <tag v="< â©αΓ" />
    <tag v="< â©αΓ" />
    <tag v="< â©αΓ" />
    <tag v="< â©αΓ" />
    <tag v="< â©αΓ" />
    ...

The script takes about half a second on my ancient laptop to turn that into:

<?xml version="1.0" encoding="utf-8"?>
<document>
    <tag v="&lt; â©αΓ" />
    <tag v="&lt; â©αΓ" />
    <tag v="&lt; â©αΓ" />
    <tag v="&lt; â©αΓ" />
    <tag v="&lt; â©αΓ" />
    ...

using

timethis perl fixlt.pl --input=bad.xml --output=good.xml
...
'bad.xml' well-formed after 70 corrections
...
TimeThis :  Elapsed Time :  00:00:00.406

PS: The full script is available on GitHub.

PPS: No comments for now, but you can discuss this post on Reddit.