Highlight Perl source code using PPI::HTML

On my blog, I use the excellent highlight.js library to apply syntax highlighting to source code in the browser. This has the benefit of being able to copy & paste source code directly in to the post (enclosed in a [% FILTER html %] block), instead of having to transform it somehow. There is also the added benefit of keeping the number of tag-enclosed pieces of text to a minimum, keeping the original DOM simple which, one hopes, means faster downloads and faster initial rendering.

A recent Stackoverflow question introduced me to the PPI::HTML module which uses the amazing PPI module to parse Perl source code, and associate CSS classes with the various elements.

If you ask the module to produce a complete HTML page, it will also embed the relevant CSS in the page, and will produce a pretty, colorful document. By default, the class names are rather verbose, and the module offers limited flexibility, but PPI::HTML::CodeFolder provides some enhancements that may be useful.

What if you wanted to produce a self-contained chunk of syntax-highlighted Perl without depending on external CSS or JavaScript? In that case, you can resort to a somewhat grungy technique I use when I am generating HTML email: Post process the HTML to replace classes on elements with style attributes.

Here is an example Perl script which generates a syntax highlighted version of its own source code:

#!/usr/bin/env perl

use strict;
use warnings;

use PPI;
use PPI::HTML;
use HTML::TokeParser::Simple;

my %colors = (
    cast => '#339999',
    comment => '#008080',
    core => '#FF0000',
    double => '#999999',
    heredoc_content => '#FF0000',
    interpolate => '#883333',
    keyword => '#0000FF',
    line_number => '#666666',
    literal => '#999999',
    magic => '#0099FF',
    match => '#9900FF',
    number => '#990000',
    operator => '#DD7700',
    pod => '#008080',
    pragma => '#990000',
    regex => '#9900FF',
    single => '#664444',
    substitute => '#9900FF',
    transliterate => '#9900FF',
    word => '#40c080',
);

my $highlighter = PPI::HTML->new(line_numbers => 0);
my $html = $highlighter->html(\ do { local $/; open 0; <0> });

print qq{<pre style="background-color:#fff;color:#000">},
      map_class_to_style($html, \%colors),
      qq{</pre>\n}
;

sub map_class_to_style {
    my $html = shift;
    my $colors = shift;

    my $parser = HTML::TokeParser::Simple->new(string => $html);
    my $out;

    while (my $token = $parser->get_token) {
        next if $token->is_tag('br');
        my $class = $token->get_attr('class');
        if ($class) {
            $token->delete_attr('class');
            if (defined(my $color = $colors->{$class})) {
                # shave off some characters if possible
                $color =~ s{
                    \A \#
                    ([[:xdigit:]])\1
                    ([[:xdigit:]])\2
                    ([[:xdigit:]])\3
                    \z
                }{#$1$2$3}x;
                $token->set_attr(style => "color:$color");
            }
        }
        $out .= $token->as_is;
    }
    $out;
}

And the output, in a rather distasteful color scheme, I admit:

The original script is 1,690 bytes. On the other hand, the syntax highlighted chunk above is 8,599 which is about a 408% increase.

PS: You can discuss this post on /r/perl.