Splitting long lines is slow

A. Sinan Unur

I received thoughtful responses to my earlier post Large gzipped files, long lines, extracting columns etc.

I am not going to address everything in detail, but for the time being, let's stick with the phase where I dissect the files. Once I decide what to keep, then I can think about whether putting them in a database makes sense. For now, we are talking about roughly 1,000 compressed files, taking roughly 20 GB compressed, and 100 GB uncompressed. But, keep in mind that we are talking about one static snapshot I can play with rather than the entire universe of data.

I can think of many kinds of analyses on these data. A good analogy are web server logs. You can analyze them request-by-request. You can aggregate them on a resource level. You can aggregate requests by response code. Or, you can try to reconstruct users' paths through your site. I just do not know up-front what kind of analysis is a good fit for what we want to learn from these logs. I would like to try everything I can think of and see what I can visualize.

As I mentioned in my previous post, it takes about 3 seconds to decompress one of the files using gunzip and it takes less than a second to go through it line by line using PerlIO::gunzip.

So, about a second per compressed file is the best I can expect on this hardware.

Now, if you just add a split to the mix, the processing time for each file shoots up to about 20 seconds.

So, I got a little obsessed with improving the speed of extracting the columns I want.

Sure, it helped to generate an anchored pattern that captured the fields in which I was interested, but I wanted to know what I could gain by living dangerously.

So, I ran for Inline::C to write a routine that would work if it got ASCII lines with semicolon separated fields:

use Inline C => <<'END_C'

/* This code 'works' only in a limited set of circumstances! */

SV *
extract_fields(char *line, AV *wanted_fields)
{
    int ch;
    IV current_field = 0;
    IV wanted_field = -1;

    char *cursor = line;
    char *field_begin = line;
    char *save_field_begin;

    STRLEN field_len = 0;
    IV i_wanted = 0;
    IV n_wanted = av_len(wanted_fields);

    AV *ret = newAV();

    while (i_wanted <= n_wanted) {
        SV **p_wanted = av_fetch(wanted_fields, i_wanted,0);
        if (!(*p_wanted)) {
            croak("av_fetch returned NULL pointer");
        }
        wanted_field = SvIV(*p_wanted);

        while ((ch = *(cursor++))) {

            if (!((ch == ';') || (ch == '\n'))) {
                continue;
            }

            field_len = cursor - field_begin - 1;
            save_field_begin = field_begin;
            field_begin = cursor;

            current_field += 1;

            if (current_field != wanted_field) {
                continue;
            }

            av_push(ret, newSVpvn(save_field_begin, field_len));
            break;
        }
        i_wanted += 1;
    }

    return newRV_noinc((SV *) ret);
}

END_C
;

With this function, the following code only takes about 1.5 seconds to work through an average file:

sub process_file {
    my ($filename, $wanted_fields) = @_;

    open my $in_fh, '<:gzip', $filename
        or die "Cannot open '$filename': $!";

    while (my $line = <$in_fh>) {
        my $vals = extract_fields($line, $wanted_fields);
    }

    close $in_fh
        or die "Cannot close '$filename': $!";

    return;
}

Using the extract_fields function above reduced the processing time for an individual file down to about 1.5 seconds. That's about 10% of the time taken when using split in the loop, and about half the time used with a pre-constructed capturing regular expression.

And, performance does not seem to deteriorate as quickly as it does with the regular expression method when extracting fields towards the end of the string.

Of course, I do not know the first thing about how to traverse a string UTF8 character by UTF8 character ;-)