How to sum data from multiple files in Perl?

Stackoverflow is not that much fun any more, but I still look there to see if there are any interesting questions. Sometimes, relatively ordinary looking questions lead to useful insight. One such example is the question titled Perl : Adding 2 files line by line.

In that question, a Perl beginner is working with two files each of which hold a column of data. The files have different numbers of observations. The programmer wanted to sum the data files row-wise if they both had data. If one file has n1 and the other has n2 observations, then the resulting data set will have max(n1, n2) observations.

As is typical for a beginner, the OP’s attempt involved reading each file into a separate array, and the answer s/he accepted conforms to that as well. For toy examples, this is not horrible, but this has the drawback of making the memory footprint of the program being correlated the the sum of the sizes of the input files. This clearly won’t work when dealing with large files.

Chris Charley offered a somewhat better solution to the problem:

use strict;
use warnings;

my @sums;
my $i = 0;
while (my $num = <>) {
    $sums[$i++] += $num;
    $i = 0 if eof;
}

print "\n" for @sums;

It is a good use of both the diamond operator and Perl’s eof. And, if you have two files that each take 1 GB of memory when read into an array, and you only have 2GB of memory available to perl, you are going to have difficulties. Even if you have oodles of memory, there may be other processes competing for it, or you may want to be able to run multiple instances of this program for separate sets of data files. On my 64-bit Windows system, an array containing 30,000,000 numbers takes about a gigabyte of memory, so you can run into limits pretty easily.

In these kinds of situations, you’d be better off writing a program whose memory footprint depends on the number of files rather than the size of the largest file. The idea is to read one observation from each file that still has data, sum those observations up, and print out the sum. It sounds simple, but given the number of times I have found processes bottlenecked by habitually reading large data sets into memory, surprisingly few people think about this.

So, here is a short program that will sum the rows of any number of files your system allows you to have open. The script expects the names of the files on the command line.

#!/usr/bin/env perl

use strict;
use warnings;

use Carp qw( croak );
use List::Util qw( sum );
use Path::Tiny;

use Devel::Size qw( total_size );
use Number::Bytes::Human qw( format_bytes );

run();

sub run {
    my @readers = map make_reader(), @_;

    while (my @obs = grep defined, map ->(), @readers) {
        print sum(@obs), "\n";
    }

    return;
}

sub make_reader {
    my $fname = shift;
    my $fhandle = path( $fname )->openr;
    my $is_readable = 1;
    sub {
        return unless $is_readable;

        my $line = <$fhandle>;
        return $line if defined $line;

        close $fhandle
            or croak "Cannot close '$fname': ";

        $is_readable = 0;
        return;
    }
}

All the useful action happens in make_reader. For each filename passed on the command line, we use make_reader to open the file (Path::Tiny’s openr opens a file in read-only mode and croaks on failure). It then returns a function which will attempt to read a line from that file handle each time it is invoked. If we reach the end of the file or some other error occurs, it will return an undefined value.

In the main loop:

    while (my @obs = grep defined, map ->(), @readers) {
        print sum(@obs), "\n";
    }

we select just the defined values returned by each reader. If there are no defined values, the while loop terminates and we are done. In all probability (check $! after readline if you are paranoid or if you are reading from sockets etc), this means we exhausted all the files and printed all the row-wise sums successfully.

At any point in time, we only hold the @readers and @obs arrays, and n lines from n files in memory. That means, even if you are dealing with a hundred files each of which contains a billion rows, you are not going to run out of memory. It can also mean that you can run similar processes for many clients (i.e. people who are paying you to do stuff for them) at a time. In most circumstances, the reads will not touch the disk very often, because the operating system will already have enough memory to cache the hot files. So, any way you look at it, it’s a win to conserve memory even in this age of abundance (we have come a long way — I can’t believe what we used to be able to do with 64K let alone 640K ;-).

You can discuss this post on r/perl.