Can Parallel::ForkManager speed up a seemingly IO bound task?

This post is motivated by a question on Stackoverflow titled Faster way to do “perl -anle ‘print $F[1]’ *manyfiles* > result” (‘cut’ fails). I cautiously mentioned that it might help to read files in parallel. brian d foy emphasized that it might not do much given that the task seems to be IO bound. I admit I don’t understand much about filesystem caches, but I thought reading input in parallel might help utilize them better. So, I decided to test if that made sense. My preliminary check on Windows seemed to show using Parallel::ForkManager with two processes resulted in the run time being reduced by 40%. But, we all know that Windows is a little funky when it comes to forking, so I rebooted into Linux, and decided to try there.

The tests were run on my aging laptop with an ancient Core Duo processor and 2 GB of physical memory. No, I still haven’t replaced it, although I do also use a newer Mac. Both perls were 5.14.2. The Windows system was XP Professional SP3 and the Linux system has ArchLinux with the latest updates. I am only going to show results from runs on Linux below.

First, I generated ten files with 1,000,000 lines each using the following short script:

#!/usr/bin/env perl

use strict; use warnings;

for (1 .. 1_000_000) {
    my $str;
    if (0.2 > rand) {
        $str .= ' ' x rand(10);
    }
    $str .= 'a' x 20 . ' ' . 'a' x 20;
    print $str, "\n";
}

I then used the following script to read all the files and capture the second field:

#!/usr/bin/env perl

use strict; use warnings;

use Parallel::ForkManager;

my ($maxproc) = @ARGV;
my @files = ('01' .. '10');

my $pm = Parallel::ForkManager->new($maxproc);

for my $file (@files) {
    my $pid = $pm->start and next;
    my $ret = open my $h, '<', $file;

    unless ($ret) {
        warn "Cannot open '$file': $!";
        $pm->finish;
    }

    while (my $line = <$h>) {
        next unless $line =~ /^\s*\S+\s+(\S+)/;
        print "$1\n";
    }

    $pm->finish;
}

$pm->wait_all_children;

Here are the results:

# sync
# echo 3 > /proc/sys/vm/drop_caches
$ /usr/bin/time -f '%Uuser %Ssystem %Eelapsed %PCPU' ./process.pl 1 > output
24.44user 0.93system 0:29.08elapsed 87%CPU

$ rm output
# sync
# echo 3 > /proc/sys/vm/drop_caches
$ /usr/bin/time -f '%Uuser %Ssystem %Eelapsed %PCPU' ./process.pl 2 > output
24.95user 0.91system 0:18.31elapsed 141%CPU

$ rm output
# sync
# echo 3 > /proc/sys/vm/drop_caches
$ /usr/bin/time -f '%Uuser %Ssystem %Eelapsed %PCPU' ./process.pl 4 > output
24.70user 0.88system 0:17.45elapsed 146%CPU

$ rm output
# sync
# echo 3 > /proc/sys/vm/drop_caches
$ /usr/bin/time -f '%Uuser %Ssystem %Eelapsed %PCPU' ./process.pl 1 > output
25.31user 0.95system 0:29.72elapsed 88%CPU

The results were consistent through the handful of runs I tried.

So, even if the task is IO-bound, it may pay to utilize all the cores you have.