C++: Walking the filesystem with Boost

Last time, I left after putting together a short but complete program that processed command line arguments, and showed a help message.

Eventually, the program is supposed to walk directory trees, collecting information on various files to help me find duplicates in my photo collection.

By default, that program displayed the help message if no arguments were given. With the objective of the program in mind, I decided, instead, that it should just process the directory tree starting from the current directory.

Of course, the user should be able to specify a different directory, or, many top level directories to traverse. Therefore, I added another option specification:

(
    "dirs,d",
    po::value< std::vector< std::string > >()
        ->default_value(std::vector< std::string >(1, "."), ".")
        ->multitoken(),
    "Search under these directories. Defaults to current directory."
)

This says the default value for --dirs is a vector holding a single string ".". The second "." provides the textual representation of the default value so it can be printed.

With this specification in place, if the user does not specify a --dirs argument, it defaults to the current directory. The user can specify multiple --dirs arguments as in file-stats-collector --dirs /dir/one --dirs two/three/four, or list all directories after a single --dirs: file-stats-collector --dirs /dir/one two/three/four.

But, I did not want to have to specify --dirs at all. For that, one needs to create a positional argument description object, specifying which positional arguments correspond to which named arguments in the options_description we are using:

po::positional_options_description pos_desc;
pos_desc.add("dirs", -1);

This directs the parser to consider all non-named (i.e. positional) command line arguments to belong to the --dirs option. This way, I can invoke the command as file-stats-collector /dir/one two/three/four.

To enable this magic, we need to switch to using a command_line_parser instance:

try {
    po::store(
        po::command_line_parser(argc, argv).
            options(desc).positional(pos_desc).run(),
        args
    );
}

If any there are any directories to process, and, we are guaranteed to have at least one unless the --help option was specified, we call a function to walk said directories:

void
process_program_options(const int argc, const char *const argv[])
{
    po::options_description desc("Allowed options");
    desc.add_options()
        (
            "dirs,d",
            po::value< std::vector< std::string > >()
                ->default_value(std::vector< std::string >(1, "."), ".")
                ->multitoken(),
            "Search under these directories. Defaults to current directory."
        )
        (
            "help,h",
            po::value< std::string >()
                ->implicit_value("")
                ->notifier(
                    [&desc](const std::string& topic)
                    {
                        show_help(desc, topic);
                    }
                ),
            "Show help. If given, show help on the specified topic."
        )
    ;

    po::positional_options_description pos_desc;
    pos_desc.add("dirs", -1);

    po::variables_map args;

    try {
        po::store(
            po::command_line_parser(argc, argv).
                options(desc).positional(pos_desc).run(),
            args
        );
    }
    catch (po::error const& e) {
        std::cerr << e.what() << '\n';
        exit( EXIT_FAILURE );
    }
    po::notify(args);

    if (args.count("dirs")) {
        walk_dirs(args["dirs"].as< std::vector< std::string > >());
    }

    return;
}

walk_dirs uses recursive_directory_iterator from Boost.Filesystem. At this point, it just counts the number of plain files in under the given directories:

void
walk_dirs(const std::vector< std::string >& dirs)
{
    size_t n_files(0);
    for (const auto& dir : dirs) {
        try {
            auto walker = fs::recursive_directory_iterator(fs::path(dir));
            for (const auto& entry : walker) {
                if (fs::is_regular_file(entry)) {
                    n_files += 1;
                }
            }
        }
        catch (const std::exception& x) {
            std::cerr << "Error accessing " << dir << "\n\t" << x.what() << '\n';
        }
    }
    std::cout << n_files << " files\n";
}

I am repeatedly amazed at how self-documenting simple, modern C++ code has become.

A test run:

file-stats-collector C:\some\path
71923 files

From a cold boot, it took one minute and 57 seconds to count those files on my clunky Windows 8 laptop.

That is 0.002 seconds per file.

All subsequent runs clocked at under 10 seconds, reducing that time to 0.00014 seconds per file — a 93% improvement. Caches are wonderful things.

A comparable Perl program:

#!/usr/bin/env perl

use strict;
use warnings;

use File::Find;

run(\@ARGV);

sub run {
    my $argv = shift;
    for my $dir (@$argv) {
        sub {
            my $count;
            find(
                sub {
                    -f $File::Find::name
                        or return;
                    $count += 1;
                },
                $_[0]
            );
            print "$count files\n";
        }->($dir);
    }
}

took 2 minutes and 14 seconds from a cold boot. This time went down to 27 seconds in repeated runs. So, in this case, Perl’s overhead seems to 17 seconds, and the IO bottleneck adds about a minute and 47 seconds. Hmmm.

I close this post with a puzzle. For this particular tree, the Perl program above consistently counts 26 more plain files than the Boost.Filesystem version. Why is that? I do not know yet, but it would be good to figure it out.