Descriptive Stats with C++ and Boost

Boost is a collection of mostly header-only libraries which make mere mortals like myself free-ride on years of accumulated wisdom of talented C++ programmers. It is also rather straightforward to use it on Windows which is a benefit for people like me who does stuff on all three major operating systems.

One Boost library I hadn't used at all was Boost.Accumulators. The other day, I was wondering how much work it would take to write a small program to calculate simple descriptive statistics on a column of numerical data. Something like the following Perl script:

#!/usr/bin/env perl

use strict;
use warnings;

use Statistics::Descriptive;

run(@ARGV);

sub run {
    local @ARGV = @_;

    my @stats_to_methods = (
        [ qw(min min) ],
        [ qw(mean mean) ],
        [ qw(max max) ],
        [ qw(range sample_range) ],
        [ qw(var variance) ],
        [ qw(sd standard_deviation) ],
        [ qw(N count) ],
    );

    my $stats = Statistics::Descriptive::Sparse->new;
    while (my $line = <>) {
        my ($num) = ($line =~ /(\S+)/);
        $stats->add_data(0 + $num);
    }

    for my $s ( @stats_to_methods ) {
        my ($v, $m) = @$s;
        printf("%s= %g\n", $v, $stats->$m)
    }

    return 0;
}

But, I don't like touching Statistics::Descriptive because it calculates mean and variance using naive methods as can be seen from the examples below:

C:\...\stats> perl descriptive.pl
10000000004
10000000007
10000000013
10000000016
^Z
min= 1e+010
mean= 1e+010
max= 1e+010
range= 12
var= 21845.3
sd= 147.802
N= 4

and

C:\...\stats> perl descriptive.pl
100000000004
100000000007
100000000013
100000000016
^Z
min= 1e+011
mean= 1e+011
max= 1e+011
range= 12
var= 0
sd= 0
N= 4

Before proceeding further, I should note that programmer and engineer types seem to be confused about the meaning of the word variance. I am not going to go too deeply into this, but it is worth mentioning because Boost.Accumulators and Statistics::Descriptive calculate different (but, of course, related numbers).

Statistics::Descriptive calculates the unbiased sample estimator of the population variance assuming each observation was drawn iid random from a population: The sum of squared deviations from the mean divided by N - 1. Boost.Accumulators, however, calculates the variance of the random variable that can take on the values in the provided set of numbers with given (in this case equal) probability. That is, the sum of squared deviations from the mean divided by N. Assuming the set of numbers constitutes and iid sample of observations from a larger population, we can obtain the estimator of population variance from the number calculated by Boost.Accumulators by simply multiplying it with N/(N-1).

In addition, Boost.Accumulators uses the term sample to refer to what we would normally call an observation or data point.

I needed to mention that because you'll see that adjustment applied in the small demo below:

#include <iostream>
#include <boost/accumulators/accumulators.hpp>
#include <boost/accumulators/statistics/stats.hpp>
#include <boost/accumulators/statistics/variance.hpp>

namespace bacc = boost::accumulators;

static void
demo(const double base)
{
    bacc::accumulator_set<
        double, bacc::stats<bacc::tag::variance(bacc::lazy)> > acc;

    acc(base + 4);
    acc(base + 7);
    acc(base + 13);
    acc(base + 16);

    auto n = bacc::count(acc);
    auto f = static_cast<double>(n) / (n - 1);

    std::cout << bacc::variance(acc)*f << '\n';

    return;
}


int
main(int argc, char *argv[])
{
    demo(0);
    demo(1e8);
    demo(1e9);
    demo(1e11);
    return 0;
}

Compiled this using the command line cl -IC:\opt\boost -O2 -EHsc ex.cpp and ran it:

30
29.3333
-170.667
0

At first, I was shocked that even a Boost library did not calculate variance correctly. Looking at the source code, I realized they should have used naive rather than lazy to describe the calculation method. Simply removing the lazy calculation method from the declaration, as in:

    bacc::accumulator_set<
        double,
        bacc::stats<bacc::tag::variance>
    > acc;

gives much better results:

30
30
30
30

With that hurdle out of the way, it didn't take much effort to come up with the following program. Note that std::map stores keys in sorted order (and the default order suits me), so there is no need to consider using a vector of pairs. That's what I would have done if the names of the statistics did not match the order in which I wanted to print them.

#include <cctype>
#include <cmath>
#include <functional>
#include <iostream>
#include <string>
#include <map>

#include <boost/accumulators/accumulators.hpp>
#include <boost/accumulators/statistics/count.hpp>
#include <boost/accumulators/statistics/max.hpp>
#include <boost/accumulators/statistics/mean.hpp>
#include <boost/accumulators/statistics/min.hpp>
#include <boost/accumulators/statistics/variance.hpp>
#include <boost/accumulators/statistics/stats.hpp>

namespace bacc = boost::accumulators;

int
main(int argc, char *argv[])
{
    bacc::accumulator_set< double, bacc::stats<
            bacc::tag::min,
            bacc::tag::max,
            bacc::tag::mean,
            bacc::tag::variance > > acc;

    std::map<std::string, std::function<double()> > stats {
        { "min",   [&acc] () { return bacc::min(acc);  }},
        { "mean",  [&acc] () { return bacc::mean(acc); }},
        { "max",   [&acc] () { return bacc::max(acc);  }},
        { "range", [&acc] () { return bacc::max(acc) - bacc::min(acc); }},
        { "var",   [&acc] () {
                                 auto n = bacc::count(acc);
                                 double f = static_cast<double>(n) / (n - 1);
                                 return f * bacc::variance(acc);
                             }},
        { "sd",    [&stats] () { return std::sqrt(stats["var"]()); }},
    };

    std::string line;
    while (std::getline(std::cin, line)) {
        if (line[0] == '\0') {
            continue;
        }
        try {
            size_t pos;
            double obs = std::stod(line, &pos);
            if ( line[pos] && !std::isspace(line[pos]) ) {
                throw std::invalid_argument(line);
            }
            acc(obs);
        }
        catch ( ... ) {
            std::cerr << "Failed to parse line:\n\t'" << line << "'\n";
            continue;
        }
    }

    for (auto s : stats) {
        std::cout << s.first << "= " << s.second() << '\n';
    }

    std::cout << "N = " << bacc::count(acc) << '\n';

    return 0;
}

A lot of this program is white-space because I like oodles of it. There are also a lot of #include statements. The line count also went up because I wanted to make sure typos such as 1..1 were not parsed as two numbers 1. and .1 with the least amount of effort. The program does only one thing: When given a set of numbers, it prints out some simple statistics. I decided I did want the estimator of variance:

C:\...\stats> cat example.data
4       1000000000004
7       1000000000007
13      1000000000013
16      1000000000016

I can then get my descriptive stats like:

C:\...\stats> cut -f 1 example.data | ex
max= 16
mean= 10
min= 4
range= 12
sd= 5.47723
var= 30
N = 4

or

C:\...\stats> cut -f 2 example.data | ex
max= 1e+12
mean= 1e+12
min= 1e+12
range= 12
sd= 5.47723
var= 30
N = 4

Wait ... I should have the median in there, too, shouldn't I?

Doing so requires adding one header:

#include <boost/accumulators/statistics/median.hpp>

one more item to the accumulator set:

    bacc::accumulator_set< double, bacc::stats<
            bacc::tag::min,
            bacc::tag::max,
            bacc::tag::mean,
            bacc::tag::median,
            bacc::tag::variance > > acc;

and one more entry in the lookup table for stats:

        { "median", [&acc] () { return bacc::median(acc); }},

which leads me to my next disappointment:

C:\...\stats> cut -f 1 example.data | ex
max= 16
mean= 10
median= 13
min= 4
range= 12
sd= 5.47723
var= 30
N = 4

As a programmer, you can see why you end up with 13 as the median of the set of observations {4, 7, 13, 16}: Programmers, especially programmers who come from the C tradition do think in terms of half-closed intervals [a, b). So, when you declare 13 to be the median, you get two subsamples, {4, 7} and {13, 16}: Isn't that enough?

No.

The median of a set with an even number of elements is the midpoint of the maximum of the lowest half of observations and the minimum of the highest half of observations. That is, the median of this sample is defined as the midpoint between 7 and 13.

Indeed, here is what Stata tells me:

. list x

     +----+
     |  x |
     |----|
  1. |  4 |
  2. |  7 |
  3. | 13 |
  4. | 16 |
     +----+

and

. summarize x, detail

                              x
-------------------------------------------------------------
      Percentiles      Smallest
 1%            4              4
 5%            4              7
10%            4             13       Obs                   4
25%          5.5             16       Sum of Wgt.           4

50%           10                      Mean                 10
                        Largest       Std. Dev.      5.477226
75%         14.5              4
90%           16              7       Variance             30
95%           16             13       Skewness              0
99%           16             16       Kurtosis           1.36

Here is what R tells me:

> x <- c(4, 7, 13, 16)
> median(x)
[1] 10

Here is what I get in Octave:

>> x = [4, 7, 13, 16];
>> median(x)
ans =  10

You can also ask WolframAlpha.

Now, I am not saying 10 is the median of this sample because all these programs say so. I am saying all these programs say 10 is the median of this sample because it by definition is the median of this sample.

Not 13.

Now, it turns out I get the correct answer if I instead tell Boost.Accumulators to calculate the median using the density estimation method, i.e. guessing median from a histogram:

    bacc::accumulator_set<double,
        bacc::stats<bacc::tag::median(bacc::with_density) > >
            acc_median( bacc::density_cache_size = 4, bacc::density_num_bins = 4 );

This, of course, is sensitive to both the cache size and the number of bins, so the accumulator has to be tailored to the actual number of observations you have. I understand why changing the number of bins affects the median calculation, but I am having a hard time understanding why changing the size of the density cache to 100 should result in the median being calculated as 1.15883e+228. I have an inkling the authors of the library and I are not using the word cache to mean the same thing.

In any case, the problem with the default median calculation seems to me to stem from an incorrect implementation of the original P2 algorithm. The first step in the algorithm is to calculate five values from the initial set of observations (we have only four, but that's OK). These values are:

q1 = minimum of the observations so far
q2 = current estimate of the first quartile
q3 = current estimate of the median
q4 = current estimate of the third quartile
q5 = maximum of the observations so far

So, with our sample of {4, 7, 13, 16}, these values become:

q1 = 4
q2 = 5.5
q3 = 10
q4 = 14.5
q5 = 16

You come asking for the median at this point, you should get the correct answer if you implemented the P2 algorithm correctly. But, you don't when using Boost.Accumulators.

Adding a few more observations doesn't make things better. E.g.:

C:\...\stats> ex
4
4
4
7
13
16
16
16
^Z
max= 16
mean= 10
median= 6
min= 4
range= 12
sd= 5.78174
var= 33.4286
N = 8

The correct median is still 10, but now I get 6. Looking at p_square_quantile.hpp, I am reasonably certain where the problem is, but I am not confident I would be able to fix the library without affecting something else somewhere else.

In fact, you might object to the small number of observations and tell me everything would work fine with a sufficiently large sample. To check that, I generated a sample with 2,000 observations:

perl -E "say for (4)x999; say for qw(7 13); say for (16)x999" > n2000.data

and I got:

C:\...\stats> ex < n2000.data
max= 16
mean= 10
median= 7.98084
min= 4
range= 12
sd= 5.99925
var= 35.991
N = 2000

I love C. I have wanted to love C++ for a long time, and starting with C+11, I have been feeling I can. I am not easily daunted by pages and pages of warning and error messages when I forget a namespace specifier: Over the years, I have learned to spot what caused the avalanche. And, I admire the dedication of every single C++ library writer. Boost itself is a monumental achievement.

It is, however, rather disappointing to find flies like this in the ointment. These things prevent one from confidently building on small steps.

The fact is, the test suite for median in Boost.Accumulators consists solely of checking if the median of 100,000 draws from N(1,1) is close enough to 1. Well, OK, sure, that matters, but it also does matter if your algorithm works in other situations as well.

I don't really have a conclusion. My little command line utility will not print a median because I like the elegance of it as it stands: Look, this is C++, and I have in there a mapping of strings to pointers to functions returning doubles implemented as lambdas closing over my accumulator:

std::map<std::string, std::function<double()> > stats {
    { "min",    [&acc] () { return bacc::min(acc);    }},
    { "mean",   [&acc] () { return bacc::mean(acc);   }},
    { "max",    [&acc] () { return bacc::max(acc);    }},
    { "range",  [&acc] () { return bacc::max(acc) - bacc::min(acc); }},
    { "var",    [&acc] () {
                             auto n = bacc::count(acc);
                             double f = static_cast<double>(n) / (n - 1);
                             return f * bacc::variance(acc);
                         }},
    { "sd",    [&stats] () { return std::sqrt(stats["var"]()); }},
};

I mean, how neat is that?

You can discuss this post on HN or on r/programming.