Know what you are testing: The case of the test for median in Boost.Accumulators C++ Library

A few days ago, I wrote about my adventures with the Boost.Accumulators library. Since then, I have spent some time staring at the test code for the median algorithms included with that library. The provided test demonstrates many problems created by flawed attempts at doing TDD. Here is the relevant code:

void test_stat()
{
    // two random number generators
    double mu = 1.;
    boost::lagged_fibonacci607 rng;
    boost::normal_distribution<> mean_sigma(mu,1);
    boost::variate_generator<boost::lagged_fibonacci607&, boost::normal_distribution<> > normal(rng, mean_sigma);

    accumulator_set<double, stats<tag::median(with_p_square_quantile) > > acc;
    accumulator_set<double, stats<tag::median(with_density) > >
        acc_dens( density_cache_size = 10000, density_num_bins = 1000 );
    accumulator_set<double, stats<tag::median(with_p_square_cumulative_distribution) > >
        acc_cdist( p_square_cumulative_distribution_num_cells = 100 );

    for (std::size_t i=0; i<100000; ++i)
    {
        double sample = normal();
        acc(sample);
        acc_dens(sample);
        acc_cdist(sample);
    }

    BOOST_CHECK_CLOSE(1., median(acc), 1.);
    BOOST_CHECK_CLOSE(1., median(acc_dens), 1.);
    BOOST_CHECK_CLOSE(1., median(acc_cdist), 3.);
}

Terminology clarification: The author of the code uses the word sample to refer to individual data points. S/he is probably thinking of things like samples/second etc that is common in other fields, but in Stats, the word sample refers to a set of observations. An observation is a point in some n-dimensional space. The axes of that space correspond to the variables under consideration.

In the code above, the author is initializing a pseudo-random number generator and then transforming its output to produce draws from the normal distribution with unit mean and standard deviation. That immediately stood out to me: Why not use the standard normal, N(0,1)?

The code then initializes the three methods of calculating the median provided by Boost.Accumulators (of course, you are always free to store the observations in a std::vector and then sort it to find the median, but you really don’t want to do that very often with large amounts of data).

It then generates 100,000 draws from N(1,1) and tests if all three accumulators yield medians close enough to 1.0.

This method raises a number of questions. First and foremost is “what is the third argument to BOOST_CHECK_CLOSE and why is it 3.0 for one and not the others?” Boost.Test documentation provides the answer:

BOOST_WARN_CLOSE(left, right, tolerance);
BOOST_CHECK_CLOSE(left, right, tolerance);
BOOST_REQUIRE_CLOSE(left, right, tolerance);

Last third parameter defines the tolerance for the comparison in percentage units.

I see, so, for some reason, the test code is telling me that median(with_p_square_cumulative_distribution) can be ±3% of the correct value. No, I don’t see. But, let’s leave that aside for now.

The percentage tolerance business explains why we are using N(1,1) instead of N(0,1).

The really important questions center around the purpose of this code. Is it:

  1. testing lagged_fibonacci607?
  2. testing normal_distribution?
  3. testing variate_generator?
  4. or, is it testing the three methods of calculating the median of a data set?

That is, if this test fails, do we immediately know why it failed? And, can this test ever fail?

Let’s rewind: We are pseudo-randomly generating a sample of 100,000 observations, and checking if its median is close enough to 1.0. Let’s leave aside the vagaries of generating random doubles for a second, and imagine we are just dealing with real numbers. The domain of the normal density curve is (-∞,+∞). That is, purely through random chance we’ll get samples of 100,000 draws from N(1,1) whose median is not exactly 1.0. What is the probability of the median of such a sample not being in (0.99, 1.01)? For that, we need to consult the sampling distribution of the sample median. Since the underlying distribution is normal, the sampling distribution of the median in this case is also normal with mean 1.0 and approximate standard deviation of 0.004. That is, the 1% tolerance used above corresponds to about 2.5 standard deviations around the true median in the sampling distribution.

What then is the probability of observing a set of 100,000 draws with a median greater than 1% away from 1.0?

> 2 * pnorm(.99, mean=1, sd=0.004)
[1] 0.01241933

That is, this test should fail 1.24% of the time solely due to sampling variation if it is drawing a fresh random sample of 100,000 observations from N(1,1) each time it is run.

For example, consider the following Perl script which picks 1,000 samples of 100,000 random draws from N(1, 1).

#!/usr/bin/env perl

use v5.24;
use warnings;

use constant SAMPLE_SIZE => 100_000;
use constant NSAMPLES => 1_000;

use Math::Random::NormalDistribution;
my $generator = rand_nd_generator(1, 1);

my $nextreme;

for (1 .. NSAMPLES) {
    my $m = median([map $generator->(), 1 .. SAMPLE_SIZE]);
    $nextreme += (abs($m - 1) >= 0.01);
}
printf "%.2f%% of sample medians were more than 0.01 away from 1.0\n", $nextreme/NSAMPLES;

sub median {
    my $s = shift;
    @$s = sort { $a <=> $b } @$s;
    if ( @$s % 2 ) {
        return $s->[ $s / 2 ]
    }
    return $s->[$s / 2]/2 + $s->[ @$s / 2 ]/2;
}

Three runs of the script above generated 10, 14, and 11 out of 1,000 medians more than 1% away from the true median of 1.0.

In this case, you can think of each of those 1,000 medians as someone running the test suite. If the code is correct, and we are generating a fresh independent sample of 100,000 draws from N(1,1) each time the test code is run, we should expect about 12 or so spurious test failures due to sampling error.

Given that there doesn’t seem to be any reports of such failures, we know that either 1) the test code in Boost.Accumulators is not drawing a fresh set of 100,000 points each time it’s run, or 2) people are not reporting test failures, or 3) practically no one is testing Boost.Accumulators.

If we assume that Boost.Accumulators library is indeed being tested by sufficient number of people and that they are diligent about reporting test failures, and if the code is actually drawing a fresh sample each time the test is run, and about 1.24% of all tests are not failing, then there might also be problems with one or all of the components lagged_fibonacci607, normal_distribution, and variate_generator.

To investigate, I decided to modify the test program to calculate the actual median of the generated sample, and output that along with the medians computed using the three methods provided by Boost.Accumulators. This gave me:

with_p_square_quantile:                0.999735
with_density:                          0.999192
with_p_square_cumulative_distribution: 1.02558

sort and midpoint: 0.999327

Ran it again:

with_p_square_quantile:                0.999735
with_density:                          0.999192
with_p_square_cumulative_distribution: 1.02558

sort and midpoint: 0.999327

OK, I am going to guess that the program produces the exact same set of 100,000 draws every time it is run. Of course, I could have tried to go through the code to figure this out, but this was much quicker than pinpointing the exact spot I needed to look at.

So, if all this test is ever going to use is the same sample of 100,000 draws from N(1,1), why is it jumping through all these hoops, setting up an pseudo random number generator and a variate generator?

The obvious alternative is to bundle a file containing the 100,000 numbers this process generates. Maybe Boost maintainers care about the final size of the downloadable distribution. The method used in this test code trades off CPU time on testers’ machines against distribution size.

Why is the code content with an approximate check? If each time it is run, it is going to calculate the median of the same set of numbers, would it not make sense to compare the numbers produced by each of these methods to known good results? After all, it is not that difficult to store 100,000 doubles in a std::vector and find the actual median of that sample.

The test is comparing the median calculated using each of these methods to the population median. There is absolutely no reason to do this. This is not a hypothesis test. The goodness of a given algorithm must not be measured against the population median: It must be measured against the actual median of the sample fed to the algorithm. The test should store all the generated numbers in a std::vector, sort it, and get the actual median of the actual set of numbers used (note that, as the output above shows, this particular test would still pass if we compared the medians calculated to the true median of the sample, 0.999327).

I would have approached this problem from a different angle though: I would have started with an externally validated case where we know what the algorithm is supposed to do and checked if the algorithm actually does what it is supposed to do. A good test case is contained in the original paper that introduced the P2 algorithm. The authors work through the algorithm with a small set of 20 observations, demonstrating what the algorithm should produce at each step.

It would be good to ensure that the Boost.Accumulators implementation of the P2 algorithm matches the working of the algorithm as it was described in the source paper. Starting from small, known, validated scenarios is more in line with the underlying philosophy of TDD than throwing a soup of numbers at an algorithm and checking if what comes out sorta kinda looks right according to a squishy criterion.

So, let’s look at what happens when we try that:

#include <algorithm>
#include <iostream>
#include <string>
#include <utility>
#include <vector>
#include <boost/accumulators/accumulators.hpp>
#include <boost/accumulators/statistics/stats.hpp>
#include <boost/accumulators/statistics/median.hpp>

namespace bacc = boost::accumulators;

int main(void)
{
    bacc::accumulator_set<double,
        bacc::stats<bacc::tag::median(bacc::with_p_square_quantile)> > acc;

    // See http://www.cse.wustl.edu/~jain/papers/psqr.htm

    // First five observations
    acc(0.02);
    acc(0.5);
    acc(0.74);
    acc(3.39);
    acc(0.83);

    const std::vector<std::pair<double, double> > jain_chlamtac {
        { 22.37, 0.74 },
        { 10.15, 0.74 },
        { 15.43, 2.18 },
        { 38.62, 4.75 },
        { 15.92, 4.75 },
        { 34.60, 9.28 },
        { 10.28, 9.28 },
        {  1.47, 9.28 },
        {  0.40, 9.28 },
        {  0.05, 6.30 },
        { 11.39, 6.30 },
        {  0.27, 6.30 },
        {  0.42, 6.30 },
        {  0.09, 4.44 },
        { 11.37, 4.44 },
    };

    for (auto p: jain_chlamtac)
    {
        acc(p.first);
        std::cout << "calculated= " << bacc::median(acc)
            << "\texpected= " << p.second << '\n';
    }

    return 0;
}

Let’s run it and check the output:

calculated= 0.74        expected= 0.74
calculated= 0.74        expected= 0.74
calculated= 2.06167     expected= 2.18
calculated= 4.55176     expected= 4.75
calculated= 4.55176     expected= 4.75
calculated= 9.15196     expected= 9.28
calculated= 9.15196     expected= 9.28
calculated= 9.15196     expected= 9.28
calculated= 9.15196     expected= 9.28
calculated= 6.17976     expected= 6.3
calculated= 6.17976     expected= 6.3
calculated= 6.17976     expected= 6.3
calculated= 6.17976     expected= 6.3
calculated= 4.24624     expected= 4.44
calculated= 4.24624     expected= 4.44

Clearly, there is something wrong with the implementation of the P2 algorithm in Boost.Accumulators.

Update: The discrepancies shown above were due to typo in the original paper.

PS: You can discuss this post on r/programming or HN.

PPS: Feel free to contact me if you need someone to take a look at your data processing pipeline or analyze your codebase or help you make sense of large amounts of data or need a large collection of documents parsed etc.