How you average numbers matters

Every now and then, I end up having to explain to skeptical people why it matters how their programs treat the numbers they ingest. With IEEE 754 and doubles, people seem to think that one can just will nilly add a bunch of numbers and average them, and get reliably accurate results.

In fact, things are better now than they used to be even as recently as 20 years ago. But, given enough numbers, and enough operations, it is still possible to run into interesting inaccuracies. This is a fact of life, and not an earth-shattering observation, but, sometimes, it does take a bit of effort to explain to someone who is hearing about this stuff for the first time, and is absolutely sure of his/her programming prowess.

To illustrate this, I am first going to start with a slightly contrived example so that we can easily, and independently, calculate the expected magnitudes, and assess the accuracy of the computer's calculations.

We start with the data vector:

my @data = (1_000_000_000.1, 1.1) x 50_000;

That is, we have a vector of one hundred thousand numbers. Odd-indexed elements are rather large, and even indexed ones are relatively small. Clearly, this is a contrived example, but it helps verify things arithmetically.

The arithmetic average of this variable is precisely 500,000,000.6.

Let's see what our computer tells us:

my @data = (1_000_000_000.1, 1.1) x 50_000;
printf "Naive mean:                  %f\n", (sum @data) / @data;

Output:

Naive mean:                  500000000.600916

Well, that's surely close enough. After all, who cares about the fourth and fifth digits after the decimal point when you are dealing with magnitudes in the five hundred millions. Some floating point error is to be expected. Etc, etc, so on and so forth. You have heard this before.

So by adding large numbers and small numbers in alternating order, we lost some accuracy.

What if we added all the small numbers first? Would that improve things?

Well, having enough memory to keep not just one but two copies of the original data set is not a luxury one can always afford, but, in this case, the array is small enough (I say that remembering fully when no one in her right mind would call an array of 100,000 elements small enough) that we can give it a shot:

sub asc_sum { sum sort {$a <=> $b} @_ }

sub asc_mean { asc_sum(@_) / @_ }

printf "Mean of ascending sequence:  %f\n", asc_mean(@data);

Output:

Mean of ascending sequence:  500000000.600458

Well, that's better. The error is only half what it was.

For the converse, let's check out what happens if we sort the data in descending order first:

sub dsc_sum { sum sort {$b <=> $a} @_ }

sub dsc_mean { dsc_sum(@_) / @_ }

printf "Mean of descending sequence: %f\n", dsc_mean(@data);

Output:

Mean of descending sequence: 500000000.601239

Ouch! The error is now 35% larger than the naive mean, and 171% larger than the mean of the increasing sequence.

Of course, properly written software deals with this. To see how, note that the difference between the mean of the first N elements and the mean of the first N + 1 elements is simply

( xN + 1 - ) / (N + 1)

where denotes the average of the first N elements.

Based on that, we have our stable_mean function:

sub stable_mean {
    my $x̄ = $_[0];
    for my $i (1 .. $#_) {
        $x̄ += ($_[$i] - $x̄) / ($i + 1);
    }
    $x̄;
}

printf "Stable mean:                 %f\n",
    stable_mean(@data);
printf "Stable mean of ascending:    %f\n",
    stable_mean(sort {$a <=> $b} @data);
printf "Stable mean of descending:   %f\n",
    stable_mean(sort {$b <=> $a} @data);

Output:

Stable mean:                 500000000.600000
Stable mean of ascending:    500000000.600002
Stable mean of descending:   500000000.599997

Just look at that accuracy!

Now, in the real world, you have programs that ingest untold amounts of data. They sum numbers, divide them, multiply them, do unspeakable things to them in the name of "big data". Very few of the people who consider themselves C++ wizars, or F# philosophers, or C# ninjas actually know that one needs to pay attention to how you torture the data. Otherwise, by the time you add, divide, multiply, subtract, and raise to the nth power you might be reporting mush and not data.

One saving grace of the real world is the fact that a given variable is unlikely to contain values with such an extreme range. On the other hand, in the real world, one hardly ever works with just a single variable, and one can hardly every verify the results of individual summations independently.

Anyway, the point of this post was not to make a grand statement, but to illustrate with a simple example that when using floating point, the way numbers are added up, and divided, matters.

The example is independent of programming language. What I called stable_mean above is part of the online algorithm for calculating variance on Wikipedia. I learned about it close to three decades ago, trying to debug a Fortran program. Until I decided to check a few things while writing this post, I did not know it even had a name ;-)

As an example of the dangers of this sort thing, consider the following short example:

#!/usr/bin/env perl
use utf8;
use strict;
use warnings;
use Statistics::Descriptive;

my $K = 10_000_000_000;
my @D = (4, 7, 13, 16);
my @X = map $K + $_, @D;

my $stat = Statistics::Descriptive::Full->new;
$stat->add_data(@X);

my $μ = $stat->mean;
my @S = map { ($_ - $μ)**2 } @X;

printf "Variance calculated using Statistics::Descriptive: %f\n",
    $stat->variance;
printf "Variance calculated using stable mean:             %f\n",
    (@S * stable_mean(@S))/(@S - 1);

sub stable_mean {
    my $x̄ = $_[0];
    for my $i (1 .. $#_) {
        $x̄ += ($_[$i] - $x̄) / ($i + 1);
    }
    $x̄;
}

Output:

Variance calculated using Statistics::Descriptive: 21845.333333
Variance calculated using stable mean:             30.000000

In case you were wondering, the correct variance is 30.

PS: You can discuss this post on /r/perl.