Bleach your files faster!

This is a quick follow up to my previous post “Another optimization story”. It was inspired by Greg Foletta’s “A Tale Of Two Optimisations” and subsequent commentary on HN.

Briefly, we are trying to find ways to encode bytes using four whitespace characters and decode them back.

Encoding is done half a nibble at a time and corresponds to the mapping:

  x |            y
--------------------------------
  0 | '\t' (0x09, 0000_1001)
  1 | '\n' (0x0a, 0000_1010)
  2 | '\r' (0x0d, 0000_1101)
  3 | ' '  (0x20, 0010_0000)

That means given a byte containing a half nibble, denoted by c, we can use (c * c) + 14 * (c & (c >> 1)) + 0x09; to do the encoding because:

    c   | c² + 9
----------------
    0   |   9
    1   |  10
    2   |  13
    3   |  18

In case c == 3, and only in that case, we need to add 14 and we’d like to do this without using a conditional which is where the (c & (c >> 1)) comes in: Both digits of half nibble are set if and only if it is equal to 3.

This part was already in the original post, but I moved things around a little.

On the decoding side, we have the inverse mapping:

          x            |    y
---------------------------------------
'\t' (0x09, 0000_1001) |   0 (00)
'\n' (0x0a, 0000_1010) |   1 (01)
'\r' (0x0d, 0000_1101) |   2 (10)
' '  (0x20, 0010_0000) |   3 (11)

Using integer arithmetic, x/10 almost gets us there:

          x            |    x/10
---------------------------------------
'\t' (0x09, 0000_1001) |   0 (00)
'\n' (0x0a, 0000_1010) |   1 (01)
'\r' (0x0d, 0000_1101) |   1 (01)
' '  (0x20, 0010_0000) |   3 (11)

In case x is 0x0d, we need x/10 + 1. Luckily, 0x0dis the only x value with bit 2 set, so x/10 + !!(x & 4) does the trick without using a conditional.

Division tends to be more expensive than multiplication, so we can replace x/10 with d*x where d is an approximation to 1/10. Luckily, we only need this to work for a very small number of small values, so we can approximate 1/10 using ceil(64/6) = 7 and replace x/10 with (x * 7) >> 6. You can verify that:

          x            | (7 * x) >> 6
-------------------------------------------
'\t' (0x09, 0000_1001) |    0 (00)
'\n' (0x0a, 0000_1010) |    1 (01)
'\r' (0x0d, 0000_1101) |    1 (01)
' '  (0x20, 0010_0000) |    3 (11)

holds. That means we can do all our airthmetic with 8 bits without needing to upcast/downcast anything.

Hence, mapping a whitespace character to half a nibble now becomes ((7 * x) >> 6) + !!(x & 4).

With these changes in place, encoding a 256 MiB into 1,024 MiB on a T9900 on Windows 10 64-bit:

C:\> timethis "wse < test.data | wsd > NUL"

TimeThis :  Command Line :  wse < test.data > NUL
TimeThis :  Elapsed Time :  00:00:02.187

Decoding the resulting 1,024 MiB output file on the same system:

C:\> timethis "wsd < test.encoded > NUL"

TimeThis :  Command Line :  wsd < test.encoded > NUL
TimeThis :  Elapsed Time :  00:00:01.852

This basically concludes the matter of replacing the lookup table with a function without having to resort to fitting polynomials. We can, indeed, and the resulting straightforward implementation gives me about twice the performance on this ancient T9900 compared to the 8th generation i7 laptop.

This led me to ask: Can I bleach files faster?

On and off, I tried to figure out if I can use SSE/AVX instructions in a similarly elegant fashion, but didn’t get anywhere (see also lifthrasiir’s solution on HN).

Another approach is to ask whether multi-threading would help. While this task is mostly IO-bound, once caches are populated, dividing the work can make things go faster. To that end, I first wrote an implementation using C11 thread support library before realizing Visual Studio did not yet support it. So, I converted the code to Franken-C++ to test the idea. I was not disappointed: With two threads, I got:

TimeThis :  Command Line :  wse < test.data > NUL
TimeThis :  Elapsed Time :  00:00:01.404

which represents approximately 35% improvement in time and 55% improvement in encoding throughput. Decoding performance improved as well:

TimeThis :  Command Line :  wsd < test.encoded > NUL
TimeThis :  Elapsed Time :  00:00:01.505

This corresponds to about 19% improvement in time and 23% improvement in throughtput.

The pipeline performance got worse on the dual core T9900 since each part of the pipeline was running two threads reducing parallelism in the pipeline.

While this has been fun (never been one to enjoy crossword puzzles) and I like the neat equations above, this is clearly not as performant as it can be.

In the HN thread, there was one comment which pointed out a very simple and straightforward replacement of a lookup table or a switch statement:

clang does something smart with the switch, equivalent to this C code:

 unsigned char lookup2_encode(const unsigned char dibit) {
      return ' \r\n\t' >> (dibit * 8);
}

For clarity, I rewrote that as (note we are in the Franken-C++ world with the threads now):

static uint8_t
half_nibble_to_ws(const uint8_t& c)
{
    static const uint32_t x = (' ' << 24) |
        ('\r' << 16) |
        ('\n' << 8) |
        '\t'
    ;

    return (x >> (8 * c)) & 0xff;
}

I doubt the masking is necessary, but I am not sure so it stays.

How about the decoding part? My first thought was to position the half-nibbles in a 32-bit integer with offsets corresponding to the byte values of the characters. There are problems with that: First, ' ' has code 32, and, of course, shifting a 32-bit integer left or right by 32 bits is not going to give us much (keep in mind that we are trying to avoid conditionals and special cases). Second, the code for '\t' is 9 while the code for \n is 10. We can’t place 00 at bit 9 and 01 at bit 10 for obvious reasons. So, I went for the simplest solution: Place the half-nibbles at twice the difference between each character and the tab. That is:

      x      | offset (value)
-----------------------------------
'\t' (0x09)  |    0 (00)
'\n' (0x0a)  |    2 (01)
'\r' (0x0d)  |    8 (10)
' '  (0x20)  |   46 (11)

which means we need a 64-bit integer to store this map:

static uint8_t
ws_to_half_nibble(const uint8_t& ws)
{
    static const uint64_t x = (3ULL << 46) | (2 << 8) | (1 << 2);
    return (x >> (2 * (ws - '\t'))) & 3;
}

Encoding the same 256 MB file now took:

TimeThis :  Command Line :  wse < test.data > NUL
TimeThis :  Elapsed Time :  00:00:00.933

which is a 34% improvement on the nice, elegant looking (in the eye of the beholder) expression I had come up with to replace the OP’s polynomial.

Decoding the resulting 1,024 MB file took:

TimeThis :  Command Line :  wsd < test.encoded > NUL
TimeThis :  Elapsed Time :  00:00:01.457

which seems to represent a consistent 3% improvement over the expression

((7 * x) >> 6) + !!(x & 4)

I formulated above. I still like that expression (approximating division by 10 using multiplication by 7 followed by shifting right 6 bits and the straight-line handling of the special case just feels more elegant than just shuffling bits), but, oh well.

I should note that I stuck with the stdio functions (fread/fwrite), because using istream::read and ostream::write caused approximately 30% slower overall operation.

The current implementation is available on GitHub. You can discuss this post on HN.