# Another optimization story

I quickly skimmed through “A Tale of Two Optimizations” a couple of days ago on a phone after noticing the title on HN. As the author mentions in the repo:

I’m not sure there are many useful applications of this program. But it was a fun little project with a small scope.

Indeed, it is not easy these days to find fun, well defined projects to exercise our programming muscles just for the fun of it these days: Leetcode is everywhere and StackOverflow is not really that much fun for me any more.

I was a little struck by what seemed to be low throughput:

on my Intel i7-8650U laptop, running over a file that’s cached in memory and outputting to /dev/null, the encoding / decoding process runs at 258MiB/s.

Of course, it is unreasonable to expect to achieve max memory bandwidth, but I thought it ought to be possible to do better than that. Among the optimization attempts, this one caught my eye:

I had an idea about using mathematical functions (rather than the lookup table) to perform the encoding/decoding.

If the lookup table is small enough to fit in 4 bytes, it will probably stay in the fastest cache available to the CPU so that is probably not a huge concern. Also, as some commenters on HN noted, compilers can come up with neat tricks.

I was surprised that the author went the way of fitting a polynomial using linear regression to the four points and using its inverse to speed things up. I would not expect much of a return there: Even if the mapping in one direction might help, the inverse mapping is unlikely to be symmetrically performant. In fact, the author did run two separate regressions, presumably for that reason.

It can be a good idea to see if you can replace lookups with computation, but it is worth looking for something simpler than polynomials. It is not obvious a mapping that is simple enough exists, especially one that avoids branching.

The encoding scheme chosen by the author is straightforward:

``   char encode_lookup_tbl[] = { '\t', '\n', '\r', ' ' };``

We want a function that will map the values `(0, 1, 2, 3)` to whitespace characters `('\t', '\n', '\r', ' ')`. At first blush, it looks like there is no obvious systematic relationship we can exploit:

``````  x  |  y
-------------
0  |   9
1  |  10
2  |  13
3  |  32``````

which is what I think led the author to consider fitting polynomials. This is where representation starts to matter. We are used to looking at numbers in decimal notation, but that can hide patterns that might otherwise be useful to us. After all, all we need is to map four values on one side to four values on the other side. Let’s look at it in hex:

``````  x  |  y
-------------
0  |  0x09
1  |  0x0a
2  |  0x0d
3  |  0x20``````

OK, this already gives me something. Three of the values have their high nibble set to zero and the other one has its low nibble set to zero. To be frank, I stared at this for a while to figure out if I can come up with some simple arithmetic, but that went nowhere. That’s when I thought I might gain more insight by looking at the bit patterns. So, I expressed both sides in binary:

``````   x  |     y
-----------------
00  |  00001001
01  |  00001010
10  |  00001101
11  |  00100000``````

Look at that! The first three `x` values map straight to bits 1 and 2 in the `y` values.

This actually made the decoding, that is, going from whitespace values to half-nibble values rather straightforward:

``````uint8_t
ws_to_half_nibble(uint8_t ws)
{
return ((ws >> 1) & 3) | (ws >> 4) | (ws >> 5);
}``````

The `(ws >> 1) & 3` part maps the tab, newline, and carriage return characters to 0, 1, and 2, respectively.

If the whitespace value is space (`0x20`, `00100000`), then `(ws >> 4) | (ws >> 5)` gives me 3 (`11` in binary).

Therefore, decoding becomes:

``````uint8_t
ws_decode(const uint8_t* buf)
{
return (ws_to_half_nibble(buf) << 6) |
(ws_to_half_nibble(buf) << 4) |
(ws_to_half_nibble(buf) << 2) |
ws_to_half_nibble(buf)
;
}``````

I did not unroll for performance: This just seemed just as clear if not more so as having a `for` loop with computed indexes.

Finding a similarly straightforward mapping in the opposite direction, one that maps `0`, `1`, `2`, and `3` to `\t`, `\n`, `\r`, `' '`, respectively felt harder for me until I decided to use offsets from the tab character value in the `y` column:

``````   x  |         Δ
----------------------------
00  |  00000000 ( 0, 0x00)
01  |  00000001 ( 1, 0x01)
10  |  00000100 ( 4, 0x04)
11  |  00010111 (23, 0x17)``````

Once again, the first three values seem straightforward to map: Just square the half-nibble. How do we special-case `3` without introducing branching? The answer turns out to look trivial: `3&times;3` is 9. `23 - 9 = 14`. So, just add `14` to the result of squaring if bit `0` and bit `1` are both set. Therefore, encoding a byte into four whitespace characters becomes:

``````void
ws_encode(uint8_t c, uint8_t* buf)
{
for (int i = 0; i < 4; ++i) {
const uint8_t b = (c >> 2 * (3 - i)) & 3; // move half nibble into position
const uint8_t b0 = b & 1;
const uint8_t b1 = (b >> 1);
buf[i] = '\t' + (b * b) + 14 * (b0 & b1);
}
}``````

Hmmm … while actually elegant, I am not sure if the squaring is doing me any favors here. An alternative way to write this (in the context of this specific problem) is:

``````void
ws_encode(uint8_t c, uint8_t* buf)
{
for (int i = 0; i < 4; ++i) {
const uint8_t b = (c >> 2 * (3 - i)) & 3; // move half nibble into position
const uint8_t b0 = b & 1;
const uint8_t b1 = b & 2;
buf[i] = '\t' + b0 + 2 * b1 + 18 * (b0 & (b1 >> 1));
}
}``````

This then answers my original question of whether there is a way to replace the lookup tables with functions which have no branching for both encoding and decoding.

I am not sure how this would affect overall performance as I decided to limit the time I spent on this post by focusing solely on the task of expressing the mapping of half-nibbles to the chosen whitespace characters The specific order in which the original lookup table was constructed happened to be very helpful in coming up with relatively simple expressions.