Counting Bits

Bits of a board

Most of my early education in software performance optimization came from writing programs that played chess.

This was in the immediate aftermath of the famous match between Deep Blue, a purpose-built computer from IBM, and then (human) world-champion Garry Kasparov in 1997. While Deep Blue's hardware was bespoke, the commodity hardware available to consumers was becoming powerful enough such that a thoughtful hobbyist with a compiler could piece together extremely strong chess engines and compete with other engines over the rapidly growing Internet.

One of the first choices a chess engine programmer must make is how to represent the state of the game, and in particular, the position of the pieces on the board. Intuitively, an array structure of some sort is a reasonable approximation of a chessboard. Elements of the array correspond to squares on the board, where each square might have a state such as "white pawn," "black king," or "empty." Generating moves and detecting conditions such as check involve navigating this array and querying the contents of each element.

That representation of a board can take an engine quite far. A less obvious, but unusually powerful, technique is to represent the chessboard as a set of individual bits. A chessboard is a square, eight squares on a side, for a total of 64 squares: a power of two. This matches nicely with modern CPUs which operate on words of 64 bits.

Consider each bit of a 64-bit word corresponding to a boolean property of a single square on the chessboard. Perhaps the first bit corresponds to the state of square A1, the second B1, etc., and the last bit represents H8.

If using a single bit per square on the board, it's not possible to express which piece is in a square. With six distinct pieces (pawn, knight, bishop, rook, queen, king), each coming in two colors (white, black), plus the notion of an empty square, we'd need several bits to express that state of a single square. To keep our data fitting nicely in 64-bit words, we'll instead use several 64-bit words. For example, one word would express "position of the black pawns," where a set bit indicates a black pawn on the square, and an unset bit indicates the lack of a black pawn on that square.

This technique is called a "bitboard" representation.

The benefits of this technique become more clear when we have to ask questions beyond "what piece is on this square?" Consider a task like determining "does black occupy light-square diagonals?" Here, we'd need to verify either that there's a queen on the board (who could move to the light diagonals, if not already on a light square), or there's at least one bishop and that bishop is on a light-colored square (as bishops cannot change their diagonal). With an array alone, this is quite a chore, and array-based engines will often need to track piece counts in a separate array.

With bitboards, however, we simply need to OR the bitboards for the black queen and black bishops, resulting in a bitboard expressing the position of black's diagonal slider pieces, and then AND that result with a pre-computed bitboard of all the light-colored squares. If that result has any bits set, or in other words, is non-zero, then we know black has a diagonal-sliding piece (either a bishop or queen) somewhere on the light diagonals. Compare this series of computations - an OR, an AND, and then a comparison to zero - against indexing, possibly repeatedly to locate the pieces, into main memory.

Similarly, questions like "how many white pieces are on the board" become as simple as counting the bits set by ORing all the white piece bitboards together and then counting the number of bits set to 1. This operation is called a "population count," or "popcount" for short. An efficient popcount operation is critical for fast computation, but is interestingly non-trivial.

Implementing Popcount

The obvious method

While the modern C23 standard includes this operation in stdbit.h, the C99 standard (against which I did most of my work in this era) does not have a population count as a standard function. As most ambitious novices do, I started out implementing population count myself in perhaps the slowest possible manner:

int popcount_slow(uint64_t n) {
    int count = 0;
    while (n > 0) {
        if (n & 1) {
            count++;
        }
        n >>= 1;
    }
    return count;
}

Here, we shift the input by one bit at a time, checking if that last bit is set or not. Our loop will run a number of times proportional to the position of the highest set bit. For example, if the most-significant bit is the only bit set, we won't reach zero until we've shifted that bit all the way across the word.

A bit better

We can improve on this using a method attributed to Brian Kernighan, where instead of shifting by one bit at a time, we repeatedly AND our input with itself minus one. This extinguishes the lowest set bit with each pass, ensuring that we only loop as many times as we have bits set.

int popcount_better(uint64_t n) {
    int count = 0;
    while (n > 0) {
        n &= (n - 1);
        count++;
    }
    return count;
}

In doing so, the amount of iteration is reduced particularly for "sparse" words with few bits set, as well as removing a conditional branch. This remains completely portable between processors as written. We can improve further, but we start exploiting CPU-specific behaviors and features.

Putting the silicon to work

The GNU C Compiler (GCC) and compilers that follow similar conventions offer a useful __builtin_popcountll operation. While not part of the C standard, it informs the compiler that you're looking for the best version of popcount for the target hardware. If your target processor offers nothing, you might get something like the popcount_better above. However, many processors now offer implementations of the operation with dedicated circuits, or otherwise have characteristics our compiler knows how to utilize for performance.

Many of the server and professional workstation-oriented hardware makers offered a hardware popcount relatively early, with both DEC's Alpha series and Sun Microsystems' UltraSPARC processors offering popcount instructions in the 1990s.

For consumer-oriented hardware, which thrifty university students such as myself typically owned, Intel didn't introduce a dedicated instruction until 2008 with SSE 4.2, first seen with their Nehalem microarchitecture. At this point I was busying myself with non-chess work, and sadly missed this update. However, it's never too late to run a fun benchmark!

We can try out this extension with this very simple popcount function:

int popcount_fast(uint64_t n) {
    return __builtin_popcountll(n);
}

This alone won't emit code using this instruction, however - we will need to inform our compiler that it's okay to use the implementation-specific details. With GCC on x64 platforms, this is the -mpopcnt compiler flag:

gcc -mpopcnt popcount.c

And indeed, our resulting binary (shown here with no compiler optimization for clarity):

<+0>:   push   %rbp
<+1>:   mov    %rsp,%rbp
<+4>:   mov    %rdi,-0x8(%rbp)
<+8>:   xor    %eax,%eax
<+10>:  popcnt -0x8(%rbp),%rax
<+16>:  pop    %rbp
<+17>:  ret

At offset +10 we see our popcnt, giving our result with a single instruction to the CPU.

But what if we don't set -mpopcnt? The compiler still does what it can to optimize, but the popcnt instruction isn't used:

<+0>:   push   %rbp
<+1>:   mov    %rsp,%rbp
<+4>:   sub    $0x10,%rsp
<+8>:   mov    %rdi,-0x8(%rbp)
<+12>:  mov    -0x8(%rbp),%rdi
<+16>:  call   0x15c0 <__popcountdi2>
<+21>:  leave
<+22>:  ret

Here, at offset +16, we're invoking a software function __popcountdi2, which is provided by the compiler (LLVM has their own similar implementation).

No doubt that these were carefully optimized by experts. How close in performance are they to a dedicated hardware instruction?

On your marks!

No performance article is complete without benchmarks.

The Method

I wrote a performance test which calls our various popcount implementations with a deterministic sequence of 64-bit integers with randomly distributed bits (which is interesting enough perhaps for another blog post). This is a deliberate attempt to confound my CPU's branch predictor.

The branch predictor makes an informed guess about which direction an "if" will go, wagering we can save some CPU work if we guess correctly. A predictable sequence of consecutive integers is a noticeable pattern to modern CPUs and might trigger unintentional performance optimization that only works on that particular sequence, fouling our test. For example, the n & 1 check in our first implementation can be easily predicted without having to compute that exactly if we know the input sequence is alternating between even and odd numbers, but outside of deliberate benchmarks it's unlikely we'd have such a fortuitous series of inputs. Further, this randomization allows us to spread out "1"s across the entire 64 bits instead of having them packed at one end, which is what we'd get with consecutive integers starting from zero unless we're willing to run an exhaustive test on all 2⁶⁴ inputs.

The Numbers

The related timing code is here, compiled for this test with GCC 14.2.0:

popcount.c
Makefile to compile the above in a few versions.

The -Ox where x is a number 0 through 3 informs the compiler how aggressive in finding performance optimizations it should be, 3 being the most aggressive. I tested at 0 (the default level) and 3. These were all timed on my antiquated-but-reliable 2011-era Sandy Bridge Xeon E3-1270 @ 3.40 Ghz, calculating the popcount of 100,000,000 random-ish bit patterns:

Method	Time with `-O0`	Time with `-O3`
`popcount_slow()`	31.73s	4.72s
`popcount_better()`	9.01s	2.40s
`popcount_fast()`, with `popcountdi2`	0.61s	0.29s
`popcount_fast()`, with hardware `popcnt`	0.36s	0.07s

Using the hardware implementation is about ninety times faster than the slowest pure-software implementation with no compiler optimization flag applied, and "only" 60 times faster with -O3.

I found it surprising that software popcountdi2 was so close to the hardware popcnt, a tribute to some very clever engineering. Reading more about this method, the implementation exploits other capabilities of modern processors.

This is an example of "SIMD Within a Register," or SWAR. SIMD refers to the idea of "single-instruction, multiple data" operations, meaning a CPU applying a specific operation on multiple pieces of data roughly simultaneously, often using special-purpose, very wide registers offered with special extensions on certain CPUs. Here, we use a plain 64-bit register, but do so in a clever manner: The 64-bit value in the register is logically partitioned into multiple independent inputs (e.g. 32 inputs of 2 bits each, 16 inputs of 4 bits each, and so on). Then, the math is performed concurrently on those smaller sections of the register.

This exploits the fact that popcount isn't dependent on the order of the bits in a number like addition or multiplication is, but rather only is concerned with the total number of bits. The count of one subset of a word doesn't affect the popcount in adjacent subsets, making divide-and-conquer approaches straightforward and effective.

While not a popcount-specific operation, SWAR is a technique which enables a blazingly fast popcount by taking full advantage of the CPU.

Also interesting is the performance of the "better" (Kernighan method) popcount when tested with both -mpopcnt and -O3 enabled, not shown in the figure above to keep the chart small. This will be the subject of a later post.

Beware of rabbit holes

Observing the narrow performance gap between popcountdi2 and the popcount is a reminder of the trade-offs between using general-purpose CPU features to their fullest potential against dedicating hardware on the CPU to squeeze a bit more performance on key operations.

But to my chess engine: I don't think I can blame any in-game blunders on a slow popcount. First we make it correct, and only then (if ever) do we make it fast.

Previous: Typographical Fractal Fun with PostScript

All Posts | Back to Home