As I’ve described in part 1 & part 2 of this series, I’ve recently overhauled an internal data structure we use at Work® to start using platform dependent intrinsics.

If you’ve not read the previous posts, I suggest you do so, as a lot of what is discussed here relies on the code and issues presented there…

As a reminder, this series is made in 3 parts:

All of the code (C# & C++) is published under the bitgoo github repo.

C++ vs. C#

I think I’ve mentioned this somewhere before: I started working on better versions of my bitmap search function way before CoreCLR intrinsics were even imagined. This led me to start to tinkering with C++ code where I tried out most of my ideas. When CoreCLR 3.0 became real enough, I ported the C++ code back to C# (which surprisingly consisted of a couple of search and replace operations, no more…).

As such, having two close implementations begs performing a head-to-head comparison. After some additional work, I had basic google benchmark and google test suites up and running1
I’ll cut right to the chase and present a relative comparison between C++ and C# for the last version we ran in our previous post, The C# method is POPCNTAndBMI2Unrolled and the C++ one is POPCNTAndBMI2Unrolled2:

Method N C# Mean (ns) C++ Mean (ns) C++/C# Ratio
POPCNTAndBMI2Unrolled 1 2.249 3.338 148.42%
POPCNTAndBMI2Unrolled 4 10.904 11.037 101.22%
POPCNTAndBMI2Unrolled 16 50.368 43.786 86.93%
POPCNTAndBMI2Unrolled 64 208.272 202.366 97.16%
POPCNTAndBMI2Unrolled 256 1,580.026 1,493.020 94.49%
POPCNTAndBMI2Unrolled 1024 21,282.905 11,520.900 54.13%
POPCNTAndBMI2Unrolled 4096 255,186.977 133,976.543 52.50%
POPCNTAndBMI2Unrolled 16384 3,730,420.068 1,754,421.485 47.03%
POPCNTAndBMI2Unrolled 65536 56,939,817.593 26,613,731.568 46.74%

There are a few things that stand out from this comparison:

I’ll cut to the chase and answer this last question directly, then, proceed to explain the underlying relevant basics (tl;dr: it’s not so basic) of CPU pipelining and register renaming in order for the explanation to stick for people reading this that are not familiar with those terms/concepts.

The bottom line is: there is a bug in the CPU! There is a well known (even if very cryptic) erratum about this bug, and compiler developers are more or less generally aware of this issue and have been working around it for the better part of the last 5 years.

False Dependencies

So what is this mysterious CPU bug all about? The JIT was producing what should be, according to the processor documentation, pretty good code:

1
2
3
4
5
6
7
8
9
10
11
12
BEGIN_POPCNT_UROLLED_LOOP:
popcnt   rsi, qword ptr [rcx]
sub      rdx, esi
popcnt   rsi, qword ptr [rcx+8]
sub      rdx, esi
popcnt   rsi, qword ptr [rcx+16]
sub      rdx, esi
popcnt   rsi, qword ptr [rcx+24]
sub      rdx, esi
add      rcx, 32
cmp      rdx, 256
jge SHORT BEGIN_POPCNT_UROLLED_LOOP

What we see above is an excerpt from POPCNTAndBMI2Unrolled method’s assembly code, and more specifically the unrolled loop that does 4 POPCNT instructions in succession.

Even if you are not an assembly guru, it’s pretty clear we have 4 pairs of POPCNT + SUB instructions, where:

The high-level explanation of the bug goes like this:

  1. The CPU should have detected that each POPCNT + SUB instruction pair is effectively independent of the previous pair (inside our unrolled loop and between the loop’s iterations). In other words: although all 4 pairs are using the same destination register (rsi), each such pair is really not dependent on the previous value of rsi.
  2. This dependency analysis, performed by the CPU, should have enabled it to use an internal optimization called register-renaming (more on that later).
  3. Had register renaming been triggered the CPU could have processed our POPCNT instructions with a higher degree of parallelism: In other words, our CPU, would run a few POPCNT instructions in parallel at any given moment. This would lead to better perf or better IPC (Instruction-Per-Cycle ratio).
  4. In reality, the bug is causing the CPU to delay the processing of each such pair of instructions for a few cycles, per pair, introducing a lot of “garbage time” inside the CPU, where it’s stalling, doing less work than it should, leading to the slowdown we are seeing.

Terminology wise, this sort of bug is called a false-dependency bug: In our case, the CPU wrongfully introduces a dependency between the different POPCNT instructions on their destination register, it thinks each POPCNT instruction is not only writing into rsi but also reading from it! (it does no such thing)
Given that this false dependency now exists, it is preventing the CPU from using register-renaming to execute the code more efficiently.

I will first focus on describing how compilers have been working around this, and afterward, I will describe in much more detail how the CPU employs register renaming to improve the throughput of the pipeline when the bug does not exist or is worked around.

Working Around False Dependencies

As I’ve mentioned, this bug has been around for quite some time: It was reported somewhere in 2014 and is unfortunately still persistent to this day on most Intel CPUs, at least when it comes to the POPCNT instruction2.

Luckily, compiler developers have been able to work around this issue with relative ease by generating extra code that breaks the aforementioned false-dependency. As far as I can tell, the people who originally wrote the workarounds were Intel developers, so they had a very good understanding of the exact nature of this false-dependency. What they opted to do was make compilers introduce a single-byte instruction that clears the lower 32-bits of the destination register. In our case, this comes in the form of a xor esi, esi instruction. This is the shortest way (instruction length-wise) in x86 CPUs to zero out a register. This instruction is a well known special case in the CPU since it “knows” the future value of the destination register (0) even without executing it, or knowing what its original value ever was. It appears the Intel engineers knew that the dependency is not the entire 64-bit register (rsi) but only on the lower 32-bit part of that register (esi3) and took advantage of this understanding to introduce a single byte fix into the instruction stream, which is relatively very cheap.

The correct x86 assembly, generated by a fixed JIT or compiler should look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
BEGIN_POPCNT_UROLLED_LOOP:
xor      esi, esi				; This breaks the dependency
popcnt   rsi, qword ptr [rcx]
sub      rdx, esi
xor      esi, esi				; This breaks the dependency
popcnt   rsi, qword ptr [rcx+8]
sub      rdx, esi
xor      esi, esi				; This breaks the dependency
popcnt   rsi, qword ptr [rcx+16]
sub      rdx, esi
xor      esi, esi				; This breaks the dependency
popcnt   rsi, qword ptr [rcx+24]
sub      rdx, esi
add      rcx, 32
cmp      rdx, 256
jge SHORT BEGIN_POPCNT_UROLLED_LOOP

This short piece of code is the sort of code that gcc/clang would generate for POPCNT to work-around the bug. When read out of context, it looks silly… it appears like the compiler generated useless code, to begin with, and you’ll find people wondering about this publicly in stackoverflow and other forums from time to time, or worse yet: trying to “fix” it. But for most in-production x86 CPUs (e.g. all the ones that suffer from this false-dependency) this code will substantially outperform the original code we saw above…

Update: CoreCLR does the right thing

I originally started writing part 3 after I found this issue with the JIT, and submitting an issue, thinking I would finish writing this post before anyone would fix the underlying issue. I was wrong on both counts: Writing this post became an ever-growing challenge as I attempted to explain pipelines and register-renaming for the uninitiated (below), while Fei Peng fixed the issue in a matter of two weeks (Thanks!).

What CoreCLR now does (since commit 6957b4f) is always introduce the xor dest, dest workaround/dependency breaker for the 3 affected instructions LZCNT, TZCNT, POPCNT. This is not the optimal solution since the JIT will introduce this both for CPUs afflicted with this bug (specific Intel CPUs) as well as CPUs that don’t have this bug (All AMD CPUs and newer Intel CPUs).
From the discussion, it’s clear that this path was chosen for simplicity’s sake: it would require more infrastructure both to detect the correct CPU family inside the JIT, and introduce questions around what should the JIT do in case of AOT (Ahead Of Time) compilation, as well as require more testing infrastructure than what is currently in place on the one hand, while the one byte fix is very cheap even for CPUs that are not affected.

Let’s see if this CoreCLR fix does anything to our unmodified piece of code…:

Method N Mean (ns) Scaled To “buggy” CoreCLR Scaled to C++
POPCNTAndBMI2Unrolled 1 2.170 0.96 0.65
POPCNTAndBMI2Unrolled 4 11.910 1.09 1.08
POPCNTAndBMI2Unrolled 16 55.016 1.09 1.26
POPCNTAndBMI2Unrolled 64 225.156 1.08 1.11
POPCNTAndBMI2Unrolled 256 1,637.336 1.04 1.10
POPCNTAndBMI2Unrolled 1024 11,698.421 0.55 1.02
POPCNTAndBMI2Unrolled 4096 149,247.146 0.58 1.11
POPCNTAndBMI2Unrolled 16384 1,904,945.748 0.51 1.09
POPCNTAndBMI2Unrolled 65536 27,712,720.427 0.49 1.04

It sure does! It appears now that the unrolled version is running roughly 85-101% faster for higher bit counts than it did with the previous, unfixed CoreCLR!. When compared to C++, performance is now pretty close and consistent for the important parts of the benchmark. If you consider, for a moment, that we got here by making the JIT spill out an extra, supposedly useless instruction, this makes the achievement that much more impressive :), as before, here is the JITDump with the newly fixed JIT in place.

Now, we can really see just how much of a profound effect this false-dependency had on performance. In theory, this might be the right time to finish this post, however, I couldn’t let it go without attempting to explain the underlying CPU internals of how and why the false-dependency had such a deep effect on performance. For readers well aware of how CPU pipelines operate and how they interact with the register renaming functionality on a modern super-scalar out-of-order CPU this is a good time to stop reading.
What follows is me trying to explain how the CPU tries to handle loops of code effectively, and how register renaming plays an important role in that.

The love/hate story that is tight loops in CPUs

It takes very little imagination to realize that CPUs spend a lot their processing time executing loops (or the same machine code multiple times, in this context).
We need to remember that CPUs achieve remarkable throughput (e.g. instructions per cycles, or IPC) even though the table, in some ways, is set against them:

With that in mind, let’s take the same, short piece of assembly code, which was generated by the JIT for our last unrolled attempt at, and see how it theoretically executes on a Skylake CPU.

Visualizing our loop

Without any additional fanfare, lets introduce the following visualization:

iaca-popcnt

I created this diagram by prettifying a trace file generated by a little known tool made by Intel called IACA, which stands for Intel Architecture Code Analyzer. IACA takes a piece of machine code + target CPU family and produces a textual trace file that can help us see better what the CPU does, at every cycle of a relatively short loop.
If you dislike having to use commercial (non-OSS) tools, please note that there is a similar tool by the LLVM project called llvm-mca, and you can even use it from the infamous compiler-explorer.

Let’s try to break this diagram down:

It is important to note that IACA, while being Intel’s own tool is not aware of the Intel CPU bug I just described. It is simulating what that processor should have done with NO false dependency…

While all the various states of the instruction within the pipeline are interesting I will give some more meaning to specific states:

mnemonic Meaning
d Dispatched to execution: The CPU has completed decoding and waiting for the instruction’s dependencies to be ready. Execution will begin in the next cycle
e Executing: The instruction is being executed, often in multiple cycles within a specific execution port (unit) inside the CPU
w Writeback: The instruction’s result is being written back to a register in the register-file (more on this below), where it will be available for other instructions that might have a dependency on that instruction
R Retired: The temporary register used during the execution/writeback has to be written back to the “real” destination register, according to the original order of the program code, this is called retirement, after which the CPUs internal, temporary register is free again (more on this below)

I encourage you to try to follow this execution trace for a couple of instructions. I like to stare at these things for hours, trying to tell a story in my own head in the form of “what is the CPU thinking now” for each and every cycle. There is much we could say about this, but I will highlight a few remarkable things:

In case you are interested in digging much more deeper than I can afford to go into this within this post, I suggest you read Modern Microprocessors: A 90-Minute Guide! to get more detailed information about pipelines, super-scalar CPUs, everything I try to cover here, and more.

For this post, I will focus on one key aspect that lies in the root of how the CPU manages to do so many things at the same time: register renaming.

Instruction Dependencies

Let’s look at the code again, this time adding arrows between the various instructions, marking their interdependencies.

popcnt-deps

If we interpret this code naively (and wrongly), we see that rsi is being used in each and every instruction of this code fragment, this could lead us to assume that the heavy usage of rsi is generating a long dependency chain:

This naive dependency analysis pretty much contradicts the output we saw come out of IACA in the previous diagram without further explanation. It would seem impossible for the CPU to run so many things in parallel where every instruction here seems to have a dependency through the use of the rsi register.
Moreover, both our original C# and C++ code did not force the JIT/compiler to re-use the same register over and over. It could have allocated 4 different registers and used them to generate code where each POPCNT + SUB pair would be independent of the previous one, so why didn’t it do so?
Well, it turns out there is no need to! The JIT/compiler is doing exactly what it needs to be doing, it is just us, that need to learn about a very important concept in modern processors called register renaming.

Register Renaming

To understand why anyone would need something like register renaming, we first need to understand that CPU designers are stuck between a rock and a hard place:

These contradicting requirements led CPU designers to decouple the idea of register names and register storage: modern CPUs have many more (hundreds) or physical registers (storage) in their register-file than they have names for our software to use. This is where register renaming enters the scene.

What CPU designers have been doing, for quite a long time now (before 1967, believe it or not!) is really remarkable: they have been employing a really neat trick that effectively gets the best of both worlds (i.e. satisfy both requirements) at the cost of more complexity, more power usage, and more stages in the pipeline (hence also a little slowdown in the execution of a single instruction) to achieve better pipeline utilization at the global scale.

This optimization, named “Register renaming”, accomplishes just that: by analyzing when a register is being written (write-only, not read-write) to, the CPU “understands” that the previous value of that register is no longer required for the execution of instructions reading/writing to that same register from that moment onwards, even if previous instructions have not completed (or started) execution! What this really means, is that if we go back to the naive (now you see why) dependency analysis we did in the previous section, it’s clear that each POPCNT + SUB pair are actually completely independent of each other because they begin with overwriting rsi! In other words, each POPCNT having written to rsi is considered to be breaking the dependency chain from that moment onwards. What the CPU does, therefore, is continuously re-map named registers to different register locations on the register-file, according to the real dependency chain, and use that newly allocated location within the register file (hence the initial “Allocation” stage at the IACA diagram above) until the dependency chain is broken again (e.g. the same register is written to again).
I cannot emphasize how important of a tool this is for the CPU. Register renaming allows it to schedule multiple instructions to execute concurrently, either at different stages of the same pipeline or in parallel in different execution ports (pipelines) that exist in a super-scalar CPU. Moreover, this optimization achieves this while keeping the machine code small and easy to decode, since there are very few bits allocated for register names!

How big of a deal is this? How good is the CPU in using this renaming trick? To best answer this from a practical standpoint, I think, we can take a look into the disparity between how many register names exist, for example, in the x64 architecture, that number being 16, and how many physical register storage space there is on the register-file, for example, on an Intel Skylake CPU: 180 (!).

After the temporary (renamed) register has finished its job for a given instruction chain, we are still, unfortunately, not entirely done with it. Understand, that the CPU cannot look too far into the incoming instruction stream (mostly a few dozen bytes), and it can not know, with certainty, if the last written value it just wrote to a renamed register will not be required by some future part of the code it hasn’t seen yet, hundreds of instructions in the future. This brings us to the last phase of register renaming, which is retirement: The CPU must still write the last value for our symbolic register (rsi) back to the canonical location of that register (a.k.a the “real” register), in case future instructions that have not been loaded/decoded will attempt to read that value.
Moreover, this retirement phase must be performed exactly in program order for the program to continue operating as its original intention was.

Wrapping up: clearing the register for the rescue

So going back to our false-dependency bug, we can now hopefully understand the underlying issue and the fix armed with our new knowledge:

Our Intel CPU wrongly misunderstands our POPCNT instruction, when it comes to its dependency analysis: It “thinks” our usage of rsi is not only writing to it but also reading from it.
This is the false-dependency at the root of this issue. We cannot see this with IACA, but we can understand it conceptually: If the CPU (wrongfully) “thinks” that our second POPCNT has to READ the previous rsi value, then no register renaming can occur at that point, and the second POPCNT instruction cannot execute in parallel to the first one, it needs to wait for the completion of the first POPCNT and basically stall for a few precious cycles, in order for the previous rsi to be written back somewhere. Naturally this is true for every unrolled POPCNT in our loop and also between loop iterations.
This alone is enough to cause the perf drop we saw originally with the C# code before CoreCLR was patched. Once the xor esi,esi dependency breaker is added to the instruction stream, we are basically “informing” the CPU that we really are not dependent on the previous value of rsi and we allow it to perform register renaming from that point onwards. It still wrongfully thinks that POPCNT reads from rsi but thanks to our otherwise seemingly superfluous xor, this is an already renamed rsi and the pipeline stall is averted.

I think it is pretty clear by now, although we barely scratched the surface of CPU internals, that CPUs are very complex, and that in the race to extract more performance out of code, today’s out-of-order, super-scalar CPUs go into extreme lengths to find ways to parallelize machine code execution.
It should be also clear that it’s important to be able to empathize with the machine and understand the true nature of its inner workings to really be able to deal the weirdness we experience as we try to make stuff go faster.

It would be great if all we needed to do was keep compiler and hardware developers well fed and well paid so we could do our job without needing to know any of this, and to a great extent, this statement is true. But more often than not, extreme performance requires deeper understanding.

  1. As a side note, after not doing serious C++ work for years, coming back to it and discovering sanitizers, cmake, google test & benchmark was a very pleasant surprise. I distinctly remember the surprise of writing C++ and not having violent murderous thoughts at the same time. 

  2. Apparently Intel has fixed the bug (according to reports) for the LZCNT and TZCNT instructions on Skylake processors, but not so for the POPCNT instruction for reasons unknown to practically anyone. 

  3. yes, x86 registers are weird in that way, where some 64 bit registers have additional symbolic names referring to their lower 32, 16, and both 8 bit parts of their lower 16 bits, don’t ask. 

  4. µop or micro-op, is a low-level hardware operation. The CPU Front-End is responsible for reading the x86 machine code and decoding them into one or more µops.