damageboy

This Goes to Eleven (Pt. 5/∞)

2020-02-02T02:22:28+00:00

I ended up going down the rabbit hole re-implementing array sorting with AVX2 intrinsics, and there’s no reason I should go down alone.

Since there’s a lot to go over here, I’ll split it up into a few parts:

In part 1, we start with a refresher on QuickSort and how it compares to Array.Sort().
In part 2, we go over the basics of vectorized hardware intrinsics, vector types, and go over a handful of vectorized instructions we’ll use in part 3. We still won’t be sorting anything.
In part 3, we go through the initial code for the vectorized sorting, and start seeing some payoff. We finish agonizing courtesy of the CPU’s branch predictor, throwing a wrench into our attempts.
In part 4, we go over a handful of optimization approaches that I attempted trying to get the vectorized partition to run faster, seeing what worked and what didn’t.
In this part, we’ll take a deep dive into how to deal with memory alignment issues.
In part 6, we’ll take a pause from the vectorized partitioning, to get rid of almost 100% of the remaining scalar code, by implementing small, constant size array sorting with yet more AVX2 vectorization.
In part 7, We’ll circle back and try to deal with a nasty slowdown left in our vectorized partitioning code
In part 8, I’ll tell you the sad story of a very twisted optimization I managed to pull off while failing miserably at the same time.
In part 9, I’ll try some algorithmic improvements to milk those last drops of perf, or at least those that I can think of, from this code.

(Trying) to squeeze some more vectorized juice

I thought it would be nice to show a bunch of things I ended up trying to improve performance. I tried to keep most of these experiments in separate implementations, both the ones that yielded positive results and the failures. These can be seen in the original repo under the Happy and Sad folders.

While some worked, and some didn’t, I think a bunch of these were worth mentioning, so here goes:

Aligning our expectations

This quote, taken from Hennessy and Patterson’s “Computer Architecture: A Quantitative Approach, 6th Edition”, which is traced to all the way back to the fathers of modern-day computing in 1946 can be taken as a foreboding warning for the pains that are related to anything that deals with the complexity of memory hierarchies.

With modern computer hardware, CPUs might access memory more efficiently when it is naturally aligned: in other words, when the address we use is a multiple of some magical constant. The constant is classically the machine word size, 4/8 bytes on 32/64 bit machines. These constants are related to how the CPU is physically wired and constructed internally. Historically, older processors used to be very limited, either disallowing or severely limiting performance, with non-aligned memory access. To this day, very simple micro-controllers (like the ones you might find in IoT devices, for example) will exhibit such limitations around memory alignment, essentially forcing memory access to conform to multiples of 4/8 bytes. With more modern (read: more expensive) CPUs, these requirements have become increasingly relaxed. Most programmers can simply afford to ignore this issue. The last decade or so worth of modern processors are oblivious to this problem per-se, as long as we access memory within a single cache-line, or 64-bytes on almost any modern-day processors.

What is this cache-line? I’m actively fighting my internal inclination, so I won’t turn this post into a detour about computer micro-architecture. Caches have been covered elsewhere ad-nauseam by far more talented writers, that I’ll never do it justice anyway. Instead, I’ll just do the obligatory one-paragraph reminder where we recall that CPUs don’t directly communicate with RAM, as it is dead slow; instead, they read and write from internal, on-die, special/fast memory called caches. Caches contain partial copies of RAM. Caches are faster, smaller, and organized in multiple levels (L1/L2/L3 caches, to name them), where each level is usually larger in size and slightly slower in terms of latency. When the CPU is instructed to access memory, it instead communicates with the cache units, but it never does so in small units. Even when our code is reading a single byte, the CPU will communicate with it’s cache subsystem in a unit-of-work known as a cache-line. In theory, every CPU model may have its own definition of a cache-line, but in practice, the last 15 years of processors seem to have converged on 64-bytes as that golden number.

Now, what happens when, lets say, our read operations end up crossing cache-lines?

As mentioned, the unit-of-work, as far as the CPU is concerned, is a 64-byte cache-line. Therefore, such reads literally cause the CPU to issue two read operations downstream, ultimately directed at the cache units¹. These cache-line crossing reads do have a sustained effect on perfromance². But how often do they occur? Let’s consider this by way of example:
Imagine we are processing a single array sequentially, reading 32-bit integers at a time, or 4-bytes; if for some reason, our starting address is not divisible by 4, cross cache-line reads would occur at a rate of 4/64 or 6.25% of reads. Even this paltry rate of cross cache-line reads usually remains in the realm of theory since we have the memory allocator and compiler working in tandem, behind the scenes, to make this go away:

The default allocator always returns memory aligned at least to machine word size on the one hand.
The compiler/JIT use padding bytes within our classes/structs in-between members, as needed, to ensure that individual members are aligned to 4/8 bytes.

So far, I’ve told you why/when you shouldn’t care about alignment. This was my way of both easing you into the topic and helping you feel OK if this is news to you. You really can afford not to think about this without paying any penalty, for the most part. Unfortunately, this stops being true for Vector256<T> sized reads, which are 32 bytes wide (256 bits / 8). And this is doubly not true for our partitioning problem:

The memory handed to us for partitioning/sorting is rarely aligned to 32-bytes, except for dumb luck.
The allocator, allocating an array of 32-bit integers, simply doesn’t care about 32-byte alignment.
Even if it were magically aligned to 32-bytes, it would do us little good; Once a single partition operation is complete, further sub-divisions, inherent with QuickSort, are determined by the (random) new placement of the last pivot we used.
There is no way we will get lucky enough that every partition will be 32-byte aligned.

Now that it is clear that we won’t be 32-byte aligned, we finally realize that as we go over the array sequentially (left to right and right to left as we do) issuing unaligned 32-byte reads on top of a 64-byte cache-line, we end up reading across cache-lines every other read! Or at a rate of 50%! This just escalated from being “…generally not a problem” into a “Houston, we have a problem” very quickly.

You’ve endured through a lot of hand waving so far, let’s try to see if we can get some damning evidence for all of this, by launching perf, this time tracking the oddly specific mem_inst_retired.split_loads HW counter:

$ COMPlus_PerfMapEnabled=1 perf record -Fmax -e mem_inst_retired.split_loads \
    ./Example --type-list DoublePumpJedi --size-list 100000 \
        --max-loops 1000 --no-check
$ perf report --stdio -F overhead,sym | head -20

# To display the perf.data header info, please use --header/--header-only options.
# Event count (approx.): 87102613
# Overhead  Symbol
    86.68%  [.] ...DoublePumpJedi::VectorizedPartitionInPlace(int32*,int32*)
     5.74%  [.] ...DoublePumpJedi::Sort(int32*,int32*,int32)
     2.99%  [.] __memmove_avx_unaligned_erms

We ran the same sort operation 1,000 times and got 87,102,613 split-loads, with 86.68% attributed to our partitioning function. This means (87102613 * 0.8668) / 1000 or 75,500 split-loads per sort of 100,000 elements. To seal the deal, we need to figure out how many vector loads per sort we are performing in the first place; Luckily I can generate an answer quickly: I have statistics collection code embedded in my code, so I can issue this command:

$ ./Example --type-list DoublePumpJedi \
      --size-list 100000 --max-loops 10000 \
      --no-check --stats-file jedi-100k-stats.json

And in return I get this beutiful thing back:

Note

These numbers are vastly different than the ones we last saw in the end of the 3^rd post, for example. There is a good reason for this: We’ve spent the previous post tweaking the code in a few considerable ways:

Changing the cut-off point for vectorized sorting from 16 ⮞ 40, there-by reducing the amount of vectorized partitions we’re performing in the first place.
Changing the permutation entry loading code to read 8-byte values from memroy, rather than full 32-byte Vector256<int> entries, cutting the number of Vector256<int> loads by half.

Method Name	Size	Max Depth	Part itions	Vector Loads	Vector Stores	Vector Permutes	Small Sort Size	Data Based Branches	Small Sort Branches

In total, we perform 173,597 vector loads per sort operation of 100,000 elements in 4,194 partitioning calls. Assuming our array is aligned to 4-bytes to begin with (which C#’s allocator does very reliably), every partitioning call has a 4/32 or 12.5% of ending up being 32-byte aligned: In other words 21,700 of the total vector reads should be aligned by sheer chance, which leaves 173597-21700 or 151,898 that should be unaligned, of which, I claim that that ½ would cause split-loads: 50% of 151,898 is 75,949 while we measured 75,500 with perf! I don’t know how your normal day goes about, but in mine, reality and my hallucinations rarely go hand-in-hand like this.

Fine, we now know we have a problem. The first step was acknowledging/accepting reality: Our code does indeed generate a lot of split memory operations. Let’s consider our memory access patterns when reading/writing with respect to alignment, and see if we can do something about it:

For writing, we’re all over the place: we always advance the write pointers according to how the data was partitioned, e.g. it is completely data-dependent, and there is little we can say about our write addresses. In addition, as it happens, Intel CPUs, as almost all other modern CPUs, employ another common trick in the form of store buffers, or write-combining buffers (WCBs). I’ll refrain from describing them here, but the bottom line is we both can’t/don’t need to care about the writing side of our algorithm.
For reading, the situation is entirely different: We always advance the read pointers by 8 elements (32-bytes) on the one hand, and we even have a special intrinsic: Avx.LoadAlignedVector256() / VMOVDQA³ that helps us ensure that our reading is properly aligned to 32-bytes.

Aligning to CPU Cache-lines: :+1:

With this lengthy introduction out of the way, it’s time we do something about these cross-cache line reads. Initially, I got “something” working quickly: remember that we needed to deal with the remainder of the array, when we had less than 8-elements, anyway. In the original code at the end of the 3^rd post, we did so right after our vectorized loop. If we move that scalar code from the end of the function to its beginning while also modifying it to perform scalar partitioning until both readLeft/readRight pointers are aligned to 32 bytes, our work is complete. There is a slight wrinkle in this otherwise simple approach:

Previously, we had anywhere between 0-7 elements left as a remainder for scalar partitioning per partition call.
- 3.5 elements on average.
Aligning from the edges of our partition with scalar code means we will now have 0-7 elements per-side…
- So 3.5 x 2 == 7 elements on average.

In other words, doing this sort of inwards pre-alignment optimization is not a clean win: We end up with more scalar work than before on the one hand (which is unfortunate), but on the other hand, we can change the vector loading code to use Avx.LoadAlignedVector256() and know for sure that we will no longer be causing the CPU to issue a single cross cache-line read (The latter being the performance boost).
It’s understandable if while reading this, your gut reaction is thinking that adding 3.5 scalar operations doesn’t sound like much of a trade-off, but we have to consider that:

Each scalar comparison comes with a likely branch misprediction, as discussed before, so it has a higher cost than what you might be initially pricing in.
More importantly: we can’t forget that this is a recursive function, with ever decreasing partition sizes. If you go back to the initial stats we collected in previous posts, you’ll be quickly reminded that we partition upwards of 340k times for 1 million element arrays, so this scalar work both piles up, and represents a larger portion of our workload as the partition sizes decrease…

I won’t bother showing the entire code listing for B5_1_DoublePumpAligned.cs, but I will show the rewritten scalar partition block, which is now tasked with aligning our pointers before we go full vectorized partitioning. Originally it was right after the double-pumped loop and looked like this:

    // ...
    while (readLeft < readRight) {
        var v = *readLeft++;

        if (v <= pivot) {
            *tmpLeft++ = v;
        } else {
            *--tmpRight = v;
        }
    }

The aligned variant, with the alignment code now at the top of the function, looks like this:

    const ulong ALIGN = 32;
    const ulong ALIGN_MASK = ALIGN - 1;

    if (((ulong) readLeft & ALIGN_MASK) != 0) {
        var nextAlign = (int *) (((ulong) readLeft + ALIGN) & ~ALIGN_MASK);
        while (readLeft < nextAlign) {
            var v = *readLeft++;
            if (v <= pivot) {
                *tmpLeft++ = v;
            } else {
                *--tmpRight = v;
            }
        }
    }
    Debug.Assert(((ulong) readLeft & ALIGN_MASK) == 0);

    if (((ulong) readRight & ALIGN_MASK) != 0) {
        var nextAlign = (int *) ((ulong) readRight & ~ALIGN_MASK);
        while (readRight > nextAlign) {
            var v = *--readRight;
            if (v <= pivot) {
                *tmpLeft++ = v;
            } else {
                *--tmpRight = v;
            }
        }
    }
    Debug.Assert(((ulong) readRight & ALIGN_MASK) == 0);

What it does now is check when alignment is necessary, then proceeds to align while also partitioning each side into the temporary memory.

Where do we end up performance-wise with this optimization?

Method Name	Problem Size	Time / Element (ns)	Scaling	Measurements

BenchmarkDotNet=v0.12.0, OS=clear-linux-os 32120
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.1.100
  [Host]     : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
  Job-DEARTS : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

InvocationCount=3  IterationCount=15  LaunchCount=2
UnrollFactor=1  WarmupCount=10

$ grep 'stepping\|model\|microcode' /proc/cpuinfo | head -4
model           : 158
model name      : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping        : 9
microcode       : 0xb4

The whole attempt ends up as a mediocre improvement, so it would seem:

We’re are seeing a speedup/improvement, in the high counts.
We seem to be slowing down due to the higher scalar operation count, in the low problem sizes.

It’s kind of a mixed bag, and perhaps slightly unimpressive at first glance. However, when we stop to remember that we somehow managed both to speed up the function while doubling the amount of scalar work done, the interpretation of the results becomes more nuanced: The pure benefit from alignment itself is larger than what the results are showing right now since it’s being masked, to some extent, by the extra scalar work we tacked on. If only there was a way we could skip that scalar work all together… If only there was a way… If only…

(Re-)Partitioning overlapping regions: :+1: :+1:

Next up is a different optimization approach to the same problem, and a natural progression from the last one. At the risk of sounding pompous, I think I might have found something here that no-one has done before in the context of partitioning⁴: The basic idea here is we get rid of all (ok, ok, almost all) scalar partitioning in our vectorized code path. If we can partition and align the edges of the segment we are about to process with vectorized code, we would be reducing the total number instructions executed. At the same time, we would be retaining more of the speed-up that was lost with the alignment optimization above. This would have a double-whammy compounded effect. But how?

We could go about it the other way around! Instead of aligning inwards in each respective direction, we could align outwards and enlarge the partitioned segment to include a few more (up to 7) elements on the outer rims of each partition and re-partition them using the new pivot we’ve just selected. If this works, we end up doing both 100% aligned reads and eliminating all scalar work in one optimization! This might sound simple and safe, but this is the sort of humbling experience that QuickSort is quick at dispensing (sorry, I had to…) at people trying to nudge it in the wrong way. At some point, I was finally able to screw my own head on properly with respect to this re-partitioning attempt and figure out what precisely are the critical constraints we must respect for this to work.

Note

This is a slightly awkward optimization when you consider that I’m suggesting we should partition more data in order to speed up our code. This sounds bonkers, unless we dig deep within for some mechanical empathy: not all work is equal in the eyes of the CPU. When we are executing scalar partitioning on n elements, we are really telling the CPU to execute n branches, comparisons, and memory accesses, which are completely data-dependent. The CPU “hates” this sort of work. It has to guess what happens next, and will do so no better than flipping a coin, or 50%, for truly random data. What’s worse, as mentioned before, whenever the CPU mispredicts, there’s a price to pay in the form of a full pipeline flush which roughly costs us 14-15 cycles on a modern CPU. Paying this once, is roughly equivalent to partitioning 2 x 8 element vectors with our vectorized partition block! This is the reason that doing “more” might be faster.

Back to the constraints. There’s one thing we can never do: move a pivot that was previously partitioned. I (now) call them “buried pivots” (since they’re in their final resting place, get it?); Everyone knows, you don’t move around dead bodies, that’s always the first bad thing that happens in a horror movie. There’s our motivation: not being the stupid person who dies first. That’s about it. It sounds simple, but it requires some more serious explanation: When a previous partition operation is complete, the pivot used during that operation is moved to its final resting place. It’s new position is used to subdivide the array, and effectively stored throughout numerous call stacks of our recursive function. There’s a baked-in assumption here that all data left/right of that buried pivot is smaller/larger than it. And that assumption must never be broken. If we intend to re-partition data to the left and right of a given partition, as part of this overlapping alignment effort, we need to consider that this extra data might already contain buried pivots, and we can not, under any circumstances ever move them again.
In short: Buried pivots stay buried where we left them, or bad things happen.

When we call our partitioning operation, we have to consider what initially looks like an asymmetry of the left and right edges of our to-be-partitioned segment:

For the left side:
- There might not be additional room on the left with extra data to read from.
  - We are too close to the edge of the array on the left side!
    This happens for all partitions starting at the left-edge of the entire array.
- We always partition first left, then right of any buried pivot, we know for a fact that all elements left of “our” partition at any given moment are sorted. e.g. they are all buried pivots, and we can’t re-order them.
- Important: We also know that each of those values is smaller than or equal to whatever pivot value we will select for the current partitioning operation.
For the right side, it is almost the same set of constraints:
- There might not be additional room on the right with extra data to read from.
  - We are too close to the edge of the array on the right side!
    This happens for all partitions ending on the right-edge of the entire array.
- The immediate value to our right side is a buried pivot, and all other values to its right are larger-than-or-equal to it.
- There might be additional pivots immediately to our right as well.
- Important: We also know that each of those values is larger-then-or-equal to whatever pivot value we will select for the current partitioning operation.

All this information is hard to integrate at first, but what it boils down to is that whenever we load up the left overlapping vector, there are anywhere between 1-7 elements we are not allowed to reorder on the left side, and when we load the right overlapping vector, there are, again, anywhere between 1-7 elements we are not allowed to re-order on that right side. That’s the challenge; the good news is that all those overlapping elements are also guaranteed to also be smaller/larger than whatever pivot we end up selecting from out original (sans overlap) partition. This knowledge gives us the edge we need: We know in advance that the extra elements will generate predictable comparison results compared to any pivot within our partition.

What we need are permutation entries that are stable. I’m coining this phrase freely as I’m going along:
Stable partitioning means that the partitioning operation must not reorder values that need to go on the left amongst themselves (we keep their internal ordering amongst themselves). Likewise, it must not reorder the values that go on the right amongst themselves. If we manage to do this, we’re in the clear: The combination of stable permutation and predictable comparison results means that the overlapping elements will stay put while other elements will be partitioned properly on both edges of our overlapping partition. After this weird permutation, we just need to forget we ever read those extra elements, and the whole thing just… works? … yes!

Let’s start with cementing this idea of what stable partitioning is: Up to this point, there was no such requirement, and the initial partition tables I generated failed to satisfy this requirement. Here’s a simple example for stable/unstable permutation entries, let’s imagine we partition the following values around a pivot value of 500:

Bit	0	1	2	3	4	5	6	7
`Vector256<T>` Value	99	100	666	101	102	777	888	999
Mask	0	0	1	0	0	1	1	1
Unstable Permutation	0	1	7	2	3	6	5	4
Unstable Result	99	100	101	102	999	888	777	666
Stable Permutation	0	1	4	2	3	5	6	7
Stable Result	99	100	101	102	666	777	888	999

In the above example, the unstable permutation is a perfectly valid permutation for general case partitioning. It successfully partitions the sample vector around the pivot value of 500, but the 4 elements marked in bold are re-ordered with respect to each other when compared to the original array. In the stable permutation entry, the internal ordering amongst the partitioned groups is preserved.

Armed with new, stable permutation entries, We can proceed with this overlapping re-partitioning hack: The idea is to find the optimal alignment point on the left and on the right (assuming one is available, e.g. there is enough room on that side), read that data with the LoadVectorAligned256 intrinsic, and partition it into the temporary area. The final twist: We need to keep tabs on how many elements do not belong to this partition (e.g. originate from our overlap gymnastics), and remember not to copy them back into our partition at the end of the function, relying on our stable partitioning to keep them grouped at the edges of the temporary buffer we’re copying from… To my amazement, that was kind of it. It just works! (I’ve conveniently ignored a small edge-case here in words, but not in the code :).

The end result is super delicate. If you feel you’ve got it, skip this paragraph, but if you need an alternative view on how this works, here it is: I’ve just described how to partition the initial 2x8 elements (8 on each side); out of those initial 8, We always have a subset that must never be reordered (the overlap), and a subset we need to re-order, as is normal, with respect to some pivot. We know that whatever possible pivot value might be selected from our internal partition, it will always be larger/smaller than the elements in the overlapping areas. Knowing that, we can rely on having stable permutation entries that do not reorder those extra elements. In the end, we read extra elements, feed them through our partitioning machine, but ignore the extra overlapping elements and avoid all scalar partitioning thanks to this scheme.

In the end, we literally get to eat our cake and keep it whole: For the 99% case we kill scalar partitioning all-together, doing zero scalar work, at the same time aligning everything to Vector256<T> size and being nice to our processor. Just to make this victory a tiny touch sweeter, even the initial 2x8 partially overlapping vectors are read using aligned reads! I named this approach “overligned” (overlap + align) in my code-base; it is available in full in B5_2_DoublePumpOverlined.cs. It implements this overlapping alignment approach, with some extra small points for consideration:

When it is impossible to align outwards, we fall back to the alignment mechanic introduced in the previous section.
This is uncommon: Going back to the statistical data we collected about random-data sorting in the 3^rd post, we anticipate a recursion depth of around 40 when sorting 1M elements and ~340K partitioning calls. We will have at least 40x2 (for both sides) such cases where we align inwards for that 1M case, as an example. This is small change compared to the 340K - 80 calls we can optimize with outward alignment, but it does mean we have to keep that old code lying around.
Once we calculate for a given partition how much alignment is required on each side, we can cache that calculation recursively for the entire depth of the recursive call stack: This again reduces the overhead we are paying for this alignment strategy. In the code you’ll see I’m squishing two 32-bit integers into a 64-bit value I call alignHint and I keep reusing one half of 64-bit value without recalculating the alignment amount; If we’ve made it this far, let’s shave a few more cycles off while we’re here.

There’s another small optimization I tacked on to this version, which I’ll discuss immediately after providing the results:

Method Name	Problem Size	Time / Element (ns)	Scaling	Measurements

BenchmarkDotNet=v0.12.0, OS=clear-linux-os 32120
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.1.100
  [Host]     : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
  Job-DEARTS : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

InvocationCount=3  IterationCount=15  LaunchCount=2
UnrollFactor=1  WarmupCount=10

$ grep 'stepping\|model\|microcode' /proc/cpuinfo | head -4
model           : 158
model name      : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping        : 9
microcode       : 0xb4

This is much better! The improvement is much more pronounced here, and we have a lot to consider:

The performance improvements are not spread evenly through-out the size of the sorting problem.
I’ve conveniently included two vertical markers, per my specific machine model, they show the size of the L2/L3 caches translated to # of 32-bit elements in our array.
It can be clearly seen that as long as we’re sorting roughly within the size of our L2-L3 cache size range, this optimization pays in spades: we’re seeing ~10% speedup in runtime in many cases!
It is also clear that as we progress outside the size of the L2 into the L3 cache size, and ultimately exhaust the size of our caches entirely, the returns on this optimization diminish gradually.
While not shown here, since I’ve lost access to that machine, on older Intel/AMD machines, where only one load operation can be executed by the processor at any given time (Example: Intel Broadwell processors), this can lead to an improvement of 20% in total runtime; This should make sense: the less load ports the CPU has, the better this split-load reducing technique performs.
Another thing to consider is that in future variations of this code when I finally get access and ability to use AVX-512, with 64-byte wide registers, the effects of this optimization will be much more pronounced again for a different reason: With vector registers spanning 64-bytes each, split-loading becomes a bigger problem (every single un-aligned read becomes a split-load). Therefore, removing it is even more important.

As the problem size goes beyond the size of the L2 cache, we are hit with the realities of CPU cache latency numbers. As service to the reader here is a visual representation for the latency numbers for a Skylake-X CPU running at 4.3 Ghz:

The small number of cycles we tack as the penalty of for split-loading (7 in this diagram) on to the memory operations is very real when we compare it to regular L1/L2 cache latency. But once we compare it to L3 or RAM latency, it becomes abundantly clear why we are seeing diminishing returns for this optimization; the penalty is simply too small to notice at those work points.

Finally, for this optimization, we must never forget our moto of trust no one and nothing. Let’s double check what the current state of affairs is as far as perf is concerned:

$ perf record -Fmax -e mem_inst_retired.split_loads \
   ./Example --type-list DoublePumpOvelined --size-list 100000 \
       --max-loops 1000 --no-check
$ perf report --stdio -F overhead,sym | head -20
# To display the perf.data header info, please use --header/--header-only options.
# Samples: 129  of event 'mem_inst_retired.split_loads'
# Event count (approx.): 12900387
# Overhead  Symbol
    30.23%  [.] DoublePumpOverlined...::Sort(int32*,int32*,int64,int32)
    28.68%  [.] DoublePumpOverlined...::VectorizedPartitionInPlace(int32*,int32*,int64)
    13.95%  [.] __memmove_avx_unaligned_erms
     0.78%  [.] JIT_MemSet_End

Seems like this moved the needle, and then some. We started with 86.68% of 87,102,613 split-loads in our previous version of vectorized partitioning , and now we have 28.68% of 12,900,387. In other words: (0.2668 * 12900387) / (0.8668 * 87102613) gives us 4.55%, or a 95.44% reduction of split-load events for this version. Not an entirely unpleasant experience.

Sub-optimization- Converting branches to arithmetic: :+1:

By this time, my code contained quite a few branches to deal with various edge cases around alignment, and I pulled another rabbit out of the optimization hat that is worth mentioning: We can convert simple branches into arithmetic operations. Many times, we end up having branches with super simple code behind them; here’s a real example I used to have in my code, as part of some early version of overlinement, which we’ll try to optimize:

int leftAlign;
... // Calculate left align here...
if (leftAlign < 0) {
    readLeft += 8;
}

This looks awfully friendly, and it is unless leftAlign and therefore the entire branch is determined by random data we read from the array, making the CPU mispredict this branch too often than we’d care for it to happen. In my case, I had two branches like this, and each of them was happening at a rate of 1/8. So enough for me to care. The good news is that we can re-write this, entirely in C#, and replace the potential misprediction with a constant, predictable (and often shorter!) data dependency. Let’s start by inspecting the re-written “branch”:

int leftAlign;
... // Calculate left align here...
// Signed arithmetic FTW
var leftAlignMask = leftAlign >> 31;
// the mask is now either all 1s or all 0s depending on leftAlign's sign!
readLeft += 8 & leftALignMask;

By taking the same value we were comparing to 0 and right shifting it, we are performing an arithmetic right shift. This takes the top bit, which is either 0/1 depending on leftAlign’s sign bit, and essentially propagates it throughout the entire 32-bit value, which is then assigned to the lestAlignMask variable. We’ve essentially taken what was previously the result of the comparison as part of the branch (the sign bit), transforming it into a mask. We then proceed to take the mask and use it to control the outcome of the += 8 operation, effectively turning it into either a += 8 -or- a += 0 operation, depending on the value of the mask!
This turns out to be a quite effective way, again, for simple branches only, at converting a potential misprediction event costing us 15 cycles, with a 100% constant 3-4 cycles data-dependency for the CPU: It can be thought as a “signaling” mechanism where we tell the CPU not to speculate on the result of the branch but instead complete the readLeft += statement only after waiting for the right-shift (>> 31) and the bitwise and (&) operation to propagate through its pipeline.

Note

I referred to this as an old geezer’s optimization since modern processors already support this internally in the form of a CMOV instruction, which is more versatile, faster and takes up less bytes in the instruction stream while having the same “do no speculate on this” effect on the CPU. The only issue is we don’t have CMOV in the CoreCLR JIT (Mono’s JIT, peculiarly does support this both with the internal JIT and naturally with LLVM…).
As a side note to this side note, I’ll add that this is such an old-dog trick that LLVM even detects such code and de-optimizes it back into a “normal” branch and then proceeds to optimize it again into CMOV, which I think is just a very cool thing, regardless :)

I ended up replacing about 5-6 super simple/small branches this way. I won’t show direct performance numbers for this, as this is already part of the overlined version; I can’t say it improved performance considerably for my test runs, but it did reduce the jitter of those runs, which can be seen in the reduced error bars and tighter confidence intervals shown in the benchmark results above.

Coming to terms with bad speculation

At the end of part 3, we came to a hard realization that our code is badly speculating inside the CPU. Even after simplifying the branch code in our loop in part 4, the bad speculation remained there, staring at us persistently. If you recall, we experienced a lot of bad-speculation effects when sorting the data with our vectorized code, and profiling using hardware counters showed us that while InsertionSort was the cause of most of the bad-speculation events (41%), our vectorized code was still responsible for 32% of them. Let’s try to think about that mean nasty branch, stuck there, in the middle of our beautiful loop:

int* nextPtr;
if ((byte *) writeRight - (byte *) readRight < N * sizeof(int)) {
    nextPtr   =  readRight;
    readRight -= N;
} else {
    nextPtr  =  readLeft;
    readLeft += N;
}

PartitionBlock(nextPtr, P, pBase, ref writeLeft, ref writeRight);

Long story short: We ended up sneaking up a data-based branch into our code in the form of this side-selection logic. Whenever we try to pick a side, we would read from next is where we put the CPU in a tough spot. We’re asking it to speculate on something it can’t possibly speculate on successfully. Our question is: “Oh CPU, CPU in the socket, Which side is closer to being over-written of them all?”, to which the answer is completely data-driven. In other words, it depends on how the last round(s) of partitioning mutated the pointers involved in the comparison. It might sound like an easy thing for the CPU to check, but we have to remember it is attempting to execute ~100 or so instructions into the future, as it is required to speculate on the result: the previous rounds of partitioning have not yet been fully-executed, internally. The CPU guesses, at best, based on stale data, and we know, as the grand designers of this mess, that its best guess is no better here than flipping a coin. Quite sad. You have to admit it is ironic we managed to do this whole big circle around our own tails just to come-back to having a branch misprediction based on the random array data. Mis-predicting here seems unavoidable. Or is it?

Replacing the branch with arithmetic: :-1:

Could we replace this branch with arithmetic, just like we’ve done a couple of paragraphs above? Yes we can. Consider this alternative version:

var readRightMask =
    (((byte*) writeRight - (byte*) readRight - N*sizeof(int))) >> 63;
var readLeftMask =  ~readRightMask;
// If readRightMask is 0, we pick the left side
// If readLeftMask is 0, we pick the right side
var readRightMaybe  = (ulong) readRight & (ulong) readRightMask;
var readLeftMaybe   = (ulong) readLeft  & (ulong) readLeftMask;

PartitionBlock((int *) (readLeftMaybe + readRightMaybe),
               P, pBase, ref writeLeft, ref writeRight);

var postFixUp = -32 & readRightMask;
readRight = (int *) ((byte *) readRight + postFixUp);
readLeft  = (int *) ((byte *) readLeft  + postFixUp + 32);

What the code above does, except for causing a nauseating headache, is taking the same concept of turning branches into arithmetic from the previous section and using it to get rid of that nasty branch: We take the comparison and turn it into a negative/positive number, then proceed to use it to generate masks we use to execute the code that used to reside under the branch.

I don’t want to dig deep into this. While its technically sound, and does what we need it to do, it’s more important to focus on how this performs:

Method Name	Problem Size	Time / Element (ns)	Scaling	Measurements

BenchmarkDotNet=v0.12.0, OS=clear-linux-os 32120
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.1.100
  [Host]     : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
  Job-DEARTS : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

InvocationCount=3  IterationCount=15  LaunchCount=2
UnrollFactor=1  WarmupCount=10

$ grep 'stepping\|model\|microcode' /proc/cpuinfo | head -4
model           : 158
model name      : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping        : 9
microcode       : 0xb4

Look, I’m not here to sugar-coat it: This looks like an unmitigated disaster. But I claim that it is one we can learn a lot from in the future. With the exception of sorting <= 100 elements, as the problem grows, the situation is getting much worse.

To double-check that everything is sound, I ran perf recording the instructions, branches and branch-misses events for both versions for sorting 100,000 elements.

The command line used was this:

$ perf record -F max -e instructions,branches,branch-misses \
    ./Example --type-list DoublePumpOverlined \
              --size-list 100000 --max-loops 1000 --no-check
$ perf record -F max -e instructions,branches,branch-misses \
    ./Example --type-list DoublePumpBranchless \
              --size-list 100000 --max-loops 1000 --no-check

If you’re one of those sick people who likes to look into other people’s sorrows, here is a gist with the full results, if you’re more normal, and to keep things simple, I’ve processed the results and presenting them here in table form:

This is pretty amazing if you think about it:

The number of branches was cut in half: This makes sense, the loop control itself is a branch instuction after all, so it remains even in the Branchless variant.
The branches that remain in the branchless version are all easy to predict, and we see that the branch-misses counter shows us those are down to nothing.
This means that there is no mistake: We succeeded in a targeted assassination of that branch; however, there was a lot of collateral damage…
The verbiage of the branchless code, expressed in the instructions counter is definitely costing us something here:
The number of executed instructions inside our partition loop have gone up by 17%, which is a lot.

The slowdown we’ve measured here is directly related to NOT having CMOV available to us through the CoreCLR JIT. but I really don’t think that this is the entire story here. It’s hard to express this in words, but the slope at which the branchless code is slowing down compared to the previous version is very suspicious in my eyes.
There is an expression we use in Hebrew a lot for this sort of situation: “The operation was successful, but the patient died”. There is no question that this is one of those moments. This failure to accelerate the sorting operation, and specifically the way it fails, increasingly as the problem size grows, is very telling in my eyes. I have an idea of why this is and how we might be able to go around it. But, for today, our time is up. I’ll try and get back to this much much later in this series, and hopefully, we’ll all be wiser for it.

Remember that the CPU knows nothing about two different cache-lines. They might actually be on a page boundary as well, which means they might be in two different DRAM chips, or perhaps, a single split-line access causes our poor CPU to communicate with a different socket, where another memory controller is responsible to reading the memory from its own DRAM modules! ↩
Most modern Intel CPUs can actually address the L1 cache units twice per cycle, at least when it comes to reading data, by virtue of having two load-ports. That means they can actually request two cache-line as the same time! But this still causes more load on the cache and bus. In our case, we must also remember we will be reading an additional cache-line for our permutation entry… ↩
This specific AVX2 intrinsic will actually fail if/when used on non-aligned addresses. But it is important to note that it seems it won’t actually run faster than the previous load intrinsic we’ve used: AVX2.LoadDquVector256 as long as the actual addresses we pass to both instructions are 32-byte aligned. In other words, it’s very useful for debugging alignment issues, but not that critical to actually call that intrinsic! ↩
I could be wrong about that last statement, but I couldn’t find anything quite like this discussed anywhere, and believe me, I’ve searched. If anyone can point me out to someone doing this before, I’d really love to hear about it; there might be more good stuff to read about there… ↩

This Goes to Eleven (Pt. 4/∞)

2020-02-01T05:26:28+00:00

I ended up going down the rabbit hole re-implementing array sorting with AVX2 intrinsics, and there’s no reason I should go down alone.

Since there’s a lot to go over here, I’ll split it up into a few parts:

In part 1, we start with a refresher on QuickSort and how it compares to Array.Sort().
In part 2, we go over the basics of vectorized hardware intrinsics, vector types, and go over a handful of vectorized instructions we’ll use in part 3. We still won’t be sorting anything.
In part 3 we go through the initial code for the vectorized sorting, and start seeing some payoff. We finish agonizing courtesy of the CPU’s branch predictor, throwing a wrench into our attempts.
In this part, we go over a handful of optimization approaches that I attempted trying to get the vectorized partition to run faster, seeing what worked and what didn’t.
In part 5, we’ll take a deep dive into how to deal with memory alignment issues.
In part 6, we’ll take a pause from the vectorized partitioning, to get rid of almost 100% of the remaining scalar code, by implementing small, constant size array sorting with yet more AVX2 vectorization.
In part 7, We’ll circle back and try to deal with a nasty slowdown left in our vectorized partitioning code
In part 8, I’ll tell you the sad story of a very twisted optimization I managed to pull off while failing miserably at the same time.
In part 9, I’ll try some algorithmic improvements to milk those last drops of perf, or at least those that I can think of, from this code.

Squeezing some more juice

While some worked, and some didn’t, I think a bunch of these were worth mentioning, so here goes:

Dealing with small JIT hiccups: :+1:

One of the more surprising things I’ve discovered during the optimization journey was that the JIT could generate much better code, specifically around/with pointer arithmetic. With the basic version we got working by the end of the 3^rd post, I started turning my attention to the body of the main loop. That main loop is where I presume we spend most of our execution time. I quickly encountered some red-flag raising assembly code, specifically with this single line of code, which we’ve briefly discussed before:

if (readLeft   - writeLeft <=
    writeRight - readRight) {
    // ...
} else {
    // ...
}

It looks innocent enough, but here’s the freely commented x86 asm code for it:

mov     rax,rdx       ; ✓  copy readLeft
sub     rax,r12       ; ✓  subtract writeLeft
mov     rcx,rax       ; ✘  wat?
sar     rcx,3Fh       ; ✘  wat?1?
and     rcx,3         ; ✘  wat?!?!?
add     rax,rcx       ; ✘  wat!?!@#
sar     rax,2         ; ✘  wat#$@#$@
mov     rcx,[rbp-58h] ; ✓✘ copy writeRight, but from stack?
mov     r8,rcx        ; ✓✘ in the loop body?!?!?, Oh lordy!
sub     r8,rsi        ; ✓  subtract readRight
mov     r10,r8        ; ✘  wat?
sar     r10,3Fh       ; ✘  wat?!?
and     r10,3         ; ✘  wat!?!@#
add     r8,r10        ; ✘  wat#$@#$@
sar     r8,2          ; ✘  wat^!#$!#$
cmp     rax,r8        ; ✓  finally, comapre!

It’s not every day that we get to see two JIT issues with one line of code, I know some people might take this as a bad sign, but in my mind this is great! To me this feels like digging for oil in Texas in the early 20s… We’ve practically hit the ground with a pickaxe accidentally, only to see black liquid seeping out almost immediately!

JIT Bug 1: variable not promoted to register

One super weird thing that we see happening here is the difference in the asm code that copies writeRight on L8-9 from the stack ([rbp-58h]) before performing the subtraction when compared to L1 where a conceptually similar copy is performed for readLeft from a register (rdx). The code merely tries to subtract two pairs of pointers, but the generated machine code is weird: 3 out of 4 pointers were correctly lifted out of the stack into registers outside the body of the loop (readLeft, writeLeft, readRight), but the 4^th one, writeRight, is the designated black-sheep of the family and is being continuously read from the stack (and later in that loop body is also written back to the stack, to make things worse).
There is no good reason for this, and this clearly smells! What do we do?

For one thing, I’ve opened up an issue about this weirdness. The issue itself shows just how finicky the JIT is regarding this one variable, and (un)surprisingly, by fudging around the setup code this can be easily worked around for now.
As a refresher, here’s the original setup code I presented in the previous post, just before we enter the loop body:

unsafe int* VectorizedPartitionInPlace(int* left, int* right)
{
    var writeLeft = left;
    var writeRight = right - N - 1; // <- Why the hate?
    var tmpLeft = _tempStart;
    var tmpRight = _tempEnd - N;

    var pBase = Int32PermTables.IntPermTablePtr;
    var P = Vector256.Create(pivot);

    PartitionBlock(left,          P, ref tmpLeft, ref tmpRight);
    PartitionBlock(right - N - 1, P, ref tmpLeft, ref tmpRight);

    var readLeft  = left + N;
    var readRight = right - 2*N - 1;

And here’s a simple fix: moving the pointer declaration closer to the loop body seems to convince the JIT that we can all be friends once more:

unsafe int* VectorizedPartitionInPlace(int* left, int* right)
{
    // ... omitted for brevity
    var tmpLeft = _tempStart;
    var tmpRight = _tempEnd - N;

    PartitionBlock(left,          P, ref tmpLeft, ref tmpRight);
    PartitionBlock(right - N - 1, P, ref tmpLeft, ref tmpRight);

    var writeLeft = left;
    var writeRight = right - N - 1; // <- Oh, so now we're cool?
    var readLeft  = left + N;
    var readRight = right - 2*N - 1;

The asm is slightly cleaner:

mov     r8,rax        ; ✓ copy readLeft
sub     r8,r15        ; ✓ subtract writeLeft
mov     r9,r8         ; ✘ wat?
sar     r9,3Fh        ; ✘ wat?1?
and     r9,3          ; ✘ wat?!?!?
add     r8,r9         ; ✘ wat!?!@#
sar     r8,2          ; ✘ wat#$@#$@
mov     r9,rsi        ; ✓ copy writeRight
sub     r9,rcx        ; ✓ subtract readRight
mov     r10,r9        ; ✘ wat?1?
sar     r10,3Fh       ; ✘ wat?!?!?
and     r10,3         ; ✘ wat!?!@#
add     r9,r10        ; ✘ wat#$@#$@
sar     r9,2          ; ✘ wat^%#^#@!
cmp     r8,r9         ; ✓ finally, comapre!

It doesn’t look like much, but we’ve managed to remove two memory accesses from the loop body (the read, shown above and a symmetrical write to the same stack variable/location towards the end of the loop). It’s also clear, at least from my comments that I’m not entirely pleased yet, so let’s move on to…

JIT bug 2: not optimizing pointer difference comparisons

Calling this one a bug might be stretch, but in the world of the JIT, sub-optimal code generation can be considered just that. The original code performing the comparison is making the JIT (wrongfully) think that we want to perform int * arithmetic for readLeft - writeLeft and writeRight - readRight. In other words: The JIT emits code subtracting both pointer pairs, generating a byte * difference for each pair; which is great (I marked that with checkmarks in the listings). Then, it goes on to generate extra code converting those differences into int * units: so lots of extra arithmetic operations. This is simply useless: we just care if one side is larger than the other. What the JIT is doing here is similar in spirit to converting two distance measurements taken in cm to km just to compare which one is greater.
To work around this disappointing behaviour, I wrote this instead:

if ((byte *) readLeft   - (byte *) writeLeft) <=
    (byte *) writeRight - (byte *) readRight) {
    // ...
} else {
    // ...
}

By doing this sort of seemingly useless casting 4 times, we get the following asm generated:

mov rcx, rdi  ; ✓ copy readRight
sub rcx, r12  ; ✓ subtract writeLeft
mov r9, rdi   ; ✓ copy writeRight
sub r9, r13   ; ✓ subtract readRight
cmp rcx, r9   ; ✓ compare

It doesn’t take a degree in reverse-engineering asm code to figure out this was a good idea.
Casting each pointer to byte * coerces the JIT to do our bidding and just perform a simpler comparison.

JIT Bug 3: Updating the `write*` pointers more efficiently

I discovered another missed opportunity in the pointer update code at the end of our inlined partitioning block. When we update the two write* pointers, our intention is to update two int * values with the result of the PopCount intrinsic:

var popCount = PopCount(mask);
writeLeft += 8U - popCount;
writeRight -= popCount;

Unfortunately, the JIT isn’t smart enough to see that it would be wiser to left shift popCount once by 2 (e.g. convert to byte * distance) and reuse that left-shifted value twice while mutating the two pointers. Again, uglifying the originally clean code into the following god-awful mess get’s the job done:

var popCount = PopCount(mask) << 2;
writeRight = ((int *) ((byte *) writeRight - popCount);
writeLeft =  ((int *) ((byte *) writeLeft + 8*4U - popCount);

I’ll skip the asm this time. It’s pretty clear from the C# that we pre-left shift (or multiply by 4) the popCount result before mutating the pointers. We’re now generating slightly denser code by eliminating a silly instruction from a hot loop.

All 3 of these workarounds can be seen on my repo in the research branch. I kept this pretty much as-is under B4_1_DoublePumpMicroOpt.cs. Time to see whether all these changes help in terms of performance:

Method Name	Problem Size	Time / Element (ns)	Scaling	Measurements

BenchmarkDotNet=v0.12.0, OS=clear-linux-os 32120
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.1.100
  [Host]     : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
  Job-DEARTS : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

InvocationCount=3  IterationCount=15  LaunchCount=2
UnrollFactor=1  WarmupCount=10

$ grep 'stepping\|model\|microcode' /proc/cpuinfo | head -4
model           : 158
model name      : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping        : 9
microcode       : 0xb4

This is quite better! I’ve artificially set the y-axis here to a narrow range of 80%-105% so that the differences would become more apparent. The improvement is very measurable. Too bad we had to uglify the code to get here, but such is life. Our results just improved by another ~7-14% across the board.
If this is the going rate for ugly, I’ll bite the bullet :)

I did not include any statistics collection tab for this version since there is no algorithmic change involved.

Selecting a better cut-off threshold for scalar sorting: :+1:

I briefly mentioned this at the end of the 3^rd post: While it made sense to start with the same threshold that Array.Sort uses (16) to switch from partitioning into small array sorting, there’s no reason to assume this is the optimal threshold for our partitioning function: Given that the dynamics have changed with vectorized partitioning, the optimal cut-off point probably needs to move too.
In theory, we should retest the cut-off point after every optimization that succeeds in moving the needle; I won’t do this after every optimization, but I will do so again for the final version. For the meantime, let’s see how playing with the cut-off point changes the results: We’ll try 24, 32, 40, 48 on top of 16, and see what comes on top:

Method Name	Problem Size	Time / Element (ns)	Scaling	Measurements

BenchmarkDotNet=v0.12.0, OS=clear-linux-os 32120
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.1.100
  [Host]     : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
  Job-DEARTS : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

InvocationCount=3  IterationCount=15  LaunchCount=2
UnrollFactor=1  WarmupCount=10

$ grep 'stepping\|model\|microcode' /proc/cpuinfo | head -4
model           : 158
model name      : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping        : 9
microcode       : 0xb4

I’ve pulled a little trick with these charts: by default, I’ve hidden everything but one of the cut-off points: 40, that being the best new cut-off point, at least in my opinion. If you care to follow my reasoning process, I suggest you start slowly clicking (or touching) the 24, 32, 48 series/titles in the legend. This will add them back into the chart, one by one. Stop to appreciate what you are seeing; Once you’ll do so, I think it’s easier to see that:

The initial value we started off with: 16, the baseline for this series of benchmarks, is undoubtedly the worst possible cut-off for vectorized partitioning…
All of the other cut-off points have scaling values below 100%, hence they are faster.
24 does not do us a world of good here either: It’s clearly always next worst option.
32 is pretty good, except that in the lower edge of the chart, where the higher cut-off points seem to provide better value.
For the most part, using any one of 40/48 as a cut-off point seems to be the right way to go. These two cover the least area in the chart. In other words, they all provide the best improvement, on average, for our scenario.

I ended up voting for 40. There’s no good reason I can give for this except for (perhaps wrong) instinct. Lest we forget, another small pat of the back is in order: we’ve managed to speed up our sorting code with an improvement ranging from 5-25% throughout the entire spectrum, which is cause for a small celebration in itself.

To be completely honest here, there is another, ulterior motive, as far as I’m concerned, for showing the effect of changing the small sorting threshold so early into this series. By doing so, we can sense where this trail will take us on our journey: It’s pretty clear that we will end up with two equally important implementations, each handling a large part of the total workload for sorting:

The vectorized partitioning will be tasked with the initial heavy lifting, relegated to taking large arrays and breaking them down to many small, unsorted, yet completely distinct groups of elements.
To put it plainly: taking a million elements and splitting them up into 10,000-20,000 groups of ~50-100 elements each, that do not cross-over each other; that way we can use…
Small-sorting, which will end up doing a final pass taking many small ~50-100 element groups, sorting them in place, before moving on to the next group.

Given that we will always start with partitioning before concluding with small-sorting, we end up with a complete solution. Just as importantly, we can optimize each of the two parts making up our solution independently, in the coming posts.

Explicit Prefetching: :-1:

I tried using prefetch intrinsics to give the CPU early hints as to where we are reading memory from.

Generally speaking, explicit prefetching can be used to make sure the CPU always reads some data from memory into the cache ahead of the actual time we require it so that the CPU never needs to wait for memory, which is very slow. The bottom line is that having to wait for RAM is a death sentence (200-300 cpu cycles), but even having to wait for L2 cache (14 cycles) when your entire loop’s throughput is around 9 cycles is unpleasant. With prefetch intrinsics we can explicitly instruct the CPU to prefetch specific cache lines all the way to L1 cache, or alternatively specify the target level as L2, L3.

Just because we can do something, doesn’t mean we should: do we actually need to prefetch? CPU designers know all of the above just as much as we do, and the CPU already attempts to prefetch data based on complex and obscure heuristics. You might be tempted to think: “oh, what’s so bad about doing it anyway?”. Well, quite a lot, to be honest: when we explicitly tell the CPU to prefetch data, we’re wasting both instruction cache and decode+fetch bandwidth. Those might be better used for executing our computation.
So, the bottom line remain somewhat hazy, but we can probably try and set-up some ground rules that are probably true in 2020:

CPUs can prefetch data when we traverse memory sequentially.
They do so regardless of the traversal direction (increasing/decreasing addresses).
They can sucessfully figure out the stride we use, when it is constant.
They do so by building up history of our reads, per call-site.

With all that in mind, it is quite likely that prefetching in our case would do little good: Our partitioning code pretty much hits every point in the previous list. But even so, you can never really tell without either trying out, or inspecting memory-related performance counters. The latter, turns out to be more complicated than what you’d think, and sometimes, it’s just easier to try out something rather than attempt to measure it ahead of time. In our case, prefetching the writable memory makes no sense, as our loop code mostly reads from the same addresses just before writing to them in the next iteration or two, so I mostly focused on trying to prefetch the next read addresses.

Whenever I modified readLeft, readRight, I immediately added code like this:

int * nextPtr;
if ((byte *) readLeft   - (byte *) writeLeft) <=
    (byte *) writeRight - (byte *) readRight)) {
    nextPtr = readLeft;
    readLeft += 8;
    // Trying to be clever here,
    // If we are reading from the left at this iteration,
    // we are likely to read from right in the next iteration
    Sse.Prefetch0((byte *) readRight - 64);
} else {
    nextPtr = readRight;
    readRight -= 8;
    // Same as above, only the other way around:
    // After reading from the right, it's likely
    // that our next read will be on the left side
    Sse.Prefetch0((byte *) readLeft + 64);
}

This tells the CPU we are about to use data in readLeft + 64 (the next cache-line from the left) and readRight - 64 (the next cache-line from the right) in the following iterations.

While this looks great on paper, the real world results of this were unnoticeable for me and even slightly negative. For the most part, it appears that the CPUs I used for testing did a good job without me constantly telling them to do what they had already been doing on their own… Still, it was worth a shot.

Simplifying the branch :+1:

I’m kind of ashamed at this particular optimization: I had been literally staring at this line of code and optimizing around it for months without stopping to really think about what it was that I’m really trying to do. Let’s go back to our re-written branch from a couple of paragraphs ago:

if ((byte *) readLeft   - (byte *) writeLeft) <= 
    (byte *) writeRight - (byte *) readRight) {
    // ...
} else {
    // ...
}

I’ve been describing this condition both in animated and code form in the previous part, explaining how for my double-pumping to work, I have to figure out which side we must read from next so we never end-up overwriting data before having a chance to read and partition yet. All of this is happening in the name of performing in-place partitioning. However, I’ve been over-complicating the actual condition!
At some, admittedly late stage, “it” hit me, so let’s play this out step-by-step:

We always start with the setup I’ve previously described, where we make 8 elements worth of space available on both sides, by partitioning them away into the temporary memory.
When we get into the main partitioning loop, we pick one specific side to read from: so far, this has always been the left side (It doesn’t really matter which side it is, but it arbitrarily ended up being the left side due to the condition being <= rather than <).
Given all of the above, we always start reading from the left, there-by increasing the “breathing space” on that left side from 8 to 16 elements temporarily.
Once our trusty ole’ partitioning block is done, we can pause and reason on how both sides now look:
- The left side either has:
  - 8 elements of space (in the less likely, yet possible case that all elements read from it were smaller than the selected pivot) -or-
  - It has more than 8 elements of “free” space.
- In the first case, where the left side is now back to 8 elements of free space, the right side also has 8 elements of free space, since nothing was written on that side!
- In all other cases, the left side has more than 8 elements of free space, and the right side has less than 8 elements of free space, by definition.
Since these are the true dynamics, why should we even bother comparing both heads and tails of each respective side?

The answer to that last question is: We don’t have to!
We could simplify the branch by comparing only the right head+tail pointer distance to see if it is smaller than the magical number 8 or not! This new condition would be just as good at serving the original intent (which is: “don’t end up overwriting unread data”) as the more complicated branch we used before…
When the right side has less than 8 elements, we have to read from the right side in the next round, since it is in danger of being over-written, otherwise, the only other option is that both sides are back at 8-elements each, and we should go back to reading from the left side again, essentially going back to our starting setup condition as described in (1). It’s kind of silly, and I really feel bad it took me 4 months or so to see this. The new condition ends up being much simpler to encode and execute:

int* nextPtr;
if ((byte *) writeRight - (byte *) readRight < N * sizeof(int)) {
        // ...
} else {
        // ...
}

This branch is just as “correct” as the previous one, but it is less taxing in a few ways:

Less instructions to decode and execute.
We’ve saved an additional 5 bytes worth of opcodes from the main loop!
Less data dependencies for the CPU to potentially wait for.
(The CPU doesn’t have to wait for the writeLeft/readLeft pointer mutation and subtraction to complete)

Naturally this ends up slightly faster, and can verify this with BDN once again:

Method Name	Problem Size	Time / Element (ns)	Scaling	Measurements

BenchmarkDotNet=v0.12.0, OS=clear-linux-os 32120
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.1.100
  [Host]     : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
  Job-DEARTS : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

InvocationCount=3  IterationCount=15  LaunchCount=2
UnrollFactor=1  WarmupCount=10

$ grep 'stepping\|model\|microcode' /proc/cpuinfo | head -4
model           : 158
model name      : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping        : 9
microcode       : 0xb4

There’s not a lot to say about this, but I’ll point out a couple of things:

There is a seemingly very slight slow down around 100, 1M elements. It’s authentic and repeatable in my tests. I honestly don’t know why it happens, yet. We spend a total of around 1.6μs for every 100 element sort, which might initially sound like not a lot of time, but at 2.8Ghz, that amounts to ~4500 cycles give or take. For the case of 1M elements, this phenomenon is even more peculiar; But such is life.
Otherwise, there is an improvement, even if modest, of roughly 2%-4% for most cases. it does look like this version of our code is better, in the end of the day.

One interesting question that I personally did not know the answer to beforehand was: would this reduce branch mispredictions? There’s no reason to expect this since our input data, being random, is driving the outcome of this branch. However, if I’ve learned one thing throughout this long ordeal, is that there are always things you don’t even know that you don’t know. Any way of verifying our pet-theories is a welcome opportunity at learning some humility.

Let’s fire up perf to inspect what its counters tell us about the two versions (each result is in a separate tab below):

$ COMPlus_PerfMapEnabled=1 perf record -F max -e branch-misses \
    ./Example --type-list DoublePumpMicroOptCutOff_40 --size-list 1000000 --no-check
...
$ perf report --stdio -F overhead,sym | head -15
# Samples: 403K of event 'branch-misses'
# Event count (approx.): 252554012
    43.73%  [.] ... DoublePumpMicroOptCutoff_40::InsertionSort(...)
    25.51%  [.] ... DoublePumpMicroOptCutoff_40+VxSortInt32::VectorizedPartitionInPlace(...)

$ COMPlus_PerfMapEnabled=1 perf record -F max -e branch-misses \
    ./Example --type-list DoublePumpSimpleBranch --size-list 1000000 --no-check
...
$ perf report --stdio -F overhead,sym | head -15
# Samples: 414K of event 'branch-misses'
# Event count (approx.): 241513903
    41.11%  [.] ... DoublePumpSimpleBranch::InsertionSort(...)
    26.59%  [.] ... DoublePumpSimpleBranch+VxSortInt32::VectorizedPartitionInPlace(...)

Here we’re comparing the same two versions we’ve just benchmarked with a specific focus on the branch-misses HW counter. We can take this oppertunity both to appreciate how these results compare to the ones we recorded at the end of the previous post, as well as how they compare to each other.

Compared to our DoublePumpedNaive implementation of yester-post, it would appear that the “burden of guilt” when it comes to branch mispredictions has shifted towards InsertionSort by 3-4%. This is to be expected: We were using a cut-off point of 16 previously, and we’ve just upped it to 40 in the previous section, so it makes sense for InsertionSort to perform more work in this new balance, taking a larger share of the branch-misses.

When comparing between the DoublePumpMicroOptCutOff_40 and the DoublePumpSimpleBranch versions, that differ only in that nasty branch in the top of our main loop, both versions look mostly similar. First, we have to acknowledge that perf is a statistical tool, that works by collecting samples of HW counters, so we’re not going to get an exact count of anything, even when running the same code time after time. In our case, both versions look roughly the same: Once we count how many branch misses of the total are attributed to the function we actually changed, it comes to 64,426,528 misses for the previous version vs. 64,218,546 for the newer simpler branch. It doesn’t amount to enough to call this a branch misprediction win. So it would seem with gained a bit with smaller code, but not by lowering the frequency of mispredictions.

Packing the Permutation Table, 1^st attempt: :+1:

Ever since I started with this little time-succubus of a project, I was really annoyed at the way I was encoding the permutation tables. To me, wasting 8kb worth of data, or more specifically, wasting 8kb worth of precious L1 cache in the CPU for the permutation entries was tantamount to a cardinal sin. My emotional state set aside, the situation is even more horrid when you stop to consider that out of each 32-byte permutation entry, we were only really using 3 bits x 8 elements, or 24 bits of usable data. To be completely honest, I probably made this into a bigger problem in my head, imagining how the performance was suffering from this, than what it really is in reality, but we don’t always get to choose our made-up enemies. sometimes they choose us.

My first attempt at packing the permutation entries was to try and use a specific Intel intrinsic called ConvertToVector256Int32 / VPMOVZXBD. This intrinsic can read a 64-bit value directly from memory while also expanding it into 8x32bit values inside a Vactor256<T> register. If nothing else, it buys me an excuse to do this:

More seriously though, the basic idea was that I would go back to the permutation entries and re-encode them as 64-bits (8x8bits) per single entry instead of 256-bits which is what I’ve been using thus far. This encoding would reduce the size of the entire permutation entry from 8kb to 2kb, which is a nice start.
Unfortunately, this initial attempt went south as I got hit by a JIT bug. When I tried to circumvent that bug, the results didn’t look better, they were slightly worse, so I kind of left the code in a sub-optimal state and forgot about it. Luckily, I did revisit this at a later stage, after the bug was fixed, and to my delight, once the JIT was encoding this instruction correctly and efficiently, things start working smoothly.

I ended up encoding a second permutation table, and by using the correct ConvertToVector256Int32 we are kind of better off:

Method Name	Problem Size	Time / Element (ns)	Scaling	Measurements

Method Name	Problem Size	Time / Element (ns)	Scaling	Measurements

BenchmarkDotNet=v0.12.0, OS=clear-linux-os 32120
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.1.100
  [Host]     : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
  Job-DEARTS : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

InvocationCount=3  IterationCount=15  LaunchCount=2
UnrollFactor=1  WarmupCount=10

$ grep 'stepping\|model\|microcode' /proc/cpuinfo | head -4
model           : 158
model name      : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping        : 9
microcode       : 0xb4

These results bring a new, unwarranted dimension into our lives: CPU vendors and model-specific quirks. Up until now, I’ve been testing my various optimizations on three different processors models I had at hand: Intel Kaby Lake, Intel Broadwell, and AMD Ryzen. Every attempt I’ve presented here netted positive results on all three test beds, even if differently, so I opted for focusing on the Intel Kaby-Lake results to reduce the information overload.
Now is the first time we see uneven results: the two results I included represent two extremes of the performance spectrum; The newer Intel Kaby-Lake processors are not affected by this optimization. When I set out to implement it, I came into this with eyes wide-open: I knew that all in all, the CPU would roughly be doing the same work for the permutation entry loading per-se. I was gunning for a 2^nd order effect: Freeing up 6KB of L1 data-cache is no small saving, given its total size is 32KB in all of my tested CPUs.

What we see from the Intel Kaby-Lake results can basically be summarised as: Newer Intel CPUs probably have a very efficient prefetch unit. One that performs well enough that we can’t feel or see the benefit of having more L1 room afforded by packing the permutation table more tightly. With AMD CPUs, and older Intel CPUs (Like Intel Broadwell, not shown here), freeing up the L1 cache does make a substantial dent in the total runtime.

All in all, while this is a slightly more complex scenario to reason about, we’re left with one, rather new CPU that is not affected by this optimization for better and for worse, and other, older/different CPUs where this is a very substantial win. As such, I decided to keep it in the code-base going forward.

Packing the Permutation Table, 2^nd attempt: :-1:

Next, I tried to pack the permutation table even further, going from 2kb to 1kb of memory, by packing the 3-bit entries even further into a single 32-bit value. The packing is the easy part, but how would we unpack these 32-bit compressed entries all the way back to a full 256-bit vector? Why, with yet more intrinsics of course. With this, my ultra packed permutation table now looked like this:

ReadOnlySpan<byte> BitPermTable => new byte[]
{
    0b10001000, 0b11000110, 0b11111010, 0b00000000, // 0
    // ...
    0b01100011, 0b01111101, 0b01000100, 0b00000000, // 7
    // ...
    0b00010000, 0b10011101, 0b11110101, 0b00000000, // 170
    // ...
    0b10001000, 0b11000110, 0b11111010, 0b00000000, // 255
}

And my unpacking code now uses the ParallelBitDposit / PDEP, which I’ve accidentaly covered in more detail in a previous post:

Vector256<int> GetBitPermutation(uint *pBase, in uint mask)
{
    const ulong magicMask =
        0b00000111_00000111_00000111_00000111_00000111_00000111_00000111_00000111;
    return Avx2.ConvertToVector256Int32(
        Vector128.CreateScalarUnsafe(
            Bmi2.X64.ParallelBitDeposit(pBase[mask], magicMask)).AsByte());
}

What does this little monstrosity do exactly? We pack the permutation bits (remember, we just need 3 bits per element, we have 8 elements, so 24 bits per permutation vector in total) into a single 32 bit value, then whenever we need to expand this into a full blown vector, we:

Unpack the 32-bit values into a 64-bit value using ParallelBitDeposit from the BMI2 intrinsics extensions.
Convert (move) the 64-bit value into the lower 64-bits of a 128-bit SIMD register using Vector128.CreateScalarUnsafe.
Go back to using a different variant of ConvertToVector256Int32 (VPMOVZXBD) that takes 8-bit elements from a 128-bit wide register and expands them into integers in a 256 bit registers.

In short, we chain 2 extra instructions compared to our 2KB permutation table, but save an additional 1KB of cache. Was it worth it?
I wish I could say with a complete and assured voice that it was, but the truth is that it had only very little positive effect, if any:

While we end up saving 1kb of precious L1 cache, the extra instructions end up delaying and costing us more than whatever perf we’re gaining from the extra cache space.
To make things even worse, I later discovered that with AMD processors, the very same intrinsic I’m relying upon here, PDEP, is some sort of a bastardized instruction. It’s not really an instruction implemented with proper circuitry at the CPU level, but rather implemented as a plain loop inside the processor. As the discussion I linked to shows, it can take hundreds of cycles(!) depending on the provided mask value. For now we can simply chalk this attempt as a failure.

Skipping some permutations: :-1:

There are common cases where performing the permutation is completely un-needed. This means that almost the entire permutation block can be skipped:

No need to load the perutation entry
Or perform the permutation

To be percise, there are exactly 9 such cases in the permutation table, whenever all the 1 bits are already grouped in the upper (MSB) part of the mask value in our permutation block, the values are:

0b11111111
0b11111110
0b11111100
0b11111000
0b11110000
0b11100000
0b11000000
0b10000000
0b00000000

I thought it might be a good idea to detect those cases. I ended up trying a switch case, and when that failed to speed things up, comparing the amount of trailing zeros to (8 - population count). While both methods did technically work, the additional branch and associated branch misprediction didn’t make this worth while or yield any positive result. The simpler code which always permutes did just as good if not slightly better.
Of course, these results have to be taken with a grain of salt, since they depend on us sorting random data. There might be some other situation when such branches are predicted correctly where this could save a lot of cycles. But for now, let’s just drop it and move on…

Getting intimate with x86 for fun and profit: :+1:

I know the title sounds cryptic, but x86 is just weird, and I wanted to make sure you’re mentally geared for some weirdness in our journey to squeeze a bit of extra performance. We need to remember that this is a 40+ year-old CISC processor made in an entirely different era:

This last optimization trick I will go over repeats the same speil I’ve been doing throughtout this post: trimming the fat around our code. We’ll try generating slightly denser code in our vectorized block. The idea here is to trigger the JIT to encode the pointer update code at the end of our vectorized partitioning block with the space-efficient LEA instruction.

To better explain this, we’ll start by going back to the last 3 lines of code I presented at the top of this post, as part of the so-called micro-optimized version. Here is the C#:

    // end of partitioning block...
    var popCount = PopCnt.PopCount(mask);
    writeRight = (int*) ((byte*) writeRight - popCount);
    writeLeft  = (int*) ((byte*) writeLeft + (8U << 2) - popCount);

If we look at the corresponding disassembly for this code, it looks quite verbose. Here it is with some comments, and with the machine-code bytes on the right-hand side:

;var popCount = PopCnt.PopCount(mask);
popcnt  r8d,r8d ; F3450FB8C0
shl     r8d,2   ; 41C1E002

;writeRight = (int*) ((byte*) writeRight - popCount);
mov     r9d,r8d ; 458BC8
sub     rcx,r9  ; 492BC9

;writeLeft  = (int*) ((byte*) writeLeft + (8U << 2) - popCount);
add     r12,20h ; 4983C420
mov     r8d,r8d ; 458BC0
sub     r12,r8  ; 4D2BE0

If we count the bytes, everything after the PopCount instruction is taking 20 bytes in total: 4 + 3 + 3 + 4 + 3 + 3 to complete both pointer updates.

The motivation behind what I’m about to show is that we can replace all of this code with a much shorter sequence, taking advantage of x86’s wacky memory addressing, by tweaking the C# code ever so slightly. This, in turn, will enable the C# JIT, which is already aware of these x86 shenanigans, and is capable of generating the more compact x86 code, to do so when it encounters the right constructs at the MSIL/bytecode level.
We succeed here if and when we end up using one LEA instruction for each pointer modification.

What is LEA you ask? Load Effective Address is an instruction that exposes the full extent of x86’s memory addressing capabilities in a single instruction. It allows us to encode rather complicated mathematical/address calculations with a minimal set of bytes, abusing the CPUs address generation units (AGU), while storing the result of that calculation back to a register.

But what can the AGUs do for us? We need to learn just enough about them before we attempt to milk some performance out of them through LEA. Out of curiosity, I went back in time to find out when the memory addressing scheme was defined/last changed. To my surprise, I found out it was much later than what I had originally thought: Intel last expanded the memory addressing semantics as late as 1986! Of course this was later expanded again by AMD when they introduced amd64 to propel x86 from the 32-bit dark-ages into the brave world of 64-bit processing, but that was merely a machine-word expansion, not a functional change. I’m happy I researched this bit of history for this post because I found this scanned 80386 manual:

In this reference manual, the “new” memory addressing semantics are described in section 2.5.3.2 on page 2-18, reprinted here for some of its 1980s era je ne sais quoi:

Figure 2-10 in the original manual does a good job explaining the components and machinery that go into a memory address calculation in x86. Here it is together with my plans to abuse it:

Segment register: This is an odd over-engineered 32-bit era remnant. It’s mostly never used, so let’s skip it in this context.
Base register: This will be our pointer that we want to modify: writeLeft and writeRight.
Index: Basically some offset to the base: In our case the PopCount result, in some form.
The index has to be added (+) to the base register. The operation will always be an addition; of course nothing prevents us from adding a negative number…
Scale: The PopCount result needs to be multiplied by 4, we’ll do it with the scale. The scale is limited to be one of 1/2/4/8, but for us this is not a limitation, since multiplication by 4 is exactly what we need.
Displacement: Some other constant we can tack on to the address calculation. The displacement can be 8/32 bits and is also always used with an addition (+) operation.

There’s a key point I need to stress here: while the mathematical operations performed by LEA are always addition, we can take advantage of how twos-complement addition/subtraction works to effectively turn this so-called addition into a subtraction.

The actual code change is, for lack of better words, underwhelming. But without all this pre-amble it wouldn’t make a lot of sense, here it is, in all its glory:

    // ...
    var popCount = - (long) PopCnt.PopCount(mask);
    writeRight += popCount;
    writeLeft  += popCount + 8;

Surely, you must be joking, Mr. @damageboy!, I can almost hear you think, but really, this is it. By casting to long and pre-negating the PopCount result (see that little minus sign?) and reverting back to simpler pointer advancement code, without all the pre-left-shifting pizzaz from the beginning of this post, we get this beautiful, packed, assembly code automatically generated for us:

popcnt  rdi,rdi             ; F3480FB8FF
neg     rdi                 ; 48F7DF
lea     rax,[rax+rdi*4]     ; 488D04B8
lea     r15,[r15+rdi*4+20h] ; 4D8D7CBF20

The new version is taking 3 + 4 + 5 or 12 bytes in total, to complete both pointer updates. So it’s clearly denser. It is important to point out that this reduces the time taken by the CPU to fetch and decode these instructions. Internally, the CPU still has to perform the same calculations as before. I’ll refrain from digressing into the mechanics of x86’s frontend, backend, and all that jazz, as it is out of scope for this blog post, so let’s just be happy with what we have.

Before we forget, though, does it improve performance?

Method Name	Problem Size	Time / Element (ns)	Scaling	Measurements

BenchmarkDotNet=v0.12.0, OS=clear-linux-os 32120
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.1.100
  [Host]     : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
  Job-DEARTS : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

InvocationCount=3  IterationCount=15  LaunchCount=2
UnrollFactor=1  WarmupCount=10

$ grep 'stepping\|model\|microcode' /proc/cpuinfo | head -4
model           : 158
model name      : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping        : 9
microcode       : 0xb4

All in all, this might not look like much, but it is real: another small 3-4% uneven improvement across the sorting spectrum if you disregard the weirdness around 10K elements. I do realize it may not look super impressive to boot, but here’s a spoiler: a few blog posts down the road, we’ll get to unroll our loops, you know, that place where all optimization efforts end up going. When we do get there, every byte we remove off this main loop body will pay in spades. In other words, while some of the optimizations may appear minor, I have a different metric, at least in my mind, when it comes to improving the loop body even by a single per-cent while we’re still not unrolling it. That’s one of those places where a little experience affords better foresight.

I have to come clean here: I’ve left some pennies here on the floor. We could still go one step further and get rid of one more 3-byte instruction in the loop. Alas, I’ve made an executive decision to no do so in this blog post: For one, this post has already become quite long, and I doubt a substantial number of people who have started reading it are still here with us, with a beating pulse. Moreover, this specific optimization that I have in mind would not really shine in this moment. As such, I’ll go back to it once we get to unroll this loop.

We’ve Come a Long Way, Baby!

We’ve done quite a lot to optimize the vectorized partitioning so far. All these incremental improvements pile up, when you multiply them on top of another.

Don’t believe me? Here’s one last group of charts and data tables to show what distance we’ve travalled in one blog post:

Method Name	Problem Size	Time / Element (ns)	Scaling	Measurements

BenchmarkDotNet=v0.12.0, OS=clear-linux-os 32120
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.1.100
  [Host]     : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
  Job-DEARTS : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

InvocationCount=3  IterationCount=15  LaunchCount=2
UnrollFactor=1  WarmupCount=10

$ grep 'stepping\|model\|microcode' /proc/cpuinfo | head -4
model           : 158
model name      : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping        : 9
microcode       : 0xb4

We can see that we’ve managed to trim a lot of excess fat off this little monster of ours. It’s shaping up to be one mean sorting machine, for sure. When comparing to where we were in the end of the previous blog post:

We have a more pronounced effect for these optimizations in the lower end of the spectrum, cutting down an additional 30% of the runtime for anything below 1000 elements.
Above 1000 elements, we’ve “only” succeeded in reducing the runtime by 20%. Then again, it’s 20% off of tens and hundreds of milliseconds of total runtime, which is nothing to snicker at.

Next up, we’ll have to take on what is a non-trivial problem of dealing with memory alignment, in the scope of a complicated partitioning algorithm like QuickSort.

This Goes to Eleven (Part. 3/∞)

2020-01-30T03:26:28+00:00

Since there’s a lot to go over here, I’ve split it up into a few parts:

In part 1, we start with a refresher on QuickSort and how it compares to Array.Sort().
In part 2, we go over the basics of vectorized hardware intrinsics, vector types, and go over a handful of vectorized instructions we’ll use in part 3. We still won’t be sorting anything.
In this part, we go through the initial code for the vectorized sorting, and start seeing some payoff. We finish agonizing courtesy of the CPU’s branch predictor, throwing a wrench into our attempts.
In part 4, we go over a handful of optimization approaches that I attempted trying to get the vectorized partitioning to run faster. We’ll see what worked and what didn’t.
In part 5, we’ll see how we can almost get rid of all the remaining scalar code- by implementing small-constant size array sorting. We’ll use, drum roll…, yet more AVX2 vectorization.
Finally, in part 6, I’ll list the outstanding stuff/ideas I have for getting more juice and functionality out of my vectorized code.

Unstable Vectorized Partitioning + QuickSort

It’s time we mash all the new knowledge we picked up in the last posts about SIMD registers, instructions, and QuickSorting into something useful. Here’s the plan:

Vectorized in-place partitioning:
- First, we learn to take 8-element blocks, or units of Vector256<int>, and partition them with AVX2 intrinsics.
- Then we take Berlin: We reuse our block to partition an entire array with a method I named double-pumping, suitable for processing large arrays in-place with this vectorized block.
Once we’ve covered vectorized partitioning, we finish up with some innocent glue-code wrapping the whole thing to look like a proper Array.Sort replacement.

Now that we’re doing our own thing, finally, It’s time to address a baby elephant hiding in the room: Stable vs. Unstable sorting. I should probably bother explaining: One possible way to categorize sorting algorithms is with respect for their stability: Do they reorder equal values as they appear in the original input data or not. Stable sorting does not reorder, while unstable sorting provides no such guarantee.
Stability might be critical for certain tasks, for example:

When sorting an array of structs/classes according to a key embedded as a member, while providing a non-default IComparer<T> or Comparison<T>, we might care about preserving the order of the containing type.
Similarly, when sorting pairs of arrays: keys and values, reordering both arrays according to the sorted order of the keys, while preserving the ordering of values for equal keys.

At the same time, stable sorting is a non-issue when:

Sorting arrays of simple primitives; stability is meaningless:
(what would a “stable sort” of the array [7, 7, 7] even mean?)
At other times, we know for a fact that our keys are unique. There is no unstable sorting for unique keys.
Lastly, sometimes, we just don’t care. We’re fine if our data gets reordered.

In the .NET/C# world, one could say that the landscape regarding sorting is a little unstable (pun intended):

Array.Sort is unstable, as is clearly stated in the remarks section:

This implementation performs an unstable sort; that is, if two elements are equal, their order might not be preserved.
On the other hand, Enumerable.OrderBy is stable:

This method performs a stable sort; that is, if the keys of two elements are equal, the order of the elements is preserved.

In general, what I came up with in my full repo/nuget package are algorithms capable of doing both stable and unstable sorting. But with two caveats:

Stable sorting is considerably slower than unstable sorting (But still faster than Array.Sort).
Stable sorting is less elegant/fun to explain.

Given this new information and the fact that I am only presenting pure primitive sorting anyway, where there is no notion of stability to begin with, for this series, I will be describing my unstable sorting approach. It doesn’t take a lot of imagination to get from here to the stable variant, but I’m not going to address this in these posts. It is also important to note that in general, when there is a doubt if stability is a requirement (e.g., for key/value, IComparer<T>/Comparison<T>, or non-primitive sorting) we should err on the side of safety and go for stable sorting.

AVX2 Partitioning Block

Let’s start with this “simple” block, describing what we do with moving pictures.

Hint	From here-on, The following icon means I have a thingy that animates: Click/Touch/Hover inside means: Click/Touch/Hover outside means:

Here is the same block, in more traditional code form:

var P = Vector256.Create(pivot); // Outside any loop, top-level in the function
...
[MethodImpl(MethodImplOptions.AggressiveInlining)]
static unsafe void PartitionBlock(int *dataPtr, Vector256<int> P,
                                  ref int* writeLeft, ref int* writeRight) {
    var data = Avx2.LoadDquVector256(dataPtr);
    var mask = (uint) Avx.MoveMask(
        Avx2.CompareGreaterThan(data, P).AsSingle());
    data = Avx2.PermuteVar8x32(data,
        Avx2.LoadDquVector256(PermTablePtr + mask * 8)));
    Avx.Store(writeLeft,  data);
    Avx.Store(writeRight, data);
    var popCount = PopCnt.PopCount(mask);
    writeRight -= pc;
    writeLeft  += 8 - pc;
}

There’s a lot of cheese here; let’s break this down:

Broadcast the pivot value to a vector I’ve named P. We’re merely creating 8-copies of the selected pivot value in a SIMD register.
Technically, this isn’t really part of the block as this is this happens only once per partitioning function call! It’s included here for context.

L3-5

We wrap our block in a static function. We aggressively inline it in strategic places throughout the rest of the code.
This may look like an odd signature, but think of its purpose: We avoid copy-pasting codemwhile also avoiding any performance penalty.

Load up data from somewhere in our array. dataPtr points to some unpartitioned data. dataVec will be loaded with data we intend to partition, and that’s the important bit.

L7-8

Perform an 8-way comparison using CompareGreaterThan, then proceed to convert/compress the 256-bit result into an 8-bit value using the MoveMask intrinsic.
The goal here is to generate a scalar mask value, that contains a single 1 bit for every comparison where the corresponding data element was greater-than the pivot value and 0 bits for all others. If you are having a hard time following why this does this, you need to head back to the 2^nd post and read up on these two intrinsics/watch their animations.

L9-10

Permute the loaded data according to a permutation vector; A-ha! A twist in the plot!
mask contains 8 bits, from LSB to MSB describing where each element belongs to (left/right). We could, of course, loop over those bits and perform 8 branches to determine which side each element belongs to, but that would be a terrible mistake. Instead, we’re going to use the mask as an index into a lookup-table for permutation values!
This is one of the reasons it was critical to use MoveMask in the first place. Without it, we would not have a scalar value we could use as an index to our table. Pretty neat, no?
With the permutation operation done, we’ve grouped all the smaller-or-equal than values on one side of our dataVec vector (the “left” side) and all the greater-than values on the other side (the “right” side).
I’ve comfortably glanced over the actual values in the permutation lookup-table which PermTablePtr is pointing to; I’ll address this a couple of paragraphs below.

Partitioning is now practically complete: That is, our dataVec vector is neatly partitioned. Except that that data is still “stuck” inside our vector. We need to write its contents back to memory. Here comes a small complication: Our dataVec vector now contains values belonging both to the left and right sides of the original array. We did separate them within the vector, but we’re not done until each side is written back to memory, on both ends of our array.

L11-12

Store the permuted vector to both sides of the array. There is no cheap way to write portions of a vector to each respective end, so we write the entire partitioned vector to both the left and right sides of the array.
At any given moment, we have two write pointers pointing to where we need to write to next on either side: writeLeft and writeRight. How those are initialized and maintained will be dealt with further down where we start calling this block, but for now, let’s assume these pointers initially point to somewhere where it is safe to write at least an entire Vector256<T> and move on.

L13-15

Book-keeping time: We just wrote 8 elements to each side, and each side had a trail of unwanted data tacked to it. We didn’t care for it while we were writing it, because we knew we’re about to update the same write pointers in such a way that the next writes operations will overwrite the trailing/unwanted data that doesn’t belong to each respective side!
The vector gods are smiling at us: We have the PopCount intrinsic to lend us a hand here. We issue PopCount on the same mask variable (again, MoveMask was worth its weight in gold here) and get a count of how many bits in mask were 1. This accounts for how many values inside the vector were greater-than the pivot value and belong to the right side.
This “happens” to be the amount by which we want to decrease the writeRight pointer (writeRight is “advanced” by decrementing it, this may seem weird for now, but will become clearer when we discuss the outer-loop!
Finally, we adjust the writeLeft pointer: popCount contains the number of 1 bits; the number of 0 bits is by definition, 8 - popCount since mask had 8 bits of content in it, to begin with. This accounts for how many values in the register were less-than-or-equal the pivot value and grouped on the left side of the register.

This was a full 8-element wise partitioning block, and it’s worth noting a thing or two about it:

It is completely branch-less(!): We’ve given the CPU a nice juicy block with no need to speculate on what code gets executed next. It sure looks pretty when you consider the number of branches our scalar code would execute for the same amount of work. Don’t pop a champagne bottle quite yet though, we’re about to run into a wall full of thorny branches in a second, but sure feels good for now.
If we want to execute multiple copies of this block, the main dependency from one block to the next is the mutation of the writeLeft and writeRight pointers. It’s unavoidable given we set-out to perform in-place sorting (well, I couldn’t avoid it, maybe you can!), but worth-while mentioning nonetheless. If you need a reminder about how these data-dependencies can change the dynamics of efficient execution, you can read up on when I tried my best to go at it battling with PopCount to run screaming fast; If nothing else, you’ll get a clearer understanding of how the CPU extracts data-flows from our code.

I thought it would be nice to wrap up the discussion of this block by showing off that the JIT is relatively well-behaved in this case with the generated x64 asm:
Anyone who has followed the C# code can use the intrinsics table from the previous post and read the assembly code without further help. Also, it becomes clear how this is a 1:1 translation of C# code. Congratulations: It’s 2020, and we’re x86 assembly programmers again!

vmovd xmm1,r15d                      ; Broadcast
vbroadcastd ymm1,xmm1                ; pivot
...
vlddqu ymm0, ymmword ptr [rax]       ; load 8 elements
vpcmpgtd ymm2, ymm0, ymm1            ; compare
vmovmskps ecx, ymm2                  ; movemask into scalar reg
mov r9d, ecx                         ; copy to r9
shl r9d, 0x3                         ; *= 8
vlddqu ymm2, qword ptr [rdx+r9d*4]   ; load permutation
vpermd ymm0, ymm2, ymm0              ; permute
vmovdqu ymmword ptr [r12], ymm0      ; store left
vmovdqu ymmword ptr [r8], ymm0       ; store right
popcnt ecx, ecx                      ; popcnt
...                                  ; update writeLeft/writeRight pointers

Permutation lookup table

If you made it this far, you are owed an explanation of the permutation lookup table. Let’s see what’s in it:

The table needs to have 2⁸ elements for all possible mask values.
Each element ultimately needs to be a Vector256<int> because that’s what the permutation intrinsic expects from us, so 8 x 4 bytes = 32 bytes per element.
- That’s a whopping 8kb of lookup data in total (!).
The values inside are pre-generated so that they would reorder the data inside a Vector256<int> according to our wishes: all values that got a corresponding 1 bit in the mask go to one side (right side), and the elements with a 0 go to the other side (left side). There’s no particular required order amongst the grouped elements since we’re merely partitioning around a pivot value, nothing more, nothing less.

Here are 4 sample values from the generated permutation table that I’ve copy-pasted so we can get a feel for it:

static ReadOnlySpan<int> PermTable => new[] {
    0, 1, 2, 3, 4, 5, 6, 7,     // 0   => 0b00000000
    // ...
    3, 4, 5, 6, 7, 0, 1, 2,     // 7   => 0b00000111
    // ...
    0, 2, 4, 6, 1, 3, 5, 7,     // 170 => 0b10101010
    // ...
    0, 1, 2, 3, 4, 5, 6, 7,     // 255 => 0b11111111
};

For mask values 0, 255 the entries are trivial: All mask bits were either 1 or 0 so there’s nothing we need to do with the data, we just leave it as is, the “null” permutation vector: [0, 1, 2, 3, 4, 5, 6, 7] achieves just that.
When mask is 0b00000111 (decimal 7), the 3 lowest bits of the mask are 1, they represent elements that need to go to the right side of the vector (e.g., elements that were > pivot), while all other values need to go to the left (<= pivot). The permutation vector: [3, 4, 5, 6, 7, 0, 1, 2] does just that.
The checkered bit pattern for the mask value 0b10101010 (decimal 170) calls to move all the even elements to one side and the odd elements to the other… You can see that [0, 2, 4, 6, 1, 3, 5, 7] does the work here.

Note

The permutation table signature provided here is technically a lie: The actual code uses ReadOnlySpan<byte> as the table’s type, with the int values encoded as individual bytes in little-endian encoding. This is a C# 7.3 specific optimization where we get to treat the address of this table as a constant at JIT time. Kevin Jones (@vcsjones) did a wonderful job of digging into it.
We must use a ReadOnlySpan<byte> for the optimization to trigger: Not reading that fine-print cost me two nights of my life chasing what I was sure had to be a GC/JIT bug. Normally, it would be a bad idea to store a ReadOnlySpan<int> as a ReadOnlySpan<byte>: we are forced to choose between little/big-endian encoding at compile-time. This runs up against the fact that in C# we compile once and debug (and occasionally run :) everywhere. Therefore, we have to assume our binaries might run on both little/big-endian machines where the CPU might not match the encoding we chose.
In this case, praise the vector deities, blessed be their name and all that they touch, this is a non-issue: The entire premise is x86 specific. This means that this code will never run on a big-endian machine. We can simply assume little endianness here till the end of all times.

We’ve covered the basic layout of the permutation table. We’ll go back to it once we start optimization efforts in earnest on the 4^th post, but for now, we can move on to the loop surrounding our vectorized partition block.

Double Pumped Loop

Armed with a vectorized partitioning block, it’s time to hammer our unsorted array with it, but there’s a wrinkle: In-place sorting. This brings a new challenge to the table: If you followed the previous section carefully, you might have noticed it already. For every Vector256<int> we read, we ended up writing that same vector twice to both ends of the array. You don’t have to be a math wizard to figure out that if we end up writing 16 elements for every 8 we read, that doesn’t sound very in-placy, to begin with. Moreover, this extra writing would have to overwrite data that we have not read yet.
Initially, it would seem, we’ve managed to position ourselves between a rock and a hard place.

But all is not lost! In reality, we immediately adjust the next write positions on both sides in such a way that their sum advances by 8. In other words, we are at risk of overwriting unread data only temporarily while we store the data back. I ended up adopting a tricky approach: We will need to continuously make sure we have at least 8 elements (the size of our block) of free space on both sides of the array so we could, in turn, perform a full, efficient 8-element write to both ends without overwriting a single bit of data we haven’t read yet.

Here’s a visual representation of the mental model I was in while debugging/making this work (I’ll note I had the same facial expressions as this poor Charmander while writing and debugging that code):

Funny, right? It’s closer to what I actually do than I’d like to admit! I fondly named this approach in my code as “double-pumped partitioning”. It pumps values in-to/out-of both ends of the array at all times. I’ve left it pretty much intact in the repo under the name DoublePumpNaive, in case you want to dig through the full code. Like all good things in life, it comes in 3-parts:

Prime the pump (make some initial room inside the array).
Loop over the data in 8-element chunks executing our vectorized code block.
Finally, go over the last remaining data elements (e.g. the last remaining < 8 block of unpartitioned data) and partition them using scalar code. This is a very common and unfortunate pattern we find in vectorized code, as we need to finish off with just a bit of scalar work.

Let’s start with another visual aid I ended up doing to better explain this; note the different color codes and legend I’ve provided here, and try to watch a few loops noticing the various color transitions, this will become useful as you parse the text and code below:

Each rectangle is 8-elements wide.
- Except for the middle one, which represents the last group of up to 8 elements that need to be partitioned. This is often called in vectorized parlance the “remainder problem”.
We want to partition the entire array, in-place, or turn it from orange into the green/red colors:
- Green: for smaller-than-or-equal to the pivot values, on the left side.
- Red: for greater-than-or-equal the pivot values, on the right side.
Initially we “prime the pump”, or make some room inside the array, by partitioning into some temporary memory, marked as the 3x8-element blocks in purple:
- We allocate this temporary space somewhere on the stack; We’ll discuss why this isn’t really a big deal below.
- We read one vector’s worth of elements from the left and execute our partitioning block into the temporary space.
- We repeat the process for the right side.
- At this stage, one vector on each edge has already been partitioned, and their color is now gray, which represents data/area within our array we can freely write into.
From here-on, we’re in the main loop: this could go on for millions of iterations, even though in this animation we only see 4 iterations in total:
- In every round, we choose where we read from next: From the left -or- right side of the orange area?
  How? Easy-peasy: Whichever side has a smaller gray area!
  - Intuition: The gray area represents the distance between the head (read) and tail (write) pointers we set up for each side, the smaller the distance/area is, the more likely that our next 8-element partition might end with us overwriting that side’s head with the tail.
  - We really don’t want that to happen…
  - We read from the only side where this might happen next, thereby adding 8 more elements of breathing space to that side just in time before we cause a meltdown. (you can see this clearly in the animation as each orange block turns gray after we read it, but before we write to both sides…)
- We partition the data inside the Vector256<int> we just read and write it to the next write position on each side.
- We advance each write pointer according to how much of that register was red/green, we’ve discussed the how of it when we toured the vectorized block. Here you can see the end result reflected in how the red portion of the written copy on the left-hand side turns into gray, and the green portion on the right-hand side turns into gray correspondingly.
  Remember: We’ve seen the code in detail when we previously discussed the partitioning block; I repeat it here since it is critical for understanding how the whole process clicks together.
For the finishing touch:
- Left with less than 8 elements, we partition with plain old scalar code the few remaining elements, into the temporary memory area again.
- We copy back each side of the temporary area back to the main array, and we’re done!
- We move the pivot value that was left untouched all this time on the right edge of our segment and move it to where the new boundary is.

Let’s go over it again, in more detail, this time with code:

Setup: Make some room!

What I eventually opted for was to read from one area and write to another area in the same array. But we need to make some spare room inside the array for this. How?

We cheat! (¯\(ツ)/¯), but not really: we allocate some temporary space on stack, by using the relatively new ref struct feature in C# in combination with fixed arrays, here’s why this isn’t really cheating in any reasonable person’s book:

Stack allocation doesn’t put pressure on the GC, and its allocation is super fast/slim.
We allocate once at the top of our entire sort operation and reuse that space while recursing.
“Just a bit” is really just a bit: For our 8-element partition block we need room for 1 x 8-elements vector on each side of the array, so we allocate a total of 2 x 8 integers. In addition, we allocate 8 more elements for handling the remainder (well technically, 7 would be enough, but I’m not a monster, I like round numbers just like the next person), so a total of 96 bytes. Not too horrid.

Here’s the signature + setup code:

unsafe int* VectorizedPartitionInPlace(int* left, int* right)
{
    var N = Vector256<T>.Count; // Treated by JIT as constant!

    var writeLeft = left;
    var writeRight = right - N - 1;
    var tmpLeft = _tempStart;
    var tmpRight = _tempEnd - N;

    var pivot = *right;
    var P = Vector256.Create(pivot);

    PartitionBlock(left,          P, ref tmpLeft, ref tmpRight);
    PartitionBlock(right - N - 1, P, ref tmpLeft, ref tmpRight);

    var readLeft  = left + N;
    var readRight = right - 2*N - 1;

The function accepts two parameters: left, right pointing to the edges of the partitioning task we were handed. The selected pivot is “passed” in an unconventional way: the caller (The top-level sort function) is responsible for moving it to the right edge of the array before calling the partitioning function. In other words, we start executing the function expecting the pivot to be already selected and placed at the right edge of the segment (e.g., right points to it). This is a remnant of my initial copy-pasting of CoreCLR code, and to be honest, I don’t care enough to change it.

We start by setting up various pointers we’ll be using on L5-8: The writeLeft and writeRight pointers pointing into the internal edges of our array (excluding the last element which is pointing to the selected pivot), while the tmpLeft and tmpRight pointers are pointing into the internal edges of the temporary space.
One recurring pattern is that the right-side pointers are pointing on vector’s worth on elements left of their respective edge. This makes sense given that we will be using vectorized write operations that take a pointer to memory and write 8 elements at a time; the pointers are setup accounting for that assymetry.

Note

I’m using a “variable” (N) on L3 instead of Vector256<int>.Count. There’s a reason for those double quotes: At JIT time, the right-hand expression is considered as a constant as far as the JIT is concerned. Furthermore, once we initialize N with its value and never modify it, the JIT treats N as a constant as well! So really, I get to use a short/readable name and pay no penalty in for it.

We proceed to partition a single 8-element vector on each side on L13-14, with our good-ole’ partitioning block straight into that temporary space through the pointers we just setup. It is important to remember that having done that, we don’t care about the original contents of the area we just read from anymore: we’re free to write up to one Vector256<T> to each edge of the array in the future. We’ve made enough room inside our array available for writing in-place while partitioning.

We finish the setup on L16-17 by initializing read pointers for every side (readLeft, readRight); An alternative way to think about these pointers is that each side gets its own head (read) and tail (write) pointers. We will be continuously reading from one of the heads and writing to both tails from now on.

The setup ends with readLeft pointing a single Vector256<int> right of left , and readRight pointing 1 element + 2xVector256<int> left of right. The setup of readRight might initially seem peculiar, but easily explained:

right itself points to the selected pivot; we’re not going to (re-)partition it, so we skip that element (this explains the - 1).
As with the tmpRight and writeWrite pointers, when we read/write using Avx2.LoadDquVector256/Avx.Store we always have to supply the start address to read from or write to!
Since There is no ability to read/write to the “left” of the pointer, we pre-decrement that pointer by 2*N to account for the data that was already partitioned and to prepare it for the next read.

Loop

Here’s the same loop we saw in the animation with our vectorized block smack in its middle, in plain-old C#:

    while (readRight >= readLeft) {
        int *nextPtr;
        if ((readLeft   - writeLeft) <= (writeRight - readRight)) {
            nextPtr = readLeft;
            readLeft += N;
        } else {
            nextPtr = readRight;
            readRight -= N;
        }

        PartitionBlock(nextPtr, P, ref writeLeft, ref writeRight);
    }
    readRight += N;
    tmpRight += N;

This is the heart of the partitioning operation and where we spend most of the time sorting the array. Looks quite boring, eh?

This loop is all about calling our good ole’ partitioning block on the entire array. We-reuse the same block on L11, but here, for the first time, actually use it as an in-place partitioning block, since we are both reading and writing to the same array.
While the runtime of the loop is dominated by the partitioning block, the interesting bit is that beefy condition on L3 that we described/animated before: it calculates the distance between each head and tail on both sides and compares them to determine which side has less space left, or which side is closer to being overwritten. Given that the next read will happen from the side we choose here, we’ve just added 8 more integers worth of writing space to that same endangered side, thereby eliminating the risk of overwriting.
While it might be easy to read in terms of correctness or motivation, this is a very sad line of code, as it will haunt us in the next posts!

Finally, as we exit the loop once there are < 8 elements left (remember that we pre-decremented readRight by N elements before the loop), we are done with all vectorized work for this partitioning call. as such, this is as good a time to re-adjust both readRight and tmpRight that were pre-decremented by N elements to make them ready-to-go for the final step of handling the remainder with scalr sorting, on L13-14.

Handling the remainder and finishing up

Here’s the final piece of this function:

    while (readLeft < readRight) {
        var v = *readLeft++;
        if (v <= pivot) {
            *tmpLeft++ = v;
        } else {
            *--tmpRight = v;
        }
    }

    var leftTmpSize = (uint) (int) (tmpLeft - _tempStart);
    Unsafe.CopyBlockUnaligned(writeLeft, _tmpStart, leftTmpSize * sizeof(int));
    writeLeft += leftTmpSize;
    var rightTmpSize = (uint) (int) (_tempEnd - tmpRight);
    Unsafe.CopyBlockUnaligned(writeLeft, tmpRight, rightTmpSize * sizeof(int));
    Swap(writeLeft, right);
    return writeLeft;
}

Finally, we come out of the loop once we have less than 8-elements to partition (1-7 elements). We can’t use vectorized code here, so we drop to plain-old scalar partitioning on L1-8. To keep things simple, we partition these last elements straight into the temporary area. This is the reason we’re allocating 8 more elements in the temporary area in the first place.

Once we’re done with this remainder nuisance, we copy back the already partitioned data from the temporary area back into the array to the area left between writeLeft and writeRight, it’s a quick 64-96 byte copy in two operations, performed L10-14 and we are nearly done. We still need to move the pivot back to the newly calculated pivot position (remember the caller placed it on the right edge of the array as part of pivot selection) and report this position back as the return value for this to be officially be christened as AVX2 partitioning function.

Pretending we’re Array.Sort

Now that we have a proper partitioning function, it’s time to string it into a quick-sort like dispatching function: This will be the entry point to our sort routine:

public static class DoublePumpNaive
{
    public static unsafe void Sort<T>(T[] array) where T : unmanaged, IComparable<T>
    {
        if (array == null)
            throw new ArgumentNullException(nameof(array));

        fixed (T* p = &array[0]) {
            if (typeof(T) == typeof(int)) {
                var pi = (int*) p;
                var sorter = new VxSortInt32(startPtr: pi, endPtr: pi + array.Length - 1);
                sorter.Sort(pi, pi + array.Length - 1);
            }
        }
    }

    const int SLACK_PER_SIDE_IN_VECTORS = 1;

Most of this is pretty dull code:

We start with a top-level static class DoublePumpNaive containing a single Sort entry point accepting a normal managed array.
We special case, relying on generic type elision, for typeof(int), newing up a VxSortInt32 struct and finally calling its internal .Sort() method to initiate the recursive sorting.
- This is a good time as any to remind, again, that for the time being, I only implemented vectorized sorting when T is int. To fully replace Array.Sort() more tweaked versions of this code will have to be written to eventually support unsigned integers, both larger and smaller than 32 bits as well as floating-point types.

Continuing on to VxSortInt32 itself:

    internal unsafe ref struct VxSortInt32
    {
        const int SLACK_PER_SIDE_IN_ELEMENTS    = SLACK_PER_SIDE_IN_VECTORS * 8;
        const int TMP_SIZE_IN_ELEMENTS          = 2 * SLACK_PER_SIDE_IN_ELEMENTS + 8;
        const int SMALL_SORT_THRESHOLD_ELEMENTS = 16;

        readonly int* _startPtr,  _endPtr;
                      _tempStart, _tempEnd;
        fixed int _temp[TMP_SIZE_IN_ELEMENTS];

        public VxSortInt32(int* startPtr, int* endPtr) : this()
        {
            _startPtr = startPtr;
            _endPtr   = endPtr;
            fixed (int* pTemp = _temp) {
                _tempStart = pTemp;
                _tempEnd   = pTemp + TMP_SIZE_IN_ELEMENTS;
            }
        }

This is where the real top-level sorting entry point for 32-bit signed integers is:

This struct contains a bunch of constants and members that are initialized for a single sort-job/call and immediately discarded once sorting is complete.
There’s a little semingly nasty bit hiding in plain sight there, where we exfiltrate an interior pointer obtained inside a fixed block and store it for the lifetime of the struct, outside of the fixed block.
- This is generally a no-no, since, in theory, we don’t have a guarantee that the struct won’t be boxed/stored inside a managed object on a heap where the GC is free to move our memory around.
- In this case, we are ensuring that instances of VxSortInt32 are never promoted to the managed heap by declaring it as a ref struct.
- The motivation behind this is to ensure that the fixed temporary memory resides close to the other struct fields, taking advantage of locality of reference.

        internal void Sort(int* left, int* right)
        {
            var length = (int) (right - left + 1);

            int* mid;
            switch (length) {
                case 0:
                case 1:
                    return;
                case 2:
                    SwapIfGreater(left, right);
                    return;
                case 3:
                    mid = right - 1;
                    SwapIfGreater(left, mid);
                    SwapIfGreater(left, right);
                    SwapIfGreater(mid,  right);
                    return;
            }

            // Go to insertion sort below this threshold
            if (length <= SMALL_SORT_THRESHOLD_ELEMENTS) {
                InsertionSort(left, right);
                return;
            }

            // Compute median-of-three, of:
            // the first, mid and one before last elements
            mid = left + ((right - left) / 2);
            SwapIfGreater(left, mid);
            SwapIfGreater(left, right - 1);
            SwapIfGreater(mid,  right - 1);

            // Pivot is mid, place it in the right hand side
            Swap(mid, right);

            var boundary = VectorizedPartitionInPlace(left, right);

            Sort( left, boundary - 1);
            Sort(boundary + 1,  right);
        }

Lastly, we have the Sort method for the VxSortInt32 struct. Most of this is code I blatantly copied for ArraySortHelper<T>. What it does is:

Special case for lengths of 0-3.
When length <= 16 we just go straight to InsertionSort and skip all the recursive jazz (go back to post 1 if you want to know why Array.Sort() does this).
When we have >= 17 elements, we go to vectorized partitioning:
- We do median of 3 pivot selection.
- Swap that pivot so that it resides on the right-most index of the partition.
Call VectorizedPartitionInPlace, which we’ve seen before.
- We conveniently take advantage of the fact we have InsertionSort to cover us for the small partitions, and our partitioning code can always assume that it can prime the pump with at least two vectors worth of vectorized partitioning without additional checks…
Recurse to the left.
Recurse to the right.

Initial Performance

Are we fast yet?

Yes! This is by no means the end, on the contrary, this is only a rather impressive beginning. We finally have something working, and it is even not entirely unpleasant, if I may say so:

Method Name	Problem Size	Time / Element (ns)	Scaling	Measurements

Method Name	Size	Max Depth	Part itions	Vector Loads	Vector Stores	Vector Permutes	Small Sort Size	Data Based Branches	Small Sort Branches

BenchmarkDotNet=v0.12.0, OS=clear-linux-os 32120
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.1.100
  [Host]     : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
  Job-DEARTS : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

InvocationCount=3  IterationCount=15  LaunchCount=2
UnrollFactor=1  WarmupCount=10

$ grep 'stepping\|model\|microcode' /proc/cpuinfo | head -4
model           : 158
model name      : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping        : 9
microcode       : 0xb4

We’re off to a very good start:

We can see that as soon as we hit 1000 element arrays (even earlier, in earnest), we already outperform Array.Sort (87% runtime), and by the time we get to 1M / 10M element arrays, we see speed-ups north of 2.5x (39%, 37% runtime) over the scalar C++ code!
While Array.Sort is behaving like we would expect from a QuickSort-like function: it is slowing down at rate you’d expect given that it has a Big-O notation of \(\mathcal{O}(n\log{}n)\), our own DoublePumpedNaive is peculiar: The time spent sorting every single element starts going up as we increase N, then goes down a bit and back up. Huh? It actually improves as we sort more data? Quite unreasonable, unless we remind ourselves that we are executing a mix of scalar insertion sort and vectorized code. Where are we actually spending more CPU cycles though? We’ll run some profiling sessions in a minute, to get a better idea of what’s going on.

If you recall, on the first post in this series, I presented some statistics about is going on inside our sort routine. This is a perfect time to switch to the statistics tab, where I’ve beefed up the table with some vectorized counters that didn’t make sense before with the scalar version. From here we can learn a few interesting facts:

The number of partitioning operations / small sorts is practically the same
- You could ask yourself, or me, why they are not exactly the same? To which I’d answer:
  - The thresholds are 16 vs. 17, which has some effect.
  - We have to remember that the resulting partitions from each implementation end up looking slightly different because of the double pumping + temporary memory shenanigans. Once the partitions look different, the following pivots selected are different, and the entire whole sort mechanic looks slightly different.
We are doing a lot of vectorized work:
- Loading two vectors per 8-element(1 data vector + 1 permutation vector)
- Storing two vectors (left+right) for every vector read
- In a weird coincidence, this means we perform the same number of vectorized loads and stores for every test case.
  In future posts, I will discard one of these columns to reduce the amount of information load…
- Finally, lest we forget, we perfom compares/permutations at exactly half of the load/store rate.
All of this is helping us by reducing the number of scalar comparisons, but there’s still quite a lot of it left too:
- We continue to do scalar partitioning inside VectorizedPartitionInPlace, as part of handling the remainder that doesn’t fit into a Vector256<int>.
- We are still executing scalar comparisons as part of small-sorting/inside of the insertion sort at an alarming rate:
  - The absolute number of comparisons is quite high: We’re still doing millions of data-based branches.
  - It is also clear from the counters that the overwhelming majority of these are from InsertionSort: If we focus on the 1M/10M cases here, we see that InsertionSort went up from attributing 28.08%/24.60% of scalar comparisons in the Unmanaged (scalar) test-case all the way to 66.4%/62.74% in the vectorized DoublePumpNaive version. Of course this rise is merely in percent terms, but clearly we will have to deal with this if we intend to make this thing fast(er).

This is but the beginning of our profiling journey, but we are already learning a complicated truth: Right now, as fast as this is already going, the scalar code we use for insertion sort will always put an upper limit on how fast we can possibly go by optimizing the vectorized code we’ve gone over so far, unless we get rid of InsertionSort alltogether, replacing it with something better. But first thing’s first, we must remain focused: 65% of instructions executed are still spent doing vectorized partitioning; That is the biggest target on our scope!

As promised, it’s time we profile the code to see what’s really up: We can fire up the venerable Linux perf tool, through a simple test binary/project I’ve coded up which allows me to execute some dummy sorting by selecting the sort method I want to invoke and specify some parameters for it through the command line, for example:

$ cd ~/projects/public/VxSort/Example
$ dotnet publish -c release -o linux-x64 -r linux-x64
# Run AVX2DoublePumped with 1,000,000 elements x 100 times
$ ./linux-x64/Example --type-list DoublePumpNaive --size-list 1000000

Here we call the DoublePumpedNaive implementation we’ve been discussing from the beginning of this post with 1M elements, and sort the random data 100 times to generate some heat in case global warming is not cutting it for you.
I know that calling dotnet publish ... seems superfluous, but trust¹ me and go with me on this one:

$ COMPlus_PerfMapEnabled=1  perf record -F max -e instructions ./Example \
       --type-list DoublePumpedNaive --size-list 1000000
...
$ perf report --stdio -F overhead,sym | head -15
...
# Overhead  Symbol
    65.66%  [.] ... ::VectorizedPartitionInPlace(int32*,int32*,int32*)[Optimized]
    22.43%  [.] ... ::InsertionSort(!!0*,!!0*)[Optimized]
     5.43%  [.] ... ::QuickSortInt(int32*,int32*,int32*,int32*)[OptimizedTier1]
     4.00%  [.] ... ::Memmove(uint8&,uint8&,uint64)[OptimizedTier1]

$ COMPlus_PerfMapEnabled=1 perf record -F max -e instructions ./Example \
       --type-list AVX2DoublePumpedNaive --size-list 10000
...
$ perf report --stdio -F overhead,sym | head -15
...
# Overhead  Symbol
    54.59%  [.] ... ::VectorizedPartitionInPlace(int32*,int32*,int32*)[Optimized]
    29.87%  [.] ... ::InsertionSort(!!0*,!!0*)[Optimized]
     7.02%  [.] ... ::QuickSortInt(int32*,int32*,int32*,int32*)[OptimizedTier1]
     5.23%  [.] ... ::Memmove(uint8&,uint8&,uint64)[OptimizedTier1]

This is a trimmed summary of perf session recording performance metrics, specifically: number of instructions executed for running a 1M element sort 100 times, followed by running a 10K element sort, 10K times. I was shocked when I saw this for the first time, but we’re starting to understand the previous oddities we saw with the Time/N column!
We’re spending upwards of 20% of our time doing scalar insertion sorting! I lured you here with promises of vectorized sorting and yet, somehow, “only” 65% of the time is spent in doing “vectorized” work (which also has some scalar partitioning, if we’re honest). Not only that, but as the size of the array decreases, the percentage of time spent in scalar code increases (from 22.43% to 29.87%), which should not surprise us anymore.
Before anything else, let me clearly state that this is not necessarily a bad thing! As the size of the partition decreases, the benefit of doing vectorized partitioning decreases in general, and even more so for our AVX2 partitioning, which has non-trivial start-up overhead. We shouldn’t care about the amount of time we’re spending on scalar code per se, but the amount of time taken to sort the entire array.
The decision to go to with scalar insertion-sort or stick to vectorized code is controlled by the threshold I mentioned before, which is still sitting there at 16. We’re only beginning our optimization phase in the next post, so for now, we’ll stick with the threshold selected for Array.Sort by the CoreCLR developers, this is the “correct” starting point both in terms of allowing us to compare apples-to-apples and also as I am a firm believer at doing very incremental modifications for this sort of work.
Having said that, this is definitely something we will tweak later for our particular implementation.

Finishing off with a sour taste

I’ll end this post with a not so easy pill to swallow: let’s re-run perf and measure a different aspect of our code: Let’s see how the code is behaving in terms of top-level performance counters. The idea here is to use counters that our CPU is already capable of collecting at the hardware level, with almost no performance impact, to see where/if we’re hurting. What I’ll do before invoking perf is use a Linux utility called cset which can be used to evacuate all user threads and (almost all) kernel threads from a given physical CPU core, using cpusets:

$ sudo cset shield --cpu 3 -k on
cset: --> activating shielding:
cset: moving 638 tasks from root into system cpuset...
[==================================================]%
cset: kthread shield activated, moving 56 tasks into system cpuset...
[==================================================]%
cset: **> 38 tasks are not movable, impossible to move
cset: "system" cpuset of CPUSPEC(0-2) with 667 tasks running
cset: "user" cpuset of CPUSPEC(3) with 0 tasks running

Once we have “shielded” a single CPU core, we execute the Example binary we used before much in the same way while collecting different top-level hardware statistics from befre using a the following perf command line:

$ perf stat -a --topdown sudo cset shield -e ./Example \
    --type-list DoublePumpedNaive --size-list 1000000
cset: --> last message, executed args into cpuset "/user", new pid is: 16107

 Performance counter stats for 'system wide':
        retiring      bad speculation       frontend bound        backend bound
...
S0-C3 1    37.6%                32.3%                16.9%                13.2%

       3.221968791 seconds time elapsed

I’m purposely showing only the statistics collected for our shielded core since we know we only care about that core in the first place.

Here are some bad news: core #3 is really not having a good time running our code. perf --topdown is essentially screaming from the top of its lungs with that 32.3% under the bad speculation column. This might seem like an innocent metric if you haven’t done this sort of thing before (in which case, read the info box below), but this is really bad. In plain English and without getting into the intricacies of top-down perfromance analysis, this metric represents cycles where the CPU isn’t doing useful work because of an earlier mis-speculation. Here, the mis-speculation is mis-predicted branches. The penalty for each such mis-predicted branch is an entire flush of the pipeline (hence the wasted time), which costs us around 14-15 cycles on modern Intel CPUs.

Note

We have to remember that efficient execution on modern CPUs means keeping the CPU pipeline as busy as possible; This is quite a challenge given its length is about 15 stages, and the CPU itself is super-scalar (For example: an Intel Skylake CPU has 8 ports that can execute some instruction every cycle!). If, for example, all instructions in the CPU have a constant latency in cycles, this means it has to process 100+ instructions into “the future” while it’s just finishing up with a current one to avoid doing nothing. That’s enough of a challenge for regular code, but what should it do when it sees a branch? It could attempt and execute both branches, which quickly becomes a fool’s errand if somewhere close-by there would be even more branches. What CPU designers did was opt for speculative execution: add complex machinery to predict if a branch will be taken and speculatively execute the next instruction according to the prediction. But the predictor isn’t all knowing, and it will mis-predict, and then we end up paying a huge penalty: The CPU will have to push those mis-predicted instructions through the pipeline flushing the results out as if the whole thing never happenned. This is why the rate of mis-prediction is a life and death matter when it comes to performance.

Wait, I sense some optimistic thoughts all across the internet… maybe it’s not our precious vectorized so-called branch-less code? Maybe we can chalk it all up on that mean scalar InsertionSort function doing those millions and millions of scalar comparisons? We are, after all, using it for sorting small partitions, which we’ve already measured at more than 20% of the total run-time? Let’s see this again with perf, this time focusing on the branch-misses HW counter and try to figure out how the mis-predictions are distributed amongst our call-stacks:

$ export COMPlus_PerfMapEnabled=1 # Make perf speak to the JIT
# Record some performance information:
$ perf record -F max -e branch-misses ./Example \
    --type-list DoublePumpedNaive --size-list 1000000
...
$ perf report --stdio -F overhead,sym | head -17
...
    40.97%  [.] ...::InsertionSort(!!0*,!!0*)[Optimized]
    32.30%  [.] ...::VectorizedPartitionInPlace(int32*,int32*,int32*)[Optimized]
     9.64%  [.] ...::Memmove(uint8&,uint8&,uint64)[OptimizedTier1]
     9.64%  [.] ...::QuickSortInt(int32*,int32*,int32*,int32*)[OptimizedTier1]
     5.62%  [.] ...::VectorizedPartitionOnStack(int32*,int32*,int32*)[Optimized]
...

No such luck. While InsertionSort is definitely starring here with 41% of the branch misprediction events, we still have 32% of the bad speculation coming from our own new vectorized code. This is a red-flag as far as we’re concerned: It means that our vectorized code still contains a lot of mis-predicted branches. Given that we’re in the bussiness of sorting (random data) and the high rate of recorded mis-prediction the only logical conclusion is that we have branches that are data-dependent. Another thing to keep in mind is that the resulting pipeline flush is a large penalty to pay given that our entire 8-element partition block has a throughput of around 8-9 cycles. That means we are hitting that 15 cycle pan-to-the-face way too often to feel good about ourselves.

I’ll finish this post here. We have a lot of work cut out for us. This is no-where near over.
In the next post, I’ll try to give the current vectorized code a good shakeup. After all, it’s still our biggest target in terms of number of instructions executed, and 2^nd when it comes to branch mis-predictions. Once we finish squeezing that lemon for all its performance juice on the 4^th post, We will turn our focus to the InsertionSort function on the 5^th post , and we’ll see if we can appease the performance gods to make that part of the sorting effort faster.
In the meantime, you can go back to the vectorized partitioning function and try to figure out what is causing all those nasty branch mis-predictions if you’re up for a small challenge. We’ll be dealing with it head-on in the next post.

For some, perf wasn’t in the mood to show me function names without calling dotnet publish and using the resulting binary, and I didn’t care enough to investigate further… ↩

This Goes to Eleven (Part. 2/∞)

2020-01-29T05:26:28+00:00

Since there’s a lot to over go over here, I’ve split it up into no less than 6 parts:

In part 1, we start with a refresher on QuickSort and how it compares to Array.Sort().
In this part, we go over the basics of vectorized hardware intrinsics, vector types, and go over a handful of vectorized instructions we’ll use in part 3. We still won’t be sorting anything.
In part 3, we go through the initial code for the vectorized sorting, and start seeing some payoff. We finish agonizing courtesy of the CPU’s branch predictor, throwing a wrench into our attempts.
In part 4, we go over a handful of optimization approaches that I attempted trying to get the vectorized partitioning to run faster. We’ll see what worked and what didn’t.
In part 5, we’ll see how we can almost get rid of all the remaining scalar code- by implementing small-constant size array sorting. We’ll use, drum roll…, yet more AVX2 vectorization.
Finally, in part 6, I’ll list the outstanding stuff/ideas I have for getting more juice and functionality out of my vectorized code.

Intrinsics / Vectorization

I’ll start by repeating my own words from the first blog post where I discussed intrinsics in the CoreCLR 3.0 alpha days:

Processor intrinsics are a way to directly embed specific CPU instructions via special, fake method calls that the JIT replaces at code-generation time. Many of these instructions are considered exotic, and normal language syntax cannot map them cleanly.
The general rule is that a single intrinsic “function” becomes a single CPU instruction.

You can go and re-read that introduction if you care for a more general and gentle introduction to processor intrinsics. For this series, we are going to focus on vectorized intrinsics in Intel processors. This is the largest group of CPU specific intrinsics in our processors, and I want to start by showing this by the numbers. I gathered some statistics by processing Intel’s own data-3.4.6.xml. This XML file is part of the Intel Intrinsics Guide, an invaluable resource on intrinsics in itself, and the “database” behind the guide. What I learned was that:

There are no less than 1,218 intrinsics in Intel processors¹!
- Those can be combined in 6,180 different ways (according to operand sizes and types).
- They’re grouped into 67 different categories/groups, these groups loosely correspond to various generations of CPUs as more and more intrinsics were gradually added.
More than 94% are vectorized hardware intrinsics, which we’ll define more concretely below.

That last point is super-critical: CPU intrinsics, at least in 2020, are overwhelmingly about being able to execute vectorized instructions. That’s really why you should be paying them attention in the first place. Sure, there’s additional stuff in there: if you’re a kernel developer, or writing crypto code, or some other niche-cases, but vectorization is why you are really here, whether you knew it or not.

In C#, we’ve mostly shied away from having intrinsics until CoreCLR 3.0 came along, where intrinsic support became official/complete, championed by @tannergooding as well as others from Microsoft and Intel. but as single-threaded performance has virtually stopped improving, more programming languages started adding intrinsics support (go, rust, Java and now C#) so developers in those languages would have access to these specialized, much more efficient instructions. CoreCLR 3.0 does not support all 1,218 intrinsics that I found, but a more modest 226 intrinsics in 15 different classes for x86 Intel and AMD processors. Each class is filled with many static functions, all of which are unique processor intrinsics, and represent a 1:1 mapping to Intel group/code names. As C# developers, we roughly get access to everything that Intel incorporated in their processors manufactured from 2014 and onwards², and for AMD processors, from 2015 onwards.

What are these vectorized intrinsics?
We need to cover a few base concepts specific to that category of intrinsics before we can start explaining specific intrinsics/instructions:

What are vectorized intrinsics, and why have they become so popular.
How vectorized intrinsics interact with specialized vectorized registers.
How those registers are reflected as, essentially, new primitive types in CoreCLR 3.0.

SIMD What & Why

I’m going to use vectorization and SIMD interchangeably from here-on, but for the first and last time, let’s spell out what SIMD is: Single Instruction Multiple Data is really a simple idea when you think about it. A lot of code ends up doing “stuff” in loops, usually, processing vectors of data one element at a time. SIMD instructions bring a simple new idea to the table: The CPU adds special instructions that can do arithmetic, bit-operations, comparisons and many other types of generalized operations on “vectors”, e.g. process multiple elements per instruction.

The benefit of using this approach to computing is that it allows for much greater efficiency: When we use vectorized intrinsics we end up executing the same number of instructions to process, for example, 8 data elements per instruction. Therefore, we reduce the amount of time the CPU spends decoding instructions for the same amount of work; furthermore, most vectorized instructions operate independently on the various elements of the vector and complete their operation at the same number of CPU cycles as the equivalent non-vectorized (or scalar) instruction. In short, in the land of CPU feature economics, vectorization is considered a high bang-for-buck feature: You can get a lot of potential performance for relatively little transistors added to the CPU, as long as people are willing to adapt their code (e.g. rewrite it) to use these new intrinsics, or compilers somehow magically manage to auto-vectorize the code (spoiler: There are tons of problems with that too)³.

Another equally important thing to embrace and understand about vectorized intrinsics is what they don’t and cannot provide: branching. It’s pretty much impossible to even attempt to imagine what a vectorized branch instruction would mean. These two concepts don’t begin to mix. Appropriately, a substantial part of vectorizing code is forcing oneself to accomplishing the given task without using branching. As we will see, branching begets unpredictability, at the CPU level, and unpredictability is our enemy, when we want to go fast.

Of course, I’m grossly over-romanticizing vectorized intrinsics and their benefits: There are also many non-trivial overheads involved both using them and adding them to our processors and to using them in our code. However, all in all, in the grand picture of CPU/performance economics adding and using vectorized instructions is still, compared to other potential improvements, quite cheap, under the assumption that programmers are willing to make the effort to re-write and maintain vectorized code.

SIMD registers

After our short introduction to vectorized intrinsics, we need to discuss SIMD registers, and how this piece of the puzzle fits the grand picture: Teaching our CPU to execute 1,000+ vectorized instructions is just part of the story, these instructions need to somehow operate on our data. Do all of these instructions simply take a pointer to memory and run wild with it? The short answer is: No. For the most part, CPU instructions dealing with vectorization (with a few notable exceptions) use special registers inside our CPU that are called SIMD registers. This is analogous to scalar (regular, non-vectorized) code we write in any programming language: while some instructions read and write directly to memory, and occasionally some instruction will accept a memory address as an operand, most instructions are register ↔ register only.

Just like scalar CPU registers, SIMD registers have a constant bit-width. In Intel these come at 64, 128, 256 and recently 512 bit wide registers. Unlike scalar registers, though, SIMD registers, end up containing multiple data-elements of another primitive type. The same register can and will be used to process a wide-range of primitive data-types, depending on which instruction is using it, as we will shortly demonstrate.

For now, this is all I care to explain about SIMD Registers at the CPU level: We need to be aware of their existence (we’ll see them in disassembly dumps anyway), and since we are dealing with high-perfomance code we kind of need to know how many of them exist inside our CPU.

SIMD Intrinsic Types in C\#

We’ve touched lightly upon SIMD intrinsics and how they operate (e.g. accept and modify) on SIMD registers. Time to figure out how we can fiddle with everything in C#; we’ll start with the types:

C# Type	x86 Registers	Width (bits)
`Vector64<T>`	`mmo-mm7`	64
`Vector128<T>`	`xmm0-xmm15`	128
`Vector256<T>`	`ymm0-ymm15`	256

These are primitive vector value-types recognized by the JIT while it is generating machine code. We should try and think about these types just like we think about other special-case primitive types such as int or double, with one exception: These vector types all accept a generic parameter <T>, which may seem a little odd for a primitive type at a first glance, until we remember that their purpose is to contain other primitive types (there’s a reason they put the word “Vector” in there…); moreover, this generic parameter can’t just be any type or even value-type we’d like… It is limited to the types supported on our CPU and its vectorized intrinsics.

Let’s take Vector256<T>, which I’ll be using exclusively in this series, as an example; Vector256<T> can be used only with the following primitive types:

`typeof(T)`		# Elements		Element Width (bits)
`byte / sbyte`	➡	32	x	8b
`short / ushort`	➡	16	x	16b
`int / uint`	➡	8	x	32b
`long / ulong`	➡	4	x	64b
`float`	➡	8	x	32b
`double`	➡	4	x	64b

No matter which type of the supported primitive set we’ll choose, we’ll end up with a total of 256 bits, or the underlying SIMD register width.
Now that we’ve kind of figured out of vector types/registers are represented in C#, let’s perform some operations on them.

A few Vectorized Instructions for the road

Armed with this new understanding and knowledge of Vector256<T> we can move on and start learning a few vectorized instructions.

Chekhov famously said: “If in the first act you have hung a pistol on the wall, then in the following one it should be fired. Otherwise, don’t put it there”. Here are seven loaded AVX2 pistols; rest assured they are about to fire in the next act. I’m obviously not going to explain all 1,000+ intrinsics mentioned before, if only not to piss off Anton Chekhov. We will thoroughly explain the ones needed to get this party going.
Here’s the list of what we’re going to go over:

x64 asm	Intel	CoreCLR
`vbroadcastd`	`_mm256_broadcastd_epi32`	`Vector256.Create(int)`
`vlddqu`	`_mm256_lddqu_si256`	`Avx.LoadDquVector256`
`vmovdqu`	`_mm256_storeu_si256`	`Avx.Store`
`vpcmpgtd`	`_mm256_cmpgt_epi32`	`Avx2.CompareGreaterThan`
`vmovmskps`	`_mm256_movemask_ps`	`Avx.MoveMask`
`popcnt`	`_mm_popcnt_u32`	`Popcnt.PopCount`
`vpermd`	`_mm256_permutevar8x32_epi32`	`Avx2.PermuteVar8x32`

I understand that for first time readers, this list looks like I’m just name-dropping lots of fancy code names to make myself sound smart, but the unfortunate reality is that we kind of need to know all of these, and here is why: On the right column I’ve provided the actual C# Intrinsic function we will be calling in our code and linked to their docs. But here’s a funny thing: There is no “usable” documentation on Microsoft’s own docs regarding most of these intrinsics. All those docs do is simply point back to the Intel C/C++ intrinsic name, which I’ve also provided in the middle column, with links to the real documentation, the sort that actually explains what the instruction does with pseudo code. Finally, since we’re practically writing assembly code anyways, and I can guarantee we’ll end up inspecting JIT’d code down the road, I provided the x86 assembly opcodes for our instructions as well.⁴ Now, What does each of these do? Let’s find out…

Hint	From here-on, The following icon means I have a thingy that animates: Click/Touch/Hover inside means: Click/Touch/Hover outside means:

Vector256.Create(int value)

We start with a couple of simple instructions, and nothing is more simple than this first: This intrinsic accepts a single scalar value and simply “broadcasts” it to an entire SIMD register, this is how you’d use it:

Vector256<int> someVector256 = Vector256.Create(0x42);

This will load up someVector256 with 8 copies of 0x42 once executed, and in x64 assembly, the JIT will produce something quite simple:

vmovd  xmm0, rax          ; 3 cycle latency / 1 cycle throughput
vpbroadcastd ymm0, xmm0   ; 3 cycle latency / 1 cycle throughput

This specific intrinsic is translated into two intel opcodes, since there is no direct single instruction that performs this.

Avx2.LoadDquVector256 / Avx.Store

Next up we have a couple of brain dead simple intrinsics that we use to read/write from memory into SIMD registers and conversely store from SIMD registers back to memory. These are amongst the most common intrinsics out there, as you can imagine:

int *ptr = ...; // Get some pointer to a big enough array

Vector256<int> data = Avx.LoadDquVector256(ptr);
...
Avx.Store(ptr, data);

And in x64 assembly:

vlddqu ymm1, ymmword ptr [rdi]  ; 5 cycle latency + cache/memory
                                ; 0.5 cycle throughput
vmovdqu ymmword ptr [rdi], ymm1 ; Same as above

I only included an SVG animation for LoadDquVector256, but you can use your imagination and visualize how Store simply does the same thing in reverse.

CompareGreaterThan

CompareGreaterThan does an n-way, element-by-element greater-than (>) comparison between two Vector256<T> instances. In our case where T is really int, this means comparing 8 integers in one go, instead of performing 8 comparisons serially!

But where is the result? In a new Vector256<int> of course! The resulting vector contains 8 results for the corresponding comparisons between the elements of the first and second vectors. Each position where the element in the first vector was greater-than (>) the second vector, the corresponding element in the result vector gets a -1 value, or 0 otherwise.
Calling this is rather simple:

Vector256<int> data, comperand;
Vector256<int> result =
    Avx2.CompareGreaterThan(data, comperand);

And in x64 assembly, this is pretty simple too:

vpcmpgtd ymm2, ymm1, ymm0 ; 1 cycle latency
                          ; 0.5 cycle throughput

MoveMask

Another intrinsic which will prove to be very useful is the ability to extract some bits from a vectorized register into a normal, scalar one. MoveMask does just this. This intrinsic takes the top-level (MSB) bit from every element and moves it into our scalar result:

Vector256<int> data;
int result = Avx.MoveMask(data.AsSingle());

There’s an oddity here, as you can tell by that awkward .AsSingle() call, try to ignore it if you can, or hit this footnote⁵ if you can’t. The assembly instruction here is exactly as simple as you would think:

vmovmskps rax, ymm2  ; 5 cycle latency
                     ; 1 cycle throughput

PopCount

PopCount is a very powerful intrinsic, which I’ve covered extensively before: PopCount returns the number of 1 bits in a 32/64 bit primitive.
In C#, we would use it as follows:

int result = PopCnt.PopCount(0b0000111100110011);
// result == 8

And in x64 assembly code:

popcnt rax, rdx  ; 3 cycle latency
                 ; 1 cycle throughput

In this series, PopCount is the only intrinsic I use that is not purely vectorized⁶.

PermuteVar8x32

PermuteVar8x32 accepts two vectors: source, permutation and performs a permutation operation on the source value according to the order provided in the permutation value. If this sounds confusing go straight to the visualization below…

Vector256<int> data, perm;
Vector256<int> result = Avx2.PermuteVar8x32(data, perm);

While technically speaking, both the data and perm parameters are of type Vector256<int> and can contain any integer value in their elements, only the 3 least significant bits in perm are taken into account for permutation of the elements in data.
This should make sense, as we are permuting an 8-element vector, so we need 3 bits (2³ == 8) in every permutation element to figure out which element goes where… In x64 assembly this is:

vpermd ymm1, ymm2, ymm1 ; 3 cycles latency
                        ; 1 cycles throughput

That’s it for now

This post was all about laying the groundwork before this whole mess comes together.
Remember, we’re re-implementing QuickSort with AVX2 intrinsics in this series, which for the most part, means re-implementing the partitioning function from our scalar code listing in the previous post.
I’m sure wheels are turning in many heads now as you are trying to figure out what comes next…
I think it might be a good time as any to end this post and leave you with a suggestion: Try to take a piece of paper or your favorite text editor, and see if you can cobble up these instructions into something that can partition numbers given a selected pivot.

When you’re ready, head on to the next post to see how the whole thing comes together, and how fast we can get it to run with a basic version…

To be clear, some of these are intrinsics in unreleased processors, and even of those that are all released in the wild, there is no single processor support all of these… ↩
CoreCLR supports roughly everything up to and including the AVX2 intrinsics, which were introduced with the Intel Haswell processor, near the end of 2013. ↩
In general, auto-vectorizing compilers are a huge subject in their own, but the bottom line is that without completely changing the syntax and concepts of our programming language, there is very little that an auto-vectorizing compiler can do with existing code, and making one that really works often involves designing programming language with vectorization baked into them from day one. I really recommend reading this series about Intel’s attempt at this space if you are into this sort of thing. ↩
Now, If I was in my annoyed state of mind, I’d bother to mention that I personally always thought that introducing 200+ functions with already established names (in C/C++/rust) and forcing everyone to learn new names whose only saving grace is that they look BCLish to begin with was not the friendliest move on Microsoft’s part, and that trying to give C# names to the utter mess that Intel created in the first place was a thankless effort that would only annoy everyone more, and would eventually run up against the inhumane names Intel went for (Yes, I’m looking at you LoadDquVector256, you are not looking very BCL-ish to me with the Dqu slapped in the middle there : (╯°□°)╯︵ ┻━┻)… But thankfully, I’m not in my annoyed state. ↩
While this looks like we’re really doing “something” with our Vector256<int> and somehow casting it do single-precision floating point values, let me assure you, this is just smoke and mirrors: The intrinsic simply accepts only floating point values (32/64 bit ones), so we have to “cast” the data to Vector256<float>, or alternatively call .AsSingle() before calling MoveMask. Yes, this is super awkward from a pure C# perspective, but in reality, the JIT understands these shenanigans and really ignores them completely. ↩
By the way, although this intrinsic doesn’t accept nor return one of the SIMD registers / types, and considered to be a non-vectorized intrinsic as far as classification goes, as far as I’m concerned bit-level intrinsic functions that operate on scalar registers are just as “vectorized” as their “pure” vectorized sisters, as they mostly deal with scalar values as vectors of bits. ↩

This Goes to Eleven (Part 1/∞)

2020-01-28T05:26:28+00:00

Let’s do this

Let’s get in the ring and show what AVX/AVX2 intrinsics can really do for a non-trivial problem, and even discuss potential improvements that future CoreCLR versions could bring to the table.

Everyone needs to sort arrays, once in a while, and many algorithms we take for granted rely on doing so. We think of it as a solved problem and that nothing can be further done about it in 2020, except for waiting for newer, marginally faster machines to pop-up¹. However, that is not the case, and while I’m not the first to have thoughts about it; or the best at implementing it, if you join me in this rather long journey, we’ll end up with a replacement function for Array.Sort, written in pure C# that outperforms CoreCLR’s C++² code by a factor north of 10x on most modern Intel CPUs, and north of 11x on my laptop.
Sounds interesting? If so, down the rabbit hole we go…

Note

In the final days before posting this series, Intel started seeding a CPU microcode update that is/was affecting the performance of the released version of CoreCLR 3.0/3.1 quite considerably. I managed to stir up a small commotion as this was unraveling in my benchmarks. As it happened, my code was (not coincidentally) less affected by this change, while CoreCLRs Array.Sort() took a 20% nosedive. Let it never be said I’m nothing less than chivalrous, for I rolled back the microcode update, and for this entire series, I’m going to run against a much faster version of Array.Sort() than what you, the reader, are probably using, Assuming you update your machine from time to time. For the technically inclined, here’s a whole footnote³ on how to double-check what your machine is actually running. I also opened two issues in the CoreCLR repo about attempting to mitigate this both in CoreCLRs C++ code and separately in the JIT. If/when there is movement on those fronts, the microcode you’re running will become less of an issue, to begin with, but for now, this just adds another level of unwarranted complexity to our lives.

A while back now, I was reading the post by Stephen Toub about Improvements in CoreCLR 3.0, and it became apparent that hardware intrinsics were common to many of these, and that so many parts of CoreCLR could still be sped up with these techniques, that one thing led to another, and I decided an attempt to apply hardware intrinsics to a larger problem than I had previously done myself was in order. To see if I could rise to the challenge, I decided to take on array sorting and see how far I can go.

What I came up with eventually would become a re-write of Array.Sort() with AVX2 hardware intrinsics. Fortunately, choosing sorting and focusing on QuickSort makes for a great blog post series, since:

Everyone should be familiar with the domain and even the original (sorting is the bread and butter of learning computer science, really, and QuickSort is the queen of all sorting algorithms).
It’s relatively easy to explain/refresh on the original.
If I can make it there, I can make it anywhere.
I had no idea how to do it.

I started with searching various keywords and found an interesting paper titled: Fast Quicksort Implementation Using AVX Instructions by Shay Gueron and Vlad Krasnov. That title alone made me think this is about to be a walk in the park. While initially promising, it wasn’t good enough as a drop-in replacement for Array.Sort for reasons I’ll shortly go into. I ended up having a lot of fun expanding on their basic approach. ~~I will submit a proper pull-request to start a discussion with CoreCLR devs about integrating this code into the main dotnet repository~~⁴, but for now, let’s talk about sorting.

Since there’s a lot to go over here, I’ve split it up into no less than 6 parts:

In this part, we start with a refresher on QuickSort and how it compares to Array.Sort(). If you don’t need a refresher, skip it and get right down to part 2 and onwards. I recommend skimming through, mostly because I’ve got excellent visualizations which should be in the back of everyone’s mind as we deal with vectorization & optimization later.
In part 2, we go over the basics of vectorized hardware intrinsics, vector types, and go over a handful of vectorized instructions we’ll use in part 3. We still won’t be sorting anything.
In part 3, we go through the initial code for the vectorized sorting, and we’ll start seeing some payoff. We finish agonizing courtesy of the CPU’s Branch Predictor, throwing a wrench into our attempts.
In part 4, we go over a handful of optimization approaches that I attempted trying to get the vectorized partitioning to run faster. We’ll see what worked and what didn’t.
In part 5, we’ll see how we can almost get rid of all the remaining scalar code- by implementing small-constant size array sorting. We’ll use, drum roll…, yet more AVX2 vectorization.
Finally, in part 6, I’ll list the outstanding stuff/ideas I have for getting more juice and functionality out of my vectorized code.

QuickSort Crash Course

QuickSort is deceivingly simple.
No, it really is.
In 20 lines of C# or whatever language you can sort numbers. Lots of them, and incredibly fast. However, try and change something about it; nudge it in the wrong way, and it will quickly turn around and teach you a lesson in humility. It is hard to improve on it without breaking any of the tenants it is built upon.

In words

Before we discuss any of that, let’s describe QuickSort in words, code, pictures, and statistics:

It uses a divide-and-conquer approach.
- In other words, it’s recursive.
- It performs \(\mathcal{O}(n\log{}n)\) comparisons to sort n items.
It performs an in-place sort.

That last point, referring to in-place sorting, sounds simple and neat, and it sure is from the perspective of the user: no additional memory allocation needs to occur regardless of how much data they’re sorting. While that’s great, I’ve spent days trying to overcome the correctness and performance challenges that arise from it, specifically in the context of vectorization. It is also essential to remain in-place since I intend for this to become a drop-in replacement for Array.Sort.

More concretely, QuickSort works like this:

Pick a pivot value.
Partition the array around the pivot value.
Recurse on the left side of the pivot.
Recurse on the right side of the pivot.

Picking a pivot could be a mini-post in itself, but again, in the context of competing with Array.Sort we don’t need to dive into it, we’ll copy whatever CoreCLR does, and get on with our lives.
CoreCLR uses a pretty standard scheme of median-of-three for pivot selection, which can be summed up as: “Let’s sort these 3 elements: In the first, middle and last positions, then pick the middle one of those three as the pivot”.

Partitioning the array is where we spend most of the execution time: we take our selected pivot value and rearrange the array segment that was handed to us such that all numbers smaller-than the pivot are in the beginning or left, in no particular order amongst themselves. Then comes the pivot, in its final resting position, and following it are all elements greater-than the pivot, again in no particular order amongst themselves.

After partitioning is complete, we recurse to the left and right of the pivot, as previously described.

That’s all there is: this gets millions, billions of numbers sorted, in-place, efficiently as we know how to do 60+ years after its invention.

Bonus trivia points for those who are still here with me: Tony Hoare, who invented QuickSort back in the early 60s also took responsibility for inventing the null pointer concept. So I guess there really is no good without evil in this world.

In code

void QuickSort(int[] items) => QuickSort(items, 0, items.Length - 1);

void QuickSort(int[] items, int left, int right)
{
    if (left == right) return;
    int pivot = PickPivot(items, left, right);
    int pivotPos = Partition(items, pivot, left, right);
    QuickSort(items, left, pivotPos);
    QuickSort(items, pivotPos + 1, right);
}

int PickPivot(int[] items, int left, int right)
{
    var mid = left + ((right - left) / 2);
    SwapIfGreater(ref items[left],  ref items[mid]);
    SwapIfGreater(ref items[left],  ref items[right]);
    SwapIfGreater(ref items[mid],   ref items[right]);
    var pivot = items[mid];
}

int Partition(int[] array, int pivot, int left, int right)
{
    while (left < right) {
        while (array[left]  < pivot) left++;
        while (array[right] > pivot) right--;

        if (left <= right) {
            var t = array[left];
            array[left++]  = array[right];
            array[right--] = t;
        }
    }
    return left;
}

I did say it is deceptively simple, and grasping how QuickSort really works sometimes feels like trying to lift sand through your fingers; To that end I’ve included two more visualizations of QuickSort, which are derivatives of the amazing work done by Michael Bostock (@mbostock) with d3.js.

Visualizing QuickSort’s recursion

One thing that we have to keep in mind is that the same data is partitioned over-and-over again, many times, with ever-shrinking partition sizes until we end up having a partition size of 2 or 3, in which case we can trivially sort the partition as-is and return.

To help see this better, we’ll use this way of visualizing arrays and their intermediate states in QuickSort:

Here, we see an unsorted array of 200 elements (in the process of getting sorted).
The different sticks represent numbers in the [-45°..+45°] range, and the angle of each individual stick represents its value, as I hope it is easy to discern.
We represent the pivots with two colors:

Red for the currently selected pivot at a given recursion level.
Green for previous pivots that have already been partitioned around in previous rounds/levels of the recursion.

Our ultimate goal is to go from the messy image above to the visually appeasing one below:

What follows is a static (e.g., non-animated) visualization that shows how pivots are randomly selected at each level of recursion and how, by the next step, the unsorted segments around them become partitioned until we finally have a completely sorted array. Here is how the whole thing looks:

These visuals are auto-generated in Javascript + d3.js, so feel free to hit that “Reload” button and/or change the number of elements in the array if you feel you want to see a new set of random sticks sorted.

I encourage you to look at this and try to explain to yourself what QuickSort “does” here, at every level. What you can witness here is the interaction between pivot selection, where it “lands” in the next recursion level (or row), and future pivots to its left and right and in the next levels of recursion. We also see how, with every level of recursion, the partition sizes decrease in until finally, every element is a pivot, which means sorting is complete.

Visualizing QuickSort’s Comparisons/Swaps

While the above visualization really does a lot to help understand how QuickSort works, I also wanted to leave you with an impression of the total amount of work done by QuickSort:

Above is an animation of the whole process as it goes over the same array, slowly and recursively going from an unsorted mess to a completely sorted array.

We can witness just how many comparisons and swap operations need to happen for a 200 element QuickSort to complete successfully. There’s genuinely a lot of work that needs to happen per element (when considering how we re-partition virtually all elements again and again) for the whole thing to finish.

Array.Sort vs. QuickSort

It’s important to note that Array.Sort uses a couple of more tricks to get better performance and avoid certain dark-spots the come with QuickSort. I would be irresponsible if I didn’t mention those since in the later posts, I borrow at least one idea from its play-book, and improve upon it with intrinsics.

Array.Sort isn’t strictly QuickSort; it is a variation on it called Introspective Sort invented by David Musser in 1997. What it roughly does is combine Quick-Sort, Heap-Sort, and Insertion-Sort by dynamically switching between them: more specifically it starts with quick-sort and may switch to heap-sort if the recursion depth goes beyond a specific threshold while also switching into insertion-sort if the size of the partition goes below a different threshold. This hybrid approach is a clever way of mitigating the two biggest shortcomings in quick-sort alone:

QuickSort is notorious for degenerating into \(\mathcal{O}(n^2)\) for various edge-cases input sequences. I won’t go very deeply into this, but think about an array that is made up of a single repeated number. In such an extreme case, partitioning results in a bad separation around the pivot (e.g. one sub-partition will always have a size of 0) for each partitioning attempt, and the whole thing goes south very quickly.
- Introspective-sort mitigates such bad cases by tracking the current recursion depth vs. an acceptable worst-case depth (usually \(\mathcal 2*(floor(log_{2}(n))+1)\)). Once the measured/actual depth crosses over that threshold, introspective-sort switches internally from partitioning/quick-sort to heap-sort which deals with such cases better, on average.
Lastly, once the partition is small enough, introspective-sort switches to using insertion-sort. This is a critical improvement when we consider that recursive calls are never cheap (even more so for the code I’ll present later in this series). In CoreCLR/C#, where this threshold was selected to be 16 elements, this hybrid approach manages to replace up to 3 levels of recursive calls (or \(\mathcal 2^{n+1}-1 = {2^4}-1 = 15\) partitioning calls on average) with a single call to insertion-sort, which is very effective for these small input sizes anyway. The impact of this optimization, where recursion is replaced with simpler loop-based code, cannot be overstated.

As mentioned, I ended up borrowing this last idea for my code as the issues around smaller partition sizes are exacerbated by using vectorized intrinsics in the following posts.

For the unfriendly cases I mentioned before, I have no vectorized approach yet (OK, I kind of do, but I have no intention of making this a 9-post blog series :). However, I have no problem admitting to this while weaseling my way out of this pit of despair in the most direct way: use the same logic that introspective-sort uses for switching to heap-sort (where it triggers when the depth exceeds some dynamically computed threshold) and in-turn switch to… Array.Sort; We let it stumble a bit with the same input until it will give up and switch internally to heap-sort. It’s slightly nasty, but it works…

Comparing Scalar Variants

With all this new information, this is a good time to measure how a couple of different scalar (e.g. non-vectorized) versions compare to Array.Sort. I’ll show some results generated using BenchmarkDotNet (BDN) with:

Array.Sort() as the baseline.
Managed as the code I’ve just presented above.
- This version is just basic QuickSort using regular/safe C#. With this version, every time we access an array element, the JIT inserts bounds-checking machine code around our actual access that ensures the CPU does not read/write outside the memory region owned by the array.
Unmanaged as an alternative/faster version to Scalar where:
- The code uses native pointers and unsafe semantics (using C#‘s new unmanaged constraint, neat!).
- We switch to InsertionSort (again, copy-pasted from CoreCLR) when below 16 elements, just like Array.Sort does.

I’ve prepared this last version to show that with unsafe code + InsertionSort, we can remove most of the performance gap between C# and C++ for this type of code, which mainly stems from bounds-checking, that the JIT cannot elide for these sort of random-access patterns as well as the jump-to InsertionSort optimization.

Note

Throughout this series, I’ll benchmark each sorting method with various array sizes (BDN parameter: N): \(10^i_{i=1\cdots7}\). I’ve added a custom column to the BDN column to the report: Time / N. This represents the time spent sorting per element in the array, and as such, very useful to compare the results on a more uniform scale.
In addition, I will only start with purely randon and unique sets of values, as that is a classical input type where I want to focus for this series.
When I actually get to submitting a PR, I will have to show more test cases and prove that the whole thing doesn’t crumble once the input is less than optimal, but that is outside of the scope for this series.

Here are the results in the form of charts and tables. I’ve included a handy large button you can press to get a quick tour of what each tab contains, what we have here is:

A chart scaling the performance of various implementations being compared to Array.Sort as a ratio.
A chart showing time spent sorting a single element in an array of N elements (Time / N).
BDN results in a friendly table form.
Statistics/Counters that teach us about what is actually going on under the hood.

Method Name	Problem Size	Time / Element (ns)	Scaling	Measurements

Method Name	Problem Size	Max Depth	# Part- itions	Avg. Small Sorts Size	# Data- Based Branches	% Small Sort Data- Based Branches

BenchmarkDotNet=v0.12.0, OS=clear-linux-os 32120
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.1.100
  [Host]     : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
  Job-DEARTS : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

InvocationCount=3  IterationCount=15  LaunchCount=2
UnrollFactor=1  WarmupCount=10

$ grep 'stepping\|model\|microcode' /proc/cpuinfo | head -4
model           : 158
model name      : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping        : 9
microcode       : 0xb4

Surprisingly⁵, the unmanaged C# version is running slightly faster than Array.Sort, but with one caveat: it only outperforms the C++ version for large inputs. Otherwise, everything is as expected: The purely Managed variant is just slow, and the Unamanged one mostly is on par with Array.Sort.
These C# implementations were written to verify that we can get to Array.Sort like performance in C#, and they do just that. Running 5% faster for some input sizes will not cut it for me; I want it much faster. An equally important reason for re-implementing these basic versions is that we can now sprinkle statistics-collecting-code magic fairy dust⁶ on them so that we have even more numbers to dig into in the “Statistics” tab: These counters will assist us in deciphering and comparing future results and implementations. In this post they serve us by establishing a baseline. We can see, per each N value (with some commentary):

The maximal recursion depth. Note that:
- The unmanaged version, like CoreCLR’s Array.Sort switches to InsertionSort for the last couple of recursion levels, therefore, its maximal depth is smaller.
The total number of partitioning operations performed.
- Same as above, less recursion ⮚ less partitioning calls.
The average size of what I colloquially refer to as “small-sort” operations performed (e.g., InsertionSort for the Unmanaged variant).
- The Managed version doesn’t have any of this, so it’s just 0.
- In the Unmanaged version, we see a consistent value of 9.x: Given that we special case 1,2,3 in the code and 16 is the upper limit, 9.x seems like a reasonable outcome here.
The number of branch operations that were user-data dependent; This one may be hard to relate to at first, but it becomes apparent why this is a crucial number to track starting with the 3^rd post onwards. For now, a definition: This statistic counts how many times our code did an if or a while or any other branch operation whose condition depended on unsorted user supplied data!
- The numbers boggle the mind, this is the first time we get to show how much work is involved.
- What’s even more surprising that for the Unmanged variant, the number is even higher (well only surprising if you don’t know anything about how InsertionSort works…) and yet this version seems to run faster… I have an entire post dedicated just to this part of the problem in this series, so let’s just make note of this for now, but already we see peculiar things.
Finally, I’ve also included a statistic here that shows what percent of those data-based branches came from small-sort operations. Again, this was 0% for the Managed variant, but we can see that a large part of those compares are now coming from those last few levels of recursion that were converted to InsertionSort…

Some of these statistics will remain pretty much the same for the rest of this series, regardless of what we do next in future versions, while others radically change; We’ll observe and make use of these as key inputs in helping us to figure out how/why something worked, or not!

All Warmed Up?

We’ve spent quite some time polishing our foundations concerning QuickSort and Array.Sort. I know lengthy introductions are somewhat dull, but I think time spent on this post will pay off with dividend when we next encounter our actual implementation in the 3^rd post and later on. This might be also a time to confess that just doing the leg-work to provide this refresher helped me come up with at least one, super non-trivial optimization, which I think I’ll keep the lid on all the way until the 6^th and final post. So never underestimate the importance of “just” covering the basics.

Before we write vectorized code, we need to pick up some knowhow specific to vectorized intrinsics and introduce a few select intrinsics we’ll be using, so, this is an excellent time to break off this post, grab a fresh cup of coffee and head to the next post.

Which is increasingly taking more and more time to happen, due to Dennard scaling and the slow-down of Moore’s law… ↩
Since CoreCLR 3.0 was release, a PR to provide a span based version of this has been recently merged into the 5.0 master branch, but I’ll ignore this for the time being as it doesn’t seem to matter in this context. ↩
You can grab your microcode signature in one of the following methods: On Windows, the easiest way is to install and run the excellent HWiNFO64 application, it will show you the microcode signature. On line a grep -i microcode /proc/cpuinfo does the tricks, and macOs: sysctl -a | grep -i microcode will get the job done. Unfortunately you’ll have to consult your specific CPU model to figure out the before/after signature, and I can’t help you there, except to point out that the microcode update in question came out in November 13^th and is about mitigating the JCC errata. ↩
I came, I Tried, I Folded ↩
Believe it or not, I pretty much wrote every other version features in this series before I wrote the Unmanaged one, so I really was quite surprised that it ended up being slightly faster that Array.Sort ↩
I have a special build configuration called Stats which compiles in a bunch of calls into various conditionally compiled functions that bump various counters, and finally, dump it all to json and it eventually makes it all the way into these posts (if you dig deep you can get the actual json files :) ↩

Unsafe Bounds Checking

2019-11-08T05:26:28+00:00

Unsafe Bounds Checking

I thought I’d write a really short post on a nifty technique/trick I came up while trying to debug my own horrible unsafe code for vectorized sorting. I don’t think I’ve seen it used/shown before, and it really saved me tons of time. It all boils down to a combination of:

using static
#if DEBUG
Local functions in C#

Imagine this is our starting point:

unsafe void GenerateRollingSum(int *p, int lengthInVectors)
{
    // This get's folded as a constant by the
    // JIT and I hate typing this all over the place
    var N = Vector256<int>.Count;

    var acc = Avx.LoadDquVector256(p);
    var pEnd = p + lengthInVectors * N;
    var pRead = p + 1;
    var pWrite = p;
    while (p < pEnd) {
      var data = Avx.LoadDquVector256(p);
      acc = Avx.Add(data, acc);
      Avx.Store(pWrite, acc);
    }
}

I’m providing here a very wrong implementation, obviously, for the purpose of this post. Keen eyes will immediately notice that this method is going to make us very unhappy as it is writing partially into the same memory it is about to read in the next iteration. It’s definitely not going to work. But at the same time, it’s important to note that it isn’t going to crash or generate any exception, except for not doing it’s job.

Unfortunately, for me, I’ve managed to write many variations of this bug, so I had to come up with something that would negate my in-built idiocy, here’s what I normally write with code like this these days:

// We import all the static methods in Avx
using static System.Runtime.Intrinsics.X86.Avx;

unsafe void GenerateRollingSum(int *p, int lengthInVectors)
{
    // This get's folded as a constant by the
    // JIT and I hate typing this all over the place
    var N = Vector256<int>.Count;

    var acc = LoadDquVector256(p);
    var pEnd = p + lengthInVectors * N;
    var pRead = p + 1;
    var pWrite = p;
    while (p < pEnd) {
      var data = LoadDquVector256(p);
      acc = Avx.Add(data, acc);
      Store(pWrite, acc);
    }

#if DEBUG
    // "Hijack" LoadDquVector256 under DEBUG configuration
    // and assert for various constraint violations
    Vector256<int> LoadDquVector256(int *ptr) {
      Debug.Assert((ptr + N - 1) < p + lengthInVectors * N,
                   "Reading past end of array");
      // Finally call the real LoadDquVector256()
      return Avx.LoadDquVector256(ptr);
    }

    // "Hijack" LoadDquVector256 under DEBUG configuration
    // and assert for various constraint violations
    void Store(int *ptr, Vector256<int> data) {
      Debug.Assert((ptr + N - 1) < p + lengthInVectors * N,
                   "Writing past end of array");
      Debug.Assert((ptr + N - 1) < pRead,
                   "Writing will overwrite unread data");
      // Finally call the real Store()
      Avx.Store(ptr, data);
    }
#endif
}

As you can see, this is a nifty way to abuse using static statements with local functions. We override the LoadDquVector256() / Store intrinsics only in DEBUG mode, so there’s no performance hit that they incur in RELEASE, and we also make use of the fact that they are defined as local functions to perform some in-depth Debug.Assert()ing that is based on the internal state of the function. Without defining these functions as local we would not be able to do so…

This isn’t necessarily useful for vectorized code exclusively, but any code that is potentially tricky. I hope you find this useful! I don’t think I’ve seen this in the wild before.

Hacking CoreCLR on Linux with CLion

2019-05-01T05:26:28+00:00

What/Why?

Being a regular Linux user, when I can, I was looking for a decent setup for myself to grok then hack on CoreCLR’s C++ code.

CoreCLR, namely the C++ code that implements the runtime (GC, JIT and more) is a BIG project, and trying to peel through its layers for the first time is no easy task for sure. While there are many great resources available for developers that want to read about the runtime such as the BotR, for me, there really is no replacement for reading the code and trying to reason about what/how it gets stuff done, preferably during a debug session, with a very focused task/inquiry at hand. For this reason, I really wanted a proper IDE for the huge swaths of C++ code, and I couldn’t think of anything else but JetBrains’ own CLion IDE under Linux (and macOS, which I’m not a user of).
With my final setup, I really can do non-trivial navigation on the code base such as:

CoreCLR is a beast of a project, and getting it to parse properly under CLion, moreover, it requires some non-trivial setup, so I thought I’d disclose my process here, for other people to see and maybe even improve upon…

Generally speaking, all the puzzle pieces should fit since the CoreCLR build-system is 95% made of running cmake to generate standard GNU makefiles, and then builds the whole thing using said makefiles, where the other 5% is made of some scripts wrapping the cmake build-system. At the same time, CLion builds upon cmake to bootstrap its own internal project representation, provided that it can invoke cmake just like the normal build would.

Here’s what I did to get everything working:

First, We’ll clone and perform a single build of CoreCLR by following the instructions, What I did on my Ubuntu machine consisted of:

$ sudo apt install cmake llvm-3.9 clang-3.9 lldb-3.9 liblldb-3.9-dev libunwind8 libunwind8-dev gettext libicu-dev liblttng-ust-dev libcurl4-openssl-dev libssl-dev libnuma-dev libkrb5-dev
$ ./build.sh checked

Once the build is over, you should have everything under the bin/Product/Linux.x64.Checked like so:

$ ls bin/Product/Linux.x64.Checked
bin           libcoreclr.so                  netcoreapp2.0
coreconsole   libcoreclrtraceptprovider.so   PDB
corerun       libdbgshim.so                  sosdocsunix.txt
createdump    libmscordaccore.so             SOS.NETCore.dll
crossgen      libmscordbi.so                 SOS.NETCore.pdb
gcinfo        libprotononjit.so              superpmi
IL            libsosplugin.so                System.Globalization.Native.a
ilasm         libsos.so                      System.Globalization.Native.so
ildasm        libsuperpmi-shim-collector.so  System.Private.CoreLib.dll
inc           libsuperpmi-shim-counter.so    System.Private.CoreLib.ni.{fe21e59b-7903-49b4-b2d3-67de152c1d7d}.map
lib           libsuperpmi-shim-simple.so     System.Private.CoreLib.xml
libclrgc.so   Loader
libclrjit.so  mcs

Now that an initial build is over, we can be sure that some scripts that were crucial to generate a few headers essential for the rest of the compilation process were generated and CLion will be able to find all the necessary source code once we teach it how to…

CLion needs to invoke cmake with the same arguments that the build scripts use. To sniff out the cmake command-line we’ll use an *nix old-timer’s trick to generate traces for build.sh run: use bash -x. Unfortunately, nothing is ever so simple in life, and CoreCLR’s build.sh script doesn’t directly invoke cmake, so we will need to make this -x parameter sticky or recursive. There is no better way to do this than the following somewhat convoluted procedure:
First we need to generate a wrapper-script for build.sh, we’ll call it build-wrapper.sh:

echo "export SHELLOPTS && ./build.sh \$@" > build-wrapper.sh

After we have our wrapper in place. we run it instead of build.sh like this:

$ bash -x ./build-wrapper.sh checked
... # omitted
+ /usr/bin/cmake -G 'Unix Makefiles' -DCMAKE_BUILD_TYPE=CHECKED -DCMAKE_INSTALL_PREFIX=/home/dmg/projects/public/coreclr/bin/Product/Linux.x64.Checked -DCMAKE_USER_MAKE_RULES_OVERRIDE= -DCLR_CMAKE_PGO_INSTRUMENT=0 -DCLR_CMAKE_OPTDATA_PATH=/home/dmg/.nuget/packages/optimization.linux-x64.pgo.coreclr/99.99.99-master-20190716.1 -DCLR_CMAKE_PGO_OPTIMIZE=1 -S /home/dmg/projects/public/coreclr -B /home/dmg/projects/public/coreclr/bin/obj/Linux.x64.Checked

Boom! We’ve hit that jackpot. For folks following this that are feeling a bit shaky, I’ve isolated the exact part we’re after below:

-G 'Unix Makefiles' -DCMAKE_BUILD_TYPE=CHECKED -DCMAKE_INSTALL_PREFIX=/home/dmg/projects/public/coreclr/bin/Product/Linux.x64.Checked -DCMAKE_USER_MAKE_RULES_OVERRIDE= -DCLR_CMAKE_PGO_INSTRUMENT=0 -DCLR_CMAKE_OPTDATA_PATH=/home/dmg/.nuget/packages/optimization.linux-x64.pgo.coreclr/99.99.99-master-20190716.1 -DCLR_CMAKE_PGO_OPTIMIZE=1 -S /home/dmg/projects/public/coreclr -B /home/dmg/projects/public/coreclr/bin/obj/Linux.x64.Checked

The “hard” part is over. It’s a series of boring clicks from here on. it’s time to open up CLion and get this show on the road: We’ll start with defining a clang-3.9 based toolchain, since on Linux Clion defaults to using the gcc toolchain (at least on Linux), while CoreCLR needs clang-3.9 to build itself:
With a toolchain setup, we need to tell cmake about our build configuration, so we set it up like so:

I’ve highlighted all the text boxes you’ll need to set. I’ll go over the less trivial stuff:
- The command line option we just set aside in (3) goes into the CMake options field.
  Unfortunately CLion doesn’t like single quotes (weird…), so I’ve had to change the -G 'Unix Makefiles' into -G "Unix Makrfiles" (notice the use of double quotes).
- It would be a wise idea to share the same build folder as our initial command line build used, more over, we might end up going back and forth between CLion and the command line, so I override the “Generation Path” setting with the value bin/obj/Linux.x64.Checked. This is again extracted from the same command line we set-aside before. You’ll find it in my case towards the end, specified right after the -B switch.
- For the build options, I’ve specified -j 8. This option controls how many parallel builds (compilers) are launched during the build process. A good default is to set it to 2x the number of physical cores your machine has, so in my case that means using -j 8.
That’s it, let CLion do it’s thing while grinding your machine to a halt, and once it’s done you can start navigating and building the CoreCLR project like a first class citizen of the civilized world :)

Debugging CoreCLR from CLion

Once we have CLion understanding the CoreCLR project structure we can take it up a notch and try to debug CoreCLR by launching “something” while setting a breakpoint.

Let’s try to debug the JIT as an example for a useful scenario.

First we need a console application:

 $ cd /tmp/
 $ dotnet new console -n clion_dbg_sample
 The template "Console Application" was created successfully.
 Processing post-creation actions...
 Running 'dotnet restore' on clion_dbg_sample/clion_dbg_sample.csproj...
 Restore completed in 54.39 ms for /tmp/clion_dbg_sample/clion_dbg_sample.csproj.

 Restore succeeded.

 $ cd clion_dbg_sample
 $ dotnet publish -c release -o linux-x64 -r linux-x64
 Microsoft (R) Build Engine version 16.3.0+0f4c62fea for .NET Core
 Copyright (C) Microsoft Corporation. All rights reserved.

 Restore completed in 66.26 ms for /tmp/clion_dbg_sample/clion_dbg_sample.csproj.
 clion_dbg_sample -> /tmp/clion_dbg_sample/bin/release/netcoreapp3.0/linux-x64/clion_dbg_sample.dll
 clion_dbg_sample -> /tmp/clion_dbg_sample/linux-x64/

Now we have a console application published in some folder, in my case it’s /tmp/clion_dbg_sample/linux-x64

Next we will setup a new configuration under CLion:
Now we define a new configuration:
We provide some name, I’ve decided to use the same name as my test program: clion_dbg_sample, We select “All targets” as the Target, and under executable we need to choose “Select other…” to provide a custom path to corerun. The reason behind this is that we need to run corerun from a directory that actually contains the entire product: jit, gc and everything else.
The path we provide is to the corerun executable that resides in the bin/Product/Linux.x64.Checked folder:
Finally we provide our sample project from before to the corerun executable. This is how my final configuration looks like:
It’s time to set a break-point and launch. As a generic sample I will navigate to compiler.cpp and find the jitNativeCode method. It’s pretty much one of the top-level functions in the JIT, and therefore a good candidate for us. If we set a breakpoint in that method and launch our newly created configuration, we should hit it in no time:
We’re done! If you really want to figure out what to do next, it’s probably a good time to hit the BotR, namely the RyuJit Overview and RyuJit Tutorial pages that contain a more detailed overview of the JIT. Alternatively, if you’re a “get your hands dirty” sort of person, you can also do some warm-up exercises for your fingers and start hitting that step-into keyboard shortcut. You’re debugging the JIT as we speak!

I hope this end up helping someone wanting to get started digging into the JIT not on Windows. I also personally have a strong preference for CLion as I really think it’s much more faster and powerful option than all the other stuff I’ve tried this far. At any rate, it’s the only viable option for Linux/macOs people.

Have fun! Let me know on twitter if you’re encountering any difficulties or you think I can make anything clearer…

.NET Core 3.0 Intrinsics in Real Life - (Part 3/3)

2018-08-20T15:26:28+00:00

As I’ve described in part 1 & part 2 of this series, I’ve recently overhauled an internal data structure we use at Work^® to start using platform dependent intrinsics.

If you’ve not read the previous posts, I suggest you do so, as a lot of what is discussed here relies on the code and issues presented there…

As a reminder, this series is made in 3 parts:

The data-structure/operation that we’ll optimize and basic usage of intrinsics.
Using intrinsics more effectively
The C++ version(s) of the corresponding C# code, and what I learned from them (this post).

All of the code (C# & C++) is published under the bitgoo github repo.

C++ vs. C#

I think I’ve mentioned this somewhere before: I started working on better versions of my bitmap search function way before CoreCLR intrinsics were even imagined. This led me to start to tinkering with C++ code where I tried out most of my ideas. When CoreCLR 3.0 became real enough, I ported the C++ code back to C# (which surprisingly consisted of a couple of search and replace operations, no more…).

As such, having two close implementations begs performing a head-to-head comparison. After some additional work, I had basic google benchmark and google test suites up and running¹
I’ll cut right to the chase and present a relative comparison between C++ and C# for the last version we ran in our previous post, The C# method is POPCNTAndBMI2Unrolled and the C++ one is POPCNTAndBMI2Unrolled2:

Method	N	C# Mean (ns)	C++ Mean (ns)	C++/C# Ratio
POPCNTAndBMI2Unrolled	1	2.249	3.338	148.42%
POPCNTAndBMI2Unrolled	4	10.904	11.037	101.22%
POPCNTAndBMI2Unrolled	16	50.368	43.786	86.93%
POPCNTAndBMI2Unrolled	64	208.272	202.366	97.16%
POPCNTAndBMI2Unrolled	256	1,580.026	1,493.020	94.49%
POPCNTAndBMI2Unrolled	1024	21,282.905	11,520.900	54.13%
POPCNTAndBMI2Unrolled	4096	255,186.977	133,976.543	52.50%
POPCNTAndBMI2Unrolled	16384	3,730,420.068	1,754,421.485	47.03%
POPCNTAndBMI2Unrolled	65536	56,939,817.593	26,613,731.568	46.74%

There are a few things that stand out from this comparison:

The percentage differences in the low bit counts (1,4) should be ignored, they are minuscule in absolute terms and within the margin of error.
C# is doing pretty well up to 256 bits when we don’t execute the unrolled loop, it’s basically neck to neck with C++.
Sweet mercy, what is going on with 1024 bits an onwards, inside the unrolled loop? Why is there such a big difference for what is a relatively optimized (and equivalent) piece of code between the two languages?

I’ll cut to the chase and answer this last question directly, then, proceed to explain the underlying relevant basics (tl;dr: it’s not so basic) of CPU pipelining and register renaming in order for the explanation to stick for people reading this that are not familiar with those terms/concepts.

The bottom line is: there is a bug in the CPU! There is a well known (even if very cryptic) erratum about this bug, and compiler developers are more or less generally aware of this issue and have been working around it for the better part of the last 5 years.

False Dependencies

So what is this mysterious CPU bug all about? The JIT was producing what should be, according to the processor documentation, pretty good code:

BEGIN_POPCNT_UROLLED_LOOP:
popcnt   rsi, qword ptr [rcx]
sub      rdx, esi
popcnt   rsi, qword ptr [rcx+8]
sub      rdx, esi
popcnt   rsi, qword ptr [rcx+16]
sub      rdx, esi
popcnt   rsi, qword ptr [rcx+24]
sub      rdx, esi
add      rcx, 32
cmp      rdx, 256
jge SHORT BEGIN_POPCNT_UROLLED_LOOP

What we see above is an excerpt from POPCNTAndBMI2Unrolled method’s assembly code, and more specifically the unrolled loop that does 4 POPCNT instructions in succession.

Even if you are not an assembly guru, it’s pretty clear we have 4 pairs of POPCNT + SUB instructions, where:

Each POPCNT instruction is reading from successive memory addresses and writing their result temporarily into a register named rsi.
This temporary value is then subtracted using SUB from another register which represents our good old C# variable n (the target-bit count).

The high-level explanation of the bug goes like this:

The CPU should have detected that each POPCNT + SUB instruction pair is effectively independent of the previous pair (inside our unrolled loop and between the loop’s iterations). In other words: although all 4 pairs are using the same destination register (rsi), each such pair is really not dependent on the previous value of rsi.
This dependency analysis, performed by the CPU, should have enabled it to use an internal optimization called register-renaming (more on that later).
Had register renaming been triggered the CPU could have processed our POPCNT instructions with a higher degree of parallelism: In other words, our CPU, would run a few POPCNT instructions in parallel at any given moment. This would lead to better perf or better IPC (Instruction-Per-Cycle ratio).
In reality, the bug is causing the CPU to delay the processing of each such pair of instructions for a few cycles, per pair, introducing a lot of “garbage time” inside the CPU, where it’s stalling, doing less work than it should, leading to the slowdown we are seeing.

Terminology wise, this sort of bug is called a false-dependency bug: In our case, the CPU wrongfully introduces a dependency between the different POPCNT instructions on their destination register, it thinks each POPCNT instruction is not only writing into rsi but also reading from it! (it does no such thing)
Given that this false dependency now exists, it is preventing the CPU from using register-renaming to execute the code more efficiently.

I will first focus on describing how compilers have been working around this, and afterward, I will describe in much more detail how the CPU employs register renaming to improve the throughput of the pipeline when the bug does not exist or is worked around.

Working Around False Dependencies

As I’ve mentioned, this bug has been around for quite some time: It was reported somewhere in 2014 and is unfortunately still persistent to this day on most Intel CPUs, at least when it comes to the POPCNT instruction².

Luckily, compiler developers have been able to work around this issue with relative ease by generating extra code that breaks the aforementioned false-dependency. As far as I can tell, the people who originally wrote the workarounds were Intel developers, so they had a very good understanding of the exact nature of this false-dependency. What they opted to do was make compilers introduce a single-byte instruction that clears the lower 32-bits of the destination register. In our case, this comes in the form of a xor esi, esi instruction. This is the shortest way (instruction length-wise) in x86 CPUs to zero out a register. This instruction is a well known special case in the CPU since it “knows” the future value of the destination register (0) even without executing it, or knowing what its original value ever was. It appears the Intel engineers knew that the dependency is not the entire 64-bit register (rsi) but only on the lower 32-bit part of that register (esi³) and took advantage of this understanding to introduce a single byte fix into the instruction stream, which is relatively very cheap.

The correct x86 assembly, generated by a fixed JIT or compiler should look like this:

BEGIN_POPCNT_UROLLED_LOOP:
xor      esi, esi				; This breaks the dependency
popcnt   rsi, qword ptr [rcx]
sub      rdx, esi
xor      esi, esi				; This breaks the dependency
popcnt   rsi, qword ptr [rcx+8]
sub      rdx, esi
xor      esi, esi				; This breaks the dependency
popcnt   rsi, qword ptr [rcx+16]
sub      rdx, esi
xor      esi, esi				; This breaks the dependency
popcnt   rsi, qword ptr [rcx+24]
sub      rdx, esi
add      rcx, 32
cmp      rdx, 256
jge SHORT BEGIN_POPCNT_UROLLED_LOOP

This short piece of code is the sort of code that gcc/clang would generate for POPCNT to work-around the bug. When read out of context, it looks silly… it appears like the compiler generated useless code, to begin with, and you’ll find people wondering about this publicly in stackoverflow and other forums from time to time, or worse yet: trying to “fix” it. But for most in-production x86 CPUs (e.g. all the ones that suffer from this false-dependency) this code will substantially outperform the original code we saw above…

Update: CoreCLR does the right thing

I originally started writing part 3 after I found this issue with the JIT, and submitting an issue, thinking I would finish writing this post before anyone would fix the underlying issue. I was wrong on both counts: Writing this post became an ever-growing challenge as I attempted to explain pipelines and register-renaming for the uninitiated (below), while Fei Peng fixed the issue in a matter of two weeks (Thanks!).

What CoreCLR now does (since commit 6957b4f) is always introduce the xor dest, dest workaround/dependency breaker for the 3 affected instructions LZCNT, TZCNT, POPCNT. This is not the optimal solution since the JIT will introduce this both for CPUs afflicted with this bug (specific Intel CPUs) as well as CPUs that don’t have this bug (All AMD CPUs and newer Intel CPUs).
From the discussion, it’s clear that this path was chosen for simplicity’s sake: it would require more infrastructure both to detect the correct CPU family inside the JIT, and introduce questions around what should the JIT do in case of AOT (Ahead Of Time) compilation, as well as require more testing infrastructure than what is currently in place on the one hand, while the one byte fix is very cheap even for CPUs that are not affected.

Let’s see if this CoreCLR fix does anything to our unmodified piece of code…:

Method	N	Mean (ns)	Scaled To “buggy” CoreCLR	Scaled to C++
POPCNTAndBMI2Unrolled	1	2.170	0.96	0.65
POPCNTAndBMI2Unrolled	4	11.910	1.09	1.08
POPCNTAndBMI2Unrolled	16	55.016	1.09	1.26
POPCNTAndBMI2Unrolled	64	225.156	1.08	1.11
POPCNTAndBMI2Unrolled	256	1,637.336	1.04	1.10
POPCNTAndBMI2Unrolled	1024	11,698.421	0.55	1.02
POPCNTAndBMI2Unrolled	4096	149,247.146	0.58	1.11
POPCNTAndBMI2Unrolled	16384	1,904,945.748	0.51	1.09
POPCNTAndBMI2Unrolled	65536	27,712,720.427	0.49	1.04

It sure does! It appears now that the unrolled version is running roughly 85-101% faster for higher bit counts than it did with the previous, unfixed CoreCLR!. When compared to C++, performance is now pretty close and consistent for the important parts of the benchmark. If you consider, for a moment, that we got here by making the JIT spill out an extra, supposedly useless instruction, this makes the achievement that much more impressive :), as before, here is the JITDump with the newly fixed JIT in place.

Now, we can really see just how much of a profound effect this false-dependency had on performance. In theory, this might be the right time to finish this post, however, I couldn’t let it go without attempting to explain the underlying CPU internals of how and why the false-dependency had such a deep effect on performance. For readers well aware of how CPU pipelines operate and how they interact with the register renaming functionality on a modern super-scalar out-of-order CPU this is a good time to stop reading.
What follows is me trying to explain how the CPU tries to handle loops of code effectively, and how register renaming plays an important role in that.

The love/hate story that is tight loops in CPUs

It takes very little imagination to realize that CPUs spend a lot their processing time executing loops (or the same machine code multiple times, in this context).
We need to remember that CPUs achieve remarkable throughput (e.g. instructions per cycles, or IPC) even though the table, in some ways, is set against them:

A modern CPU will often have a dozen or so stages in their pipeline (examples: 14 in Skylake, 19 in AMD Ryzen)
- This means a single instruction will take about 14 cycles on my cpu from start to finish if we were only executing that instruction and waiting for it to complete!
The CPU attempts to handle multiple instructions in different stages of the pipeline, but it may become stalled (i.e. do no work) when it needs to wait for a previous instruction to advance through the pipeline enough to have its result ready (this is generally referred to as instruction dependencies).
To improve the utilization of CPU caches (L1/2/3 caches) and memory bus utilization, most modern processors artificially limit the number of register names they support for instructions (seems like in 2018 everyone has settled on 16 general purpose registers, except for PowerPC at 32)
- That way instructions take up fewer bits and can be read more quickly over these highly subscribed resources (caches and memory bus).
- The flip side of this design decision is that compilers do not have the ability to generate code that uses many different registers, which in turn leads them to generate more code fragments that are dependent of each other because of the limited register names available for them.

With that in mind, let’s take the same, short piece of assembly code, which was generated by the JIT for our last unrolled attempt at, and see how it theoretically executes on a Skylake CPU.

Visualizing our loop

Without any additional fanfare, lets introduce the following visualization:

I created this diagram by prettifying a trace file generated by a little known tool made by Intel called IACA, which stands for Intel Architecture Code Analyzer. IACA takes a piece of machine code + target CPU family and produces a textual trace file that can help us see better what the CPU does, at every cycle of a relatively short loop.
If you dislike having to use commercial (non-OSS) tools, please note that there is a similar tool by the LLVM project called llvm-mca, and you can even use it from the infamous compiler-explorer.

Let’s try to break this diagram down:

The leftmost column contains the loop counter, I’ve limited the trace to 2 [0, 1] iterations of that loop, to keep everything compact.
Next, the instruction counter within its respective loop. Clearly we have 11 instructions per loop.
Next, the disassembly, where we can see 4 POPCNT instructions and they are interleaved with 4 subtractions of each POPCNT result from the register rdx
Next we see how the instructions are broken down to µops⁴:
For now, we will simply make note that every POPCNT we have , having been encoded as an instruction that reads from memory AND calculates the population count, was broken down to two µops:
- A load µop (TYPE_LOAD) loading the data from its respective pointer.
- An operation µop (TYPE_OP) performing the actual POPCNTing into our destination register (rsi).
Then comes the real kicker: IACA simulates what a Skylake CPU (specifically) should be doing at every cycle of those two loop iterations and provides us with critical insight into the state that each instruction is at every cycle (relative to the beginning of the first loop). These states are described by the coded symbols in each box, which I will shortly describe in more detail.

It is important to note that IACA, while being Intel’s own tool is not aware of the Intel CPU bug I just described. It is simulating what that processor should have done with NO false dependency…

While all the various states of the instruction within the pipeline are interesting I will give some more meaning to specific states:

mnemonic	Meaning
d	Dispatched to execution: The CPU has completed decoding and waiting for the instruction’s dependencies to be ready. Execution will begin in the next cycle
e	Executing: The instruction is being executed, often in multiple cycles within a specific execution port (unit) inside the CPU
w	Writeback: The instruction’s result is being written back to a register in the register-file (more on this below), where it will be available for other instructions that might have a dependency on that instruction
R	Retired: The temporary register used during the execution/writeback has to be written back to the “real” destination register, according to the original order of the program code, this is called retirement, after which the CPUs internal, temporary register is free again (more on this below)

I encourage you to try to follow this execution trace for a couple of instructions. I like to stare at these things for hours, trying to tell a story in my own head in the form of “what is the CPU thinking now” for each and every cycle. There is much we could say about this, but I will highlight a few remarkable things:

I’ve highlighted the R symbol/stage with a red-ellipse. For our purposes here, this represents the final stage of each instruction. To me, it’s very impressive to see how all of these instructions terminate execution either 0 or 1 cycles apart of each other.
By the time the first instruction (POPCNT) reaches the R (retired) state at cycle 14, when it’s done, we are already executing, in some pipeline stage or another, all instructions from the next 4 iterations of this unrolled loop (I’ve limited the visualization to only 2 iterations for brevity, but you get the hang of it).
- The processor is already (speculatively) executing loads from memory to satisfy our POPCNT instructions in loop iterations 1,2,3 before the first iteration has even completed running, and without even knowing for sure our loop would actually execute for that amount of iterations.
- Quantitatively speaking: We have roughly 4 iterations of an 11 instruction loop (> 40 instructions) all running in parallel inside one core(!) of our processor. This is possible both because of the length of the pipeline (14 stages for this specific processor) and the fact that internally, the processor has multiple units or ports capable of running various instructions in parallel. This is often referred to as a super-scalar CPU.

In case you are interested in digging much more deeper than I can afford to go into this within this post, I suggest you read Modern Microprocessors: A 90-Minute Guide! to get more detailed information about pipelines, super-scalar CPUs, everything I try to cover here, and more.

For this post, I will focus on one key aspect that lies in the root of how the CPU manages to do so many things at the same time: register renaming.

Instruction Dependencies

Let’s look at the code again, this time adding arrows between the various instructions, marking their interdependencies.

If we interpret this code naively (and wrongly), we see that rsi is being used in each and every instruction of this code fragment, this could lead us to assume that the heavy usage of rsi is generating a long dependency chain:

The POPCNT is writing into rsi.
rsi is then used as a source for the subtraction from rdx, so naturally, the sub instruction cannot proceed before rsi has the value of POPCNT.
The next POPCNT is again writing to rsi but would seemingly be unable to write before the previous sub has finished.
After four such operations, we loop (in turquoise) again and we are again taking a dependency on rsi at the beginning of the loop.

This naive dependency analysis pretty much contradicts the output we saw come out of IACA in the previous diagram without further explanation. It would seem impossible for the CPU to run so many things in parallel where every instruction here seems to have a dependency through the use of the rsi register.
Moreover, both our original C# and C++ code did not force the JIT/compiler to re-use the same register over and over. It could have allocated 4 different registers and used them to generate code where each POPCNT + SUB pair would be independent of the previous one, so why didn’t it do so?
Well, it turns out there is no need to! The JIT/compiler is doing exactly what it needs to be doing, it is just us, that need to learn about a very important concept in modern processors called register renaming.

Register Renaming

To understand why anyone would need something like register renaming, we first need to understand that CPU designers are stuck between a rock and a hard place:

On one hand they want to be able to read our program code as fast as possible, from memory 🡒 cache 🡒 instruction decoder (a.k.a CPU front end), this requirement leads down a path where they have to severely limit the number of register names available for machine code, since fewer register names leads to more compact instructions (fewer bits) in memory and more efficient utilization of memory buses and caches.
On the other hand, they would like to give compilers / JIT engines as much flexibility as possible in using as many registers as they want (possibly hundreds) without needing to move their contents into memory (or more realistically: CPU cache) just because they ran out of registers names.

These contradicting requirements led CPU designers to decouple the idea of register names and register storage: modern CPUs have many more (hundreds) or physical registers (storage) in their register-file than they have names for our software to use. This is where register renaming enters the scene.

What CPU designers have been doing, for quite a long time now (before 1967, believe it or not!) is really remarkable: they have been employing a really neat trick that effectively gets the best of both worlds (i.e. satisfy both requirements) at the cost of more complexity, more power usage, and more stages in the pipeline (hence also a little slowdown in the execution of a single instruction) to achieve better pipeline utilization at the global scale.

This optimization, named “Register renaming”, accomplishes just that: by analyzing when a register is being written (write-only, not read-write) to, the CPU “understands” that the previous value of that register is no longer required for the execution of instructions reading/writing to that same register from that moment onwards, even if previous instructions have not completed (or started) execution! What this really means, is that if we go back to the naive (now you see why) dependency analysis we did in the previous section, it’s clear that each POPCNT + SUB pair are actually completely independent of each other because they begin with overwriting rsi! In other words, each POPCNT having written to rsi is considered to be breaking the dependency chain from that moment onwards. What the CPU does, therefore, is continuously re-map named registers to different register locations on the register-file, according to the real dependency chain, and use that newly allocated location within the register file (hence the initial “Allocation” stage at the IACA diagram above) until the dependency chain is broken again (e.g. the same register is written to again).
I cannot emphasize how important of a tool this is for the CPU. Register renaming allows it to schedule multiple instructions to execute concurrently, either at different stages of the same pipeline or in parallel in different execution ports (pipelines) that exist in a super-scalar CPU. Moreover, this optimization achieves this while keeping the machine code small and easy to decode, since there are very few bits allocated for register names!

How big of a deal is this? How good is the CPU in using this renaming trick? To best answer this from a practical standpoint, I think, we can take a look into the disparity between how many register names exist, for example, in the x64 architecture, that number being 16, and how many physical register storage space there is on the register-file, for example, on an Intel Skylake CPU: 180 (!).

After the temporary (renamed) register has finished its job for a given instruction chain, we are still, unfortunately, not entirely done with it. Understand, that the CPU cannot look too far into the incoming instruction stream (mostly a few dozen bytes), and it can not know, with certainty, if the last written value it just wrote to a renamed register will not be required by some future part of the code it hasn’t seen yet, hundreds of instructions in the future. This brings us to the last phase of register renaming, which is retirement: The CPU must still write the last value for our symbolic register (rsi) back to the canonical location of that register (a.k.a the “real” register), in case future instructions that have not been loaded/decoded will attempt to read that value.
Moreover, this retirement phase must be performed exactly in program order for the program to continue operating as its original intention was.

Wrapping up: clearing the register for the rescue

So going back to our false-dependency bug, we can now hopefully understand the underlying issue and the fix armed with our new knowledge:

Our Intel CPU wrongly misunderstands our POPCNT instruction, when it comes to its dependency analysis: It “thinks” our usage of rsi is not only writing to it but also reading from it.
This is the false-dependency at the root of this issue. We cannot see this with IACA, but we can understand it conceptually: If the CPU (wrongfully) “thinks” that our second POPCNT has to READ the previous rsi value, then no register renaming can occur at that point, and the second POPCNT instruction cannot execute in parallel to the first one, it needs to wait for the completion of the first POPCNT and basically stall for a few precious cycles, in order for the previous rsi to be written back somewhere. Naturally this is true for every unrolled POPCNT in our loop and also between loop iterations.
This alone is enough to cause the perf drop we saw originally with the C# code before CoreCLR was patched. Once the xor esi,esi dependency breaker is added to the instruction stream, we are basically “informing” the CPU that we really are not dependent on the previous value of rsi and we allow it to perform register renaming from that point onwards. It still wrongfully thinks that POPCNT reads from rsi but thanks to our otherwise seemingly superfluous xor, this is an already renamed rsi and the pipeline stall is averted.

I think it is pretty clear by now, although we barely scratched the surface of CPU internals, that CPUs are very complex, and that in the race to extract more performance out of code, today’s out-of-order, super-scalar CPUs go into extreme lengths to find ways to parallelize machine code execution.
It should be also clear that it’s important to be able to empathize with the machine and understand the true nature of its inner workings to really be able to deal the weirdness we experience as we try to make stuff go faster.

It would be great if all we needed to do was keep compiler and hardware developers well fed and well paid so we could do our job without needing to know any of this, and to a great extent, this statement is true. But more often than not, extreme performance requires deeper understanding.

As a side note, after not doing serious C++ work for years, coming back to it and discovering sanitizers, cmake, google test & benchmark was a very pleasant surprise. I distinctly remember the surprise of writing C++ and not having violent murderous thoughts at the same time. ↩
Apparently Intel has fixed the bug (according to reports) for the LZCNT and TZCNT instructions on Skylake processors, but not so for the POPCNT instruction for reasons unknown to practically anyone. ↩
yes, x86 registers are weird in that way, where some 64 bit registers have additional symbolic names referring to their lower 32, 16, and both 8 bit parts of their lower 16 bits, don’t ask. ↩
µop or micro-op, is a low-level hardware operation. The CPU Front-End is responsible for reading the x86 machine code and decoding them into one or more µops. ↩

.NET Core 3.0 Intrinsics in Real Life - (Part 2/3)

2018-08-19T15:26:28+00:00

As I’ve described in part 1 of this series, I’ve recently overhauled an internal data structure we use at Work^® to start using platform dependent intrinsics.

If you’ve not read part 1 yet, I suggest you do so, since we continue right where we left off…

As a reminder, this series is made in 3 parts:

The data-structure/operation that we’ll optimize and basic usage of intrinsics.
Using intrinsics more effectively (this post).
The C++ version(s) of the corresponding C# code, and what I learned from them.

All of the code (C# & C++) is published under the bitgoo github repo.

PDEP - Parallel Bit Deposit

We’re about to twist our heads with a bit of a challenge: For me, this was a lot of fun, since I got to play with something I knew nothing about which turned out to be very useful, and not only for this specific task, but useful in general.

We’re going to optimize a subset of this method’s performance “spectrum”: lower bit counts.
If you go back to the previous iteration of the code, we can clearly see that apart from the one 64 bit POPCNT loop up at the top, the ratio between instructions executed and bits processed for low values of N doesn’t look too good. I summed up the instruction counts from the JIT Dump linked above:

The 64-bit POPCNT loop takes 10 instructions, split into two fragments of the function, processing 64 bits each iteration.
The rest of the code (31 instructions not including the ret!) is spent processing the last <= 64 bits, executing a single time.

While just counting instructions isn’t the best profiling metric in the world, it’s still very revealing…
Wouldn’t it be great if we could do something to improve that last, long code fragment? Guess what…
Yes we can! using a weird little instruction called PDEP whose description (copy-pasted from Intel’s bible of instructions in page 922) goes like this:

PDEP uses a mask in the second source operand (the third operand) to transfer/scatter contiguous low order bits in the first source operand (the second operand) into the destination (the first operand). PDEP takes the low bits from the first source operand and deposit them in the destination operand at the corresponding bit locations that are set in the second source operand (mask). All other bits (bits not set in mask) in destination are set to zero.

Luckily, it comes with a diagram that makes is more digestible:

I know this might be a bit intimidating at first, but what PDEP can do for us, in my own words, is this: Process a single 64-bit value (SRC1) according to a mask of bits (SRC2) and copy (“deposit”) the least-significant bits from SRC1 (or from right to left in the diagram) into a destination register according to the the position of 1 bits in the mask (SRC2).
It definitely takes time to wrap your head around how/what can be done with this, and there are many more applications than just this bit-searching. To be honest, right after I read a paper about PDEP, which from what I gathered, was the inspiration that led to having these primitives in our processors and an extremely good paper for those willing to dive deeper, I felt like a hammer in search of a nail, in wanting to apply this somewhere, until I remembered I had this little thing I need (e.g. this function) and I tried using it, still in C++, about 2 years ago…
It took me a good day of goofing around (I actually started with its sister instruction PEXT ) with this on a white-board until I finally saw a solution… *Note*: There might be other solutions, better than what I came up with, and if anyone reading this finds one, I would love to hear about it!

For those of you who don’t like spoilers, this might be a good time to grab a piece of paper and try to figure out how PDEP could help us in processing the last 64 bits, where we know our target bit is hiding…

If you are ready for the solution, I’ll just show the one-liner C# expression that replaces the 31 instructions we saw the JIT emit for us to handle those last < 64 bits in our bitmap all the way town to 13 instructions and just as importantly: with 0 branching:

// Where:
// n is the # of the target bit we are searching for
// value is the 64 bits when we know for sure that n is "hiding" within
var offsetOfNthBit = TrailingZeroCount(
                         ParallelBitDeposit(1UL << (n - 1), value);

It’s not trivial to see how/why this works just from reading the code, so lets break this down, for an imaginary case of a 16-bit PDEP and assorted registers, for simplicity:

As an example, let’s pretend we are are looking for the offset (position) of the 8^th 1 bit.
We pass two operands to ParallelBitDeposit():
The SRC1 operand has the value of 1 left shifted by the bit number we are searching for minus 1, so for our case of n = 8, we shift a single 1 bit 7 bits to the left, ending up with:

0b_0000_0000_1000_0000

Our “fake” 16-bit SRC1 now has a single 1 bit in the position that equals our target-bit count (This last emphasis is important!) Remember that by this point in our search function, we have made sure our n is within the range 1..64, so n-1 can only be 0..63 we we can never shift negative number of bits, or above the size of the register (this can be seen more easily in the full code listing below).

As for SRC2, We load it up with our remaining portion of the bitmap, whose n^th lit bit position we are searching for, so with careful mashing of the keyboard, I came up with these random bits:

0b_0001_0111_0011_0110

This is what executing PDEP with these two operands does:

By now, we’ve managed to generate a temporary value where only our original target-bit remains lit, in its original position, so thanks for that, PDEP! In a way, we’ve managed to tweak PDEP into a custom masking opcode, capable of masking out the first n-1 lit bits…
Finally, all that remains is to use the BMI1 TZCNT instruction to count the number of 0 bits leading up to our deposited 1 bit marker. That number ends up being the offset of the n^th lit bit in the original bitmap! Cool, eh?

Let’s look at the final code for this function:

using static System.Runtime.Intrinsics.X86.Popcnt;
using static System.Runtime.Intrinsics.X86.Bmi1;
using static System.Runtime.Intrinsics.X86.Bmi2;

public static unsafe int POPCNTAndBMI2(ulong* bits, int numBits, int n)
{
    var p64 = bits;
    int prevN;
    do {
        prevN = n;
        n -= (int) PopCount(*p64);
        p64++;
    } while (n > 0);

    p64--;
    // Here, we know for sure that 1 .. prevN .. 64 (including)
    var pos = (int) TrailingZeroCount(
                        ParallelBitDeposit(1UL << (prevN - 1), *p64));
    return (int) ((p64 - bits) << 6) + pos;
}

With the code out of the way, time to see if the whole thing paid off?

Method	N	Mean (ns)	Scaled to “POPCNTAndBMI1”
POPCNTAndBMI2	1	2.232	0.95
POPCNTAndBMI2	4	9.497	0.62
POPCNTAndBMI2	16	40.259	0.34
POPCNTAndBMI2	64	193.253	0.19
POPCNTAndBMI2	256	1,581.082	0.32
POPCNTAndBMI2	1024	23,174.989	0.51
POPCNTAndBMI2	4096	341,087.341	0.82
POPCNTAndBMI2	16384	4,979,229.288	0.95
POPCNTAndBMI2	65536	76,144,935.381	0.98

Oh boy did it! results are much better for the lower counts of N:

As expected, the scaling improved with peak improvement compared to the previous version at N==64, with a 400% speedup compared to the previous version!
As N grows beyond 64, this version’s performance resembles the previous version’s more and more (duh!).

All in all, everything looks as we would have expected so far…
Again, for those interested, here’s a gist of the JITDump, for your pleasure.

Loop Unrolling

A common optimization technique we haven’t used up to this point, is loop unrolling/unwinding:

The goal of loop unwinding is to increase a program’s speed by reducing or eliminating instructions that control the loop, such as pointer arithmetic and “end of loop” tests on each iteration;[1] reducing branch penalties; as well as hiding latencies including the delay in reading data from memory.[2] To eliminate this computational overhead, loops can be re-written as a repeated sequence of similar independent statements.[3]

By now, we’re left with only one loop, so clearly the target of loop unrolling is the POPCNT loop.
After all, we are potentially going over thousands of bits, and by shoving more POPCNT instructions in between the looping instructions, we can theoretically drive the CPU harder.
Not only that, but modern (in this case x86/x64) CPUs are notorious for having internal parallelism that comes in many shapes and forms. For POPCNT specifically, we know from Agner Fog’s Instruction Tables that:

Intel Skylake can execute certain POPCNT instructions on two different execution ports, with a single POPCNT latency of 3 cycles, and a reciprocal throughput of 1 cycle, so a latency of x + 2 cycles as a best case, where x is the number of continuous independent POPCNT instructions.
AMD Ryzen can execute up to 4 POPCNT instructions in 1 cycle, with a latency of 1 cycle, for continuous independent POPCNT instructions, which is even more impressive (I’ve not yet been able to verify this somewhat extravagant claim…).

These numbers were measured on real CPUs, with very specific benchmarks that measure single independent instructions. They should not be taken as a target performance for our code, since we are attempting to solve a real-life problem, which isn’t limited to a single instruction and has at least SOME dependency between the different instructions and branching logic on top of that.
But the numbers do give us at least one thing: motivation to unroll our POPCNT loop and try to get more work out of the CPU by issuing independent POPCNT on different parts of our bitmap.

Here’s the code that does this:

using static System.Runtime.Intrinsics.X86.Popcnt;
using static System.Runtime.Intrinsics.X86.Bmi1;
using static System.Runtime.Intrinsics.X86.Bmi2;

public static unsafe int POPCNTAndBMI2Unrolled(ulong* bits, int numBits, int n)
{
    var p64 = bits;
    for (; n >= 256; p64 += 4) {
        n -= (int) (
            PopCount(p64[0]) +
            PopCount(p64[1]) +
            PopCount(p64[2]) +
            PopCount(p64[3]));
    }
    var prevN = n;
    while (n > 0) {
        prevN = n;
        n -= (int) PopCount(*p64);
        p64++;
    }

    p64--;
    var pos = (int) TrailingZeroCount(
                        ParallelBitDeposit(1UL << (prevN - 1), *p64));
    return (int) ((p64 - bits) * 64) + pos;
}

We had to change the code flow to account for the unrolled loop, but all in all this is pretty straight forward, so let’s see how this performs:

Method	N	Mean (ns)	Scaled to POPCNTAndBMI2
POPCNTAndBMI2Unrolled	1	2.249	1.04
POPCNTAndBMI2Unrolled	4	10.904	1.15
POPCNTAndBMI2Unrolled	16	50.368	1.11
POPCNTAndBMI2Unrolled	64	208.272	1.13
POPCNTAndBMI2Unrolled	256	1,580.026	0.99
POPCNTAndBMI2Unrolled	1024	21,282.905	0.92
POPCNTAndBMI2Unrolled	4096	255,186.977	0.74
POPCNTAndBMI2Unrolled	16384	3,730,420.068	0.77
POPCNTAndBMI2Unrolled	65536	56,939,817.593	0.76

There are a few interesting things going on here:

For low bit-counts (N <= 64) we can see a drop in performance compared to the previous version. That is totally acceptable: We’ve made the code longer and more branch-y, and all of this was done in order to gain some serious change on the other side of this benchmark (Also, in reality, no one ever complains that your code used to take 193ns, but is now taking 208ns :).
In other words: The drop is not horrible, And we hope to make up enough for it on higher bit counts.
And we are making up for it, kind of… We can see a 33%-ish speedup for N >= 4096.

For those interested, here’s the JITDump of this version.

In theory, we should be happy, pack our bags, and call it a day! We’ve done it, we’ve squeezed every last bit we could hope to.
Except we really didn’t…
While it might not be clear from these results alone, the loop unrolling hit an unexpected snag: the performance improvement is actually disappointing.
How can I tell? Well, that’s simple: I’m cheating!
I’ve already written parallel C++ code as part of this whole effort (to be honest, I wrote the C++ code two years before C# intrinsics were a thing), and I’ve seen where unrolled POPCNT can go, and this is not it.
Not yet at least.

From my C++ attempts, I know we should have seen a ~100% speedup in high bit-counts with loop unrolling, but we are seeing much less than that.

To understand why though, and what is really going on here, you’ll have to wait for the next post, where we cover some of the C++ code, and possibly learn more about processors than we cared to know…

Mid-Journey Conclusions

We’ve taken our not so bad code at the end of the first post and improved upon quite a lot!
I hope you’ve seen how trying to think outside the box, and finding creative ways to compound various intrinsics provided by the CPU can really pay off in performance, and even simplicity.

With the positive things, we must also not forget there are some negative sides to working with intrinsics, which by now, you might also begin sensing them:

You’ll need to map which CPUs your users are using, and which CPU intrinsics are supported on each model (even within a single architecture, such as Intel/AMD x64 you’ll see great variation throughout different models!).
You’ll sometimes need to cryptic implementation-selection code, that uses the provided .IsHardwareAccelerated properties (for example detecting BMI1 only CPUs vs. BMI1 + BMI2 ones) to steer the JIT into the “best” implementation, while praying to the powers that be that the JIT will be intelligent enough to elide the un-needed code at generation time, and still inline the resulting code.
Due to having multiple implementations, architecture specific testing becomes a new requirement. This might sound basic to a C++ developer, but less so for C#/CLR developers; this would mean that you need to have access to x86 (both 32 and 64 bit) ,arm32,arm64 test agents and run tests on all of them to be able to sleep calmly at night.

All of these are considerations to be taken seriously, especially if you work outside of Microsoft (where there are considerably more resources for testing, and greater impact for using intrinsics at the same time), while considering intrinsics.

In the next and final post, we’ll explore the performance bug I uncovered, and how generally C# compares to C++ for this sort of code…

.NET Core 3.0 Intrinsics in Real Life - (Part 1/3)

2018-08-18T15:26:28+00:00

I’ve recently overhauled an internal data structure we use at Work^® to start using platform dependent intrinsics- the anticipated feature (for speed junkies like me, that is) which was released in preview form as part of CoreCLR 2.1: What follows is sort of a travel log of what I did and how the new CoreCLR functionality fares compared to writing C++ code, when processor intrinsics are involved.

This series will contain 3 parts:

The data-structure/operation that we’ll optimize and basic usage of intrinsics (this post).
Using intrinsics more effectively.
The C++ version(s) of the corresponding C# code, and what I learned from them.

All of the code (C# & C++) is published under the bitgoo github repo, with build/run scripts in case someone wants to play with it and/or use it as a starting point for humiliating me with better versions.

In order to keep people motivated:

By the end of this post, we’ll already start using intrinsics, and see considerable speedup in our execution time
By the end of the 2^nd post, we will already see a 300% speed-up compared to my current .NET Core 2.1 production code, and:
By the end of the 3^rd post I hope to show how with some fixing in the JIT, we can probably get another 100%-ish improvement on top of that, bringing us practically to C++ territory¹

The What/Why of Intrinsics

Processor intrinsics are a way to directly embed specific CPU instructions via special, fake method calls that the JIT replaces at code-generation time. Many of these instructions are considered exotic, and normal language syntax cannot map them cleanly.
The general rule is that a single intrinsic “function” becomes a single CPU instruction.

Intrinsics are not really new to the CLR, and staples of .NET rely on having them around. For example, practically all of the methods in the Interlocked class in System.Threading are essentially intrinsics, even if not referred to as such in the documentation. The same holds true for a vast set of vectorized mathematical operations exposed through the types in System.Numerics.

The recent, new effort to introduce more intrinsics in CoreCLR tries to provide additional processor specific intrinsics that deal with a wide range of interesting operations from sped-up cryptographic functions, random number generation to fused mathematical operations and various CPU/cache synchronization primitives.

Unlike the previous cases mentioned, the new intrinsic wrappers in .NET Core don’t shy away from providing model and architecture specific intrinsics, even in cases were only a small portion of actual CPUs might support them. In addition, a .IsHardwareAccelerated property was sprinkled all over the BCL classes providing intrinsics to allow runtime discovery of what the CPU supports.

On the performance/latency side, which is the focus of this series, we often find that intrinsics can replace tens of CPU instructions with one or two while possibly also eliminating branches (sometimes, more important than using less instructions…). This is compounded by the fact that the simplified instruction stream makes it possible for a modern CPU to “see” the dependencies between instructions (or lack thereof!) more clearly, and safely attempt to run multiple instructions in parallel even inside a single CPU core.

While there are some downsides as well to using intrinsics, I’ll discuss some of those at the end of the second post; by then, I hope my warnings will fall on more welcoming ears.
Personally, I’m more than ready to take that plunge, so with that long preamble out of the way, let’s describe our starting point:

The Bitmap, GetNthBitOffset()

To keep it short, I’m purposely going to completely ignore the context the code we are about to discuss is a key part of (If there is interest, I may write a separate post about it). For now, let’s accept that we have a god-given assignment in the form of a function that we really want to optimize the hell out of, without stopping to ask “Why?”.

The Bitmap

This is dead simple: we have a bitmap which is potentially thousands or tens of thousands of bits long, which we will store somewhere as an ulong[]:

const int THIS_MANY_BITS = 66666;
ulong[] bits = new ulong[(THIS_MANY_BITS / 64) + 1]; // enough room for everyone

The bits array in the sample above is continuously being mutated, and as bits go, this is going to be in the form of bits being turned on and off in no particular order, so imagine:

var r = new Random(DateTime.Ticks % int.MaxValue);
for (var i = 0; i < bits.Length; i++)
    bits[i] = unchecked(((ulong)r.Next()) << 32 | ((ulong) r.Next()));

The Search Method

We’re about to describe one of the two methods that I optimized. I chose this particular method since it was the more challenging one to optimize. But before describing it, a short disclaimer is in order:

The method is implemented with unsafe and ulong * rather than the managed/safe variants (ulong[] or Span<ulong>). The reasons I’m using unsafe are that for this type of code, which makes up double digit % of our CPU time, adding bounds-checking can be very destructive for performance; Specifically, in the context of this series where I’m about to compare C# with C++, we get an apples-to-apples comparison, as C++ is compiled without bounds-checking normally.

With that out of the way, lets inspect the method signature:

unsafe int GetNthBitOffset(ulong *bits, int numBits, int n);

This method runs over the entire bitmap until it finds the n^th bit with the value 1, or as I will refer to it here-on, our target-bit, and returns its bit offset within the bitmap as its return value. For brevity we assume that incoming values of n are never below 1 or above the number of 1 bits in the bitmap.

Here’s a super naive implementation that achieves this:

public static unsafe int Naive(ulong* bits, int numBits, int n)
{
    var b = 0;
    var value = *bits;
    var leftInULong = 64;

    var i = 0;
    while (i < numBits) {
        if ((value & 0x1UL) == 0x1UL)
            i++;
        if (i == n)
            break;
        value >>= 1;
        leftInULong--;
        b++;

        if (leftInULong != 0) // Still more bits left in this ulong?
            continue;
        value = *(++bits); // Load a new 64 bit value        
        leftInULong = 64;
    }
    return b;
}

Initial Performance

This implementation is obviously pretty bad, performance wise. But wait: There are lots of ways you could improve upon this: bit-twiddling hacks, LUTs and what we’re here for, processor intrinsics.

Our next step is to start measuring this, and we’ll move on to better and better versions of this method, until we exhaust my abilities to make this go any faster.
Using everyone’s favorite CLR microbenchmarking tool, BDN, I wrote a small harness that preallocates a huge array of bits, fills it up with random values (roughly 50% 0/1), then executes the benchmark(s) over this array looking for all the offsets of lit bits up to N where N is parametrized to be: 1, 8, 64, 512, 2048, 4096, 16384, 65536. The benchmark code looks roughly like this:

const int KB = 1024;
[Params(1, 4, 16, 64, 256, 1*KB, 4*KB, 16*KB, 64*KB)]
public int N { get; set; }

protected unsafe ulong *_bits;
...
[Benchmark]
public unsafe int Naive()
{
    int sum = 0;
    for (var i = 1; i <= N; i++)
        sum += GetNthBitOffset.Naive(_bits, N, i);
    return sum;
}

For those in the know, I’m NOT using BDN’s OperationsPerInvoke() to normalize for N since the benchmark is looping over the entire bitmap, and the performance varies wildly throughout the loop.

Running this gives us the following results:

Method	N	Mean (ns)
Naive	1	1.185
Naive	4	35.308
Naive	16	605.021
Naive	64	6,368.355
Naive	256	99,448.636
Naive	1024	2,057,984.353
Naive	4096	68,728,413.667
Naive	16384	1,365,698,984.333
Naive	65536	22,669,217,647.333

A couple of comments about these results:

Small numbers of bits actually work out OK-ish given how bad the code is.
Yes, finding all the offsets of the first 64k lit bits (so 64K calls times average length of 64K bits processed per call²) takes a whopping 22+ seconds…

Prepping the Machine / CLR + Environmental information

Here is the BDN environmental data about my machine:

BenchmarkDotNet=v0.11.0, OS=ubuntu 18.04
Intel Core i7-7700HQ CPU 2.80GHz (Sky Lake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
  [Host]   : .NET Core 3.0.0-preview1-26814-05 (CoreCLR 4.6.26814.06, CoreFX 4.6.26814.01), 64bit RyuJIT
  ShortRun : .NET Core ? (CoreCLR 4.6.26814.06, CoreFX 4.6.26814.01), 64bit RyuJIT

Job=ShortRun  Toolchain=3.0.100-alpha1-20180720-2  IterationCount=3  
LaunchCount=1  WarmupCount=3  

Keen eyes will notice I’m running this with .NET Core 3.0 pre-alpha / preview. While this is completely uncalled for the code we’ve seen so far, the next variations will actually depend on having .NET Core 3.0 around, so I ran the whole benchmark set with 3.0.

I’m using an excellent prep.sh originally prepared by Alexander Gallego that basically kills the Turbo effect on modern CPUs, by setting up the min/max frequencies to the base clock of the machine (e.g. what you would get when running 100% CPU on all cores).

My laptop has an Intel i7 Skylake processor model 7700HQ with a base frequency of 2.8Ghz, so I ran the following commands on my laptop as root:

source prep.sh # to get the bash functions used below
cpu_enable_performance_cpupower_state
cpu_set_min_frequencies 2800000
cpu_set_max_frequencies 2800000
cpu_available_frequencies # should print 2800000 for all 4 cores, in my case

This is done so that the numbers presented here are applicable for multi-core machines running this code on all cores, and so that very short benchmarks don’t get skewed results compared to longer benchmarks due to CPU frequency scaling.

PopCount() without POPCNT

Now that we have the initial code out of the way, we’re not going to look at it anymore. The next version will use bit-twiddling hacks in order to count larger groups of bits much faster.

We’ll introduce two pure C# functions that implement population counts:

The Hamming weight of a string is the number of symbols that are different from the zero-symbol of the alphabet used. It is thus equivalent to the Hamming distance from the all-zero string of the same length. For the most typical case, a string of bits, this is the number of 1’s in the string, or the digit sum of the binary representation of a given number and the ℓ₁ norm of a bit vector. In this binary case, it is also called the population count,[1] popcount, sideways sum,[2] or bit summation.[3]

Ultimately, one of the key processor intrinsics we will use is… POPCNT which does exactly this, as a single instruction at the processor level, but for now, we will implement a PopCount() method without those intrinsics, for 64/32 bit inputs.
Apart from PopCount() we will also define a TrailingZeroCount()³ method, that counts trailing zero bits. I chose an implementation that uses PopCount() internally.
Here are the two PopCount() and TrailingZeroCount()methods shamelessly stolen throughout the interwebs from Hacker’s delight:

public class HackersDelight
{
    public static int PopCount(ulong b)
    {
        b -= (b >> 1) & 0x5555555555555555;
        b =  (b & 0x3333333333333333) + ((b >> 2) & 0x3333333333333333);
        b =  (b + (b >> 4)) & 0x0f0f0f0f0f0f0f0f;
        return unchecked((int) ((b * 0x0101010101010101) >> 56));
    }

    public static int PopCount(uint b)
    {
        b -= (b >> 1) & 0x55555555;
        b =  (b & 0x33333333) + ((b >> 2) & 0x33333333);
        b =  (b + (b >> 4)) & 0x0f0f0f0f;
        return unchecked((int) ((b * 0x01010101) >> 24));
    }

    public static int TrailingZeroCount(uint x) => PopCount(~x & (x - 1));
}

These methods can quickly and without a single branch instruction, count the lit bits in 64/32 bit words, with just 12 arithmetic operations, most of them simple bit operations and only one (!) multiplication.

With our bit-twiddling optimized functions implemented and out of the way, let’s put them to good use in a new implementation, and make a few changes in the flow of the code:

using static BitGoo.HackersDelight;

public static unsafe int NoIntrisics(ulong* bits, int numBits, int n)
{
    // (1)
    var p64 = bits;
    int prevN;
    do {
        prevN = n;
        n -= PopCount(*p64);
        p64++;
    } while (n > 0);

    // (2)
    var p32 = (uint *) (p64 - 1);
    n = prevN - PopCount(*p32);
    if (n > 0) {
        prevN = n;
        p32++;
    }

    // (3)
    var prevValue = *p32;
    var pos = (p32 - (uint*) bits) * 32;
    while (prevN > 0) {
        var bp = TrailingZeroCount(prevValue) + 1;
        pos += bp;
        prevN--;
        prevValue >>= (int) bp;
    }

    return (int) (pos - 1);
}

Our new approach to solving this goes like this (comments correspond to blocks of the code above):

As long as we still need to look for any 1 bits, we loop, calling PopCount() until we finally consume more bits than what we were tasked with… At that stage our p64 pointer is pointing 1 ulong beyond the ulong containing our target-bit, and prevN contains the number of consumed 1 bits that was still correct one ulong before.
Once we’re out of the loop, we know that out target-bit is hiding somewhere within that last 64-bit ulong. So we will use a single 32-bit PopCount() to figure out if its within the first/second 32-bit words making up that 64-bit word and update the bit-counts / p32 pointer accordingly.
Now, we know that p32 is pointing to the 32-bit word containing our target-bit p32, so we find the target-bit, by using TrailingZeroCount() and right shifting in a loop until we find the target bit’s position within the word, finally returning the offset when we’re done.

Let’s take a look at how this version fairs:

Method	N	Mean (ns)	Scaled to “Naive”
NoIntrisics	1	5.247	4.19
NoIntrisics	4	43.919	0.79
NoIntrisics	16	429.974	0.58
NoIntrisics	64	2,986.498	0.44
NoIntrisics	256	16,492.408	0.16
NoIntrisics	1024	112,049.075	0.06
NoIntrisics	4096	1,058,565.813	0.02
NoIntrisics	16384	13,714,191.734	0.010
NoIntrisics	65536	206,236,218.000	0.009

Quite an improvement already! To be fair, our starting point being so low helped a lot, but still an improvement. As a side note, this is, essentially, the code I’m running on our own bitmaps in production right now, since I don’t have intrinsics right now.

If there’s really one column where our focus should gravitate towards it’s the “Scaled” column on the right of the table. Each result here is scaled to its corresponding Naive version:

For any bit length < 16, the old version runs faster, but marginally so, in absolute terms.
Once we hit N == 16 and upwards, the landscape changes dramatically and our bit-twiddling PopCount() starts paying off big-time: the speedup for 64 is already > 100% all the way up to 11100% speedup @ 64K.

CoreCLR & Architecture Dependent Intrinsics

Let us remind ourselves where things stand at the time of writing this post, when it comes to using intrinsics in CoreCLR:

.NET Core 2.1 was released on May 30^th 2018, with Intrinsics released as a “preview” feature:
- The 2.1 JIT kind of knows how to handle some intrinsics.
- To actually use them, we need to use the dotnet-core myget feed and install an experimental nuget package that provides the API surface for the intrinsics.
- No commitments were made that things would be stable/working.
.NET Core 3.0 is the official (so far?) target release for intrinsics support in .NET Core:
- Considerably more intrinsics are supported than what was available with 2.1.
- No extra nuget package is required (intrinsics are part of the SDK).
- Work is still being very actively done to add more intrinsics and improve the quality of what is already there.

As we require intrinsics that were not available with 2.1, The code in repo is targeting a pre-alpha1 version of .NET Core 3.0 (i.e. netcoreapp3.0).

For people wanting to run this code, it’s relatively easy to do so, and non-destructive to your current setup:

Go to the Installers and Binaries section of the core-sdk project.
The left most column contains .NET Core Master branch builds (3.0.x Runtime).
Download the appropriate installer in .zip / .tar.gz form: I used the linux one, but the windows / osx ones should be just as good.
unzip/untar the installer somewhere (*Nix users beware: Microsoft does this entirely inhumane thing of packaging the contents of their distribution as the top level of .tar.gz, so be sure to mkdir dotnet; tar -C dotnet xf /path/to/where/you/downloaded/the/tar.gz to avoid heart-ache).
Adjust your PATH env. to find the dotnet executable in the new folder you just unzipped to, before anywhere else. (I did this locally in my terminal session).
You should now be able to dotnet restore|build|run|test the BitGoo project(s).

Just to be on the safe side, here is what dotnet --info prints for me:

.NET Core SDK (reflecting any global.json):
 Version:   3.0.100-alpha1-20180720-2
 Commit:    82bd85d0a9
   
Runtime Environment:
 OS Name:     ubuntu
 OS Version:  18.04
 OS Platform: Linux
 RID:         ubuntu.18.04-x64
 ... # No one really cares that much

Using POPCNT & TZCNT

The next step will be to replace our bit-twiddling PopCount() code with the PopCount() intrinsic provided by System.Runtime.Intrinsics.X86.Popcnt class in the 3.0 BCL, which should be replaced by a single CPU POPCNT instruction by the JIT at runtime. In addition, we will also use the BMI1 (Bit Manipulation Intrinsics 1) TrailingZeroCount() intrinsic which maps to the TZCNT instruction.

These instructions do exactly what our previous hand written implementation did, except it’s done with dedicated circuitry in our CPUs, takes up less instructions in the instruction stream, runs faster and can be parallelized internally inside the processor. I was very careful in the last post / code-sample, to use the exact same function name(s) as the intrinsics provided by the 3.0 BCL, so really, the code change comes down to mostly adjusting the two top using static statements:

using static System.Runtime.Intrinsics.X86.Popcnt;
using static System.Runtime.Intrinsics.X86.Bmi1;

// Rest of the code is the same...

That’s it! We’re using intrinsics, all done!
If you are having a hard time trusting me, here’s a link to the complete code. Here are the results, this time scaled to the NoIntrinsics() version:

Method	N	Mean (ns)	Scaled to “NoIntrinsics”`
POPCNTAndBMI1	1	2.358	0.44
POPCNTAndBMI1	4	15.318	0.35
POPCNTAndBMI1	16	128.712	0.31
POPCNTAndBMI1	64	916.033	0.27
POPCNTAndBMI1	256	5,005.190	0.30
POPCNTAndBMI1	1024	44,606.327	0.39
POPCNTAndBMI1	4096	408,871.712	0.39
POPCNTAndBMI1	16384	5,205,533.285	0.39
POPCNTAndBMI1	65536	76,186,499.286	0.37

OK, now we’re talking…
There can be no doubt that we have SOMETHING working: we can see a very substantial improvement across the board for every value of N!.
There are some weird things still happening here that I cannot fully explain yet at this stage, namely: how the scaling becomes relatively worse as N increases, but there is little to generally complain about.

For those with a need to see assembly code to feel convinced, I’ve uploaded JITDumps to a gist, where you can clearly see the various POPCNT / LZCNT instructions throughout the ASM code (scroll to the end of the dump…).

What’s Next?

We’ve reached pretty far, and I hope it was interesting even if a bit introductory.
In the next post, we’ll continue iterating on this task, introducing new intrinsics in the process, and encounter some “interesting” quirks.

If you feel like you’re up for it, the next post is here…

Worry not, I reported and opened an issue on CoreCLR^1 before even starting to write this post and plan to do a deep-dive into this on the 3rd post ↩
Since our bitmap is filled with roughly 50% 0/1 values, searching for 64K lit bits means going over roughly 128K bits, as an example. ↩
The TrailingZeroCount() method I’ve used here is the fastest, from independent testing, for C#. There are others but they either depend on having a compiler that can use CMOV instructions (which CoreCLR doesn’t yet), or on using LUTs (Look Up Tables) which I dislike since they tend to win benchmarks while losing in bigger scope of where the code is used, so I have a semi-religious bias against them. ↩

damageboy

This Goes to Eleven (Pt. 5/∞)

(Trying) to squeeze some more vectorized juice

Aligning our expectations

Aligning to CPU Cache-lines: :+1:

(Re-)Partitioning overlapping regions: :+1: :+1:

Sub-optimization- Converting branches to arithmetic: :+1:

Coming to terms with bad speculation

Replacing the branch with arithmetic: :-1:

This Goes to Eleven (Pt. 4/∞)

Squeezing some more juice

Dealing with small JIT hiccups: :+1:

JIT Bug 1: variable not promoted to register

JIT bug 2: not optimizing pointer difference comparisons

JIT Bug 3: Updating the write* pointers more efficiently

Selecting a better cut-off threshold for scalar sorting: :+1:

Explicit Prefetching: :-1:

Simplifying the branch :+1:

Packing the Permutation Table, 1st attempt: :+1:

Packing the Permutation Table, 2nd attempt: :-1:

Skipping some permutations: :-1:

Getting intimate with x86 for fun and profit: :+1:

We’ve Come a Long Way, Baby!

This Goes to Eleven (Part. 3/∞)

Unstable Vectorized Partitioning + QuickSort

AVX2 Partitioning Block

Permutation lookup table

Double Pumped Loop

Setup: Make some room!

Loop

Handling the remainder and finishing up

Pretending we’re Array.Sort

Initial Performance

Finishing off with a sour taste

This Goes to Eleven (Part. 2/∞)

Intrinsics / Vectorization

SIMD What & Why

SIMD registers

SIMD Intrinsic Types in C\#

A few Vectorized Instructions for the road

Vector256.Create(int value)

Avx2.LoadDquVector256 / Avx.Store

CompareGreaterThan

MoveMask

PopCount

PermuteVar8x32

That’s it for now

This Goes to Eleven (Part 1/∞)

Let’s do this

QuickSort Crash Course

In words

In code

Visualizing QuickSort’s recursion

Visualizing QuickSort’s Comparisons/Swaps

Array.Sort vs. QuickSort

Comparing Scalar Variants

All Warmed Up?

Unsafe Bounds Checking

Unsafe Bounds Checking

Hacking CoreCLR on Linux with CLion

What/Why?

Loading CoreCLR with CLion Navigation

Debugging CoreCLR from CLion

.NET Core 3.0 Intrinsics in Real Life - (Part 3/3)

C++ vs. C#

False Dependencies

Working Around False Dependencies

Update: CoreCLR does the right thing

The love/hate story that is tight loops in CPUs

Visualizing our loop

Instruction Dependencies

Register Renaming

Wrapping up: clearing the register for the rescue

.NET Core 3.0 Intrinsics in Real Life - (Part 2/3)

PDEP - Parallel Bit Deposit

Loop Unrolling

Mid-Journey Conclusions

.NET Core 3.0 Intrinsics in Real Life - (Part 1/3)

The What/Why of Intrinsics

The Bitmap, GetNthBitOffset()

JIT Bug 3: Updating the `write*` pointers more efficiently

Packing the Permutation Table, 1^st attempt: :+1:

Packing the Permutation Table, 2^nd attempt: :-1: