Trumping Array.Sort with AVX2 Intrinsics (Part 4/6)

28 minute read

I ended up going down the rabbit hole re-implementing Array.Sort with AVX2 intrinsics, and there’s no reason I should store all the agony inside (to be honest: I had a lot of fun with this). I should probably attempt to have a serious discussion with CoreCLR / CoreFX people about starting a slow process that would end with integrating this code into the main C# repos, but for now, let’s get in the ring and show what AVX/AVX2 intrinsics can really do for a non-trivial problem, and even discuss potential improvements that future CoreCLR versions could bring to the table.

Since there’s a lot to go over here, I’ll split it up into a few parts:

  1. In part 1, we did a short refresher on QuickSort and how it compares to Array.Sort. If you don’t need any refresher, you can skip over it and get right down to part 2 and onwards , although I really recommend skimming through, mostly because I’ve got really good visualizations for that should be in the back of everyone’s mind as we’ll be dealing with vectorization & optimization later.
  2. In part 2, we go over the basics of Vectorized HW Intrinsics, discussed vector types, and a handful of vectorized instructions we’ll actually be using in part 3, but we still will not be sorting anything.
  3. In part 3 we go through the initial code for the vectorized sorting and we’ll finally start seeing some payoff. We’ll finish with some agony courtesy of CPU’s Branch Predictor, just so we don’t get too cocky.
  4. In this part, we go over a handful of optimization approaches that I attempted trying to get the vectorized partition to run faster, we’ll see what worked and what didn’t.
  5. In part 5, we’ll see how we can almost get rid of 100% of the remaining scalar code, by implementing small-constant size array sorting. We’ll use, drum roll…, yet more AVX2 vectorization and gain a considerable amount of performance / efficiency in the process.
  6. Finally, in part 6, I’ll list the outstanding stuff / ideas I have for getting more juice and functionality out of my vectorized code.

(Trying) to squeeze some more vectorized juice

I thought it would be nice to show a bunch of things I ended up trying to improve performance. Some worked, some not, but all were worth mentioning, so here goes:

Dealing with small JIT hiccups :+1:

One of the more annoying things I’ve discovered during this optimization process was the the JIT isn’t really ready to do proper optimizations with pointer arithmetic.
Consider these two following pieces of code, which we’ve shown before:

int * nextPtr;
if (readLeft   - writeLeft <= 
    writeRight - readRight) {
    nextPtr = readLeft;
    readLeft += 8;
} else {
    nextPtr = readRight;
    readRight -= 8;

While this was what I originally wrote, my new disillusionment with the JIT means I have to write this:

int * nextPtr;
if ((byte *) readLeft   - (byte *) writeLeft) <= 
    (byte *) writeRight - (byte *) readRight)) {
    nextPtr = readLeft;
    readLeft += 8;
} else {
    nextPtr = readRight;
    readRight -= 8;

Why bother casting 4 int * to a byte *. What does this serve?
The original code made the JIT (wrongfully) think we are actually doing int * arithmetic for readLeft - writeLeft and writeRight - readRight. In other words: The JIT generated code to take the numerical pointer differences, and generated extra code to convert them to int * differences: so lots of >> 2 operations. However, this is useless: we just care if one side is larger that the other, we don’t care if this is done with byte * or int * comparisons… So by forcefully casting each pointer to byte * we are “telling” the JIT that the comparison can be made without the superfluous right-shifts.

The same pattern (albeit slightly more convoluted) re-surfaces here:

var popCount = PopCount(mask);
writeLeft += 8U - popCount;
writeRight -= popCount;

Here, the popCount result is used to increment two int * values. Unfortunately, the JIT isn’t smart enough to see that it would be wiser to left shift popCount once by 2 (e.g. convert to byte * distance) and reuse that value twice. Rewriting this rather clean code into this god-awful mess:

var popCount = PopCount(mask) << 2;
writeRight = ((int *) ((byte *) writeRight - popCount);
writeLeft =  ((int *) ((byte *) writeLeft + 8*4U - popCount);

Helped with generating slightly denser code by eliminating silly instructions from a hot loop. But does it help?

Method N Mean (µs) Time / N (ns) Ratio
AVX2DoublePumpedNaive 100 3.043 30.4290 1.00
AVX2DoublePumpedMicroOpt 100 3.076 30.7578 1.15
AVX2DoublePumpedNaive 1000 26.041 26.0415 1.00
AVX2DoublePumpedMicroOpt 1000 23.257 23.2569 0.91
AVX2DoublePumpedNaive 10000 325.880 32.5880 1.00
AVX2DoublePumpedMicroOpt 10000 312.971 31.2971 0.96
AVX2DoublePumpedNaive 100000 3,510.946 35.1095 1.00
AVX2DoublePumpedMicroOpt 100000 3,327.012 33.2701 0.95
AVX2DoublePumpedNaive 1000000 27,700.134 27.7001 1.00
AVX2DoublePumpedMicroOpt 1000000 26,130.626 26.1306 0.94
AVX2DoublePumpedNaive 10000000 298,068.352 29.8068 1.00
AVX2DoublePumpedMicroOpt 10000000 275,455.395 27.5455 0.92

Sure does! The improvement is very measurable. Too bad we had to uglify the code to get here, but such is life. Our results just improved by another ~4-9% across the board.
If this is the going rate for ugly, I’ll bite the bullet :)

Selecting a good InsertionSort threshold :+1:

As mentioned before, it’s not clear that the currently selected threshold for InsertionSort, which is 16 elements + 1 (for pivot), or 17 in total is the best for our mix of partitioning. I tried 24, 32, 40, 48 on top of 16, and this is what came out in the form of an Area Bump Chart:

This sort of visualization is great at showing us at each exact value of N, which threshold performed best, or was positioned as the lowest in the chart.
What’s abundantly clear is that 16 isn’t cutting it. And while for small/medium sized arrays 40/48 looks like a good threshold, in my eyes, the clear overall winner here is 32, as it seems to perform pretty well for the smaller array sizes, and exceedingly well for larger arrays.

Aligning to CPU Cache-lines :+1:

Our memory access patterns are very different for reading/writing with respect to alignment:

  • For writing, we’re all over the place, we always advance the write pointers according to how the data was partitioned, e.g. it is data dependent, and there is little we can say about our write addresses. Also, Intel CPUs don’t really have a special opcode for writing aligned data, so we don’t care.
  • For reading, the situation is different: We always advance the read pointers by 8 elements (32 bytes) on the one hand, and we actually have a special intrinsic: Avx.LoadAlignedVector256().

Alignment, in general, is not critical in modern x64 processors, although some people believe in this myth, probably due to bad experience a decade ago. What we should be careful about, however, is when our read operations ends up crossing cache-lines (which are 64 bytes on almost all modern HW). This literally causes the CPU to issue two read operation directed at the cache units. This sort of cache-line crossing read does have a sustained effect1.
When we process an array using 4 byte reads, this means that if our data is not 4-byte aligned (which is rarely the case anyway with most allocators, including C#) these cross cache-line read would happens at a rate of 4/64 or 6.25% of reads. Again, it is important to stress that this is at best a theoretical discussion given that almost all allocators are machine-word size (4/8 bytes) aligned.
The same does not hold true for Vector256<T> sized reads, which are 32 bytes wide, when we issue cache-line unaligned reads, we end up reading across cache-lines at a rate of 50%! Not cool.

Can we do something about it? We sure can! and not at a great cost: remember that we need to deal with the remainder of the array anyway, and we’ve been doing that towards the end of our partitioning code thus far. We can move that code from the end of our partitioning function, to the beginning and also modify it to partition with scalar code until both readLeft/readRight pointers are aligned to 32 bytes.
This means we would do a little more scalar work potentially on the one hand, but we can change all of the loading code to use Avx.LoadAlignedVector256() and know for sure that we will no longer be causing the CPU to do cross cache-line reads.

I won’t bother showing the code listing for AVX2DoublePumpedAligned , it’s available here, but I will show the rewritten scalar partition block, originally it was right after the main double pumped loop and looked like this:

    while (readLeft < readRight) {
        var v = *readLeft++;

        if (v <= pivot) {
            *tmpLeft++ = v;
        } else {
            *--tmpRight = v;

The re-written loop, with it’s alignment logic is now at the top of the function looking like this:

            const ulong ALIGN = 32;
            const ulong ALIGN_MASK = ALIGN - 1;

            if (((ulong) readLeft & ALIGN_MASK) != 0) {
                var nextAlign = (int *) (((ulong) readLeft + ALIGN) & ~ALIGN_MASK);
                while (readLeft < nextAlign) {
                    var v = *readLeft++;
                    if (v <= pivot) {
                        *tmpLeft++ = v;
                    } else {
                        *--tmpRight = v;
            Debug.Assert(((ulong) readLeft & ALIGN_MASK) == 0);

            if (((ulong) readRight & ALIGN_MASK) != 0) {
                var nextAlign = (int *) ((ulong) readRight & ~ALIGN_MASK);
                while (readRight > nextAlign) {
                    var v = *--readRight;
                    if (v <= pivot) {
                        *tmpLeft++ = v;
                    } else {
                        *--tmpRight = v;
            Debug.Assert(((ulong) readRight & ALIGN_MASK) == 0);

What is does now is check alignment is necessary, and then proceeds to align while partitioning each side.

Where do we end up performance wise with this optimization? (to be clear, these results are compared to the latest a baseline version that now uses 32 as the InsertionSort threshold):

Method N Mean (ns) Time / N (ns) Ratio
AVX2DoublePumpedMicroOpt 100 895.9 8.9585 1.00
AVX2DoublePumpedAligned 100 2,879.6 28.7956 3.21
AVX2DoublePumpedMicroOpt 1000 19,093.2 19.0932 1.00
AVX2DoublePumpedAligned 1000 25,468.0 25.4680 1.34
AVX2DoublePumpedMicroOpt 10000 278,365.7 27.8366 1.00
AVX2DoublePumpedAligned 10000 272,146.9 27.2147 0.98
AVX2DoublePumpedMicroOpt 100000 2,806,369.0 28.0637 1.00
AVX2DoublePumpedAligned 100000 2,614,231.4 26.1423 0.93
AVX2DoublePumpedMicroOpt 1000000 24,250,413.3 24.2504 1.00
AVX2DoublePumpedAligned 1000000 23,771,945.4 23.7719 0.98
AVX2DoublePumpedMicroOpt 10000000 266,767,552.8 26.6768 1.00
AVX2DoublePumpedAligned 10000000 258,396,465.8 26.4285 0.97

I know it does not seem like the most impressive improvement, but we have to remember that before, we were hitting the scalar code when we had less than 8 elements left, but more than 0, this meant 3.5 iterations of scalar partitioning per partition operation on average. With this new aligned code, we are now aligning up to 7 elements on each side per partition operation, or maximum of 14 elements, or 7 elements on average per per-partition operation. So we somehow managed to speed up the function by around 2% while doubling the amount of scalar work done! This means that the pure benefit from alignment is larger than what the results are showing right now since it’s being masked, to some extent, by the extra scalar work we tacked on. If only there was a way we could skip that scalar work all together…

(Re-)Partitioning overlapping regions

This is a very cool optimization and a natural progression from the last one. At the risk of sounding pompous, I think I might have found something here that no-one has done before in the context of partitioning. I could be wrong about that last statement, but I couldn’t find anything quite like this discussed anywhere, and believe me, I’ve searched. If anyone can point me out to someone doing this before, I’d really love to hear about it, there might be more good stuff to read about there.

The basic idea here is we get rid of all (ok, ok, almost all) scalar partitioning in our vectorized code. If we can partition and align the edges of segment we are about to process with vectorized code, we would reducing the total number instructions executed. At the same time, we would be retaining more of the speed-up that was lost with alignment optimization we did before. This would have a double-whammy compounded effect. But how? we could go about it the other way around! Instead of aligning forward in each respective direction, we could simply enlarge the partitioned segment to include a few more (up to 7 more) elements on the outer side of each partition and re-partition them using the new pivot we’ve just selected. This might sound simple and safe but before we give it the OK, there are critical constraints we must respect for this not to completely mess up the entire sort mechanics.

Make note, that this is a very awkward optimization when you consider that I’m suggesting we should partition more data in order to speed up our code. This sounds bonkers, unless we dig deep within ourselves and find some mechanical empathy: We need to remind ourselves that not all work is equal in the eyes of the CPU. When we are doing scalar partitioning on n elements, we are really telling the CPU to execute n branches, which are completely data-dependent. To put it simply: The CPU “hates” this sort of work. It has to guess what happens next, and will do so no better than flipping a coin, so at a success rate of roughly 50% for truly random data. What’s worse, as mentioned before, in the end of part 3, whenever the CPU mis-predicts, there’s a huge penalty to pay in the form of a full pipeline flush which roughly costs us 14-15 cycles on a modern CPU. Paying this penalty once, is roughly equivalent to partitioning 2 x 8 element vectors in full with our branch-less vectorized partition block! This is the reason that doing “more” work might be faster. It’s because what we think is more is actually less, when we empathize and understand the CPU.

Back to the constraints though: There’s one thing we can never do, and that is move a pivot that was previously partitioned, I fondly call them “buried pivots” (since they’re in their final resting place, get it?); as everyone knows, you don’t move around dead bodies, that’s always the first thing that happens in a horror movie. That’s about it. It sounds simple, but it requires explanation. When a previous partition operation is finished, the pivot used during that operation will be moved to its final destination in the sorted array. Moreover, all other partitioning operations will have their left/right edges calculated according to that final position of the pivot. We can not, under any circumstances ever move an already places pivot.

When we call our partitioning operation, we have to consider the asymmetry of the left and right edges of our segment:

  • For the left side:

    • There might not be additional room on the left to read from for an 8-element partition operation!
      • In other words, we are too close to the edge of the array on the left side!
    • Since we always partition first to the left, then to the right, we know for a fact that all of elements to our left are completely sorted. e.g. they are all previously sorted pivots, and we can’t move them.
    • We also know that each of those values is smaller than or equal to whatever pivot value we select for our own partitioning operation.
  • For the right side:

    • There might not be additional room on the right to read from for an 8-element partition operation!
      • In other words, we are too close to the edge of the array on the right side!
    • The immediate value to our right side is a pivot, and all other value to its right are larger-than-or-equal to it. So we can’t move it with respect to its position.
    • We also know that each of those values is larger-then-or-equal to whatever pivot value we select for our own partitioning operation.

All this information is hard to process, but what it boils down to is that we need to be very careful when we partition, or more accurately when we permute our data in the vectorized partition block. We need permutation entries that are “stable”. I’m coining this phrase freely as I’m going along: we need to make sure our permutation table entries are stable on the left and on the right: e.g. they cannot reorder the values that need to go on the left amongst themselves (their ordering needs to stay the same between them), and they cannot reorder the values that need to go on the right amongst themselves from the right hand side of the register.

Up to this point, there was no such requirement, and the initial partition tables I generated failed to satisfy this new requirement.

Here’s a simple example for an stable/unstable permutation entries, let’s imagine we compared to a pivot value of 500:

Bit 0 1 2 3 4 5 6 7
Value 99 100 666 101 102 777 888 999
Mask 0 0 1 0 0 1 1 1
0 1 7 2 3 6 5 4
Unstable Result 99 100 101 102 999 888 777 666
0 1 4 2 3 5 6 7
Stable Result 99 100 101 102 666 777 888 999

In the above example, the unstable permutation is a perfectly valid permutation, and it successfully partitions the sample vector around the pivot value of 500, but the 4 elements I marked in bold are re-ordered with respect to each other, when compared to the original array; In the stable permutation entry, the internal ordering amongst the partitioned groups is preserved.

After I rewrote the code that generates the permutation entries, I proceeded with my overlapping re-partitioning hack: The idea was the I would find the optimal alignment point on the left and on the right (assuming one was available, e.g. there was enough room on that side) and read that data with our good ole LoadDquVector256 intrinsic, and partition that data into the temporary area. But there is one additional twist. We need to remember how many elements do not belong to this partition (e.g. originate from out overlap hack) and remember not to copy them back at the end of the function.

Prefetching :-1:

I tried using prefetch intrinsics to give the CPU early hints as to where we are reading memory from.

Generally speaking prefetching should be used to make sure the CPU always reads some data from memory to cache ahead of the actual time we would use it so that the CPU never stalls waiting for memory which is very slow, we can prefetch all the way to L1 cache, or even specify the target level as L2, L3, as quick reminder here are the memory hierarchy latencies for an Intel Skylake-X processor, although the same ratios are pretty much true for all modern processors:

Hierarchy Level Latency (cycles) Ratio
L1 Data Cache 4-5 1.0
L2 Cache 14 3.5
L3 Cache 68 (@ 3.6Ghz) 4.8
RAM 79 cycles (constant) + 180 (50 ns @ 3.6Ghz) 3.8

The bottom line is that having to wait for RAM is a death sentence, but even having to wait for L2 cache (14 cycles) when your entire loop’s throughput is around 6-7 cycles is really unpleasant.
But do we actually need to prefetch? Well, there is no clear cut answer except than trying it out. CPU designers know this just as much as we do, and the CPU already attempts to prefetch data. But it’s very hard to know when it might need our help. Adding prefetching instructions puts more load on the CPU as we’re adding more instructions to decode & execute, while the CPU might already be doing the same work without us telling it. This is the key consideration we have to keep in mind when trying to figure out if prefetching is a good idea. To make matters worse, the answer can also be CPU model specific…
In our case, prefetching the writable memory made no sense, as our loop code mostly reads the same data just before writing to in the next iteration or two, so I mostly focused on trying to prefetch the next read addresses.

Whenever I modified readLeft, readRight, I immediately added code like this:

int * nextPtr;
if ((byte *) readLeft   - (byte *) writeLeft) <= 
    (byte *) writeRight - (byte *) readRight)) {
    nextPtr = readLeft;
    readLeft += 8;
    // Trying to be clever here,
    // If we are reading from the left at this iteration, 
    // we are likely to read from right in the next iteration
    Sse.Prefetch0((byte *) readRight - 64);
} else {
    nextPtr = readRight;
    readRight -= 8;
    // Same as above
    Sse.Prefetch0((byte *) readLeft + 64);

This tells the CPU we are about to use data in readLeft + 64 (the next cache-line from the left) and readRight - 64 (the next cache-line from the right) in the following iterations.

While this looks great on paper, the real world results of this were unnoticeable for me and even slightly negative. I think this is related to the fact that for the 2 CPUs I was trying this on, the prefetching unit in the CPU was already doing well without my generous help…
Still it was worth a shot.

Packing the Permutation Table :-1:

This following attempt yielded mixed results. In some cases (e.g. specific CPU models) it did better, on other it did worse, but all in all I still think it’s interesting that it didn’t do worse overall, and I haven’t given up on it completely.

The original permutation table is taking up 32 bytes per entry x 28 elements ⮞ 8kb in total. Just to be clear: that’s a lot! For comparison, our entire CPU L1 data-cache is normally 32kb, and I’d sure rather have it store the actual data we’re sorting, rather than my lookup table, right?

Well, not all is lost. We can do something semi-clever here, this will take the lookup table down to 4 bytes per element, or 8x improvement.


Well, with intrinsics of course, if nothing else, it was worth it so I could do this:

Yo Dawg

My optimized permutation table and vector loading code looks like this:

ReadOnlySpan<byte> BitPermTable => new byte[]
    0b10001000, 0b11000110, 0b11111010, 0b00000000, // 0
    // ...
    0b01100011, 0b01111101, 0b01000100, 0b00000000, // 7    
    // ...
    0b00010000, 0b10011101, 0b11110101, 0b00000000, // 170
    // ...
    0b10001000, 0b11000110, 0b11111010, 0b00000000, // 255

Vector256<int> GetBitPermutation(uint *pBase, in uint mask)
    const ulong magicMask =
    return Avx2.ConvertToVector256Int32(
            Bmi2.X64.ParallelBitDeposit(pBase[mask], magicMask)).AsByte());

What does this little monstrosity do exactly? We pack the permutation bits (remember, we just need 3 bits per element, we have 8 elements, so 24 bits per permutation vector in total) into a single 32 bit value, then whenever we need to permute, we:

  • Unpack the 32-bit values into a 64-bit value using ParallelBitDeposit from the BMI2 intrinsics extensions.
    In a stroke of luck I’ve already throughly covered it back in my PopCount series here.
  • Convert (move) it to a 128-bit SIMD register using Vector128.CreateScalarUnsafe.
  • Use yet another cryptic intrinsic ConvertToVector256Int32 (VPMOVSXBD) that takes 8-bit elements from a 128-bit wide register and expands them into integers in a 256 bit registers.

In short, we chain 3 extra instructions, but save 7KB of cache. Was it worth it?
I wish I could say with a complete and assured voice that it was, but the truth is that it had only very little effect. While we end up using 1kb of cache instead of 8kb, the extra instructions still cost us quite a lot more. I still think this optimization might do some good, but for this to make a bigger splash we need to be in a situation were there is more pressure on the cache, for the extra latency to be worth it. I will refer to this in the last post, when I discuss other types of sorting that I would still need/want to support. Where I also think this packing approach might still prove useful…

Skipping some permutations :-1:

There are very common cases where permutation (and loading the permutation vector) is completely un-needed, to be exact there are exactly 8 such cases in the permutation table, whenever the all the 1 bits are already grouped in the upper (MSB) part of the register:

  • 0b00000000
  • 0b11111110
  • 0b11111100
  • 0b11111000
  • 0b11110000
  • 0b11100000
  • 0b11000000
  • 0b10000000

I thought it might be a good idea to detect those cases using a switch case or some sort of other intrinsics based code, while it did work, the extra branch and associated branch mis-prediction didn’t make this worth while or yield any positive result. The simpler code which always permutes did just as good. Oh well, it was worth the attempt…

Reordering instructions :-1:

I also tried reordering some instructions so that they would happen sooner inside the loop body. For example: moving the PopCounting to happen sooner (immediately after we calculate the mask).

None of these attempts helped, and I think the reason is that CPU already does this on its own, so while it sounds logical that this should happen, it doesn’t seem to help when we change the code to do it given that the CPU already does it all by itself without our generous help.

Mitigating the bad speculation :+1:

I postponed answering the last question I raised in the end of part 3 for last. If you recall, we experienced a lot of bad-speculation effects when sorting the data with our vectorized code, and profiling using hardware counters showed us that while InsertionSort was the cause of most of the bad-speculation events (41%), our vectorized code was still responsible for 32% of them. I’ve been writing again and again that our vectorized code is branch-less, why would branch-less code be causing bad speculation? Shouldn’t we be doing no speculation at all?

Oh, I wish it were true, remember this little gem, that we have to use in every iteration in order for use to successfully do in-place partitioning?

int * nextPtr;
if ((byte *) readLeft   - (byte *) writeLeft) <= 
    (byte *) writeRight - (byte *) readRight)) {
    nextPtr = readLeft;
    readLeft += 8;
} else {
    nextPtr = readRight;
    readRight -= 8;

This is it! Whenever we try to pick a side we would read from next, this is where we put the CPU in a tough spot. We’re asking it to speculate on something it can’t possibly speculate on successfully. Our question is: “Oh CPU, CPU in the socket, Which side is closer to being over-written of them all?”, to which the answer is completely data-driven! In other words, it depends on how the last round(s) of partitioning mutated all 4 pointers involved in the comparison. While it might sound like an easy thing for the CPU to check, we have to remember it is actually required to speculate this, since every time the CPU is demanded to answer this question, it it is still in the middle of processing a few of the previous iterations of this very same hot-loop due to the length of the pipeline and the nature of speculative execution. So the CPU guess, at best, on stale data, and we know as the grand designer of this mess, that in reality, at least for random data, the best guess is no better here than flipping a coin. Quite sad.

Mis-predicting here is unavoidable. Or at least I have no idea on how to avoid it in C# with the current JIT in August 2019 (But oh, just you wait for part 6, I have something in store there for you…, hint hint, wink wink).

But not all is lost.

We can still do something about it! We can unroll the code!
Unrolling loop code is a common technique that is usually employed to reduce the loop overhead (maintaining the loop counters and checking the loop end-condition) by doing more actual work per-loop (e.g. calling the same code multiple times per iteration). It’s classically thought of as a complexity/overhead trade-off: We write somewhat more complex code, but we end up with less overhead for the “fluff” code of maintaining a loop counter and checking it. As I’ve previously described how/why this helps, but the concept is clear: We do more work per iteration, there by paying less overhead per work done.

It should probably be noted that normally, unrolling this hot-loop wouldn’t do much good, since the amount of instructions attributed to overhead (loop-control) vs. the actual work done here, is not so skewed that we should immediately run to unroll. However, I propose a different way of thinking about this: We shouldn’t measure amount of instructions of work/overhead but measure cycles attributed for work/overhead. With that in mind, it is clear that even a single branch instruction, when frequently mis-speculated, is very expensive, cycle wise.

The same unrolling technique can be used to mitigate our bad speculation, to some extent. While we can’t fix the rate of mis-speculation, we can change the mix of time spent in penalty for each mis-speculation to work being done: we can, for example, use a loop-body that does 4 vectorized blocks per iteration. We would still speculate poorly, but we would have 4x less branches taken, since we would re-write the code to calculate which side to read from once per every 4 blocks.

I won’t show the entire code listing for this, as it really ended up blowing up, and complicating the code. You can look into it here. What I will show is where we are after unrolling by 4:

Out of juice?

Well, I’m personally out of idea about to optimize the vectorized code for now.

I kept saying this to myself when this blog post was half the size, but this journey with optimizing this particular part of the code, the partitioning functions, appears to have to an end.

Let’s show where we are, when compared to the original Array.Sort I set out to beat in the beginning of the first post, when we were all still young and had a long future ahead of us :)

We are now running at almost 4x the original array

It is also interesting to take a look at the various profiles we showed beofre:

Not bad, all in all. We are now partitioning using vectorized code pretty quickly, and this is a good time to finally end this post.
In the next post we will move on to replacing InsertionSort. Right now, this is the last big chunk of scalar code we are still running, and with all the previous optimization efforts it is now taking up around half of the time we’re actually spending on sorting. Can we make it? Stay tuned!

In the last post, I will address a way to avoid branches at this point at all. It’s an optimization technique that converts the branch into a data dependency, which is the way to go for badly speculative code.
This optimization relies on a widely available Intel opcode called CMOV (conditional mov). While C++/go/rust compilers support using this (mostly due to their use of LLVM as an optimizing compiler back-end), at the time of writing this, August 2019, we, as C# developers, don’t have a JIT that supports this feature while also supporting AVX2 intrinsics (Mono has some limited CMOV support, but no intrinsics support), and there are no intrinsics for this in C# available. I will show how/what could be done with this in the last blog post in this series, but for now, having the JIT magically solve our mis-prediction woes for us is off the table.

So what’s so special about SIMD registers?

Again, not much. According to the specfic CPU we’re running our code on, we’ll get access to a different set of vectorized registers, varying in their size / width:

511 256 255 128 127 0

In this table, which I’ve conveniently taken and adapted from Wikipedia, you can see the various registers into the Intel world of CPUs.

The somewhat small part shaded in gray are the actualy registers available to use through CoreCLR 3.0: those are 16 registers that are either 128 / 256 bits wide (depending if our CPU has SSE / AVX support)

While the rest of the table depicts what is / could be available to us had we were C++ / Rust developers on the best that Intel has to offer.
I know it immediately feels like we, as C# devs, have been shortchanged, from the table, because all those nice plump 512 bit registers are only for us to see and not use, but in reality, AVX-512 has still not caught on for mere mortals: Every single desktop/mobile CPU doesn’t support them at all, and even with servers / workstations, you need to shell out serious change to get access to these registers and (more importantly!) the instructions that come with them.

To sum this up, as C# developers, we get access to 16 architectural 256-bit wide registers. Those can be later mapped on to many more physical registers my the CPUs own registers renaming (which I’ve written about in the part), and for the most part,

  1. Most modern Intel CPUs can actually address the L1 cache units twice per cycle, that means they can actually ask it to read two cache-line as the same time. But this still causes more load on the cache and bus, and we must not forget that we will be reading an additional cache-line for our permutation block…