Trumping Array.Sort with AVX2 Intrinsics (Part 2/6)

16 minute read

I ended up going down the rabbit hole re-implementing Array::Sort with AVX2 intrinsics, and there’s no reason I should store all the agony inside (to be honest: I had a lot of fun with this), let’s get in the ring and show what AVX/AVX2 intrinsics can really do for a non-trivial problem, and even discuss potential improvements that future CoreCLR versions could bring to the table.

Since there’s a lot to over go over here, I’ll split it up into a few parts:

  1. In part 1, we did a short refresher on QuickSort and how it compares to Array::Sort. If you don’t need any refresher, you can skip over it and get right down to part 2 and onwards , although I really recommend skimming through, mostly because I’ve got really good visualizations for that should be in the back of everyone’s mind as we’ll be dealing with vectorization & optimization later.
  2. In this part, we go over the basics of Vectorized HW Intrinsics, discussed vector types, and a handful of vectorized instructions we’ll actually be using in part 3, but we still weren’t sorting anything.
  3. In part 3 we go through the initial code for the vectorized sorting and we’ll finally start seeing some payoff. We’ll finish with some agony courtesy of CPU’s Branch Predictor, just so we don’t get too cocky.
  4. In part 4, we go over a handful of optimization approaches that I attempted trying to get the vectorized partition to run faster, we’ll see what worked and what didn’t.
  5. In part 5, we’ll see how we can almost get rid of 100% of the remaining scalar code, by implementing small-constant size array sorting. We’ll use, drum roll…, yet more AVX2 vectorization and gain a considerable amount of performance / efficiency in the process.
  6. Finally, in part 6, I’ll list the outstanding stuff / ideas I have for getting more juice and functionality out of my vectorized code.

Intrinsics / Vectorization

I’ll start by repeating my own words from the first blog post where I discussed intrinsics in CoreCLR 3.0:

Processor intrinsics are a way to directly embed specific CPU instructions via special, fake method calls that the JIT replaces at code-generation time. Many of these instructions are considered exotic, and normal language syntax does cannot map them cleanly.
The general rule is that a single intrinsic “function” becomes a single CPU instruction.

You can go and re-read that introduction if you care for a more general and gentle introduction to processor intrinsics. For this series, we are going to focus on vectorized intrinsics in Intel processors. This is the largest group of CPU intrinsics in our processors, and I want to start by showing this by the numbers. I gathered some statistics by processing Intel’s own data-3.4.6.xml. This XML file is part of the Intel Intrinsics Guide, an invaluable resource on intrinsics in itself, and the “database” behind the guide. What I learned was that:

  • There are no less than 1,218 intrinsics in Intel processors1!
    • Those can be combined in 6,180 different ways (according to operand sizes and types).
    • They’re grouped into 67 different categories/groups, these groups loosely correspond to various generations of CPUs as more and more intrinsics were gradually added.
  • More than 94% are vectorized hardware intrinsics, which we’ll define more concretely below.

That last point is supercritical: CPU intrinsics, at least in 2019, are overwhelmingly about being able to execute vectorized instructions. That’s really why you should be paying them attention in the first place. There is additional stuff in there: if you’re a kernel developer, or writing crypto code, or some other niche-cases, but vectorization is why you are really here, whether you knew it or not.

In C#, we’ve mostly shied away from having intrinsics until CoreCLR 3.0 came along, but as single-threaded performance has virtually stopped improving, more and more languages started adding intrinsics support (go, rust, Java and now C#) so developers in those languages would have access to these specialized, much more efficient vectorized instructions. CoreCLR 3.0 does not support all 1,218 intrinsics that I found, but a more modest 226 intrinsics in 15 different classes. Each class is filled with many static functions, all of which are processor intrinsics, and represent a 1:1 mapping to Intel group/code names. As C# developers, we roughly get everything that Intel incorporated in their processors manufactured from 2014 and onwards2, and for AMD processors, from 2015 onwards.

So what are these vectorized intrinsics?
We need to go over a few base concepts before we can start explaining specific intrinsics:

  • What are vectorized intrinsics, and why have they become so popular.
  • How vectorized intrinsics interact with specialized vectorized registers.
  • How those registers are reflected in CoreCLR 3.0.

SIMD What & Why

I’m going to use vectorization and SIMD interchangeably from here-on, but for the first and last time, let’s define exactly what SIMD is: Single Instruction Multiple Data is really a simple idea if you think about it… A lot of code ends up doing “stuff” in loops, usually, processing vectors of data one element at a time. SIMD instructions bring a simple new idea to the table: The CPU adds special instructions that can do arithmetic, logic and many other types of generalized operations on “vectors”.

The benefit of using this approach to computing is that it allows for much greater efficiency: When we use vectorized intrinsics we end up executing the same number of instructions to process, for example, 8 data elements per instruction. Therefore, we reduce the amount of time the CPU spends decoding instructions for the same amount of work; furthermore, most vectorized instructions operate independently on the various elements of the vector and complete their operation at the same number of CPU cycles as the equivalent non-vectorized (or scalar) instruction. In short, in the land of CPU feature economics, vectorization is considered a high bang-for-buck feature: You can get a lot of potential performance for relatively little transistors added to the CPU, as long as people are willing to adapt their code (e.g. rewrite it) to use your new intrinsics, or compilers somehow magically manage to auto-vectorize the code3.

Of course, I’m grossly over-romanticizing vectorized intrinsics and their benefits: There are also many non-trivial overheads involved both using them and adding them to our processors. But all in all, the grand picture of CPU economics remains the same, adding vectorized instructions is still, compared to other potential improvements, quite cheap, under the assumption that programmers are willing to make the effort to re-write and maintain the vectorized code.

SIMD registers

After our short introduction to vectorized intrinsics, we need to introduce SIMD registers, and how this piece of the puzzle fits the grand picture: Teaching our CPU to execute 1,000+ vectorized instructions is just part of the story, these instructions need to somehow operate on data in memory. Do all of these instructions simply take a pointer to memory and run wild with it? The short answer is: No. For the most part, CPU instructions dealing with vectorization use special registers inside our CPU that are called SIMD registers. This is analogous to scalar (regular, non-vectorized) code we write in any programming language: while some instructions read and write directly to memory, and occasionally some instructions accept a memory address as an operand, most instructions are register ↔ register only.

Just like scalar CPU registers, SIMD registers have a constant bit-width. In Intel these come at 64, 128, 256 and recently even 512 bit wide registers. Being SIMD registers, they will end up containing multiple data-elements of a primitive type. The same register can and will be used to process a wide-range of primitive data-types, depending on which instruction is using it, as we will shortly demonstrate.

For now, this is all I care to explain about SIMD Registers at the CPU level: just like you don’t necessarily need to know how many scalar registers exist and the details surrounding how the JIT generates code to use them efficiently for a given CPU, you can skip attaining that knowledge when starting with vectorization as well. We will circle back later and give more attention to these details when, invariably, our vectorized code doesn’t perform as quickly as we’d like it to, but for now we just need the basics.

SIMD Intrinsic Types in C#

We’ve discussed SIMD intrinsics and mentioned that those operate (e.g. accept and modify) on SIMD registers. To work with SIMD intrinsics in C# we need to get acquainted with the C# abstraction of the very same registers we just mentioned: These are primitive vector types that the JIT recognizes and knows how to work with, In C# those would be:

These are special value-types recognized by the JIT while it is busy generating machine code. We should think about these types just like we think about other special-case primitive types such as int or double. These vector types all accept a generic parameter <T>. This parameter can’t just be anything we’d like. It is limited to the types supported on our CPU and its vectorized intrinsics. Also, the numerical part of the type name (64/128/256) denotes the number of bits in the vectorized type, or the actual width of the SIMD register in bits.

Let’s take Vector256<T>, which I’ll be using exclusively in this series, as an example; Vector256<T> can be used with the following primitive types:

typeof(T) # elements element width (bits)
byte / sbyte 32x8b
short / ushort 16x16b
int / uint 8x32b
long / ulong 4x64b
float 8x32b
double 4x64b

No matter what type of the supported primitive types we’ll choose, we’ll end up with 256 bits wide in total, or the underlying SIMD register in question.

A few Vectorized Instructions for the road

Armed with this new understanding and knowledge of Vector256<T> we can move on and start learning a few vectorized instructions.

Chekhov famously said: “If in the first act you have hung a pistol on the wall, then in the following one it should be fired. Otherwise, don’t put it there”. Here are seven loaded AVX2 guns, and let me assure you they are about to fire in the next act. We’re obviously not going to explain all 1,000+ intrinsics mentioned before, if only not to piss off Anton Chekhov. We will thoroughly explain the ones needed to get this party going.
Here’s the list of what we’re going to go over:

x64 asm Intel CoreCLR
vbroadcastd _mm256_broadcastd_epi32 Vector256.Create(int)
vlddqu _mm256_lddqu_si256 Avx.LoadDquVector256
vmovdqu _mm256_storeu_si256 Avx.Store
vpcmpgtd _mm256_cmpgt_epi32 Avx2.CompareGreaterThan
vmovmskps _mm256_movemask_ps Avx.MoveMask
popcnt _mm_popcnt_u32 Popcnt.PopCount
vpermd _mm256_permutevar8x32_epi32 Avx2.PermuteVar8x32

I understand that for first time readers, this list looks like I’m just name-dropping lots of fancy code names to make myself sound smart, but the unfortunate reality is that we kind of need to know all of these, and here is why: On the right column I’ve provided the actual C# Intrinsic function we will be calling in our code and linked to their docs. But here’s a funny thing: There is no “usable” documentation on Microsoft’s own docs regarding these intrinsics. All they do is simply point back to the Intel name, which I’ve also provided in the middle column, with links to the real documentation, the sort that actually explains what the instruction does. Finally, since we’re practically writing assembly code anyways, and I can guarantee we’ll end up inspecting JIT’d code, I provided the x64 assembly opcodes for our instructions as well.4 Now, What does each of these do? Let’s find out…

As luck would have it, I was blessed with the ultimate power of wielding SVG animations, so most of these instructions will be accompanied by an animation visually explaining them.
Hint: These animations are triggered by your mouse pointer / finger-touch inside them. The animations will immediately stop/freeze once the pointer is out of the drawing area, and resume again when inside…
Use this wisely. From here-on, I’ll use the following icon when I have a thingy that animates:

Vector256.Create(int value)

We’ll start with a couple of simple instructions, and nothing is more simple than this first one: This intrinsic accepts a single scalar value and simply “broadcasts” it to an entire SIMD register, this is how you’d use it:

Vector256<int> someVector256 = Vector256.Create(0x42);

This will load up someVector256 with 8 copies of 0x42 once executed, and in x64 assembly, the JIT will produce something quite simple:

vmovd  xmm0, rax          ; 3 cycle latency / 1 cycle throughput
vpbroadcastd ymm0, xmm0   ; 3 cycle latency / 1 cycle throughput

This specific intrinsic is translated into two intel opcodes, since there is no direct single instruction that performs this. Here is what it looks like visually:

LoadDquVector256 / Store

Another couple of brain dead simple intrinsics are the ones we use to read/write from memory into SIMD registers and conversely store from SIMD registers back to memory. These are amongst the most common intrinsics out there:

int *ptr = ...; // Get some pointer to a big enough array

Vector256<int> data = Avx.LoadDquVector256(ptr);
Avx.Store(ptr, data);

And in x64 assembly:

vlddqu ymm1, ymmword ptr [rdi]  ; 5 cycle latency + cache/memory
                                ; 0.5 cycle throughput
vmovdqu ymmword ptr [rdi], ymm1 ; Same as above

I only included an SVG animation for LoadDquVector256, but you can use your imagination and visualize how Store simply does the same thing in reverse:


CompareGreaterThan does an n-way, elemet-by-element greater-than (>) comparison between the elements of two Vector256<T> instances. In our case where T is really int, this means comparing 8 integers in one go, instead of performing 8 comparisons serially!

But where is the result? In a new Vector256<int> of course! The resulting vector contains 8 results for the corresponding comparisons between the elements of the first and second vectors. Each position where the element in the first vector was greater-than (>) the second vector, the corresponding element in the result vector gets a -1 value, or 0 otherwise.
Calling this is rather simple:

Vector256<int> data, comperand;
Vector256<int> result =
    Avx2.CompareGreaterThan(data, comperand);

And in x64 assembly, this is pretty simple too:

vpcmpgtd ymm2, ymm1, ymm0 ; 1 cycle latency
                          ; 0.5 cycle throughput

Visually this does:


Another intrinsic which will prove to be very useful is the ability to extract some bits from a vectorized register into a normal, scalar one. MoveMask does just this. This intrinsic takes the top-level (MSB) bit from every element and moves it into our scalar result:

Vector256<int> data;
int result = Avx.MoveMask(data.AsSingle());

There’s an oddity here, as you can tell by that awkward .AsSingle() call, try to ignore it if you can, or hit this footnote5 if you can’t. The assembly instruction here is exactly as simple as you would think:

vmovmskps rax, ymm2  ; 5 cycle latency
                     ; 1 cycle throughput



PopCount is a very simple, useful intrinsic, which I’ve covered extensively before: PopCount returns the number of 1 bits in a 32/64 bit primitive.
In C#, we would use it as follows::

int result = PopCnt.PopCount(0b0000111100110011);
// result == 8

And in x64 assembly code:

popcnt rax, rdx  ; 3 cycle latency
                 ; 1 cycle throughput

In this series, PopCount is the only intrinsic I use that is not purely vectorized6.


PermuteVar8x32 accepts two vectors: source, permutation and performs a permutation operation on the source value according to the order provided in the permutation value. If this sounds confusing go straight to the visualization below…

Vector256<int> data, perm;
Vector256<int> result = Avx2.PermuteVar8x32(data, perm);

While technically speaking, both the data and perm parameters are of type Vector256<int> and can contain any integer value in their elements, only the 3 least significant bits in perm are taken into account for permutation of the elements in data.
This should make sense, as we are permuting an 8-element vector, so we need 3 bits (23 == 8) in every permutation element to figure out which element goes where… In x64 assembly this is:

vpermd ymm1, ymm2, ymm1 ; 3 cycles latency
                        ; 1 cycles throughput

And with our trusty animations, what goes on under the hood becomes that much clearer:

That’s it for now

This post was all about laying the groundwork before this whole mess comes together.
Remember, we’re re-implementing QuickSort here, which for the most part, means re-implementing the partitioning function from our scalar code in the previous post with AVX2 intrinsics.
I’m sure wheels are turning in many heads now as you are trying to figure out what comes next…
I think it might be a good time as any to end this post and leave you with a suggestion: Try to take a piece of paper or your favorite text editor, and see if you can some cobble up these instructions into something that can partition numbers given a selected pivot.

When you’re ready, head on to the next post to see how the whole thing comes together, and how fast we can get it to run with a basic version…

  1. To be clear, some of these are intrinsics in unreleased processors, and even of those that are all released in the wild, there is no single processor support all of these… 

  2. CoreCLR supports roughly everything up to and including the AVX2 intrinsics, which were introduced with the Intel Haswell processor, near the end of 2013. 

  3. In general, auto-vectorizing compilers are a huge subject in their own, but the bottom line is that without completely changing the syntax and concepts of our programming language, there is very little that an auto-vectorizing compiler can do with existing code, and making one that reall works often involves designing programming language with vectorization baked into them from day one. 

  4. Now, If I was in my annoyed state of mind, I’d bother to mention that I personally always thought that introducing 200+ functions with already established names (in C/C++/rust) and forcing everyone to learn new names whose only saving grace is that they look BCLish to begin with was not the friendliest move on Microsoft’s part, and that trying to give C# names to the utter mess that Intel created in the first place was a thankless effort that would only annoy everyone more, and would eventually run up against the inhumane names Intel went for (Yes, I’m looking at you LoadDquVector256, you are not looking very BCL-ish to me with the Dqu slapped in the middle there : (╯°□°)╯︵ ┻━┻)… But thankfully, I’m not in my annoyed state. 

  5. While this looks like we’re really doing “something” with our Vector256<int> and somehow casting it do single-precision floating point values, let me assure you, this is just smoke and mirrors: The intrinsic simply accepts only floating point values (32/64 bit ones), so we have to “cast” the data to Vector256<float>, or alternatively call .AsSingle() before calling MoveMask. Yes, this is super awkward from a pure C# perspective, but in reality, the JIT understands these shenanigans and really ignores them completely. 

  6. By the way, although this intrinsic doesn’t accept nor return one of the SIMD registers / types, and considered to be a non-vectorized intrinsic as far as classification goes, as far as I’m concerned bit-level intrinsic functions that operate on scalar registers are just as “vectorized” as their “pure” vectorized sisters, as they mostly deal with scalar values as vectors of bits.