Trumping Array.Sort with AVX2 Intrinsics (Part 1/6)
Let’s do this
Everyone needs to sort arrays, once in a while. We think of it as a solved problem and that nothing can be further done about it in 2019, except for waiting for newer, marginally faster machines to popup^{1}. However, that is not the case, and if you join me in this rather long journey, we’ll end up with a replacement function for Array.Sort
, written in C# that can outperform CoreCLR’s own C++ based code by a factor of 5x.
Sometime in the future, after some additional tricks end up landing inside the JIT, the same concepts in this series can take us further to roughly 10x performance bump.
Sounds interesting? If so, down the rabbit hole we go…
As I was reading the post by Stephen Toub about Improvements in CoreCLR 3.0, it became apparent that hardware intrinsics were common to many of the improvements and that so many parts of CoreCLR could still be improved with these techniques, that one thing led to another, and I decided I’d attempt to apply hardware intrinsics to a larger problem than I had previously done myself, to see if I could rise to the challenge: I decided to take on array sorting and see how far I can go.
What I came up with eventually would become a rewrite of QuickSort
/ Array.Sort()
with AVX2 hardware intrinsics. Fortunately, choosing sorting, and specifically, QuickSort
, makes for a great blog post series, since:
 Everyone should be familiar with the domain and even the original (It is the bread and butter of learning computer science).
 It’s relatively easy to explain/refresh on it.
 If I can make it there, I can make it anywhere.
 I had no idea how to do it.
I started googling and found an interesting paper titled: Fast Quicksort Implementation Using AVX Instructions by Shay Gueron and Vlad Krasnov. That title alone made me think this would be a walk in the park, but while promising, it wasn’t good enough as a dropin replacement for Array.Sort
for reasons I’ll shortly go into. I ended up (or rather, still am, to be honest) having a lot of fun expanding on their ideas. I will submit a proper pullrequest to start a discussion with CoreCLR devs about integrating this code into the main CoreCLR repo, but for now, let’s get in the ring and show what AVX/AVX2 intrinsics are capable of for this nontrivial problem.
Since there’s a lot to go over here, I’ve split it up into no less than 6 parts:
 In this part, we do a short refresher on
QuickSort
and how it compares toArray.Sort()
. If you don’t need any refresher, you can skip over it and get right down to part 2 and onwards, although I recommend skimming through, mostly because I’ve got excellent visualizations which should be in the back of everyone’s mind as we deal with vectorization & optimization later.  In part 2, we go over the basics of vectorized hardware intrinsics, discuss vector types, and a handful of vectorized instructions we’ll use in part 3, but we still won’t be sorting anything.
 In part 3, we go through the initial code for the vectorized sorting, and we’ll finally start seeing some payoff. We’ll finish with some agony courtesy of the CPU’s Branch Predictor, which will throw a wrench into our attempts.
 In part 4, we go over a handful of optimization approaches that I attempted trying to get the vectorized partitioning to run faster. We’ll see what worked and what didn’t.
 In part 5, we’ll see how we can almost get rid of all the remaining scalar code by implementing smallconstant size array sorting. We’ll use… drum roll…, yet more AVX2 vectorization and gain a considerable amount of performance/efficiency in the process.
 Finally, in part 6, I’ll list the outstanding stuff/ideas I have for getting more juice and functionality out of my vectorized code.
QuickSort Crash Course
QuickSort
is deceivingly simple.
No, it really is.
In 20 lines of C# or whatever language you can sort numbers. Lots of them, and incredibly fast. However, try and change something about it; it quickly becomes apparent just how tricky it is to improve on it without breaking any of the tenants it is built upon.
In words
Before we discuss any of that, let’s describe QuickSort
in words, then in code:
 It uses a divideandconquer approach.
 In other words, it’s recursive.
 It has an average of comparisons for n items.
 It performs an inplace sort.
That last point, referring to inplace sorting, sounds simple and neat, and it sure is from the perspective of the user: no additional memory allocation needs to occur regardless of how much data they’re sorting. While that’s great, I’ve spent days trying to overcome the correctness and performance challenges that arise from it, specifically in the context of vectorization. It is also essential to remain inplace since I intend for this to become a dropin replacement for Array.Sort
.
More concretely, QuickSort
works like this:
 Pick a pivot value.
 Partition the array around the pivot value.
 Recurse on the left side of the pivot.
 Recurse on the right side of the pivot.
Picking a pivot could be a minipost in itself, but again, in the context of competing with Array.Sort
we don’t necessarily need to dive into it, we’ll copy whatever CoreCLR does, and get on with our lives.
CoreCLR uses a pretty standard scheme of medianofthree for pivot selection, which can be summed up as: “Let’s sort these 3 elements: first, middle and last, then pick the middle one of those three”.
Partitioning the array is where we spend most of the time sorting, as we take our pivot value, whatever it is, and rearrange the segment that was handed to us such that all numbers smallerthan the pivot are in the beginning or left (in no particular order). Then comes the pivot, in its final resting position, and following it are all elements greaterthan the pivot, again in no particular order amongst themselves.
After partitioning is complete, we recurse to the left and right of the pivot, as previously described.
That’s all there is: this gets millions of numbers sorted, inplace, as efficiently as we know how to do 60+ years after its invention.
Bonus trivia points for those who are still here with me: Tony Hoare, who invented QuickSort back in the early 60s also took responsibility for inventing the null
pointer concept. So I guess there really is no good without evil in this world…
In code
void QuickSort(int[] items) => QuickSort(items, 0, items.Length  1);
void QuickSort(int[] items, int left, int right)
{
if (left == right) return;
int pivot = PickPivot(items, left, right);
int pivotPos = Partition(items, pivot, left, right);
QuickSort(items, left, pivotPos);
QuickSort(items, pivotPos + 1, right);
}
int PickPivot(int[] items, int left, int right)
{
var mid = left + ((right  left) / 2);
SwapIfGreater(ref items[left], ref items[mid]);
SwapIfGreater(ref items[left], ref items[right]);
SwapIfGreater(ref items[mid], ref items[right]);
var pivot = items[mid];
}
int Partition(int[] array, int pivot, int left, int right)
{
while (left <= right) {
while (array[left] < pivot) left++;
while (array[right] > pivot) right;
if (left <= right) {
var t = array[left];
array[left++] = array[right];
array[right] = t;
}
}
return left;
}
I did say it is deceptively simple, and grasping how QuickSort really works sometimes feels like trying to lift sand through your fingers, so I’ve decided to include two more visualizations of QuickSort, which are derivatives of the amazing work done by Michael Bostock (@mbostock) with D3.
Visualizing QuickSort’s recursion
One thing that we have to keep in mind is that the same data is partitioned overandover again, many times, with evershrinking partition sizes until we end up having a partition size of 2 or 3, in which case we can trivially sort the partition asis and return.
To help see this better, we’ll use this way of visualizing arrays in QuickSort:
Here, we see an unsorted array of 200 elements (in the process of getting sorted).
The different sticks represent numbers in the [45°..+45°] range, and the angle of the stick represents that value, as I hope it is easy to discern.
The pivots are represented with two colors:
 Red for the currently selected pivot at a given recursion level.
 Green for previous pivots that have already been partitioned around in previous rounds/levels of the recursion.
Our ultimate goal is to go from the image above to the image below:
What follows is a static (e.g., nonanimated) visualization that shows how pivots are randomly selected at each level of recursion and how by the next step, the unsorted segments around them become partitioned until we finally have a completely sorted array. Here is how the whole thing looks:
These visuals are autogenerated in Javascript + D3, so feel free to hit that “Reload” button if you feel you want to see a new set of random sticks sorted.
I encourage you to look at this and try to explain to yourself what QuickSort
“does” here, at every step. What you can see here is the interaction between each pivot that is selected, and where it “lands” in the next recursion level (or row) and how new pivots are then selected to its left and right and in the next level of recursion we have twice as many pivots (in total) to partition around, in ever shrinkingpartition sizes.
Visualizing QuickSort’s Comparisons/Swaps
While the above visualization really does a lot to help understand how QuickSort
works, I also wanted to leave you with an impression of the total amount of work done by QuickSort
, here is an animation of the whole process as it goes over the same array, slowly and recursively going from an unsorted mess to a completely sorted array:
We can witness just how many comparisons and swap operations need to happen for a 200 element QuickSort
to complete successfully. There’s genuinely a lot of work that needs to happen per element (when considering the how we repartition virtually all elements again and again) for the whole thing to finish.
Array.Sort vs QuickSort
It’s important to note that Array.Sort
uses a couple of more tricks to get better performance. I would be irresponsible if I didn’t mention those since in the later posts I will borrow at least one idea from its playbook, and improve upon it with intrinsics…
Array.Sort
isn’t QuickSort
; it is a variation on it called Introspective Sort invented by David Musser in 1997. What it roughly does is combine QuickSort, HeapSort, and Insertion Sort, and switch dynamically between them according to the recursion depth and the partition size. This last trick, where it switches to using “Insertion Sort” on small partitions is critical, both for the general case and also for intrinsics/vectorization. It is beneficial because it replaces (up to) the last 4 levels of recursion (for partition sizes <= 16
) with a single call to an insertion sort. This trick reduces the overhead associated with recursion with simpler loopbased code which ultimately runs slightly faster for such small partitions.
As mentioned, I ended up borrowing this idea for my code as the issues around smaller partition sizes are exacerbated when using vectorized intrinsics in the following posts.
Comparing Scalar Variants
With all this new information, this is a good time to measure how a couple of different scalar (e.g. nonvectorized) versions compare to Array.Sort
. I’ll show some results generated using BenchmarkDotNet (BDN) with:
Array.Sort()
as the baseline.Scalar
as the code I’ve just presented above. This version is using regular/safe C#. With this version, every time we access an array element, the JIT inserts boundscheck machine code around our actual access that ensures the CPU does not read/write outside the memory region owned by the array.ScalarUnmanaged
as an alternative/faster version toScalar
where: The code uses native pointers and unsafe semantics (using C#‘s new
unmanaged
constraint, neat!).  We switch to
InsertionSort
(again, copypasted from CoreCLR) when below 16 elements, just likeArray.Sort
does.
 The code uses native pointers and unsafe semantics (using C#‘s new
I’ve prepared this last version to show that with unsafe code + InsertionSort
, we can remove most of the performance gap between C# and C++ for this type of code, which mainly stems from boundschecking, that the JIT cannot elide for these sort of randomaccess patterns.
Note that for this series, We’ll benchmark each sorting method with various array sizes (BDN parameter: N
): . I’ve added a custom column to the BDN column to the report: Time / N
. This represents the time spent sorting per element in the array, and as such, very useful to compare the results on a more uniform scale.
Here are the results in the form of charts showing: (1) the ratio (scaling) of various implementations being compared, (2) the time spent sorting a single element in an array of N elements (Time / N) and finally (3..) various BDN results in table form if you’re more into tables:

SMethod N Mean (µs) Time / N (ns) Ratio ArraySort 100 1.211 12.1123 1.00 Scalar 100 2.474 24.7385 2.04 Unmanaged 100 2.110 21.0955 1.75 ArraySort 1000 30.546 30.5461 1.00 Scalar 1000 47.880 47.8796 1.57 Unmanaged 1000 30.969 30.9692 1.01 ArraySort 10000 546.410 54.6410 1.00 Scalar 10000 727.528 72.7528 1.33 Unmanaged 10000 543.112 54.3112 0.99 ArraySort 100000 6,048.737 60.4874 1.00 Scalar 100000 6,774.185 67.7419 1.12 Unmanaged 100000 5,895.774 58.9577 0.97 ArraySort 1000000 70,753.876 70.7539 1.00 Scalar 1000000 77,390.590 77.3906 1.09 Unmanaged 1000000 65,722.162 65.7222 0.93 ArraySort 10000000 808,430.600 80.8431 1.00 Scalar 10000000 897,593.437 89.7593 1.11 Unmanaged 10000000 768,630.685 76.8631 0.95
Surprisingly^{2}, the unmanaged C# version is running slightly faster than Array.Sort
, but with one caveat: it only outperforms the C++ version for large inputs. Otherwise, everything is as expected: The purely Scalar
variant is just slow, and the Unamanged
one mostly is on par with Array.Sort
. Again, these C# implementations were written to verify that we can get to Array.Sort
like performance in C#, and they do just that. I don’t know about you, but I did not wake up this morning to get Array.Sort
running 5% faster; I want it much faster. Another side benefit of having these implementations is that we can sparkle statisticscollectingcode magic dust on them and obtain highlevel sorting statistics that assist us in deciphering and comparing future results and implementations. Let’s show what we have collected for both the Scalar
and Unmanaged
variants and describe what each different statistic tells us:

Stat 100 1000 10000 100000 1000000 10000000 NumPartitionOperations 33 342 3,428 34,285 342,812 3,428,258 NumSmallSorts 0 0 0 0 0 0 SmallSortSize N/A N/A N/A N/A N/A N/A NumScalarCompares 422 6,559 89,198 1,128,145 13,698,171 155,534,152 
Stat 100 1000 10000 100000 1000000 10000000 NumPartitionOperations 8 94 951 9,521 95,214 952,250 NumSmallSorts 8 87 873 8,739 87,359 873,843 SmallSortSize 10 9.2 9.16 9.15 9.15 9.15 NumScalarCompares 696 9,275 116,365 1,399,684 16,414,849 182,703,336
Here, we can see, per each N
value:
 The total number of partitioning operations performed.
 The total number of smallSort operations performed (e.g.,
InsertionSort
for theUnmanaged
variant).  The average size of each smallsort operation (e.g., how many elements
InsertionSort
sorted perinvocation).  The number of scalar comparison+branch operations that were datadependent; This one is a bit hard to relate to at first, but it becomes apparent why this is a crucial number to track in the 3^{rd} post. For now, we’ll simply define it: This statistic counts how many times our code did an
if
or awhile
or any other branch in our code whose condition depended on unsorted data!
It’s easy to see the differences in collected statistics between both versions: The number of partition operations changes once we introduce InsertionSort
, and so does the number of databased scalar compares. Some of these statistics remain pretty much the same, regardless of what we do next in future versions, while others radically change; We’ll observe and make use of these as key inputs in helping us to figure out how/why something worked, or not!
All Warmed Up
We’ve spent quite some time polishing our foundations concerning QuickSort
. I know lengthy introductions are somewhat dull, but I think time spent on this post will pays off when we next encounter our actual implementation in the 3^{rd} post.
Before that, we need to pick up some knowledge that is specific to vectorized intrinsics and introduce a few select intrinsics we’ll be using, so, this is an excellent time to break off this post, grab a fresh cup of coffee and head to the next post.

Which is increasingly taking more and more time to happen, due to Dennard scaling and the slowdown of Moore’s law… ↩

Believe it or not, I pretty much wrote every other version features in this series before I wrote the
Unmanaged
one, so I really was quite surprised that it ended up being slightly faster thatArray.Sort
↩
Comments