When you see the following thingy, you can touch/hover above it to trigger animations that are
part of this deck:
Why You’re Here
From: "Computer Architecture: A Quantitative Approach, 6th Edition
"The reason processor performance is sub-linear with transistor count is [because] it's limited by unpredictability:Branch predictability, Data-predictability, Instruction predictability."
A single core executes hundreds of instructions at any given moment…
To do this, the CPU has to guess the target of branches!
It usually does a good job
But when it fails we get penalized
~15 cycles for a single mis-prediction!
Now that I’ve got you scared motivated enough…
Let’s get busy!
Intrinsics in English
A way to directly embed specific CPU instructions via special, fake method calls that the JIT replaces at code-generation time
Used to expose processor functionality that doesn’t map well to the language:
Atomic operations
System-Programming (e.g. kernel mode)
Crypto instructions
Niche instructions
Vectorization
- Instructions that work on vectors
Vectorization 101
Usually 1-3 CPU cycles!
Math is hard, let’s go sorting!
I needed to sort numbers, and really fast.
"Eh, I'll rewrite Array.Sort*.QuickSort.With intrinsics...How bad could it be?"
5 months later, and I'm still at it...
But,
It’s 6x faster
In this talk: 2.7x
Why QuickSort?
Universally known
Non-trivial use of intrinsics
Pretty close to Array.Sort*
Refresher
QuickSort uses a divide-and-conquer approach
It’s recursive
Has average O(n log n) comparisons for n items
Performs an in-place sort
Pick a pivot value
Partition the array around the pivot value
Recurse on the left side of the pivot
Recurse on the right side of the pivot
Visualizing QuickSort
To grasp better how/why it works, we’ll use visualizations made by @mbostock, where:
By the numbers
Stat
100
1K
10K
100K
1M
10M
Max Depth
8
14
21
28
35
42
# Partitions
33
342
3,428
34,285
342,812
3,428,258
# Comparisons
422
6,559
89,198
1,128,145
13,698,171
155,534,152
int Partition(int[] array, int pivot, int left, int right)
{
while (left <= right) {
while (array[left] < pivot) left++;
while (array[right] > pivot) right--;
if (left <= right) {
var t = array[left];
array[left++] = array[right];
array[right--] = t;
}
}
return left;
}
Everything stays in placeMove 3 from left to right4/4 split
var P = Vector256.Create(pivot);
...
var current = Avx2.LoadDquVector256(nextPtr);
var mask = (uint) Avx.MoveMask(
Avx2.CompareGreaterThan(current, P).AsSingle()));
current = Avx2.PermuteVar8x32(current,
LoadDquVector256(PermTablePtr + mask * 8));
Avx.Store(writeLeft, current);
Avx.Store(writeRight, current);
var popCount = PopCnt.PopCount(mask);
writeRight -= popCount;
writeLeft += 8 - popCount;
We generate a vectorized pivot, once per partitionLoad 8 elements from somewhere.Compare to pivot, cast to Vector256<float> (because ¯\(ツ)/¯)Generate an 8-bit mask from the comparison resultLoad permutation vector from table (next slides!)Permute data (partition)Store 8 elements to the left.Store 8 elements to the right.Count 1 bits ⮚ How many are elemenets are > than pivot.Advance right by popCount.Advance left by 8 - popCount.8.5 cycle throughput
OK, So now the vector is partitioned, what’s next?
The outer loop
The idea is we handle 8 ints every iteration
Then write 8 both to left + right
How can we keep everything in place?
Remember allocation is bad
stackalloc to the rescue
We “cheat” just a bit: ¯\(ツ)/¯
stackalloc Vector256<int>[2]
Total temp memory: 64 bytes
Constant
No matter how much we sort
Outer loop
Yeah yeah, are we fast yet?
In summary
We can and should re-approach even age old problems to find ways
to increase the predictablity of our code by using instrinsics!
This is now available to us in C#/.NET as part of CoreCLR 3.0.