CoreCLR 3.0 Intrinsics

Guide

When you see the following thingy, you can touch/hover above it to trigger animations that are part of this deck:

Why You’re Here

From: "Computer Architecture: A Quantitative Approach, 6^th Edition

So: Here is the last 40 years of single threaded performance improvemnets. After a first happy couple of decades, at 2003, we’re start seeing an ever increasing slowdown in this rate, even thogh transistor density has been doubling all the way till 2015 until we reach our current time, which is the dark ages at 3.5% per year.

Now, no one knows for sure what the future holds, But I think we can all agree that the present sucks.

Dennard observes that transistor dimensions are scaled by 30% (0.7x) every technology generation, thus reducing their area by 50%. This reduces the delay by 30% (0.7x) and therefore increases operating frequency by about 40% (1.4x). Finally, to keep the electric field constant, voltage is reduced by 30%, reducing energy by 65% and power (at 1.4x frequency) by 50%.[note 1] Therefore, in every technology generation the transistor density doubles, the circuit becomes 40% faster, and power consumption (with twice the number of transistors) stays the same.
Amdahl’s law can be formulated in the following way: ${\displaystyle S_{\text{latency}}(s)={\frac {1}{(1-p)+{\frac {p}{s}}}}}$

where:
- $S_{\text{latency}}$ is the theoretical speedup of the execution of the whole task;
- s is the speedup of the part of the task that benefits from improved system resources;
- p is the proportion of execution time that the part benefiting from improved resources originally occupied.

"The reason processor performance is sub-linear with transistor count is [because] it's limited by unpredictability: Branch predictability, Data-predictability, Instruction predictability."

Jim Keller From: Moore’s Law is Not Dead

Branch Prediction

A single core executes hundreds of instructions at any given moment…
To do this, the CPU has to guess the target of branches!
- It usually does a good job
But when it fails we get penalized
- ~15 cycles for a single mis-prediction!

Now that I’ve got you ~~scared~~ motivated enough…

Let’s get busy!

Intrinsics in English

A way to directly embed specific CPU instructions via special, fake method calls that the JIT replaces at code-generation time

Used to expose processor functionality that doesn’t map well to the language:

Atomic operations

System-Programming (e.g. kernel mode)

Crypto instructions

Niche instructions

Vectorization

- Instructions that work on vectors

Vectorization 101

Usually 1-3 CPU cycles!

Math is hard, let’s go sorting!

I needed to sort numbers, and really fast.

"Eh, I'll rewrite Array.Sort^*. QuickSort. With intrinsics... How bad could it be?"

5 months later, and I'm still at it...

But,

It’s 6x faster

In this talk: 2.7x

Why QuickSort?

Universally known
Non-trivial use of intrinsics
Pretty close to Array.Sort^*

Refresher

QuickSort uses a divide-and-conquer approach
- It’s recursive
Has average O(n log n) comparisons for n items
Performs an in-place sort

Pick a pivot value
Partition the array around the pivot value
Recurse on the left side of the pivot
Recurse on the right side of the pivot

Visualizing QuickSort

To grasp better how/why it works, we’ll use visualizations made by @mbostock, where:

By the numbers

Stat	100	1K	10K	100K	1M	10M
Max Depth	8	14	21	28	35	42
# Partitions	33	342	3,428	34,285	342,812	3,428,258
# Comparisons	422	6,559	89,198	1,128,145	13,698,171	155,534,152

int Partition(int[] array, int pivot, int left, int right)
{
    while (left <= right) {
        while (array[left] < pivot) left++;
        while (array[right] > pivot) right--;

        if (left <= right) {
            var t = array[left];
            array[left++]  = array[right];
            array[right--] = t;
        }
    }
    return left;
}

Branches, Branches Everywhere! 👎 Unpredictable 👎 👍 Predictable 👍

Plan

Redo Partition with vectorized intrinsics.

What intrinsics do we use?
How do they work?

How?

How can an instruction operate on a vector?

Does it operate directly on memory?

Generally: No!

Vectors Types / Registers

These intruction operate on special vector types that are supported at the CPU level: registers

Vector registers have constant size (in bits).

SIMD registers in CoreCLR 3.0

C# vectorized intrinsics accept and return these types:

Where T is some primitive type.

Example:

Vector256<T> can be:

`byte / sbyte`	⮚	32 x 8b	== 256b
`short / ushort`	⮚	16 x 16b	== 256b
`int / uint`	⮚	8 x 32b	== 256b
`long / ulong`	⮚	4 x 64b	== 256b
`float`	⮚	8 x 32b	== 256b
`double`	⮚	4 x 64b	== 256b

7 for the road

For this talk we need 7 intrinsics, Let’s go over them!

Create

Accepts a single primitive value
Returns a vector where all elements ⮘ value
Scalar ⮚ Vector

C#:

Vector256<int> someVector256 = Vector256.Create(0x42);

asm:

vmovd  xmm0, rax          ; 3 cycle latency
                          ; 1 cycle throughput
vpbroadcastd ymm0, xmm0   ; 3 cycle latency
                          ; 1 cycle throughput

Load / Store

Takes a pointer to an array of primitives
Load reads a vector with copied data from memory
Store writes a vector into memory

C#:

int *ptr = ...; // Get some pointer to a big enough array
Vector256<int> data = Avx2.LoadDquVector256(ptr);
Avx.Store(ptr, data);

asm:

vlddqu  ymm1,  [rdi] ; 5 cycle latency +
vmovdqu [r12], ymm1  ; cache/memory
                     ; 0.5 cycle throughput

Compare Greater Than

Compares 2 vectors element by element
Returns a 3^rd vector where:
- Greater than elements are marked with -1
- Smaller than -or- equal are marked as 0

C#:

Vector256<int> data, comparand;
Vector256<int> result =
    Avx2.CompareGreaterThan(data, comparand);

asm:

vpcmpgtd ymm2, ymm1, ymm0 ; 1 cycle latency
                          ; 0.5 cycle throughput

Move Mask

Set each bit of the result based on the MSB of the corresponding 32-bit element
Reverse of broadcat, for the MSB
Vector ⮚ Scalar

C#:

Vector256<int> data;
int result = Avx.MoveMask(data.AsSingle());

asm:

vmovmskps rax, ymm2  ; 5 cycle latency
                     ; 1 cycle throughput

Population Count

Counts # of ‘1’ bits in a 32/64 bit primitive

C#:


int result = PopCnt.PopCount(0b0000111100110011);
// result == 8

asm:

popcnt rax, rdx  ; 3 cycle latency
                 ; 1 cycle throughput

8-Way Permute

Accepts two vectors: source, permutation
Permutes the source according to the permutation order

C#:

Vector256<int> data, perm;
Vector256<int> result = Avx2.PermuteVar8x32(data, perm);

asm:

vpermd ymm1, ymm2, ymm1 ; 3 cycles latency
                        ; 1 cycles throughput

Vectorized Partition Block

We're going to partition 8 x ints at a time

inside a Vector256

Load ⮚ Compare ⮚ Permute ⮚ Store

Now what?

mask tells us which element goes where!
We could loop over the bits in the mask
- Back to square one: 8-branches
I did not fly all the way to Poland for this!

Permutation Tables

There are 256 possible mask values (2⁸)
We can precompute all permutations in a table
Each permutation entry will provide the correct order for a given mask
The table is simply part of the source code

static int[] PermTable => new[] {
    0, 1, 2, 3, 4, 5, 6, 7,     // 0   => 0b00000000
    // ...
    3, 4, 5, 6, 7, 0, 1, 2,     // 7   => 0b00000111
    // ...
    0, 2, 4, 6, 1, 3, 5, 7,     // 170 => 0b10101010
    // ...
    0, 1, 2, 3, 4, 5, 6, 7,     // 255 => 0b11111111
};

Everything stays in place Move 3 from left to right 4/4 split

var P = Vector256.Create(pivot);
...
var current = Avx2.LoadDquVector256(nextPtr);
var mask = (uint) Avx.MoveMask(
    Avx2.CompareGreaterThan(current, P).AsSingle()));
current = Avx2.PermuteVar8x32(current,
    LoadDquVector256(PermTablePtr + mask * 8));
Avx.Store(writeLeft, current);
Avx.Store(writeRight, current);
var popCount = PopCnt.PopCount(mask);
writeRight -= popCount;
writeLeft  += 8 - popCount;

We generate a vectorized pivot, once per partition Load 8 elements from somewhere. Compare to pivot, cast to Vector256<float> (because ¯\(ツ)/¯) Generate an 8-bit mask from the comparison result Load permutation vector from table (next slides!) Permute data (partition) Store 8 elements to the left. Store 8 elements to the right. Count 1 bits ⮚ How many are elemenets are > than pivot. Advance right by popCount. Advance left by 8 - popCount. 8.5 cycle throughput

OK, So now the vector is partitioned, what’s next?

The outer loop

The idea is we handle 8 ints every iteration
Then write 8 both to left + right
How can we keep everything in place?
Remember allocation is bad

stackalloc to the rescue

We “cheat” just a bit: ¯\(ツ)/¯
stackalloc Vector256<int>[2]
Total temp memory: 64 bytes
- Constant
- No matter how much we sort

Outer loop

Yeah yeah, are we fast yet?

In summary

We can and should re-approach even age old problems to find ways to increase the predictablity of our code by using instrinsics!

This is now available to us in C#/.NET as part of CoreCLR 3.0.

Be nice to your CPUs!

CoreCLR 3.0 Intrinsics

Guide

Why You’re Here

Branch Prediction

Intrinsics in English

Vectorization 101

Math is hard, let’s go sorting!

It’s 6x faster

In this talk: 2.7x

Why QuickSort?

Refresher

Visualizing QuickSort

By the numbers

Plan

How?

Vectors Types / Registers

SIMD registers in CoreCLR 3.0

Example:

7 for the road

Create

Load / Store

Compare Greater Than

Move Mask

Population Count

8-Way Permute

Vectorized Partition Block

Now what?

Permutation Tables

OK, So now the vector is partitioned, what’s next?

The outer loop

stackalloc to the rescue

Outer loop

Yeah yeah, are we fast yet?

In summary

Links

github.com/damageboy

Twitter: @damageboy

Blog