CoreCLR 3.0 Intrinsics

Guide

When you see the following thingy, you can touch/hover above it to trigger animations that are part of this deck:

Why You’re Here

From: "Computer Architecture: A Quantitative Approach, 6^th Edition

So: Here is the last 40 years of single threaded performance improvemnets. After a first happy couple of decades, at 2003, we’re start seeing an ever increasing slowdown in this rate, even thogh transistor density has been doubling all the way till 2015 until we reach our current time, which is the dark ages at 3.5% per year.

Now, no one knows for sure what the future holds, But I think we can all agree that the present sucks.

Dennard observes that transistor dimensions are scaled by 30% (0.7x) every technology generation, thus reducing their area by 50%. This reduces the delay by 30% (0.7x) and therefore increases operating frequency by about 40% (1.4x). Finally, to keep the electric field constant, voltage is reduced by 30%, reducing energy by 65% and power (at 1.4x frequency) by 50%.[note 1] Therefore, in every technology generation the transistor density doubles, the circuit becomes 40% faster, and power consumption (with twice the number of transistors) stays the same.
Amdahl’s law can be formulated in the following way: ${\displaystyle S_{\text{latency}}(s)={\frac {1}{(1-p)+{\frac {p}{s}}}}}$

where:
- $S_{\text{latency}}$ is the theoretical speedup of the execution of the whole task;
- s is the speedup of the part of the task that benefits from improved system resources;
- p is the proportion of execution time that the part benefiting from improved resources originally occupied.

"The reason processor performance is sub-linear with transistor count is [because] it's limited by unpredictability: Branch predictability, Data-predictability, Instruction predictability."

Jim Keller From: Moore’s Law is Not Dead

Branch Prediction

A single core executes hundreds of instructions at any given moment…
To do this, CPUs guess the target of branches!
- It usually does a good job
But when it fails we get penalized
- ~15 cycles for a single mis-prediction!

Now that I’ve got you ~~scared~~ motivated enough…

Let’s get busy!

Intrinsics in English

A way to directly embed specific CPU instructions via special, fake method calls that the JIT replaces at code-generation time

Used to expose processor functionality that doesn’t map well to the language:

Atomic operations

System-Programming (e.g. kernel mode)

Crypto instructions

Niche instructions

Vectorization

- Instructions that work on vectors

Vectorization 101

Usually 1-3 CPU cycles!

Math is hard, let’s go sorting!

I needed to sort numbers, and really fast.

"Eh, I'll rewrite Array.Sort^*. QuickSort. With intrinsics... How bad could it be?"

6 months later, and I'm still at it...

But,

It’s 9x faster

Why QuickSort?

Universally known
Non-trivial use of intrinsics
Pretty close to Array.Sort^*

Refresher

QuickSort uses a divide-and-conquer approach
- It’s recursive
Has average O(n log n) comparisons for n items
Performs an in-place sort

Pick a pivot value
Partition the array around the pivot value
Recurse on the left side of the pivot
Recurse on the right side of the pivot

int Partition(int[] array, int pivot, int left, int right)
{
    while (left <= right) {
        while (array[left] < pivot) left++;
        while (array[right] > pivot) right--;

        if (left <= right) {
            var t = array[left];
            array[left++]  = array[right];
            array[right--] = t;
        }
    }
    return left;
}

Branches, Branches Everywhere! 👎 Unpredictable 👎 👍 Predictable 👍

By the numbers

Stat	100	1K	10K	100K	1M	10M
Max Depth	8	14	21	28	35	42
# Partitions	33	342	3,428	34,285	342,812	3,428,258
# Comparisons	422	6,559	89,198	1,128,145	13,698,171	155,534,152

Plan

Redo Partition with vectorized intrinsics.

What intrinsics do we use?
How do they work?

How?

How can an instruction operate on a vector?

Does it operate directly on memory?

Generally: No!

Vectors Types / Registers

These intructions operate on special vector types that are supported at the CPU level: registers

Vector registers have constant size (in bits).

SIMD registers in CoreCLR 3.0

C# vectorized intrinsics accept and return these types:

`CoreCLR`	`Intel`
`Vector64<T>`	`mm0-mm7`
`Vector128<T>`	`xmm0-xmm15`
`Vector256<T>`	`ymm0-ymm15`

Where T is some primitive type.

Example:

Vector256<T> can be:

`byte / sbyte`	⮚	32 x 8b	== 256b
`short / ushort`	⮚	16 x 16b	== 256b
`int / uint`	⮚	8 x 32b	== 256b
`long / ulong`	⮚	4 x 64b	== 256b
`float`	⮚	8 x 32b	== 256b
`double`	⮚	4 x 64b	== 256b

Vectorized Partition Block

We're going to partition 8 x ints at a time

inside a Vector256

Load ⮚ Compare ⮚ Permute ⮚ Store

With no branching(!)

Now what?

mask tells us which element goes where!
We could loop over the bits in the mask
- Back to square one: 8-branches
I did not fly all the way to Moscow for this!

Famous Quote

Give me a lever long enough and a fulcrum on which to place it, and I shall move the world

- Synagoge, Book VIII, 340 A.D.

Less Famous Quote

Give me vectorized intrinsics and a large enough look-up table, and I can make anything 4x faster

- Intel Software Optimization Guide, 2003 A.D.

Permutation Tables

There are 256 possible mask values (2⁸)
We can precompute all permutations in a table
Each permutation entry will provide the correct order for a given mask
The table is simply part of the source code

8-Way Permute

C#:

Vector256<int> data, perm;
Vector256<int> result = Avx2.PermuteVar8x32(data, perm);

asm:

vpermd ymm1, ymm2, ymm1 ; 3 cycle latency, 1 cycle throughput

static int[] PermTable => new[] {
    0, 1, 2, 3, 4, 5, 6, 7,     // 0   => 0b00000000
    // ...
    3, 4, 5, 6, 7, 0, 1, 2,     // 7   => 0b00000111
    // ...
    0, 2, 4, 6, 1, 3, 5, 7,     // 170 => 0b10101010
    // ...
    0, 1, 2, 3, 4, 5, 6, 7,     // 255 => 0b11111111
};

Everything stays in place Move 3 from left to right 4/4 split

var P = Vector256.Create(pivot);
...
var current = Avx2.LoadDquVector256(nextPtr);
var mask = (uint) Avx.MoveMask(
    Avx2.CompareGreaterThan(current, P).AsSingle()));
current = Avx2.PermuteVar8x32(current,
    LoadDquVector256(PermTablePtr + mask * 8));
Avx.Store(writeLeft, current);
Avx.Store(writeRight, current);
var popCount = PopCnt.PopCount(mask);
writeRight -= popCount;
writeLeft  += 8 - popCount;

We generate a vectorized pivot, once per partition Load 8 elements from somewhere. Compare to pivot, cast to Vector256<float> (because ¯\(ツ)/¯) Generate an 8-bit mask from the comparison result Load permutation vector from table Permute data (partition) Store 8 elements to the left. Store 8 elements to the right. Count 1 bits ⮚ How many are elemenets are > than pivot. Advance right by popCount. Advance left by 8 - popCount. 8.5 cycle throughput

OK, So now the vector is partitioned, what’s next?

stackalloc to the rescue

We “cheat” just a bit: ¯\(ツ)/¯
stackalloc Vector256<int>[3]
Total temp memory: 96 bytes
- Constant
- No matter how much we sort

Outer loop

Yeah yeah, are we fast yet?

What’s the Speed Limit?

There’s tons of stuff we could still do:

Inspect MSIL
Inspect asm code
Use HW counters
Vectorization Tweaks
Special code for small arrays

MSIL

C# compiler uses definite assignment analysis
- CS0165: Use of unassigned local variable…
But tells JIT to initialize locals regardless
a.k.a .locals init

  .method private hidebysig static int32*
    VectorizedPartitionInPlace(
      int32* left,
      int32* right,
      int32* tmp
    ) cil managed
  {
    .maxstack 3
    .locals init (
      [0] int32 pivot,
      ...
      [22] int32 v,
      [23] valuetype [System.Runtime]System.ReadOnlySpan`1<int32> V_23
    )

Why?

There’s a bunch of ways to remove this
I chose Fody.LocalsInit
by @Lucas_Trz
- Checkout Fody!

5min hack

2-3% Improvement

Inspecting ASM

while (readRight >= readLeft) {
    int *nextPtr;
    if (readLeft   - writeLeft <=
        writeRight - readRight) {
        nextPtr = readLeft;
        readLeft += 8;
    } else {
        nextPtr = readRight;
        readRight -= 8;
    }
    var current = Avx.LoadDquVector256(nextPtr);
    //...
}

Anything left to partition? Which side is closer to being overwritten? Pick left Pick right

ASM Code

mov rcx, rdi    ; rdi -> readLeft
sub rcx, r12    ; r12 -> writeLeft
mov r8, rcx
sar r8, 0x3f
and r8, 0x3
add rcx, r8
sar rcx, 0x2
mov r9, rsi     ; rsi -> writeRight
sub r9, r13     ; r13 -> readRight
mov r10, r9
sar r10, 0x3f
and r10, 0x3
add r9, r10
sar r9, 0x2
cmp rcx, r9

Load, Subtract… Compare What the…?

Look at all that arithmetic… Why?

The JIT is not optimizing the int pointer math when comparing differences...

We can get past this!

Before:

if (readLeft   - writeLeft <=
    writeRight - readRight) {
    // ...
} else {
    // ...
}

After:

if ((byte *) readLeft   - (byte *) writeLeft) <=
    (byte *) writeRight - (byte *) readRight)) {
    // ...
} else {
    // ...
}

Much better

mov rcx, rdi    ; rdi -> readLeft
sub rcx, r12    ; r12 -> writeLeft
mov r9, rsi     ; rsi -> writeRight
sub r9, r13     ; r13 -> readRight
cmp rcx, r9

Ugly, But Effective

6-9% Improvement!

HW Counters

CPUs have HW counters that give stats!

$ perf stat -a --topdown ...
 Performance counter stats for 'system wide':
        retiring   bad speculation   frontend bound  backend bound
S0-C3 1    37.6%             37.3%            16.9%      13.2%

Almost 40% of all branches are mis-predicted

Bad Speculation is Bad

That “pick a side” logic is super bad for perf
- We snuck in branch unpredictability!
- 100% random data ⮚ Flip a coin
- Every mis-prediction costs us 15 cycles
- Remember our block is 8 cycles!
- We’re doing nothing 50% of the time!

Branch ⮚ Arithmetic

What if we can make the CPU run the same code, for both branches?

Turn the branch into a data dependency!

Before:

int x, y;
if (x > y) {
    ptr += 8;
}

After:

int x, y;
// + => 0, - => 0xFFFFFFFF
var condAsMask = (y - x) >> 31;
//Always perform the addition, sometimes with 0!
ptr += 8 & condAsMask;

Poor-man’s CMOV

This is a age-old technique of replacing badly speculated branches with simple arithmetic.

It only works for simple branches.

CoreCLR will hopefully learn to cmov in the future.

Unroll the code

Change the ratio between work/branches predicted
CPU prediction will continue to suck
- But once every N partitioning operations
We need to allocate more temporary space:
- 2*N*Vector256<int>
- 2*4*8*4 == 256 bytes

while (readRight >= readLeft) {
    int *nextPtr;
    if (readLeft   - writeLeft <=
        writeRight - readRight) {
        nextPtr = readLeft;
        readLeft += 8*4;
    } else {
        nextPtr = readRight;
        readRight -= 8*4;
    }
    var L0 = LoadDquVector256(nextPtr + 0*8);
    var L1 = LoadDquVector256(nextPtr + 1*8);
    var L2 = LoadDquVector256(nextPtr + 2*8);
    var L3 = LoadDquVector256(nextPtr + 3*8);

    var m0 = (uint) MoveMask(CompareGreaterThan(L0, P).AsSingle());
    var m1 = (uint) MoveMask(CompareGreaterThan(L1, P).AsSingle());
    var m2 = (uint) MoveMask(CompareGreaterThan(L2, P).AsSingle());
    var m3 = (uint) MoveMask(CompareGreaterThan(L3, P).AsSingle());

    L0 = PermuteVar8x32(L0, GetIntPermutation(pBase, m0));
    L1 = PermuteVar8x32(L1, GetIntPermutation(pBase, m1));
    L2 = PermuteVar8x32(L2, GetIntPermutation(pBase, m2));
    L3 = PermuteVar8x32(L3, GetIntPermutation(pBase, m3));

    var pc0 = PopCount(m0);
    var pc1 = PopCount(m1);
    var pc2 = PopCount(m2);
    var pc3 = PopCount(m3);

    Store(writeRight, L0);
    writeRight -= pc0;
    Store(writeRight, L1);
    writeRight -= pc1;
    Store(writeRight, L2);
    writeRight -= pc2;
    Store(writeRight, L3);
    writeRight -= pc3;

    pc0 = 8 - pc0;
    pc1 = 8 - pc1;
    pc2 = 8 - pc2;
    pc3 = 8 - pc3;

    Store(writeLeft, L0);
    writeLeft +=  pc0);
    Store(writeLeft, L1);
    writeLeft += pc1);
    Store(writeLeft, L2);
    writeLeft += pc2);
    Store(writeLeft, L3);
    writeLeft += pc3);
}

Can you spot the difference? Load x4 Compare+MoveMask x4 Permute x4 PopCount x4 Store on the right x4 8 - PopCount x4 Store Left x4

Show me the money!

Do we have less branch mis-predictions now?
- Not really!
- (Not in percent)
- But less branches in total
Does it run faster though?

Sure Does!

20-30% Improvement!

Vectorization Tweaks

Let’s kill two perf hogs with one crazy opt!

But what are we fixing?

The Remainder problem

All vectorized code needs to deal with remainders.

length % 8 != 0

Between 1-7 elements, handled with pure scalar code!

Cacheline Boundaries

When reading memory, we really read from cache.

Single int ⮚ full cacheline: 64 bytes

What if our data is across two cachelines?

The CPU needs to ask for 2(!) cache-lines

Alignment

Normally, we shouldn’t care!

The JIT & allocator work to minimize this!
- “Stuff” is aligned to pointer size: 4/8 bytes
- When not: ⁴⁄₆₄ ⮚ 6.25% rate
  of cross-cacheline reads
But what about Vector256? (32-bytes)
Is this true for partitioning?

No love!

Not a single pointer is naturally aligned to Vector256.

With partitioning, the data dictates the alignment

50% of our reads are reading 2 cachelines!!!

Two birds, one stone

Let’s fix both of these issues!

By doing more work!

The safe approach is to align inward
- Align with scalar code
- No more remainder at the end
- But this means more scalar work:
  1-7 elements on each side!
But what if we align outwards?
- It’s legal (But not trivial!)
- 100% vectorized^*!!!

Totally worth it!

10%-20% improvement

^*When in cache

Small array sorting

Another common trick is to sort very small partitions directly without quicksort.

Array.Sort uses this.

Can we? With Vectors?

Yes we can!

Bitonic Sort

Parallel Sorting Algorithm
- Used a lot in GPUs
O(n * log²(n)) comparisons
Generates 2 monotonic series
- Increasing/Decreasing
- Bitonic
No branches

Final Speedup

In summary

We can and should re-approach even on age old problems to find ways to increase the predictablity of our code by using instrinsics!

This is now available to us in C#/.NET as part of CoreCLR 3.0.

Be nice to your CPUs!