CoreCLR 3.0 Intrinsics

Guide

When you see the following thingy, you can touch/hover above it to trigger animations that are part of this deck:

Why You’re Here

From: "Computer Architecture: A Quantitative Approach, 6th Edition

"The reason processor performance is sub-linear with transistor count is [because] it's limited by unpredictability: Branch predictability, Data-predictability, Instruction predictability."

Jim Keller From: Moore’s Law is Not Dead

Branch Prediction

  • A single core executes hundreds of instructions at any given moment…
  • To do this, CPUs guess the target of branches!
    • It usually does a good job
  • But when it fails we get penalized
    • ~15 cycles for a single mis-prediction!

Now that I’ve got you scared motivated enough…

Let’s get busy!

Intrinsics in English

A way to directly embed specific CPU instructions via special, fake method calls that the JIT replaces at code-generation time

Used to expose processor functionality that doesn’t map well to the language:

  • Atomic operations
  • System-Programming (e.g. kernel mode)
  • Crypto instructions
  • Niche instructions
  • Vectorization
  • - Instructions that work on vectors

Vectorization 101

Usually 1-3 CPU cycles!

Math is hard, let’s go sorting!

I needed to sort numbers, and really fast.

"Eh, I'll rewrite Array.Sort*. QuickSort. With intrinsics... How bad could it be?"

6 months later, and I'm still at it...

But,

It’s 9x faster

Why QuickSort?

  • Universally known
  • Non-trivial use of intrinsics
  • Pretty close to Array.Sort*

Refresher

  • QuickSort uses a divide-and-conquer approach
    • It’s recursive
  • Has average O(n log n) comparisons for n items
  • Performs an in-place sort
  1. Pick a pivot value
  2. Partition the array around the pivot value
  3. Recurse on the left side of the pivot
  4. Recurse on the right side of the pivot
int Partition(int[] array, int pivot, int left, int right)
{
    while (left <= right) {
        while (array[left] < pivot) left++;
        while (array[right] > pivot) right--;

        if (left <= right) {
            var t = array[left];
            array[left++]  = array[right];
            array[right--] = t;
        }
    }
    return left;
}

Branches, Branches Everywhere! 👎 Unpredictable 👎 👍 Predictable 👍

By the numbers

Stat 100 1K 10K 100K 1M 10M
Max Depth 8 14 21 28 35 42
# Partitions 33 342 3,428 34,285 342,812 3,428,258
# Comparisons 422 6,559 89,198 1,128,145 13,698,171 155,534,152

Plan

Redo Partition with vectorized intrinsics.

  • What intrinsics do we use?
  • How do they work?

How?

How can an instruction operate on a vector?

Does it operate directly on memory?

Generally: No!

Vectors Types / Registers

These intructions operate on special vector types that are supported at the CPU level: registers

Vector registers have constant size (in bits).

SIMD registers in CoreCLR 3.0

C# vectorized intrinsics accept and return these types:

CoreCLR Intel
Vector64<T> mm0-mm7
Vector128<T> xmm0-xmm15
Vector256<T> ymm0-ymm15

Where T is some primitive type.

Example:

Vector256<T> can be:

byte / sbyte 32 x 8b == 256b
short / ushort 16 x 16b == 256b
int / uint 8 x 32b == 256b
long / ulong 4 x 64b == 256b
float 8 x 32b == 256b
double 4 x 64b == 256b

Vectorized Partition Block

  • We're going to partition 8 x ints at a time
    • inside a Vector256
  • Load ⮚ Compare ⮚ Permute ⮚ Store
  • With no branching(!)

Now what?

  • mask tells us which element goes where!
  • We could loop over the bits in the mask
    • Back to square one: 8-branches
  • I did not fly all the way to Moscow for this!

Famous Quote

Give me a lever long enough and a fulcrum on which to place it, and I shall move the world

- Synagoge, Book VIII, 340 A.D.

Less Famous Quote

Give me vectorized intrinsics and a large enough look-up table, and I can make anything 4x faster

- Intel Software Optimization Guide, 2003 A.D.

Permutation Tables

  • There are 256 possible mask values (28)
  • We can precompute all permutations in a table
  • Each permutation entry will provide the correct order for a given mask
  • The table is simply part of the source code

8-Way Permute

C#:

Vector256<int> data, perm;
Vector256<int> result = Avx2.PermuteVar8x32(data, perm);

asm:

vpermd ymm1, ymm2, ymm1 ; 3 cycle latency, 1 cycle throughput

static int[] PermTable => new[] {
    0, 1, 2, 3, 4, 5, 6, 7,     // 0   => 0b00000000
    // ...
    3, 4, 5, 6, 7, 0, 1, 2,     // 7   => 0b00000111
    // ...
    0, 2, 4, 6, 1, 3, 5, 7,     // 170 => 0b10101010
    // ...
    0, 1, 2, 3, 4, 5, 6, 7,     // 255 => 0b11111111
};

Everything stays in place Move 3 from left to right 4/4 split

var P = Vector256.Create(pivot);
...
var current = Avx2.LoadDquVector256(nextPtr);
var mask = (uint) Avx.MoveMask(
    Avx2.CompareGreaterThan(current, P).AsSingle()));
current = Avx2.PermuteVar8x32(current,
    LoadDquVector256(PermTablePtr + mask * 8));
Avx.Store(writeLeft, current);
Avx.Store(writeRight, current);
var popCount = PopCnt.PopCount(mask);
writeRight -= popCount;
writeLeft  += 8 - popCount;

We generate a vectorized pivot, once per partition Load 8 elements from somewhere. Compare to pivot, cast to Vector256<float> (because ¯\(ツ)) Generate an 8-bit mask from the comparison result Load permutation vector from table Permute data (partition) Store 8 elements to the left. Store 8 elements to the right. Count 1 bits ⮚ How many are elemenets are > than pivot. Advance right by popCount. Advance left by 8 - popCount. 8.5 cycle throughput

OK, So now the vector is partitioned, what’s next?

stackalloc to the rescue

  • We “cheat” just a bit: ¯\(ツ)
  • stackalloc Vector256<int>[3]
  • Total temp memory: 96 bytes
    • Constant
    • No matter how much we sort

Outer loop

Yeah yeah, are we fast yet?

N,100,1K,10K,100K,1M,10M ArraySort, 1 , 1 , 1, 1 , 1 , 1 AVX2DoublePumpedNaive, 1.149425287, 1.666666667, 1.724137931, 2.564102564, 2.702702703, 2.702702703,

What’s the Speed Limit?

There’s tons of stuff we could still do:

  • Inspect MSIL
  • Inspect asm code
  • Use HW counters
  • Vectorization Tweaks
  • Special code for small arrays

MSIL

  • C# compiler uses definite assignment analysis
    • CS0165: Use of unassigned local variable…
  • But tells JIT to initialize locals regardless
  • a.k.a .locals init
  .method private hidebysig static int32*
    VectorizedPartitionInPlace(
      int32* left,
      int32* right,
      int32* tmp
    ) cil managed
  {
    .maxstack 3
    .locals init (
      [0] int32 pivot,
      ...
      [22] int32 v,
      [23] valuetype [System.Runtime]System.ReadOnlySpan`1<int32> V_23
    )

Why?

5min hack

2-3% Improvement

Inspecting ASM

while (readRight >= readLeft) {
    int *nextPtr;
    if (readLeft   - writeLeft <=
        writeRight - readRight) {
        nextPtr = readLeft;
        readLeft += 8;
    } else {
        nextPtr = readRight;
        readRight -= 8;
    }
    var current = Avx.LoadDquVector256(nextPtr);
    //...
}

Anything left to partition? Which side is closer to being overwritten? Pick left Pick right

ASM Code

mov rcx, rdi    ; rdi -> readLeft
sub rcx, r12    ; r12 -> writeLeft
mov r8, rcx
sar r8, 0x3f
and r8, 0x3
add rcx, r8
sar rcx, 0x2
mov r9, rsi     ; rsi -> writeRight
sub r9, r13     ; r13 -> readRight
mov r10, r9
sar r10, 0x3f
and r10, 0x3
add r9, r10
sar r9, 0x2
cmp rcx, r9

Load, Subtract… Compare What the…?

Look at all that arithmetic… Why?

The JIT is not optimizing the int pointer math when comparing differences...

We can get past this!

Before:

if (readLeft   - writeLeft <=
    writeRight - readRight) {
    // ...
} else {
    // ...
}

After:

if ((byte *) readLeft   - (byte *) writeLeft) <=
    (byte *) writeRight - (byte *) readRight)) {
    // ...
} else {
    // ...
}

Much better

mov rcx, rdi    ; rdi -> readLeft
sub rcx, r12    ; r12 -> writeLeft
mov r9, rsi     ; rsi -> writeRight
sub r9, r13     ; r13 -> readRight
cmp rcx, r9

Ugly, But Effective

6-9% Improvement!

HW Counters

CPUs have HW counters that give stats!

$ perf stat -a --topdown ...
 Performance counter stats for 'system wide':
        retiring   bad speculation   frontend bound  backend bound
S0-C3 1    37.6%             37.3%            16.9%      13.2%

Almost 40% of all branches are mis-predicted

Bad Speculation is Bad

  • That “pick a side” logic is super bad for perf
    • We snuck in branch unpredictability!
    • 100% random data ⮚ Flip a coin
    • Every mis-prediction costs us 15 cycles
    • Remember our block is 8 cycles!
    • We’re doing nothing 50% of the time!

Branch ⮚ Arithmetic

What if we can make the CPU run the same code, for both branches?

Turn the branch into a data dependency!

Before:

int x, y;
if (x > y) {
    ptr += 8;
}

After:

int x, y;
// + => 0, - => 0xFFFFFFFF
var condAsMask = (y - x) >> 31;
//Always perform the addition, sometimes with 0!
ptr += 8 & condAsMask;

Poor-man’s CMOV

This is a age-old technique of replacing badly speculated branches with simple arithmetic.

It only works for simple branches.

CoreCLR will hopefully learn to cmov in the future.

Unroll the code

  • Change the ratio between work/branches predicted
  • CPU prediction will continue to suck
    • But once every N partitioning operations
  • We need to allocate more temporary space:
    • 2*N*Vector256<int>
    • 2*4*8*4 == 256 bytes
while (readRight >= readLeft) {
    int *nextPtr;
    if (readLeft   - writeLeft <=
        writeRight - readRight) {
        nextPtr = readLeft;
        readLeft += 8*4;
    } else {
        nextPtr = readRight;
        readRight -= 8*4;
    }
    var L0 = LoadDquVector256(nextPtr + 0*8);
    var L1 = LoadDquVector256(nextPtr + 1*8);
    var L2 = LoadDquVector256(nextPtr + 2*8);
    var L3 = LoadDquVector256(nextPtr + 3*8);

    var m0 = (uint) MoveMask(CompareGreaterThan(L0, P).AsSingle());
    var m1 = (uint) MoveMask(CompareGreaterThan(L1, P).AsSingle());
    var m2 = (uint) MoveMask(CompareGreaterThan(L2, P).AsSingle());
    var m3 = (uint) MoveMask(CompareGreaterThan(L3, P).AsSingle());

    L0 = PermuteVar8x32(L0, GetIntPermutation(pBase, m0));
    L1 = PermuteVar8x32(L1, GetIntPermutation(pBase, m1));
    L2 = PermuteVar8x32(L2, GetIntPermutation(pBase, m2));
    L3 = PermuteVar8x32(L3, GetIntPermutation(pBase, m3));

    var pc0 = PopCount(m0);
    var pc1 = PopCount(m1);
    var pc2 = PopCount(m2);
    var pc3 = PopCount(m3);

    Store(writeRight, L0);
    writeRight -= pc0;
    Store(writeRight, L1);
    writeRight -= pc1;
    Store(writeRight, L2);
    writeRight -= pc2;
    Store(writeRight, L3);
    writeRight -= pc3;

    pc0 = 8 - pc0;
    pc1 = 8 - pc1;
    pc2 = 8 - pc2;
    pc3 = 8 - pc3;

    Store(writeLeft, L0);
    writeLeft +=  pc0);
    Store(writeLeft, L1);
    writeLeft += pc1);
    Store(writeLeft, L2);
    writeLeft += pc2);
    Store(writeLeft, L3);
    writeLeft += pc3);
}


Can you spot the difference? Load x4 Compare+MoveMask x4 Permute x4 PopCount x4 Store on the right x4 8 - PopCount x4 Store Left x4

Show me the money!

  • Do we have less branch mis-predictions now?
    • Not really!
    • (Not in percent)
    • But less branches in total
  • Does it run faster though?

Sure Does!

20-30% Improvement!

Vectorization Tweaks

Let’s kill two perf hogs with one crazy opt!

But what are we fixing?

The Remainder problem

All vectorized code needs to deal with remainders.

length % 8 != 0

Between 1-7 elements, handled with pure scalar code!

Cacheline Boundaries

When reading memory, we really read from cache.

Single int ⮚ full cacheline: 64 bytes

What if our data is across two cachelines?

The CPU needs to ask for 2(!) cache-lines

Alignment

Normally, we shouldn’t care!

  • The JIT & allocator work to minimize this!
    • “Stuff” is aligned to pointer size: 4/8 bytes
    • When not: 464 ⮚ 6.25% rate
      of cross-cacheline reads
  • But what about Vector256? (32-bytes)
  • Is this true for partitioning?

No love!

Not a single pointer is naturally aligned to Vector256.

With partitioning, the data dictates the alignment

50% of our reads are reading 2 cachelines!!!

Two birds, one stone

Let’s fix both of these issues!

By doing more work!

  • The safe approach is to align inward
    • Align with scalar code
    • No more remainder at the end
    • But this means more scalar work:
      1-7 elements on each side!
  • But what if we align outwards?
    • It’s legal (But not trivial!)
    • 100% vectorized*!!!

Totally worth it!

10%-20% improvement

*When in cache

Small array sorting

Another common trick is to sort very small partitions directly without quicksort.

Array.Sort uses this.

Can we? With Vectors?

Yes we can!

Bitonic Sort

  • Parallel Sorting Algorithm
    • Used a lot in GPUs
  • O(n * log2(n)) comparisons
  • Generates 2 monotonic series
    • Increasing/Decreasing
    • Bitonic
  • No branches

Final Speedup

N,100,1K,10K,100K,1M,10M ArraySort, 1 , 1 , 1, 1 , 1 , 1 AVX2DoublePumpedOverlinedUnrolledWithBitonicSort, 2.04, 4.80, 7.67, 8.73, 9.02, 8.93,

In summary

We can and should re-approach even on age old problems to find ways to increase the predictablity of our code by using instrinsics!

This is now available to us in C#/.NET as part of CoreCLR 3.0.

Be nice to your CPUs!

Links

github.com/damageboy

Twitter: @damageboy

Blog