CoreCLR 3.0 Intrinsics

Guide

When you see the following thingy, you can touch/hover above it to trigger animations that are part of this deck:

Why You’re Here

From: "Computer Architecture: A Quantitative Approach, 6th Edition

"The reason processor performance is sub-linear with transistor count is [because] it's limited by unpredictability: Branch predictability, Data-predictability, Instruction predictability."

Jim Keller From: Moore’s Law is Not Dead

Branch Prediction

  • A single core executes hundreds of instructions at any given moment…
  • To do this, the CPU has to guess the target of branches!
    • It usually does a good job
  • But when it fails we get penalized
    • ~15 cycles for a single mis-prediction!

Now that I’ve got you scared motivated enough…

Let’s get busy!

Intrinsics in English

A way to directly embed specific CPU instructions via special, fake method calls that the JIT replaces at code-generation time

Used to expose processor functionality that doesn’t map well to the language:

  • Atomic operations
  • System-Programming (e.g. kernel mode)
  • Crypto instructions
  • Niche instructions
  • Vectorization
  • - Instructions that work on vectors

Vectorization 101

Usually 1-3 CPU cycles!

Math is hard, let’s go sorting!

I needed to sort numbers, and really fast.

"Eh, I'll rewrite Array.Sort*. QuickSort. With intrinsics... How bad could it be?"

5 months later, and I'm still at it...

But,

It’s 6x faster

In this talk: 2.7x

Why QuickSort?

  • Universally known
  • Non-trivial use of intrinsics
  • Pretty close to Array.Sort*

Refresher

  • QuickSort uses a divide-and-conquer approach
    • It’s recursive
  • Has average O(n log n) comparisons for n items
  • Performs an in-place sort
  1. Pick a pivot value
  2. Partition the array around the pivot value
  3. Recurse on the left side of the pivot
  4. Recurse on the right side of the pivot

Visualizing QuickSort

To grasp better how/why it works, we’ll use visualizations made by @mbostock, where:

By the numbers

Stat 100 1K 10K 100K 1M 10M
Max Depth 8 14 21 28 35 42
# Partitions 33 342 3,428 34,285 342,812 3,428,258
# Comparisons 422 6,559 89,198 1,128,145 13,698,171 155,534,152
int Partition(int[] array, int pivot, int left, int right)
{
    while (left <= right) {
        while (array[left] < pivot) left++;
        while (array[right] > pivot) right--;

        if (left <= right) {
            var t = array[left];
            array[left++]  = array[right];
            array[right--] = t;
        }
    }
    return left;
}

Branches, Branches Everywhere! 👎 Unpredictable 👎 👍 Predictable 👍

Plan

Redo Partition with vectorized intrinsics.

  • What intrinsics do we use?
  • How do they work?

How?

How can an instruction operate on a vector?

Does it operate directly on memory?

Generally: No!

Vectors Types / Registers

These intruction operate on special vector types that are supported at the CPU level: registers

Vector registers have constant size (in bits).

SIMD registers in CoreCLR 3.0

C# vectorized intrinsics accept and return these types:

Where T is some primitive type.

Example:

Vector256<T> can be:

byte / sbyte 32 x 8b == 256b
short / ushort 16 x 16b == 256b
int / uint 8 x 32b == 256b
long / ulong 4 x 64b == 256b
float 8 x 32b == 256b
double 4 x 64b == 256b

7 for the road

For this talk we need 7 intrinsics, Let’s go over them!

Create

  • Accepts a single primitive value
  • Returns a vector where all elements ⮘ value
  • Scalar ⮚ Vector

C#:

Vector256<int> someVector256 = Vector256.Create(0x42);

asm:

vmovd  xmm0, rax          ; 3 cycle latency
                          ; 1 cycle throughput
vpbroadcastd ymm0, xmm0   ; 3 cycle latency
                          ; 1 cycle throughput

Load / Store

  • Takes a pointer to an array of primitives
  • Load reads a vector with copied data from memory
  • Store writes a vector into memory

C#:

int *ptr = ...; // Get some pointer to a big enough array
Vector256<int> data = Avx2.LoadDquVector256(ptr);
Avx.Store(ptr, data);

asm:

vlddqu  ymm1,  [rdi] ; 5 cycle latency +
vmovdqu [r12], ymm1  ; cache/memory
                     ; 0.5 cycle throughput

Compare Greater Than

  • Compares 2 vectors element by element
  • Returns a 3rd vector where:
    • Greater than elements are marked with -1
    • Smaller than -or- equal are marked as 0

C#:

Vector256<int> data, comparand;
Vector256<int> result =
    Avx2.CompareGreaterThan(data, comparand);

asm:

vpcmpgtd ymm2, ymm1, ymm0 ; 1 cycle latency
                          ; 0.5 cycle throughput

Move Mask

  • Set each bit of the result based on the MSB of the corresponding 32-bit element
  • Reverse of broadcat, for the MSB
  • Vector ⮚ Scalar

C#:

Vector256<int> data;
int result = Avx.MoveMask(data.AsSingle());

asm:

vmovmskps rax, ymm2  ; 5 cycle latency
                     ; 1 cycle throughput

Population Count

  • Counts # of ‘1’ bits in a 32/64 bit primitive

C#:


int result = PopCnt.PopCount(0b0000111100110011);
// result == 8

asm:

popcnt rax, rdx  ; 3 cycle latency
                 ; 1 cycle throughput

8-Way Permute

  • Accepts two vectors: source, permutation
  • Permutes the source according to the permutation order

C#:

Vector256<int> data, perm;
Vector256<int> result = Avx2.PermuteVar8x32(data, perm);

asm:

vpermd ymm1, ymm2, ymm1 ; 3 cycles latency
                        ; 1 cycles throughput

Vectorized Partition Block

  • We're going to partition 8 x ints at a time
    • inside a Vector256
  • Load ⮚ Compare ⮚ Permute ⮚ Store

Now what?

  • mask tells us which element goes where!
  • We could loop over the bits in the mask
    • Back to square one: 8-branches
  • I did not fly all the way to Poland for this!

Permutation Tables

  • There are 256 possible mask values (28)
  • We can precompute all permutations in a table
  • Each permutation entry will provide the correct order for a given mask
  • The table is simply part of the source code
static int[] PermTable => new[] {
    0, 1, 2, 3, 4, 5, 6, 7,     // 0   => 0b00000000
    // ...
    3, 4, 5, 6, 7, 0, 1, 2,     // 7   => 0b00000111
    // ...
    0, 2, 4, 6, 1, 3, 5, 7,     // 170 => 0b10101010
    // ...
    0, 1, 2, 3, 4, 5, 6, 7,     // 255 => 0b11111111
};

Everything stays in place Move 3 from left to right 4/4 split

var P = Vector256.Create(pivot);
...
var current = Avx2.LoadDquVector256(nextPtr);
var mask = (uint) Avx.MoveMask(
    Avx2.CompareGreaterThan(current, P).AsSingle()));
current = Avx2.PermuteVar8x32(current,
    LoadDquVector256(PermTablePtr + mask * 8));
Avx.Store(writeLeft, current);
Avx.Store(writeRight, current);
var popCount = PopCnt.PopCount(mask);
writeRight -= popCount;
writeLeft  += 8 - popCount;

We generate a vectorized pivot, once per partition Load 8 elements from somewhere. Compare to pivot, cast to Vector256<float> (because ¯\(ツ)) Generate an 8-bit mask from the comparison result Load permutation vector from table (next slides!) Permute data (partition) Store 8 elements to the left. Store 8 elements to the right. Count 1 bits ⮚ How many are elemenets are > than pivot. Advance right by popCount. Advance left by 8 - popCount. 8.5 cycle throughput

OK, So now the vector is partitioned, what’s next?

The outer loop

  • The idea is we handle 8 ints every iteration
  • Then write 8 both to left + right
  • How can we keep everything in place?
  • Remember allocation is bad

stackalloc to the rescue

  • We “cheat” just a bit: ¯\(ツ)
  • stackalloc Vector256<int>[2]
  • Total temp memory: 64 bytes
    • Constant
    • No matter how much we sort

Outer loop

Yeah yeah, are we fast yet?

N,100,1K,10K,100K,1M,10M ArraySort, 1 , 1 , 1, 1 , 1 , 1 AVX2DoublePumpedNaive, 1.149425287, 1.666666667, 1.724137931, 2.564102564, 2.702702703, 2.702702703,

In summary

We can and should re-approach even age old problems to find ways to increase the predictablity of our code by using instrinsics!

This is now available to us in C#/.NET as part of CoreCLR 3.0.

Be nice to your CPUs!

Links

github.com/damageboy

Twitter: @damageboy

Blog