When you see the following thingy, you can touch/hover above it to trigger animations that are part of this deck:

From: "Computer Architecture: A Quantitative Approach, 6^{th} Edition

"The reason processor performance is sub-linear with transistor count is [because] it's limited byunpredictability: Branch predictability, Data-predictability, Instruction predictability."

- A single core executes hundreds of instructions at any given moment…
- To do this, the CPU has to guess the target of branches!
- It usually does a good job

- But when it fails we get penalized
- ~15 cycles for a single mis-prediction!

Now that I’ve got you ~~scared~~ motivated enough…

Let’s get busy!

A way to directly embed **specific** CPU instructions via special, *fake* method calls that the JIT replaces at code-generation time

Used to expose processor functionality that *doesn’t* map well to the language:

- Atomic operations
- System-Programming (e.g. kernel mode)
- Crypto instructions
- Niche instructions
**Vectorization** - Instructions that work on vectors

Usually 1-3 CPU cycles!

I needed to sort numbers, and really fast.

"Eh, I'll rewrite Array.Sort^{*}.
QuickSort.
With intrinsics...
How bad could it be?"

*5 months later, and I'm still at it...*

- Universally known
- Non-trivial use of intrinsics
- Pretty close to
`Array.Sort`

^{*}

- QuickSort uses a
*divide-and-conquer*approach- It’s recursive

- Has average O(
*n*log*n*) comparisons for*n*items - Performs an in-place sort

- Pick a
*pivot*value - Partition the array around the pivot value
- Recurse on the left side of the pivot
- Recurse on the right side of the pivot

To grasp better how/why it works, we’ll use visualizations made by @mbostock, where:

Stat | 100 | 1K | 10K | 100K | 1M |
10M |
---|---|---|---|---|---|---|

Max Depth | 8 | 14 | 21 | 28 | 35 |
42 |

# Partitions | 33 | 342 | 3,428 | 34,285 | 342,812 |
3,428,258 |

# Comparisons | 422 | 6,559 | 89,198 | 1,128,145 | 13,698,171 |
155,534,152 |

```
int Partition(int[] array, int pivot, int left, int right)
{
while (left <= right) {
while (array[left] < pivot) left++;
while (array[right] > pivot) right--;
if (left <= right) {
var t = array[left];
array[left++] = array[right];
array[right--] = t;
}
}
return left;
}
```

Branches, Branches Everywhere! 👎 Unpredictable 👎 👍 Predictable 👍

Redo `Partition`

with vectorized intrinsics.

- What intrinsics do we use?
- How do they work?

How can an instruction operate on a vector?

Does it operate *directly* on memory?

Generally: **No!**

These intruction operate on *special* vector types that are supported at the CPU level: registers

Vector registers have constant size (in bits).

C# vectorized intrinsics accept and return these types:

Where `T`

is some primitive type.

`Vector256<T>`

can be:

`byte / sbyte` | ⮚ | 32 x 8b | == 256b |

`short / ushort` | ⮚ | 16 x 16b | == 256b |

`int / uint` |
⮚ | 8 x 32b | == 256b |

`long / ulong` | ⮚ | 4 x 64b | == 256b |

`float` | ⮚ | 8 x 32b | == 256b |

`double` | ⮚ | 4 x 64b | == 256b |

For this talk we need 7 intrinsics, Let’s go over them!

- Accepts a single primitive value
- Returns a vector where all elements ⮘ value
- Scalar ⮚ Vector

C#:

```
Vector256<int> someVector256 = Vector256.Create(0x42);
```

asm:

```
vmovd xmm0, rax ; 3 cycle latency
; 1 cycle throughput
vpbroadcastd ymm0, xmm0 ; 3 cycle latency
; 1 cycle throughput
```

- Takes a pointer to an array of primitives
- Load reads a vector with copied data
*from*memory - Store writes a vector
*into*memory

C#:

```
int *ptr = ...; // Get some pointer to a big enough array
Vector256<int> data = Avx2.LoadDquVector256(ptr);
Avx.Store(ptr, data);
```

asm:

```
vlddqu ymm1, [rdi] ; 5 cycle latency +
vmovdqu [r12], ymm1 ; cache/memory
; 0.5 cycle throughput
```

- Compares 2 vectors element by element
- Returns a 3
^{rd}vector where:*Greater than*elements are marked with`-1`

*Smaller than -or- equal*are marked as`0`

C#:

```
Vector256<int> data, comparand;
Vector256<int> result =
Avx2.CompareGreaterThan(data, comparand);
```

asm:

```
vpcmpgtd ymm2, ymm1, ymm0 ; 1 cycle latency
; 0.5 cycle throughput
```

- Set each bit of the result based on the MSB of the corresponding 32-bit element
- Reverse of broadcat, for the MSB
- Vector ⮚ Scalar

C#:

```
Vector256<int> data;
int result = Avx.MoveMask(data.AsSingle());
```

asm:

```
vmovmskps rax, ymm2 ; 5 cycle latency
; 1 cycle throughput
```

- Counts # of ‘1’ bits in a 32/64 bit primitive

C#:

```
int result = PopCnt.PopCount(0b0000111100110011);
// result == 8
```

asm:

```
popcnt rax, rdx ; 3 cycle latency
; 1 cycle throughput
```

- Accepts two vectors: source, permutation
- Permutes the source according to the permutation order

C#:

```
Vector256<int> data, perm;
Vector256<int> result = Avx2.PermuteVar8x32(data, perm);
```

asm:

```
vpermd ymm1, ymm2, ymm1 ; 3 cycles latency
; 1 cycles throughput
```

- We're going to partition 8 x
`int`

s at a time - inside a Vector256
- Load ⮚ Compare ⮚ Permute ⮚ Store

`mask`

tells us which element goes where!- We could loop over the bits in the mask
- Back to square one: 8-branches

- I did not fly all the way to Poland for this!

- There are 256 possible mask values (2
^{8}) - We can precompute all permutations in a table
- Each permutation entry will provide the correct order for a given mask
- The table is simply part of the source code

```
static int[] PermTable => new[] {
0, 1, 2, 3, 4, 5, 6, 7, // 0 => 0b00000000
// ...
3, 4, 5, 6, 7, 0, 1, 2, // 7 => 0b00000111
// ...
0, 2, 4, 6, 1, 3, 5, 7, // 170 => 0b10101010
// ...
0, 1, 2, 3, 4, 5, 6, 7, // 255 => 0b11111111
};
```

Everything stays in place Move 3 from left to right 4/4 split

```
var P = Vector256.Create(pivot);
...
var current = Avx2.LoadDquVector256(nextPtr);
var mask = (uint) Avx.MoveMask(
Avx2.CompareGreaterThan(current, P).AsSingle()));
current = Avx2.PermuteVar8x32(current,
LoadDquVector256(PermTablePtr + mask * 8));
Avx.Store(writeLeft, current);
Avx.Store(writeRight, current);
var popCount = PopCnt.PopCount(mask);
writeRight -= popCount;
writeLeft += 8 - popCount;
```

We generate a vectorized pivot, once per partition
Load 8 elements from somewhere.
Compare to pivot, cast to `Vector256<float>`

(because `¯\`

)
Generate an 8-bit mask from the comparison result
Load permutation vector from table (next slides!)
Permute data (partition)
Store 8 elements to the left.
Store 8 elements to the right.
Count 1 bits ⮚ How many are elemenets are > than pivot.
Advance right by popCount.
Advance left by 8 - popCount.
8.5 cycle throughput*(ツ)*/¯

- The idea is we handle 8 ints every iteration
- Then write 8
**both**to left + right - How can we keep everything in place?
- Remember allocation is bad

- We “cheat” just a bit:
`¯\`

*(ツ)*/¯ `stackalloc Vector256<int>[2]`

- Total temp memory: 64 bytes
- Constant
- No matter how much we sort

We can and *should* re-approach even age old problems to find ways
to increase the predictablity of our code by using instrinsics!

This is now available to us in C#/.NET as part of CoreCLR 3.0.

Be nice to your CPUs!