CoreCLR 3.0 Intrinsics

(Are they any good?)


getent passwd $USER

dmg:*:666:666:Dan Shechter:/home/dmg:/usr/bin/zsh

CTO of a high-frequency trading* firm that trades global markets from inside exchanges.

Also, *nix programmer that likes low-level & perf and whose been around the block: Windows/Linux kernel programming, Hypervisors

</td> </tr> </table> --- ## Today - What? - Why (now)? - Take something we all respect - (Re)build it together using intrinsics - Oh yeah, learn intrinsics while at it - Profit! - Q&A --- ## Wikipedia [Intrinsic function](https://en.wikipedia.org/wiki/Intrinsic_function)
...an intrinsic function is a function available for use in a given programming language whose implementation is handled specially by the JIT. compiler.
-- Traditionally, used to expose processor functionality that *doesn't* map well to the language:
  • Atomic operations
  • System-Programming (e.g. kernel mode)
  • Crypto instructions
  • Niche instructions
  • Vectorization
-- ## Where do intrinsics come from? CPU designers add large groups of intrinsics and expose their availability through CPU feature flags. Intel currently supports: - 1,218 distinct intrinsics - In 6,180 (!) combinations - Grouped in 67 code-names / feature flags. -- -- ## In C# #
  • Not new! (since .NET 1.1)
  • Very limited until CoreCLR 3.0 came along...
-- ## Namespaces - Common stuff is in `System.Runtime.Intrinsics` - x86 stuff is in `System.Runtime.Intrinsics.X86` - arm64 stuff is in `System.Runtime.Intrinsics.Arm.Arm64` -- ## Detection - Each class / group has a: ```csharp public static bool IsSupported { get; } ``` that tests availability on CPU @ runtime - Special check recognized during code-gen (JIT!) - So in reality: ZERO cost - Already nicer than C++... --- ## Why should I care? Performance gains flowing from silicon are shrinking: -- When all else fails, we look for exotic sources for perf; intrinsics provide an answer... This includes [CoreCLR itself](https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-core-3-0/): > ...Further, .NET Core 3.0 includes new hardware intrinsics that allow a properly-motivated developer to eke out the best possible performance on supporting hardware, utilizing extensions like AVX or SSE that can compare well more than 8 bytes at a time. Many of the improvements in .NET Core 3.0 come from utilizing these techniques. -- ## CoreCLR 3.0 vs. 2.1 -- Now that I've got you ~~scared~~ motivated enough... Let's get busy! --- ## Math is hard, let's go sorting! We're going to redo **QuickSort**, with intrinsics/vectorization. Why QuickSort?
  • Universally known
  • Non-trivial use of intrinsics/vectorization
  • It's pretty close to Array.Sort*
-- ## Reminder - QuickSort uses a *divide-and-conquer* approach - It's recursive - Has average O(*n* log *n*) comparisons for *n* items - Performs an in-place sort -- 1. Pick a pivot value 2. Partition the array around the pivot value 3. Recurse on the left side of the pivot 4. Recurse on the right side of the pivot -- ## Visualizing QuickSort To grasp better how/why it works, we'll use visualizations made by @mbostock, where: -- -- -- ```csharp void QuickSort(int[] items, int left, int right) { if (left == right) return; int pivot = PickPivot(items, left, right); int pivotPos = Partition(items, pivot, left, right); QuickSort(items, left, pivotPos); QuickSort(items, pivotPos + 1, right); } ``` -- ```csharp int Partition(int[] array, int pivot, int left, int right) { while (left <= right) { while (array[left] < pivot) left++; while (array[right] > pivot) right--; if (left <= right) { var t = array[left]; array[left++] = array[right]; array[right--] = t; } } return left; } ``` -- # Compared to Array.Sort -- ## Same Array.Sort in CoreCLR is *similar* to QuickSort. - Picks pivot - Median of {first, middle, last} elements - Uses same divide-and-conquer approach - And recursion pattern -- ## But different Technically it's called [IntroSort](https://en.wikipedia.org/wiki/Introsort) - Hybrid partitioning: - HeapSort, QuickSort, Insertion Sort - Depending on depth / partition size - Better worst-case behaviour - In CoreCLR, is implemented in C++ - Better code quality vs. JIT - No bounds checking -- ## BDN Time -- --- # Step aside scalar, SIMD is coming Here we go... -- ## Plan We'll write `Partition` to use AVX+AVX2 Vectorization/SIMD. - What is SIMD? - Which intrinsics do we pick? - How do they exactly work? -- ## SIMD

We've mentioned that CPUs have 1000s of instructions that have to do with vectorization.

Also referred to as SIMD instructions / intrinsics:

Single Instruction Multiple Data

-- ## How? How can an instruction operate on a vector? Does it operate on memory? -- ## SIMD Vectors

SIMD instructions operate on vector types that are supported at the CPU level: registers

SIMD registers have constant size in bits.

CoreCLR 3.0 supports SIMD instructions that use 64/128/256 bit wide registers.

-- ## Vectors in C# # C# vectorized intrinsics accept and return these types: - [`Vector64`](https://github.com/dotnet/coreclr/blob/master/src/System.Private.CoreLib/shared/System/Runtime/Intrinsics/Vector64_1.cs) - [`Vector128`](https://github.com/dotnet/coreclr/blob/master/src/System.Private.CoreLib/shared/System/Runtime/Intrinsics/Vector128_1.cs) - [`Vector256`](https://github.com/dotnet/coreclr/blob/master/src/System.Private.CoreLib/shared/System/Runtime/Intrinsics/Vector256_1.cs) Where `T` is some primitive type. -- Example: `Vector256` can be: Integer:
byte / sbyte 32 x 8b
short / ushort 16 x 16b
int / uint 8 x 32b
long / ulong 4 x 64b
Floating point:
float 8 x 32b
double 4 x 64b
-- ## Vector registers
511 256 255 128 127 0
zmm0        
ymm0        
xmm0        
zmm1        
ymm1        
xmm1        
zmm2        
ymm2        
xmm2        
zmm3        
ymm3        
xmm3        
zmm4        
ymm4        
xmm4        
zmm5        
ymm5        
xmm5        
zmm6        
ymm6        
xmm6        
zmm7        
ymm7        
xmm7        
zmm8        
ymm8        
xmm8        
zmm9        
ymm9        
xmm9        
zmm10        
ymm10        
xmm10        
zmm11        
ymm11        
xmm11        
zmm12        
ymm12        
xmm12        
zmm13        
ymm13        
xmm13        
zmm14        
ymm14        
xmm14        
zmm15        
ymm15        
xmm15        
zmm16        
ymm16        
xmm16        
zmm17        
ymm17        
xmm17        
zmm18        
ymm18        
xmm18        
zmm19        
ymm19        
xmm19        
zmm20        
ymm20        
xmm20        
zmm21        
ymm21        
xmm21        
zmm22        
ymm22        
xmm22        
zmm23        
ymm23        
xmm23        
zmm24        
ymm24        
xmm24        
zmm25        
ymm25        
xmm25        
zmm26        
ymm26        
xmm26        
zmm27        
ymm27        
xmm27        
zmm28        
ymm28        
xmm28        
zmm29        
ymm29        
xmm29        
zmm30        
ymm30        
xmm30        
zmm31        
ymm31        
xmm31        
-- For this talk we need:
x64 asm Intel CoreCLR
vbroadcastd _mm256_broadcastd_epi32 Vector256.Create(int)
vlddqu _mm256_lddqu_si256 Avx.LoadDquVector256
vpcmpgtd _mm256_cmpgt_epi32 Avx2.CompareGreaterThan
vmovmskps _mm256_movemask_ps Avx.MoveMask
popcnt _mm_popcnt_u32 Popcnt.PopCount
vpermd _mm256_permutevar8x32_epi32 Avx2.PermuteVar8x32
vmovdqu _mm256_storeu_si256 Avx.Store
-- ## Vector256.Create() - Accepts a single primitive value - returns a vector where all elements contain the value -- C#: ```csharp Vector256 someVector256 = Vector256.Create(0x42); ``` asm: ```x86asm vmovd xmm0, rax ; 3 cycle latency ; 1 cycle throughput vpbroadcastd ymm0, xmm0 ; 3 cycle latency ; 1 cycle throughput ``` -- ## LoadDquVector256 - Accepts a pointer to an array of supported primitives - Returns a vector with copied data from the array -- C# ```csharp int *ptr = ...; // Get some pointer to a big enough array Vector256 data = Avx2.LoadDquVector256(ptr); ``` asm: ```x86asm vlddqu ymm1, ymmword ptr [rdi] ; 5 cycle latency + ; cache/memory ; 0.5 cycle throughput ``` -- ## CompareGreaterThan - Compares 2 vectors element by element - Returns a 3rd vector where: - *Greater than* elements are marked with `-1` - *Smaller than -or- equal* are marked as `0` -- C#: ```csharp Vector256 data, comperand; Vector256 result = Avx2.CompareGreaterThan(data, comperand); ``` asm: ```x86asm vpcmpgtd ymm2, ymm1, ymm0 ; 1 cycle latency ; 0.5 cycle throughput ``` -- ## MoveMask - Set each bit of result based on the most significant bit of the corresponding 32-bit element - Technically, expect 32-bit floating point only -- C#: ```csharp Vector256 data; int result = Avx.MoveMask(data.AsSingle()); ``` asm: ```x86asm vmovmskps rax, ymm2 ; 5 cycle latency ; 1 cycle throughput ``` -- ## PopCount - returns the number of '1' bits in a 32/64 bit primitive C#: ```csharp int result = PcpCnt.PopCount(0b0000111100110011); // result == 8 ``` asm: ```x86asm popcnt rax, rdx ; 3 cycle latency ; 1 cycle throughput ``` -- ## Avx2.PermuteVar8x32 - Accepts two vectors: source, permutation - Permutes the source according to the permutation order -- C#: ```csharp Vector256 data, perm; Vector256 result = Avx2.PermuteVar8x32(data, perm); ``` asm: ```x86asm vpermd ymm1, ymm2, ymm1 ; 3 cycles latency ; 1 cycles throughput ``` -- ## Vectorized Partition Block
  • We're going to partition 8 x `int`s at a time
    • inside a Vector256
  • Load ➡ Compare ➡ Permute ➡ Store
    • The result is written to both sides of the array
    • Then advance the next write pos for each side
  • With no branching(!)
-- ```csharp var P = Vector256.Create(pivot); ... var current = Avx2.LoadDquVector256(nextPtr); var mask = (uint) Avx.MoveMask( Avx2.CompareGreaterThan(current, P).AsSingle())); current = Avx2.PermuteVar8x32(current, LoadDquVector256(PermTablePtr + mask * 8)); Avx.Store(writeLeft, current); Avx.Store(writeRight, current); var popCount = PopCnt.PopCount(mask); writeRight -= popCount; writeLeft += 8 - popCount; ``` We generate a vectorized pivot, once per partition Load 8 elements from somewhere. Compare to pivot, cast to Vector256<float> (because ¯\\_(ツ)_/¯) Generate an 8-bit mask from the comparison result Load permutation vector from table (next slides!) Permute data (partition) Store 8 elements to the left. Store 8 elements to the right. Count 1 bits ➡ How many are elemenets are > than pivot. Advance right by popCount. Advance left by 8 - popCount. -- ```x86asm vmovd xmm1,r15d ; Broadcast vbroadcastd ymm1,xmm1 ; pivot ... vlddqu ymm0, ymmword ptr [rax] ; load 8 elements vpcmpgtd ymm2, ymm0, ymm1 ; compare vmovmskps ecx, ymm2 ; movemask into scalar reg mov r9d, ecx ; copy to r9 shl r9d, 0x3 ; *= 8 vlddqu ymm2, qword ptr [rdx+r9d*4] ; load permutation vpermd ymm0, ymm2, ymm0 ; permute vmovdqu ymmword ptr [r12], ymm0 ; store left vmovdqu ymmword ptr [r8], ymm0 ; store right popcnt ecx, ecx ; popcnt shl ecx, 0x2 ; pointer mov r9d, ecx ; arithmetic neg r9d ; for += 8 - popCount add r9d, 0x20 ; add r12, r9 ; Update writeLeft pos sub r8, rcx ; Update writeRight pos ``` We generate a vectorized pivot, once per partition Load 8 elements from somewhere. Compare to pivot, cast to Vector256<float> (because ¯\\_(ツ)_/¯) Generate an 8-bit mask from the comparison result Load permutation vector from table Permute data (partition) Store 8 elements to the left. Store 8 elements to the right. Count 1 bits ➡ How many are elemenets are > than pivot. Advance left by 8 - popCount. Advance right by popCount. -- ## Permutation Tables - We need precomputed permutation tables - 256 entries for every possible mask (28) - We pre-generate them as part of the source code - Let's look at a few entries, to get a feel for them -- ```charp static ReadOnlySpan PermTable => new[] { 0, 1, 2, 3, 4, 5, 6, 7, // 0 => 0b00000000 // ... 3, 4, 5, 6, 7, 0, 1, 2, // 7 => 0b00000111 // ... 0, 2, 4, 6, 1, 3, 5, 7, // 170 => 0b10101010 // ... 0, 1, 2, 3, 4, 5, 6, 7, // 255 => 0b11111111 }; ``` Everything stays in place Move 3 from left to right 4/4 split -- ## The outer loop - The idea is we read 8 at every iteration - Then write 8 **both** to left + right - How can we keep everything in place? - Remember allocation is bad -- ## stackalloc to the rescue - We cheat just a bit: ¯\\_(ツ)_/¯ - `stackalloc Vector256[4]` - Total temp memory: 128 bytes - constant, regardless of how much we're sorting -- ## Now what
  • We read ahead 2 vectors from left + 2 from right
  • Partition them separately
  • Now we can read ahead of where we write
  • Every iteration we decide to read from left/right
    • According to which "head" is closest to being overwritten
-- ```csharp while (readRight >= readLeft) { int *nextPtr; if (readLeft - writeLeft <= writeRight - readRight) { nextPtr = readLeft; readLeft += 8; } else { nextPtr = readRight; readRight -= 8; } var current = Avx.LoadDquVector256(nextPtr); //... } ``` Anything left to partition? Which side is closer to being overwritten? Pick left Pick right -- ## Yeah yeah, are we fast yet? -- ## Can we go faster? - That "pick a side" is actually super bad for perf - The branch is 100% random data dependent - Not fun! - How bad?