Jekyll2020-11-23T05:07:29+00:00https://bits.houmus.org/feed.xmldamageboydamageboy's place to vent over those mean computersdamageboydans@houmus.orghttps://bits.houmus.orgThis Goes to Eleven (Pt. 5/∞)2020-02-02T02:22:28+00:002020-02-02T02:22:28+00:00https://bits.houmus.org/2020-02-02/this-goes-to-eleven-pt5<p>I ended up going down the rabbit hole re-implementing array sorting with AVX2 intrinsics, and there’s no reason I should go down alone.</p>
<p>Since there’s a lot to go over here, I’ll split it up into a few parts:</p>
<ol>
<li>In <a href="/2020-01-28/this-goes-to-eleven-pt1">part 1</a>, we start with a refresher on <code class="highlighter-rouge">QuickSort</code> and how it compares to <code class="highlighter-rouge">Array.Sort()</code>.</li>
<li>In <a href="/2020-01-29/this-goes-to-eleven-pt2">part 2</a>, we go over the basics of vectorized hardware intrinsics, vector types, and go over a handful of vectorized instructions we’ll use in part 3. We still won’t be sorting anything.</li>
<li>In <a href="/2020-01-30/this-goes-to-eleven-pt3">part 3</a>, we go through the initial code for the vectorized sorting, and start seeing some payoff. We finish agonizing courtesy of the CPU’s branch predictor, throwing a wrench into our attempts.</li>
<li>In <a href="/2020-02-01/this-goes-to-eleven-pt4">part 4</a>, we go over a handful of optimization approaches that I attempted trying to get the vectorized partition to run faster, seeing what worked and what didn’t.</li>
<li>In this part, we’ll take a deep dive into how to deal with memory alignment issues.</li>
<li>In part 6, we’ll take a pause from the vectorized partitioning, to get rid of almost 100% of the remaining scalar code, by implementing small, constant size array sorting with yet more AVX2 vectorization.</li>
<li>In part 7, We’ll circle back and try to deal with a nasty slowdown left in our vectorized partitioning code</li>
<li>In part 8, I’ll tell you the sad story of a very twisted optimization I managed to pull off while failing miserably at the same time.</li>
<li>In part 9, I’ll try some algorithmic improvements to milk those last drops of perf, or at least those that I can think of, from this code.</li>
</ol>
<h2 id="trying-to-squeeze-some-more-vectorized-juice">(Trying) to squeeze some more vectorized juice</h2>
<p>I thought it would be nice to show a bunch of things I ended up trying to improve performance.
I tried to keep most of these experiments in separate implementations, both the ones that yielded positive results and the failures. These can be seen in the original repo under the <a href="https://github.com/damageboy/VxSort/tree/research/VxSortResearch/Unstable/AVX2/Happy">Happy</a> and <a href="https://github.com/damageboy/VxSort/tree/research/VxSortResearch/Unstable/AVX2/Sad">Sad</a> folders.</p>
<p>While some worked, and some didn’t, I think a bunch of these were worth mentioning, so here goes:</p>
<h3 id="aligning-our-expectations">Aligning our expectations</h3>
<center>
<object style="margin: auto; width: 90%" type="image/svg+xml" data="../assets/images/computer-architecture-caches-are-evil-quote.svg"></object>
</center>
<p>This quote, taken from Hennessy and Patterson’s <a href="https://www.elsevier.com/books/computer-architecture/hennessy/978-0-12-811905-1">“Computer Architecture: A Quantitative Approach, 6th Edition”</a>, which is traced to all the way back to the fathers of modern-day computing in 1946 can be taken as a foreboding warning for the pains that are related to anything that deals with the complexity of memory hierarchies.</p>
<p>With modern computer hardware, CPUs <em>might</em> access memory more efficiently when it is naturally aligned: in other words, when the <em>address</em> we use is a multiple of some magical constant. The constant is classically the machine word size, 4/8 bytes on 32/64 bit machines. These constants are related to how the CPU is physically wired and constructed internally. Historically, older processors used to be very limited, either disallowing or severely limiting performance, with non-aligned memory access. To this day, very simple micro-controllers (like the ones you might find in IoT devices, for example) will exhibit such limitations around memory alignment, essentially forcing memory access to conform to multiples of 4/8 bytes. With more modern (read: more expensive) CPUs, these requirements have become increasingly relaxed. Most programmers can simply afford to <em>ignore</em> this issue. The last decade or so worth of modern processors are oblivious to this problem per-se, as long as we access memory within a <strong>single cache-line</strong>, or 64-bytes on almost any modern-day processors.</p>
<p>What is this cache-line? I’m actively fighting my internal inclination, so I <strong>won’t turn</strong> this post into a detour about computer micro-architecture. Caches have been covered elsewhere ad-nauseam by far more talented writers, that I’ll never do it justice anyway. Instead, I’ll just do the obligatory one-paragraph reminder where we recall that CPUs don’t directly communicate with RAM, as it is dead slow; instead, they read and write from internal, on-die, special/fast memory called caches. Caches contain partial copies of RAM. Caches are faster, smaller, and organized in multiple levels (L1/L2/L3 caches, to name them), where each level is usually larger in size and slightly slower in terms of latency. When the CPU is instructed to access memory, it instead communicates with the cache units, but it never does so in small units. Even when our code is reading a <em>single byte</em>, the CPU will communicate with it’s cache subsystem in a unit-of-work known as a cache-line. In theory, every CPU model may have its own definition of a cache-line, but in practice, the last 15 years of processors seem to have converged on 64-bytes as that golden number.</p>
<p>Now, what happens when, lets say, our read operations end up <strong>crossing</strong> cache-lines?</p>
<center>
<object style="margin: auto; width: 90%" type="image/svg+xml" data="../talks/intrinsics-sorting-2019/cacheline-boundaries.svg"></object>
</center>
<p>As mentioned, the unit-of-work, as far as the CPU is concerned, is a 64-byte cache-line. Therefore, such reads literally cause the CPU to issue <em>two</em> read operations downstream, ultimately directed at the cache units<sup id="fnref:0" role="doc-noteref"><a href="#fn:0" class="footnote">1</a></sup>. These cache-line crossing reads <em>do</em> have a sustained effect on perfromance<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote">2</a></sup>. But how often do they occur? Let’s consider this by way of example:<br />
Imagine we are processing a single array sequentially, reading 32-bit integers at a time, or 4-bytes; if for some reason, our starting address is <em>not</em> divisible by 4, cross cache-line reads would occur at a rate of <code class="highlighter-rouge">4/64</code> or <code class="highlighter-rouge">6.25%</code> of reads. Even this paltry rate of cross cache-line reads usually remains in the <em>realm of theory</em> since we have the memory allocator and compiler working in tandem, behind the scenes, to make this go away:</p>
<ul>
<li>The default allocator <em>always</em> returns memory aligned at least to machine word size on the one hand.</li>
<li>The compiler/JIT use padding bytes within our classes/structs in-between members, as needed, to ensure that individual members are aligned to 4/8 bytes.</li>
</ul>
<p>So far, I’ve told you why/when you <em>shouldn’t</em> care about alignment. This was my way of both easing you into the topic and helping you feel OK if this is news to you. You really can afford <em>not to think</em> about this without paying any penalty, for the most part. Unfortunately, this <strong>stops</strong> being true for <code class="highlighter-rouge">Vector256<T></code> sized reads, which are 32 bytes wide (256 bits / 8). And this is <em>doubly not true</em> for our partitioning problem:</p>
<ul>
<li>The memory handed to us for partitioning/sorting is rarely aligned to 32-bytes, except for dumb luck.<br />
The allocator, allocating an array of 32-bit integers, simply doesn’t care about 32-<strong>byte</strong> alignment.</li>
<li>Even if it were magically aligned to 32-bytes, it would do us little good; Once a <em>single</em> partition operation is complete, further sub-divisions, inherent with QuickSort, are determined by the (random) new placement of the last pivot we used.<br />
There is no way we will get lucky enough that <em>every partition</em> will be 32-byte aligned.</li>
</ul>
<p>Now that it is clear that we won’t be 32-byte aligned, we finally realize that as we go over the array sequentially (left to right and right to left as we do) issuing <strong>unaligned</strong> 32-byte reads on top of a 64-byte cache-line, we end up reading across cache-lines every <strong>other</strong> read! Or at a rate of 50%! This just escalated from being “…generally not a problem” into a “Houston, we have a problem” very quickly.</p>
<p>You’ve endured through a lot of hand waving so far, let’s try to see if we can get some damning evidence for all of this, by launching <code class="highlighter-rouge">perf</code>, this time tracking the oddly specific <code class="highlighter-rouge">mem_inst_retired.split_loads</code> HW counter:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="rouge-code"><pre><span class="nv">$ COMPlus_PerfMapEnabled</span><span class="o">=</span>1 perf record <span class="nt">-Fmax</span> <span class="nt">-e</span> mem_inst_retired.split_loads <span class="se">\</span>
./Example <span class="nt">--type-list</span> DoublePumpJedi <span class="nt">--size-list</span> 100000 <span class="se">\</span>
<span class="nt">--max-loops</span> 1000 <span class="nt">--no-check</span>
<span class="nv">$ </span>perf report <span class="nt">--stdio</span> <span class="nt">-F</span> overhead,sym | <span class="nb">head</span> <span class="nt">-20</span>
<span class="c"># To display the perf.data header info, please use --header/--header-only options.</span>
<span class="c"># Event count (approx.): 87102613</span>
<span class="c"># Overhead Symbol</span>
86.68% <span class="o">[</span>.] ...DoublePumpJedi::VectorizedPartitionInPlace<span class="o">(</span>int32<span class="k">*</span>,int32<span class="k">*</span><span class="o">)</span>
5.74% <span class="o">[</span>.] ...DoublePumpJedi::Sort<span class="o">(</span>int32<span class="k">*</span>,int32<span class="k">*</span>,int32<span class="o">)</span>
2.99% <span class="o">[</span>.] __memmove_avx_unaligned_erms
</pre></td></tr></tbody></table></code></pre></div></div>
<p>We ran the same sort operation <code class="highlighter-rouge">1,000</code> times and got <code class="highlighter-rouge">87,102,613</code> split-loads, with <code class="highlighter-rouge">86.68%</code> attributed to our partitioning function. This means <code class="highlighter-rouge">(87102613 * 0.8668) / 1000</code> or <code class="highlighter-rouge">75,500</code> split-loads <em>per sort</em> of <code class="highlighter-rouge">100,000</code> elements. To seal the deal, we need to figure out how many vector loads per sort we are performing in the first place; Luckily I can generate an answer quickly: I have statistics collection code embedded in my code, so I can issue this command:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="rouge-code"><pre><span class="nv">$ </span>./Example <span class="nt">--type-list</span> DoublePumpJedi <span class="se">\</span>
<span class="nt">--size-list</span> 100000 <span class="nt">--max-loops</span> 10000 <span class="se">\</span>
<span class="nt">--no-check</span> <span class="nt">--stats-file</span> jedi-100k-stats.json
</pre></td></tr></tbody></table></code></pre></div></div>
<p>And in return I get this beutiful thing back:</p>
<table style="margin-bottom: 0em" class="notice--info">
<tr>
<td style="border: none; padding-top: 0; padding-bottom: 0; vertical-align: top"><span class="uk-label">Note</span></td>
<td style="border: none; padding-top: 0; padding-bottom: 0"><div>
<p>These numbers are vastly different than the ones we last saw in the end of the 3<sup>rd</sup> post, for example. There is a good reason for this: We’ve spent the previous post tweaking the code in a few considerable ways:</p>
<ul>
<li>Changing the cut-off point for vectorized sorting from 16 ⮞ 40, there-by reducing the amount of vectorized partitions we’re performing in the first place.</li>
<li>Changing the permutation entry loading code to read 8-byte values from memroy, rather than full 32-byte <code class="highlighter-rouge">Vector256<int></code> entries,
cutting the number of <code class="highlighter-rouge">Vector256<int></code> loads by half.</li>
</ul>
</div>
</td>
</tr>
</table>
<div>
<!-- <button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button> -->
<table class="table datatable" data-json="../_posts/jedi-stats.json" data-id-field="name" data-pagination="false" data-intro="Each row in this table contains statistics collected & averaged out of thousands of runs with random data" data-position="left" data-show-pagination-switch="false">
<thead data-intro="The header can be used to sort/filter by clicking" data-position="right">
<tr>
<th data-field="MethodName" data-sortable="true" data-filter-control="select">
<span data-intro="The name of the benchmarked method" data-position="top">Method<br />Name</span>
</th>
<th data-field="ProblemSize" data-sortable="true" data-value-type="int" data-filter-control="select">
<div data-intro="The size of the sorting problem being benchmarked (# of integers)" data-position="bottom" class="rotated-header-container">
<div class="rotated-header">Size</div>
</div>
</th>
<th data-field="MaxDepthScaledDataTable" data-sortable="true" data-value-type="inline-bar-horizontal">
<div data-intro="The maximal depth of recursion reached while sorting" data-position="top" class="rotated-header-container">
<div class="rotated-header">Max</div>
<div class="rotated-header">Depth</div>
</div>
</th>
<th data-field="NumPartitionOperationsScaledDataTable" data-sortable="true" data-value-type="inline-bar-horizontal">
<div data-intro="# of partitioning operations per sort" data-position="bottom" class="rotated-header-container">
<div class="rotated-header">Part</div>
<div class="rotated-header">itions</div>
</div>
</th>
<th data-field="NumVectorizedLoadsScaledDataTable" data-sortable="true" data-value-type="inline-bar-horizontal">
<div data-intro="# of vectorized load operations" data-position="top" class="rotated-header-container">
<div class="rotated-header">Vector</div>
<div class="rotated-header">Loads</div>
</div>
</th>
<th data-field="NumVectorizedStoresScaledDataTable" data-sortable="true" data-value-type="inline-bar-horizontal">
<div data-intro="# of vectorized store operations" data-position="bottom" class="rotated-header-container">
<div class="rotated-header">Vector</div>
<div class="rotated-header">Stores</div>
</div>
</th>
<th data-field="NumPermutationsScaledDataTable" data-sortable="true" data-value-type="inline-bar-horizontal">
<div data-intro="# of vectorized permutation operations" data-position="top" class="rotated-header-container">
<div class="rotated-header">Vector</div>
<div class="rotated-header">Permutes</div>
</div>
</th>
<th data-field="AverageSmallSortSizeScaledDataTable" data-sortable="true" data-value-type="inline-bar-horizontal">
<div data-intro="For hybrid sorting, the average size that each small sort operation was called with (e.g. InsertionSort)" data-position="bottom" class="rotated-header-container">
<div class="rotated-header">Small</div>
<div class="rotated-header">Sort</div>
<div class="rotated-header">Size</div>
</div>
</th>
<th data-field="NumScalarComparesScaledDataTable" data-sortable="true" data-value-type="inline-bar-horizontal">
<div data-intro="How many branches were executed in each sort operation that were based on the unsorted array elements" data-position="top" class="rotated-header-container">
<div class="rotated-header">Data</div>
<div class="rotated-header">Based</div>
<div class="rotated-header">Branches</div>
</div>
</th>
<th data-field="PercentSmallSortCompares" data-sortable="true" data-value-type="float2-percentage">
<div data-intro="What percent of</br>⬅<br/>branches happenned as part of small-sorts" data-position="bottom" class="rotated-header-container">
<div class="rotated-header">Small</div>
<div class="rotated-header">Sort</div>
<div class="rotated-header">Branches</div>
</div>
</th>
</tr>
</thead>
</table>
</div>
<p>In total, we perform <code class="highlighter-rouge">173,597</code> vector loads per sort operation of <code class="highlighter-rouge">100,000</code> elements in <code class="highlighter-rouge">4,194</code> partitioning calls. Assuming our array is aligned to 4-bytes to begin with (which C#’s allocator does very reliably), every partitioning call has a <code class="highlighter-rouge">4/32</code> or <code class="highlighter-rouge">12.5%</code> of ending up being 32-byte aligned: In other words <code class="highlighter-rouge">21,700</code> of the total vector reads should be aligned by sheer chance, which leaves <code class="highlighter-rouge">173597-21700</code> or <code class="highlighter-rouge">151,898</code> that should be <em>unaligned</em>, of which, I claim that that ½ would cause split-loads: <code class="highlighter-rouge">50%</code> of <code class="highlighter-rouge">151,898</code> is <code class="highlighter-rouge">75,949</code> while we measured <code class="highlighter-rouge">75,500</code> with <code class="highlighter-rouge">perf</code>! I don’t know how your normal day goes about, but in mine, reality and my hallucinations rarely go hand-in-hand like this.</p>
<p>Fine, we now <strong>know</strong> we have a problem. The first step was acknowledging/accepting reality: Our code does indeed generate a lot of split memory operations. Let’s consider our memory access patterns when reading/writing with respect to alignment, and see if we can do something about it:</p>
<ul>
<li>For writing, we’re all over the place: we always advance the write pointers according to how the data was partitioned, e.g. it is completely data-dependent, and there is little we can say about our write addresses. In addition, as it happens, Intel CPUs, as almost all other modern CPUs, employ another common trick in the form of <a href="https://en.wikipedia.org/wiki/Write_combining">store buffers, or write-combining buffers (WCBs)</a>. I’ll refrain from describing them here, but the bottom line is we both can’t/don’t need to care about the writing side of our algorithm.</li>
<li>For reading, the situation is entirely different: We <em>always</em> advance the read pointers by 8 elements (32-bytes) on the one hand, and we even have a special intrinsic: <code class="highlighter-rouge">Avx.LoadAlignedVector256() / VMOVDQA</code><sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote">3</a></sup> that helps us ensure that our reading is properly aligned to 32-bytes.</li>
</ul>
<h4 id="aligning-to-cpu-cache-lines-1">Aligning to CPU Cache-lines: :+1:</h4>
<p>With this lengthy introduction out of the way, it’s time we do something about these cross-cache line reads. Initially, I got “something” working quickly: remember that we needed to deal with the <em>remainder</em> of the array, when we had less than 8-elements, anyway. In the original code at the end of the 3<sup>rd</sup> post, we did so right after our vectorized loop. If we move that scalar code from the end of the function to its beginning while also modifying it to perform scalar partitioning until both <code class="highlighter-rouge">readLeft</code>/<code class="highlighter-rouge">readRight</code> pointers are aligned to 32 bytes, our work is complete. There is a slight wrinkle in this otherwise simple approach:</p>
<ul>
<li>Previously, we had anywhere between <code class="highlighter-rouge">0-7</code> elements left as a remainder for scalar partitioning per partition call.
<ul>
<li><code class="highlighter-rouge">3.5</code> elements on average.</li>
</ul>
</li>
<li>Aligning from the edges of our partition with scalar code means we will now have <code class="highlighter-rouge">0-7</code> elements per-side…
<ul>
<li>So <code class="highlighter-rouge">3.5 x 2 == 7</code> elements on average.</li>
</ul>
</li>
</ul>
<p>In other words, doing this sort of inwards pre-alignment optimization is not a clean win: We end up with more scalar work than before on the one hand (which is unfortunate), but on the other hand, we can change the vector loading code to use <code class="highlighter-rouge">Avx.LoadAlignedVector256()</code> and <em>know for sure</em> that we will no longer be causing the CPU to issue a single cross cache-line read (The latter being the performance boost).<br />
It’s understandable if while reading this, your gut reaction is thinking that adding 3.5 scalar operations doesn’t sound like much of a trade-off, but we have to consider that:</p>
<ul>
<li>Each scalar comparison comes with a likely branch misprediction, as discussed before, so it has a higher cost than what you might be initially pricing in.</li>
<li>More importantly: we can’t forget that this is a recursive function, with ever <em>decreasing</em> partition sizes. If you go back to the initial stats we collected in previous posts, you’ll be quickly reminded that we partition upwards of 340k times for 1 million element arrays, so this scalar work both piles up, and represents a larger portion of our workload as the partition sizes decrease…</li>
</ul>
<p>I won’t bother showing the entire code listing for <a href="https://github.com/damageboy/VxSort/blob/research/VxSortResearch/Unstable/AVX2/Happy/B5_1_DoublePumpAligned.cs"><code class="highlighter-rouge">B5_1_DoublePumpAligned.cs</code></a>, but I will show the rewritten scalar partition block, which is now tasked with aligning our pointers before we go full vectorized partitioning. Originally it was right after the double-pumped loop and looked like this:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
</pre></td><td class="rouge-code"><pre> <span class="c1">// ...</span>
<span class="k">while</span> <span class="p">(</span><span class="n">readLeft</span> <span class="p"><</span> <span class="n">readRight</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">var</span> <span class="n">v</span> <span class="p">=</span> <span class="p">*</span><span class="n">readLeft</span><span class="p">++;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">v</span> <span class="p"><=</span> <span class="n">pivot</span><span class="p">)</span> <span class="p">{</span>
<span class="p">*</span><span class="n">tmpLeft</span><span class="p">++</span> <span class="p">=</span> <span class="n">v</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="p">*--</span><span class="n">tmpRight</span> <span class="p">=</span> <span class="n">v</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>The aligned variant, with the alignment code now at the top of the function, looks like this:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
</pre></td><td class="rouge-code"><pre> <span class="k">const</span> <span class="kt">ulong</span> <span class="n">ALIGN</span> <span class="p">=</span> <span class="m">32</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">ulong</span> <span class="n">ALIGN_MASK</span> <span class="p">=</span> <span class="n">ALIGN</span> <span class="p">-</span> <span class="m">1</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(((</span><span class="kt">ulong</span><span class="p">)</span> <span class="n">readLeft</span> <span class="p">&</span> <span class="n">ALIGN_MASK</span><span class="p">)</span> <span class="p">!=</span> <span class="m">0</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">var</span> <span class="n">nextAlign</span> <span class="p">=</span> <span class="p">(</span><span class="kt">int</span> <span class="p">*)</span> <span class="p">(((</span><span class="kt">ulong</span><span class="p">)</span> <span class="n">readLeft</span> <span class="p">+</span> <span class="n">ALIGN</span><span class="p">)</span> <span class="p">&</span> <span class="p">~</span><span class="n">ALIGN_MASK</span><span class="p">);</span>
<span class="k">while</span> <span class="p">(</span><span class="n">readLeft</span> <span class="p"><</span> <span class="n">nextAlign</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">var</span> <span class="n">v</span> <span class="p">=</span> <span class="p">*</span><span class="n">readLeft</span><span class="p">++;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">v</span> <span class="p"><=</span> <span class="n">pivot</span><span class="p">)</span> <span class="p">{</span>
<span class="p">*</span><span class="n">tmpLeft</span><span class="p">++</span> <span class="p">=</span> <span class="n">v</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="p">*--</span><span class="n">tmpRight</span> <span class="p">=</span> <span class="n">v</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">Debug</span><span class="p">.</span><span class="nf">Assert</span><span class="p">(((</span><span class="kt">ulong</span><span class="p">)</span> <span class="n">readLeft</span> <span class="p">&</span> <span class="n">ALIGN_MASK</span><span class="p">)</span> <span class="p">==</span> <span class="m">0</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(((</span><span class="kt">ulong</span><span class="p">)</span> <span class="n">readRight</span> <span class="p">&</span> <span class="n">ALIGN_MASK</span><span class="p">)</span> <span class="p">!=</span> <span class="m">0</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">var</span> <span class="n">nextAlign</span> <span class="p">=</span> <span class="p">(</span><span class="kt">int</span> <span class="p">*)</span> <span class="p">((</span><span class="kt">ulong</span><span class="p">)</span> <span class="n">readRight</span> <span class="p">&</span> <span class="p">~</span><span class="n">ALIGN_MASK</span><span class="p">);</span>
<span class="k">while</span> <span class="p">(</span><span class="n">readRight</span> <span class="p">></span> <span class="n">nextAlign</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">var</span> <span class="n">v</span> <span class="p">=</span> <span class="p">*--</span><span class="n">readRight</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">v</span> <span class="p"><=</span> <span class="n">pivot</span><span class="p">)</span> <span class="p">{</span>
<span class="p">*</span><span class="n">tmpLeft</span><span class="p">++</span> <span class="p">=</span> <span class="n">v</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="p">*--</span><span class="n">tmpRight</span> <span class="p">=</span> <span class="n">v</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">Debug</span><span class="p">.</span><span class="nf">Assert</span><span class="p">(((</span><span class="kt">ulong</span><span class="p">)</span> <span class="n">readRight</span> <span class="p">&</span> <span class="n">ALIGN_MASK</span><span class="p">)</span> <span class="p">==</span> <span class="m">0</span><span class="p">);</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>What it does now is check when alignment is necessary, then proceeds to align while also partitioning each side into the temporary memory.</p>
<p>Where do we end up performance-wise with this optimization?</p>
<div>
<div class="stickemup">
<ul class="uk-tab" data-uk-switcher="{connect:'#50d405d8-6a9a-4b68-9b7f-20445b335308'}">
<li class="uk-active"><a href="#"><i class="glyphicon glyphicon-stats"></i> Scaling</a></li>
<li><a href="#"><i class="glyphicon glyphicon-stats"></i> Time/N</a></li>
<li><a href="#"><i class="glyphicon glyphicon-list-alt"></i> Benchmarks</a></li>
<li><a href="#"><i class="glyphicon glyphicon-info-sign"></i> Setup</a></li>
</ul>
<ul id="50d405d8-6a9a-4b68-9b7f-20445b335308" class="uk-switcher uk-margin">
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<div data-intro="Size of the sorting problem, 10..10,000,000 in powers of 10" data-position="bottom">
<div data-intro="Performance scale: Array.Sort (solid gray) is always 100%, and the other methods are scaled relative to it" data-position="left">
<div data-intro="Click legend items to show/hide series" data-position="right">
<div class="benchmark-chart-container">
<canvas data-chart="line">
N,100,1K,10K,100K,1M,10M
Jedi, 1 , 1 , 1 , 1 , 1 , 1
Aligned, 1.082653616, 1.091733385, 0.958578753, 0.959159569, 0.964604818, 0.980102965
<!--
{
"data" : {
"datasets" : [
{
"backgroundColor": "rgba(66,66,66,0.35)",
"rough": { "fillStyle": "hachure", "hachureAngle": -30, "hachureGap": 9, "fillWeight": 0.3}
},
{
"backgroundColor": "rgba(33,220,33,.9)",
"rough": { "fillStyle": "hachure", "hachureAngle": 60, "hachureGap": 3}
}
]
},
"options": {
"title": { "text": "AVX2 Aligned Sorting - Scaled to Jedi", "display": true },
"scales": {
"yAxes": [{
"ticks": {
"fontFamily": "Indie Flower",
"min": 0.90,
"callback": "ticksPercent"
},
"scaleLabel": {
"labelString": "Scaling (%)",
"display": true
}
}]
}
},
"defaultOptions": {"scales":{"xAxes":[{"scaleLabel":{"display":"true,","labelString":"N (elements)","fontFamily":"Indie Flower"},"ticks":{"fontFamily":"Indie Flower"}}]},"legend":{"display":true,"position":"bottom","labels":{"fontFamily":"Indie Flower","fontSize":14}},"title":{"position":"top","fontFamily":"Indie Flower","fontSize":16}}
}
--> </canvas>
</div>
</div>
</div>
</div>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<div data-intro="Size of the sorting problem, 10..10,000,000 in powers of 10" data-position="bottom">
<div data-intro="Time in nanoseconds spent sorting per element. Array.Sort (solid gray) is the baseline, again" data-position="left">
<div data-intro="Click legend items to show/hide series" data-position="right">
<div class="benchmark-chart-container">
<canvas data-chart="line">
N,100,1K,10K,100K,1M,10M
Jedi, 18.3938 ,20.7342 ,24.6347 ,26.9067 ,23.9922 ,25.5122
Aligned, 19.9128, 22.6363, 23.6143, 25.8078, 23.143, 25.0046
<!--
{
"data" : {
"datasets" : [
{
"backgroundColor": "rgba(66,66,66,0.35)",
"rough": { "fillStyle": "hachure", "hachureAngle": -30, "hachureGap": 9, "fillWeight": 0.3}
},
{
"backgroundColor": "rgba(33,220,33,.9)",
"rough": { "fillStyle": "hachure", "hachureAngle": 60, "hachureGap": 3}
}
]
},
"options": {
"title": { "text": "AVX2 Jedi Sorting + Aligned - log(Time/N)", "display": true },
"scales": {
"yAxes": [{
"type": "logarithmic",
"ticks": {
"min": 15,
"max": 28,
"callback": "ticksNumStandaard",
"fontFamily": "Indie Flower"
},
"scaleLabel": {
"labelString": "Time/N (ns)",
"fontFamily": "Indie Flower",
"display": true
}
}]
}
},
"defaultOptions": {"scales":{"xAxes":[{"scaleLabel":{"display":"true,","labelString":"N (elements)","fontFamily":"Indie Flower"},"ticks":{"fontFamily":"Indie Flower"}}]},"legend":{"display":true,"position":"bottom","labels":{"fontFamily":"Indie Flower","fontSize":14}},"title":{"position":"top","fontFamily":"Indie Flower","fontSize":16}}
}
--> </canvas>
</div>
</div>
</div>
</div>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<table class="table datatable" data-json="../_posts/Bench.BlogPt5_1_Int32_-report.datatable.json" data-id-field="name" data-pagination="false" data-page-list="[9, 18]" data-intro="Each row in this table represents a benchmark result" data-position="left" data-show-pagination-switch="false">
<thead data-intro="The header can be used to sort/filter by clicking" data-position="right">
<tr>
<th data-field="TargetMethodColumn.Method" data-sortable="true" data-filter-control="select">
<span data-intro="The name of the benchmarked method" data-position="top">
Method<br />Name
</span>
</th>
<th data-field="N" data-sortable="true" data-value-type="int" data-filter-control="select">
<span data-intro="The size of the sorting problem being benchmarked (# of integers)" data-position="top">
Problem<br />Size
</span>
</th>
<th data-field="TimePerNDataTable" data-sortable="true" data-value-type="float2-interval-muted">
<span data-intro="Time in nanoseconds spent sorting each element in the array (with confidence intervals in parenthesis)" data-position="top">
Time /<br />Element (ns)
</span>
</th>
<th data-field="RatioDataTable" data-sortable="true" data-value-type="inline-bar-horizontal-percentage">
<span data-intro="Each result is scaled to its baseline (Array.Sort in this case)" data-position="top">
Scaling
</span>
</th>
<th data-field="Measurements" data-sortable="true" data-value-type="inline-bar-vertical">
<span data-intro="Raw benchmark results visualize how stable the result it. Longest/Shortest runs marked with <span style='color: red'>Red</span>/<span style='color: green'>Green</span>" data-position="top">Measurements</span>
</th>
</tr>
</thead>
</table>
</div>
</li>
<li>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="rouge-code"><pre><span class="nv">BenchmarkDotNet</span><span class="o">=</span>v0.12.0, <span class="nv">OS</span><span class="o">=</span>clear-linux-os 32120
Intel Core i7-7700HQ CPU 2.80GHz <span class="o">(</span>Kaby Lake<span class="o">)</span>, 1 CPU, 4 logical and 4 physical cores
.NET Core <span class="nv">SDK</span><span class="o">=</span>3.1.100
<span class="o">[</span>Host] : .NET Core 3.1.0 <span class="o">(</span>CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404<span class="o">)</span>, X64 RyuJIT
Job-DEARTS : .NET Core 3.1.0 <span class="o">(</span>CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404<span class="o">)</span>, X64 RyuJIT
<span class="nv">InvocationCount</span><span class="o">=</span>3 <span class="nv">IterationCount</span><span class="o">=</span>15 <span class="nv">LaunchCount</span><span class="o">=</span>2
<span class="nv">UnrollFactor</span><span class="o">=</span>1 <span class="nv">WarmupCount</span><span class="o">=</span>10
<span class="nv">$ </span><span class="nb">grep</span> <span class="s1">'stepping\|model\|microcode'</span> /proc/cpuinfo | <span class="nb">head</span> <span class="nt">-4</span>
model : 158
model name : Intel<span class="o">(</span>R<span class="o">)</span> Core<span class="o">(</span>TM<span class="o">)</span> i7-7700HQ CPU @ 2.80GHz
stepping : 9
microcode : 0xb4
</pre></td></tr></tbody></table></code></pre></div></div>
</li>
</ul>
</div>
<p>The whole attempt ends up as a mediocre improvement, so it would seem:</p>
<ul>
<li>We’re are seeing a speedup/improvement, in the high counts.</li>
<li>We seem to be slowing down due to the higher scalar operation count, in the low problem sizes.</li>
</ul>
<p>It’s kind of a mixed bag, and perhaps slightly unimpressive at first glance. However, when we stop to remember that we somehow managed both to speed up the function while doubling the amount of scalar work done, the interpretation of the results becomes more nuanced: The pure benefit from alignment itself is larger than what the results are showing right now since it’s being masked, to some extent, by the extra scalar work we tacked on. If only there was a way we could skip that scalar work all together… If only there was a way… If only…</p>
</div>
<h3 id="re-partitioning-overlapping-regions-1-1">(Re-)Partitioning overlapping regions: :+1: :+1:</h3>
<p>Next up is a different optimization approach to the same problem, and a natural progression from the last one. At the risk of sounding pompous, I think I <em>might</em> have found something here that no-one has done before in the context of partitioning<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote">4</a></sup>: The basic idea here is we get rid of all (ok, ok, <em>almost all</em>) scalar partitioning in our vectorized code path. If we can partition and align the edges of the segment we are about to process with vectorized code, we would be reducing the total number instructions executed. At the same time, we would be retaining more of the speed-up that was lost with the alignment optimization above. This would have a double-whammy compounded effect. But how?</p>
<object style="margin: auto" type="image/svg+xml" data="../talks/intrinsics-sorting-2019/overlap-partition-with-hint.svg"></object>
<p>We could go about it the other way around! Instead of aligning <em>inwards</em> in each respective direction, we could align <strong><em>outwards</em></strong> and enlarge the partitioned segment to include a few more (up to 7) elements on the outer rims of each partition and <u>re-partition</u> them using the new pivot we’ve just selected. If this works, we end up doing both 100% aligned reads and eliminating all scalar work in one optimization! This might <em>sound simple</em> and <strong>safe</strong>, but this is the sort of humbling experience that QuickSort is quick at dispensing (sorry, I had to…) at people trying to nudge it in the wrong way. At some point, I was finally able to screw my own head on properly with respect to this re-partitioning attempt and figure out what precisely are the critical constraints we must respect for this to work.</p>
<table style="margin-bottom: 0em" class="notice--info">
<tr>
<td style="border: none; padding-top: 0; padding-bottom: 0; vertical-align: top"><span class="uk-label">Note</span></td>
<td style="border: none; padding-top: 0; padding-bottom: 0"><div>
<p>This is a slightly awkward optimization when you consider that I’m suggesting we should <strong>partition more data</strong> in order to <em>speed up</em> our code. This sounds bonkers, unless we dig deep within for some mechanical empathy: not all work is equal in the eyes of the CPU. When we are executing scalar partitioning on <em>n</em> elements, we are really telling the CPU to execute <em>n</em> branches, comparisons, and memory accesses, which are completely data-dependent. The CPU “hates” this sort of work. It has to guess what happens next, and will do so no better than flipping a coin, or 50%, for truly random data. What’s worse, as mentioned before, whenever the CPU mispredicts, there’s a price to pay in the form of a full pipeline flush which roughly costs us 14-15 cycles on a modern CPU. Paying this <strong>once</strong>, is roughly equivalent to partitioning 2 x 8 element vectors with our vectorized partition block! This is the reason that doing “more” might be faster.</p>
</div>
</td>
</tr>
</table>
<p>Back to the constraints. There’s one thing we can <strong>never</strong> do: move a pivot that was previously partitioned. I (now) call them “buried pivots” (since they’re in their final resting place, get it?); Everyone knows, you don’t move around dead bodies, that’s always the first bad thing that happens in a horror movie. There’s our motivation: not being the stupid person who dies first. That’s about it. It sounds simple, but it requires some more serious explanation: When a previous partition operation is complete, the pivot used during that operation is moved to its final resting place. It’s new position is used to subdivide the array, and effectively stored throughout numerous call stacks of our recursive function. There’s a baked-in assumption here that all data left/right of that buried pivot is smaller/larger than it. And that assumption must <strong>never</strong> be broken. If we intend to <strong>re-partition</strong> data to the left and right of a given partition, as part of this overlapping alignment effort, we need to consider that this extra data might already contain buried pivots, and we can not, under any circumstances ever move them again.<br />
In short: Buried pivots stay buried where we left them, or bad things happen.</p>
<p>When we call our partitioning operation, we have to consider what initially looks like an asymmetry of the left and right edges of our to-be-partitioned segment:</p>
<ul>
<li>For the left side:
<ul>
<li>There might not be additional room on the left with extra data to read from.
<ul>
<li>We are too close to the edge of the array on the left side!<br />
This happens for all partitions starting at the left-edge of the entire array.</li>
</ul>
</li>
<li>We always partition first left, then right of any buried pivot, we know for a fact that all elements left of “our” partition at any given moment are sorted. e.g. they are all buried pivots, and we can’t re-order them.</li>
<li><em>Important:</em> We also know that each of those values is smaller than or equal to whatever pivot value we <em>will select</em> for the current partitioning operation.</li>
</ul>
</li>
<li>For the right side, it is almost the same set of constraints:
<ul>
<li>There might not be additional room on the right with extra data to read from.
<ul>
<li>We are too close to the edge of the array on the right side!<br />
This happens for all partitions ending on the right-edge of the entire array.</li>
</ul>
</li>
<li>The immediate value to our right side is a buried pivot, and all other values to its right are larger-than-or-equal to it.</li>
<li>There might be additional pivots immediately to our right as well.</li>
<li><em>Important:</em> We also know that each of those values is larger-then-or-equal to whatever pivot value we <em>will select</em> for the current partitioning operation.</li>
</ul>
</li>
</ul>
<p>All this information is hard to integrate at first, but what it boils down to is that whenever we load up the left overlapping vector, there are anywhere between 1-7 elements we are <strong>not</strong> allowed to reorder on the <em>left side</em>, and when we load the right overlapping vector, there are, again, anywhere between 1-7 elements we are <strong>not</strong> allowed to re-order on <em>that right side</em>. That’s the challenge; the good news is that all those overlapping elements are also guaranteed to also be smaller/larger than whatever pivot we end up selecting from out original (sans overlap) partition. This knowledge gives us the edge we need: We know in advance that the extra elements will generate predictable comparison results compared to <em>any</em> pivot <em>within</em> our partition.</p>
<p>What we need are permutation entries that are <strong><em>stable</em></strong>. I’m coining this phrase freely as I’m going along:<br />
Stable partitioning means that the partitioning operation <strong>must not</strong> <em>reorder</em> values that need to go on the left amongst themselves (we keep their internal ordering amongst themselves). Likewise, it <strong>must not</strong> reorder the values that go on the right amongst themselves. If we manage to do this, we’re in the clear: The combination of stable permutation and predictable comparison results means that the overlapping elements will stay put while other elements will be partitioned properly on both edges of our overlapping partition. After this weird permutation, we just need to forget we ever read those extra elements, and the whole thing just… works? … yes!</p>
<p>Let’s start with cementing this idea of what stable partitioning is: Up to this point, there was no such requirement, and the initial partition tables I generated failed to satisfy this requirement.
Here’s a simple example for stable/unstable permutation entries, let’s imagine we partition the following values around a pivot value of 500:</p>
<table>
<thead>
<tr>
<th>Bit</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
</tr>
</thead>
<tbody>
<tr>
<td><code class="highlighter-rouge">Vector256<T></code> Value</td>
<td>99</td>
<td>100</td>
<td>666</td>
<td>101</td>
<td>102</td>
<td>777</td>
<td>888</td>
<td>999</td>
</tr>
<tr>
<td>Mask</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>Unstable Permutation</td>
<td>0</td>
<td>1</td>
<td><strong>7</strong></td>
<td>2</td>
<td>3</td>
<td><strong>6</strong></td>
<td><strong>5</strong></td>
<td><strong>4</strong></td>
</tr>
<tr>
<td>Unstable Result</td>
<td>99</td>
<td>100</td>
<td>101</td>
<td>102</td>
<td><strong>999</strong></td>
<td><strong>888</strong></td>
<td><strong>777</strong></td>
<td><strong>666</strong></td>
</tr>
<tr>
<td>Stable Permutation</td>
<td>0</td>
<td>1</td>
<td>4</td>
<td>2</td>
<td>3</td>
<td>5</td>
<td>6</td>
<td>7</td>
</tr>
<tr>
<td>Stable Result</td>
<td>99</td>
<td>100</td>
<td>101</td>
<td>102</td>
<td>666</td>
<td>777</td>
<td>888</td>
<td>999</td>
</tr>
</tbody>
</table>
<p>In the above example, the unstable permutation is a perfectly <em><u>valid</u></em> permutation for general case partitioning. It successfully partitions the sample vector around the pivot value of 500, but the 4 elements marked in bold are re-ordered with respect to each other when compared to the original array. In the stable permutation entry, the internal ordering amongst the partitioned groups is <em>preserved</em>.</p>
<p>Armed with new, stable permutation entries, We can proceed with this overlapping re-partitioning hack: The idea is to find the optimal alignment point on the left and on the right (assuming one is available, e.g. there is enough room on that side), read that data with the <code class="highlighter-rouge">LoadVectorAligned256</code> intrinsic, and partition it into the temporary area. The final twist: We need to keep tabs on how many elements <em>do not belong</em> to this partition (e.g. originate from our overlap gymnastics), and remember not to copy them back into our partition at the end of the function, relying on our stable partitioning to keep them grouped at the edges of the temporary buffer we’re copying from… To my amazement, that was kind of it. It just works! (I’ve conveniently ignored a small edge-case here in words, but not in the code :).</p>
<p>The end result is super delicate. If you feel you’ve got it, skip this paragraph, but if you need an alternative view on how this works, here it is: I’ve just described how to partition the initial 2x8 elements (8 on each side); out of those initial 8, We <em>always</em> have a subset that must <strong>never</strong> be reordered (the overlap), and a subset we need to re-order, as is normal, with respect to some pivot. We know that whatever <em>possible</em> pivot value <em>might</em> be selected from our internal partition, it will always be larger/smaller than the elements in the overlapping areas. Knowing that, we can rely on having stable permutation entries that <strong>do not</strong> reorder those extra elements. In the end, we read extra elements, feed them through our partitioning machine, but ignore the extra overlapping elements and avoid <em>all</em> scalar partitioning thanks to this scheme.</p>
<p>In the end, we literally get to eat our cake and keep it whole: For the 99% case we <strong>kill</strong> scalar partitioning all-together, doing <em>zero</em> scalar work, at the same time aligning everything to <code class="highlighter-rouge">Vector256<T></code> size and being nice to our processor. Just to make this victory a tiny touch sweeter, even the <em>initial</em> 2x8 partially overlapping vectors are read using aligned reads!
I named this approach “overligned” (overlap + align) in my code-base; it is available in full in <a href="https://github.com/damageboy/VxSort/blob/research/VxSortResearch/Unstable/AVX2/Happy/B5_2_DoublePumpOverlined.cs"><code class="highlighter-rouge">B5_2_DoublePumpOverlined.cs</code></a>. It implements this overlapping alignment approach, with some extra small points for consideration:</p>
<ul>
<li>When it is <strong>impossible</strong> to align outwards, we fall back to the alignment mechanic introduced in the previous section.<br />
This is uncommon: Going back to the statistical data we collected about random-data sorting in the 3<sup>rd</sup> post, we anticipate a recursion depth of around 40 when sorting 1M elements and ~340K partitioning calls. We will have <em>at least</em> 40x2 (for both sides) such cases where we align inwards for that 1M case, as an example. This is small change compared to the <code class="highlighter-rouge">340K - 80</code> calls we can optimize with outward alignment, but it does mean we have to keep that old code lying around.</li>
<li>Once we calculate for a given partition how much alignment is required on each side, we can cache that calculation recursively for the entire depth of the recursive call stack: This again reduces the overhead we are paying for this alignment strategy.
In the code you’ll see I’m squishing two 32-bit integers into a 64-bit value I call <code class="highlighter-rouge">alignHint</code> and I keep reusing one half of 64-bit value without recalculating the alignment <em>amount</em>; If we’ve made it this far, let’s shave a few more cycles off while we’re here.</li>
</ul>
<p>There’s another small optimization I tacked on to this version, which I’ll discuss immediately after providing the results:</p>
<div>
<div class="stickemup">
<ul class="uk-tab" data-uk-switcher="{connect:'#3d6bdb20-d0b7-4c05-ae7d-d6aa78662bad'}">
<li class="uk-active"><a href="#"><i class="glyphicon glyphicon-stats"></i> Scaling</a></li>
<li><a href="#"><i class="glyphicon glyphicon-stats"></i> Time/N</a></li>
<li><a href="#"><i class="glyphicon glyphicon-list-alt"></i> Benchmarks</a></li>
<li><a href="#"><i class="glyphicon glyphicon-info-sign"></i> Setup</a></li>
</ul>
<ul id="3d6bdb20-d0b7-4c05-ae7d-d6aa78662bad" class="uk-switcher uk-margin">
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<div data-intro="Size of the sorting problem, 10..10,000,000 in powers of 10" data-position="bottom">
<div data-intro="Performance scale: Array.Sort (solid gray) is always 100%, and the other methods are scaled relative to it" data-position="left">
<div data-intro="Click legend items to show/hide series" data-position="right">
<div class="benchmark-chart-container">
<canvas data-chart="line">
N,100,1K,10K,64K,100K,1M,1.5M,10M
Jedi, 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1
Overlined, 1.012312, 0.995069647, 0.904921232, 0.905092554, 0.915092554, 0.9212314, 0.929801383, 0.960170878
<!--
{
"data" : {
"datasets" : [
{
"backgroundColor": "rgba(66,66,66,0.35)",
"rough": { "fillStyle": "hachure", "hachureAngle": -30, "hachureGap": 9, "fillWeight": 0.3}
},
{
"backgroundColor": "rgba(33,220,33,.9)",
"rough": { "fillStyle": "hachure", "hachureAngle": 60, "hachureGap": 3}
}
]
},
"options": {
"title": { "text": "AVX2 Overlined Sorting - Scaled to Jedi", "display": true },
"scales": {
"yAxes": [{
"ticks": {
"fontFamily": "Indie Flower",
"min": 0.88,
"callback": "ticksPercent"
},
"scaleLabel": {
"labelString": "Scaling (%)",
"display": true
}
}]
},
"annotation": {
"annotations": [{
"drawTime": "afterDatasetsDraw",
"type": "line",
"mode": "vertical",
"scaleID": "x-axis-0",
"value": "1.5M",
"borderColor": "#666666",
"borderWidth": 2,
"borderDash": [5, 5],
"borderDashOffset": 5,
"label": {
"yAdjust": 5,
"backgroundColor": "rgba(255, 0, 0, 0.75)",
"fontFamily": "Indie Flower",
"fontSize": 14,
"content": "L3 Cache Size",
"enabled": true
}
},
{
"drawTime": "afterDatasetsDraw",
"type": "line",
"mode": "vertical",
"scaleID": "x-axis-0",
"value": "64K",
"borderColor": "#666666",
"borderWidth": 2,
"borderDash": [5, 5],
"borderDashOffset": 5,
"label": {
"yAdjust": 65,
"backgroundColor": "rgba(255, 0, 0, 0.75)",
"fontFamily": "Indie Flower",
"fontSize": 14,
"content": "L2 Cache Size",
"enabled": true
}
}]
}
},
"defaultOptions": {"scales":{"xAxes":[{"scaleLabel":{"display":"true,","labelString":"N (elements)","fontFamily":"Indie Flower"},"ticks":{"fontFamily":"Indie Flower"}}]},"legend":{"display":true,"position":"bottom","labels":{"fontFamily":"Indie Flower","fontSize":14}},"title":{"position":"top","fontFamily":"Indie Flower","fontSize":16}}
}
--> </canvas>
</div>
</div>
</div>
</div>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<div data-intro="Size of the sorting problem, 10..10,000,000 in powers of 10" data-position="bottom">
<div data-intro="Time in nanoseconds spent sorting per element. Array.Sort (solid gray) is the baseline, again" data-position="left">
<div data-intro="Click legend items to show/hide series" data-position="right">
<div class="benchmark-chart-container">
<canvas data-chart="line">
N,100,1K,10K,100K,1M,10M
Jedi, 19.4547, 20.8907, 23.8802, 24.7229, 22.8053, 25.7011
Overlined, 20.092, 20.7878, 21.6097, 22.6238, 21.2044, 24.6774
<!--
{
"data" : {
"datasets" : [
{
"backgroundColor": "rgba(66,66,66,0.35)",
"rough": { "fillStyle": "hachure", "hachureAngle": -30, "hachureGap": 9, "fillWeight": 0.3}
},
{
"backgroundColor": "rgba(33,220,33,.9)",
"rough": { "fillStyle": "hachure", "hachureAngle": 60, "hachureGap": 3}
}
]
},
"options": {
"title": { "text": "AVX2 Jedi Sorting + Overlined - log(Time/N)", "display": true },
"scales": {
"yAxes": [{
"type": "logarithmic",
"ticks": {
"min": 15,
"max": 28,
"callback": "ticksNumStandaard",
"fontFamily": "Indie Flower"
},
"scaleLabel": {
"labelString": "Time/N (ns)",
"fontFamily": "Indie Flower",
"display": true
}
}]
}
},
"defaultOptions": {"scales":{"xAxes":[{"scaleLabel":{"display":"true,","labelString":"N (elements)","fontFamily":"Indie Flower"},"ticks":{"fontFamily":"Indie Flower"}}]},"legend":{"display":true,"position":"bottom","labels":{"fontFamily":"Indie Flower","fontSize":14}},"title":{"position":"top","fontFamily":"Indie Flower","fontSize":16}}
}
--> </canvas>
</div>
</div>
</div>
</div>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<table class="table datatable" data-json="../_posts/Bench.BlogPt5_2_Int32_-report.datatable.json" data-id-field="name" data-pagination="false" data-page-list="[9, 18]" data-intro="Each row in this table represents a benchmark result" data-position="left" data-show-pagination-switch="false">
<thead data-intro="The header can be used to sort/filter by clicking" data-position="right">
<tr>
<th data-field="TargetMethodColumn.Method" data-sortable="true" data-filter-control="select">
<span data-intro="The name of the benchmarked method" data-position="top">
Method<br />Name
</span>
</th>
<th data-field="N" data-sortable="true" data-value-type="int" data-filter-control="select">
<span data-intro="The size of the sorting problem being benchmarked (# of integers)" data-position="top">
Problem<br />Size
</span>
</th>
<th data-field="TimePerNDataTable" data-sortable="true" data-value-type="float2-interval-muted">
<span data-intro="Time in nanoseconds spent sorting each element in the array (with confidence intervals in parenthesis)" data-position="top">
Time /<br />Element (ns)
</span>
</th>
<th data-field="RatioDataTable" data-sortable="true" data-value-type="inline-bar-horizontal-percentage">
<span data-intro="Each result is scaled to its baseline (Array.Sort in this case)" data-position="top">
Scaling
</span>
</th>
<th data-field="Measurements" data-sortable="true" data-value-type="inline-bar-vertical">
<span data-intro="Raw benchmark results visualize how stable the result it. Longest/Shortest runs marked with <span style='color: red'>Red</span>/<span style='color: green'>Green</span>" data-position="top">Measurements</span>
</th>
</tr>
</thead>
</table>
</div>
</li>
<li>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="rouge-code"><pre><span class="nv">BenchmarkDotNet</span><span class="o">=</span>v0.12.0, <span class="nv">OS</span><span class="o">=</span>clear-linux-os 32120
Intel Core i7-7700HQ CPU 2.80GHz <span class="o">(</span>Kaby Lake<span class="o">)</span>, 1 CPU, 4 logical and 4 physical cores
.NET Core <span class="nv">SDK</span><span class="o">=</span>3.1.100
<span class="o">[</span>Host] : .NET Core 3.1.0 <span class="o">(</span>CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404<span class="o">)</span>, X64 RyuJIT
Job-DEARTS : .NET Core 3.1.0 <span class="o">(</span>CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404<span class="o">)</span>, X64 RyuJIT
<span class="nv">InvocationCount</span><span class="o">=</span>3 <span class="nv">IterationCount</span><span class="o">=</span>15 <span class="nv">LaunchCount</span><span class="o">=</span>2
<span class="nv">UnrollFactor</span><span class="o">=</span>1 <span class="nv">WarmupCount</span><span class="o">=</span>10
<span class="nv">$ </span><span class="nb">grep</span> <span class="s1">'stepping\|model\|microcode'</span> /proc/cpuinfo | <span class="nb">head</span> <span class="nt">-4</span>
model : 158
model name : Intel<span class="o">(</span>R<span class="o">)</span> Core<span class="o">(</span>TM<span class="o">)</span> i7-7700HQ CPU @ 2.80GHz
stepping : 9
microcode : 0xb4
</pre></td></tr></tbody></table></code></pre></div></div>
</li>
</ul>
</div>
<p>This is much better! The improvement is much more pronounced here, and we have a lot to consider:</p>
<ul>
<li>The performance improvements are not spread evenly through-out the size of the sorting problem.</li>
<li>I’ve conveniently included two vertical markers, per my specific machine model, they show the size of the L2/L3 caches translated to <code class="highlighter-rouge">#</code> of 32-bit elements in our array.</li>
<li>It can be clearly seen that as long as we’re sorting roughly within the size of our L2-L3 cache size range, this optimization pays in spades: we’re seeing ~10% speedup in runtime in many cases!</li>
<li>It is also clear that as we progress outside the size of the L2 into the L3 cache size, and ultimately exhaust the size of our caches entirely, the returns on this optimization diminish gradually.</li>
<li>While not shown here, since I’ve lost access to that machine, on older Intel/AMD machines, where only one load operation can be executed by the processor at any given time (Example: Intel Broadwell processors), this can lead to an improvement of 20% in total runtime; This should make sense: the less load ports the CPU has, the better this split-load reducing technique performs.</li>
<li>Another thing to consider is that in future variations of this code when I finally get access and ability to use AVX-512, with 64-byte wide registers, the effects of this optimization will be much more pronounced again for a different reason: With vector registers spanning 64-bytes each, split-loading becomes a bigger problem (every single un-aligned read becomes a split-load). Therefore, removing it is even more important.</li>
</ul>
</div>
<p>As the problem size goes beyond the size of the L2 cache, we are hit with the realities of CPU cache latency numbers. As service to the reader here is a visual representation for the <a href="https://www.7-cpu.com/cpu/Skylake_X.html">latency numbers for a Skylake-X CPU</a> running at 4.3 Ghz:</p>
<center>
<object style="margin: auto; width: 90%" type="image/svg+xml" data="../assets/images/latency.svg"></object>
</center>
<p>The small number of cycles we tack as the penalty of for split-loading (7 in this diagram) on to the memory operations is very real when we compare it to regular L1/L2 cache latency. But once we compare it to L3 or RAM latency, it becomes abundantly clear why we are seeing diminishing returns for this optimization; the penalty is simply too small to notice at those work points.</p>
<p>Finally, for this optimization, we must never forget our moto of trust no one and nothing. Let’s double check what the current state of affairs is as far as <code class="highlighter-rouge">perf</code> is concerned:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
</pre></td><td class="rouge-code"><pre><span class="nv">$ </span>perf record <span class="nt">-Fmax</span> <span class="nt">-e</span> mem_inst_retired.split_loads <span class="se">\</span>
./Example <span class="nt">--type-list</span> DoublePumpOvelined <span class="nt">--size-list</span> 100000 <span class="se">\</span>
<span class="nt">--max-loops</span> 1000 <span class="nt">--no-check</span>
<span class="nv">$ </span>perf report <span class="nt">--stdio</span> <span class="nt">-F</span> overhead,sym | <span class="nb">head</span> <span class="nt">-20</span>
<span class="c"># To display the perf.data header info, please use --header/--header-only options.</span>
<span class="c"># Samples: 129 of event 'mem_inst_retired.split_loads'</span>
<span class="c"># Event count (approx.): 12900387</span>
<span class="c"># Overhead Symbol</span>
30.23% <span class="o">[</span>.] DoublePumpOverlined...::Sort<span class="o">(</span>int32<span class="k">*</span>,int32<span class="k">*</span>,int64,int32<span class="o">)</span>
28.68% <span class="o">[</span>.] DoublePumpOverlined...::VectorizedPartitionInPlace<span class="o">(</span>int32<span class="k">*</span>,int32<span class="k">*</span>,int64<span class="o">)</span>
13.95% <span class="o">[</span>.] __memmove_avx_unaligned_erms
0.78% <span class="o">[</span>.] JIT_MemSet_End
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Seems like this moved the needle, and then some. We started with <code class="highlighter-rouge">86.68%</code> of <code class="highlighter-rouge">87,102,613</code> split-loads in our previous version of vectorized partitioning , and now we have <code class="highlighter-rouge">28.68%</code> of <code class="highlighter-rouge">12,900,387</code>. In other words: <code class="highlighter-rouge">(0.2668 * 12900387) / (0.8668 * 87102613)</code> gives us <code class="highlighter-rouge">4.55%</code>, or a <code class="highlighter-rouge">95.44%</code> reduction of split-load events for this version.
Not an entirely unpleasant experience.</p>
<h4 id="sub-optimization--converting-branches-to-arithmetic-1">Sub-optimization- Converting branches to arithmetic: :+1:</h4>
<p>By this time, my code contained quite a few branches to deal with various edge cases around alignment, and I pulled another rabbit out of the optimization hat that is worth mentioning: We can convert simple branches into arithmetic operations. Many times, we end up having branches with super simple code behind them; here’s a real example I used to have in my code, as part of some early version of overlinement, which we’ll try to optimize:</p>
<div>
<div class="stickemup">
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre><span class="kt">int</span> <span class="n">leftAlign</span><span class="p">;</span>
<span class="p">...</span> <span class="c1">// Calculate left align here...</span>
<span class="k">if</span> <span class="p">(</span><span class="n">leftAlign</span> <span class="p"><</span> <span class="m">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">readLeft</span> <span class="p">+=</span> <span class="m">8</span><span class="p">;</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div> </div>
</div>
<p>This looks awfully friendly, and it is unless <code class="highlighter-rouge">leftAlign</code> and therefore the entire branch is determined by random data we read from the array, making the CPU mispredict this branch too often than we’d care for it to happen. In my case, I had two branches like this, and each of them was happening at a rate of <code class="highlighter-rouge">1/8</code>. So enough for me to care. The good news is that we can re-write this, entirely in C#, and replace the potential misprediction with a constant, predictable (and often shorter!) data dependency. Let’s start by inspecting the re-written “branch”:</p>
</div>
<div>
<div class="stickemup">
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="rouge-code"><pre><span class="kt">int</span> <span class="n">leftAlign</span><span class="p">;</span>
<span class="p">...</span> <span class="c1">// Calculate left align here...</span>
<span class="c1">// Signed arithmetic FTW</span>
<span class="kt">var</span> <span class="n">leftAlignMask</span> <span class="p">=</span> <span class="n">leftAlign</span> <span class="p">>></span> <span class="m">31</span><span class="p">;</span>
<span class="c1">// the mask is now either all 1s or all 0s depending on leftAlign's sign!</span>
<span class="n">readLeft</span> <span class="p">+=</span> <span class="m">8</span> <span class="p">&</span> <span class="n">leftALignMask</span><span class="p">;</span>
</pre></td></tr></tbody></table></code></pre></div> </div>
</div>
<p>By taking the same value we were comparing to 0 and right shifting it, we are performing an arithmetic right shift. This takes the top bit, which is either <code class="highlighter-rouge">0/1</code> depending on <code class="highlighter-rouge">leftAlign</code>’s sign bit, and essentially propagates it throughout the entire 32-bit value, which is then assigned to the <code class="highlighter-rouge">lestAlignMask</code> variable. We’ve essentially taken what was previously the result of the comparison as part of the branch (the sign bit), transforming it into a mask. We then proceed to take the mask and use it to control the outcome of the <code class="highlighter-rouge">+= 8</code> operation, effectively turning it into <em>either</em> a <code class="highlighter-rouge">+= 8</code> -or- a <code class="highlighter-rouge">+= 0</code> operation, depending on the value of the mask!<br />
This turns out to be a quite effective way, again, for simple branches only, at converting a potential misprediction event costing us 15 cycles, with a 100% constant 3-4 cycles data-dependency for the CPU: It can be thought as a “signaling” mechanism where we tell the CPU not to speculate on the result of the branch but instead complete the <code class="highlighter-rouge">readLeft +=</code> statement only after waiting for the right-shift (<code class="highlighter-rouge">>> 31</code>) and the bitwise and (<code class="highlighter-rouge">&</code>) operation to propagate through its pipeline.</p>
<table style="margin-bottom: 0em" class="notice--info">
<tr>
<td style="border: none; padding-top: 0; padding-bottom: 0; vertical-align: top"><span class="uk-label">Note</span></td>
<td style="border: none; padding-top: 0; padding-bottom: 0"><div>
<p>I referred to this as an old geezer’s optimization since modern processors already support this internally in the form of a <code class="highlighter-rouge">CMOV</code> instruction, which is more versatile, faster and takes up less bytes in the instruction stream while having the same “do no speculate on this” effect on the CPU. <em>The only issue</em> is we don’t have <code class="highlighter-rouge">CMOV</code> in the CoreCLR JIT (Mono’s JIT, peculiarly does support this both with the internal JIT and naturally with LLVM…).<br />
As a side note to this side note, I’ll add that this is such an old-dog trick that LLVM even detects such code and de-optimizes it back into a “normal” branch and then proceeds to optimize it again into <code class="highlighter-rouge">CMOV</code>, which I think is just a very cool thing, regardless :)</p>
</div>
</td>
</tr>
</table>
</div>
<p>I ended up replacing about 5-6 super simple/small branches this way. I won’t show direct performance numbers for this, as this is already part of the overlined version; I can’t say it improved performance considerably for my test runs, but it did reduce the jitter of those runs, which can be seen in the reduced error bars and tighter confidence intervals shown in the benchmark results above.</p>
<h3 id="coming-to-terms-with-bad-speculation">Coming to terms with bad speculation</h3>
<p>At the end of part 3, we came to a hard realization that our code is badly speculating inside the CPU. Even after simplifying the branch code in our loop in part 4, the bad speculation remained there, staring at us persistently. If you recall, we experienced a lot of bad-speculation effects when sorting the data with our vectorized code, and profiling using hardware counters showed us that while <code class="highlighter-rouge">InsertionSort</code> was the cause of most of the bad-speculation events (41%), our vectorized code was still responsible for 32% of them. Let’s try to think about that mean nasty branch, stuck there, in the middle of our beautiful loop:</p>
<div>
<div class="stickemup">
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
</pre></td><td class="rouge-code"><pre><span class="kt">int</span><span class="p">*</span> <span class="n">nextPtr</span><span class="p">;</span>
<span class="k">if</span> <span class="p">((</span><span class="kt">byte</span> <span class="p">*)</span> <span class="n">writeRight</span> <span class="p">-</span> <span class="p">(</span><span class="kt">byte</span> <span class="p">*)</span> <span class="n">readRight</span> <span class="p"><</span> <span class="n">N</span> <span class="p">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">int</span><span class="p">))</span> <span class="p">{</span>
<span class="n">nextPtr</span> <span class="p">=</span> <span class="n">readRight</span><span class="p">;</span>
<span class="n">readRight</span> <span class="p">-=</span> <span class="n">N</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">nextPtr</span> <span class="p">=</span> <span class="n">readLeft</span><span class="p">;</span>
<span class="n">readLeft</span> <span class="p">+=</span> <span class="n">N</span><span class="p">;</span>
<span class="p">}</span>
<span class="nf">PartitionBlock</span><span class="p">(</span><span class="n">nextPtr</span><span class="p">,</span> <span class="n">P</span><span class="p">,</span> <span class="n">pBase</span><span class="p">,</span> <span class="k">ref</span> <span class="n">writeLeft</span><span class="p">,</span> <span class="k">ref</span> <span class="n">writeRight</span><span class="p">);</span>
</pre></td></tr></tbody></table></code></pre></div> </div>
</div>
<p>Long story short: We ended up sneaking up a data-based branch into our code in the form of this side-selection logic. Whenever we try to pick a side, we would read from next is where we put the CPU in a tough spot. We’re asking it to speculate on something it <em>can’t possibly speculate on successfully</em>. Our question is: “Oh CPU, CPU in the socket, Which side is closer to being over-written of them all?”, to which the answer is completely data-driven. In other words, it depends on how the last round(s) of partitioning mutated the pointers involved in the comparison. It might sound like an easy thing for the CPU to check, but we have to remember it is attempting to execute ~100 or so instructions into the future, as it is required to speculate on the result: the previous rounds of partitioning have not yet been fully-executed, internally. The CPU guesses, at best, based on stale data, and we know, as the grand designers of this mess, that its best guess is no better here than flipping a coin. Quite sad. You have to admit it is ironic we managed to do this whole big circle around our own tails just to come-back to having a branch misprediction based on the random array data. Mis-predicting here seems unavoidable. Or is it?</p>
<h4 id="replacing-the-branch-with-arithmetic--1">Replacing the branch with arithmetic: :-1:</h4>
<p>Could we replace this branch with arithmetic, just like we’ve done a couple of paragraphs above? Yes we can.
Consider this alternative version:</p>
</div>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="rouge-code"><pre><span class="kt">var</span> <span class="n">readRightMask</span> <span class="p">=</span>
<span class="p">(((</span><span class="kt">byte</span><span class="p">*)</span> <span class="n">writeRight</span> <span class="p">-</span> <span class="p">(</span><span class="kt">byte</span><span class="p">*)</span> <span class="n">readRight</span> <span class="p">-</span> <span class="n">N</span><span class="p">*</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">int</span><span class="p">)))</span> <span class="p">>></span> <span class="m">63</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">readLeftMask</span> <span class="p">=</span> <span class="p">~</span><span class="n">readRightMask</span><span class="p">;</span>
<span class="c1">// If readRightMask is 0, we pick the left side</span>
<span class="c1">// If readLeftMask is 0, we pick the right side</span>
<span class="kt">var</span> <span class="n">readRightMaybe</span> <span class="p">=</span> <span class="p">(</span><span class="kt">ulong</span><span class="p">)</span> <span class="n">readRight</span> <span class="p">&</span> <span class="p">(</span><span class="kt">ulong</span><span class="p">)</span> <span class="n">readRightMask</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">readLeftMaybe</span> <span class="p">=</span> <span class="p">(</span><span class="kt">ulong</span><span class="p">)</span> <span class="n">readLeft</span> <span class="p">&</span> <span class="p">(</span><span class="kt">ulong</span><span class="p">)</span> <span class="n">readLeftMask</span><span class="p">;</span>
<span class="nf">PartitionBlock</span><span class="p">((</span><span class="kt">int</span> <span class="p">*)</span> <span class="p">(</span><span class="n">readLeftMaybe</span> <span class="p">+</span> <span class="n">readRightMaybe</span><span class="p">),</span>
<span class="n">P</span><span class="p">,</span> <span class="n">pBase</span><span class="p">,</span> <span class="k">ref</span> <span class="n">writeLeft</span><span class="p">,</span> <span class="k">ref</span> <span class="n">writeRight</span><span class="p">);</span>
<span class="kt">var</span> <span class="n">postFixUp</span> <span class="p">=</span> <span class="p">-</span><span class="m">32</span> <span class="p">&</span> <span class="n">readRightMask</span><span class="p">;</span>
<span class="n">readRight</span> <span class="p">=</span> <span class="p">(</span><span class="kt">int</span> <span class="p">*)</span> <span class="p">((</span><span class="kt">byte</span> <span class="p">*)</span> <span class="n">readRight</span> <span class="p">+</span> <span class="n">postFixUp</span><span class="p">);</span>
<span class="n">readLeft</span> <span class="p">=</span> <span class="p">(</span><span class="kt">int</span> <span class="p">*)</span> <span class="p">((</span><span class="kt">byte</span> <span class="p">*)</span> <span class="n">readLeft</span> <span class="p">+</span> <span class="n">postFixUp</span> <span class="p">+</span> <span class="m">32</span><span class="p">);</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>What the code above does, except for causing a nauseating headache, is taking the same concept of turning branches into arithmetic from the previous section and using it to get rid of that nasty branch: We take the comparison and turn it into a negative/positive number, then proceed to use it to generate masks we use to execute the code that used to reside under the branch.</p>
<p>I don’t want to dig deep into this. While its technically sound, and does what we need it to do, it’s more important to focus on how this performs:</p>
<div>
<div class="stickemup">
<ul class="uk-tab" data-uk-switcher="{connect:'#8fc63529-869b-4c7d-aab5-2c2d25a929f2'}">
<li class="uk-active"><a href="#"><i class="glyphicon glyphicon-stats"></i> Scaling</a></li>
<li><a href="#"><i class="glyphicon glyphicon-stats"></i> Time/N</a></li>
<li><a href="#"><i class="glyphicon glyphicon-list-alt"></i> Benchmarks</a></li>
<li><a href="#"><i class="glyphicon glyphicon-info-sign"></i> Setup</a></li>
</ul>
<ul id="8fc63529-869b-4c7d-aab5-2c2d25a929f2" class="uk-switcher uk-margin">
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<div data-intro="Size of the sorting problem, 10..10,000,000 in powers of 10" data-position="bottom">
<div data-intro="Performance scale: Array.Sort (solid gray) is always 100%, and the other methods are scaled relative to it" data-position="left">
<div data-intro="Click legend items to show/hide series" data-position="right">
<div class="benchmark-chart-container">
<canvas data-chart="line">
N,100,1K,10K,100K,1M,10M
Overlined, 1 , 1 , 1 , 1 , 1 , 1
Branchless, 0.87253937, 0.951842168, 1.104715689, 1.140662148, 1.253573179, 1.379499062
<!--
{
"data" : {
"datasets" : [
{
"backgroundColor": "rgba(66,66,66,0.35)",
"rough": { "fillStyle": "hachure", "hachureAngle": -30, "hachureGap": 9, "fillWeight": 0.3}
},
{
"backgroundColor": "rgba(33,220,33,.9)",
"rough": { "fillStyle": "hachure", "hachureAngle": 60, "hachureGap": 3}
}
]
},
"options": {
"title": { "text": "AVX2 Branchless Sorting - Scaled to Overlined", "display": true },
"scales": {
"yAxes": [{
"ticks": {
"fontFamily": "Indie Flower",
"min": 0.80,
"callback": "ticksPercent"
},
"scaleLabel": {
"labelString": "Scaling (%)",
"display": true
}
}]
}
},
"defaultOptions": {"scales":{"xAxes":[{"scaleLabel":{"display":"true,","labelString":"N (elements)","fontFamily":"Indie Flower"},"ticks":{"fontFamily":"Indie Flower"}}]},"legend":{"display":true,"position":"bottom","labels":{"fontFamily":"Indie Flower","fontSize":14}},"title":{"position":"top","fontFamily":"Indie Flower","fontSize":16}}
}
--> </canvas>
</div>
</div>
</div>
</div>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<div data-intro="Size of the sorting problem, 10..10,000,000 in powers of 10" data-position="bottom">
<div data-intro="Time in nanoseconds spent sorting per element. Array.Sort (solid gray) is the baseline, again" data-position="left">
<div data-intro="Click legend items to show/hide series" data-position="right">
<div class="benchmark-chart-container">
<canvas data-chart="line">
N,100,1K,10K,100K,1M,10M
Overlined, 20.3199,21.0354,21.6787,23.0622,23.246,24.7603
Branchless, 17.7252,20.0221,23.9488,26.3062,29.1405,34.1567
<!--
{
"data" : {
"datasets" : [
{
"backgroundColor": "rgba(66,66,66,0.35)",
"rough": { "fillStyle": "hachure", "hachureAngle": -30, "hachureGap": 9, "fillWeight": 0.3}
},
{
"backgroundColor": "rgba(33,220,33,.9)",
"rough": { "fillStyle": "hachure", "hachureAngle": 60, "hachureGap": 3}
}
]
},
"options": {
"title": { "text": "AVX2 Jedi Sorting + Aligned - log(Time/N)", "display": true },
"scales": {
"yAxes": [{
"type": "logarithmic",
"ticks": {
"min": 15,
"max": 35,
"callback": "ticksNumStandaard",
"fontFamily": "Indie Flower"
},
"scaleLabel": {
"labelString": "Time/N (ns)",
"fontFamily": "Indie Flower",
"display": true
}
}]
}
},
"defaultOptions": {"scales":{"xAxes":[{"scaleLabel":{"display":"true,","labelString":"N (elements)","fontFamily":"Indie Flower"},"ticks":{"fontFamily":"Indie Flower"}}]},"legend":{"display":true,"position":"bottom","labels":{"fontFamily":"Indie Flower","fontSize":14}},"title":{"position":"top","fontFamily":"Indie Flower","fontSize":16}}
}
--> </canvas>
</div>
</div>
</div>
</div>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<table class="table datatable" data-json="../_posts/Bench.BlogPt5_3_Int32_-report.datatable.json" data-id-field="name" data-pagination="false" data-page-list="[9, 18]" data-intro="Each row in this table represents a benchmark result" data-position="left" data-show-pagination-switch="false">
<thead data-intro="The header can be used to sort/filter by clicking" data-position="right">
<tr>
<th data-field="TargetMethodColumn.Method" data-sortable="true" data-filter-control="select">
<span data-intro="The name of the benchmarked method" data-position="top">
Method<br />Name
</span>
</th>
<th data-field="N" data-sortable="true" data-value-type="int" data-filter-control="select">
<span data-intro="The size of the sorting problem being benchmarked (# of integers)" data-position="top">
Problem<br />Size
</span>
</th>
<th data-field="TimePerNDataTable" data-sortable="true" data-value-type="float2-interval-muted">
<span data-intro="Time in nanoseconds spent sorting each element in the array (with confidence intervals in parenthesis)" data-position="top">
Time /<br />Element (ns)
</span>
</th>
<th data-field="RatioDataTable" data-sortable="true" data-value-type="inline-bar-horizontal-percentage">
<span data-intro="Each result is scaled to its baseline (Array.Sort in this case)" data-position="top">
Scaling
</span>
</th>
<th data-field="Measurements" data-sortable="true" data-value-type="inline-bar-vertical">
<span data-intro="Raw benchmark results visualize how stable the result it. Longest/Shortest runs marked with <span style='color: red'>Red</span>/<span style='color: green'>Green</span>" data-position="top">Measurements</span>
</th>
</tr>
</thead>
</table>
</div>
</li>
<li>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="rouge-code"><pre><span class="nv">BenchmarkDotNet</span><span class="o">=</span>v0.12.0, <span class="nv">OS</span><span class="o">=</span>clear-linux-os 32120
Intel Core i7-7700HQ CPU 2.80GHz <span class="o">(</span>Kaby Lake<span class="o">)</span>, 1 CPU, 4 logical and 4 physical cores
.NET Core <span class="nv">SDK</span><span class="o">=</span>3.1.100
<span class="o">[</span>Host] : .NET Core 3.1.0 <span class="o">(</span>CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404<span class="o">)</span>, X64 RyuJIT
Job-DEARTS : .NET Core 3.1.0 <span class="o">(</span>CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404<span class="o">)</span>, X64 RyuJIT
<span class="nv">InvocationCount</span><span class="o">=</span>3 <span class="nv">IterationCount</span><span class="o">=</span>15 <span class="nv">LaunchCount</span><span class="o">=</span>2
<span class="nv">UnrollFactor</span><span class="o">=</span>1 <span class="nv">WarmupCount</span><span class="o">=</span>10
<span class="nv">$ </span><span class="nb">grep</span> <span class="s1">'stepping\|model\|microcode'</span> /proc/cpuinfo | <span class="nb">head</span> <span class="nt">-4</span>
model : 158
model name : Intel<span class="o">(</span>R<span class="o">)</span> Core<span class="o">(</span>TM<span class="o">)</span> i7-7700HQ CPU @ 2.80GHz
stepping : 9
microcode : 0xb4
</pre></td></tr></tbody></table></code></pre></div></div>
</li>
</ul>
</div>
<p>Look, I’m not here to sugar-coat it: This looks like an unmitigated disaster. But I claim that it is one we can learn a lot from in the future.
With the exception of sorting <code class="highlighter-rouge"><= 100</code> elements, as the problem grows, the situation is getting much worse.</p>
<p>To double-check that everything is sound, I ran <code class="highlighter-rouge">perf</code> recording the <code class="highlighter-rouge">instructions</code>, <code class="highlighter-rouge">branches</code> and <code class="highlighter-rouge">branch-misses</code> events for both versions for sorting <code class="highlighter-rouge">100,000</code> elements.</p>
<p>The command line used was this:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="rouge-code"><pre><span class="nv">$ </span>perf record <span class="nt">-F</span> max <span class="nt">-e</span> instructions,branches,branch-misses <span class="se">\</span>
./Example <span class="nt">--type-list</span> DoublePumpOverlined <span class="se">\</span>
<span class="nt">--size-list</span> 100000 <span class="nt">--max-loops</span> 1000 <span class="nt">--no-check</span>
<span class="nv">$ </span>perf record <span class="nt">-F</span> max <span class="nt">-e</span> instructions,branches,branch-misses <span class="se">\</span>
./Example <span class="nt">--type-list</span> DoublePumpBranchless <span class="se">\</span>
<span class="nt">--size-list</span> 100000 <span class="nt">--max-loops</span> 1000 <span class="nt">--no-check</span>
</pre></td></tr></tbody></table></code></pre></div> </div>
<p>If you’re one of those sick people who likes to look into other people’s sorrows, here is a <a href="https://gist.github.com/damageboy/79368e350364348c6ca476492a63f052">gist with the full results</a>, if you’re more normal, and to keep things simple, I’ve processed the results and presenting them here in table form:</p>
<center>
<object style="margin: auto; width: 90%" type="image/svg+xml" data="../assets/images/overlined-branchless-counters.svg"></object>
</center>
</div>
<p>This is pretty amazing if you think about it:</p>
<ul>
<li>The number of branches was cut in half: This makes sense, the loop control itself is a branch instuction after all, so it remains even in the <code class="highlighter-rouge">Branchless</code> variant.</li>
<li>The branches that remain in the <code class="highlighter-rouge">branchless</code> version are all easy to predict, and we see that the <code class="highlighter-rouge">branch-misses</code> counter shows us those are down to nothing.<br />
This means that there is no mistake: We succeeded in a targeted assassination of that branch; however, there was a lot of collateral damage…</li>
<li>The verbiage of the branchless code, expressed in the <code class="highlighter-rouge">instructions</code> counter is definitely costing us something here:<br />
The number of executed instructions inside our partition loop have gone up by 17%, which is a lot.</li>
</ul>
<p>The slowdown we’ve measured here is directly related to NOT having <code class="highlighter-rouge">CMOV</code> available to us through the CoreCLR JIT. but I really don’t think that this is the entire story here. It’s hard to express this in words, but
the slope at which the branchless code is slowing down compared to the previous version is very suspicious in my eyes.<br />
There is an expression we use in Hebrew a lot for this sort of situation: “The operation was successful, but the patient died”. There is no question that this is one of those moments.
This failure to accelerate the sorting operation, and specifically the way it fails, increasingly as the problem size grows, is very telling in my eyes.
I have an idea of why this is and how we might be able to go around it. But, for today, our time is up. I’ll try and get back to this much much later in this series,
and hopefully, we’ll all be wiser for it.</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:0" role="doc-endnote">
<p>Remember that the CPU knows nothing about two different cache-lines. They might actually be on a page boundary as well, which means they might be in two different DRAM chips, or perhaps, a single split-line access causes our poor CPU to communicate with a different socket, where another memory controller is responsible to reading the memory from its own DRAM modules! <a href="#fnref:0" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:1" role="doc-endnote">
<p>Most modern Intel CPUs can actually address the L1 cache units twice per cycle, at least when it comes to reading data, by virtue of having two load-ports. That means they can actually request two cache-line as the same time! But this still causes more load on the cache and bus. In our case, we must also remember we will be reading an additional cache-line for our permutation entry… <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>This specific AVX2 intrinsic will actually fail if/when used on non-aligned addresses. But it is important to note that it seems it won’t actually run faster than the previous load intrinsic we’ve used: <code class="highlighter-rouge">AVX2.LoadDquVector256</code> as long as the actual addresses we pass to both instructions are 32-byte aligned. In other words, it’s very useful for debugging alignment issues, but not that critical to actually call that intrinsic! <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>I could be wrong about that last statement, but I couldn’t find anything quite like this discussed anywhere, and believe me, I’ve searched. If anyone can point me out to someone doing this before, I’d really love to hear about it; there might be more good stuff to read about there… <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>damageboydans@houmus.orghttps://bits.houmus.orgDecimating Array.Sort with AVX2. I ended up going down the rabbit hole re-implementing array sorting with AVX2 intrinsics. There's no reason I should go down alone.This Goes to Eleven (Pt. 4/∞)2020-02-01T05:26:28+00:002020-02-01T05:26:28+00:00https://bits.houmus.org/2020-02-01/this-goes-to-eleven-pt4<p>I ended up going down the rabbit hole re-implementing array sorting with AVX2 intrinsics, and there’s no reason I should go down alone.</p>
<p>Since there’s a lot to go over here, I’ll split it up into a few parts:</p>
<ol>
<li>In <a href="/2020-01-28/this-goes-to-eleven-pt1">part 1</a>, we start with a refresher on <code class="highlighter-rouge">QuickSort</code> and how it compares to <code class="highlighter-rouge">Array.Sort()</code>.</li>
<li>In <a href="/2020-01-29/this-goes-to-eleven-pt2">part 2</a>, we go over the basics of vectorized hardware intrinsics, vector types, and go over a handful of vectorized instructions we’ll use in part 3. We still won’t be sorting anything.</li>
<li>In <a href="/2020-01-30/this-goes-to-eleven-pt3">part 3</a> we go through the initial code for the vectorized sorting, and start seeing some payoff. We finish agonizing courtesy of the CPU’s branch predictor, throwing a wrench into our attempts.</li>
<li>In this part, we go over a handful of optimization approaches that I attempted trying to get the vectorized partition to run faster, seeing what worked and what didn’t.</li>
<li>In <a href="/2020-02-02/this-goes-to-eleven-pt5">part 5</a>, we’ll take a deep dive into how to deal with memory alignment issues.</li>
<li>In part 6, we’ll take a pause from the vectorized partitioning, to get rid of almost 100% of the remaining scalar code, by implementing small, constant size array sorting with yet more AVX2 vectorization.</li>
<li>In part 7, We’ll circle back and try to deal with a nasty slowdown left in our vectorized partitioning code</li>
<li>In part 8, I’ll tell you the sad story of a very twisted optimization I managed to pull off while failing miserably at the same time.</li>
<li>In part 9, I’ll try some algorithmic improvements to milk those last drops of perf, or at least those that I can think of, from this code.</li>
</ol>
<h2 id="squeezing-some-more-juice">Squeezing some more juice</h2>
<p>I thought it would be nice to show a bunch of things I ended up trying to improve performance.
I tried to keep most of these experiments in separate implementations, both the ones that yielded positive results and the failures. These can be seen in the original repo under the <a href="https://github.com/damageboy/VxSort/tree/research/VxSortResearch/Unstable/AVX2/Happy">Happy</a> and <a href="https://github.com/damageboy/VxSort/tree/research/VxSortResearch/Unstable/AVX2/Sad">Sad</a> folders.</p>
<p>While some worked, and some didn’t, I think a bunch of these were worth mentioning, so here goes:</p>
<h3 id="dealing-with-small-jit-hiccups-1">Dealing with small JIT hiccups: :+1:</h3>
<p>One of the more surprising things I’ve discovered during the optimization journey was that the JIT could generate much better code, specifically around/with pointer arithmetic. With the basic version we got working by the end of the <a href="/2020-01-30/this-goes-to-eleven-pt3">3<sup>rd</sup> post</a>, I started turning my attention to the body of the main loop. That main loop is where I presume we spend most of our execution time. I quickly encountered some red-flag raising assembly code, specifically with this single line of code, which we’ve briefly discussed before:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="rouge-code"><pre><span class="k">if</span> <span class="p">(</span><span class="n">readLeft</span> <span class="p">-</span> <span class="n">writeLeft</span> <span class="p"><=</span>
<span class="n">writeRight</span> <span class="p">-</span> <span class="n">readRight</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// ...</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="c1">// ...</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>It looks innocent enough, but here’s the freely commented x86 asm code for it:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
</pre></td><td class="rouge-code"><pre><span class="nf">mov</span> <span class="nb">rax</span><span class="p">,</span><span class="nb">rdx</span> <span class="c1">; ✓ copy readLeft</span>
<span class="nf">sub</span> <span class="nb">rax</span><span class="p">,</span><span class="nv">r12</span> <span class="c1">; ✓ subtract writeLeft</span>
<span class="nf">mov</span> <span class="nb">rcx</span><span class="p">,</span><span class="nb">rax</span> <span class="c1">; ✘ wat?</span>
<span class="nf">sar</span> <span class="nb">rcx</span><span class="p">,</span><span class="mh">3Fh</span> <span class="c1">; ✘ wat?1?</span>
<span class="nf">and</span> <span class="nb">rcx</span><span class="p">,</span><span class="mi">3</span> <span class="c1">; ✘ wat?!?!?</span>
<span class="nf">add</span> <span class="nb">rax</span><span class="p">,</span><span class="nb">rcx</span> <span class="c1">; ✘ wat!?!@#</span>
<span class="nf">sar</span> <span class="nb">rax</span><span class="p">,</span><span class="mi">2</span> <span class="c1">; ✘ wat#$@#$@</span>
<span class="nf">mov</span> <span class="nb">rcx</span><span class="p">,[</span><span class="nb">rbp</span><span class="o">-</span><span class="mh">58h</span><span class="p">]</span> <span class="c1">; ✓✘ copy writeRight, but from stack?</span>
<span class="nf">mov</span> <span class="nv">r8</span><span class="p">,</span><span class="nb">rcx</span> <span class="c1">; ✓✘ in the loop body?!?!?, Oh lordy!</span>
<span class="nf">sub</span> <span class="nv">r8</span><span class="p">,</span><span class="nb">rsi</span> <span class="c1">; ✓ subtract readRight</span>
<span class="nf">mov</span> <span class="nv">r10</span><span class="p">,</span><span class="nv">r8</span> <span class="c1">; ✘ wat?</span>
<span class="nf">sar</span> <span class="nv">r10</span><span class="p">,</span><span class="mh">3Fh</span> <span class="c1">; ✘ wat?!?</span>
<span class="nf">and</span> <span class="nv">r10</span><span class="p">,</span><span class="mi">3</span> <span class="c1">; ✘ wat!?!@#</span>
<span class="nf">add</span> <span class="nv">r8</span><span class="p">,</span><span class="nv">r10</span> <span class="c1">; ✘ wat#$@#$@</span>
<span class="nf">sar</span> <span class="nv">r8</span><span class="p">,</span><span class="mi">2</span> <span class="c1">; ✘ wat^!#$!#$</span>
<span class="nf">cmp</span> <span class="nb">rax</span><span class="p">,</span><span class="nv">r8</span> <span class="c1">; ✓ finally, comapre!</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>It’s not every day that we get to see two JIT issues with one line of code, I know some people might take this as a bad sign, but in my mind this is great! To me this feels like digging for oil in Texas in the early 20s…
We’ve practically hit the ground with a pickaxe accidentally, only to see black liquid seeping out almost immediately!</p>
<h4 id="jit-bug-1-variable-not-promoted-to-register">JIT Bug 1: variable not promoted to register</h4>
<p>One super weird thing that we see happening here is the difference in the asm code that copies <code class="highlighter-rouge">writeRight</code> on <span class="uk-label">L8-9</span> from the <em>stack</em> (<code class="highlighter-rouge">[rbp-58h]</code>) before performing the subtraction when compared to <span class="uk-label">L1</span> where a conceptually similar copy is performed for <code class="highlighter-rouge">readLeft</code> from a register (<code class="highlighter-rouge">rdx</code>). The code merely tries to subtract two pairs of pointers, but the generated machine code is weird: 3 out of 4 pointers were correctly lifted out of the stack into registers outside the body of the loop (<code class="highlighter-rouge">readLeft</code>, <code class="highlighter-rouge">writeLeft</code>, <code class="highlighter-rouge">readRight</code>), but the 4<sup>th</sup> one, <code class="highlighter-rouge">writeRight</code>, is the designated black-sheep of the family and is being continuously read from the stack (and later in that loop body is also written back to the stack, to make things worse).<br />
There is no good reason for this, and this clearly smells! What do we do?</p>
<p>For one thing, I’ve <a href="https://github.com/dotnet/runtime/issues/35495">opened up an issue</a> about this weirdness. The issue itself shows just how finicky the JIT is regarding this one variable, and (un)surprisingly, by fudging around the setup code this can be easily worked around for now.<br />
As a refresher, here’s the original setup code I presented in the previous post, just before we enter the loop body:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
</pre></td><td class="rouge-code"><pre><span class="k">unsafe</span> <span class="kt">int</span><span class="p">*</span> <span class="nf">VectorizedPartitionInPlace</span><span class="p">(</span><span class="kt">int</span><span class="p">*</span> <span class="n">left</span><span class="p">,</span> <span class="kt">int</span><span class="p">*</span> <span class="n">right</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">var</span> <span class="n">writeLeft</span> <span class="p">=</span> <span class="n">left</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">writeRight</span> <span class="p">=</span> <span class="n">right</span> <span class="p">-</span> <span class="n">N</span> <span class="p">-</span> <span class="m">1</span><span class="p">;</span> <span class="c1">// <- Why the hate?</span>
<span class="kt">var</span> <span class="n">tmpLeft</span> <span class="p">=</span> <span class="n">_tempStart</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">tmpRight</span> <span class="p">=</span> <span class="n">_tempEnd</span> <span class="p">-</span> <span class="n">N</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">pBase</span> <span class="p">=</span> <span class="n">Int32PermTables</span><span class="p">.</span><span class="n">IntPermTablePtr</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">P</span> <span class="p">=</span> <span class="n">Vector256</span><span class="p">.</span><span class="nf">Create</span><span class="p">(</span><span class="n">pivot</span><span class="p">);</span>
<span class="nf">PartitionBlock</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">P</span><span class="p">,</span> <span class="k">ref</span> <span class="n">tmpLeft</span><span class="p">,</span> <span class="k">ref</span> <span class="n">tmpRight</span><span class="p">);</span>
<span class="nf">PartitionBlock</span><span class="p">(</span><span class="n">right</span> <span class="p">-</span> <span class="n">N</span> <span class="p">-</span> <span class="m">1</span><span class="p">,</span> <span class="n">P</span><span class="p">,</span> <span class="k">ref</span> <span class="n">tmpLeft</span><span class="p">,</span> <span class="k">ref</span> <span class="n">tmpRight</span><span class="p">);</span>
<span class="kt">var</span> <span class="n">readLeft</span> <span class="p">=</span> <span class="n">left</span> <span class="p">+</span> <span class="n">N</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">readRight</span> <span class="p">=</span> <span class="n">right</span> <span class="p">-</span> <span class="m">2</span><span class="p">*</span><span class="n">N</span> <span class="p">-</span> <span class="m">1</span><span class="p">;</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>And here’s a simple fix: moving the pointer declaration closer to the loop body seems to convince the JIT that we can all be friends once more:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
</pre></td><td class="rouge-code"><pre><span class="k">unsafe</span> <span class="kt">int</span><span class="p">*</span> <span class="nf">VectorizedPartitionInPlace</span><span class="p">(</span><span class="kt">int</span><span class="p">*</span> <span class="n">left</span><span class="p">,</span> <span class="kt">int</span><span class="p">*</span> <span class="n">right</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// ... omitted for brevity</span>
<span class="kt">var</span> <span class="n">tmpLeft</span> <span class="p">=</span> <span class="n">_tempStart</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">tmpRight</span> <span class="p">=</span> <span class="n">_tempEnd</span> <span class="p">-</span> <span class="n">N</span><span class="p">;</span>
<span class="nf">PartitionBlock</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">P</span><span class="p">,</span> <span class="k">ref</span> <span class="n">tmpLeft</span><span class="p">,</span> <span class="k">ref</span> <span class="n">tmpRight</span><span class="p">);</span>
<span class="nf">PartitionBlock</span><span class="p">(</span><span class="n">right</span> <span class="p">-</span> <span class="n">N</span> <span class="p">-</span> <span class="m">1</span><span class="p">,</span> <span class="n">P</span><span class="p">,</span> <span class="k">ref</span> <span class="n">tmpLeft</span><span class="p">,</span> <span class="k">ref</span> <span class="n">tmpRight</span><span class="p">);</span>
<span class="kt">var</span> <span class="n">writeLeft</span> <span class="p">=</span> <span class="n">left</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">writeRight</span> <span class="p">=</span> <span class="n">right</span> <span class="p">-</span> <span class="n">N</span> <span class="p">-</span> <span class="m">1</span><span class="p">;</span> <span class="c1">// <- Oh, so now we're cool?</span>
<span class="kt">var</span> <span class="n">readLeft</span> <span class="p">=</span> <span class="n">left</span> <span class="p">+</span> <span class="n">N</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">readRight</span> <span class="p">=</span> <span class="n">right</span> <span class="p">-</span> <span class="m">2</span><span class="p">*</span><span class="n">N</span> <span class="p">-</span> <span class="m">1</span><span class="p">;</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>The asm is <em>slightly</em> cleaner:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
</pre></td><td class="rouge-code"><pre><span class="nf">mov</span> <span class="nv">r8</span><span class="p">,</span><span class="nb">rax</span> <span class="c1">; ✓ copy readLeft</span>
<span class="nf">sub</span> <span class="nv">r8</span><span class="p">,</span><span class="nv">r15</span> <span class="c1">; ✓ subtract writeLeft</span>
<span class="nf">mov</span> <span class="nv">r9</span><span class="p">,</span><span class="nv">r8</span> <span class="c1">; ✘ wat?</span>
<span class="nf">sar</span> <span class="nv">r9</span><span class="p">,</span><span class="mh">3Fh</span> <span class="c1">; ✘ wat?1?</span>
<span class="nf">and</span> <span class="nv">r9</span><span class="p">,</span><span class="mi">3</span> <span class="c1">; ✘ wat?!?!?</span>
<span class="nf">add</span> <span class="nv">r8</span><span class="p">,</span><span class="nv">r9</span> <span class="c1">; ✘ wat!?!@#</span>
<span class="nf">sar</span> <span class="nv">r8</span><span class="p">,</span><span class="mi">2</span> <span class="c1">; ✘ wat#$@#$@</span>
<span class="nf">mov</span> <span class="nv">r9</span><span class="p">,</span><span class="nb">rsi</span> <span class="c1">; ✓ copy writeRight</span>
<span class="nf">sub</span> <span class="nv">r9</span><span class="p">,</span><span class="nb">rcx</span> <span class="c1">; ✓ subtract readRight</span>
<span class="nf">mov</span> <span class="nv">r10</span><span class="p">,</span><span class="nv">r9</span> <span class="c1">; ✘ wat?1?</span>
<span class="nf">sar</span> <span class="nv">r10</span><span class="p">,</span><span class="mh">3Fh</span> <span class="c1">; ✘ wat?!?!?</span>
<span class="nf">and</span> <span class="nv">r10</span><span class="p">,</span><span class="mi">3</span> <span class="c1">; ✘ wat!?!@#</span>
<span class="nf">add</span> <span class="nv">r9</span><span class="p">,</span><span class="nv">r10</span> <span class="c1">; ✘ wat#$@#$@</span>
<span class="nf">sar</span> <span class="nv">r9</span><span class="p">,</span><span class="mi">2</span> <span class="c1">; ✘ wat^%#^#@!</span>
<span class="nf">cmp</span> <span class="nv">r8</span><span class="p">,</span><span class="nv">r9</span> <span class="c1">; ✓ finally, comapre!</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>It doesn’t look like much, but we’ve managed to remove two memory accesses from the loop body (the read, shown above and a symmetrical write to the same stack variable/location towards the end of the loop).
It’s also clear, at least from my comments that I’m not entirely pleased yet, so let’s move on to…</p>
<h4 id="jit-bug-2-not-optimizing-pointer-difference-comparisons">JIT bug 2: not optimizing pointer difference comparisons</h4>
<p>Calling this one a bug might be stretch, but in the world of the JIT, sub-optimal code generation can be considered just that. The original code performing the comparison is making the JIT (wrongfully) think that we want to perform <code class="highlighter-rouge">int *</code> arithmetic for <code class="highlighter-rouge">readLeft - writeLeft</code> and <code class="highlighter-rouge">writeRight - readRight</code>. In other words: The JIT emits code subtracting both pointer pairs, generating a <code class="highlighter-rouge">byte *</code> difference for each pair; which is great (I marked that with checkmarks in the listings). Then, it goes on to generate extra code converting those differences into <code class="highlighter-rouge">int *</code> units: so lots of extra arithmetic operations. This is simply useless: we just care if one side is larger than the other. What the JIT is doing here is similar in spirit to converting two distance measurements taken in <code class="highlighter-rouge">cm</code> to <code class="highlighter-rouge">km</code> just to compare which one is greater.<br />
To work around this disappointing behaviour, I wrote this instead:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="rouge-code"><pre><span class="k">if</span> <span class="p">((</span><span class="kt">byte</span> <span class="p">*)</span> <span class="n">readLeft</span> <span class="p">-</span> <span class="p">(</span><span class="kt">byte</span> <span class="p">*)</span> <span class="n">writeLeft</span><span class="p">)</span> <span class="p"><=</span>
<span class="p">(</span><span class="kt">byte</span> <span class="p">*)</span> <span class="n">writeRight</span> <span class="p">-</span> <span class="p">(</span><span class="kt">byte</span> <span class="p">*)</span> <span class="n">readRight</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// ...</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="c1">// ...</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>By doing this sort of seemingly useless casting 4 times, we get the following asm generated:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre><span class="nf">mov</span> <span class="nb">rcx</span><span class="p">,</span> <span class="nb">rdi</span> <span class="c1">; ✓ copy readRight</span>
<span class="nf">sub</span> <span class="nb">rcx</span><span class="p">,</span> <span class="nv">r12</span> <span class="c1">; ✓ subtract writeLeft</span>
<span class="nf">mov</span> <span class="nv">r9</span><span class="p">,</span> <span class="nb">rdi</span> <span class="c1">; ✓ copy writeRight</span>
<span class="nf">sub</span> <span class="nv">r9</span><span class="p">,</span> <span class="nv">r13</span> <span class="c1">; ✓ subtract readRight</span>
<span class="nf">cmp</span> <span class="nb">rcx</span><span class="p">,</span> <span class="nv">r9</span> <span class="c1">; ✓ compare</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>It doesn’t take a degree in reverse-engineering asm code to figure out this was a good idea.<br />
Casting each pointer to <code class="highlighter-rouge">byte *</code> coerces the JIT to do our bidding and just perform a simpler comparison.</p>
<h4 id="jit-bug-3-updating-the-write-pointers-more-efficiently">JIT Bug 3: Updating the <code class="highlighter-rouge">write*</code> pointers more efficiently</h4>
<p>I discovered another missed opportunity in the pointer update code at the end of our inlined partitioning block. When we update the two <code class="highlighter-rouge">write*</code> pointers, our intention is to update two <code class="highlighter-rouge">int *</code> values with the result of the <code class="highlighter-rouge">PopCount</code> intrinsic:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="rouge-code"><pre><span class="kt">var</span> <span class="n">popCount</span> <span class="p">=</span> <span class="nf">PopCount</span><span class="p">(</span><span class="n">mask</span><span class="p">);</span>
<span class="n">writeLeft</span> <span class="p">+=</span> <span class="m">8U</span> <span class="p">-</span> <span class="n">popCount</span><span class="p">;</span>
<span class="n">writeRight</span> <span class="p">-=</span> <span class="n">popCount</span><span class="p">;</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Unfortunately, the JIT isn’t smart enough to see that it would be wiser to left shift <code class="highlighter-rouge">popCount</code> once by <code class="highlighter-rouge">2</code> (e.g. convert to <code class="highlighter-rouge">byte *</code> distance) and reuse that left-shifted value <strong>twice</strong> while mutating the two pointers.
Again, uglifying the originally clean code into the following god-awful mess get’s the job done:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="rouge-code"><pre><span class="kt">var</span> <span class="n">popCount</span> <span class="p">=</span> <span class="nf">PopCount</span><span class="p">(</span><span class="n">mask</span><span class="p">)</span> <span class="p"><<</span> <span class="m">2</span><span class="p">;</span>
<span class="n">writeRight</span> <span class="p">=</span> <span class="p">((</span><span class="kt">int</span> <span class="p">*)</span> <span class="p">((</span><span class="kt">byte</span> <span class="p">*)</span> <span class="n">writeRight</span> <span class="p">-</span> <span class="n">popCount</span><span class="p">);</span>
<span class="n">writeLeft</span> <span class="p">=</span> <span class="p">((</span><span class="kt">int</span> <span class="p">*)</span> <span class="p">((</span><span class="kt">byte</span> <span class="p">*)</span> <span class="n">writeLeft</span> <span class="p">+</span> <span class="m">8</span><span class="p">*</span><span class="m">4U</span> <span class="p">-</span> <span class="n">popCount</span><span class="p">);</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>I’ll skip the asm this time. It’s pretty clear from the C# that we pre-left shift (or multiply by 4) the <code class="highlighter-rouge">popCount</code> result before mutating the pointers.
We’re now generating slightly denser code by eliminating a silly instruction from a hot loop.</p>
<p>All 3 of these workarounds can be seen on my repo in the <a href="https://github.com/damageboy/VxSort/tree/research">research branch</a>. I kept this pretty much as-is under <a href="https://github.com/damageboy/VxSort/blob/research/VxSortResearch/Unstable/AVX2/Happy/B4_1_DoublePumpMicroOpt.cs"><code class="highlighter-rouge">B4_1_DoublePumpMicroOpt.cs</code></a>.
Time to see whether all these changes help in terms of performance:</p>
<div>
<div class="stickemup">
<ul class="uk-tab" data-uk-switcher="{connect:'#f26d55b8-f3f3-45ad-a052-56e3d7306828'}">
<li class="uk-active"><a href="#"><i class="glyphicon glyphicon-stats"></i> Scaling</a></li>
<li><a href="#"><i class="glyphicon glyphicon-stats"></i> Time/N</a></li>
<li><a href="#"><i class="glyphicon glyphicon-list-alt"></i> Benchmarks</a></li>
<li><a href="#"><i class="glyphicon glyphicon-info-sign"></i> Setup</a></li>
</ul>
<ul id="f26d55b8-f3f3-45ad-a052-56e3d7306828" class="uk-switcher uk-margin">
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<div data-intro="Size of the sorting problem, 10..10,000,000 in powers of 10" data-position="bottom">
<div data-intro="Performance scale: Array.Sort (solid gray) is always 100%, and the other methods are scaled relative to it" data-position="left">
<div data-intro="Click legend items to show/hide series" data-position="right">
<div class="benchmark-chart-container">
<canvas data-chart="line">
N,100,1K,10K,100K,1M,10M
Naive, 1 , 1 , 1 , 1 , 1 , 1
MicroOpt, 1.01, 0.93, 0.93, 0.93, 0.89 , 0.87
<!--
{
"data" : {
"datasets" : [ {
"backgroundColor": "rgba(66,66,66,0.35)",
"rough": { "fillStyle": "hachure", "hachureAngle": -30, "hachureGap": 9, "fillWeight": 0.3 }
},
{
"backgroundColor": "rgba(33,220,33,.9)",
"rough": { "fillStyle": "hachure", "hachureAngle": 30, "hachureGap": 3 }
}]
},
"options": {
"title": { "text": "AVX2 Micro-optimized Sorting - Scaled to AVX2 Naive Sorting", "display": true },
"scales": {
"yAxes": [{
"ticks": {
"fontFamily": "Indie Flower",
"min": 0.84,
"callback": "ticksPercent"
},
"scaleLabel": {
"labelString": "Scaling (%)",
"display": true
}
}]
}
},
"defaultOptions": {"scales":{"xAxes":[{"scaleLabel":{"display":"true,","labelString":"N (elements)","fontFamily":"Indie Flower"},"ticks":{"fontFamily":"Indie Flower"}}]},"legend":{"display":true,"position":"bottom","labels":{"fontFamily":"Indie Flower","fontSize":14}},"title":{"position":"top","fontFamily":"Indie Flower","fontSize":16}}
}
--> </canvas>
</div>
</div>
</div>
</div>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<div data-intro="Size of the sorting problem, 10..10,000,000 in powers of 10" data-position="bottom">
<div data-intro="Time in nanoseconds spent sorting per element. Array.Sort (solid gray) is the baseline, again" data-position="left">
<div data-intro="Click legend items to show/hide series" data-position="right">
<div class="benchmark-chart-container">
<canvas data-chart="line">
N,100,1K,10K,100K,1M,10M
Naive , 21.2415, 26.0040, 30.7502, 31.4513, 27.4290, 30.6499
MicroOpt, 21.3374, 23.9888, 28.4617, 29.1356, 24.4974, 26.8152
<!--
{
"data" : {
"datasets" : [ {
"backgroundColor": "rgba(66,66,66,0.35)",
"rough": { "fillStyle": "hachure", "hachureAngle": -30, "hachureGap": 9, "fillWeight": 0.3 }
},
{
"backgroundColor": "rgba(33,220,33,.9)",
"rough": { "fillStyle": "hachure", "hachureAngle": 30, "hachureGap": 3 }
}]
},
"options": {
"title": { "text": "AVX2 Naive+Micro-optimized Sorting - log(Time/N)", "display": true },
"scales": {
"yAxes": [{
"type": "logarithmic",
"ticks": {
"min": 20,
"max": 35,
"callback": "ticksNumStandaard",
"fontFamily": "Indie Flower"
},
"scaleLabel": {
"labelString": "Time/N (ns)",
"fontFamily": "Indie Flower",
"display": true
}
}]
}
},
"defaultOptions": {"scales":{"xAxes":[{"scaleLabel":{"display":"true,","labelString":"N (elements)","fontFamily":"Indie Flower"},"ticks":{"fontFamily":"Indie Flower"}}]},"legend":{"display":true,"position":"bottom","labels":{"fontFamily":"Indie Flower","fontSize":14}},"title":{"position":"top","fontFamily":"Indie Flower","fontSize":16}}
}
--> </canvas>
</div>
</div>
</div>
</div>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<table class="table datatable" data-json="../_posts/Bench.BlogPt4_1_Int32_-report.datatable.json" data-id-field="name" data-pagination="false" data-page-list="[9, 18]" data-intro="Each row in this table represents a benchmark result" data-position="left" data-show-pagination-switch="false">
<thead data-intro="The header can be used to sort/filter by clicking" data-position="right">
<tr>
<th data-field="TargetMethodColumn.Method" data-sortable="true" data-filter-control="select">
<span data-intro="The name of the benchmarked method" data-position="top">
Method<br />Name
</span>
</th>
<th data-field="N" data-sortable="true" data-value-type="int" data-filter-control="select">
<span data-intro="The size of the sorting problem being benchmarked (# of integers)" data-position="top">
Problem<br />Size
</span>
</th>
<th data-field="TimePerNDataTable" data-sortable="true" data-value-type="float2-interval-muted">
<span data-intro="Time in nanoseconds spent sorting each element in the array (with confidence intervals in parenthesis)" data-position="top">
Time /<br />Element (ns)
</span>
</th>
<th data-field="RatioDataTable" data-sortable="true" data-value-type="inline-bar-horizontal-percentage">
<span data-intro="Each result is scaled to its baseline (Array.Sort in this case)" data-position="top">
Scaling
</span>
</th>
<th data-field="Measurements" data-sortable="true" data-value-type="inline-bar-vertical">
<span data-intro="Raw benchmark results visualize how stable the result it. Longest/Shortest runs marked with <span style='color: red'>Red</span>/<span style='color: green'>Green</span>" data-position="top">Measurements</span>
</th>
</tr>
</thead>
</table>
</div>
</li>
<li>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="rouge-code"><pre><span class="nv">BenchmarkDotNet</span><span class="o">=</span>v0.12.0, <span class="nv">OS</span><span class="o">=</span>clear-linux-os 32120
Intel Core i7-7700HQ CPU 2.80GHz <span class="o">(</span>Kaby Lake<span class="o">)</span>, 1 CPU, 4 logical and 4 physical cores
.NET Core <span class="nv">SDK</span><span class="o">=</span>3.1.100
<span class="o">[</span>Host] : .NET Core 3.1.0 <span class="o">(</span>CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404<span class="o">)</span>, X64 RyuJIT
Job-DEARTS : .NET Core 3.1.0 <span class="o">(</span>CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404<span class="o">)</span>, X64 RyuJIT
<span class="nv">InvocationCount</span><span class="o">=</span>3 <span class="nv">IterationCount</span><span class="o">=</span>15 <span class="nv">LaunchCount</span><span class="o">=</span>2
<span class="nv">UnrollFactor</span><span class="o">=</span>1 <span class="nv">WarmupCount</span><span class="o">=</span>10
<span class="nv">$ </span><span class="nb">grep</span> <span class="s1">'stepping\|model\|microcode'</span> /proc/cpuinfo | <span class="nb">head</span> <span class="nt">-4</span>
model : 158
model name : Intel<span class="o">(</span>R<span class="o">)</span> Core<span class="o">(</span>TM<span class="o">)</span> i7-7700HQ CPU @ 2.80GHz
stepping : 9
microcode : 0xb4
</pre></td></tr></tbody></table></code></pre></div></div>
</li>
</ul>
</div>
<p>This is quite better! I’ve artificially set the y-axis here to a narrow range of 80%-105% so that the differences would become more apparent. The improvement is <em>very</em> measurable. Too bad we had to uglify the code to get here, but such is life. Our results just improved by another ~7-14% across the board.<br />
If this is the going rate for ugly, I’ll bite the bullet :)</p>
<p>I did not include any statistics collection tab for this version since there is no algorithmic change involved.</p>
</div>
<h3 id="selecting-a-better-cut-off-threshold-for-scalar-sorting-1">Selecting a better cut-off threshold for scalar sorting: :+1:</h3>
<p>I briefly mentioned this at the end of the 3<sup>rd</sup> post: While it made sense to start with the same threshold that <code class="highlighter-rouge">Array.Sort</code> uses (<code class="highlighter-rouge">16</code>) to switch from partitioning into small array sorting, there’s no reason to assume this is the optimal threshold for <em>our</em> partitioning function: Given that the dynamics have changed with vectorized partitioning, the optimal cut-off point probably needs to move too.<br />
In theory, we should retest the cut-off point after every optimization that succeeds in moving the needle; I won’t do this after every optimization, but I will do so again for the final version. For the meantime, let’s see how playing with the cut-off point changes the results: We’ll try <code class="highlighter-rouge">24</code>, <code class="highlighter-rouge">32</code>, <code class="highlighter-rouge">40</code>, <code class="highlighter-rouge">48</code> on top of <code class="highlighter-rouge">16</code>, and see what comes on top:</p>
<div>
<div class="stickemup">
<ul class="uk-tab" data-uk-switcher="{connect:'#324b4f6c-1fea-4605-8dc7-6abf1826ec74'}">
<li class="uk-active"><a href="#"><i class="glyphicon glyphicon-stats"></i> Scaling</a></li>
<li><a href="#"><i class="glyphicon glyphicon-stats"></i> Time/N</a></li>
<li><a href="#"><i class="glyphicon glyphicon-list-alt"></i> Benchmarks</a></li>
<li><a href="#"><i class="glyphicon glyphicon-info-sign"></i> Setup</a></li>
</ul>
<ul id="324b4f6c-1fea-4605-8dc7-6abf1826ec74" class="uk-switcher uk-margin">
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<div data-intro="Size of the sorting problem, 10..10,000,000 in powers of 10" data-position="bottom">
<div data-intro="Performance scale: Array.Sort (solid gray) is always 100%, and the other methods are scaled relative to it" data-position="left">
<div data-intro="Click legend items to show/hide series" data-position="right">
<div class="benchmark-chart-container">
<canvas data-chart="line">
N,100,1K,10K,100K,1M,10M
MicroOpt_24,0.823310023,0.882747579,0.914373696,0.902330475,0.958166708,0.971168474
MicroOpt_32,0.817715618,0.766905542,0.839337033,0.850782566,0.973364241,0.9561571
MicroOpt_40,0.761305361,0.749485401,0.837020549,0.842011671,0.95013881,0.958056824
MicroOpt_48,0.758041958,0.75722345,0.823212214,0.839358026,0.966057806,0.962200074
<!--
{
"data" : {
"datasets" : [
{
"backgroundColor": "rgba(33,33,220,.5)",
"hidden": "true",
"rough": { "fillStyle": "hachure", "hachureAngle": 30, "hachureGap": 12 }
},
{
"backgroundColor": "rgba(220,33,33,.5)",
"hidden": "true",
"rough": { "fillStyle": "hachure", "hachureAngle": 90, "hachureGap": 12 }
},
{
"backgroundColor": "rgba(33,220,33,.9)",
"rough": { "fillStyle": "hachure", "hachureAngle": 60, "hachureGap": 3 }
},
{
"backgroundColor": "rgba(33,220,220,.5)",
"hidden": "true",
"rough": { "fillStyle": "hachure", "hachureAngle": 120, "hachureGap": 12 }
}
]
},
"options": {
"title": { "text": "AVX2 Sorting - Cut-off Tuning", "display": true },
"scales": {
"yAxes": [{
"ticks": {
"fontFamily": "Indie Flower",
"min": 0.70,
"callback": "ticksPercent"
},
"scaleLabel": {
"labelString": "Scaling (%)",
"display": true
}
}]
}
},
"defaultOptions": {"scales":{"xAxes":[{"scaleLabel":{"display":"true,","labelString":"N (elements)","fontFamily":"Indie Flower"},"ticks":{"fontFamily":"Indie Flower"}}]},"legend":{"display":true,"position":"bottom","labels":{"fontFamily":"Indie Flower","fontSize":14}},"title":{"position":"top","fontFamily":"Indie Flower","fontSize":16}}
}
--> </canvas>
</div>
</div>
</div>
</div>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<div data-intro="Size of the sorting problem, 10..10,000,000 in powers of 10" data-position="bottom">
<div data-intro="Time in nanoseconds spent sorting per element. Array.Sort (solid gray) is the baseline, again" data-position="left">
<div data-intro="Click legend items to show/hide series" data-position="right">
<div class="benchmark-chart-container">
<canvas data-chart="line">
N,100,1K,10K,100K,1M,10M
MicroOpt_24,17.6195,25.5307,25.4022,26.5767,23.3013,25.6154
MicroOpt_32,17.3879,22.0054,25.9392,26.6394,23.3355,25.6553
MicroOpt_40,17.3027,23.2386,26.1287,26.3959,23.4568,25.7346
MicroOpt_48,17.0937,23.5973,25.6651,26.3667,23.2584,25.6
<!--
{
"data" : {
"datasets" : [
{
"backgroundColor": "rgba(33,33,220,.5)",
"hidden": "true",
"rough": { "fillStyle": "hachure", "hachureAngle": 30, "hachureGap": 12 }
},
{
"backgroundColor": "rgba(220,33,33,.5)",
"hidden": "true",
"rough": { "fillStyle": "hachure", "hachureAngle": 90, "hachureGap": 12 }
},
{
"backgroundColor": "rgba(33,220,33,.9)",
"rough": { "fillStyle": "hachure", "hachureAngle": 60, "hachureGap": 3 }
},
{
"backgroundColor": "rgba(33,220,220,.5)",
"hidden": "true",
"rough": { "fillStyle": "hachure", "hachureAngle": 120, "hachureGap": 12 }
}
]
},
"options": {
"title": { "text": "AVX2 Sorting - Cut-off Tuning - log(Time/N)", "display": true },
"scales": {
"yAxes": [{
"type": "logarithmic",
"ticks": {
"min": 15,
"max": 30,
"callback": "ticksNumStandaard",
"fontFamily": "Indie Flower"
},
"scaleLabel": {
"labelString": "Time/N (ns)",
"fontFamily": "Indie Flower",
"display": true
}
}]
}
},
"defaultOptions": {"scales":{"xAxes":[{"scaleLabel":{"display":"true,","labelString":"N (elements)","fontFamily":"Indie Flower"},"ticks":{"fontFamily":"Indie Flower"}}]},"legend":{"display":true,"position":"bottom","labels":{"fontFamily":"Indie Flower","fontSize":14}},"title":{"position":"top","fontFamily":"Indie Flower","fontSize":16}}
}
--> </canvas>
</div>
</div>
</div>
</div>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<table class="table datatable" data-json="../_posts/Bench.BlogPt4_2_Int32_-report.datatable.json" data-id-field="name" data-pagination="true" data-page-list="[5, 10, 15, 20]" data-intro="Each row in this table represents a benchmark result" data-position="left" data-show-pagination-switch="false">
<thead data-intro="The header can be used to sort/filter by clicking" data-position="right">
<tr>
<th data-field="TargetMethodColumn.Method" data-sortable="true" data-filter-control="select">
<span data-intro="The name of the benchmarked method" data-position="top">
Method<br />Name
</span>
</th>
<th data-field="N" data-sortable="true" data-value-type="int" data-filter-control="select">
<span data-intro="The size of the sorting problem being benchmarked (# of integers)" data-position="top">
Problem<br />Size
</span>
</th>
<th data-field="TimePerNDataTable" data-sortable="true" data-value-type="float2-interval-muted">
<span data-intro="Time in nanoseconds spent sorting each element in the array (with confidence intervals in parenthesis)" data-position="top">
Time /<br />Element (ns)
</span>
</th>
<th data-field="RatioDataTable" data-sortable="true" data-value-type="inline-bar-horizontal-percentage">
<span data-intro="Each result is scaled to its baseline (Array.Sort in this case)" data-position="top">
Scaling
</span>
</th>
<th data-field="Measurements" data-sortable="true" data-value-type="inline-bar-vertical">
<span data-intro="Raw benchmark results visualize how stable the result it. Longest/Shortest runs marked with <span style='color: red'>Red</span>/<span style='color: green'>Green</span>" data-position="top">Measurements</span>
</th>
</tr>
</thead>
</table>
</div>
</li>
<li>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="rouge-code"><pre><span class="nv">BenchmarkDotNet</span><span class="o">=</span>v0.12.0, <span class="nv">OS</span><span class="o">=</span>clear-linux-os 32120
Intel Core i7-7700HQ CPU 2.80GHz <span class="o">(</span>Kaby Lake<span class="o">)</span>, 1 CPU, 4 logical and 4 physical cores
.NET Core <span class="nv">SDK</span><span class="o">=</span>3.1.100
<span class="o">[</span>Host] : .NET Core 3.1.0 <span class="o">(</span>CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404<span class="o">)</span>, X64 RyuJIT
Job-DEARTS : .NET Core 3.1.0 <span class="o">(</span>CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404<span class="o">)</span>, X64 RyuJIT
<span class="nv">InvocationCount</span><span class="o">=</span>3 <span class="nv">IterationCount</span><span class="o">=</span>15 <span class="nv">LaunchCount</span><span class="o">=</span>2
<span class="nv">UnrollFactor</span><span class="o">=</span>1 <span class="nv">WarmupCount</span><span class="o">=</span>10
<span class="nv">$ </span><span class="nb">grep</span> <span class="s1">'stepping\|model\|microcode'</span> /proc/cpuinfo | <span class="nb">head</span> <span class="nt">-4</span>
model : 158
model name : Intel<span class="o">(</span>R<span class="o">)</span> Core<span class="o">(</span>TM<span class="o">)</span> i7-7700HQ CPU @ 2.80GHz
stepping : 9
microcode : 0xb4
</pre></td></tr></tbody></table></code></pre></div></div>
</li>
</ul>
</div>
<p>I’ve pulled a little trick with these charts: by default, I’ve <em>hidden</em> everything but one of the cut-off points: <code class="highlighter-rouge">40</code>, that being the best new cut-off point, at least in my opinion. If you care to follow my reasoning process, I suggest you start slowly clicking (or touching) the <code class="highlighter-rouge">24</code>, <code class="highlighter-rouge">32</code>, <code class="highlighter-rouge">48</code> series/titles in the legend. This will add them back into the chart, one by one. Stop to appreciate what you are seeing; Once you’ll do so, I think it’s easier to see that:</p>
<ul>
<li>The initial value we started off with: <code class="highlighter-rouge">16</code>, the baseline for this series of benchmarks, is undoubtedly the <em>worst possible</em> cut-off for vectorized partitioning…<br />
<em>All of the other cut-off points have scaling values below 100%</em>, hence they are faster.</li>
<li><code class="highlighter-rouge">24</code> does not do us a world of good here either: It’s clearly always next worst option.</li>
<li><code class="highlighter-rouge">32</code> is pretty good, except that in the lower edge of the chart, where the higher cut-off points seem to provide better value.</li>
<li>For the most part, using any one of <code class="highlighter-rouge">40</code>/<code class="highlighter-rouge">48</code> as a cut-off point seems to be the right way to go. These two cover the least area in the chart. In other words, they all provide the best improvement, on average, for our scenario.</li>
</ul>
<p>I ended up voting for <code class="highlighter-rouge">40</code>. There’s no good reason I can give for this except for (perhaps wrong) instinct. Lest we forget, another small pat of the back is in order: we’ve managed to speed up our sorting code with an improvement ranging from 5-25% throughout the entire spectrum, which is cause for a small celebration in itself.</p>
<p>To be completely honest here, there is another, ulterior motive, as far as I’m concerned, for showing the effect of changing the small sorting threshold so early into this series. By doing so, we can sense where this trail will take us on our journey: It’s pretty clear that we will end up with two equally important implementations, each handling a large part of the total workload for sorting:</p>
<ul>
<li>The vectorized partitioning will be tasked with the initial heavy lifting, relegated to taking large arrays and breaking them down to many small, unsorted, yet completely distinct groups of elements.<br />
To put it plainly: taking a million elements and splitting them up into 10,000-20,000 groups of ~50-100 elements each, that do not cross-over each other; that way we can use…</li>
<li>Small-sorting, which will end up doing a final pass taking many small ~50-100 element groups, sorting them in place, before moving on to the next group.</li>
</ul>
<p>Given that we will always start with partitioning before concluding with small-sorting, we end up with a complete solution. Just as importantly, we can optimize <em>each</em> of the two parts making up our solution <em>independently</em>, in the coming posts.</p>
</div>
<h3 id="explicit-prefetching--1">Explicit Prefetching: :-1:</h3>
<p>I tried using prefetch intrinsics to give the CPU early hints as to where we are reading memory from.</p>
<p>Generally speaking, explicit prefetching can be used to make sure the CPU always reads some data from memory into the cache <em>ahead of the actual time</em> we require it so that the CPU never needs to wait for memory, which is very slow. The bottom line is that having to wait for RAM is a death sentence (200-300 cpu cycles), but even having to wait for L2 cache (14 cycles) when your entire loop’s throughput is around 9 cycles is unpleasant. With prefetch intrinsics we can explicitly instruct the CPU to prefetch specific cache lines all the way to L1 cache, or alternatively specify the target level as L2, L3.</p>
<p>Just because we can do something, doesn’t mean we should: do we actually need to prefetch? CPU designers know all of the above just as much as we do, and the CPU already attempts to prefetch data based on complex and obscure heuristics. You might be tempted to think: “oh, what’s so bad about doing it anyway?”. Well, quite a lot, to be honest: when we explicitly tell the CPU to prefetch data, we’re wasting both instruction cache and decode+fetch bandwidth. Those might be better used for executing our computation.<br />
So, the bottom line remain somewhat hazy, but we can probably try and set-up some ground rules that are probably true in 2020:</p>
<ul>
<li>CPUs can prefetch data when we traverse memory sequentially.</li>
<li>They do so regardless of the traversal direction (increasing/decreasing addresses).</li>
<li>They can sucessfully figure out the <em>stride</em> we use, when it is constant.</li>
<li>They do so by building up history of our reads, per call-site.</li>
</ul>
<p>With all that in mind, it is quite likely that prefetching in our case would do little good: Our partitioning code pretty much hits every point in the previous list. But even so, you can never really tell without either trying out, or inspecting memory-related performance counters. The latter, turns out to be <a href="https://gist.github.com/travisdowns/90a588deaaa1b93559fe2b8510f2a739">more complicated than what you’d think</a>, and sometimes, it’s just easier to try out something rather than attempt to measure it ahead of time. In our case, prefetching the <em>writable</em> memory <strong>makes no sense</strong>, as our loop code mostly reads from the same addresses just before writing to them in the next iteration or two, so I mostly focused on trying to prefetch the next read addresses.</p>
<p>Whenever I modified <code class="highlighter-rouge">readLeft</code>, <code class="highlighter-rouge">readRight</code>, I immediately added code like this:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
</pre></td><td class="rouge-code"><pre><span class="kt">int</span> <span class="p">*</span> <span class="n">nextPtr</span><span class="p">;</span>
<span class="k">if</span> <span class="p">((</span><span class="kt">byte</span> <span class="p">*)</span> <span class="n">readLeft</span> <span class="p">-</span> <span class="p">(</span><span class="kt">byte</span> <span class="p">*)</span> <span class="n">writeLeft</span><span class="p">)</span> <span class="p"><=</span>
<span class="p">(</span><span class="kt">byte</span> <span class="p">*)</span> <span class="n">writeRight</span> <span class="p">-</span> <span class="p">(</span><span class="kt">byte</span> <span class="p">*)</span> <span class="n">readRight</span><span class="p">))</span> <span class="p">{</span>
<span class="n">nextPtr</span> <span class="p">=</span> <span class="n">readLeft</span><span class="p">;</span>
<span class="n">readLeft</span> <span class="p">+=</span> <span class="m">8</span><span class="p">;</span>
<span class="c1">// Trying to be clever here,</span>
<span class="c1">// If we are reading from the left at this iteration,</span>
<span class="c1">// we are likely to read from right in the next iteration</span>
<span class="n">Sse</span><span class="p">.</span><span class="nf">Prefetch0</span><span class="p">((</span><span class="kt">byte</span> <span class="p">*)</span> <span class="n">readRight</span> <span class="p">-</span> <span class="m">64</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">nextPtr</span> <span class="p">=</span> <span class="n">readRight</span><span class="p">;</span>
<span class="n">readRight</span> <span class="p">-=</span> <span class="m">8</span><span class="p">;</span>
<span class="c1">// Same as above, only the other way around:</span>
<span class="c1">// After reading from the right, it's likely</span>
<span class="c1">// that our next read will be on the left side</span>
<span class="n">Sse</span><span class="p">.</span><span class="nf">Prefetch0</span><span class="p">((</span><span class="kt">byte</span> <span class="p">*)</span> <span class="n">readLeft</span> <span class="p">+</span> <span class="m">64</span><span class="p">);</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>This tells the CPU we are about to use data in <code class="highlighter-rouge">readLeft + 64</code> (the next cache-line from the left) and <code class="highlighter-rouge">readRight - 64</code> (the next cache-line from the right) in the following iterations.</p>
<p>While this looks great on paper, the real world results of this were unnoticeable for me and even slightly negative. For the most part, it appears that the CPUs I used for testing did a good job without me constantly telling them to do what they had already been doing on their own… Still, it was worth a shot.</p>
<h3 id="simplifying-the-branch-1">Simplifying the branch :+1:</h3>
<p>I’m kind of ashamed at this particular optimization: I had been literally staring at this line of code and optimizing around it for months without stopping to really think about what it was that I’m <strong>really trying</strong> to do. Let’s go back to our re-written branch from a couple of paragraphs ago:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="rouge-code"><pre><span class="k">if</span> <span class="p">((</span><span class="kt">byte</span> <span class="p">*)</span> <span class="n">readLeft</span> <span class="p">-</span> <span class="p">(</span><span class="kt">byte</span> <span class="p">*)</span> <span class="n">writeLeft</span><span class="p">)</span> <span class="p"><=</span>
<span class="p">(</span><span class="kt">byte</span> <span class="p">*)</span> <span class="n">writeRight</span> <span class="p">-</span> <span class="p">(</span><span class="kt">byte</span> <span class="p">*)</span> <span class="n">readRight</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// ...</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="c1">// ...</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>I’ve been describing this condition both in animated and code form in the previous part, explaining how for my double-pumping to work, I have to figure out which side we <em>must</em> read from <strong>next</strong> so we never end-up overwriting data before having a chance to read and partition yet. All of this is happening in the name of performing in-place partitioning. However, I’ve been over-complicating the actual condition!<br />
At some, admittedly late stage, “it” hit me, so let’s play this out step-by-step:</p>
<ol>
<li>We always start with the setup I’ve previously described, where we make <code class="highlighter-rouge">8</code> elements worth of space available on <strong>both</strong> sides, by partitioning them away into the temporary memory.</li>
<li>When we get into the main partitioning loop, we pick one specific side to read from: so far, this has always been the left side (It doesn’t <em>really</em> matter which side it is, but it arbitrarily ended up being the <em>left</em> side due to the condition being <code class="highlighter-rouge"><=</code> rather than <code class="highlighter-rouge"><</code>).</li>
<li>Given all of the above, we always <em>start</em> reading from the left, there-by increasing the “breathing space” on that left side from <code class="highlighter-rouge">8</code> to <code class="highlighter-rouge">16</code> elements temporarily.</li>
<li>Once our trusty ole’ partitioning block is done, we can pause and reason on how both sides now look:
<ul>
<li>The left side either has:
<ul>
<li><code class="highlighter-rouge">8</code> elements of space (in the less likely, yet possible case that all elements read from it were smaller than the selected pivot) -or-</li>
<li>It has more than <code class="highlighter-rouge">8</code> elements of “free” space.</li>
</ul>
</li>
<li>
<p>In the first case, where the left side is now back to 8 elements of free space, the right side also has <code class="highlighter-rouge">8</code> elements of free space, since nothing was written on that side!</p>
</li>
<li>In all other cases, the left side has <em>more</em> than <code class="highlighter-rouge">8</code> elements of free space, and the right side has less than <code class="highlighter-rouge">8</code> elements of free space, by definition.</li>
</ul>
</li>
<li>Since these are the true dynamics, why should we even bother comparing <strong>both</strong> heads and tails of each respective side?</li>
</ol>
<p>The answer to that last question is: <strong>We don’t have to!</strong><br />
We could simplify the branch by comparing only the right head+tail pointer distance to see if it is smaller than the magical number <code class="highlighter-rouge">8</code> or not!
This new condition would be just as good at serving the original <em>intent</em> (which is: “don’t end up overwriting unread data”) as the more complicated branch we used before…<br />
When the right side has less than <code class="highlighter-rouge">8</code> elements, we <em>have to</em> read from the right side in the next round, since it is in danger of being over-written, otherwise, the only other option is that both sides are back at 8-elements each, and we should go back to reading from the left side again, essentially going back to our starting setup condition as described in (1). It’s kind of silly, and I really feel bad it took me 4 months or so to see this. The new condition ends up being much simpler to encode and execute:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="rouge-code"><pre><span class="kt">int</span><span class="p">*</span> <span class="n">nextPtr</span><span class="p">;</span>
<span class="k">if</span> <span class="p">((</span><span class="kt">byte</span> <span class="p">*)</span> <span class="n">writeRight</span> <span class="p">-</span> <span class="p">(</span><span class="kt">byte</span> <span class="p">*)</span> <span class="n">readRight</span> <span class="p"><</span> <span class="n">N</span> <span class="p">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">int</span><span class="p">))</span> <span class="p">{</span>
<span class="c1">// ...</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="c1">// ...</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>This branch is just as “correct” as the previous one, but it is less taxing in a few ways:</p>
<ul>
<li>Less instructions to decode and execute.<br />
We’ve saved an additional 5 bytes worth of opcodes from the main loop!</li>
<li>Less data dependencies for the CPU to potentially wait for.<br />
(The CPU doesn’t have to wait for the <code class="highlighter-rouge">writeLeft</code>/<code class="highlighter-rouge">readLeft</code> pointer mutation and subtraction to complete)</li>
</ul>
<p>Naturally this ends up slightly faster, and can verify this with BDN once again:</p>
<div>
<div class="stickemup">
<ul class="uk-tab" data-uk-switcher="{connect:'#fcd8c2f3-b377-44eb-9b18-5964ad2a69b4'}">
<li class="uk-active"><a href="#"><i class="glyphicon glyphicon-stats"></i> Scaling</a></li>
<li><a href="#"><i class="glyphicon glyphicon-stats"></i> Time/N</a></li>
<li><a href="#"><i class="glyphicon glyphicon-list-alt"></i> Benchmarks</a></li>
<li><a href="#"><i class="glyphicon glyphicon-info-sign"></i> Setup</a></li>
</ul>
<ul id="fcd8c2f3-b377-44eb-9b18-5964ad2a69b4" class="uk-switcher uk-margin">
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<div data-intro="Size of the sorting problem, 10..10,000,000 in powers of 10" data-position="bottom">
<div data-intro="Performance scale: Array.Sort (solid gray) is always 100%, and the other methods are scaled relative to it" data-position="left">
<div data-intro="Click legend items to show/hide series" data-position="right">
<div class="benchmark-chart-container">
<canvas data-chart="line">
N,100,1K,10K,100K,1M,10M
MicroOpt_40,1,1,1,1,1,1
SimpleBranch,1.01220256253813,0.946321321321321,0.982688056091031,0.938806414898963,1.00465999238207,0.962359905144129
<!--
{
"data" : {
"datasets" : [ {
"backgroundColor": "rgba(66,66,66,0.35)",
"rough": { "fillStyle": "hachure", "hachureAngle": -30, "hachureGap": 9, "fillWeight": 0.3 }
},
{
"backgroundColor": "rgba(33,220,33,.9)",
"rough": { "fillStyle": "hachure", "hachureAngle": 60, "hachureGap": 3 }
}
]
},
"options": {
"title": { "text": "AVX2 SimpleBranch Sorting - Scaled to MicroOpt_40", "display": true },
"scales": {
"yAxes": [{
"ticks": {
"fontFamily": "Indie Flower",
"min": 0.92,
"callback": "ticksPercent"
},
"scaleLabel": {
"labelString": "Scaling (%)",
"display": true
}
}]
}
},
"defaultOptions": {"scales":{"xAxes":[{"scaleLabel":{"display":"true,","labelString":"N (elements)","fontFamily":"Indie Flower"},"ticks":{"fontFamily":"Indie Flower"}}]},"legend":{"display":true,"position":"bottom","labels":{"fontFamily":"Indie Flower","fontSize":14}},"title":{"position":"top","fontFamily":"Indie Flower","fontSize":16}}
}
--> </canvas>
</div>
</div>
</div>
</div>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<div data-intro="Size of the sorting problem, 10..10,000,000 in powers of 10" data-position="bottom">
<div data-intro="Time in nanoseconds spent sorting per element. Array.Sort (solid gray) is the baseline, again" data-position="left">
<div data-intro="Click legend items to show/hide series" data-position="right">
<div class="benchmark-chart-container">
<canvas data-chart="line">
N,100,1K,10K,100K,1M,
MicroOpt_40,16.3891,21.3124,24.0181,26.2096,23.1979,26.4655
SimpleBranch,16.5929,20.168,23.6023,24.6058,23.306,25.4694
<!--
{
"data" : {
"datasets" : [
{
"backgroundColor": "rgba(66,66,66,0.35)",
"rough": { "fillStyle": "hachure", "hachureAngle": -30, "hachureGap": 9, "fillWeight": 0.3 }
},
{
"backgroundColor": "rgba(33,220,33,.9)",
"rough": { "fillStyle": "hachure", "hachureAngle": 60, "hachureGap": 3 }
}
]
},
"options": {
"title": { "text": "AVX2 MicroOpt_40 + SimplerBranch - log(Time/N)", "display": true },
"scales": {
"yAxes": [{
"type": "logarithmic",
"ticks": {
"min": 15,
"max": 27,
"callback": "ticksNumStandaard",
"fontFamily": "Indie Flower"
},
"scaleLabel": {
"labelString": "Time/N (ns)",
"fontFamily": "Indie Flower",
"display": true
}
}]
}
},
"defaultOptions": {"scales":{"xAxes":[{"scaleLabel":{"display":"true,","labelString":"N (elements)","fontFamily":"Indie Flower"},"ticks":{"fontFamily":"Indie Flower"}}]},"legend":{"display":true,"position":"bottom","labels":{"fontFamily":"Indie Flower","fontSize":14}},"title":{"position":"top","fontFamily":"Indie Flower","fontSize":16}}
}
--> </canvas>
</div>
</div>
</div>
</div>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<table class="table datatable" data-json="../_posts/Bench.BlogPt4_3_Int32_-report.datatable.json" data-id-field="name" data-pagination="true" data-page-list="[5, 10, 15, 20]" data-intro="Each row in this table represents a benchmark result" data-position="left" data-show-pagination-switch="false">
<thead data-intro="The header can be used to sort/filter by clicking" data-position="right">
<tr>
<th data-field="TargetMethodColumn.Method" data-sortable="true" data-filter-control="select">
<span data-intro="The name of the benchmarked method" data-position="top">
Method<br />Name
</span>
</th>
<th data-field="N" data-sortable="true" data-value-type="int" data-filter-control="select">
<span data-intro="The size of the sorting problem being benchmarked (# of integers)" data-position="top">
Problem<br />Size
</span>
</th>
<th data-field="TimePerNDataTable" data-sortable="true" data-value-type="float2-interval-muted">
<span data-intro="Time in nanoseconds spent sorting each element in the array (with confidence intervals in parenthesis)" data-position="top">
Time /<br />Element (ns)
</span>
</th>
<th data-field="RatioDataTable" data-sortable="true" data-value-type="inline-bar-horizontal-percentage">
<span data-intro="Each result is scaled to its baseline (Array.Sort in this case)" data-position="top">
Scaling
</span>
</th>
<th data-field="Measurements" data-sortable="true" data-value-type="inline-bar-vertical">
<span data-intro="Raw benchmark results visualize how stable the result it. Longest/Shortest runs marked with <span style='color: red'>Red</span>/<span style='color: green'>Green</span>" data-position="top">Measurements</span>
</th>
</tr>
</thead>
</table>
</div>
</li>
<li>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="rouge-code"><pre><span class="nv">BenchmarkDotNet</span><span class="o">=</span>v0.12.0, <span class="nv">OS</span><span class="o">=</span>clear-linux-os 32120
Intel Core i7-7700HQ CPU 2.80GHz <span class="o">(</span>Kaby Lake<span class="o">)</span>, 1 CPU, 4 logical and 4 physical cores
.NET Core <span class="nv">SDK</span><span class="o">=</span>3.1.100
<span class="o">[</span>Host] : .NET Core 3.1.0 <span class="o">(</span>CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404<span class="o">)</span>, X64 RyuJIT
Job-DEARTS : .NET Core 3.1.0 <span class="o">(</span>CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404<span class="o">)</span>, X64 RyuJIT
<span class="nv">InvocationCount</span><span class="o">=</span>3 <span class="nv">IterationCount</span><span class="o">=</span>15 <span class="nv">LaunchCount</span><span class="o">=</span>2
<span class="nv">UnrollFactor</span><span class="o">=</span>1 <span class="nv">WarmupCount</span><span class="o">=</span>10
<span class="nv">$ </span><span class="nb">grep</span> <span class="s1">'stepping\|model\|microcode'</span> /proc/cpuinfo | <span class="nb">head</span> <span class="nt">-4</span>
model : 158
model name : Intel<span class="o">(</span>R<span class="o">)</span> Core<span class="o">(</span>TM<span class="o">)</span> i7-7700HQ CPU @ 2.80GHz
stepping : 9
microcode : 0xb4
</pre></td></tr></tbody></table></code></pre></div></div>
</li>
</ul>
</div>
<p>There’s not a lot to say about this, but I’ll point out a couple of things:</p>
<ul>
<li>There is a seemingly very slight slow down around 100, 1M elements. It’s authentic and repeatable in my tests. I honestly don’t know why it happens, yet. We spend a total of around 1.6μs for every 100 element sort, which might initially sound like not a lot of time, but at 2.8Ghz, that amounts to ~4500 cycles give or take. For the case of 1M elements, this phenomenon is even more peculiar; But such is life.</li>
<li>Otherwise, there is an improvement, even if modest, of roughly 2%-4% for most cases. it does look like this version of our code is better, in the end of the day.</li>
</ul>
<p>One interesting question that I personally did not know the answer to beforehand was: would this reduce branch mispredictions? There’s no reason to expect this since our input data, being random, is driving the outcome of this branch. However, if I’ve learned one thing throughout this long ordeal, is that there are always things you don’t even know that you don’t know. Any way of verifying our pet-theories is a welcome opportunity at learning some humility.</p>
</div>
<p>Let’s fire up <code class="highlighter-rouge">perf</code> to inspect what its counters tell us about the two versions (each result is in a separate tab below):</p>
<div>
<div class="stickemup">
<ul class="uk-tab" data-uk-switcher="{connect:'#123fd8ba-23ef-4144-b66c-871bc8f5aa58'}">
<li class="uk-active"><a href="#"><i class="glyphicon glyphicon-list-alt"></i> CutOff@40</a></li>
<li><a href="#"><i class="glyphicon glyphicon-list-alt"></i> SimpleBranch</a></li>
</ul>
<ul id="123fd8ba-23ef-4144-b66c-871bc8f5aa58" class="uk-switcher uk-margin">
<li>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="rouge-code"><pre><span class="nv">$ COMPlus_PerfMapEnabled</span><span class="o">=</span>1 perf record <span class="nt">-F</span> max <span class="nt">-e</span> branch-misses <span class="se">\</span>
./Example <span class="nt">--type-list</span> DoublePumpMicroOptCutOff_40 <span class="nt">--size-list</span> 1000000 <span class="nt">--no-check</span>
...
<span class="nv">$ </span>perf report <span class="nt">--stdio</span> <span class="nt">-F</span> overhead,sym | <span class="nb">head</span> <span class="nt">-15</span>
<span class="c"># Samples: 403K of event 'branch-misses'</span>
<span class="c"># Event count (approx.): 252554012</span>
43.73% <span class="o">[</span>.] ... DoublePumpMicroOptCutoff_40::InsertionSort<span class="o">(</span>...<span class="o">)</span>
25.51% <span class="o">[</span>.] ... DoublePumpMicroOptCutoff_40+VxSortInt32::VectorizedPartitionInPlace<span class="o">(</span>...<span class="o">)</span>
</pre></td></tr></tbody></table></code></pre></div></div>
</li>
<li>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="rouge-code"><pre><span class="nv">$ COMPlus_PerfMapEnabled</span><span class="o">=</span>1 perf record <span class="nt">-F</span> max <span class="nt">-e</span> branch-misses <span class="se">\</span>
./Example <span class="nt">--type-list</span> DoublePumpSimpleBranch <span class="nt">--size-list</span> 1000000 <span class="nt">--no-check</span>
...
<span class="nv">$ </span>perf report <span class="nt">--stdio</span> <span class="nt">-F</span> overhead,sym | <span class="nb">head</span> <span class="nt">-15</span>
<span class="c"># Samples: 414K of event 'branch-misses'</span>
<span class="c"># Event count (approx.): 241513903</span>
41.11% <span class="o">[</span>.] ... DoublePumpSimpleBranch::InsertionSort<span class="o">(</span>...<span class="o">)</span>
26.59% <span class="o">[</span>.] ... DoublePumpSimpleBranch+VxSortInt32::VectorizedPartitionInPlace<span class="o">(</span>...<span class="o">)</span>
</pre></td></tr></tbody></table></code></pre></div></div>
</li>
</ul>
</div>
<p>Here we’re comparing the same two versions we’ve just benchmarked with a specific focus on the branch-misses HW counter. We can take this oppertunity both to appreciate how these results compare to the ones we recorded at the end of the previous post,
as well as how they compare to each other.</p>
<p>Compared to our <code class="highlighter-rouge">DoublePumpedNaive</code> implementation of yester-post, it would appear that the “burden of guilt” when it comes to branch mispredictions has shifted towards <code class="highlighter-rouge">InsertionSort</code> by 3-4%. This is to be expected: We were using a cut-off point of <code class="highlighter-rouge">16</code> previously, and we’ve just upped it to <code class="highlighter-rouge">40</code> in the previous section, so it makes sense for <code class="highlighter-rouge">InsertionSort</code> to perform more work in this new balance, taking a larger share of the branch-misses.</p>
<p>When comparing between the <a href="https://github.com/damageboy/VxSort/blob/research/VxSortResearch/Unstable/AVX2/Happy/B4_2_DoublePumpMicroOptCutoff.cs#L798"><code class="highlighter-rouge">DoublePumpMicroOptCutOff_40</code></a> and the <a href="https://github.com/damageboy/VxSort/blob/research/VxSortResearch/Unstable/AVX2/Happy/B4_3_DoublePumpSimpleBranch.cs"><code class="highlighter-rouge">DoublePumpSimpleBranch</code></a> versions, that differ only in that nasty branch in the top of our main loop, both versions look mostly similar. First, we have to acknowledge that <code class="highlighter-rouge">perf</code> is a statistical tool, that works by collecting samples of HW counters, so we’re not going to get an exact count of anything, even when running the same code time after time. In our case, both versions look roughly the same: Once we count how many branch misses of the total are attributed to the function we actually changed, it comes to <code class="highlighter-rouge">64,426,528</code> misses for the previous version vs. <code class="highlighter-rouge">64,218,546</code> for the newer simpler branch. It doesn’t amount to enough to call this a branch misprediction win. So it would seem with gained a bit with smaller code, but not by lowering the frequency of mispredictions.</p>
</div>
<h3 id="packing-the-permutation-table-1st-attempt-1">Packing the Permutation Table, 1<sup>st</sup> attempt: :+1:</h3>
<p>Ever since I started with this little time-succubus of a project, I was really annoyed at the way I was encoding the permutation tables. To me, wasting 8kb worth of data, or more specifically, wasting 8kb worth of precious L1 cache in the CPU for the permutation entries was tantamount to a cardinal sin. My emotional state set aside, the situation is even more horrid when you stop to consider that out of each 32-byte permutation entry, we were only really using 3 bits x 8 elements, or 24 bits of usable data. To be completely honest, I probably made this into a bigger problem in my head, imagining how the performance was suffering from this, than what it really is in reality, but we don’t always get to choose our made-up enemies. sometimes they choose us.</p>
<p>My first attempt at packing the permutation entries was to try and use a specific Intel intrinsic called <a href="https://docs.microsoft.com/en-us/dotnet/api/system.runtime.intrinsics.x86.avx2.converttovector256int32?view=netcore-3.1"><code class="highlighter-rouge">ConvertToVector256Int32</code> / <code class="highlighter-rouge">VPMOVZXBD</code></a>. This intrinsic can read a 64-bit value directly from memory while also expanding it into 8x32bit values inside a <code class="highlighter-rouge">Vactor256<T></code> register. If nothing else, it buys me an excuse to do this:</p>
<p><img src="/assets/images/yodawg.jpg" alt="Yo Dawg" /></p>
<p>More seriously though, the basic idea was that I would go back to the permutation entries and re-encode them as 64-bits (8x8bits) per single entry instead of 256-bits which is what I’ve been using thus far. This encoding would reduce the size of the entire permutation entry from 8kb to 2kb, which is a nice start.<br />
Unfortunately, this initial attempt went south as I got hit by a <a href="https://github.com/dotnet/runtime/issues/12835">JIT bug</a>. When I tried to circumvent that bug, the results didn’t look better, they were slightly worse, so I kind of left the code in a sub-optimal state and forgot about it. Luckily, I did revisit this at a later stage, after the bug was fixed, and to my delight, once the JIT was encoding this instruction correctly and efficiently, things start working smoothly.</p>
<p>I ended up encoding a second permutation table, and by using the correct <code class="highlighter-rouge">ConvertToVector256Int32</code> we are kind of better off:</p>
<div>
<div class="stickemup">
<ul class="uk-tab" data-uk-switcher="{connect:'#057c19d6-b6ca-48e8-942e-5115c37b39cc'}">
<li class="uk-active"><a href="#"><i class="glyphicon glyphicon-stats"></i> Scaling</a></li>
<li><a href="#"><i class="glyphicon glyphicon-stats"></i> Time/N (Intel)</a></li>
<li><a href="#"><i class="glyphicon glyphicon-stats"></i> Time/N (AMD)</a></li>
<li><a href="#"><i class="glyphicon glyphicon-list-alt"></i> Benchmarks (Intel)</a></li>
<li><a href="#"><i class="glyphicon glyphicon-list-alt"></i> Benchmarks (AMD)</a></li>
<li><a href="#"><i class="glyphicon glyphicon-info-sign"></i> Setup</a></li>
</ul>
<ul id="057c19d6-b6ca-48e8-942e-5115c37b39cc" class="uk-switcher uk-margin">
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<div data-intro="Size of the sorting problem, 10..10,000,000 in powers of 10" data-position="bottom">
<div data-intro="Performance scale: Array.Sort (solid gray) is always 100%, and the other methods are scaled relative to it" data-position="left">
<div data-intro="Click legend items to show/hide series" data-position="right">
<div class="benchmark-chart-container">
<canvas data-chart="line">
N,100,1K,10K,100K,1M,10M
SimpleBranch,1,1,1,1,1,1
Packed Intel,1.013605442,1.016909534,1.001868534,0.984072719,0.997337839,0.997892526
Packed AMD,0.896395352,0.813863407,0.919215529,0.916898529,0.926463363,0.981186383
<!--
{
"data" : {
"datasets" : [
{
"backgroundColor": "rgba(66,66,66,0.35)",
"rough": { "fillStyle": "hachure", "hachureAngle": -30, "hachureGap": 9, "fillWeight": 0.3 }
},
{
"backgroundColor": "rgba(0,113,197,.9)",
"rough": { "fillStyle": "hachure", "hachureAngle": 60, "hachureGap": 3 }
},
{
"backgroundColor": "rgba(237,28,36,.9)",
"rough": { "fillStyle": "hachure", "hachureAngle": 30, "hachureGap": 3 }
}
]
},
"options": {
"title": { "text": "AVX2 Packed Permutation Table Sorting - Scaled to SimpleBranch", "display": true },
"scales": {
"yAxes": [{
"ticks": {
"fontFamily": "Indie Flower",
"min": 0.80,
"callback": "ticksPercent"
},
"scaleLabel": {
"labelString": "Scaling (%)",
"display": true
}
}]
}
},
"defaultOptions": {"scales":{"xAxes":[{"scaleLabel":{"display":"true,","labelString":"N (elements)","fontFamily":"Indie Flower"},"ticks":{"fontFamily":"Indie Flower"}}]},"legend":{"display":true,"position":"bottom","labels":{"fontFamily":"Indie Flower","fontSize":14}},"title":{"position":"top","fontFamily":"Indie Flower","fontSize":16}}
}
--> </canvas>
</div>
</div>
</div>
</div>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<div data-intro="Size of the sorting problem, 10..10,000,000 in powers of 10" data-position="bottom">
<div data-intro="Time in nanoseconds spent sorting per element. Array.Sort (solid gray) is the baseline, again" data-position="left">
<div data-intro="Click legend items to show/hide series" data-position="right">
<div class="benchmark-chart-container">
<canvas data-chart="line">
N,100,1K,10K,100K,1M,
SimpleBranch,16.1715,20.1069,24.2436,25.7728,24.4249,26.6617
Packed,16.3927,20.4471,24.2889,25.3623,24.3599,26.6055
<!--
{
"data" : {
"datasets" : [
{
"backgroundColor": "rgba(66,66,66,0.35)",
"rough": { "fillStyle": "hachure", "hachureAngle": -30, "hachureGap": 9, "fillWeight": 0.3 }
},
{
"backgroundColor": "rgba(0,113,197,.9)",
"rough": { "fillStyle": "hachure", "hachureAngle": 60, "hachureGap": 3 }
}
]
},
"options": {
"title": { "text": "AVX2 SimplerBranch + Packed - log(Time/N) on Intel", "display": true },
"scales": {
"yAxes": [{
"type": "logarithmic",
"ticks": {
"min": 15.0,
"max": 27,
"callback": "ticksNumStandaard",
"fontFamily": "Indie Flower"
},
"scaleLabel": {
"labelString": "Time/N (ns)",
"fontFamily": "Indie Flower",
"display": true
}
}]
}
},
"defaultOptions": {"scales":{"xAxes":[{"scaleLabel":{"display":"true,","labelString":"N (elements)","fontFamily":"Indie Flower"},"ticks":{"fontFamily":"Indie Flower"}}]},"legend":{"display":true,"position":"bottom","labels":{"fontFamily":"Indie Flower","fontSize":14}},"title":{"position":"top","fontFamily":"Indie Flower","fontSize":16}}
}
--> </canvas>
</div>
</div>
</div>
</div>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<div data-intro="Size of the sorting problem, 10..10,000,000 in powers of 10" data-position="bottom">
<div data-intro="Time in nanoseconds spent sorting per element. Array.Sort (solid gray) is the baseline, again" data-position="left">
<div data-intro="Click legend items to show/hide series" data-position="right">
<div class="benchmark-chart-container">
<canvas data-chart="line">
N,100,1K,10K,100K,1M,
SimpleBranch,10.1852,13.3196,18.9534,22.6299,23.7335,24.3677
Packed,9.13,10.7383,18.8766,20.7494,21.9882,23.9092
<!--
{
"data" : {
"datasets" : [
{
"backgroundColor": "rgba(66,66,66,0.35)",
"rough": { "fillStyle": "hachure", "hachureAngle": -30, "hachureGap": 9, "fillWeight": 0.3 }
},
{
"backgroundColor": "rgba(237,28,36,.9)",
"rough": { "fillStyle": "hachure", "hachureAngle": 60, "hachureGap": 3 }
}
]
},
"options": {
"title": { "text": "AVX2 SimplerBranch + Packed - log(Time/N) on AMD", "display": true },
"scales": {
"yAxes": [{
"type": "logarithmic",
"ticks": {
"min": 8.0,
"max": 26,
"callback": "ticksNumStandaard",
"fontFamily": "Indie Flower"
},
"scaleLabel": {
"labelString": "Time/N (ns)",
"fontFamily": "Indie Flower",
"display": true
}
}]
}
},
"defaultOptions": {"scales":{"xAxes":[{"scaleLabel":{"display":"true,","labelString":"N (elements)","fontFamily":"Indie Flower"},"ticks":{"fontFamily":"Indie Flower"}}]},"legend":{"display":true,"position":"bottom","labels":{"fontFamily":"Indie Flower","fontSize":14}},"title":{"position":"top","fontFamily":"Indie Flower","fontSize":16}}
}
--> </canvas>
</div>
</div>
</div>
</div>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<table class="table datatable" data-json="../_posts/Bench.BlogPt4_4_Int32_-report.datatable.intel.json" data-id-field="name" data-pagination="true" data-page-list="[5, 10, 15, 20]" data-intro="Each row in this table represents a benchmark result" data-position="left" data-show-pagination-switch="false">
<thead data-intro="The header can be used to sort/filter by clicking" data-position="right">
<tr>
<th data-field="TargetMethodColumn.Method" data-sortable="true" data-filter-control="select">
<span data-intro="The name of the benchmarked method" data-position="top">
Method<br />Name
</span>
</th>
<th data-field="N" data-sortable="true" data-value-type="int" data-filter-control="select">
<span data-intro="The size of the sorting problem being benchmarked (# of integers)" data-position="top">
Problem<br />Size
</span>
</th>
<th data-field="TimePerNDataTable" data-sortable="true" data-value-type="float2-interval-muted">
<span data-intro="Time in nanoseconds spent sorting each element in the array (with confidence intervals in parenthesis)" data-position="top">
Time /<br />Element (ns)
</span>
</th>
<th data-field="RatioDataTable" data-sortable="true" data-value-type="inline-bar-horizontal-percentage">
<span data-intro="Each result is scaled to its baseline (Array.Sort in this case)" data-position="top">
Scaling
</span>
</th>
<th data-field="Measurements" data-sortable="true" data-value-type="inline-bar-vertical">
<span data-intro="Raw benchmark results visualize how stable the result it. Longest/Shortest runs marked with <span style='color: red'>Red</span>/<span style='color: green'>Green</span>" data-position="top">Measurements</span>
</th>
</tr>
</thead>
</table>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<table class="table datatable" data-json="../_posts/Bench.BlogPt4_4_Int32_-report.datatable.amd.json" data-id-field="name" data-pagination="true" data-page-list="[5, 10, 15, 20]" data-intro="Each row in this table represents a benchmark result" data-position="left" data-show-pagination-switch="false">
<thead data-intro="The header can be used to sort/filter by clicking" data-position="right">
<tr>
<th data-field="TargetMethodColumn.Method" data-sortable="true" data-filter-control="select">
<span data-intro="The name of the benchmarked method" data-position="top">
Method<br />Name
</span>
</th>
<th data-field="N" data-sortable="true" data-value-type="int" data-filter-control="select">
<span data-intro="The size of the sorting problem being benchmarked (# of integers)" data-position="top">
Problem<br />Size
</span>
</th>
<th data-field="TimePerNDataTable" data-sortable="true" data-value-type="float2-interval-muted">
<span data-intro="Time in nanoseconds spent sorting each element in the array (with confidence intervals in parenthesis)" data-position="top">
Time /<br />Element (ns)
</span>
</th>
<th data-field="RatioDataTable" data-sortable="true" data-value-type="inline-bar-horizontal-percentage">
<span data-intro="Each result is scaled to its baseline (Array.Sort in this case)" data-position="top">
Scaling
</span>
</th>
<th data-field="Measurements" data-sortable="true" data-value-type="inline-bar-vertical">
<span data-intro="Raw benchmark results visualize how stable the result it. Longest/Shortest runs marked with <span style='color: red'>Red</span>/<span style='color: green'>Green</span>" data-position="top">Measurements</span>
</th>
</tr>
</thead>
</table>
</div>
</li>
<li>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="rouge-code"><pre><span class="nv">BenchmarkDotNet</span><span class="o">=</span>v0.12.0, <span class="nv">OS</span><span class="o">=</span>clear-linux-os 32120
Intel Core i7-7700HQ CPU 2.80GHz <span class="o">(</span>Kaby Lake<span class="o">)</span>, 1 CPU, 4 logical and 4 physical cores
.NET Core <span class="nv">SDK</span><span class="o">=</span>3.1.100
<span class="o">[</span>Host] : .NET Core 3.1.0 <span class="o">(</span>CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404<span class="o">)</span>, X64 RyuJIT
Job-DEARTS : .NET Core 3.1.0 <span class="o">(</span>CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404<span class="o">)</span>, X64 RyuJIT
<span class="nv">InvocationCount</span><span class="o">=</span>3 <span class="nv">IterationCount</span><span class="o">=</span>15 <span class="nv">LaunchCount</span><span class="o">=</span>2
<span class="nv">UnrollFactor</span><span class="o">=</span>1 <span class="nv">WarmupCount</span><span class="o">=</span>10
<span class="nv">$ </span><span class="nb">grep</span> <span class="s1">'stepping\|model\|microcode'</span> /proc/cpuinfo | <span class="nb">head</span> <span class="nt">-4</span>
model : 158
model name : Intel<span class="o">(</span>R<span class="o">)</span> Core<span class="o">(</span>TM<span class="o">)</span> i7-7700HQ CPU @ 2.80GHz
stepping : 9
microcode : 0xb4
</pre></td></tr></tbody></table></code></pre></div></div>
</li>
</ul>
</div>
<p>These results bring a new, unwarranted dimension into our lives: CPU vendors and model-specific quirks. Up until now, I’ve been testing my various optimizations on three different processors models I had at hand: Intel Kaby Lake, Intel Broadwell, and AMD Ryzen. Every attempt I’ve presented here netted positive results on all three test beds, even if differently, so I opted for focusing on the Intel Kaby-Lake results to reduce the information overload.<br />
Now is the first time we see uneven results: the two results I included represent two extremes of the performance spectrum; The newer Intel Kaby-Lake processors are not affected by this optimization. When I set out to implement it, I came into this with eyes wide-open: I knew that all in all, the CPU would roughly be doing the same work for the permutation entry loading per-se. I was gunning for a 2<sup>nd</sup> order effect: Freeing up 6KB of L1 data-cache is no small saving, given its total size is 32KB in all of my tested CPUs.</p>
<p>What we see from the Intel Kaby-Lake results can basically be summarised as: Newer Intel CPUs <em>probably</em> have a very efficient prefetch unit. One that performs well enough that we can’t feel or see the benefit of having more L1 room afforded by packing the permutation table more tightly. With AMD CPUs, and older Intel CPUs (Like Intel Broadwell, not shown here), freeing up the L1 cache does make a substantial dent in the total runtime.</p>
<p>All in all, while this is a slightly more complex scenario to reason about, we’re left with one, rather new CPU that is not affected by this optimization for better and for worse, and other, older/different CPUs where this is a very substantial win. As such, I decided to keep it in the code-base going forward.</p>
</div>
<h3 id="packing-the-permutation-table-2nd-attempt--1">Packing the Permutation Table, 2<sup>nd</sup> attempt: :-1:</h3>
<p>Next, I tried to pack the permutation table even further, going from 2kb to 1kb of memory, by packing the 3-bit entries even further into a single 32-bit value.
The packing is the easy part, but how would we unpack these 32-bit compressed entries all the way back to a full 256-bit vector? Why, with yet more intrinsics of course.
With this, my ultra packed permutation table now looked like this:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
</pre></td><td class="rouge-code"><pre><span class="n">ReadOnlySpan</span><span class="p"><</span><span class="kt">byte</span><span class="p">></span> <span class="n">BitPermTable</span> <span class="p">=></span> <span class="k">new</span> <span class="kt">byte</span><span class="p">[]</span>
<span class="p">{</span>
<span class="m">0</span><span class="n">b10001000</span><span class="p">,</span> <span class="m">0</span><span class="n">b11000110</span><span class="p">,</span> <span class="m">0</span><span class="n">b11111010</span><span class="p">,</span> <span class="m">0</span><span class="n">b00000000</span><span class="p">,</span> <span class="c1">// 0</span>
<span class="c1">// ...</span>
<span class="m">0</span><span class="n">b01100011</span><span class="p">,</span> <span class="m">0</span><span class="n">b01111101</span><span class="p">,</span> <span class="m">0</span><span class="n">b01000100</span><span class="p">,</span> <span class="m">0</span><span class="n">b00000000</span><span class="p">,</span> <span class="c1">// 7</span>
<span class="c1">// ...</span>
<span class="m">0</span><span class="n">b00010000</span><span class="p">,</span> <span class="m">0</span><span class="n">b10011101</span><span class="p">,</span> <span class="m">0</span><span class="n">b11110101</span><span class="p">,</span> <span class="m">0</span><span class="n">b00000000</span><span class="p">,</span> <span class="c1">// 170</span>
<span class="c1">// ...</span>
<span class="m">0</span><span class="n">b10001000</span><span class="p">,</span> <span class="m">0</span><span class="n">b11000110</span><span class="p">,</span> <span class="m">0</span><span class="n">b11111010</span><span class="p">,</span> <span class="m">0</span><span class="n">b00000000</span><span class="p">,</span> <span class="c1">// 255</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>And my unpacking code now uses the <a href="https://docs.microsoft.com/en-us/dotnet/api/system.runtime.intrinsics.x86.bmi2.x64.parallelbitdeposit?view=netcore-3.1#System_Runtime_Intrinsics_X86_Bmi2_X64_ParallelBitDeposit_System_UInt64_System_UInt64_"><code class="highlighter-rouge">ParallelBitDposit / PDEP</code></a>, which I’ve accidentaly covered in more detail in a <a href="/2018-08-19/netcoreapp3.0-intrinsics-in-real-life-pt2#pdep---parallel-bit-deposit">previous post</a>:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="rouge-code"><pre><span class="n">Vector256</span><span class="p"><</span><span class="kt">int</span><span class="p">></span> <span class="nf">GetBitPermutation</span><span class="p">(</span><span class="kt">uint</span> <span class="p">*</span><span class="n">pBase</span><span class="p">,</span> <span class="k">in</span> <span class="kt">uint</span> <span class="n">mask</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">const</span> <span class="kt">ulong</span> <span class="n">magicMask</span> <span class="p">=</span>
<span class="m">0</span><span class="n">b00000111_00000111_00000111_00000111_00000111_00000111_00000111_00000111</span><span class="p">;</span>
<span class="k">return</span> <span class="n">Avx2</span><span class="p">.</span><span class="nf">ConvertToVector256Int32</span><span class="p">(</span>
<span class="n">Vector128</span><span class="p">.</span><span class="nf">CreateScalarUnsafe</span><span class="p">(</span>
<span class="n">Bmi2</span><span class="p">.</span><span class="n">X64</span><span class="p">.</span><span class="nf">ParallelBitDeposit</span><span class="p">(</span><span class="n">pBase</span><span class="p">[</span><span class="n">mask</span><span class="p">],</span> <span class="n">magicMask</span><span class="p">)).</span><span class="nf">AsByte</span><span class="p">());</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>What does this little monstrosity do exactly? We <strong>pack</strong> the permutation bits (remember, we just need 3 bits per element, we have 8 elements, so 24 bits per permutation vector in total) into a single 32 bit value, then whenever we need to expand this into a full blown vector, we:</p>
<ul>
<li>Unpack the 32-bit values into a 64-bit value using <a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=pdep&expand=1532,4152"><code class="highlighter-rouge">ParallelBitDeposit</code></a> from the <code class="highlighter-rouge">BMI2</code> intrinsics extensions.</li>
<li>Convert (move) the 64-bit value into the lower 64-bits of a 128-bit SIMD register using <code class="highlighter-rouge">Vector128.CreateScalarUnsafe</code>.</li>
<li>Go back to using a different variant of <a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_cvtepi8_epi32&expand=1532"><code class="highlighter-rouge">ConvertToVector256Int32</code></a> (<code class="highlighter-rouge">VPMOVZXBD</code>) that takes 8-bit elements from a 128-bit wide register and expands them into integers in a 256 bit registers.</li>
</ul>
<p>In short, we chain 2 extra instructions compared to our 2KB permutation table, but save an additional 1KB of cache. Was it worth it?<br />
I wish I could say with a complete and assured voice that it was, but the truth is that it had only very little positive effect, if any:</p>
<p>While we end up saving 1kb of precious L1 cache, the extra instructions end up delaying and costing us more than whatever perf we’re gaining from the extra cache space.<br />
To make things even worse, I <a href="https://github.com/dotnet/runtime/issues/786">later discovered</a> that with AMD processors, the very same intrinsic I’m relying upon here, <code class="highlighter-rouge">PDEP</code>, is some sort of a bastardized instruction. It’s not really an instruction implemented with proper circuitry at the CPU level, but rather implemented as a plain loop inside the processor. As the discussion I linked to shows, it can take hundreds of cycles(!) depending on the provided mask value. For now we can simply chalk this attempt as a failure.</p>
<h3 id="skipping-some-permutations--1">Skipping some permutations: :-1:</h3>
<p>There are common cases where performing the permutation is completely un-needed. This means that almost the entire permutation block can be skipped:</p>
<ul>
<li>No need to load the perutation entry</li>
<li>Or perform the permutation</li>
</ul>
<p>To be percise, there are exactly 9 such cases in the permutation table, whenever all the <code class="highlighter-rouge">1</code> bits are already grouped in the upper (MSB) part of the <code class="highlighter-rouge">mask</code> value in our permutation block, the values are:</p>
<ul>
<li><code class="highlighter-rouge">0b11111111</code></li>
<li><code class="highlighter-rouge">0b11111110</code></li>
<li><code class="highlighter-rouge">0b11111100</code></li>
<li><code class="highlighter-rouge">0b11111000</code></li>
<li><code class="highlighter-rouge">0b11110000</code></li>
<li><code class="highlighter-rouge">0b11100000</code></li>
<li><code class="highlighter-rouge">0b11000000</code></li>
<li><code class="highlighter-rouge">0b10000000</code></li>
<li><code class="highlighter-rouge">0b00000000</code></li>
</ul>
<p>I thought it might be a good idea to detect those cases. I ended up trying a switch case, and when that failed to speed things up, comparing the amount of trailing zeros to (<code class="highlighter-rouge">8</code> - population count). While both methods did technically work, the additional branch and associated branch misprediction didn’t make this worth while or yield any positive result. The simpler code which always permutes did just as good if not slightly better.<br />
Of course, these results have to be taken with a grain of salt, since they depend on us sorting random data. There might be some other situation when such branches are predicted correctly where this could save a lot of cycles. But for now, let’s just drop it and move on…</p>
<h3 id="getting-intimate-with-x86-for-fun-and-profit-1">Getting intimate with x86 for fun and profit: :+1:</h3>
<p>I know the title sounds cryptic, but x86 is just weird, and I wanted to make sure you’re mentally geared for some weirdness in our journey to squeeze a bit of extra performance. We need to remember that this is a 40+ year-old CISC processor made in an entirely different era:</p>
<p><img src="/assets/images/your-fathers-lea.svg" alt="Your Father's LEA" /></p>
<p>This last optimization trick I will go over repeats the same speil I’ve been doing throughtout this post: trimming the fat around our code. We’ll try generating slightly denser code in our vectorized block. The idea here is to trigger the JIT to encode the pointer update code at the end of our vectorized partitioning block with the space-efficient <code class="highlighter-rouge">LEA</code> instruction.</p>
<p>To better explain this, we’ll start by going back to the last 3 lines of code I presented at the top of <em>this</em> post, as part of the so-called micro-optimized version. Here is the C#:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="rouge-code"><pre> <span class="c1">// end of partitioning block...</span>
<span class="kt">var</span> <span class="n">popCount</span> <span class="p">=</span> <span class="n">PopCnt</span><span class="p">.</span><span class="nf">PopCount</span><span class="p">(</span><span class="n">mask</span><span class="p">);</span>
<span class="n">writeRight</span> <span class="p">=</span> <span class="p">(</span><span class="kt">int</span><span class="p">*)</span> <span class="p">((</span><span class="kt">byte</span><span class="p">*)</span> <span class="n">writeRight</span> <span class="p">-</span> <span class="n">popCount</span><span class="p">);</span>
<span class="n">writeLeft</span> <span class="p">=</span> <span class="p">(</span><span class="kt">int</span><span class="p">*)</span> <span class="p">((</span><span class="kt">byte</span><span class="p">*)</span> <span class="n">writeLeft</span> <span class="p">+</span> <span class="p">(</span><span class="m">8U</span> <span class="p"><<</span> <span class="m">2</span><span class="p">)</span> <span class="p">-</span> <span class="n">popCount</span><span class="p">);</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>If we look at the corresponding disassembly for this code, it looks quite verbose. Here it is with some comments, and with the machine-code bytes on the right-hand side:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
</pre></td><td class="rouge-code"><pre><span class="c1">;var popCount = PopCnt.PopCount(mask);</span>
<span class="nf">popcnt</span> <span class="nb">r8d</span><span class="p">,</span><span class="nb">r8d</span> <span class="c1">; F3450FB8C0</span>
<span class="nf">shl</span> <span class="nb">r8d</span><span class="p">,</span><span class="mi">2</span> <span class="c1">; 41C1E002</span>
<span class="c1">;writeRight = (int*) ((byte*) writeRight - popCount);</span>
<span class="nf">mov</span> <span class="nb">r9d</span><span class="p">,</span><span class="nb">r8d</span> <span class="c1">; 458BC8</span>
<span class="nf">sub</span> <span class="nb">rcx</span><span class="p">,</span><span class="nv">r9</span> <span class="c1">; 492BC9</span>
<span class="c1">;writeLeft = (int*) ((byte*) writeLeft + (8U << 2) - popCount);</span>
<span class="nf">add</span> <span class="nv">r12</span><span class="p">,</span><span class="mh">20h</span> <span class="c1">; 4983C420</span>
<span class="nf">mov</span> <span class="nb">r8d</span><span class="p">,</span><span class="nb">r8d</span> <span class="c1">; 458BC0</span>
<span class="nf">sub</span> <span class="nv">r12</span><span class="p">,</span><span class="nv">r8</span> <span class="c1">; 4D2BE0</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>If we count the bytes, everything after the <code class="highlighter-rouge">PopCount</code> instruction is taking <code class="highlighter-rouge">20</code> bytes in total: <code class="highlighter-rouge">4 + 3 + 3 + 4 + 3 + 3</code> to complete both pointer updates.</p>
<p>The motivation behind what I’m about to show is that we can replace all of this code with a <strong>much</strong> shorter sequence, taking advantage of x86’s wacky memory addressing, by tweaking the C# code <em>ever</em> so slightly. This, in turn, will enable the C# JIT, which is already aware of these x86 shenanigans, and is capable of generating the more compact x86 code, to do so when it encounters the right constructs at the MSIL/bytecode level.<br />
We succeed here <em>if and when</em> we end up using one <code class="highlighter-rouge">LEA</code> instruction for each pointer modification.</p>
<p>What is <code class="highlighter-rouge">LEA</code> you ask? <strong>L</strong>oad <strong>E</strong>ffective <strong>A</strong>ddress is an instruction that exposes the full extent of x86’s memory addressing capabilities in a single instruction. It allows us to encode rather complicated mathematical/address calculations with a minimal set of bytes, abusing the CPUs address generation units (AGU), while storing the result of that calculation back to a register.</p>
<p>But what can the AGUs do for us? We need to learn just enough about them before we attempt to milk some performance out of them through <code class="highlighter-rouge">LEA</code>. Out of curiosity, I went back in time to find out <em>when</em> the memory addressing scheme was defined/last changed. To my surprise, I found out it was <em>much later</em> than what I had originally thought: Intel last <em>expanded</em> the memory addressing semantics as late as <strong>1986</strong>! Of course this was later expanded again by AMD when they introduced <code class="highlighter-rouge">amd64</code> to propel x86 from the 32-bit dark-ages into the brave world of 64-bit processing, but that was merely a machine-word expansion, not a functional change. I’m happy I researched this bit of history for this post because I found <a href="/assets/images/230985-001_80386_Programmers_Reference_Manual_1986.pdf">this scanned 80386 manual</a>:</p>
<center>
<div>
<p><a href="../assets/images/230985-001_80386_Programmers_Reference_Manual_1986.pdf"><img src="/assets/images/80386-manual.png" alt="80386" /></a></p>
</div>
</center>
<p>In this reference manual, the “new” memory addressing semantics are described in section <code class="highlighter-rouge">2.5.3.2</code> on page <code class="highlighter-rouge">2-18</code>, reprinted here for some of its 1980s era je ne sais quoi:</p>
<p><img src="/assets/images/x86-effective-address-calculation-transparent.png" alt="x86-effective-address-calculation" /></p>
<p>Figure <code class="highlighter-rouge">2-10</code> in the original manual does a good job explaining the components and machinery that go into a memory address calculation in x86. Here it is together with my plans to abuse it:</p>
<ul>
<li>Segment register: This is an odd over-engineered 32-bit era remnant. It’s mostly never used, so let’s skip it in this context.</li>
<li><strong>Base register</strong>: This will be our pointer that we want to modify: <code class="highlighter-rouge">writeLeft</code> and <code class="highlighter-rouge">writeRight</code>.</li>
<li><strong>Index</strong>: Basically some offset to the base: In our case the <code class="highlighter-rouge">PopCount</code> result, in some form.<br />
The index has to be <em>added</em> (<code class="highlighter-rouge">+</code>) to the base register. The operation will always be an addition; of course nothing prevents us from adding a negative number…</li>
<li><strong>Scale</strong>: The <code class="highlighter-rouge">PopCount</code> result needs to be multiplied by 4, we’ll do it with the scale.
The scale is <em>limited</em> to be one of <code class="highlighter-rouge">1/2/4/8</code>, but <em>for us</em> this is not a limitation, since multiplication by <code class="highlighter-rouge">4</code> is exactly what we need.</li>
<li><strong>Displacement</strong>: Some other constant we can tack on to the address calculation. The displacement can be 8/32 bits and is also always used with an <em>addition</em> (<code class="highlighter-rouge">+</code>) operation.</li>
</ul>
<p>There’s a key point I need to stress here: while the mathematical operations performed by <code class="highlighter-rouge">LEA</code> are always addition, we can take advantage of how twos-complement addition/subtraction works to effectively turn this so-called addition into a subtraction.</p>
<p>The actual code change is, for lack of better words, underwhelming. But without all this pre-amble it wouldn’t make a lot of sense, here it is, in all its glory:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="rouge-code"><pre> <span class="c1">// ...</span>
<span class="kt">var</span> <span class="n">popCount</span> <span class="p">=</span> <span class="p">-</span> <span class="p">(</span><span class="kt">long</span><span class="p">)</span> <span class="n">PopCnt</span><span class="p">.</span><span class="nf">PopCount</span><span class="p">(</span><span class="n">mask</span><span class="p">);</span>
<span class="n">writeRight</span> <span class="p">+=</span> <span class="n">popCount</span><span class="p">;</span>
<span class="n">writeLeft</span> <span class="p">+=</span> <span class="n">popCount</span> <span class="p">+</span> <span class="m">8</span><span class="p">;</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Surely, you must be joking, Mr. @damageboy!, I can almost hear you think, but really, this is it. By casting to long and <em>pre-negating</em> the <code class="highlighter-rouge">PopCount</code> result (see that little minus sign?) and reverting back to simpler pointer advancement code, without all the pre-left-shifting pizzaz from the beginning of this post, we get this beautiful, packed, assembly code automatically generated for us:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="rouge-code"><pre><span class="nf">popcnt</span> <span class="nb">rdi</span><span class="p">,</span><span class="nb">rdi</span> <span class="c1">; F3480FB8FF</span>
<span class="nf">neg</span> <span class="nb">rdi</span> <span class="c1">; 48F7DF</span>
<span class="nf">lea</span> <span class="nb">rax</span><span class="p">,[</span><span class="nb">rax</span><span class="o">+</span><span class="nb">rdi</span><span class="o">*</span><span class="mi">4</span><span class="p">]</span> <span class="c1">; 488D04B8</span>
<span class="nf">lea</span> <span class="nv">r15</span><span class="p">,[</span><span class="nv">r15</span><span class="o">+</span><span class="nb">rdi</span><span class="o">*</span><span class="mi">4</span><span class="o">+</span><span class="mh">20h</span><span class="p">]</span> <span class="c1">; 4D8D7CBF20</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>The new version is taking <code class="highlighter-rouge">3 + 4 + 5</code> or <code class="highlighter-rouge">12</code> bytes in total, to complete both pointer updates. So it’s clearly denser. It is important to point out that this reduces the time taken by the CPU to fetch and decode these instructions. Internally, the CPU still has to perform the same calculations as before. I’ll refrain from digressing into the mechanics of x86’s frontend, backend, and all that jazz, as it is out of scope for this blog post, so let’s just be happy with what we have.</p>
<p>Before we forget, though, does it improve performance?</p>
<div>
<div class="stickemup">
<ul class="uk-tab" data-uk-switcher="{connect:'#3cd2c1d5-4b4b-4f73-9603-4b138aef5ef7'}">
<li class="uk-active"><a href="#"><i class="glyphicon glyphicon-stats"></i> Scaling</a></li>
<li><a href="#"><i class="glyphicon glyphicon-stats"></i> Time/N</a></li>
<li><a href="#"><i class="glyphicon glyphicon-list-alt"></i> Benchmarks</a></li>
<li><a href="#"><i class="glyphicon glyphicon-info-sign"></i> Setup</a></li>
</ul>
<ul id="3cd2c1d5-4b4b-4f73-9603-4b138aef5ef7" class="uk-switcher uk-margin">
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<div data-intro="Size of the sorting problem, 10..10,000,000 in powers of 10" data-position="bottom">
<div data-intro="Performance scale: Array.Sort (solid gray) is always 100%, and the other methods are scaled relative to it" data-position="left">
<div data-intro="Click legend items to show/hide series" data-position="right">
<div class="benchmark-chart-container">
<canvas data-chart="line">
N,100,1K,10K,100K,1M,10M
Packed,1,1,1,1,1,1
Jedi,1.013855422,0.938475624,1.00941461,0.992908734,0.955117129,0.96278825
<!--
{
"data" : {
"datasets" : [
{
"backgroundColor": "rgba(66,66,66,0.35)",
"rough": { "fillStyle": "hachure", "hachureAngle": -30, "hachureGap": 9, "fillWeight": 0.3 }
},
{
"backgroundColor": "rgba(33,220,33,.9)",
"rough": { "fillStyle": "hachure", "hachureAngle": 60, "hachureGap": 3 }
}
]
},
"options": {
"title": { "text": "AVX2 Jedi Sorting - Scaled to Packed", "display": true },
"scales": {
"yAxes": [{
"ticks": {
"fontFamily": "Indie Flower",
"min": 0.92,
"callback": "ticksPercent"
},
"scaleLabel": {
"labelString": "Scaling (%)",
"display": true
}
}]
}
},
"defaultOptions": {"scales":{"xAxes":[{"scaleLabel":{"display":"true,","labelString":"N (elements)","fontFamily":"Indie Flower"},"ticks":{"fontFamily":"Indie Flower"}}]},"legend":{"display":true,"position":"bottom","labels":{"fontFamily":"Indie Flower","fontSize":14}},"title":{"position":"top","fontFamily":"Indie Flower","fontSize":16}}
}
--> </canvas>
</div>
</div>
</div>
</div>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<div data-intro="Size of the sorting problem, 10..10,000,000 in powers of 10" data-position="bottom">
<div data-intro="Time in nanoseconds spent sorting per element. Array.Sort (solid gray) is the baseline, again" data-position="left">
<div data-intro="Click legend items to show/hide series" data-position="right">
<div class="benchmark-chart-container">
<canvas data-chart="line">
N,100,1K,10K,100K,1M,
Packed,16.6008,21.8446,24.2283,25.616,25.3775,27.6628
Jedi,16.8279,20.501,24.4564,25.4344,24.2384,26.6334
<!--
{
"data" : {
"datasets" : [
{
"backgroundColor": "rgba(66,66,66,0.35)",
"rough": { "fillStyle": "hachure", "hachureAngle": -30, "hachureGap": 9, "fillWeight": 0.3 }
},
{
"backgroundColor": "rgba(33,220,33,.9)",
"rough": { "fillStyle": "hachure", "hachureAngle": 60, "hachureGap": 3 }
}
]
},
"options": {
"title": { "text": "AVX2 Jedi Sorting + Packed - log(Time/N)", "display": true },
"scales": {
"yAxes": [{
"type": "logarithmic",
"ticks": {
"min": 15,
"max": 28,
"callback": "ticksNumStandaard",
"fontFamily": "Indie Flower"
},
"scaleLabel": {
"labelString": "Time/N (ns)",
"fontFamily": "Indie Flower",
"display": true
}
}]
}
},
"defaultOptions": {"scales":{"xAxes":[{"scaleLabel":{"display":"true,","labelString":"N (elements)","fontFamily":"Indie Flower"},"ticks":{"fontFamily":"Indie Flower"}}]},"legend":{"display":true,"position":"bottom","labels":{"fontFamily":"Indie Flower","fontSize":14}},"title":{"position":"top","fontFamily":"Indie Flower","fontSize":16}}
}
--> </canvas>
</div>
</div>
</div>
</div>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<table class="table datatable" data-json="../_posts/Bench.BlogPt4_5_Int32_-report.datatable.json" data-id-field="name" data-pagination="true" data-page-list="[5, 10, 15, 20]" data-intro="Each row in this table represents a benchmark result" data-position="left" data-show-pagination-switch="false">
<thead data-intro="The header can be used to sort/filter by clicking" data-position="right">
<tr>
<th data-field="TargetMethodColumn.Method" data-sortable="true" data-filter-control="select">
<span data-intro="The name of the benchmarked method" data-position="top">
Method<br />Name
</span>
</th>
<th data-field="N" data-sortable="true" data-value-type="int" data-filter-control="select">
<span data-intro="The size of the sorting problem being benchmarked (# of integers)" data-position="top">
Problem<br />Size
</span>
</th>
<th data-field="TimePerNDataTable" data-sortable="true" data-value-type="float2-interval-muted">
<span data-intro="Time in nanoseconds spent sorting each element in the array (with confidence intervals in parenthesis)" data-position="top">
Time /<br />Element (ns)
</span>
</th>
<th data-field="RatioDataTable" data-sortable="true" data-value-type="inline-bar-horizontal-percentage">
<span data-intro="Each result is scaled to its baseline (Array.Sort in this case)" data-position="top">
Scaling
</span>
</th>
<th data-field="Measurements" data-sortable="true" data-value-type="inline-bar-vertical">
<span data-intro="Raw benchmark results visualize how stable the result it. Longest/Shortest runs marked with <span style='color: red'>Red</span>/<span style='color: green'>Green</span>" data-position="top">Measurements</span>
</th>
</tr>
</thead>
</table>
</div>
</li>
<li>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="rouge-code"><pre><span class="nv">BenchmarkDotNet</span><span class="o">=</span>v0.12.0, <span class="nv">OS</span><span class="o">=</span>clear-linux-os 32120
Intel Core i7-7700HQ CPU 2.80GHz <span class="o">(</span>Kaby Lake<span class="o">)</span>, 1 CPU, 4 logical and 4 physical cores
.NET Core <span class="nv">SDK</span><span class="o">=</span>3.1.100
<span class="o">[</span>Host] : .NET Core 3.1.0 <span class="o">(</span>CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404<span class="o">)</span>, X64 RyuJIT
Job-DEARTS : .NET Core 3.1.0 <span class="o">(</span>CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404<span class="o">)</span>, X64 RyuJIT
<span class="nv">InvocationCount</span><span class="o">=</span>3 <span class="nv">IterationCount</span><span class="o">=</span>15 <span class="nv">LaunchCount</span><span class="o">=</span>2
<span class="nv">UnrollFactor</span><span class="o">=</span>1 <span class="nv">WarmupCount</span><span class="o">=</span>10
<span class="nv">$ </span><span class="nb">grep</span> <span class="s1">'stepping\|model\|microcode'</span> /proc/cpuinfo | <span class="nb">head</span> <span class="nt">-4</span>
model : 158
model name : Intel<span class="o">(</span>R<span class="o">)</span> Core<span class="o">(</span>TM<span class="o">)</span> i7-7700HQ CPU @ 2.80GHz
stepping : 9
microcode : 0xb4
</pre></td></tr></tbody></table></code></pre></div></div>
</li>
</ul>
</div>
<p>All in all, this might not look like much, but it is real: another small 3-4% uneven improvement across the sorting spectrum if you disregard the weirdness around 10K elements. I do realize it may not look super impressive to boot, but here’s a spoiler: a few blog posts down the road, we’ll get to unroll our loops, you know, that place where all optimization efforts end up going. When we do get there, every byte we remove off this main loop body will pay in spades. In other words, while some of the optimizations may appear minor, I have a different metric, at least in my mind, when it comes to improving the loop body even by a single per-cent while we’re still not unrolling it. That’s one of those places where a little experience affords better foresight.</p>
</div>
<p>I have to come clean here: I’ve left some pennies here on the floor. We could still go one step further and get rid of one more 3-byte instruction in the loop. Alas, I’ve made an executive decision to no do so in this blog post: For one, this post has already become quite long, and I doubt a substantial number of people who have started reading it are still here with us, with a beating pulse. Moreover, this specific optimization that I have in mind would not really shine in this moment. As such, I’ll go back to it once we get to unroll this loop.</p>
<h2 id="weve-come-a-long-way-baby">We’ve Come a Long Way, Baby!</h2>
<p>We’ve done quite a lot to optimize the vectorized partitioning so far. All these incremental improvements pile up, when you multiply them on top of another.</p>
<p>Don’t believe me? Here’s one last group of charts and data tables to show what distance we’ve travalled in one blog post:</p>
<div>
<div class="stickemup">
<ul class="uk-tab" data-uk-switcher="{connect:'#95c61e0c-ed74-4a51-9393-6468adfa0452'}">
<li class="uk-active"><a href="#"><i class="glyphicon glyphicon-stats"></i> Scaling</a></li>
<li><a href="#"><i class="glyphicon glyphicon-stats"></i> Time/N</a></li>
<li><a href="#"><i class="glyphicon glyphicon-list-alt"></i> Benchmarks</a></li>
<li><a href="#"><i class="glyphicon glyphicon-info-sign"></i> Setup</a></li>
</ul>
<ul id="95c61e0c-ed74-4a51-9393-6468adfa0452" class="uk-switcher uk-margin">
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<div data-intro="Size of the sorting problem, 10..10,000,000 in powers of 10" data-position="bottom">
<div data-intro="Performance scale: Array.Sort (solid gray) is always 100%, and the other methods are scaled relative to it" data-position="left">
<div data-intro="Click legend items to show/hide series" data-position="right">
<div class="benchmark-chart-container">
<canvas data-chart="line">
N,100,1K,10K,100K,1M,10M
Naive,1,1,1,1,1,1
Jedi,0.70862069,0.717993202,0.795472874,0.783355194,0.824350492,0.82130157
<!--
{
"data" : {
"datasets" : [
{
"backgroundColor": "rgba(66,66,66,0.35)",
"rough": { "fillStyle": "hachure", "hachureAngle": -30, "hachureGap": 9, "fillWeight": 0.3 }
},
{
"backgroundColor": "rgba(218,165,32,.9)",
"rough": { "fillStyle": "hachure", "hachureAngle": 60, "hachureGap": 3 }
}
]
},
"options": {
"title": { "text": "AVX2 end of Blog Pt. 4 - Scaled to end of Pt. 3", "display": true },
"scales": {
"yAxes": [{
"ticks": {
"fontFamily": "Indie Flower",
"min": 0.65,
"callback": "ticksPercent"
},
"scaleLabel": {
"labelString": "Scaling (%)",
"display": true
}
}]
}
},
"defaultOptions": {"scales":{"xAxes":[{"scaleLabel":{"display":"true,","labelString":"N (elements)","fontFamily":"Indie Flower"},"ticks":{"fontFamily":"Indie Flower"}}]},"legend":{"display":true,"position":"bottom","labels":{"fontFamily":"Indie Flower","fontSize":14}},"title":{"position":"top","fontFamily":"Indie Flower","fontSize":16}}
}
--> </canvas>
</div>
</div>
</div>
</div>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<div data-intro="Size of the sorting problem, 10..10,000,000 in powers of 10" data-position="bottom">
<div data-intro="Time in nanoseconds spent sorting per element. Array.Sort (solid gray) is the baseline, again" data-position="left">
<div data-intro="Click legend items to show/hide series" data-position="right">
<div class="benchmark-chart-container">
<canvas data-chart="line">
N,100,1K,10K,100K,1M,
Naive,23.2032,28.2439,30.9998,32.4093,29.5396,32.2364
Jedi,16.4365,20.2787,24.6595,25.388,24.351,26.4758
<!--
{
"data" : {
"datasets" : [
{
"backgroundColor": "rgba(66,66,66,0.35)",
"rough": { "fillStyle": "hachure", "hachureAngle": -30, "hachureGap": 9, "fillWeight": 0.3 }
},
{
"backgroundColor": "rgba(218,165,32,.9)",
"rough": { "fillStyle": "hachure", "hachureAngle": 60, "hachureGap": 3 }
}
]
},
"options": {
"title": { "text": "AVX2 end of Pt. 4 + end of Pt. 3 - log(Time/N)", "display": true },
"scales": {
"yAxes": [{
"type": "logarithmic",
"ticks": {
"min": 15,
"max": 33,
"callback": "ticksNumStandaard",
"fontFamily": "Indie Flower"
},
"scaleLabel": {
"labelString": "Time/N (ns)",
"fontFamily": "Indie Flower",
"display": true
}
}]
}
},
"defaultOptions": {"scales":{"xAxes":[{"scaleLabel":{"display":"true,","labelString":"N (elements)","fontFamily":"Indie Flower"},"ticks":{"fontFamily":"Indie Flower"}}]},"legend":{"display":true,"position":"bottom","labels":{"fontFamily":"Indie Flower","fontSize":14}},"title":{"position":"top","fontFamily":"Indie Flower","fontSize":16}}
}
--> </canvas>
</div>
</div>
</div>
</div>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<table class="table datatable" data-json="../_posts/Bench.BlogPt4_6_Int32_-report.datatable.json" data-id-field="name" data-pagination="true" data-page-list="[5, 10, 15, 20]" data-intro="Each row in this table represents a benchmark result" data-position="left" data-show-pagination-switch="false">
<thead data-intro="The header can be used to sort/filter by clicking" data-position="right">
<tr>
<th data-field="TargetMethodColumn.Method" data-sortable="true" data-filter-control="select">
<span data-intro="The name of the benchmarked method" data-position="top">
Method<br />Name
</span>
</th>
<th data-field="N" data-sortable="true" data-value-type="int" data-filter-control="select">
<span data-intro="The size of the sorting problem being benchmarked (# of integers)" data-position="top">
Problem<br />Size
</span>
</th>
<th data-field="TimePerNDataTable" data-sortable="true" data-value-type="float2-interval-muted">
<span data-intro="Time in nanoseconds spent sorting each element in the array (with confidence intervals in parenthesis)" data-position="top">
Time /<br />Element (ns)
</span>
</th>
<th data-field="RatioDataTable" data-sortable="true" data-value-type="inline-bar-horizontal-percentage">
<span data-intro="Each result is scaled to its baseline (Array.Sort in this case)" data-position="top">
Scaling
</span>
</th>
<th data-field="Measurements" data-sortable="true" data-value-type="inline-bar-vertical">
<span data-intro="Raw benchmark results visualize how stable the result it. Longest/Shortest runs marked with <span style='color: red'>Red</span>/<span style='color: green'>Green</span>" data-position="top">Measurements</span>
</th>
</tr>
</thead>
</table>
</div>
</li>
<li>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="rouge-code"><pre><span class="nv">BenchmarkDotNet</span><span class="o">=</span>v0.12.0, <span class="nv">OS</span><span class="o">=</span>clear-linux-os 32120
Intel Core i7-7700HQ CPU 2.80GHz <span class="o">(</span>Kaby Lake<span class="o">)</span>, 1 CPU, 4 logical and 4 physical cores
.NET Core <span class="nv">SDK</span><span class="o">=</span>3.1.100
<span class="o">[</span>Host] : .NET Core 3.1.0 <span class="o">(</span>CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404<span class="o">)</span>, X64 RyuJIT
Job-DEARTS : .NET Core 3.1.0 <span class="o">(</span>CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404<span class="o">)</span>, X64 RyuJIT
<span class="nv">InvocationCount</span><span class="o">=</span>3 <span class="nv">IterationCount</span><span class="o">=</span>15 <span class="nv">LaunchCount</span><span class="o">=</span>2
<span class="nv">UnrollFactor</span><span class="o">=</span>1 <span class="nv">WarmupCount</span><span class="o">=</span>10
<span class="nv">$ </span><span class="nb">grep</span> <span class="s1">'stepping\|model\|microcode'</span> /proc/cpuinfo | <span class="nb">head</span> <span class="nt">-4</span>
model : 158
model name : Intel<span class="o">(</span>R<span class="o">)</span> Core<span class="o">(</span>TM<span class="o">)</span> i7-7700HQ CPU @ 2.80GHz
stepping : 9
microcode : 0xb4
</pre></td></tr></tbody></table></code></pre></div></div>
</li>
</ul>
</div>
<p>We can see that we’ve managed to trim a lot of excess fat off this little monster of ours. It’s shaping up to be one mean sorting machine, for sure. When comparing to where we were in the end of the previous blog post:</p>
<ul>
<li>We have a more pronounced effect for these optimizations in the lower end of the spectrum, cutting down an additional 30% of the runtime for anything below <code class="highlighter-rouge">1000</code> elements.</li>
<li>Above <code class="highlighter-rouge">1000</code> elements, we’ve “only” succeeded in reducing the runtime by 20%. Then again, it’s 20% off of tens and hundreds of milliseconds of total runtime, which is nothing to snicker at.</li>
</ul>
<p>Next up, we’ll have to take on what is a non-trivial problem of dealing with memory alignment, in the scope of a complicated partitioning algorithm like QuickSort.</p>
</div>damageboydans@houmus.orghttps://bits.houmus.orgDecimating Array.Sort with AVX2. I ended up going down the rabbit hole re-implementing array sorting with AVX2 intrinsics. There's no reason I should go down alone.This Goes to Eleven (Part. 3/∞)2020-01-30T03:26:28+00:002020-01-30T03:26:28+00:00https://bits.houmus.org/2020-01-30/this-goes-to-eleven-pt3<p>Since there’s a lot to go over here, I’ve split it up into a few parts:</p>
<ol>
<li>In <a href="/2020-01-28/this-goes-to-eleven-pt1">part 1</a>, we start with a refresher on <code class="highlighter-rouge">QuickSort</code> and how it compares to <code class="highlighter-rouge">Array.Sort()</code>.</li>
<li>In <a href="/2020-01-29/this-goes-to-eleven-pt2">part 2</a>, we go over the basics of vectorized hardware intrinsics, vector types, and go over a handful of vectorized instructions we’ll use in part 3. We still won’t be sorting anything.</li>
<li>In this part, we go through the initial code for the vectorized sorting, and start seeing some payoff. We finish agonizing courtesy of the CPU’s branch predictor, throwing a wrench into our attempts.</li>
<li>In part 4, we go over a handful of optimization approaches that I attempted trying to get the vectorized partitioning to run faster. We’ll see what worked and what didn’t.</li>
<li>In part 5, we’ll see how we can almost get rid of all the remaining scalar code- by implementing small-constant size array sorting. We’ll use, drum roll…, yet more AVX2 vectorization.</li>
<li>Finally, in part 6, I’ll list the outstanding stuff/ideas I have for getting more juice and functionality out of my vectorized code.</li>
</ol>
<h2 id="unstable-vectorized-partitioning--quicksort">Unstable Vectorized Partitioning + QuickSort</h2>
<p>It’s time we mash all the new knowledge we picked up in the last posts about SIMD registers, instructions, and <code class="highlighter-rouge">QuickSort</code>ing into something useful. Here’s the plan:</p>
<ul>
<li>Vectorized in-place partitioning:
<ul>
<li>First, we learn to take 8-element blocks, or units of <code class="highlighter-rouge">Vector256<int></code>, and partition them with AVX2 intrinsics.</li>
<li>Then we take Berlin: We reuse our block to partition an entire array with a method I named double-pumping, suitable for processing large arrays in-place with this vectorized block.</li>
</ul>
</li>
<li>Once we’ve covered vectorized partitioning, we finish up with some innocent glue-code wrapping the whole thing to look like a proper <code class="highlighter-rouge">Array.Sort</code> replacement.</li>
</ul>
<p>Now that we’re doing our own thing, finally, It’s time to address a baby elephant hiding in the room: Stable vs. Unstable sorting. I should probably bother explaining: One possible way to categorize sorting algorithms is with respect for their stability: Do they reorder <em>equal</em> values as they appear in the original input data or not. Stable sorting does not reorder, while unstable sorting provides no such guarantee.<br />
Stability <em>might</em> be critical for certain tasks, for example:</p>
<ul>
<li>When sorting an array of structs/classes according to a key embedded as a member, while providing a non-default <code class="highlighter-rouge">IComparer<T></code> or <code class="highlighter-rouge">Comparison<T></code>, we might care about preserving the order of the containing type.</li>
<li>Similarly, when sorting pairs of arrays: keys and values, reordering both arrays according to the sorted order of the keys, while preserving the ordering of values for equal keys.</li>
</ul>
<p>At the same time, stable sorting is a non-issue when:</p>
<ul>
<li>Sorting arrays of simple primitives; stability is meaningless:<br />
(what would a “stable sort” of the array <code class="highlighter-rouge">[7, 7, 7]</code> even mean?)</li>
<li>At other times, we <em>know</em> for a fact that our keys are unique. There is no unstable sorting for unique keys.</li>
<li>Lastly, sometimes, <em>we just don’t care</em>. We’re fine if our data gets reordered.</li>
</ul>
<p>In the .NET/C# world, one could say that the landscape regarding sorting is a little unstable (pun intended):</p>
<ul>
<li>
<p><a href="https://docs.microsoft.com/en-us/dotnet/api/system.array.sort?view=netcore-3.1"><code class="highlighter-rouge">Array.Sort</code></a> is unstable, as is clearly stated in the remarks section:</p>
<blockquote>
<p>This implementation performs an unstable sort; that is, if two elements are equal, their order might not be preserved.</p>
</blockquote>
</li>
<li>
<p>On the other hand, <a href="https://docs.microsoft.com/en-us/dotnet/api/system.linq.enumerable.orderby?view=netcore-3.1"><code class="highlighter-rouge">Enumerable.OrderBy</code></a> is stable:</p>
<blockquote>
<p>This method performs a stable sort; that is, if the keys of two elements are equal, the order of the elements is preserved.</p>
</blockquote>
</li>
</ul>
<p>In general, what I came up with in my full repo/nuget package are algorithms capable of doing both stable and unstable sorting. But with two caveats:</p>
<ul>
<li>Stable sorting is considerably slower than unstable sorting (But still faster than <code class="highlighter-rouge">Array.Sort</code>).</li>
<li>Stable sorting is less elegant/fun to explain.</li>
</ul>
<p>Given this new information and the fact that I am only presenting pure primitive sorting anyway, where there is no notion of stability to begin with, for this series, I will be describing my unstable sorting approach. It doesn’t take a lot of imagination to get from here to the stable variant, but I’m not going to address this in these posts. It is also important to note that in general, when there is a doubt if stability is a requirement (e.g., for key/value, <code class="highlighter-rouge">IComparer<T></code>/<code class="highlighter-rouge">Comparison<T></code>, or non-primitive sorting) we should err on the side of safety and go for stable sorting.</p>
<h3 id="avx2-partitioning-block">AVX2 Partitioning Block</h3>
<p>Let’s start with this “simple” block, describing what we do with moving pictures.</p>
<table style="margin-bottom: 0em" class="notice--info">
<tr>
<td style="border: none; padding-top: 0; padding-bottom: 0; vertical-align: top"><span class="uk-label">Hint</span></td>
<td style="border: none; padding-top: 0; padding-bottom: 0">From here-on, The following icon means I have a thingy that animates:
<object style="margin: auto; vertical-align: middle;" type="image/svg+xml" data="../talks/intrinsics-sorting-2019/play.svg"></object><br />
Click/Touch/Hover <b>inside</b> means: <i class="glyphicon glyphicon-play"></i><br />
Click/Touch/Hover <b>outside</b> means: <i class="glyphicon glyphicon-pause"></i>
</td>
</tr>
</table>
<object class="animated-border" width="100%" type="image/svg+xml" data="../talks/intrinsics-sorting-2019/block-unified-with-hint.svg"></object>
<p>Here is the same block, in more traditional code form:</p>
<div>
<div class="stickemup">
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
</pre></td><td class="rouge-code"><pre><span class="kt">var</span> <span class="n">P</span> <span class="p">=</span> <span class="n">Vector256</span><span class="p">.</span><span class="nf">Create</span><span class="p">(</span><span class="n">pivot</span><span class="p">);</span> <span class="c1">// Outside any loop, top-level in the function</span>
<span class="p">...</span>
<span class="p">[</span><span class="nf">MethodImpl</span><span class="p">(</span><span class="n">MethodImplOptions</span><span class="p">.</span><span class="n">AggressiveInlining</span><span class="p">)]</span>
<span class="k">static</span> <span class="k">unsafe</span> <span class="k">void</span> <span class="nf">PartitionBlock</span><span class="p">(</span><span class="kt">int</span> <span class="p">*</span><span class="n">dataPtr</span><span class="p">,</span> <span class="n">Vector256</span><span class="p"><</span><span class="kt">int</span><span class="p">></span> <span class="n">P</span><span class="p">,</span>
<span class="k">ref</span> <span class="kt">int</span><span class="p">*</span> <span class="n">writeLeft</span><span class="p">,</span> <span class="k">ref</span> <span class="kt">int</span><span class="p">*</span> <span class="n">writeRight</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">var</span> <span class="n">data</span> <span class="p">=</span> <span class="n">Avx2</span><span class="p">.</span><span class="nf">LoadDquVector256</span><span class="p">(</span><span class="n">dataPtr</span><span class="p">);</span>
<span class="kt">var</span> <span class="n">mask</span> <span class="p">=</span> <span class="p">(</span><span class="kt">uint</span><span class="p">)</span> <span class="n">Avx</span><span class="p">.</span><span class="nf">MoveMask</span><span class="p">(</span>
<span class="n">Avx2</span><span class="p">.</span><span class="nf">CompareGreaterThan</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">P</span><span class="p">).</span><span class="nf">AsSingle</span><span class="p">());</span>
<span class="n">data</span> <span class="p">=</span> <span class="n">Avx2</span><span class="p">.</span><span class="nf">PermuteVar8x32</span><span class="p">(</span><span class="n">data</span><span class="p">,</span>
<span class="n">Avx2</span><span class="p">.</span><span class="nf">LoadDquVector256</span><span class="p">(</span><span class="n">PermTablePtr</span> <span class="p">+</span> <span class="n">mask</span> <span class="p">*</span> <span class="m">8</span><span class="p">)));</span>
<span class="n">Avx</span><span class="p">.</span><span class="nf">Store</span><span class="p">(</span><span class="n">writeLeft</span><span class="p">,</span> <span class="n">data</span><span class="p">);</span>
<span class="n">Avx</span><span class="p">.</span><span class="nf">Store</span><span class="p">(</span><span class="n">writeRight</span><span class="p">,</span> <span class="n">data</span><span class="p">);</span>
<span class="kt">var</span> <span class="n">popCount</span> <span class="p">=</span> <span class="n">PopCnt</span><span class="p">.</span><span class="nf">PopCount</span><span class="p">(</span><span class="n">mask</span><span class="p">);</span>
<span class="n">writeRight</span> <span class="p">-=</span> <span class="n">pc</span><span class="p">;</span>
<span class="n">writeLeft</span> <span class="p">+=</span> <span class="m">8</span> <span class="p">-</span> <span class="n">pc</span><span class="p">;</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div> </div>
</div>
<p>There’s a lot of cheese here; let’s break this down:</p>
<div class="divTable">
<div class="divTableBody">
<div class="divTableRow">
<div class="divTableCell"><span class="uk-label">L1</span></div>
<div class="divTableCell">
<p>Broadcast the pivot value to a vector I’ve named <code class="highlighter-rouge">P</code>. We’re merely creating 8-copies of the selected pivot value in a SIMD register.<br />
Technically, this isn’t really part of the block as this is this happens only <em>once</em> per partitioning function call! It’s included here for context.</p>
</div>
</div>
<div class="divTableRow">
<div class="divTableCell"><span class="uk-label">L3-5</span></div>
<div class="divTableCell">
<p>We wrap our block in a static function. We aggressively inline it in strategic places throughout the rest of the code.<br />
This may look like an odd signature, but think of its purpose: We avoid copy-pasting codemwhile also avoiding any performance penalty.</p>
</div>
</div>
<div class="divTableRow">
<div class="divTableCell"><span class="uk-label">L6</span></div>
<div class="divTableCell">
<p>Load up data from somewhere in our array. <code class="highlighter-rouge">dataPtr</code> points to some unpartitioned data. <code class="highlighter-rouge">dataVec</code> will be loaded with data we intend to partition, and that’s the important bit.</p>
</div>
</div>
<div class="divTableRow">
<div class="divTableCell"><span class="uk-label">L7-8</span></div>
<div class="divTableCell">
<p>Perform an 8-way comparison using <code class="highlighter-rouge">CompareGreaterThan</code>, then proceed to convert/compress the 256-bit result into an 8-bit value using the <code class="highlighter-rouge">MoveMask</code> intrinsic.<br />
The goal here is to generate a <strong>scalar</strong> <code class="highlighter-rouge">mask</code> value, that contains a single <code class="highlighter-rouge">1</code> bit for every comparison where the corresponding data element was <em>greater-than</em> the pivot value and <code class="highlighter-rouge">0</code> bits for all others. If you are having a hard time following <em>why</em> this does this, you need to head back to the <a href="/2020-01-29/this-goes-to-eleven-pt2">2<sup>nd</sup> post</a> and read up on these two intrinsics/watch their animations.</p>
</div>
</div>
<div class="divTableRow">
<div class="divTableCell"><span class="uk-label">L9-10</span></div>
<div class="divTableCell">
<p>Permute the loaded data according to a permutation vector; A-ha! A twist in the plot!<br />
<code class="highlighter-rouge">mask</code> contains 8 bits, from LSB to MSB describing where each element belongs to (left/right). We could, of course, loop over those bits and perform 8 branches to determine which side each element belongs to, but that would be a terrible mistake. Instead, we’re going to use the <code class="highlighter-rouge">mask</code> as an <em>index</em> into a lookup-table for permutation values!<br />
This is one of the reasons it was critical to use <code class="highlighter-rouge">MoveMask</code> in the first place. Without it, we would not have a scalar value we could use as an index to our table. Pretty neat, no?<br />
With the permutation operation done, we’ve grouped all the <em>smaller-or-equal</em> than values on one side of our <code class="highlighter-rouge">dataVec</code> vector (the “left” side) and all the <em>greater-than</em> values on the other side (the “right” side).<br />
I’ve comfortably glanced over the actual values in the permutation lookup-table which <code class="highlighter-rouge">PermTablePtr</code> is pointing to; I’ll address this a couple of paragraphs below.</p>
</div>
</div>
</div>
</div>
<p>Partitioning is now practically complete: That is, our <code class="highlighter-rouge">dataVec</code> vector is neatly partitioned. Except that that data is still “stuck” inside our vector. We need to write its contents back to memory. Here comes a small complication: Our <code class="highlighter-rouge">dataVec</code> vector now contains values belonging <em>both</em> to the left and right sides of the original array. We did separate them <strong>within</strong> the vector, but we’re not done until each side is written back to memory, on both ends of our array.</p>
<div class="divTable">
<div class="divTableBody">
<div class="divTableRow">
<div class="divTableCell"><span class="uk-label">L11-12</span></div>
<div class="divTableCell">
<p>Store the permuted vector to both sides of the array. There is no cheap way to write <em>portions</em> of a vector to each respective end, so we write the <strong>entire</strong> partitioned vector to both the <em>left</em> <strong>and</strong> <em>right</em> sides of the array.<br />
At any given moment, we have two write pointers pointing to where we need to write to <strong>next</strong> on either side: <code class="highlighter-rouge">writeLeft</code> and <code class="highlighter-rouge">writeRight</code>. How those are initialized and maintained will be dealt with further down where we start calling this block, but for now, let’s assume these pointers initially point to somewhere where it is <strong>safe</strong> to write <em>at least</em> an entire <code class="highlighter-rouge">Vector256<T></code> and move on.</p>
</div>
</div>
<div class="divTableRow">
<div class="divTableCell"><span class="uk-label">L13-15</span></div>
<div class="divTableCell">
<p>Book-keeping time: We just wrote 8 elements to each side, and each side had a trail of unwanted data tacked to it. We didn’t care for it while we were writing it, because we knew we’re about to update the same write pointers in such a way that the <em>next</em> writes operations will <strong>overwrite</strong> the trailing/unwanted data that doesn’t belong to each respective side!<br />
The vector gods are smiling at us: We have the <code class="highlighter-rouge">PopCount</code> intrinsic to lend us a hand here. We issue <code class="highlighter-rouge">PopCount</code> on the same <code class="highlighter-rouge">mask</code> variable (again, <code class="highlighter-rouge">MoveMask</code> was worth its weight in gold here) and get a count of how many bits in <code class="highlighter-rouge">mask</code> were <code class="highlighter-rouge">1</code>. This accounts for how many values <strong>inside</strong> the vector were <em>greater-than</em> the pivot value and belong to the right side.<br />
This “happens” to be the amount by which we want to <em>decrease</em> the <code class="highlighter-rouge">writeRight</code> pointer (<code class="highlighter-rouge">writeRight</code> is “advanced” by decrementing it, this may seem weird for now, but will become clearer when we discuss the outer-loop!<br />
Finally, we adjust the <code class="highlighter-rouge">writeLeft</code> pointer: <code class="highlighter-rouge">popCount</code> contains the number of <code class="highlighter-rouge">1</code> bits; the number of <code class="highlighter-rouge">0</code> bits is by definition, <code class="highlighter-rouge">8 - popCount</code> since <code class="highlighter-rouge">mask</code> had 8 bits of content in it, to begin with. This accounts for how many values in the register were <em>less-than-or-equal</em> the pivot value and grouped on the left side of the register.</p>
</div>
</div>
</div>
</div>
<p>This was a full 8-element wise partitioning block, and it’s worth noting a thing or two about it:</p>
<ul>
<li>It is completely branch-less(!): We’ve given the CPU a nice juicy block with no need to speculate on what code gets executed next. It sure looks pretty when you consider the number of branches our scalar code would execute for the same amount of work. Don’t pop a champagne bottle quite yet though, we’re about to run into a wall full of thorny branches in a second, but sure feels good for now.</li>
<li>If we want to execute multiple copies of this block, the main dependency from one block to the next is the mutation of the <code class="highlighter-rouge">writeLeft</code> and <code class="highlighter-rouge">writeRight</code> pointers. It’s unavoidable given we set-out to perform in-place sorting (well, I couldn’t avoid it, maybe you can!), but worth-while mentioning nonetheless. If you need a reminder about how these data-dependencies can change the dynamics of efficient execution, you can read up on when I tried my best to go at it battling with <a href="/2018-08-20/netcoreapp3.0-intrinsics-in-real-life-pt3"><code class="highlighter-rouge">PopCount</code> to run screaming fast</a>; If nothing else, you’ll get a clearer understanding of how the CPU extracts data-flows from our code.</li>
</ul>
<p>I thought it would be nice to wrap up the discussion of this block by showing off that the JIT is relatively well-behaved in this case with the generated x64 asm:<br />
Anyone who has followed the C# code can use the intrinsics table from the previous post and read the assembly code without further help. Also, it becomes clear how this is a 1:1 translation of C# code. Congratulations: It’s 2020, and we’re x86 assembly programmers again!</p>
</div>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="rouge-code"><pre><span class="nf">vmovd</span> <span class="nv">xmm1</span><span class="p">,</span><span class="nb">r15d</span> <span class="c1">; Broadcast</span>
<span class="nf">vbroadcastd</span> <span class="nv">ymm1</span><span class="p">,</span><span class="nv">xmm1</span> <span class="c1">; pivot</span>
<span class="nf">...</span>
<span class="nf">vlddqu</span> <span class="nv">ymm0</span><span class="p">,</span> <span class="nv">ymmword</span> <span class="nv">ptr</span> <span class="p">[</span><span class="nb">rax</span><span class="p">]</span> <span class="c1">; load 8 elements</span>
<span class="nf">vpcmpgtd</span> <span class="nv">ymm2</span><span class="p">,</span> <span class="nv">ymm0</span><span class="p">,</span> <span class="nv">ymm1</span> <span class="c1">; compare</span>
<span class="nf">vmovmskps</span> <span class="nb">ecx</span><span class="p">,</span> <span class="nv">ymm2</span> <span class="c1">; movemask into scalar reg</span>
<span class="nf">mov</span> <span class="nb">r9d</span><span class="p">,</span> <span class="nb">ecx</span> <span class="c1">; copy to r9</span>
<span class="nf">shl</span> <span class="nb">r9d</span><span class="p">,</span> <span class="mh">0x3</span> <span class="c1">; *= 8</span>
<span class="nf">vlddqu</span> <span class="nv">ymm2</span><span class="p">,</span> <span class="kt">qword</span> <span class="nv">ptr</span> <span class="p">[</span><span class="nb">rdx</span><span class="o">+</span><span class="nb">r9d</span><span class="o">*</span><span class="mi">4</span><span class="p">]</span> <span class="c1">; load permutation</span>
<span class="nf">vpermd</span> <span class="nv">ymm0</span><span class="p">,</span> <span class="nv">ymm2</span><span class="p">,</span> <span class="nv">ymm0</span> <span class="c1">; permute</span>
<span class="nf">vmovdqu</span> <span class="nv">ymmword</span> <span class="nv">ptr</span> <span class="p">[</span><span class="nv">r12</span><span class="p">],</span> <span class="nv">ymm0</span> <span class="c1">; store left</span>
<span class="nf">vmovdqu</span> <span class="nv">ymmword</span> <span class="nv">ptr</span> <span class="p">[</span><span class="nv">r8</span><span class="p">],</span> <span class="nv">ymm0</span> <span class="c1">; store right</span>
<span class="nf">popcnt</span> <span class="nb">ecx</span><span class="p">,</span> <span class="nb">ecx</span> <span class="c1">; popcnt</span>
<span class="nf">...</span> <span class="c1">; update writeLeft/writeRight pointers</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<h2 id="permutation-lookup-table">Permutation lookup table</h2>
<p>If you made it this far, you are owed an explanation of the permutation lookup table. Let’s see what’s in it:</p>
<ul>
<li>The table needs to have 2<sup>8</sup> elements for all possible mask values.</li>
<li>Each element ultimately needs to be a <code class="highlighter-rouge">Vector256<int></code> because that’s what the permutation intrinsic expects from us, so 8 x 4 bytes = 32 bytes per element.
<ul>
<li>That’s a whopping 8kb of lookup data in total (!).</li>
</ul>
</li>
<li>The values inside are <a href="https://github.com/damageboy/VxSort/blob/research/TestBlog/PermutationTableTests.cs#L20">pre-generated</a> so that they would reorder the data <em>inside</em> a <code class="highlighter-rouge">Vector256<int></code> according to our wishes: all values that got a corresponding <code class="highlighter-rouge">1</code> bit in the mask go to one side (right side), and the elements with a <code class="highlighter-rouge">0</code> go to the other side (left side). There’s no particular required order amongst the grouped elements since we’re merely partitioning around a pivot value, nothing more, nothing less.</li>
</ul>
<p>Here are 4 sample values from the generated permutation table that I’ve copy-pasted so we can get a feel for it:</p>
<div>
<div class="stickemup">
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="rouge-code"><pre><span class="k">static</span> <span class="n">ReadOnlySpan</span><span class="p"><</span><span class="kt">int</span><span class="p">></span> <span class="n">PermTable</span> <span class="p">=></span> <span class="k">new</span><span class="p">[]</span> <span class="p">{</span>
<span class="m">0</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">2</span><span class="p">,</span> <span class="m">3</span><span class="p">,</span> <span class="m">4</span><span class="p">,</span> <span class="m">5</span><span class="p">,</span> <span class="m">6</span><span class="p">,</span> <span class="m">7</span><span class="p">,</span> <span class="c1">// 0 => 0b00000000</span>
<span class="c1">// ...</span>
<span class="m">3</span><span class="p">,</span> <span class="m">4</span><span class="p">,</span> <span class="m">5</span><span class="p">,</span> <span class="m">6</span><span class="p">,</span> <span class="m">7</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">2</span><span class="p">,</span> <span class="c1">// 7 => 0b00000111</span>
<span class="c1">// ...</span>
<span class="m">0</span><span class="p">,</span> <span class="m">2</span><span class="p">,</span> <span class="m">4</span><span class="p">,</span> <span class="m">6</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">3</span><span class="p">,</span> <span class="m">5</span><span class="p">,</span> <span class="m">7</span><span class="p">,</span> <span class="c1">// 170 => 0b10101010</span>
<span class="c1">// ...</span>
<span class="m">0</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">2</span><span class="p">,</span> <span class="m">3</span><span class="p">,</span> <span class="m">4</span><span class="p">,</span> <span class="m">5</span><span class="p">,</span> <span class="m">6</span><span class="p">,</span> <span class="m">7</span><span class="p">,</span> <span class="c1">// 255 => 0b11111111</span>
<span class="p">};</span>
</pre></td></tr></tbody></table></code></pre></div> </div>
</div>
<ul>
<li>For <code class="highlighter-rouge">mask</code> values 0, 255 the entries are trivial: All <code class="highlighter-rouge">mask</code> bits were either <code class="highlighter-rouge">1</code> or <code class="highlighter-rouge">0</code> so there’s nothing we need to do with the data, we just leave it as is, the “null” permutation vector: <code class="highlighter-rouge">[0, 1, 2, 3, 4, 5, 6, 7]</code> achieves just that.</li>
<li>When <code class="highlighter-rouge">mask</code> is <code class="highlighter-rouge">0b00000111</code> (decimal 7), the 3 lowest bits of the <code class="highlighter-rouge">mask</code> are <code class="highlighter-rouge">1</code>, they represent elements that need to go to the right side of the vector (e.g., elements that were <code class="highlighter-rouge">> pivot</code>), while all other values need to go to the left (<code class="highlighter-rouge"><= pivot</code>). The permutation vector: <code class="highlighter-rouge">[3, 4, 5, 6, 7, 0, 1, 2]</code> does just that.</li>
<li>The checkered bit pattern for the <code class="highlighter-rouge">mask</code> value <code class="highlighter-rouge">0b10101010</code> (decimal 170) calls to move all the even elements to one side and the odd elements to the other… You can see that <code class="highlighter-rouge">[0, 2, 4, 6, 1, 3, 5, 7]</code> does the work here.</li>
</ul>
<table style="margin-bottom: 0em" class="notice--warning">
<tr>
<td style="border: none; padding-top: 0; padding-bottom: 0; vertical-align: top"><span class="uk-label uk-label-warning">Note</span></td>
<td style="border: none; padding-top: 0; padding-bottom: 0"><div>
<p>The permutation table signature provided here is technically a lie: The <a href="https://github.com/damageboy/VxSort/blob/research/VxSortResearch/PermutationTables/Int32PermTables.cs#L12">actual code</a> uses <code class="highlighter-rouge">ReadOnlySpan<byte></code> as the table’s type, with the <code class="highlighter-rouge">int</code> values encoded as individual bytes in little-endian encoding. This is a C# 7.3 specific optimization where we get to treat the address of this table as a constant at JIT time. Kevin Jones (<a href="https://twitter.com/vcsjones">@vcsjones</a>) did a wonderful job of <a href="https://vcsjones.dev/2019/02/01/csharp-readonly-span-bytes-static/">digging into it</a>.<br />
We <strong>must</strong> use a <code class="highlighter-rouge">ReadOnlySpan<byte></code> for the optimization to trigger: Not reading <em>that</em> fine-print cost me two nights of my life chasing what I was <em>sure</em> had to be a GC/JIT bug. Normally, it would be a <strong>bad</strong> idea to store a <code class="highlighter-rouge">ReadOnlySpan<int></code> as a <code class="highlighter-rouge">ReadOnlySpan<byte></code>: we are forced to choose between little/big-endian encoding <em>at compile-time</em>. This runs up against the fact that in C# we compile once and debug (and occasionally run :) everywhere. Therefore, we have to <em>assume</em> our binaries might run on both little/big-endian machines where the CPU might not match the encoding we chose.<br />
<strong>In this case</strong>, praise the vector deities, blessed be their name and all that they touch, this is a <em>non-issue</em>: The entire premise is <strong>x86</strong> specific. This means that this code will <strong>never</strong> run on a big-endian machine. We can simply assume little endianness here till the end of all times.</p>
</div>
</td>
</tr>
</table>
</div>
<p>We’ve covered the basic layout of the permutation table. We’ll go back to it once we start optimization efforts in earnest on the 4<sup>th</sup> post, but for now, we can move on to the loop surrounding our vectorized partition block.</p>
<h2 id="double-pumped-loop">Double Pumped Loop</h2>
<p>Armed with a vectorized partitioning block, it’s time to hammer our unsorted array with it, but there’s a wrinkle: In-place sorting. This brings a new challenge to the table: If you followed the previous section carefully, you might have noticed it already. For every <code class="highlighter-rouge">Vector256<int></code> we read, we ended up writing that same vector twice to both ends of the array. You don’t have to be a math wizard to figure out that if we end up writing 16 elements for every 8 we read, that doesn’t sound very in-placy, to begin with. Moreover, this extra writing would have to overwrite data that we have <em>not read yet</em>.<br />
Initially, it would seem, we’ve managed to position ourselves between a rock and a hard place.</p>
<p>But all is not lost! In reality, we immediately adjust the next write positions on both sides in such a way that their <strong>sum</strong> advances by 8. In other words, we are at risk of overwriting unread data only temporarily while we store the data back. I ended up adopting a tricky approach: We will need to continuously make sure we have at least 8 elements (the size of our block) of free space on <em>both</em> sides of the array so we could, in turn, perform a full, efficient 8-element write to both ends without overwriting a single bit of data we haven’t read yet.</p>
<p>Here’s a visual representation of the mental model I was in while debugging/making this work (I’ll note I had the same facial expressions as this poor Charmander while writing and debugging that code):</p>
<video controls="" playsinline="" loop="" preload="auto" width="100%">
<source src="../talks/intrinsics-sorting-2019/fire.webm" type="video/webm" />
<source src="../talks/intrinsics-sorting-2019/fire.mp4" type="video/mp4" />
<img src="../talks/intrinsics-sorting-2019/fire.gif " alt="" />
</video>
<p><br /></p>
<p>Funny, right? It’s closer to what I actually do than I’d like to admit! I fondly named this approach in my code as “double-pumped partitioning”. It pumps values in-to/out-of <strong>both</strong> ends of the array at all times. I’ve left it pretty much intact in the repo under the name <a href="https://github.com/damageboy/VxSort/blob/research/VxSortResearch/Unstable/AVX2/Happy/00_DoublePumpNaive.cs"><code class="highlighter-rouge">DoublePumpNaive</code></a>, in case you want to dig through the full code. Like all good things in life, it comes in 3-parts:</p>
<ul>
<li>Prime the pump (make some initial room inside the array).</li>
<li>Loop over the data in 8-element chunks executing our vectorized code block.</li>
<li>Finally, go over the last remaining data elements (e.g. the last remaining <code class="highlighter-rouge">< 8</code> block of unpartitioned data) and partition them using scalar code. This is a very common and unfortunate pattern we find in vectorized code, as we need to finish off with just a bit of scalar work.</li>
</ul>
<p>Let’s start with another visual aid I ended up doing to better explain this; note the different color codes and legend I’ve provided here, and try to watch a few loops noticing the various color transitions, this will become useful as you parse the text and code below:</p>
<div>
<div class="stickemup">
<object class="animated-border" type="image/svg+xml" data="../talks/intrinsics-sorting-2019/double-pumped-loop-with-hint.svg"></object>
</div>
<object style="margin-top: 2em" type="image/svg+xml" data="../talks/intrinsics-sorting-2019/double-pumped-loop-legend.svg"></object>
<ul>
<li>Each rectangle is 8-elements wide.
<ul>
<li>Except for the middle one, which represents the last group of up to 8 elements that need to be partitioned. This is often called in vectorized parlance the “remainder problem”.</li>
</ul>
</li>
<li>We want to partition the entire array, in-place, or turn it from <span style="padding: 1px; border: 1px solid black; border-radius: 2px; background-color: #db9d00ff">orange</span> into the green/red colors:
<ul>
<li><span style="padding: 1px; border: 1px solid black; border-radius: 2px; background-color: #bbe33d">Green</span>: for smaller-than-or-equal to the pivot values, on the left side.</li>
<li><span style="padding: 1px; border: 1px solid black; border-radius: 2px; background-color: #c9211e; color: white">Red</span>: for greater-than-or-equal the pivot values, on the right side.</li>
</ul>
</li>
<li>Initially we “prime the pump”, or make some room inside the array, by partitioning into some temporary memory, marked as the 3x8-element blocks in <span style="padding: 1px; border: 1px solid black; border-radius: 2px; background-color: #f67eec">purple</span>:
<ul>
<li>We allocate this temporary space somewhere on the stack; We’ll discuss why this isn’t really a big deal below.</li>
<li>We read one vector’s worth of elements from the left and execute our partitioning block into the temporary space.</li>
<li>We repeat the process for the right side.</li>
<li>At this stage, one vector on each edge has already been partitioned, and their color is now <span style="padding: 1px; border: 1px solid black; border-radius: 2px; background-color:#b2b2b2ff">gray</span>, which represents data/area within our array we can freely <em>write</em> into.</li>
</ul>
</li>
<li>From here-on, we’re in the main loop: this could go on for millions of iterations, even though in this animation we only see 4 iterations in total:
<ul>
<li>In every round, we <em>choose</em> where we read from next: From the left <em>-or-</em> right side of the <span style="padding: 1px; border: 1px solid black; border-radius: 2px; background-color: #db9d00ff">orange</span> area?<br />
How? Easy-peasy: Whichever side has a <strong>smaller</strong> <span style="padding: 1px; border: 1px solid black; border-radius: 2px; background-color:#b2b2b2ff">gray</span> area!
<ul>
<li><em>Intuition</em>: The gray area represents the distance between the head (read) and tail (write) pointers we set up for each side, the smaller the distance/area is, the more likely that our next 8-element partition <em>might</em> end with us overwriting that side’s head with the tail.</li>
<li><strong>We really don’t want that to happen…</strong></li>
<li>We read from the only side <em>where this might happen next</em>, thereby adding 8 more elements of breathing space to that side just in time before we cause a meltdown. (you can see this clearly in the animation as each orange block turns gray <em>after</em> we read it, <em>but before</em> we write to both sides…)</li>
</ul>
</li>
<li>We partition the data inside the <code class="highlighter-rouge">Vector256<int></code> we just read and write it to the next write position on each side.</li>
<li>We advance each write pointer according to how much of that register was red/green, we’ve discussed the how of it when we toured the vectorized block. Here you can see the end result reflected in how the red portion of the written copy on the left-hand side turns into gray, and the green portion on the right-hand side turns into gray correspondingly.<br />
<strong>Remember</strong>: We’ve seen the code in detail when we previously discussed the partitioning block; I repeat it here since it is critical for understanding how the whole process clicks together.</li>
</ul>
</li>
<li>For the finishing touch:
<ul>
<li>Left with less than 8 elements, we partition with plain old scalar code the few remaining elements, into the temporary memory area again.</li>
<li>We copy back each side of the temporary area back to the main array, and we’re done!</li>
<li>We move the pivot value that was left untouched all this time on the right edge of our segment and move it to where the new boundary is.</li>
</ul>
</li>
</ul>
<p>Let’s go over it again, in more detail, this time with code:</p>
</div>
<h3 id="setup-make-some-room">Setup: Make some room!</h3>
<p>What I eventually opted for was to read from <em>one</em> area and write to <em>another</em> area in the same array. But we need to make some spare room inside the array for this. How?</p>
<p>We cheat! (¯\<em>(ツ)</em>/¯), but not really: we allocate some temporary space on stack, by using the relatively new <code class="highlighter-rouge">ref struct</code> feature in C# in combination with <code class="highlighter-rouge">fixed</code> arrays, here’s why this isn’t really cheating in any reasonable person’s book:</p>
<ul>
<li>Stack allocation doesn’t put pressure on the GC, and its allocation is super fast/slim.</li>
<li>We allocate <em>once</em> at the top of our entire sort operation and reuse that space while recursing.</li>
<li>“Just a bit” is really just a bit: For our 8-element partition block we need room for 1 x 8-elements vector on <strong>each</strong> side of the array, so we allocate a total of 2 x 8 integers. In addition, we allocate 8 more elements for handling the remainder (well technically, 7 would be enough, but I’m not a monster, I like round numbers just like the next person), so a total of 96 bytes. Not too horrid.</li>
</ul>
<p>Here’s the signature + setup code:</p>
<div>
<div class="stickemup">
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
</pre></td><td class="rouge-code"><pre><span class="k">unsafe</span> <span class="kt">int</span><span class="p">*</span> <span class="nf">VectorizedPartitionInPlace</span><span class="p">(</span><span class="kt">int</span><span class="p">*</span> <span class="n">left</span><span class="p">,</span> <span class="kt">int</span><span class="p">*</span> <span class="n">right</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">var</span> <span class="n">N</span> <span class="p">=</span> <span class="n">Vector256</span><span class="p"><</span><span class="n">T</span><span class="p">>.</span><span class="n">Count</span><span class="p">;</span> <span class="c1">// Treated by JIT as constant!</span>
<span class="kt">var</span> <span class="n">writeLeft</span> <span class="p">=</span> <span class="n">left</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">writeRight</span> <span class="p">=</span> <span class="n">right</span> <span class="p">-</span> <span class="n">N</span> <span class="p">-</span> <span class="m">1</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">tmpLeft</span> <span class="p">=</span> <span class="n">_tempStart</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">tmpRight</span> <span class="p">=</span> <span class="n">_tempEnd</span> <span class="p">-</span> <span class="n">N</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">pivot</span> <span class="p">=</span> <span class="p">*</span><span class="n">right</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">P</span> <span class="p">=</span> <span class="n">Vector256</span><span class="p">.</span><span class="nf">Create</span><span class="p">(</span><span class="n">pivot</span><span class="p">);</span>
<span class="nf">PartitionBlock</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">P</span><span class="p">,</span> <span class="k">ref</span> <span class="n">tmpLeft</span><span class="p">,</span> <span class="k">ref</span> <span class="n">tmpRight</span><span class="p">);</span>
<span class="nf">PartitionBlock</span><span class="p">(</span><span class="n">right</span> <span class="p">-</span> <span class="n">N</span> <span class="p">-</span> <span class="m">1</span><span class="p">,</span> <span class="n">P</span><span class="p">,</span> <span class="k">ref</span> <span class="n">tmpLeft</span><span class="p">,</span> <span class="k">ref</span> <span class="n">tmpRight</span><span class="p">);</span>
<span class="kt">var</span> <span class="n">readLeft</span> <span class="p">=</span> <span class="n">left</span> <span class="p">+</span> <span class="n">N</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">readRight</span> <span class="p">=</span> <span class="n">right</span> <span class="p">-</span> <span class="m">2</span><span class="p">*</span><span class="n">N</span> <span class="p">-</span> <span class="m">1</span><span class="p">;</span>
</pre></td></tr></tbody></table></code></pre></div> </div>
</div>
<p>The function accepts two parameters: <code class="highlighter-rouge">left</code>, <code class="highlighter-rouge">right</code> pointing to the edges of the partitioning task we were handed. The selected pivot is “passed” in an unconventional way: the caller (The top-level sort function) is responsible for <strong>moving</strong> it to the right edge of the array before calling the partitioning function. In other words, we start executing the function expecting the pivot to be already selected and placed at the right edge of the segment (e.g., <code class="highlighter-rouge">right</code> points to it). This is a remnant of my initial copy-pasting of CoreCLR code, and to be honest, I don’t care enough to change it.</p>
<p>We start by setting up various pointers we’ll be using on <span class="uk-label">L5-8</span>: The <code class="highlighter-rouge">writeLeft</code> and <code class="highlighter-rouge">writeRight</code> pointers pointing into the internal edges of our array (excluding the last element which is pointing to the selected pivot), while the <code class="highlighter-rouge">tmpLeft</code> and <code class="highlighter-rouge">tmpRight</code> pointers are pointing into the internal edges of the temporary space.<br />
One recurring pattern is that the right-side pointers are pointing on vector’s worth on elements <strong>left</strong> of their respective edge. This makes sense given that we will be using vectorized write operations that take a pointer to memory and write 8 elements at a time; the pointers are setup accounting for that assymetry.</p>
<table style="margin-bottom: 0em" class="notice--info">
<tr>
<td style="border: none; padding-top: 0; padding-bottom: 0; vertical-align: top"><span class="uk-label">Note</span></td>
<td style="border: none; padding-top: 0; padding-bottom: 0"><div>
<p>I’m using a “variable” (<code class="highlighter-rouge">N</code>) on <span class="uk-label">L3</span> instead of <code class="highlighter-rouge">Vector256<int>.Count</code>. There’s a reason for those double quotes: At JIT time, the right-hand expression is considered as a constant as far as the JIT is concerned. Furthermore, once we initialize N with its value and <em>never</em> modify it, the JIT treats N as a constant as well! So really, I get to use a short/readable name and pay no penalty in for it.</p>
</div>
</td>
</tr>
</table>
<p>We proceed to partition a single 8-element vector on <em>each</em> side on <span class="uk-label">L13-14</span>, with our good-ole’ partitioning block <strong>straight into</strong> that temporary space through the pointers we just setup. It is important to remember that having done that, we don’t care about the original contents of the area we just read from anymore: we’re free to write up to one <code class="highlighter-rouge">Vector256<T></code> to each edge of the array in the future. We’ve made enough room inside our array available for writing in-place while partitioning.</p>
<p>We finish the setup on <span class="uk-label">L16-17</span> by initializing read pointers for every side (<code class="highlighter-rouge">readLeft</code>, <code class="highlighter-rouge">readRight</code>); An alternative way to think about these pointers is that each side gets its own head (read) and tail (write) pointers. We will be continuously reading from <strong>one</strong> of the heads and writing to <strong>both</strong> tails from now on.</p>
<p>The setup ends with <code class="highlighter-rouge">readLeft</code> pointing a single <code class="highlighter-rouge">Vector256<int></code> <em>right</em> of <code class="highlighter-rouge">left</code> , and <code class="highlighter-rouge">readRight</code> pointing 1 element + 2x<code class="highlighter-rouge">Vector256<int></code> <em>left</em> of <code class="highlighter-rouge">right</code>. The setup of <code class="highlighter-rouge">readRight</code> might initially seem peculiar, but easily explained:</p>
<ul>
<li><code class="highlighter-rouge">right</code> itself points to the selected pivot; we’re not going to (re-)partition it, so we skip that element (this explains the <code class="highlighter-rouge">- 1</code>).</li>
<li>As with the <code class="highlighter-rouge">tmpRight</code> and <code class="highlighter-rouge">writeWrite</code> pointers, when we read/write using <code class="highlighter-rouge">Avx2.LoadDquVector256</code>/<code class="highlighter-rouge">Avx.Store</code> we always have to supply the <em>start</em> address to read from or write to!<br />
Since There is no ability to read/write to the “left” of the pointer, we pre-decrement that pointer by <code class="highlighter-rouge">2*N</code> to account for the data that was already partitioned and to prepare it for the next read.</li>
</ul>
<h3 id="loop">Loop</h3>
<p>Here’s the same loop we saw in the animation with our vectorized block smack in its middle, in plain-old C#:</p>
</div>
<div>
<div class="stickemup">
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="rouge-code"><pre> <span class="k">while</span> <span class="p">(</span><span class="n">readRight</span> <span class="p">>=</span> <span class="n">readLeft</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="p">*</span><span class="n">nextPtr</span><span class="p">;</span>
<span class="k">if</span> <span class="p">((</span><span class="n">readLeft</span> <span class="p">-</span> <span class="n">writeLeft</span><span class="p">)</span> <span class="p"><=</span> <span class="p">(</span><span class="n">writeRight</span> <span class="p">-</span> <span class="n">readRight</span><span class="p">))</span> <span class="p">{</span>
<span class="n">nextPtr</span> <span class="p">=</span> <span class="n">readLeft</span><span class="p">;</span>
<span class="n">readLeft</span> <span class="p">+=</span> <span class="n">N</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">nextPtr</span> <span class="p">=</span> <span class="n">readRight</span><span class="p">;</span>
<span class="n">readRight</span> <span class="p">-=</span> <span class="n">N</span><span class="p">;</span>
<span class="p">}</span>
<span class="nf">PartitionBlock</span><span class="p">(</span><span class="n">nextPtr</span><span class="p">,</span> <span class="n">P</span><span class="p">,</span> <span class="k">ref</span> <span class="n">writeLeft</span><span class="p">,</span> <span class="k">ref</span> <span class="n">writeRight</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">readRight</span> <span class="p">+=</span> <span class="n">N</span><span class="p">;</span>
<span class="n">tmpRight</span> <span class="p">+=</span> <span class="n">N</span><span class="p">;</span>
</pre></td></tr></tbody></table></code></pre></div> </div>
</div>
<p>This is the heart of the partitioning operation and where we spend most of the time sorting the array. Looks quite boring, eh?</p>
<p>This loop is all about calling our good ole’ partitioning block on the entire array. We-reuse the same block on <span class="uk-label">L11</span>, but here, for the first time, actually use it as an in-place partitioning block, since we are both reading and writing to the same array.<br />
While the runtime of the loop is dominated by the partitioning block, the interesting bit is that beefy condition on <span class="uk-label">L3</span> that we described/animated before: it calculates the distance between each head and tail on both sides and compares them to determine which side has less space left, or which side is closer to being overwritten. Given that the <strong>next</strong> read will happen from the side we choose here, we’ve just added 8 more integers worth of <em>writing</em> space to that same endangered side, thereby eliminating the risk of overwriting.<br />
While it might be easy to read in terms of correctness or motivation, this is a very <em>sad line of code</em>, as it will haunt us in the next posts!</p>
<p>Finally, as we exit the loop once there are <code class="highlighter-rouge">< 8</code> elements left (remember that we pre-decremented <code class="highlighter-rouge">readRight</code> by <code class="highlighter-rouge">N</code> elements before the loop), we are done with all vectorized work for this partitioning call. as such, this is as good a time to re-adjust both <code class="highlighter-rouge">readRight</code> and <code class="highlighter-rouge">tmpRight</code> that were pre-decremented by <code class="highlighter-rouge">N</code> elements to make them ready-to-go for the final step of handling the remainder with scalr sorting, on <span class="uk-label">L13-14</span>.</p>
<h3 id="handling-the-remainder-and-finishing-up">Handling the remainder and finishing up</h3>
<p>Here’s the final piece of this function:</p>
</div>
<div>
<div class="stickemup">
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
</pre></td><td class="rouge-code"><pre> <span class="k">while</span> <span class="p">(</span><span class="n">readLeft</span> <span class="p"><</span> <span class="n">readRight</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">var</span> <span class="n">v</span> <span class="p">=</span> <span class="p">*</span><span class="n">readLeft</span><span class="p">++;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">v</span> <span class="p"><=</span> <span class="n">pivot</span><span class="p">)</span> <span class="p">{</span>
<span class="p">*</span><span class="n">tmpLeft</span><span class="p">++</span> <span class="p">=</span> <span class="n">v</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="p">*--</span><span class="n">tmpRight</span> <span class="p">=</span> <span class="n">v</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">var</span> <span class="n">leftTmpSize</span> <span class="p">=</span> <span class="p">(</span><span class="kt">uint</span><span class="p">)</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span> <span class="p">(</span><span class="n">tmpLeft</span> <span class="p">-</span> <span class="n">_tempStart</span><span class="p">);</span>
<span class="n">Unsafe</span><span class="p">.</span><span class="nf">CopyBlockUnaligned</span><span class="p">(</span><span class="n">writeLeft</span><span class="p">,</span> <span class="n">_tmpStart</span><span class="p">,</span> <span class="n">leftTmpSize</span> <span class="p">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">int</span><span class="p">));</span>
<span class="n">writeLeft</span> <span class="p">+=</span> <span class="n">leftTmpSize</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">rightTmpSize</span> <span class="p">=</span> <span class="p">(</span><span class="kt">uint</span><span class="p">)</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span> <span class="p">(</span><span class="n">_tempEnd</span> <span class="p">-</span> <span class="n">tmpRight</span><span class="p">);</span>
<span class="n">Unsafe</span><span class="p">.</span><span class="nf">CopyBlockUnaligned</span><span class="p">(</span><span class="n">writeLeft</span><span class="p">,</span> <span class="n">tmpRight</span><span class="p">,</span> <span class="n">rightTmpSize</span> <span class="p">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">int</span><span class="p">));</span>
<span class="nf">Swap</span><span class="p">(</span><span class="n">writeLeft</span><span class="p">,</span> <span class="n">right</span><span class="p">);</span>
<span class="k">return</span> <span class="n">writeLeft</span><span class="p">;</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div> </div>
</div>
<p>Finally, we come out of the loop once we have less than 8-elements to partition (1-7 elements). We can’t use vectorized code here, so we drop to plain-old scalar partitioning on <span class="uk-label">L1-8</span>. To keep things simple, we partition these last elements straight into the temporary area. This is the reason we’re allocating 8 more elements in the temporary area in the first place.</p>
<p>Once we’re done with this remainder nuisance, we copy back the already partitioned data from the temporary area back into the array to the area left between <code class="highlighter-rouge">writeLeft</code> and <code class="highlighter-rouge">writeRight</code>, it’s a quick 64-96 byte copy in two operations, performed <span class="uk-label">L10-14</span> and we are nearly done. We still need to move the pivot <em>back</em> to the newly calculated pivot position (remember the caller placed it on the right edge of the array as part of pivot selection) and report this position back as the return value for this to be officially be christened as AVX2 partitioning function.</p>
</div>
<h2 id="pretending-were-arraysort">Pretending we’re Array.Sort</h2>
<p>Now that we have a proper partitioning function, it’s time to string it into a quick-sort like dispatching function: This will be the entry point to our sort routine:</p>
<div>
<div class="stickemup">
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
</pre></td><td class="rouge-code"><pre><span class="k">public</span> <span class="k">static</span> <span class="k">class</span> <span class="nc">DoublePumpNaive</span>
<span class="p">{</span>
<span class="k">public</span> <span class="k">static</span> <span class="k">unsafe</span> <span class="k">void</span> <span class="n">Sort</span><span class="p"><</span><span class="n">T</span><span class="p">>(</span><span class="n">T</span><span class="p">[]</span> <span class="n">array</span><span class="p">)</span> <span class="k">where</span> <span class="n">T</span> <span class="p">:</span> <span class="n">unmanaged</span><span class="p">,</span> <span class="n">IComparable</span><span class="p"><</span><span class="n">T</span><span class="p">></span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">array</span> <span class="p">==</span> <span class="k">null</span><span class="p">)</span>
<span class="k">throw</span> <span class="k">new</span> <span class="nf">ArgumentNullException</span><span class="p">(</span><span class="k">nameof</span><span class="p">(</span><span class="n">array</span><span class="p">));</span>
<span class="k">fixed</span> <span class="p">(</span><span class="n">T</span><span class="p">*</span> <span class="n">p</span> <span class="p">=</span> <span class="p">&</span><span class="n">array</span><span class="p">[</span><span class="m">0</span><span class="p">])</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="k">typeof</span><span class="p">(</span><span class="n">T</span><span class="p">)</span> <span class="p">==</span> <span class="k">typeof</span><span class="p">(</span><span class="kt">int</span><span class="p">))</span> <span class="p">{</span>
<span class="kt">var</span> <span class="n">pi</span> <span class="p">=</span> <span class="p">(</span><span class="kt">int</span><span class="p">*)</span> <span class="n">p</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">sorter</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">VxSortInt32</span><span class="p">(</span><span class="n">startPtr</span><span class="p">:</span> <span class="n">pi</span><span class="p">,</span> <span class="n">endPtr</span><span class="p">:</span> <span class="n">pi</span> <span class="p">+</span> <span class="n">array</span><span class="p">.</span><span class="n">Length</span> <span class="p">-</span> <span class="m">1</span><span class="p">);</span>
<span class="n">sorter</span><span class="p">.</span><span class="nf">Sort</span><span class="p">(</span><span class="n">pi</span><span class="p">,</span> <span class="n">pi</span> <span class="p">+</span> <span class="n">array</span><span class="p">.</span><span class="n">Length</span> <span class="p">-</span> <span class="m">1</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">const</span> <span class="kt">int</span> <span class="n">SLACK_PER_SIDE_IN_VECTORS</span> <span class="p">=</span> <span class="m">1</span><span class="p">;</span>
</pre></td></tr></tbody></table></code></pre></div> </div>
</div>
<p>Most of this is pretty dull code:</p>
<ul>
<li>We start with a top-level static class <code class="highlighter-rouge">DoublePumpNaive</code> containing a single <code class="highlighter-rouge">Sort</code> entry point accepting a normal managed array.</li>
<li>We special case, relying on generic type elision, for <code class="highlighter-rouge">typeof(int)</code>, newing up a <code class="highlighter-rouge">VxSortInt32</code> struct and finally calling its internal <code class="highlighter-rouge">.Sort()</code> method to initiate the recursive sorting.
<ul>
<li>This is a good time as any to remind, again, that for the time being, I only implemented vectorized sorting when <code class="highlighter-rouge">T</code> is <code class="highlighter-rouge">int</code>. To fully replace <code class="highlighter-rouge">Array.Sort()</code> more tweaked versions of this code will have to be written to eventually support unsigned integers, both larger and smaller than 32 bits as well as floating-point types.</li>
</ul>
</li>
</ul>
<p>Continuing on to <code class="highlighter-rouge">VxSortInt32</code> itself:</p>
</div>
<div>
<div class="stickemup">
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
</pre></td><td class="rouge-code"><pre>
<span class="k">internal</span> <span class="k">unsafe</span> <span class="k">ref</span> <span class="k">struct</span> <span class="nc">VxSortInt32</span>
<span class="p">{</span>
<span class="k">const</span> <span class="kt">int</span> <span class="n">SLACK_PER_SIDE_IN_ELEMENTS</span> <span class="p">=</span> <span class="n">SLACK_PER_SIDE_IN_VECTORS</span> <span class="p">*</span> <span class="m">8</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">int</span> <span class="n">TMP_SIZE_IN_ELEMENTS</span> <span class="p">=</span> <span class="m">2</span> <span class="p">*</span> <span class="n">SLACK_PER_SIDE_IN_ELEMENTS</span> <span class="p">+</span> <span class="m">8</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">int</span> <span class="n">SMALL_SORT_THRESHOLD_ELEMENTS</span> <span class="p">=</span> <span class="m">16</span><span class="p">;</span>
<span class="k">readonly</span> <span class="kt">int</span><span class="p">*</span> <span class="n">_startPtr</span><span class="p">,</span> <span class="n">_endPtr</span><span class="p">;</span>
<span class="n">_tempStart</span><span class="p">,</span> <span class="n">_tempEnd</span><span class="p">;</span>
<span class="k">fixed</span> <span class="kt">int</span> <span class="n">_temp</span><span class="p">[</span><span class="n">TMP_SIZE_IN_ELEMENTS</span><span class="p">];</span>
<span class="k">public</span> <span class="nf">VxSortInt32</span><span class="p">(</span><span class="kt">int</span><span class="p">*</span> <span class="n">startPtr</span><span class="p">,</span> <span class="kt">int</span><span class="p">*</span> <span class="n">endPtr</span><span class="p">)</span> <span class="p">:</span> <span class="k">this</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">_startPtr</span> <span class="p">=</span> <span class="n">startPtr</span><span class="p">;</span>
<span class="n">_endPtr</span> <span class="p">=</span> <span class="n">endPtr</span><span class="p">;</span>
<span class="k">fixed</span> <span class="p">(</span><span class="kt">int</span><span class="p">*</span> <span class="n">pTemp</span> <span class="p">=</span> <span class="n">_temp</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_tempStart</span> <span class="p">=</span> <span class="n">pTemp</span><span class="p">;</span>
<span class="n">_tempEnd</span> <span class="p">=</span> <span class="n">pTemp</span> <span class="p">+</span> <span class="n">TMP_SIZE_IN_ELEMENTS</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div> </div>
</div>
<p>This is where the real top-level sorting entry point for 32-bit signed integers is:</p>
<ul>
<li>This struct contains a bunch of constants and members that are initialized for a single sort-job/call and immediately discarded once sorting is complete.</li>
<li>There’s a little semingly nasty bit hiding in plain sight there, where we exfiltrate an interior pointer obtained inside a <code class="highlighter-rouge">fixed</code> block and store it for the lifetime of the struct, outside of the <code class="highlighter-rouge">fixed</code> block.
<ul>
<li>This is generally a no-no, since, in theory, we don’t have a guarantee that the struct won’t be boxed/stored inside a managed object on a heap where the GC is free to move our memory around.</li>
<li>In this case, we <em>are ensuring</em> that instances of <code class="highlighter-rouge">VxSortInt32</code> are never promoted to the managed heap by declaring it as a <a href="https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/keywords/ref#ref-struct-types"><code class="highlighter-rouge">ref struct</code></a>.</li>
<li>The motivation behind this is to ensure that the <code class="highlighter-rouge">fixed</code> temporary memory resides close to the other struct fields, taking advantage of <a href="https://en.wikipedia.org/wiki/Locality_of_reference">locality of reference</a>.</li>
</ul>
</li>
</ul>
</div>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
</pre></td><td class="rouge-code"><pre> <span class="k">internal</span> <span class="k">void</span> <span class="nf">Sort</span><span class="p">(</span><span class="kt">int</span><span class="p">*</span> <span class="n">left</span><span class="p">,</span> <span class="kt">int</span><span class="p">*</span> <span class="n">right</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">var</span> <span class="n">length</span> <span class="p">=</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span> <span class="p">(</span><span class="n">right</span> <span class="p">-</span> <span class="n">left</span> <span class="p">+</span> <span class="m">1</span><span class="p">);</span>
<span class="kt">int</span><span class="p">*</span> <span class="n">mid</span><span class="p">;</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">length</span><span class="p">)</span> <span class="p">{</span>
<span class="k">case</span> <span class="m">0</span><span class="p">:</span>
<span class="k">case</span> <span class="m">1</span><span class="p">:</span>
<span class="k">return</span><span class="p">;</span>
<span class="k">case</span> <span class="m">2</span><span class="p">:</span>
<span class="nf">SwapIfGreater</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">);</span>
<span class="k">return</span><span class="p">;</span>
<span class="k">case</span> <span class="m">3</span><span class="p">:</span>
<span class="n">mid</span> <span class="p">=</span> <span class="n">right</span> <span class="p">-</span> <span class="m">1</span><span class="p">;</span>
<span class="nf">SwapIfGreater</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">mid</span><span class="p">);</span>
<span class="nf">SwapIfGreater</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">);</span>
<span class="nf">SwapIfGreater</span><span class="p">(</span><span class="n">mid</span><span class="p">,</span> <span class="n">right</span><span class="p">);</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// Go to insertion sort below this threshold</span>
<span class="k">if</span> <span class="p">(</span><span class="n">length</span> <span class="p"><=</span> <span class="n">SMALL_SORT_THRESHOLD_ELEMENTS</span><span class="p">)</span> <span class="p">{</span>
<span class="nf">InsertionSort</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">);</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// Compute median-of-three, of:</span>
<span class="c1">// the first, mid and one before last elements</span>
<span class="n">mid</span> <span class="p">=</span> <span class="n">left</span> <span class="p">+</span> <span class="p">((</span><span class="n">right</span> <span class="p">-</span> <span class="n">left</span><span class="p">)</span> <span class="p">/</span> <span class="m">2</span><span class="p">);</span>
<span class="nf">SwapIfGreater</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">mid</span><span class="p">);</span>
<span class="nf">SwapIfGreater</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">right</span> <span class="p">-</span> <span class="m">1</span><span class="p">);</span>
<span class="nf">SwapIfGreater</span><span class="p">(</span><span class="n">mid</span><span class="p">,</span> <span class="n">right</span> <span class="p">-</span> <span class="m">1</span><span class="p">);</span>
<span class="c1">// Pivot is mid, place it in the right hand side</span>
<span class="nf">Swap</span><span class="p">(</span><span class="n">mid</span><span class="p">,</span> <span class="n">right</span><span class="p">);</span>
<span class="kt">var</span> <span class="n">boundary</span> <span class="p">=</span> <span class="nf">VectorizedPartitionInPlace</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">);</span>
<span class="nf">Sort</span><span class="p">(</span> <span class="n">left</span><span class="p">,</span> <span class="n">boundary</span> <span class="p">-</span> <span class="m">1</span><span class="p">);</span>
<span class="nf">Sort</span><span class="p">(</span><span class="n">boundary</span> <span class="p">+</span> <span class="m">1</span><span class="p">,</span> <span class="n">right</span><span class="p">);</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Lastly, we have the <code class="highlighter-rouge">Sort</code> method for the <code class="highlighter-rouge">VxSortInt32</code> struct. Most of this is code I blatantly copied for <a href="https://github.com/dotnet/coreclr/blob/master/src/System.Private.CoreLib/shared/System/Collections/Generic/ArraySortHelper.cs#L182"><code class="highlighter-rouge">ArraySortHelper<T></code></a>. What it does is:</p>
<ul>
<li>Special case for lengths of 0-3.</li>
<li>When length <code class="highlighter-rouge"><= 16</code> we just go straight to <code class="highlighter-rouge">InsertionSort</code> and skip all the recursive jazz (go back to post 1 if you want to know why <code class="highlighter-rouge">Array.Sort()</code> does this).</li>
<li>When we have <code class="highlighter-rouge">>= 17</code> elements, we go to vectorized partitioning:
<ul>
<li>We do median of 3 pivot selection.</li>
<li>Swap that pivot so that it resides on the right-most index of the partition.</li>
</ul>
</li>
<li>Call <code class="highlighter-rouge">VectorizedPartitionInPlace</code>, which we’ve seen before.
<ul>
<li>We conveniently take advantage of the fact we have <code class="highlighter-rouge">InsertionSort</code> to cover us for the small partitions, and our partitioning code can always assume that it can prime the pump with at least two vectors worth of vectorized partitioning without additional checks…</li>
</ul>
</li>
<li>Recurse to the left.</li>
<li>Recurse to the right.</li>
</ul>
<h2 id="initial-performance">Initial Performance</h2>
<p>Are we fast yet?</p>
<p>Yes! This is by no means the end, on the contrary, this is only a rather impressive beginning. We finally have something working, and it is even not entirely unpleasant, if I may say so:</p>
<div>
<div class="stickemup">
<ul class="uk-tab" data-uk-switcher="{connect:'#503d6f7d-7740-4997-968f-b1462f12e371'}">
<li class="uk-active"><a href="#"><i class="glyphicon glyphicon-stats"></i> Scaling</a></li>
<li><a href="#"><i class="glyphicon glyphicon-stats"></i> Time/N</a></li>
<li><a href="#"><i class="glyphicon glyphicon-list-alt"></i> Benchmarks</a></li>
<li><a href="#"><i class="glyphicon glyphicon-list-alt"></i> Stats</a></li>
<li><a href="#"><i class="glyphicon glyphicon-info-sign"></i> Setup</a></li>
</ul>
<ul id="503d6f7d-7740-4997-968f-b1462f12e371" class="uk-switcher uk-margin">
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<div data-intro="Size of the sorting problem, 10..10,000,000 in powers of 10" data-position="bottom">
<div data-intro="Performance scale: Array.Sort (solid gray) is always 100%, and the other methods are scaled relative to it" data-position="left">
<div data-intro="Click legend items to show/hide series" data-position="right">
<div class="benchmark-chart-container">
<canvas data-chart="line">
N,100,1K,10K,100K,1M,10M
ArraySort, 1 , 1 , 1 , 1 , 1 , 1
DoublePumpedNaive, 1.67, 0.77, 0.6, 0.50, 0.39 , 0.36
<!--
{
"data" : {
"datasets" : [ {
"backgroundColor": "rgba(66,66,66,0.35)",
"rough": { "fillStyle": "hachure", "hachureAngle": -30, "hachureGap": 9, "fillWeight": 0.3 }
},
{
"backgroundColor": "rgba(33,33,220,.9)",
"rough": { "fillStyle": "hachure", "hachureAngle": 30, "hachureGap": 6 }
}]
},
"options": {
"title": { "text": "AVX2 Naive Sorting - Scaled to Array.Sort", "display": true },
"scales": {
"yAxes": [{
"ticks": {
"fontFamily": "Indie Flower",
"min": 0.2,
"callback": "ticksPercent"
},
"scaleLabel": {
"labelString": "Scaling (%)",
"display": true
}
}]
}
},
"defaultOptions": {"scales":{"xAxes":[{"scaleLabel":{"display":"true,","labelString":"N (elements)","fontFamily":"Indie Flower"},"ticks":{"fontFamily":"Indie Flower"}}]},"legend":{"display":true,"position":"bottom","labels":{"fontFamily":"Indie Flower","fontSize":14}},"title":{"position":"top","fontFamily":"Indie Flower","fontSize":16}}
}
--> </canvas>
</div>
</div>
</div>
</div>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<div data-intro="Size of the sorting problem, 10..10,000,000 in powers of 10" data-position="bottom">
<div data-intro="Time in nanoseconds spent sorting per element. Array.Sort (solid gray) is the baseline, again" data-position="left">
<div data-intro="Click legend items to show/hide series" data-position="right">
<div class="benchmark-chart-container">
<canvas data-chart="line">
N,100,1K,10K,100K,1M,10M
ArraySort , 19.9202, 35.4067, 52.3293, 64.6518, 70.5598, 81.0416
DoublePumpedNaive, 35.4138, 26.9828, 31.5477, 32.1774, 27.8901, 29.4917
<!--
{
"data" : {
"datasets" : [ {
"backgroundColor": "rgba(66,66,66,0.35)",
"rough": { "fillStyle": "hachure", "hachureAngle": -30, "hachureGap": 9, "fillWeight": 0.3 }
},
{
"backgroundColor": "rgba(33,33,220,.9)",
"rough": { "fillStyle": "hachure", "hachureAngle": 30, "hachureGap": 6 }
}]
},
"options": {
"title": { "text": "Array.Sort + AVX2 Naive Sorting - log(Time/N)", "display": true },
"scales": {
"yAxes": [{
"type": "logarithmic",
"ticks": {
"callback": "ticksNumStandaard",
"fontFamily": "Indie Flower"
},
"scaleLabel": {
"labelString": "Time/N (ns)",
"fontFamily": "Indie Flower",
"display": true
}
}]
}
},
"defaultOptions": {"scales":{"xAxes":[{"scaleLabel":{"display":"true,","labelString":"N (elements)","fontFamily":"Indie Flower"},"ticks":{"fontFamily":"Indie Flower"}}]},"legend":{"display":true,"position":"bottom","labels":{"fontFamily":"Indie Flower","fontSize":14}},"title":{"position":"top","fontFamily":"Indie Flower","fontSize":16}}
}
--> </canvas>
</div>
</div>
</div>
</div>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<table class="table datatable" data-json="../_posts/Bench.BlogPt3_Int32_-report.datatable.json" data-id-field="name" data-pagination="false" data-page-list="[9, 18]" data-intro="Each row in this table represents a benchmark result" data-position="left" data-show-pagination-switch="false">
<thead data-intro="The header can be used to sort/filter by clicking" data-position="right">
<tr>
<th data-field="TargetMethodColumn.Method" data-sortable="true" data-filter-control="select">
<span data-intro="The name of the benchmarked method" data-position="top">
Method<br />Name
</span>
</th>
<th data-field="N" data-sortable="true" data-value-type="int" data-filter-control="select">
<span data-intro="The size of the sorting problem being benchmarked (# of integers)" data-position="top">
Problem<br />Size
</span>
</th>
<th data-field="TimePerNDataTable" data-sortable="true" data-value-type="float2-interval-muted">
<span data-intro="Time in nanoseconds spent sorting each element in the array (with confidence intervals in parenthesis)" data-position="top">
Time /<br />Element (ns)
</span>
</th>
<th data-field="RatioDataTable" data-sortable="true" data-value-type="inline-bar-horizontal-percentage">
<span data-intro="Each result is scaled to its baseline (Array.Sort in this case)" data-position="top">
Scaling
</span>
</th>
<th data-field="Measurements" data-sortable="true" data-value-type="inline-bar-vertical">
<span data-intro="Raw benchmark results visualize how stable the result it. Longest/Shortest runs marked with <span style='color: red'>Red</span>/<span style='color: green'>Green</span>" data-position="top">Measurements</span>
</th>
</tr>
</thead>
</table>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<table class="table datatable" data-json="../_posts/unmanaged-vs-doublepumpednaive-stats.json" data-id-field="name" data-pagination="false" data-intro="Each row in this table contains statistics collected & averaged out of thousands of runs with random data" data-position="left" data-show-pagination-switch="false">
<thead data-intro="The header can be used to sort/filter by clicking" data-position="right">
<tr>
<th data-field="MethodName" data-sortable="true" data-filter-control="select">
<span data-intro="The name of the benchmarked method" data-position="top">Method<br />Name</span>
</th>
<th data-field="ProblemSize" data-sortable="true" data-value-type="int" data-filter-control="select">
<div data-intro="The size of the sorting problem being benchmarked (# of integers)" data-position="bottom" class="rotated-header-container">
<div class="rotated-header">Size</div>
</div>
</th>
<th data-field="MaxDepthScaledDataTable" data-sortable="true" data-value-type="inline-bar-horizontal">
<div data-intro="The maximal depth of recursion reached while sorting" data-position="top" class="rotated-header-container">
<div class="rotated-header">Max</div>
<div class="rotated-header">Depth</div>
</div>
</th>
<th data-field="NumPartitionOperationsScaledDataTable" data-sortable="true" data-value-type="inline-bar-horizontal">
<div data-intro="# of partitioning operations per sort" data-position="bottom" class="rotated-header-container">
<div class="rotated-header">Part</div>
<div class="rotated-header">itions</div>
</div>
</th>
<th data-field="NumVectorizedLoadsScaledDataTable" data-sortable="true" data-value-type="inline-bar-horizontal">
<div data-intro="# of vectorized load operations" data-position="top" class="rotated-header-container">
<div class="rotated-header">Vector</div>
<div class="rotated-header">Loads</div>
</div>
</th>
<th data-field="NumVectorizedStoresScaledDataTable" data-sortable="true" data-value-type="inline-bar-horizontal">
<div data-intro="# of vectorized store operations" data-position="bottom" class="rotated-header-container">
<div class="rotated-header">Vector</div>
<div class="rotated-header">Stores</div>
</div>
</th>
<th data-field="NumPermutationsScaledDataTable" data-sortable="true" data-value-type="inline-bar-horizontal">
<div data-intro="# of vectorized permutation operations" data-position="top" class="rotated-header-container">
<div class="rotated-header">Vector</div>
<div class="rotated-header">Permutes</div>
</div>
</th>
<th data-field="AverageSmallSortSizeScaledDataTable" data-sortable="true" data-value-type="inline-bar-horizontal">
<div data-intro="For hybrid sorting, the average size that each small sort operation was called with (e.g. InsertionSort)" data-position="bottom" class="rotated-header-container">
<div class="rotated-header">Small</div>
<div class="rotated-header">Sort</div>
<div class="rotated-header">Size</div>
</div>
</th>
<th data-field="NumScalarComparesScaledDataTable" data-sortable="true" data-value-type="inline-bar-horizontal">
<div data-intro="How many branches were executed in each sort operation that were based on the unsorted array elements" data-position="top" class="rotated-header-container">
<div class="rotated-header">Data</div>
<div class="rotated-header">Based</div>
<div class="rotated-header">Branches</div>
</div>
</th>
<th data-field="PercentSmallSortCompares" data-sortable="true" data-value-type="float2-percentage">
<div data-intro="What percent of</br>⬅<br/>branches happenned as part of small-sorts" data-position="bottom" class="rotated-header-container">
<div class="rotated-header">Small</div>
<div class="rotated-header">Sort</div>
<div class="rotated-header">Branches</div>
</div>
</th>
</tr>
</thead>
</table>
</div>
</li>
<li>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="rouge-code"><pre><span class="nv">BenchmarkDotNet</span><span class="o">=</span>v0.12.0, <span class="nv">OS</span><span class="o">=</span>clear-linux-os 32120
Intel Core i7-7700HQ CPU 2.80GHz <span class="o">(</span>Kaby Lake<span class="o">)</span>, 1 CPU, 4 logical and 4 physical cores
.NET Core <span class="nv">SDK</span><span class="o">=</span>3.1.100
<span class="o">[</span>Host] : .NET Core 3.1.0 <span class="o">(</span>CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404<span class="o">)</span>, X64 RyuJIT
Job-DEARTS : .NET Core 3.1.0 <span class="o">(</span>CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404<span class="o">)</span>, X64 RyuJIT
<span class="nv">InvocationCount</span><span class="o">=</span>3 <span class="nv">IterationCount</span><span class="o">=</span>15 <span class="nv">LaunchCount</span><span class="o">=</span>2
<span class="nv">UnrollFactor</span><span class="o">=</span>1 <span class="nv">WarmupCount</span><span class="o">=</span>10
<span class="nv">$ </span><span class="nb">grep</span> <span class="s1">'stepping\|model\|microcode'</span> /proc/cpuinfo | <span class="nb">head</span> <span class="nt">-4</span>
model : 158
model name : Intel<span class="o">(</span>R<span class="o">)</span> Core<span class="o">(</span>TM<span class="o">)</span> i7-7700HQ CPU @ 2.80GHz
stepping : 9
microcode : 0xb4
</pre></td></tr></tbody></table></code></pre></div></div>
</li>
</ul>
</div>
<p>We’re off to a very good start:</p>
<ul>
<li>
<p>We can see that as soon as we hit 1000 element arrays (even earlier, in earnest), we already outperform <code class="highlighter-rouge">Array.Sort</code> (87% runtime), and by the time we get to 1M / 10M element arrays, we see speed-ups north of 2.5x (39%, 37% runtime) over the scalar C++ code!</p>
</li>
<li>
<p>While <code class="highlighter-rouge">Array.Sort</code> is behaving like we would expect from a <code class="highlighter-rouge">QuickSort</code>-like function: it is slowing down at rate you’d expect given that it has a Big-O notation of \(\mathcal{O}(n\log{}n)\), our own <code class="highlighter-rouge">DoublePumpedNaive</code> is peculiar: The time spent sorting every single element starts going up as we increase <code class="highlighter-rouge">N</code>, then goes down a bit and back up. Huh? It actually improves as we sort more data? Quite unreasonable, unless we remind ourselves that we are executing a mix of scalar insertion sort and vectorized code. Where are we actually spending more CPU cycles though? We’ll run some profiling sessions in a minute, to get a better idea of what’s going on.</p>
</li>
</ul>
<p>If you recall, on the first post in this series, I presented some statistics about is going on inside our sort routine. This is a perfect time to switch to the statistics tab, where I’ve beefed up the table with some vectorized counters that didn’t make sense before with the scalar version. From here we can learn a few interesting facts:</p>
<ul>
<li>The number of partitioning operations / small sorts is practically the same
<ul>
<li>You could ask yourself, or me, why they are not <strong>exactly</strong> the same?
To which I’d answer:
<ul>
<li>The thresholds are 16 vs. 17, which has some effect.</li>
<li>We have to remember that the resulting partitions from each implementation end up looking slightly different because of the double pumping + temporary memory shenanigans. Once the partitions look different, the following pivots selected are different, and the entire whole sort mechanic looks slightly different.</li>
</ul>
</li>
</ul>
</li>
<li>We are doing a lot of vectorized work:
<ul>
<li>Loading two vectors per 8-element(1 data vector + 1 permutation vector)</li>
<li>Storing two vectors (left+right) for every vector read</li>
<li>In a weird coincidence, this means we perform the same number of vectorized loads and stores for every test case.<br />
In future posts, I will discard one of these columns to reduce the amount of information load…</li>
<li>Finally, lest we forget, we perfom compares/permutations at exactly half of the load/store rate.</li>
</ul>
</li>
<li>All of this is helping us by reducing the number of scalar comparisons, but there’s still quite a lot of it left too:
<ul>
<li>We continue to do scalar partitioning inside <code class="highlighter-rouge">VectorizedPartitionInPlace</code>, as part of handling the remainder that doesn’t fit into a <code class="highlighter-rouge">Vector256<int></code>.</li>
<li>We are still executing scalar comparisons as part of small-sorting/inside of the insertion sort at an alarming rate:
<ul>
<li>The absolute number of comparisons is quite high: We’re still doing millions of data-based branches.</li>
<li>It is also clear from the counters that the overwhelming majority of these are from <code class="highlighter-rouge">InsertionSort</code>: If we focus on the 1M/10M cases here, we see that <code class="highlighter-rouge">InsertionSort</code> went up from attributing 28.08%/24.60% of scalar comparisons in the <code class="highlighter-rouge">Unmanaged</code> (scalar) test-case all the way to 66.4%/62.74% in the vectorized <code class="highlighter-rouge">DoublePumpNaive</code> version. Of course this rise is merely in percent terms, but clearly we will have to deal with this if we intend to make this thing fast(er).</li>
</ul>
</li>
</ul>
</li>
</ul>
<p>This is but the beginning of our profiling journey, but we are already learning a complicated truth: Right now, as fast as this is already going, the scalar code we use for insertion sort will always put an upper limit on how fast we can possibly go by optimizing the <em>vectorized code</em> we’ve gone over so far, <em>unless</em> we get rid of <code class="highlighter-rouge">InsertionSort</code> alltogether, replacing it with something better. But first thing’s first, we must remain focused: 65% of instructions executed are still spent doing vectorized partitioning; That is the biggest target on our scope!</p>
</div>
<p>As promised, it’s time we profile the code to see what’s really up: We can fire up the venerable Linux <code class="highlighter-rouge">perf</code> tool, through a simple test binary/project I’ve coded up which allows me to execute some dummy sorting by selecting the sort method I want to invoke and specify some parameters for it through the command line, for example:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="rouge-code"><pre><span class="nv">$ </span><span class="nb">cd</span> ~/projects/public/VxSort/Example
<span class="nv">$ </span>dotnet publish <span class="nt">-c</span> release <span class="nt">-o</span> linux-x64 <span class="nt">-r</span> linux-x64
<span class="c"># Run AVX2DoublePumped with 1,000,000 elements x 100 times</span>
<span class="nv">$ </span>./linux-x64/Example <span class="nt">--type-list</span> DoublePumpNaive <span class="nt">--size-list</span> 1000000
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Here we call the <code class="highlighter-rouge">DoublePumpedNaive</code> implementation we’ve been discussing from the beginning of this post with 1M elements, and sort the random data 100 times to generate some heat in case global warming is not cutting it for you.<br />
I know that calling <code class="highlighter-rouge">dotnet publish ...</code> seems superfluous, but trust<sup id="fnref:0" role="doc-noteref"><a href="#fn:0" class="footnote">1</a></sup> me and go with me on this one:</p>
<ul class="uk-tab" data-uk-switcher="{connect:'#0022d19b-dd68-4bb7-a13e-8acabcb4c12f'}">
<li class="uk-active"><a href="#">1M</a></li>
<li><a href="#">10K</a></li>
</ul>
<ul id="0022d19b-dd68-4bb7-a13e-8acabcb4c12f" class="uk-switcher uk-margin">
<li>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
</pre></td><td class="rouge-code"><pre><span class="nv">$ COMPlus_PerfMapEnabled</span><span class="o">=</span>1 perf record <span class="nt">-F</span> max <span class="nt">-e</span> instructions ./Example <span class="se">\</span>
<span class="nt">--type-list</span> DoublePumpedNaive <span class="nt">--size-list</span> 1000000
...
<span class="nv">$ </span>perf report <span class="nt">--stdio</span> <span class="nt">-F</span> overhead,sym | <span class="nb">head</span> <span class="nt">-15</span>
...
<span class="c"># Overhead Symbol</span>
65.66% <span class="o">[</span>.] ... ::VectorizedPartitionInPlace<span class="o">(</span>int32<span class="k">*</span>,int32<span class="k">*</span>,int32<span class="k">*</span><span class="o">)[</span>Optimized]
22.43% <span class="o">[</span>.] ... ::InsertionSort<span class="o">(!!</span>0<span class="k">*</span>,!!0<span class="k">*</span><span class="o">)[</span>Optimized]
5.43% <span class="o">[</span>.] ... ::QuickSortInt<span class="o">(</span>int32<span class="k">*</span>,int32<span class="k">*</span>,int32<span class="k">*</span>,int32<span class="k">*</span><span class="o">)[</span>OptimizedTier1]
4.00% <span class="o">[</span>.] ... ::Memmove<span class="o">(</span>uint8&,uint8&,uint64<span class="o">)[</span>OptimizedTier1]
</pre></td></tr></tbody></table></code></pre></div></div>
</li>
<li>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
</pre></td><td class="rouge-code"><pre><span class="nv">$ COMPlus_PerfMapEnabled</span><span class="o">=</span>1 perf record <span class="nt">-F</span> max <span class="nt">-e</span> instructions ./Example <span class="se">\</span>
<span class="nt">--type-list</span> AVX2DoublePumpedNaive <span class="nt">--size-list</span> 10000
...
<span class="nv">$ </span>perf report <span class="nt">--stdio</span> <span class="nt">-F</span> overhead,sym | <span class="nb">head</span> <span class="nt">-15</span>
...
<span class="c"># Overhead Symbol</span>
54.59% <span class="o">[</span>.] ... ::VectorizedPartitionInPlace<span class="o">(</span>int32<span class="k">*</span>,int32<span class="k">*</span>,int32<span class="k">*</span><span class="o">)[</span>Optimized]
29.87% <span class="o">[</span>.] ... ::InsertionSort<span class="o">(!!</span>0<span class="k">*</span>,!!0<span class="k">*</span><span class="o">)[</span>Optimized]
7.02% <span class="o">[</span>.] ... ::QuickSortInt<span class="o">(</span>int32<span class="k">*</span>,int32<span class="k">*</span>,int32<span class="k">*</span>,int32<span class="k">*</span><span class="o">)[</span>OptimizedTier1]
5.23% <span class="o">[</span>.] ... ::Memmove<span class="o">(</span>uint8&,uint8&,uint64<span class="o">)[</span>OptimizedTier1]
</pre></td></tr></tbody></table></code></pre></div></div>
</li>
</ul>
<p>This is a trimmed summary of <code class="highlighter-rouge">perf</code> session recording performance metrics, specifically: number of instructions executed for running a 1M element sort 100 times, followed by running a 10K element sort, 10K times. I was shocked when I saw this for the first time, but we’re starting to understand the previous oddities we saw with the <code class="highlighter-rouge">Time/N</code> column!<br />
We’re spending upwards of 20% of our time doing scalar insertion sorting! I lured you here with promises of vectorized sorting and yet, somehow, “only” 65% of the time is spent in doing “vectorized” work (which also has some scalar partitioning, if we’re honest). Not only that, but as the size of the array decreases, the percentage of time spent in scalar code <em>increases</em> (from 22.43% to 29.87%), which should not surprise us anymore.<br />
Before anything else, let me clearly state that this is not necessarily a bad thing! As the size of the partition decreases, the <em>benefit</em> of doing vectorized partitioning decreases in general, and even more so for our AVX2 partitioning, which has non-trivial start-up overhead. We shouldn’t care about the amount of time we’re spending on scalar code per se, but the amount of time taken to sort the entire array.<br />
The decision to go to with scalar insertion-sort or stick to vectorized code is controlled by the threshold I mentioned before, which is still sitting there at <code class="highlighter-rouge">16</code>. We’re only beginning our optimization phase in the next post, so for now, we’ll stick with the threshold selected for <code class="highlighter-rouge">Array.Sort</code> by the CoreCLR developers, this is the “correct” starting point both in terms of allowing us to compare apples-to-apples and also as I am a firm believer at doing very incremental modifications for this sort of work.<br />
Having said that, this is definitely something we will tweak later for our particular implementation.</p>
<h2 id="finishing-off-with-a-sour-taste">Finishing off with a sour taste</h2>
<p>I’ll end this post with a not so easy pill to swallow: let’s re-run <code class="highlighter-rouge">perf</code> and measure a different aspect of our code: Let’s see how the code is behaving in terms of top-level performance counters. The idea here is to use counters that our CPU is already capable of collecting at the hardware level, with almost no performance impact, to see where/if we’re hurting. What I’ll do before invoking <code class="highlighter-rouge">perf</code> is use a Linux utility called <a href="https://github.com/lpechacek/cpuset"><code class="highlighter-rouge">cset</code></a> which can be <a href="https://stackoverflow.com/a/13076880/9172">used to</a> evacuate all user threads and (almost all) kernel threads from a given physical CPU core, using <a href="https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/cgroup-v1/cpusets.rst">cpusets</a>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="rouge-code"><pre><span class="nv">$ </span><span class="nb">sudo </span>cset shield <span class="nt">--cpu</span> 3 <span class="nt">-k</span> on
cset: <span class="nt">--</span><span class="o">></span> activating shielding:
cset: moving 638 tasks from root into system cpuset...
<span class="o">[==================================================]</span>%
cset: kthread shield activated, moving 56 tasks into system cpuset...
<span class="o">[==================================================]</span>%
cset: <span class="k">**</span><span class="o">></span> 38 tasks are not movable, impossible to move
cset: <span class="s2">"system"</span> cpuset of CPUSPEC<span class="o">(</span>0-2<span class="o">)</span> with 667 tasks running
cset: <span class="s2">"user"</span> cpuset of CPUSPEC<span class="o">(</span>3<span class="o">)</span> with 0 tasks running
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Once we have “shielded” a single CPU core, we execute the <code class="highlighter-rouge">Example</code> binary we used before much in the same way while collecting different top-level hardware statistics from befre using a the following <code class="highlighter-rouge">perf</code> command line:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="rouge-code"><pre><span class="nv">$ </span>perf <span class="nb">stat</span> <span class="nt">-a</span> <span class="nt">--topdown</span> <span class="nb">sudo </span>cset shield <span class="nt">-e</span> ./Example <span class="se">\</span>
<span class="nt">--type-list</span> DoublePumpedNaive <span class="nt">--size-list</span> 1000000
cset: <span class="nt">--</span><span class="o">></span> last message, executed args into cpuset <span class="s2">"/user"</span>, new pid is: 16107
Performance counter stats <span class="k">for</span> <span class="s1">'system wide'</span>:
retiring bad speculation frontend bound backend bound
...
S0-C3 1 37.6% 32.3% 16.9% 13.2%
3.221968791 seconds <span class="nb">time </span>elapsed
</pre></td></tr></tbody></table></code></pre></div></div>
<p>I’m purposely showing only the statistics collected for our shielded core since we know we only care about that core in the first place.</p>
<p>Here are some bad news: core #3 is really not having a good time running our code. <code class="highlighter-rouge">perf --topdown</code> is essentially screaming from the top of its lungs with that <code class="highlighter-rouge">32.3%</code> under the <code class="highlighter-rouge">bad speculation</code> column. This might seem like an innocent metric if you haven’t done this sort of thing before (in which case, read the info box below), but this is <strong>really bad</strong>. In plain English and <a href="https://easyperf.net/blog/2019/02/09/Top-Down-performance-analysis-methodology">without getting into the intricacies of top-down perfromance analysis</a>, this metric represents cycles where the CPU isn’t doing useful work because of an earlier mis-speculation. Here, the mis-speculation is mis-predicted branches. The penalty for <em>each</em> such mis-predicted branch is an entire flush of the pipeline (hence the wasted time), which costs us around 14-15 cycles on modern Intel CPUs.</p>
<table style="margin-bottom: 0em" class="notice--info">
<tr>
<td style="border: none; padding-top: 0; padding-bottom: 0; vertical-align: top"><span class="uk-label">Note</span></td>
<td style="border: none; padding-top: 0; padding-bottom: 0"><div>
<p>We have to remember that efficient execution on modern CPUs means keeping the CPU pipeline as busy as possible; This is quite a challenge given its length is about 15 stages, and the CPU itself is super-scalar (For example: an <a href="https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Individual_Core">Intel Skylake CPU has 8 ports</a> that can execute some instruction every cycle!). If, for example, all instructions in the CPU have a constant latency in cycles, this means it <em>has</em> to process 100+ instructions into “the future” while it’s just finishing up with a current one to avoid doing nothing. That’s enough of a challenge for regular code, but what should it do when it sees a branch? It could attempt and execute <strong>both</strong> branches, which quickly becomes a fool’s errand if somewhere close-by there would be even more branches. What CPU designers did was opt for speculative execution: add complex machinery to <em>predict</em> if a branch will be taken and speculatively execute the next instruction according to the prediction. But the predictor isn’t all knowing, and it will mis-predict, and then we end up paying a huge penalty: The CPU will have to push those mis-predicted instructions through the pipeline flushing the results out as if the whole thing never happenned. This is why the rate of mis-prediction is a life and death matter when it comes to performance.</p>
</div>
</td>
</tr>
</table>
<p>Wait, I sense some optimistic thoughts all across the internet… maybe it’s not our precious vectorized so-called branch-less code? Maybe we can chalk it all up on that mean scalar <code class="highlighter-rouge">InsertionSort</code> function doing those millions and millions of scalar comparisons? We are, after all, using it for sorting small partitions, which we’ve already measured at more than 20% of the total run-time? Let’s see this again with <code class="highlighter-rouge">perf</code>, <em>this time</em> focusing on the <code class="highlighter-rouge">branch-misses</code> HW counter and try to figure out how the mis-predictions are distributed amongst our call-stacks:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
</pre></td><td class="rouge-code"><pre><span class="nv">$ </span><span class="nb">export </span><span class="nv">COMPlus_PerfMapEnabled</span><span class="o">=</span>1 <span class="c"># Make perf speak to the JIT</span>
<span class="c"># Record some performance information:</span>
<span class="nv">$ </span>perf record <span class="nt">-F</span> max <span class="nt">-e</span> branch-misses ./Example <span class="se">\</span>
<span class="nt">--type-list</span> DoublePumpedNaive <span class="nt">--size-list</span> 1000000
...
<span class="nv">$ </span>perf report <span class="nt">--stdio</span> <span class="nt">-F</span> overhead,sym | <span class="nb">head</span> <span class="nt">-17</span>
...
40.97% <span class="o">[</span>.] ...::InsertionSort<span class="o">(!!</span>0<span class="k">*</span>,!!0<span class="k">*</span><span class="o">)[</span>Optimized]
32.30% <span class="o">[</span>.] ...::VectorizedPartitionInPlace<span class="o">(</span>int32<span class="k">*</span>,int32<span class="k">*</span>,int32<span class="k">*</span><span class="o">)[</span>Optimized]
9.64% <span class="o">[</span>.] ...::Memmove<span class="o">(</span>uint8&,uint8&,uint64<span class="o">)[</span>OptimizedTier1]
9.64% <span class="o">[</span>.] ...::QuickSortInt<span class="o">(</span>int32<span class="k">*</span>,int32<span class="k">*</span>,int32<span class="k">*</span>,int32<span class="k">*</span><span class="o">)[</span>OptimizedTier1]
5.62% <span class="o">[</span>.] ...::VectorizedPartitionOnStack<span class="o">(</span>int32<span class="k">*</span>,int32<span class="k">*</span>,int32<span class="k">*</span><span class="o">)[</span>Optimized]
...
</pre></td></tr></tbody></table></code></pre></div></div>
<p>No such luck. While <code class="highlighter-rouge">InsertionSort</code> is definitely starring here with 41% <em>of the</em> branch misprediction events, we still have <strong>32%</strong> of the bad speculation coming from our own new vectorized code. This is a red-flag as far as we’re concerned: It means that our vectorized code still contains a lot of mis-predicted branches. Given that we’re in the bussiness of sorting (random data) and the high rate of recorded mis-prediction the only logical conclusion is that we have branches that are data-dependent. Another thing to keep in mind is that the resulting pipeline flush is a large penalty to pay given that our entire 8-element partition block has a throughput of around 8-9 cycles. That means we are hitting that 15 cycle pan-to-the-face way too often to feel good about ourselves.</p>
<p>I’ll finish this post here. We have a <strong>lot of work</strong> cut out for us. This is no-where near over.<br />
In the next post, I’ll try to give the current vectorized code a good shakeup. After all, it’s still our biggest target in terms of number of instructions executed, and 2<sup>nd</sup> when it comes to branch mis-predictions. Once we finish squeezing that lemon for all its performance juice on the 4<sup>th</sup> post, We will turn our focus to the <code class="highlighter-rouge">InsertionSort</code> function on the 5<sup>th</sup> post , and we’ll see if we can appease the performance gods to make that part of the sorting effort faster.<br />
In the meantime, you can go back to the vectorized partitioning function and try to figure out what is causing all those nasty branch mis-predictions if you’re up for a small challenge. We’ll be dealing with it head-on in the next post.</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:0" role="doc-endnote">
<p>For some, <code class="highlighter-rouge">perf</code> wasn’t in the mood to show me function names without calling <code class="highlighter-rouge">dotnet publish</code> and using the resulting binary, and I didn’t care enough to investigate further… <a href="#fnref:0" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>damageboydans@houmus.orghttps://bits.houmus.orgDecimating Array.Sort with AVX2. I ended up going down the rabbit hole re-implementing array sorting with AVX2 intrinsics. There's no reason I should go down alone.This Goes to Eleven (Part. 2/∞)2020-01-29T05:26:28+00:002020-01-29T05:26:28+00:00https://bits.houmus.org/2020-01-29/this-goes-to-eleven-pt2<p>Since there’s a lot to over go over here, I’ve split it up into no less than 6 parts:</p>
<ol>
<li>In <a href="/2020-01-28/this-goes-to-eleven-pt1">part 1</a>, we start with a refresher on <code class="highlighter-rouge">QuickSort</code> and how it compares to <code class="highlighter-rouge">Array.Sort()</code>.</li>
<li>In this part, we go over the basics of vectorized hardware intrinsics, vector types, and go over a handful of vectorized instructions we’ll use in part 3. We still won’t be sorting anything.</li>
<li>In <a href="/2020-01-30/this-goes-to-eleven-pt3">part 3</a>, we go through the initial code for the vectorized sorting, and start seeing some payoff. We finish agonizing courtesy of the CPU’s branch predictor, throwing a wrench into our attempts.</li>
<li>In part 4, we go over a handful of optimization approaches that I attempted trying to get the vectorized partitioning to run faster. We’ll see what worked and what didn’t.</li>
<li>In part 5, we’ll see how we can almost get rid of all the remaining scalar code- by implementing small-constant size array sorting. We’ll use, drum roll…, yet more AVX2 vectorization.</li>
<li>Finally, in part 6, I’ll list the outstanding stuff/ideas I have for getting more juice and functionality out of my vectorized code.</li>
</ol>
<h2 id="intrinsics--vectorization">Intrinsics / Vectorization</h2>
<p>I’ll start by repeating my own words from the first <a href="/2018-08-18/netcoreapp3.0-intrinsics-in-real-life-pt1#the-whatwhy-of-intrinsics">blog post where I discussed intrinsics</a> in the CoreCLR 3.0 alpha days:</p>
<blockquote>
<p>Processor intrinsics are a way to directly embed specific CPU instructions via special, fake method calls that the JIT replaces at code-generation time. Many of these instructions are considered exotic, and normal language syntax cannot map them cleanly.<br />
The general rule is that a single intrinsic “function” becomes a single CPU instruction.</p>
</blockquote>
<p>You can go and re-read that introduction if you care for a more general and gentle introduction to processor intrinsics. For this series, we are going to focus on vectorized intrinsics in Intel processors. This is the largest group of CPU specific intrinsics in our processors, and I want to start by showing this by the numbers. I gathered some statistics by processing Intel’s own <a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/files/data-3.4.6.xml">data-3.4.6.xml</a>. This XML file is part of the <a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/">Intel Intrinsics Guide</a>, an invaluable resource on intrinsics in itself, and the “database” behind the guide. What I learned was that:</p>
<ul>
<li>There are no less than 1,218 intrinsics in Intel processors<sup id="fnref:0" role="doc-noteref"><a href="#fn:0" class="footnote">1</a></sup>!
<ul>
<li>Those can be combined in 6,180 different ways (according to operand sizes and types).</li>
<li>They’re grouped into 67 different categories/groups, these groups loosely correspond to various generations of CPUs as more and more intrinsics were gradually added.</li>
</ul>
</li>
<li>More than 94% are vectorized hardware intrinsics, which we’ll define more concretely below.</li>
</ul>
<p>That last point is super-critical: CPU intrinsics, at least in 2020, are overwhelmingly about being able to execute vectorized instructions. That’s really why you <em>should</em> be paying them attention in the first place. Sure, there’s additional stuff in there: if you’re a kernel developer, or writing crypto code, or some other niche-cases, but vectorization is why you are really here, whether you knew it or not.</p>
<p>In C#, we’ve mostly shied away from having intrinsics until CoreCLR 3.0 came along, where intrinsic support became official/complete, championed by <a href="https://twitter.com/tannergooding">@tannergooding</a> as well as others from Microsoft and Intel. but as single-threaded performance has virtually stopped improving, more programming languages started adding intrinsics support (go, rust, Java and now C#) so developers in those languages would have access to these specialized, much more efficient instructions. CoreCLR 3.0 does not support all 1,218 intrinsics that I found, but a more modest 226 intrinsics in <a href="https://docs.microsoft.com/en-us/dotnet/api/system.runtime.intrinsics.x86?view=netcore-3.0&viewFallbackFrom=dotnet-plat-ext-3.0">15 different classes</a> for x86 Intel and AMD processors. Each class is filled with many static functions, all of which are unique processor intrinsics, and represent a 1:1 mapping to Intel group/code names. As C# developers, we roughly get access to everything that Intel incorporated in their processors manufactured from 2014 and onwards<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote">2</a></sup>, and for AMD processors, from 2015 onwards.</p>
<p>What are these vectorized intrinsics?<br />
We need to cover a few base concepts specific to that category of intrinsics before we can start explaining specific intrinsics/instructions:</p>
<ul>
<li>What are vectorized intrinsics, and why have they become so popular.</li>
<li>How vectorized intrinsics interact with specialized vectorized <em>registers</em>.</li>
<li>How those registers are reflected as, essentially, new primitive types in CoreCLR 3.0.</li>
</ul>
<h3 id="simd-what--why">SIMD What & Why</h3>
<p>I’m going to use vectorization and SIMD interchangeably from here-on, but for the first and last time, let’s spell out what SIMD is: <strong>S</strong>ingle <strong>I</strong>nstruction <strong>M</strong>ultiple <strong>D</strong>ata is really a simple idea when you think about it. A lot of code ends up doing “stuff” in loops, usually, processing vectors of data one element at a time. SIMD instructions bring a simple new idea to the table: The CPU adds special instructions that can do arithmetic, bit-operations, comparisons and many other types of generalized operations on “vectors”, e.g. process multiple elements per instruction.</p>
<p>The benefit of using this approach to computing is that it allows for much greater efficiency: When we use vectorized intrinsics we end up executing the same <em>number</em> of instructions to process, for example, 8 data elements per instruction. Therefore, we reduce the amount of time the CPU spends decoding instructions for the same amount of work; furthermore, most vectorized instructions operate <em>independently</em> on the various <strong>elements</strong> of the vector and complete their operation at the same number of CPU cycles as the equivalent non-vectorized (or scalar) instruction. In short, in the land of CPU feature economics, vectorization is considered a high bang-for-buck feature: You can get a lot of <em>potential</em> performance for relatively little transistors added to the CPU, as long as people are willing to adapt their code (e.g. rewrite it) to use these new intrinsics, or compilers somehow magically manage to auto-vectorize the code (spoiler: There are tons of problems with that too)<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote">3</a></sup>.</p>
<p>Another equally important thing to embrace and understand about vectorized intrinsics is what they don’t and cannot provide: branching. It’s pretty much impossible to even attempt to imagine what a vectorized branch instruction would mean. These two concepts don’t begin to mix. Appropriately, a substantial part of vectorizing code is forcing oneself to accomplishing the given task without using branching. As we will see, branching begets unpredictability, at the CPU level, and unpredictability is our enemy, when we want to go fast.</p>
<p>Of course, I’m grossly over-romanticizing vectorized intrinsics and their benefits: There are also many non-trivial overheads involved both using them and adding them to our processors and to using them in our code. However, all in all, in the grand picture of CPU/performance economics adding and using vectorized instructions is still, compared to other potential improvements, quite cheap, under the assumption that programmers are willing to make the effort to re-write and maintain vectorized code.</p>
<h4 id="simd-registers">SIMD registers</h4>
<p>After our short introduction to vectorized intrinsics, we need to discuss SIMD registers, and how this piece of the puzzle fits the grand picture: Teaching our CPU to execute 1,000+ vectorized instructions is just part of the story, these instructions need to somehow operate on our data. Do all of these instructions simply take a pointer to memory and run wild with it? The short answer is: <strong>No</strong>. For the <em>most</em> part, CPU instructions dealing with vectorization (with a few notable exceptions) use special registers inside our CPU that are called SIMD registers. This is analogous to scalar (regular, non-vectorized) code we write in any programming language: while some instructions read and write directly to memory, and occasionally some instruction will accept a memory address as an operand, most instructions are register ↔ register only.</p>
<p>Just like scalar CPU registers, SIMD registers have a constant bit-width. In Intel these come at 64, 128, 256 and recently 512 bit wide registers. Unlike scalar registers, though, SIMD registers, end up <em>containing multiple</em> data-elements of another primitive type. The same register can and will be used to process a wide-range of primitive data-types, depending on which instruction is using it, as we will shortly demonstrate.</p>
<p>For now, this is all I care to explain about SIMD Registers at the CPU level: We need to be aware of their existence (we’ll see them in disassembly dumps anyway), and since we are dealing with high-perfomance code we kind of need to know how many of them exist inside our CPU.</p>
<h4 id="simd-intrinsic-types-in-c">SIMD Intrinsic Types in C\#</h4>
<p>We’ve touched lightly upon SIMD intrinsics and how they operate (e.g. accept and modify) on SIMD registers. Time to figure out how we can fiddle with everything in C#; we’ll start with the types:</p>
<table>
<thead>
<tr>
<th>C# Type</th>
<th style="text-align: center">x86 Registers</th>
<th style="text-align: center">Width (bits)</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://docs.microsoft.com/en-us/dotnet/api/system.runtime.intrinsics.vector64?view=netcore-3.0"><code class="highlighter-rouge">Vector64<T></code></a></td>
<td style="text-align: center"><code class="highlighter-rouge">mmo-mm7</code></td>
<td style="text-align: center">64</td>
</tr>
<tr>
<td><a href="https://docs.microsoft.com/en-us/dotnet/api/system.runtime.intrinsics.vector128?view=netcore-3.0"><code class="highlighter-rouge">Vector128<T></code></a></td>
<td style="text-align: center"><code class="highlighter-rouge">xmm0-xmm15</code></td>
<td style="text-align: center">128</td>
</tr>
<tr>
<td><a href="https://docs.microsoft.com/en-us/dotnet/api/system.runtime.intrinsics.vector256?view=netcore-3.0"><code class="highlighter-rouge">Vector256<T></code></a></td>
<td style="text-align: center"><code class="highlighter-rouge">ymm0-ymm15</code></td>
<td style="text-align: center">256</td>
</tr>
</tbody>
</table>
<p>These are primitive vector value-types recognized by the JIT while it is generating machine code. We should try and think about these types just like we think about other special-case primitive types such as <code class="highlighter-rouge">int</code> or <code class="highlighter-rouge">double</code>, with one exception: These vector types all accept a generic parameter <code class="highlighter-rouge"><T></code>, which may seem a little odd for a primitive type at a first glance, until we remember that their purpose is to contain <em>other</em> primitive types (there’s a reason they put the word “Vector” in there…); moreover, this generic parameter can’t just be any type or even value-type we’d like… It is limited to the types supported on our CPU and its vectorized intrinsics.</p>
<p>Let’s take <code class="highlighter-rouge">Vector256<T></code>, which I’ll be using exclusively in this series, as an example; <code class="highlighter-rouge">Vector256<T></code> can be used <strong>only</strong> with the following primitive types:</p>
<table class="fragment">
<thead><th style="border: none"><code>typeof(T)</code></th>
<th />
<th style="border: none"># Elements</th>
<th style="border: none"></th>
<th style="border: none">Element Width (bits)</th>
</thead>
<tbody>
<tr><td style="border: none"><code>byte / sbyte</code></td> <td style="border: none">➡</td><td style="border: none">32</td><td style="border: none">x</td><td style="border: none">8b</td></tr>
<tr><td style="border: none"><code>short / ushort</code></td><td style="border: none">➡</td> <td style="border: none">16</td><td style="border: none">x</td><td style="border: none">16b</td></tr>
<tr><td style="border: none"><code>int / uint</code></td> <td style="border: none">➡</td> <td style="border: none">8</td><td style="border: none">x</td><td style="border: none">32b</td></tr>
<tr><td style="border: none"><code>long / ulong</code></td> <td style="border: none">➡</td> <td style="border: none">4</td><td style="border: none">x</td><td style="border: none">64b</td></tr>
<tr><td style="border: none"><code>float</code></td><td style="border: none">➡</td> <td style="border: none">8</td><td style="border: none">x</td><td style="border: none">32b</td></tr>
<tr><td style="border: none"><code>double</code></td> <td style="border: none">➡</td> <td style="border: none">4</td><td style="border: none">x</td><td style="border: none">64b</td></tr>
</tbody>
</table>
<p>No matter which type of the supported primitive set we’ll choose, we’ll end up with a total of 256 bits, or the underlying SIMD register width.<br />
Now that we’ve kind of figured out of vector types/registers are represented in C#, let’s perform some operations on them.</p>
<h3 id="a-few-vectorized-instructions-for-the-road">A few Vectorized Instructions for the road</h3>
<p>Armed with this new understanding and knowledge of <code class="highlighter-rouge">Vector256<T></code> we can move on and start learning a few vectorized instructions.</p>
<p>Chekhov famously said: “If in the first act you have hung a pistol on the wall, then in the following one it should be fired. Otherwise, don’t put it there”. Here are seven loaded AVX2 pistols; rest assured they are about to fire in the next act. I’m obviously not going to explain all 1,000+ intrinsics mentioned before, if only not to piss off Anton Chekhov. We will <strong>thoroughly</strong> explain the ones needed to get this party going.<br />
Here’s the list of what we’re going to go over:</p>
<table>
<thead>
<tr>
<th style="text-align: left">x64 asm</th>
<th style="text-align: center">Intel</th>
<th style="text-align: right">CoreCLR</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left"><code class="highlighter-rouge">vbroadcastd</code></td>
<td style="text-align: center"><a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_broadcastd_epi32&expand=542"><code class="highlighter-rouge">_mm256_broadcastd_epi32</code></a></td>
<td style="text-align: right"><a href="https://docs.microsoft.com/en-us/dotnet/api/system.runtime.intrinsics.vector256.create?view=netcore-3.0#System_Runtime_Intrinsics_Vector256_Create_System_Int32_"><code class="highlighter-rouge">Vector256.Create(int)</code></a></td>
</tr>
<tr>
<td style="text-align: left"><code class="highlighter-rouge">vlddqu</code></td>
<td style="text-align: center"><a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_lddqu_si256&expand=3296"><code class="highlighter-rouge">_mm256_lddqu_si256</code></a></td>
<td style="text-align: right"><a href="https://docs.microsoft.com/en-us/dotnet/api/system.runtime.intrinsics.x86.avx.loaddquvector256?view=netcore-3.0#System_Runtime_Intrinsics_X86_Avx_LoadDquVector256_System_Int32__"><code class="highlighter-rouge">Avx.LoadDquVector256</code></a></td>
</tr>
<tr>
<td style="text-align: left"><code class="highlighter-rouge">vmovdqu</code></td>
<td style="text-align: center"><a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_storeu_si256&expand=5654"><code class="highlighter-rouge">_mm256_storeu_si256</code></a></td>
<td style="text-align: right"><a href="https://docs.microsoft.com/en-us/dotnet/api/system.runtime.intrinsics.x86.avx.store?view=netcore-3.0#System_Runtime_Intrinsics_X86_Avx_Store_System_Int32__System_Runtime_Intrinsics_Vector256_System_Int32__"><code class="highlighter-rouge">Avx.Store</code></a></td>
</tr>
<tr>
<td style="text-align: left"><code class="highlighter-rouge">vpcmpgtd</code></td>
<td style="text-align: center"><a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_cmpgt_epi32&expand=900"><code class="highlighter-rouge">_mm256_cmpgt_epi32</code></a></td>
<td style="text-align: right"><a href="https://docs.microsoft.com/en-us/dotnet/api/system.runtime.intrinsics.x86.avx2.comparegreaterthan?view=netcore-3.0#System_Runtime_Intrinsics_X86_Avx2_CompareGreaterThan_System_Runtime_Intrinsics_Vector256_System_Int32__System_Runtime_Intrinsics_Vector256_System_Int32__"><code class="highlighter-rouge">Avx2.CompareGreaterThan</code></a></td>
</tr>
<tr>
<td style="text-align: left"><code class="highlighter-rouge">vmovmskps</code></td>
<td style="text-align: center"><a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_movemask_ps&expand=3870"><code class="highlighter-rouge">_mm256_movemask_ps</code></a></td>
<td style="text-align: right"><a href="https://docs.microsoft.com/en-us/dotnet/api/system.runtime.intrinsics.x86.avx.movemask?view=netcore-3.0#System_Runtime_Intrinsics_X86_Avx_MoveMask_System_Runtime_Intrinsics_Vector256_System_Single__"><code class="highlighter-rouge">Avx.MoveMask</code></a></td>
</tr>
<tr>
<td style="text-align: left"><code class="highlighter-rouge">popcnt</code></td>
<td style="text-align: center"><a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_popcnt_u32&expand=4378"><code class="highlighter-rouge">_mm_popcnt_u32</code></a></td>
<td style="text-align: right"><a href="https://docs.microsoft.com/en-us/dotnet/api/system.runtime.intrinsics.x86.popcnt.popcount?view=netcore-3.0#System_Runtime_Intrinsics_X86_Popcnt_PopCount_System_UInt32_"><code class="highlighter-rouge">Popcnt.PopCount</code></a></td>
</tr>
<tr>
<td style="text-align: left"><code class="highlighter-rouge">vpermd</code></td>
<td style="text-align: center"><a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_permutevar8x32_epi32&expand=4201"><code class="highlighter-rouge">_mm256_permutevar8x32_epi32</code></a></td>
<td style="text-align: right"><a href="https://docs.microsoft.com/en-us/dotnet/api/system.runtime.intrinsics.x86.avx2.permutevar8x32?view=netcore-3.0#System_Runtime_Intrinsics_X86_Avx2_PermuteVar8x32_System_Runtime_Intrinsics_Vector256_System_Int32__System_Runtime_Intrinsics_Vector256_System_Int32__"><code class="highlighter-rouge">Avx2.PermuteVar8x32</code></a></td>
</tr>
</tbody>
</table>
<p>I understand that for first time readers, this list looks like I’m just name-dropping lots of fancy code names to make myself sound smart, but the unfortunate reality is that we <em>kind of need</em> to know all of these, and here is why: On the right column I’ve provided the actual C# Intrinsic function we will be calling in our code and linked to their docs. But here’s a funny thing: There is no “usable” documentation on Microsoft’s own docs regarding most of these intrinsics. All those docs do is simply point back to the Intel C/C++ intrinsic name, which I’ve also provided in the middle column, with links to the real documentation, the sort that actually explains what the instruction does with pseudo code. Finally, since we’re practically writing assembly code anyways, and I can guarantee we’ll end up inspecting JIT’d code down the road, I provided the x86 assembly opcodes for our instructions as well.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote">4</a></sup>
Now, What does each of these do? Let’s find out…</p>
<table style="margin-bottom: 0em" class="notice--info">
<tr>
<td style="border: none"><span class="uk-label">Hint</span></td>
<td style="border: none">From here-on, The following icon means I have a thingy that animates: <object style="margin: auto; position: relative; top: 1.1em" type="image/svg+xml" data="../talks/intrinsics-sorting-2019/play.svg"></object><br />
Click/Touch/Hover <b>inside</b> means: <i class="glyphicon glyphicon-play"></i><br />
Click/Touch/Hover <b>outside</b> means: <i class="glyphicon glyphicon-pause"></i>
</td>
</tr>
</table>
<h4 id="vector256createint-value">Vector256.Create(int value)</h4>
<div>
<div class="stickemup">
<object class="animated-border" width="100%" type="image/svg+xml" data="../talks/intrinsics-sorting-2019/inst-animations/vbroadcast-with-hint.svg"></object>
</div>
<p>We start with a couple of simple instructions, and nothing is more simple than this first: This intrinsic accepts a single scalar value and simply “broadcasts” it to an entire SIMD register, this is how you’d use it:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre><span class="n">Vector256</span><span class="p"><</span><span class="kt">int</span><span class="p">></span> <span class="n">someVector256</span> <span class="p">=</span> <span class="n">Vector256</span><span class="p">.</span><span class="nf">Create</span><span class="p">(</span><span class="m">0x42</span><span class="p">);</span>
</pre></td></tr></tbody></table></code></pre></div> </div>
<p>This will load up <code class="highlighter-rouge">someVector256</code> with 8 copies of <code class="highlighter-rouge">0x42</code> once executed, and in x64 assembly, the JIT will produce something quite simple:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre><span class="nf">vmovd</span> <span class="nv">xmm0</span><span class="p">,</span> <span class="nb">rax</span> <span class="c1">; 3 cycle latency / 1 cycle throughput</span>
<span class="nf">vpbroadcastd</span> <span class="nv">ymm0</span><span class="p">,</span> <span class="nv">xmm0</span> <span class="c1">; 3 cycle latency / 1 cycle throughput</span>
</pre></td></tr></tbody></table></code></pre></div> </div>
<p>This specific intrinsic is translated into two intel opcodes, since there is no direct single instruction that performs this.</p>
</div>
<h4 id="avx2loaddquvector256--avxstore">Avx2.LoadDquVector256 / Avx.Store</h4>
<div>
<div class="stickemup">
<object class="animated-border" width="100%" type="image/svg+xml" data="../talks/intrinsics-sorting-2019/inst-animations/lddqu-with-hint.svg"></object>
</div>
<p>Next up we have a couple of brain dead simple intrinsics that we use to read/write from memory into SIMD registers and conversely store from SIMD registers back to memory. These are amongst the most common intrinsics out there, as you can imagine:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre><span class="kt">int</span> <span class="p">*</span><span class="n">ptr</span> <span class="p">=</span> <span class="p">...;</span> <span class="c1">// Get some pointer to a big enough array</span>
<span class="n">Vector256</span><span class="p"><</span><span class="kt">int</span><span class="p">></span> <span class="n">data</span> <span class="p">=</span> <span class="n">Avx</span><span class="p">.</span><span class="nf">LoadDquVector256</span><span class="p">(</span><span class="n">ptr</span><span class="p">);</span>
<span class="p">...</span>
<span class="n">Avx</span><span class="p">.</span><span class="nf">Store</span><span class="p">(</span><span class="n">ptr</span><span class="p">,</span> <span class="n">data</span><span class="p">);</span>
</pre></td></tr></tbody></table></code></pre></div> </div>
<p>And in x64 assembly:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="rouge-code"><pre><span class="nf">vlddqu</span> <span class="nv">ymm1</span><span class="p">,</span> <span class="nv">ymmword</span> <span class="nv">ptr</span> <span class="p">[</span><span class="nb">rdi</span><span class="p">]</span> <span class="c1">; 5 cycle latency + cache/memory</span>
<span class="c1">; 0.5 cycle throughput</span>
<span class="nf">vmovdqu</span> <span class="nv">ymmword</span> <span class="nv">ptr</span> <span class="p">[</span><span class="nb">rdi</span><span class="p">],</span> <span class="nv">ymm1</span> <span class="c1">; Same as above</span>
</pre></td></tr></tbody></table></code></pre></div> </div>
<p>I only included an SVG animation for <code class="highlighter-rouge">LoadDquVector256</code>, but you can use your imagination and visualize how <code class="highlighter-rouge">Store</code> simply does the same thing in reverse.</p>
</div>
<h4 id="comparegreaterthan">CompareGreaterThan</h4>
<div>
<div class="stickemup">
<object class="animated-border" width="100%" type="image/svg+xml" data="../talks/intrinsics-sorting-2019/inst-animations/vpcmpgtd-with-hint.svg"></object>
</div>
<p><code class="highlighter-rouge">CompareGreaterThan</code> does an <em>n</em>-way, element-by-element <em>greater-than</em> (<code class="highlighter-rouge">></code>) comparison between two <code class="highlighter-rouge">Vector256<T></code> instances. In our case where <code class="highlighter-rouge">T</code> is really <code class="highlighter-rouge">int</code>, this means comparing 8 integers in one go, instead of performing 8 comparisons serially!</p>
<p>But where is the result? In a new <code class="highlighter-rouge">Vector256<int></code> of course! The resulting vector contains 8 results for the corresponding comparisons between the elements of the first and second vectors. Each position where the element in the first vector was <em>greater-than</em> (<code class="highlighter-rouge">></code>) the second vector, the corresponding element in the result vector gets a <code class="highlighter-rouge">-1</code> value, or <code class="highlighter-rouge">0</code> otherwise.<br />
Calling this is rather simple:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="rouge-code"><pre><span class="n">Vector256</span><span class="p"><</span><span class="kt">int</span><span class="p">></span> <span class="n">data</span><span class="p">,</span> <span class="n">comperand</span><span class="p">;</span>
<span class="n">Vector256</span><span class="p"><</span><span class="kt">int</span><span class="p">></span> <span class="n">result</span> <span class="p">=</span>
<span class="n">Avx2</span><span class="p">.</span><span class="nf">CompareGreaterThan</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">comperand</span><span class="p">);</span>
</pre></td></tr></tbody></table></code></pre></div> </div>
<p>And in x64 assembly, this is pretty simple too:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre><span class="nf">vpcmpgtd</span> <span class="nv">ymm2</span><span class="p">,</span> <span class="nv">ymm1</span><span class="p">,</span> <span class="nv">ymm0</span> <span class="c1">; 1 cycle latency</span>
<span class="c1">; 0.5 cycle throughput</span>
</pre></td></tr></tbody></table></code></pre></div> </div>
</div>
<h4 id="movemask">MoveMask</h4>
<div>
<div class="stickemup">
<object class="animated-border" width="100%" type="image/svg+xml" data="../talks/intrinsics-sorting-2019/inst-animations/vmovmskps-with-hint.svg"></object>
</div>
<p>Another intrinsic which will prove to be very useful is the ability to extract some bits from a vectorized register into a normal, scalar one. <code class="highlighter-rouge">MoveMask</code> does just this. This intrinsic takes the top-level (MSB) bit from every element and moves it into our scalar result:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre><span class="n">Vector256</span><span class="p"><</span><span class="kt">int</span><span class="p">></span> <span class="n">data</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">result</span> <span class="p">=</span> <span class="n">Avx</span><span class="p">.</span><span class="nf">MoveMask</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="nf">AsSingle</span><span class="p">());</span>
</pre></td></tr></tbody></table></code></pre></div> </div>
<p>There’s an oddity here, as you can tell by that awkward <code class="highlighter-rouge">.AsSingle()</code> call, try to ignore it if you can, or hit this footnote<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote">5</a></sup> if you can’t. The assembly instruction here is exactly as simple as you would think:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre><span class="nf">vmovmskps</span> <span class="nb">rax</span><span class="p">,</span> <span class="nv">ymm2</span> <span class="c1">; 5 cycle latency</span>
<span class="c1">; 1 cycle throughput</span>
</pre></td></tr></tbody></table></code></pre></div> </div>
</div>
<h4 id="popcount">PopCount</h4>
<p><code class="highlighter-rouge">PopCount</code> is a very powerful intrinsic, which <a href="/2018-08-19/netcoreapp3.0-intrinsics-in-real-life-pt2">I’ve covered extensively before</a>: <code class="highlighter-rouge">PopCount</code> returns the number of <code class="highlighter-rouge">1</code> bits in a 32/64 bit primitive.<br />
In C#, we would use it as follows:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre><span class="kt">int</span> <span class="n">result</span> <span class="p">=</span> <span class="n">PopCnt</span><span class="p">.</span><span class="nf">PopCount</span><span class="p">(</span><span class="m">0</span><span class="n">b0000111100110011</span><span class="p">);</span>
<span class="c1">// result == 8</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>And in x64 assembly code:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre><span class="nf">popcnt</span> <span class="nb">rax</span><span class="p">,</span> <span class="nb">rdx</span> <span class="c1">; 3 cycle latency</span>
<span class="c1">; 1 cycle throughput</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>In this series, <code class="highlighter-rouge">PopCount</code> is the only intrinsic I use that is not purely vectorized<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote">6</a></sup>.</p>
<h4 id="permutevar8x32">PermuteVar8x32</h4>
<div>
<div class="stickemup">
<object class="animated-border" type="image/svg+xml" data="../talks/intrinsics-sorting-2019/inst-animations/vpermd-with-hint.svg"></object>
</div>
<p><code class="highlighter-rouge">PermuteVar8x32</code> accepts two vectors: source, permutation and performs a permutation operation <strong>on</strong> the source value <em>according to the order provided</em> in the permutation value. If this sounds confusing go straight to the visualization below…</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre><span class="n">Vector256</span><span class="p"><</span><span class="kt">int</span><span class="p">></span> <span class="n">data</span><span class="p">,</span> <span class="n">perm</span><span class="p">;</span>
<span class="n">Vector256</span><span class="p"><</span><span class="kt">int</span><span class="p">></span> <span class="n">result</span> <span class="p">=</span> <span class="n">Avx2</span><span class="p">.</span><span class="nf">PermuteVar8x32</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">perm</span><span class="p">);</span>
</pre></td></tr></tbody></table></code></pre></div> </div>
<p>While technically speaking, both the <code class="highlighter-rouge">data</code> and <code class="highlighter-rouge">perm</code> parameters are of type <code class="highlighter-rouge">Vector256<int></code> and can contain any integer value in their elements, only the 3 least significant bits in <code class="highlighter-rouge">perm</code> are taken into account for permutation of the elements in <code class="highlighter-rouge">data</code>.<br />
This should make sense, as we are permuting an 8-element vector, so we need 3 bits (2<sup>3</sup> == 8) in every permutation element to figure out which element goes where… In x64 assembly this is:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre><span class="nf">vpermd</span> <span class="nv">ymm1</span><span class="p">,</span> <span class="nv">ymm2</span><span class="p">,</span> <span class="nv">ymm1</span> <span class="c1">; 3 cycles latency</span>
<span class="c1">; 1 cycles throughput</span>
</pre></td></tr></tbody></table></code></pre></div> </div>
</div>
<h3 id="thats-it-for-now">That’s it for now</h3>
<p>This post was all about laying the groundwork before this whole mess comes together.<br />
Remember, we’re re-implementing QuickSort with AVX2 intrinsics in this series, which for the most part, means re-implementing the partitioning function from our scalar code listing in the previous post.<br />
I’m sure wheels are turning in many heads now as you are trying to figure out what comes next…<br />
I think it might be a good time as any to end this post and leave you with a suggestion: Try to take a piece of paper or your favorite text editor, and see if you can cobble up these instructions into something that can partition numbers given a selected pivot.</p>
<p>When you’re ready, head on to the <a href="/2020-01-30/this-goes-to-eleven-pt3">next post</a> to see how the whole thing comes together, and how fast we can get it to run with a basic version…</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:0" role="doc-endnote">
<p>To be clear, some of these are intrinsics in unreleased processors, and even of those that are all released in the wild, there is no single processor support all of these… <a href="#fnref:0" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:1" role="doc-endnote">
<p>CoreCLR supports roughly everything up to and including the AVX2 intrinsics, which were introduced with the Intel Haswell processor, near the end of 2013. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>In general, auto-vectorizing compilers are a huge subject in their own, but the bottom line is that without completely changing the syntax and concepts of our programming language, there is very little that an auto-vectorizing compiler can do with existing code, and making one that really works often involves designing programming language with vectorization baked into them from day one. I really recommend reading <a href="https://pharr.org/matt/blog/2018/04/30/ispc-all.html">this series about Intel’s attempt</a> at this space if you are into this sort of thing. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>Now, If I was in my annoyed state of mind, I’d bother to mention that <a href="https://github.com/dotnet/corefx/issues/2209#issuecomment-317124449">I personally always thought</a> that introducing 200+ functions with already established names (in C/C++/rust) and forcing everyone to learn new names whose only saving grace is that they look BCL<em>ish</em> to begin with was not the friendliest move on Microsoft’s part, and that trying to give C# names to the utter mess that Intel created in the first place was a thankless effort that would only annoy everyone more, and would eventually run up against the inhumane names Intel went for (Yes, I’m looking at you <code class="highlighter-rouge">LoadDquVector256</code>, you are not looking very BCL-ish to me with the <code class="highlighter-rouge">Dqu</code> slapped in the middle there : (╯°□°)╯︵ ┻━┻)… But thankfully, I’m not in my annoyed state. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>While this looks like we’re really doing “something” with our <code class="highlighter-rouge">Vector256<int></code> and somehow casting it do single-precision floating point values, let me assure you, this is just smoke and mirrors: The intrinsic simply accepts only floating point values (32/64 bit ones), so we have to “cast” the data to <code class="highlighter-rouge">Vector256<float></code>, or alternatively call <code class="highlighter-rouge">.AsSingle()</code> before calling <code class="highlighter-rouge">MoveMask</code>. Yes, this is super awkward from a pure C# perspective, but in reality, the JIT understands these shenanigans and really ignores them completely. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p>By the way, although this intrinsic doesn’t accept nor return one of the SIMD registers / types, and considered to be a non-vectorized intrinsic as far as classification goes, as far as I’m concerned bit-level intrinsic functions that operate on scalar registers are just as “vectorized” as their “pure” vectorized sisters, as they mostly deal with scalar values as vectors of bits. <a href="#fnref:5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>damageboydans@houmus.orghttps://bits.houmus.orgDecimating Array.Sort with AVX2. I ended up going down the rabbit hole re-implementing array sorting with AVX2 intrinsics. There's no reason I should go down alone.This Goes to Eleven (Part 1/∞)2020-01-28T05:26:28+00:002020-01-28T05:26:28+00:00https://bits.houmus.org/2020-01-28/this-goes-to-eleven-pt1<h1 id="lets-do-this">Let’s do this</h1>
<p>Let’s get in the ring and show what AVX/AVX2 intrinsics can really do for a non-trivial problem, and even discuss potential improvements that future CoreCLR versions could bring to the table.</p>
<p>Everyone needs to sort arrays, once in a while, and many algorithms we take for granted rely on doing so. We think of it as a <em>solved</em> problem and that nothing can be <em>further</em> done about it in 2020, except for waiting for newer, marginally faster machines to pop-up<sup id="fnref:0" role="doc-noteref"><a href="#fn:0" class="footnote">1</a></sup>. However, that is not the case, and while I’m not the first to have thoughts about it; or the best at implementing it, if you join me in this rather long journey, we’ll end up with a replacement function for <code class="highlighter-rouge">Array.Sort</code>, written in pure C# that outperforms CoreCLR’s C++<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote">2</a></sup> code by a factor north of 10x on most modern Intel CPUs, and north of 11x on my laptop.<br />
Sounds interesting? If so, down the rabbit hole we go…</p>
<table style="margin-bottom: 0em" class="notice--warning">
<tr>
<td style="border: none;vertical-align: top"><span class="uk-label">Note</span></td>
<td style="border: none"><div>
<p>In the final days before posting this series, Intel started seeding a CPU microcode update that is/was affecting the performance of the released version of CoreCLR 3.0/3.1 quite considerably. I managed to stir up a <a href="https://twitter.com/damageboy/status/1194751035136450560">small commotion</a> as this was unraveling in my benchmarks. As it happened, my code was (not coincidentally) less affected by this change, while CoreCLRs <code class="highlighter-rouge">Array.Sort()</code> <a href="https://github.com/dotnet/coreclr/issues/27877">took a 20% nosedive</a>. Let it never be said I’m nothing less than chivalrous, for I rolled back the microcode update, and for this <strong>entire</strong> series, I’m going to run against a much faster version of <code class="highlighter-rouge">Array.Sort()</code> than what you, the reader, are probably using, Assuming you update your machine from time to time. For the technically inclined, here’s a whole footnote<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote">3</a></sup> on how to double-check what your machine is actually running. I also opened two issues in the CoreCLR repo about attempting to mitigate this both in CoreCLRs C++ code and separately in the JIT. If/when there is movement on those fronts, the microcode you’re running will become less of an issue, to begin with, but for now, this just adds another level of unwarranted complexity to our lives.</p>
</div>
</td>
</tr>
</table>
<p>A while back now, I was reading the post by Stephen Toub about <a href="https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-core-3-0/">Improvements in CoreCLR 3.0</a>, and it became apparent that hardware intrinsics were common to many of these, and that so many parts of CoreCLR could still be sped up with these techniques, that one thing led to another, and I decided an attempt to apply hardware intrinsics to a larger problem than I had previously done myself was in order. To see if I could rise to the challenge, I decided to take on array sorting and see how far I can go.</p>
<p>What I came up with eventually would become a re-write of <code class="highlighter-rouge">Array.Sort()</code> with AVX2 hardware intrinsics. Fortunately, choosing sorting and focusing on QuickSort makes for a great blog post series, since:</p>
<ul>
<li>Everyone should be familiar with the domain and even the original (sorting is the bread and butter of learning computer science, really, and QuickSort is the queen of all sorting algorithms).</li>
<li>It’s relatively easy to explain/refresh on the original.</li>
<li>If I can make it there, I can make it anywhere.</li>
<li>I had no idea how to do it.</li>
</ul>
<p>I started with searching various keywords and found an interesting paper titled: <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1009.7773&rep=rep1&type=pdf">Fast Quicksort Implementation Using AVX Instructions</a> by Shay Gueron and Vlad Krasnov. That title alone made me think this is about to be a walk in the park. While initially promising, it wasn’t good enough as a drop-in replacement for <code class="highlighter-rouge">Array.Sort</code> for reasons I’ll shortly go into. I ended up having a lot of fun expanding on their basic approach. <a href="https://github.com/dotnet/runtime/pull/33152#issuecomment-596405021"><del>I will submit a proper pull-request to start a discussion with CoreCLR devs about integrating this code into the main dotnet repository</del></a><sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote">4</a></sup>, but for now, let’s talk about sorting.</p>
<p>Since there’s a lot to go over here, I’ve split it up into no less than 6 parts:</p>
<ol>
<li>In this part, we start with a refresher on QuickSort and how it compares to <code class="highlighter-rouge">Array.Sort()</code>. If you don’t need a refresher, skip it and get right down to part 2 and onwards. I recommend skimming through, mostly because I’ve got excellent visualizations which should be in the back of everyone’s mind as we deal with vectorization & optimization later.</li>
<li>In <a href="/2020-01-29/this-goes-to-eleven-pt2">part 2</a>, we go over the basics of vectorized hardware intrinsics, vector types, and go over a handful of vectorized instructions we’ll use in part 3. We still won’t be sorting anything.</li>
<li>In <a href="/2020-01-30/this-goes-to-eleven-pt3">part 3</a>, we go through the initial code for the vectorized sorting, and we’ll start seeing some payoff. We finish agonizing courtesy of the CPU’s Branch Predictor, throwing a wrench into our attempts.</li>
<li>In part 4, we go over a handful of optimization approaches that I attempted trying to get the vectorized partitioning to run faster. We’ll see what worked and what didn’t.</li>
<li>In part 5, we’ll see how we can almost get rid of all the remaining scalar code- by implementing small-constant size array sorting. We’ll use, drum roll…, yet more AVX2 vectorization.</li>
<li>Finally, in part 6, I’ll list the outstanding stuff/ideas I have for getting more juice and functionality out of my vectorized code.</li>
</ol>
<h2 id="quicksort-crash-course">QuickSort Crash Course</h2>
<p>QuickSort is deceivingly simple.<br />
No, it really is.<br />
In 20 lines of C# or whatever language you can sort numbers. Lots of them, and incredibly fast. However, try and change something about it; nudge it in the wrong way, and it will quickly turn around and teach you a lesson in humility. It is hard to improve on it without breaking any of the tenants it is built upon.</p>
<h3 id="in-words">In words</h3>
<p>Before we discuss any of that, let’s describe QuickSort in words, code, pictures, and statistics:</p>
<ul>
<li>It uses a <em>divide-and-conquer</em> approach.
<ul>
<li>In other words, it’s recursive.</li>
<li>It performs \(\mathcal{O}(n\log{}n)\) comparisons to sort <em>n</em> items.</li>
</ul>
</li>
<li>It performs an in-place sort.</li>
</ul>
<p>That last point, referring to in-place sorting, sounds simple and neat, and it sure is from the perspective of the user: no additional memory allocation needs to occur regardless of how much data they’re sorting. While that’s great, I’ve spent days trying to overcome the correctness and performance challenges that arise from it, specifically in the context of vectorization. It is also essential to remain in-place since I intend for this to become a <em>drop-in</em> replacement for <code class="highlighter-rouge">Array.Sort</code>.</p>
<p>More concretely, QuickSort works like this:</p>
<ol>
<li>Pick a pivot value.</li>
<li><strong>Partition</strong> the array around the pivot value.</li>
<li>Recurse on the left side of the pivot.</li>
<li>Recurse on the right side of the pivot.</li>
</ol>
<p>Picking a pivot could be a mini-post in itself, but again, in the context of competing with <code class="highlighter-rouge">Array.Sort</code> we don’t need to dive into it, we’ll copy whatever CoreCLR does, and get on with our lives.<br />
CoreCLR uses a pretty standard scheme of median-of-three for pivot selection, which can be summed up as: “Let’s sort these 3 elements: In the first, middle and last positions, then pick the middle one of those three as the pivot”.</p>
<p><strong>Partitioning</strong> the array is where we spend most of the execution time: we take our selected pivot value and rearrange the array segment that was handed to us such that all numbers <em>smaller-than</em> the pivot are in the beginning or <strong>left</strong>, in no particular order amongst themselves. Then comes the <em>pivot</em>, in its <strong>final</strong> resting position, and following it are all elements <em>greater-than</em> the pivot, again in no particular order amongst themselves.</p>
<p>After partitioning is complete, we recurse to the left and right of the pivot, as previously described.</p>
<p>That’s all there is: this gets millions, billions of numbers sorted, in-place, efficiently as we know how to do 60+ years after its invention.</p>
<p class="notice--info">Bonus trivia points for those who are still here with me: <a href="https://en.wikipedia.org/wiki/Tony_Hoare">Tony Hoare</a>, who invented QuickSort back in the early 60s also took responsibility for inventing the <code class="highlighter-rouge">null</code> pointer concept. So I guess there really is no good without evil in this world.</p>
<h3 id="in-code">In code</h3>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
</pre></td><td class="rouge-code"><pre><span class="k">void</span> <span class="nf">QuickSort</span><span class="p">(</span><span class="kt">int</span><span class="p">[]</span> <span class="n">items</span><span class="p">)</span> <span class="p">=></span> <span class="nf">QuickSort</span><span class="p">(</span><span class="n">items</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="n">items</span><span class="p">.</span><span class="n">Length</span> <span class="p">-</span> <span class="m">1</span><span class="p">);</span>
<span class="k">void</span> <span class="nf">QuickSort</span><span class="p">(</span><span class="kt">int</span><span class="p">[]</span> <span class="n">items</span><span class="p">,</span> <span class="kt">int</span> <span class="n">left</span><span class="p">,</span> <span class="kt">int</span> <span class="n">right</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">left</span> <span class="p">==</span> <span class="n">right</span><span class="p">)</span> <span class="k">return</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">pivot</span> <span class="p">=</span> <span class="nf">PickPivot</span><span class="p">(</span><span class="n">items</span><span class="p">,</span> <span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">pivotPos</span> <span class="p">=</span> <span class="nf">Partition</span><span class="p">(</span><span class="n">items</span><span class="p">,</span> <span class="n">pivot</span><span class="p">,</span> <span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">);</span>
<span class="nf">QuickSort</span><span class="p">(</span><span class="n">items</span><span class="p">,</span> <span class="n">left</span><span class="p">,</span> <span class="n">pivotPos</span><span class="p">);</span>
<span class="nf">QuickSort</span><span class="p">(</span><span class="n">items</span><span class="p">,</span> <span class="n">pivotPos</span> <span class="p">+</span> <span class="m">1</span><span class="p">,</span> <span class="n">right</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="nf">PickPivot</span><span class="p">(</span><span class="kt">int</span><span class="p">[]</span> <span class="n">items</span><span class="p">,</span> <span class="kt">int</span> <span class="n">left</span><span class="p">,</span> <span class="kt">int</span> <span class="n">right</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">var</span> <span class="n">mid</span> <span class="p">=</span> <span class="n">left</span> <span class="p">+</span> <span class="p">((</span><span class="n">right</span> <span class="p">-</span> <span class="n">left</span><span class="p">)</span> <span class="p">/</span> <span class="m">2</span><span class="p">);</span>
<span class="nf">SwapIfGreater</span><span class="p">(</span><span class="k">ref</span> <span class="n">items</span><span class="p">[</span><span class="n">left</span><span class="p">],</span> <span class="k">ref</span> <span class="n">items</span><span class="p">[</span><span class="n">mid</span><span class="p">]);</span>
<span class="nf">SwapIfGreater</span><span class="p">(</span><span class="k">ref</span> <span class="n">items</span><span class="p">[</span><span class="n">left</span><span class="p">],</span> <span class="k">ref</span> <span class="n">items</span><span class="p">[</span><span class="n">right</span><span class="p">]);</span>
<span class="nf">SwapIfGreater</span><span class="p">(</span><span class="k">ref</span> <span class="n">items</span><span class="p">[</span><span class="n">mid</span><span class="p">],</span> <span class="k">ref</span> <span class="n">items</span><span class="p">[</span><span class="n">right</span><span class="p">]);</span>
<span class="kt">var</span> <span class="n">pivot</span> <span class="p">=</span> <span class="n">items</span><span class="p">[</span><span class="n">mid</span><span class="p">];</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="nf">Partition</span><span class="p">(</span><span class="kt">int</span><span class="p">[]</span> <span class="n">array</span><span class="p">,</span> <span class="kt">int</span> <span class="n">pivot</span><span class="p">,</span> <span class="kt">int</span> <span class="n">left</span><span class="p">,</span> <span class="kt">int</span> <span class="n">right</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="n">left</span> <span class="p"><</span> <span class="n">right</span><span class="p">)</span> <span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="n">array</span><span class="p">[</span><span class="n">left</span><span class="p">]</span> <span class="p"><</span> <span class="n">pivot</span><span class="p">)</span> <span class="n">left</span><span class="p">++;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">array</span><span class="p">[</span><span class="n">right</span><span class="p">]</span> <span class="p">></span> <span class="n">pivot</span><span class="p">)</span> <span class="n">right</span><span class="p">--;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">left</span> <span class="p"><=</span> <span class="n">right</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">var</span> <span class="n">t</span> <span class="p">=</span> <span class="n">array</span><span class="p">[</span><span class="n">left</span><span class="p">];</span>
<span class="n">array</span><span class="p">[</span><span class="n">left</span><span class="p">++]</span> <span class="p">=</span> <span class="n">array</span><span class="p">[</span><span class="n">right</span><span class="p">];</span>
<span class="n">array</span><span class="p">[</span><span class="n">right</span><span class="p">--]</span> <span class="p">=</span> <span class="n">t</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">left</span><span class="p">;</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>I did say it is deceptively simple, and grasping how QuickSort really works sometimes feels like trying to lift sand through your fingers; To that end I’ve included two more visualizations of QuickSort, which are derivatives of the amazing work done by <a href="https://observablehq.com/@mbostock">Michael Bostock (@mbostock)</a> with <a href="https://d3js.org/">d3.js</a>.</p>
<h3 id="visualizing-quicksorts-recursion">Visualizing QuickSort’s recursion</h3>
<p>One thing that we have to keep in mind is that the same data is partitioned over-and-over again, many times, with ever-shrinking partition sizes until we end up having a partition size of 2 or 3, in which case we can trivially sort the partition as-is and return.</p>
<p>To help see this better, we’ll use this way of visualizing arrays and their intermediate states in QuickSort:</p>
<div>
<div class="stickemup">
<p><img src="/talks/intrinsics-sorting-2019/quicksort-mbostock/quicksort-vis-legend.svg" alt="QuickSort Legend" /></p>
</div>
<p>Here, we see an unsorted array of 200 elements (in the process of getting sorted).<br />
The different sticks represent numbers in the [-45°..+45°] range, and the angle of each individual stick represents its value, as I hope it is easy to discern.<br />
We represent the pivots with <strong>two</strong> colors:</p>
<ul>
<li><span style="color: red"><strong>Red</strong></span> for the currently selected pivot at a given recursion level.</li>
<li><span style="color: green"><strong>Green</strong></span> for previous pivots that have already been partitioned around in previous rounds/levels of the recursion.</li>
</ul>
<p>Our ultimate goal is to go from the messy image above to the visually appeasing one below:</p>
</div>
<p><img src="/talks/intrinsics-sorting-2019/quicksort-mbostock/quicksort-vis-sorted.svg" alt="QuickSort Sorted" /></p>
<p>What follows is a static (e.g., non-animated) visualization that shows how pivots are randomly selected at each level of recursion and how, by the next step, the unsorted segments around them become partitioned until we finally have a completely sorted array. Here is how the whole thing looks:</p>
<p class="notice--info">These visuals are auto-generated in Javascript + d3.js, so feel free to hit that “Reload” button and/or change the number of elements in the array if you feel you want to see a new set of random sticks sorted.</p>
<iframe src="../talks/intrinsics-sorting-2019/quicksort-mbostock/qs-static-reload.html" scrolling="no" style="width:1600px; max-width: 100%;background: transparent;" allowfullscreen=""></iframe>
<p>I encourage you to look at this and try to explain to yourself what QuickSort “does” here, at every level. What you can witness here is the interaction between pivot selection, where it “lands” in the next recursion level (or row), and future pivots to its left and right and in the next levels of recursion. We also see how, with every level of recursion, the partition sizes decrease in until finally, every element is a pivot, which means sorting is complete.</p>
<h3 id="visualizing-quicksorts-comparisonsswaps">Visualizing QuickSort’s Comparisons/Swaps</h3>
<p>While the above visualization really does a lot to help understand <strong>how</strong> QuickSort works, I also wanted to leave you with an impression of the total amount of work done by QuickSort:</p>
<div>
<div class="stickemup">
<iframe src="../talks/intrinsics-sorting-2019/quicksort-mbostock/qs-animated-playpause.html" scrolling="no" style="width:1600px; height: 250px; max-width: 100%;background: transparent;" allowfullscreen=""></iframe>
</div>
<p>Above is an <strong>animation</strong> of the whole process as it goes over the same array, slowly and recursively going from an unsorted mess to a completely sorted array.</p>
<p>We can witness just how many comparisons and swap operations need to happen for a 200 element QuickSort to complete successfully. There’s genuinely a lot of work that needs to happen per element (when considering how we re-partition virtually all elements again and again) for the whole thing to finish.</p>
</div>
<h3 id="arraysort-vs-quicksort">Array.Sort vs. QuickSort</h3>
<p>It’s important to note that <code class="highlighter-rouge">Array.Sort</code> uses a couple of more tricks to get better performance and avoid certain dark-spots the come with QuickSort. I would be irresponsible if I didn’t mention those since in the later posts, I borrow at least one idea from its play-book, and improve upon it with intrinsics.</p>
<p><code class="highlighter-rouge">Array.Sort</code> isn’t strictly QuickSort; it is a variation on it called <a href="https://en.wikipedia.org/wiki/Introsort">Introspective Sort</a> invented by <a href="https://en.wikipedia.org/wiki/David_Musser">David Musser</a> in 1997. What it roughly does is combine Quick-Sort, Heap-Sort, and Insertion-Sort by dynamically switching between them: more specifically it starts with quick-sort and <em>may</em> switch to heap-sort if the recursion depth goes beyond a specific threshold while also switching into insertion-sort if the size of the partition goes below a different threshold. This hybrid approach is a clever way of mitigating the two biggest shortcomings in quick-sort alone:</p>
<ul>
<li>QuickSort is notorious for degenerating into \(\mathcal{O}(n^2)\) for various edge-cases input sequences. I won’t go very deeply into this, but think about an array that is made up of a single repeated number. In such an extreme case, partitioning results in a bad separation around the pivot (e.g. one sub-partition will always have a size of <code class="highlighter-rouge">0</code>) for each partitioning attempt, and the whole thing goes south very quickly.
<ul>
<li>Introspective-sort mitigates such bad cases by tracking the current recursion depth vs. an acceptable worst-case depth (usually \(\mathcal 2*(floor(log_{2}(n))+1)\)). Once the measured/actual depth crosses over that threshold, introspective-sort switches internally from partitioning/quick-sort to heap-sort which deals with such cases better, on average.</li>
</ul>
</li>
<li>Lastly, once the partition is small enough, introspective-sort switches to using insertion-sort. This is a critical improvement when we consider that recursive calls are never cheap (even more so for the code I’ll present later in this series). In CoreCLR/C#, where this threshold was selected to be 16 elements, this hybrid approach manages to replace up to 3 levels of recursive calls (or \(\mathcal 2^{n+1}-1 = {2^4}-1 = 15\) partitioning calls on average) with a <strong>single</strong> call to insertion-sort, which is very effective for these small input sizes anyway. The impact of this optimization, where recursion is replaced with simpler loop-based code, cannot be overstated.</li>
</ul>
<p>As mentioned, I ended up borrowing this last idea for my code as the issues around smaller partition sizes are exacerbated by using vectorized intrinsics in the following posts.</p>
<p>For the unfriendly cases I mentioned before, I have no vectorized approach yet (OK, I kind of do, but I have no intention of making this a 9-post blog series :). However, I have no problem admitting to this while weaseling my way out of this pit of despair in the most direct way: use the same logic that introspective-sort uses for switching to heap-sort (where it triggers when the depth exceeds some dynamically computed threshold) and in-turn switch to… <code class="highlighter-rouge">Array.Sort</code>; We let <em>it</em> stumble a bit with the same input until it will give up and switch internally to heap-sort. It’s slightly nasty, but it works…</p>
<h2 id="comparing-scalar-variants">Comparing Scalar Variants</h2>
<p>With all this new information, this is a good time to measure how a couple of different scalar (e.g. non-vectorized) versions compare to <code class="highlighter-rouge">Array.Sort</code>. I’ll show some results generated using <a href="https://benchmarkdotnet.org/">BenchmarkDotNet</a> (BDN) with:</p>
<ul>
<li><code class="highlighter-rouge">Array.Sort()</code> as the baseline.</li>
<li><a href="https://github.com/damageboy/VxSort/blob/research/VxSortResearch/Unstable/Scalar/Managed.cs"><code class="highlighter-rouge">Managed</code></a> as the code I’ve just presented above.
<ul>
<li>This version is just basic QuickSort using regular/safe C#. With this version, every time we access an array element, the JIT inserts bounds-checking machine code around our actual access that ensures the CPU does not read/write outside the memory region owned by the array.</li>
</ul>
</li>
<li><a href="https://github.com/damageboy/VxSort/blob/research/VxSortResearch/Unstable/Scalar/Unmanaged.cs"><code class="highlighter-rouge">Unmanaged</code></a> as an alternative/faster version to <code class="highlighter-rouge">Scalar</code> where:
<ul>
<li>The code uses native pointers and unsafe semantics (using C#‘s new <code class="highlighter-rouge">unmanaged</code> constraint, neat!).</li>
<li>We switch to <code class="highlighter-rouge">InsertionSort</code> (again, copy-pasted from CoreCLR) when below 16 elements, just like <code class="highlighter-rouge">Array.Sort</code> does.</li>
</ul>
</li>
</ul>
<p>I’ve prepared this last version to show that with unsafe code + <code class="highlighter-rouge">InsertionSort</code>, we can remove most of the performance gap between C# and C++ for this type of code, which mainly stems from bounds-checking, that the JIT cannot elide for these sort of random-access patterns as well as the jump-to <code class="highlighter-rouge">InsertionSort</code> optimization.</p>
<table style="margin-bottom: 0em" class="notice--info">
<tr>
<td style="border: none;vertical-align: top"><span class="uk-label">Note</span></td>
<td style="border: none"><div>
<p>Throughout this series, I’ll benchmark each sorting method with various array sizes (BDN parameter: <code class="highlighter-rouge">N</code>): \(10^i_{i=1\cdots7}\). I’ve added a custom column to the BDN column to the report: <code class="highlighter-rouge">Time / N</code>. This represents the time spent sorting <em>per element</em> in the array, and as such, very useful to compare the results on a more uniform scale.<br />
In addition, I will only start with purely randon and unique sets of values, as that is a classical input type where I want to focus for this series.<br />
When I actually get to submitting a PR, I will have to show more test cases and prove that the whole thing doesn’t crumble once the input is less than optimal, but that is <em>outside of the scope</em> for this series.</p>
</div>
</td>
</tr>
</table>
<p>Here are the results in the form of charts and tables. I’ve included a handy large button you can press to get a quick tour of what each tab contains, what we have here is:</p>
<ol>
<li>A chart scaling the performance of various implementations being compared to <code class="highlighter-rouge">Array.Sort</code> as a ratio.</li>
<li>A chart showing time spent sorting a single element in an array of N elements (Time / N).</li>
<li>BDN results in a friendly table form.</li>
<li>Statistics/Counters that teach us about what is actually going on under the hood.</li>
</ol>
<div>
<div class="stickemup">
<ul class="uk-tab" data-uk-switcher="{connect:'#e34157f6-a85d-4a6d-9972-3d77cd7e5f87'}">
<li class="uk-active"><a href="#"><i class="glyphicon glyphicon-stats"></i> Scaling</a></li>
<li><a href="#"><i class="glyphicon glyphicon-stats"></i> Time/N</a></li>
<li><a href="#"><i class="glyphicon glyphicon-list-alt"></i> Benchmarks</a></li>
<li><a href="#"><i class="glyphicon glyphicon-list-alt"></i> Statistics</a></li>
<li><a href="#"><i class="glyphicon glyphicon-info-sign"></i> Setup</a></li>
</ul>
<ul id="e34157f6-a85d-4a6d-9972-3d77cd7e5f87" class="uk-switcher uk-margin">
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<div data-intro="Size of the sorting problem, 10..10,000,000 in powers of 10" data-position="bottom">
<div data-intro="Performance scale: Array.Sort (solid gray) is always 100%, and the other methods are scaled relative to it" data-position="left">
<div data-intro="Click legend items to show/hide series" data-position="right">
<div class="benchmark-chart-container">
<canvas data-chart="line">
N,100,1K,10K,100K,1M,10M
ArraySort,1,1,1,1,1,1
Scalar,2.04,1.57,1.33,1.12,1.09,1.11
Unmanaged,1.75,1.01,0.99,0.97,0.93,0.95
<!--
{
"data" : {
"datasets" : [
{
"backgroundColor": "rgba(66,66,66,0.35)",
"rough": { "fillStyle": "solid", "hachureAngle": -30, "hachureGap": 7 }
},
{
"backgroundColor": "rgba(220,33,33,.6)",
"rough": { "fillStyle": "hachure", "hachureAngle": 15, "hachureGap": 6 }
},
{
"backgroundColor": "rgba(33,33,220,.9)",
"rough": { "fillStyle": "hachure", "hachureAngle": -45, "hachureGap": 6 }
}]
},
"options": {
"title": { "text": "Scalar Sorting - Scaled to Array.Sort", "display": true },
"scales": {
"yAxes": [{
"ticks": {
"min": 0.8,
"fontFamily": "Indie Flower",
"callback": "ticksPercent"
},
"scaleLabel": {
"labelString": "Scaling (%)",
"fontFamily": "Indie Flower",
"display": true
}
}]
}
},
"defaultOptions": {"scales":{"xAxes":[{"scaleLabel":{"display":"true,","labelString":"N (elements)","fontFamily":"Indie Flower"},"ticks":{"fontFamily":"Indie Flower"}}]},"legend":{"display":true,"position":"bottom","labels":{"fontFamily":"Indie Flower","fontSize":14}},"title":{"position":"top","fontFamily":"Indie Flower"}}
}
--> </canvas>
</div>
</div>
</div>
</div>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<div data-intro="Size of the sorting problem, 10..10,000,000 in powers of 10" data-position="bottom">
<div data-intro="Time in nanoseconds spent sorting per element. Array.Sort (solid gray) is the baseline, again" data-position="left">
<div data-intro="Click legend items to show/hide series" data-position="right">
<div class="benchmark-chart-container">
<canvas data-chart="line">
N,100,1K,10K,100K,1M,10M
ArraySort,12.1123,30.5461,54.641,60.4874,70.7539,80.8431
Scalar,24.7385,47.8796,72.7528,67.7419,77.3906,89.7593
Unmanaged,21.0955,30.9692,54.3112,58.9577,65.7222,76.8631
<!--
{
"data" : {
"datasets" : [
{ "backgroundColor":"rgba(66,66,66,0.35)", "rough": { "fillStyle": "solid", "hachureGap": 6 } },
{ "backgroundColor":"rgba(33,220,33,.6)", "rough": { "fillStyle": "hachure", "hachureAngle": 15, "hachureGap": 6 } },
{ "backgroundColor":"rgba(33,33,220,.9)", "rough": { "fillStyle": "hachure", "hachureAngle": -45, "hachureGap": 6 } }
]
},
"options": {
"title": { "text": "Scalar Sorting - log(Time/N)", "display": true },
"scales": {
"yAxes": [{
"type": "logarithmic",
"ticks": {
"callback": "ticksNumStandaard",
"fontFamily": "Indie Flower"
},
"scaleLabel": {
"labelString": "Time/N (ns)",
"fontFamily": "Indie Flower",
"display": true
}
}]
}
},
"defaultOptions": {"scales":{"xAxes":[{"scaleLabel":{"display":"true,","labelString":"N (elements)","fontFamily":"Indie Flower"},"ticks":{"fontFamily":"Indie Flower"}}]},"legend":{"display":true,"position":"bottom","labels":{"fontFamily":"Indie Flower","fontSize":14}},"title":{"position":"top","fontFamily":"Indie Flower"}}
}
--> </canvas>
</div>
</div>
</div>
</div>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<table class="table datatable" data-json="../_posts/Bench.BlogPt1_Int32_-report.datatable.json" data-id-field="name" data-pagination="true" data-page-list="[9, 18]" data-intro="Each row in this table represents a benchmark result" data-position="left" data-show-pagination-switch="false">
<thead data-intro="The header can be used to sort/filter by clicking" data-position="right">
<tr>
<th data-field="TargetMethodColumn.Method" data-sortable="true" data-filter-control="select">
<span data-intro="The name of the benchmarked method" data-position="top">
Method<br />Name
</span>
</th>
<th data-field="N" data-sortable="true" data-value-type="int" data-filter-control="select">
<span data-intro="The size of the sorting problem being benchmarked (# of integers)" data-position="top">
Problem<br />Size
</span>
</th>
<th data-field="TimePerNDataTable" data-sortable="true" data-value-type="float2-interval-muted">
<span data-intro="Time in nanoseconds spent sorting each element in the array (with confidence intervals in parenthesis)" data-position="top">
Time /<br />Element (ns)
</span>
</th>
<th data-field="RatioDataTable" data-sortable="true" data-value-type="inline-bar-horizontal-percentage">
<span data-intro="Each result is scaled to its baseline (Array.Sort in this case)" data-position="top">
Scaling
</span>
</th>
<th data-field="Measurements" data-sortable="true" data-value-type="inline-bar-vertical">
<span data-intro="Raw benchmark results visualize how stable the result it. Longest/Shortest runs marked with <span style='color: red'>Red</span>/<span style='color: green'>Green</span>" data-position="top">Measurements</span>
</th>
</tr>
</thead>
</table>
</div>
</li>
<li>
<div>
<button class="helpbutton" data-toggle="chardinjs" onclick="$('body').chardinJs('start')"><object style="pointer-events: none;" type="image/svg+xml" data="/assets/images/help.svg"></object></button>
<table class="table datatable" data-json="../_posts/scalar-vs-unmanaged-stats.json" data-id-field="name" data-pagination="true" data-page-list="[9, 18]" data-intro="Each row in this table contains statistics collected & averaged out of thousands of runs with random data" data-position="left" data-show-pagination-switch="false">
<thead data-intro="The header can be used to sort/filter by clicking" data-position="right">
<tr>
<th data-field="MethodName" data-sortable="true" data-filter-control="select">
<span data-intro="The name of the benchmarked method" data-position="top">Method<br />Name</span>
</th>
<th data-field="ProblemSize" data-sortable="true" data-value-type="int" data-filter-control="select">
<span data-intro="The size of the sorting problem being benchmarked (# of integers)" data-position="top">Problem<br />Size</span>
</th>
<th data-field="MaxDepthScaledDataTable" data-sortable="true" data-value-type="inline-bar-horizontal">
<span data-intro="The maximal depth of recursion reached while sorting" data-position="top">Max<br />Depth</span>
</th>
<th data-field="NumPartitionOperationsScaledDataTable" data-sortable="true" data-value-type="inline-bar-horizontal">
<span data-intro="# of partitioning operations for each sort" data-position="top">#<br />Part-<br />itions</span>
</th>
<th data-field="AverageSmallSortSizeScaledDataTable" data-sortable="true" data-value-type="inline-bar-horizontal">
<span data-intro="For hybrid sorting, the average size that each small sort operation was called with (e.g. InsertionSort)" data-position="top">
Avg.<br />Small<br />Sorts<br />Size
</span>
</th>
<th data-field="NumScalarComparesScaledDataTable" data-sortable="true" data-value-type="inline-bar-horizontal">
<span data-intro="How many branches were executed in each sort operation that were based on the unsorted array elements" data-position="top">
# Data-<br />Based<br />Branches
</span>
</th>
<th data-field="PercentSmallSortCompares" data-sortable="true" data-value-type="float2-percentage">
<span data-intro="What percent of</br>⬅<br/>branches happenned as part of small-sorts" data-position="top">
% Small<br />Sort<br />Data-<br />Based<br />Branches
</span>
</th>
</tr>
</thead>
</table>
</div>
</li>
<li>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="rouge-code"><pre><span class="nv">BenchmarkDotNet</span><span class="o">=</span>v0.12.0, <span class="nv">OS</span><span class="o">=</span>clear-linux-os 32120
Intel Core i7-7700HQ CPU 2.80GHz <span class="o">(</span>Kaby Lake<span class="o">)</span>, 1 CPU, 4 logical and 4 physical cores
.NET Core <span class="nv">SDK</span><span class="o">=</span>3.1.100
<span class="o">[</span>Host] : .NET Core 3.1.0 <span class="o">(</span>CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404<span class="o">)</span>, X64 RyuJIT
Job-DEARTS : .NET Core 3.1.0 <span class="o">(</span>CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404<span class="o">)</span>, X64 RyuJIT
<span class="nv">InvocationCount</span><span class="o">=</span>3 <span class="nv">IterationCount</span><span class="o">=</span>15 <span class="nv">LaunchCount</span><span class="o">=</span>2
<span class="nv">UnrollFactor</span><span class="o">=</span>1 <span class="nv">WarmupCount</span><span class="o">=</span>10
<span class="nv">$ </span><span class="nb">grep</span> <span class="s1">'stepping\|model\|microcode'</span> /proc/cpuinfo | <span class="nb">head</span> <span class="nt">-4</span>
model : 158
model name : Intel<span class="o">(</span>R<span class="o">)</span> Core<span class="o">(</span>TM<span class="o">)</span> i7-7700HQ CPU @ 2.80GHz
stepping : 9
microcode : 0xb4
</pre></td></tr></tbody></table></code></pre></div></div>
</li>
</ul>
</div>
<p>Surprisingly<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote">5</a></sup>, the unmanaged C# version is running slightly faster than <code class="highlighter-rouge">Array.Sort</code>, but with one caveat: it only outperforms the C++ version for large inputs. Otherwise, everything is as expected: The purely <code class="highlighter-rouge">Managed</code> variant is just slow, and the <code class="highlighter-rouge">Unamanged</code> one mostly is on par with <code class="highlighter-rouge">Array.Sort</code>.<br />
These C# implementations were written to <strong>verify</strong> that we can get to <code class="highlighter-rouge">Array.Sort</code> <em>like</em> performance in C#, and they do just that. Running 5% faster for <em>some</em> input sizes will not cut it for me; I want it <em>much</em> faster. An equally important reason for re-implementing these basic versions is that we can now sprinkle <em>statistics-collecting-code</em> magic fairy dust<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote">6</a></sup> on them so that we have even more numbers to dig into in the “Statistics” tab: These counters will assist us in deciphering and comparing future results and implementations. In this post they serve us by establishing a baseline. We can see, per each <code class="highlighter-rouge">N</code> value (with some commentary):</p>
<ul>
<li>The maximal recursion depth. Note that:
<ul>
<li>The unmanaged version, like CoreCLR’s <code class="highlighter-rouge">Array.Sort</code> switches to <code class="highlighter-rouge">InsertionSort</code> for the last couple of recursion levels, therefore, its maximal depth is smaller.</li>
</ul>
</li>
<li>The total number of partitioning operations performed.
<ul>
<li>Same as above, less recursion ⮚ less partitioning calls.</li>
</ul>
</li>
<li>The average size of what I colloquially refer to as “small-sort” operations performed (e.g., <code class="highlighter-rouge">InsertionSort</code> for the <code class="highlighter-rouge">Unmanaged</code> variant).
<ul>
<li>The <code class="highlighter-rouge">Managed</code> version doesn’t have any of this, so it’s just 0.</li>
<li>In the <code class="highlighter-rouge">Unmanaged</code> version, we see a consistent value of 9.x: Given that we special case 1,2,3 in the code and 16 is the upper limit, 9.x seems like a reasonable outcome here.</li>
</ul>
</li>
<li>The number of branch operations that were user-data dependent; This one may be hard to relate to at first, but it becomes apparent why this is a crucial number to track starting with the 3<sup>rd</sup> post onwards. For now, a definition: This statistic counts <em>how many</em> times our code did an <code class="highlighter-rouge">if</code> or a <code class="highlighter-rouge">while</code> or any other branch operation <em>whose condition depended on unsorted user supplied data</em>!
<ul>
<li>The numbers boggle the mind, this is the first time we get to show how much work is involved.</li>
<li>What’s even more surprising that for the <code class="highlighter-rouge">Unmanged</code> variant, the number is even higher (well only surprising if you don’t know anything about how <code class="highlighter-rouge">InsertionSort</code> works…) and yet this version seems to run faster… I have an entire post dedicated just to this part of the problem in this series, so let’s just make note of this for now, but already we see peculiar things.</li>
</ul>
</li>
<li>Finally, I’ve also included a statistic here that shows what percent of those data-based branches came from small-sort operations. Again, this was 0% for the <code class="highlighter-rouge">Managed</code> variant, but we can see that a large part of those compares are now coming from those last few levels of recursion that were converted to <code class="highlighter-rouge">InsertionSort</code>…</li>
</ul>
<p>Some of these statistics will remain pretty much the same for the rest of this series, regardless of what we do next in future versions, while others radically change; We’ll observe and make use of these as key inputs in helping us to figure out how/why something worked, or not!</p>
</div>
<h2 id="all-warmed-up">All Warmed Up?</h2>
<p>We’ve spent quite some time polishing our foundations concerning QuickSort and <code class="highlighter-rouge">Array.Sort</code>. I know lengthy introductions are somewhat dull, but I think time spent on this post will pay off with dividend when we next encounter our actual implementation in the 3<sup>rd</sup> post and later on. This might be also a time to confess that just doing the leg-work to provide this refresher helped me come up with at least one, super non-trivial optimization, which I think I’ll keep the lid on all the way until the 6<sup>th</sup> and final post. So never underestimate the importance of “just” covering the basics.</p>
<p>Before we write vectorized code, we need to pick up some knowhow specific to vectorized intrinsics and introduce a few select intrinsics we’ll be using, so, this is an excellent time to break off this post, grab a fresh cup of coffee and head to the <a href="/2020-01-29/this-goes-to-eleven-pt2">next post</a>.</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:0" role="doc-endnote">
<p>Which is increasingly taking <a href="https://github.com/damageboy/analyze-spec-benchmarks#integer">more and more</a> time to happen, due to Dennard scaling and the slow-down of Moore’s law… <a href="#fnref:0" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>Since CoreCLR 3.0 was release, a <a href="https://github.com/dotnet/coreclr/pull/27700">PR</a> to provide a span based version of this has been recently merged into the 5.0 master branch, but I’ll ignore this for the time being as it doesn’t seem to matter in this context. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>You can grab your microcode signature in one of the following methods: On Windows, the easiest way is to install and run the excellent HWiNFO64 application, it will show you the microcode signature. On line a <code class="highlighter-rouge">grep -i microcode /proc/cpuinfo</code> does the tricks, and macOs: <code class="highlighter-rouge">sysctl -a | grep -i microcode</code> will get the job done. Unfortunately you’ll have to consult your specific CPU model to figure out the before/after signature, and I can’t help you there, except to point out that the microcode update in question came out in November 13<sup>th</sup> and is about mitigating the JCC errata. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p>I came, I Tried, <a href="https://github.com/dotnet/runtime/pull/33152#issuecomment-596405021">I Folded</a> <a href="#fnref:5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:1" role="doc-endnote">
<p>Believe it or not, I pretty much wrote every other version features in this series <em>before</em> I wrote the <code class="highlighter-rouge">Unmanaged</code> one, so I really was quite surprised that it ended up being slightly faster that <code class="highlighter-rouge">Array.Sort</code> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>I have a special build configuration called <code class="highlighter-rouge">Stats</code> which compiles in a bunch of calls into various conditionally compiled functions that bump various counters, and finally, dump it all to json and it eventually makes it all the way into these posts (if you dig deep you can get the actual json files :) <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>damageboydans@houmus.orghttps://bits.houmus.orgDecimating Array.Sort with AVX2. I ended up going down the rabbit hole re-implementing array sorting with AVX2 intrinsics. There's no reason I should go down alone.Unsafe Bounds Checking2019-11-08T05:26:28+00:002019-11-08T05:26:28+00:00https://bits.houmus.org/2019-11-08/unsafe-bounds-checking<h1 id="unsafe-bounds-checking">Unsafe Bounds Checking</h1>
<p>I thought I’d write a really short post on a nifty technique/trick I came up while trying to debug my own horrible unsafe code for vectorized sorting. I don’t think I’ve seen it used/shown before, and it really saved me tons of time.
It all boils down to a combination of:</p>
<ul>
<li><code class="highlighter-rouge">using static</code></li>
<li><code class="highlighter-rouge">#if DEBUG</code></li>
<li>Local functions in C#</li>
</ul>
<p>Imagine this is our starting point:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
</pre></td><td class="rouge-code"><pre><span class="k">unsafe</span> <span class="k">void</span> <span class="nf">GenerateRollingSum</span><span class="p">(</span><span class="kt">int</span> <span class="p">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">int</span> <span class="n">lengthInVectors</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// This get's folded as a constant by the</span>
<span class="c1">// JIT and I hate typing this all over the place</span>
<span class="kt">var</span> <span class="n">N</span> <span class="p">=</span> <span class="n">Vector256</span><span class="p"><</span><span class="kt">int</span><span class="p">>.</span><span class="n">Count</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">acc</span> <span class="p">=</span> <span class="n">Avx</span><span class="p">.</span><span class="nf">LoadDquVector256</span><span class="p">(</span><span class="n">p</span><span class="p">);</span>
<span class="kt">var</span> <span class="n">pEnd</span> <span class="p">=</span> <span class="n">p</span> <span class="p">+</span> <span class="n">lengthInVectors</span> <span class="p">*</span> <span class="n">N</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">pRead</span> <span class="p">=</span> <span class="n">p</span> <span class="p">+</span> <span class="m">1</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">pWrite</span> <span class="p">=</span> <span class="n">p</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">p</span> <span class="p"><</span> <span class="n">pEnd</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">var</span> <span class="n">data</span> <span class="p">=</span> <span class="n">Avx</span><span class="p">.</span><span class="nf">LoadDquVector256</span><span class="p">(</span><span class="n">p</span><span class="p">);</span>
<span class="n">acc</span> <span class="p">=</span> <span class="n">Avx</span><span class="p">.</span><span class="nf">Add</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">acc</span><span class="p">);</span>
<span class="n">Avx</span><span class="p">.</span><span class="nf">Store</span><span class="p">(</span><span class="n">pWrite</span><span class="p">,</span> <span class="n">acc</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>I’m providing here a very <strong>wrong</strong> implementation, obviously, for the purpose of this post. Keen eyes will immediately notice that this method is going to make us very unhappy as it is writing partially into the same memory it is about to read in the next iteration. It’s definitely not going to work. But at the same time, it’s important to note that it isn’t going to crash or generate any exception, except for not doing it’s job.</p>
<p>Unfortunately, for me, I’ve managed to write many variations of this bug, so I had to come up with something that would negate my in-built idiocy, here’s what I normally write with code like this these days:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
</pre></td><td class="rouge-code"><pre><span class="c1">// We import all the static methods in Avx</span>
<span class="k">using</span> <span class="nn">static</span> <span class="n">System</span><span class="p">.</span><span class="n">Runtime</span><span class="p">.</span><span class="n">Intrinsics</span><span class="p">.</span><span class="n">X86</span><span class="p">.</span><span class="n">Avx</span><span class="p">;</span>
<span class="k">unsafe</span> <span class="k">void</span> <span class="nf">GenerateRollingSum</span><span class="p">(</span><span class="kt">int</span> <span class="p">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">int</span> <span class="n">lengthInVectors</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// This get's folded as a constant by the</span>
<span class="c1">// JIT and I hate typing this all over the place</span>
<span class="kt">var</span> <span class="n">N</span> <span class="p">=</span> <span class="n">Vector256</span><span class="p"><</span><span class="kt">int</span><span class="p">>.</span><span class="n">Count</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">acc</span> <span class="p">=</span> <span class="nf">LoadDquVector256</span><span class="p">(</span><span class="n">p</span><span class="p">);</span>
<span class="kt">var</span> <span class="n">pEnd</span> <span class="p">=</span> <span class="n">p</span> <span class="p">+</span> <span class="n">lengthInVectors</span> <span class="p">*</span> <span class="n">N</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">pRead</span> <span class="p">=</span> <span class="n">p</span> <span class="p">+</span> <span class="m">1</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">pWrite</span> <span class="p">=</span> <span class="n">p</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">p</span> <span class="p"><</span> <span class="n">pEnd</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">var</span> <span class="n">data</span> <span class="p">=</span> <span class="nf">LoadDquVector256</span><span class="p">(</span><span class="n">p</span><span class="p">);</span>
<span class="n">acc</span> <span class="p">=</span> <span class="n">Avx</span><span class="p">.</span><span class="nf">Add</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">acc</span><span class="p">);</span>
<span class="nf">Store</span><span class="p">(</span><span class="n">pWrite</span><span class="p">,</span> <span class="n">acc</span><span class="p">);</span>
<span class="p">}</span>
<span class="cp">#if DEBUG
</span> <span class="c1">// "Hijack" LoadDquVector256 under DEBUG configuration</span>
<span class="c1">// and assert for various constraint violations</span>
<span class="n">Vector256</span><span class="p"><</span><span class="kt">int</span><span class="p">></span> <span class="nf">LoadDquVector256</span><span class="p">(</span><span class="kt">int</span> <span class="p">*</span><span class="n">ptr</span><span class="p">)</span> <span class="p">{</span>
<span class="n">Debug</span><span class="p">.</span><span class="nf">Assert</span><span class="p">((</span><span class="n">ptr</span> <span class="p">+</span> <span class="n">N</span> <span class="p">-</span> <span class="m">1</span><span class="p">)</span> <span class="p"><</span> <span class="n">p</span> <span class="p">+</span> <span class="n">lengthInVectors</span> <span class="p">*</span> <span class="n">N</span><span class="p">,</span>
<span class="s">"Reading past end of array"</span><span class="p">);</span>
<span class="c1">// Finally call the real LoadDquVector256()</span>
<span class="k">return</span> <span class="n">Avx</span><span class="p">.</span><span class="nf">LoadDquVector256</span><span class="p">(</span><span class="n">ptr</span><span class="p">);</span>
<span class="p">}</span>
<span class="c1">// "Hijack" LoadDquVector256 under DEBUG configuration</span>
<span class="c1">// and assert for various constraint violations</span>
<span class="k">void</span> <span class="nf">Store</span><span class="p">(</span><span class="kt">int</span> <span class="p">*</span><span class="n">ptr</span><span class="p">,</span> <span class="n">Vector256</span><span class="p"><</span><span class="kt">int</span><span class="p">></span> <span class="n">data</span><span class="p">)</span> <span class="p">{</span>
<span class="n">Debug</span><span class="p">.</span><span class="nf">Assert</span><span class="p">((</span><span class="n">ptr</span> <span class="p">+</span> <span class="n">N</span> <span class="p">-</span> <span class="m">1</span><span class="p">)</span> <span class="p"><</span> <span class="n">p</span> <span class="p">+</span> <span class="n">lengthInVectors</span> <span class="p">*</span> <span class="n">N</span><span class="p">,</span>
<span class="s">"Writing past end of array"</span><span class="p">);</span>
<span class="n">Debug</span><span class="p">.</span><span class="nf">Assert</span><span class="p">((</span><span class="n">ptr</span> <span class="p">+</span> <span class="n">N</span> <span class="p">-</span> <span class="m">1</span><span class="p">)</span> <span class="p"><</span> <span class="n">pRead</span><span class="p">,</span>
<span class="s">"Writing will overwrite unread data"</span><span class="p">);</span>
<span class="c1">// Finally call the real Store()</span>
<span class="n">Avx</span><span class="p">.</span><span class="nf">Store</span><span class="p">(</span><span class="n">ptr</span><span class="p">,</span> <span class="n">data</span><span class="p">);</span>
<span class="p">}</span>
<span class="cp">#endif
</span><span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>As you can see, this is a nifty way to abuse <code class="highlighter-rouge">using static</code> statements with local functions. We override the <code class="highlighter-rouge">LoadDquVector256()</code> / <code class="highlighter-rouge">Store</code> intrinsics only in <code class="highlighter-rouge">DEBUG</code> mode, so there’s no performance hit that they incur in <code class="highlighter-rouge">RELEASE</code>, and we also make use of the fact that they are defined as local functions to perform some in-depth <code class="highlighter-rouge">Debug.Assert()</code>ing that is based on the internal state of the function. Without defining these functions as local we would not be able to do so…</p>
<p>This isn’t necessarily useful for vectorized code exclusively, but any code that is potentially tricky. I hope you find this useful! I don’t think I’ve seen this in the wild before.</p>damageboydans@houmus.orghttps://bits.houmus.orgUnsafe Bounds CheckingHacking CoreCLR on Linux with CLion2019-05-01T05:26:28+00:002019-05-01T05:26:28+00:00https://bits.houmus.org/2019-05-01/hacking-coreclr-on-linux-with-clion<h2 id="whatwhy">What/Why?</h2>
<p>Being a regular Linux user, when I can, I was looking for a decent setup for myself to grok then hack on CoreCLR’s C++ code.</p>
<p>CoreCLR, namely the C++ code that implements the runtime (GC, JIT and more) is a BIG project, and trying to peel through its layers for the first time is no easy task for sure. While there are many great resources available for developers that want to read about the runtime such as the <a href="https://github.com/dotnet/coreclr/blob/master/Documentation/botr/README.md">BotR</a>, for me, there really is no replacement for reading the code and trying to reason about what/how it gets stuff done, preferably during a debug session, with a very focused task/inquiry at hand. For this reason, I really wanted a proper IDE for the huge swaths of C++ code, and I couldn’t think of anything else but <a href="https://www.jetbrains.com/clion/">JetBrains’ own CLion IDE</a> under Linux (and macOS, which I’m not a user of).<br />
With my final setup, I really can do non-trivial navigation on the code base such as:</p>
<video width="900" controls="">
<source src="../assets/images/clion-coreclr.webm" type="video/webm" />
</video>
<h2 id="loading-coreclr-with-clion-navigation">Loading CoreCLR with CLion Navigation</h2>
<p>CoreCLR is a beast of a project, and getting it to parse properly under CLion, moreover, it requires some non-trivial setup, so I thought I’d disclose my process here, for other people to see and maybe even improve upon…</p>
<p>Generally speaking, all the puzzle pieces should fit since the CoreCLR build-system is 95% made of running <code class="highlighter-rouge">cmake</code> to generate standard GNU makefiles, and then builds the whole thing using said makefiles, where the other 5% is made of some scripts wrapping the <code class="highlighter-rouge">cmake</code> build-system. At the same time, CLion builds upon <code class="highlighter-rouge">cmake</code> to bootstrap its own internal project representation, <em>provided</em> that it can invoke <code class="highlighter-rouge">cmake</code> just like the normal build would.</p>
<p>Here’s what I did to get everything working:</p>
<ol>
<li>First, We’ll clone and perform a single build of CoreCLR by <a href="https://github.com/dotnet/coreclr/blob/master/Documentation/building/linux-instructions.md#environment">following the instructions</a>, What I did on my Ubuntu machine consisted of:
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre><span class="nv">$ </span><span class="nb">sudo </span>apt <span class="nb">install </span>cmake llvm-3.9 clang-3.9 lldb-3.9 liblldb-3.9-dev libunwind8 libunwind8-dev gettext libicu-dev liblttng-ust-dev libcurl4-openssl-dev libssl-dev libnuma-dev libkrb5-dev
<span class="nv">$ </span>./build.sh checked
</pre></td></tr></tbody></table></code></pre></div> </div>
</li>
<li>Once the build is over, you should have everything under the <code class="highlighter-rouge">bin/Product/Linux.x64.Checked</code> like so:
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="rouge-code"><pre><span class="nv">$ </span><span class="nb">ls </span>bin/Product/Linux.x64.Checked
bin libcoreclr.so netcoreapp2.0
coreconsole libcoreclrtraceptprovider.so PDB
corerun libdbgshim.so sosdocsunix.txt
createdump libmscordaccore.so SOS.NETCore.dll
crossgen libmscordbi.so SOS.NETCore.pdb
gcinfo libprotononjit.so superpmi
IL libsosplugin.so System.Globalization.Native.a
ilasm libsos.so System.Globalization.Native.so
ildasm libsuperpmi-shim-collector.so System.Private.CoreLib.dll
inc libsuperpmi-shim-counter.so System.Private.CoreLib.ni.<span class="o">{</span>fe21e59b-7903-49b4-b2d3-67de152c1d7d<span class="o">}</span>.map
lib libsuperpmi-shim-simple.so System.Private.CoreLib.xml
libclrgc.so Loader
libclrjit.so mcs
</pre></td></tr></tbody></table></code></pre></div> </div>
<p>Now that an initial build is over, we can be sure that some scripts that were crucial to generate a few headers essential for the rest of the compilation process were generated and CLion will be able to find all the necessary source code once we teach it how to…</p>
</li>
<li>
<p>CLion needs to invoke <code class="highlighter-rouge">cmake</code> with the same arguments that the build scripts use. To sniff out the <code class="highlighter-rouge">cmake</code> command-line we’ll use an *nix old-timer’s trick to generate traces for <code class="highlighter-rouge">build.sh</code> run: use <code class="highlighter-rouge">bash -x</code>. Unfortunately, nothing is ever so simple in life, and CoreCLR’s <code class="highlighter-rouge">build.sh</code> script doesn’t directly invoke <code class="highlighter-rouge">cmake</code>, so we will need to make this <code class="highlighter-rouge">-x</code> parameter sticky or recursive. There is no better way to do this than the following somewhat convoluted procedure:<br />
First we need to generate a wrapper-script for <code class="highlighter-rouge">build.sh</code>, we’ll call it <code class="highlighter-rouge">build-wrapper.sh</code>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre><span class="nb">echo</span> <span class="s2">"export SHELLOPTS && ./build.sh </span><span class="se">\$</span><span class="s2">@"</span> <span class="o">></span> build-wrapper.sh
</pre></td></tr></tbody></table></code></pre></div> </div>
<p>After we have our wrapper in place. we run it instead of <code class="highlighter-rouge">build.sh</code> like this:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="rouge-code"><pre><span class="nv">$ </span>bash <span class="nt">-x</span> ./build-wrapper.sh checked
... <span class="c"># omitted</span>
+ /usr/bin/cmake <span class="nt">-G</span> <span class="s1">'Unix Makefiles'</span> <span class="nt">-DCMAKE_BUILD_TYPE</span><span class="o">=</span>CHECKED <span class="nt">-DCMAKE_INSTALL_PREFIX</span><span class="o">=</span>/home/dmg/projects/public/coreclr/bin/Product/Linux.x64.Checked <span class="nt">-DCMAKE_USER_MAKE_RULES_OVERRIDE</span><span class="o">=</span> <span class="nt">-DCLR_CMAKE_PGO_INSTRUMENT</span><span class="o">=</span>0 <span class="nt">-DCLR_CMAKE_OPTDATA_PATH</span><span class="o">=</span>/home/dmg/.nuget/packages/optimization.linux-x64.pgo.coreclr/99.99.99-master-20190716.1 <span class="nt">-DCLR_CMAKE_PGO_OPTIMIZE</span><span class="o">=</span>1 <span class="nt">-S</span> /home/dmg/projects/public/coreclr <span class="nt">-B</span> /home/dmg/projects/public/coreclr/bin/obj/Linux.x64.Checked
</pre></td></tr></tbody></table></code></pre></div> </div>
<p>Boom! We’ve hit that jackpot. For folks following this that are feeling a bit shaky, I’ve isolated the exact part we’re after below:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre><span class="nt">-G</span> <span class="s1">'Unix Makefiles'</span> <span class="nt">-DCMAKE_BUILD_TYPE</span><span class="o">=</span>CHECKED <span class="nt">-DCMAKE_INSTALL_PREFIX</span><span class="o">=</span>/home/dmg/projects/public/coreclr/bin/Product/Linux.x64.Checked <span class="nt">-DCMAKE_USER_MAKE_RULES_OVERRIDE</span><span class="o">=</span> <span class="nt">-DCLR_CMAKE_PGO_INSTRUMENT</span><span class="o">=</span>0 <span class="nt">-DCLR_CMAKE_OPTDATA_PATH</span><span class="o">=</span>/home/dmg/.nuget/packages/optimization.linux-x64.pgo.coreclr/99.99.99-master-20190716.1 <span class="nt">-DCLR_CMAKE_PGO_OPTIMIZE</span><span class="o">=</span>1 <span class="nt">-S</span> /home/dmg/projects/public/coreclr <span class="nt">-B</span> /home/dmg/projects/public/coreclr/bin/obj/Linux.x64.Checked
</pre></td></tr></tbody></table></code></pre></div> </div>
</li>
<li>
<p>The “hard” part is over. It’s a series of boring clicks from here on. it’s time to open up CLion and get this show on the road:
We’ll start with defining a clang-3.9 based toolchain, since on Linux Clion defaults to using the gcc toolchain (at least on Linux), while CoreCLR needs clang-3.9 to build itself:<img src="/assets/images/clion-toolchains-coreclr.png" alt="clion-toolchains-coreclr" /></p>
</li>
<li>
<p>With a toolchain setup, we need to tell <code class="highlighter-rouge">cmake</code> about our build configuration, so we set it up like so:
<img src="/assets/images/clion-cmake-coreclr.png" alt="clion-cmake-coreclr" /></p>
<p>I’ve highlighted all the text boxes you’ll need to set. I’ll go over the less trivial stuff:</p>
<ul>
<li>The command line option we just set aside in (3) goes into the <code class="highlighter-rouge">CMake options</code> field.<br />
Unfortunately CLion doesn’t like single quotes (weird…), so I’ve had to change the <code class="highlighter-rouge">-G 'Unix Makefiles'</code> into <code class="highlighter-rouge">-G "Unix Makrfiles"</code> (notice the use of double quotes).</li>
<li>It would be a wise idea to share the same build folder as our initial command line build used, more over, we might end up going back and forth between CLion and the command line, so I override the “Generation Path” setting with the value <code class="highlighter-rouge">bin/obj/Linux.x64.Checked</code>. This is again extracted from the same command line we set-aside before. You’ll find it in my case towards the end, specified right after the <code class="highlighter-rouge">-B</code> switch.</li>
<li>For the build options, I’ve specified <code class="highlighter-rouge">-j 8</code>. This option controls how many parallel builds (compilers) are launched during the build process. A good default is to set it to 2x the number of physical cores your machine has, so in my case that means using <code class="highlighter-rouge">-j 8</code>.</li>
</ul>
</li>
<li>That’s it, let CLion do it’s thing while grinding your machine to a halt, and once it’s done you can start navigating and building the CoreCLR project like a first class citizen of the civilized world :)</li>
</ol>
<h2 id="debugging-coreclr-from-clion">Debugging CoreCLR from CLion</h2>
<p>Once we have CLion understanding the CoreCLR project structure we can take it up a notch and try to debug CoreCLR by launching “something” while setting a breakpoint.</p>
<p>Let’s try to debug the JIT as an example for a useful scenario.</p>
<ol>
<li>First we need a console application:
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
</pre></td><td class="rouge-code"><pre> <span class="nv">$ </span><span class="nb">cd</span> /tmp/
<span class="nv">$ </span>dotnet new console <span class="nt">-n</span> clion_dbg_sample
The template <span class="s2">"Console Application"</span> was created successfully.
Processing post-creation actions...
Running <span class="s1">'dotnet restore'</span> on clion_dbg_sample/clion_dbg_sample.csproj...
Restore completed <span class="k">in </span>54.39 ms <span class="k">for</span> /tmp/clion_dbg_sample/clion_dbg_sample.csproj.
Restore succeeded.
<span class="nv">$ </span><span class="nb">cd </span>clion_dbg_sample
<span class="nv">$ </span>dotnet publish <span class="nt">-c</span> release <span class="nt">-o</span> linux-x64 <span class="nt">-r</span> linux-x64
Microsoft <span class="o">(</span>R<span class="o">)</span> Build Engine version 16.3.0+0f4c62fea <span class="k">for</span> .NET Core
Copyright <span class="o">(</span>C<span class="o">)</span> Microsoft Corporation. All rights reserved.
Restore completed <span class="k">in </span>66.26 ms <span class="k">for</span> /tmp/clion_dbg_sample/clion_dbg_sample.csproj.
clion_dbg_sample -> /tmp/clion_dbg_sample/bin/release/netcoreapp3.0/linux-x64/clion_dbg_sample.dll
clion_dbg_sample -> /tmp/clion_dbg_sample/linux-x64/
</pre></td></tr></tbody></table></code></pre></div> </div>
<p>Now we have a console application published in some folder, in my case it’s <code class="highlighter-rouge">/tmp/clion_dbg_sample/linux-x64</code></p>
</li>
<li>
<p>Next we will setup a new configuration under CLion:<br />
<img src="/assets/images/clion-edit-configurations-coreclr.png" alt="" /></p>
</li>
<li>
<p>Now we define a <strong>new</strong> configuration:<br />
<img src="/assets/images/clion-select-executable-coreclr.png" alt="" />
We provide some name, I’ve decided to use the same name as my test program: <code class="highlighter-rouge">clion_dbg_sample</code>, We select “All targets” as the Target, and under executable we need to choose “Select other…” to provide a custom path to <code class="highlighter-rouge">corerun</code>. The reason behind this is that we need to run <code class="highlighter-rouge">corerun</code> from a directory that actually contains the entire product: jit, gc and everything else.</p>
</li>
<li>
<p>The path we provide is to the <code class="highlighter-rouge">corerun</code> executable that resides in the <code class="highlighter-rouge">bin/Product/Linux.x64.Checked</code> folder:
<img src="/assets/images/clion-custom-executable-coreclr.png" alt="" /></p>
</li>
<li>
<p>Finally we provide our sample project from before to the <code class="highlighter-rouge">corerun</code> executable. This is how my final configuration looks like:<br />
<img src="/assets/images/clion-sample-configuration-final-coreclr.png" alt="" /></p>
</li>
<li>
<p>It’s time to set a break-point and launch. As a generic sample I will navigate to <code class="highlighter-rouge">compiler.cpp</code> and find the <code class="highlighter-rouge">jitNativeCode</code> method. It’s pretty much one of the top-level functions in the JIT, and therefore a good candidate for us. If we set a breakpoint in that method and launch our newly created configuration, we should hit it in no time:
<img src="/assets/images/clion-debug-jit-coreclr.png" alt="" /></p>
</li>
<li>We’re done! If you really want to figure out what to do next, it’s probably a good time to hit the <a href="https://github.com/dotnet/coreclr/blob/master/Documentation/botr/README.md">BotR</a>, namely the <a href="https://github.com/dotnet/coreclr/blob/master/Documentation/botr/ryujit-overview.md">RyuJit Overview</a> and <a href="https://github.com/dotnet/coreclr/blob/master/Documentation/botr/ryujit-tutorial.md">RyuJit Tutorial</a> pages that contain a more detailed overview of the JIT. Alternatively, if you’re a “get your hands dirty” sort of person, you can also do some warm-up exercises for your fingers and start hitting that step-into keyboard shortcut. You’re debugging the JIT as we speak!</li>
</ol>
<p>I hope this end up helping someone wanting to get started digging into the JIT not on Windows. I also personally have a strong preference for CLion as I really think it’s much more faster and powerful option than all the other stuff I’ve tried this far. At any rate, it’s the only viable option for Linux/macOs people.</p>
<p>Have fun! Let me know on <a href="https://twitter.com/damageboy">twitter</a> if you’re encountering any difficulties or you think I can make anything clearer…</p>damageboydans@houmus.orghttps://bits.houmus.orgWhat/Why?.NET Core 3.0 Intrinsics in Real Life - (Part 3/3)2018-08-20T15:26:28+00:002018-08-20T15:26:28+00:00https://bits.houmus.org/2018-08-20/netcoreapp3.0-intrinsics-in-real-life-pt3<p>As I’ve described in <a href="/2018-08-18/netcoreapp3.0-intrinsics-in-real-life-pt1">part 1</a> & <a href="/2018-08-19/netcoreapp3.0-intrinsics-in-real-life-pt2">part 2</a> of this series, I’ve recently overhauled an internal data structure we use at Work<sup>®</sup> to start using <a href="https://github.com/dotnet/designs/blob/master/accepted/platform-intrinsics.md">platform dependent intrinsics</a>.</p>
<p>If you’ve not read the previous posts, I suggest you do so, as a lot of what is discussed here relies on the code and issues presented there…</p>
<p>As a reminder, this series is made in 3 parts:</p>
<ul>
<li><a href="/2018-08-18/netcoreapp3.0-intrinsics-in-real-life-pt1">The data-structure/operation that we’ll optimize and basic usage of intrinsics</a>.</li>
<li><a href="/2018-08-19/netcoreapp3.0-intrinsics-in-real-life-pt2">Using intrinsics more effectively</a></li>
<li>The C++ version(s) of the corresponding C# code, and what I learned from them (this post).</li>
</ul>
<p>All of the code (C# & C++) is published under the <a href="https://github.com/damageboy/bitgoo">bitgoo github repo</a>.</p>
<h2 id="c-vs-c">C++ vs. C#</h2>
<p>I think I’ve mentioned this somewhere before: I started working on better versions of my bitmap search function way before CoreCLR intrinsics were even imagined. This led me to start to tinkering with C++ code where I tried out most of my ideas. When CoreCLR 3.0 became real enough, I ported the C++ code back to C# (which surprisingly consisted of a couple of search and replace operations, no more…).</p>
<p>As such, having two close implementations begs performing a head-to-head comparison.
After some additional work, I had basic <a href="https://github.com/google/benchmark">google benchmark</a> and <a href="https://github.com/google/googletest">google test</a> suites up and running<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote">1</a></sup><br />
I’ll cut right to the chase and present a relative comparison between C++ and C# for the last version we ran in our previous post, The C# method is <code class="highlighter-rouge">POPCNTAndBMI2Unrolled</code> and the C++ one is <code class="highlighter-rouge">POPCNTAndBMI2Unrolled2</code>:</p>
<table>
<thead>
<tr>
<th>Method</th>
<th>N</th>
<th>C# Mean (ns)</th>
<th>C++ Mean (ns)</th>
<th>C++/C# Ratio</th>
</tr>
</thead>
<tbody>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>1</td>
<td>2.249</td>
<td>3.338</td>
<td>148.42%</td>
</tr>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>4</td>
<td>10.904</td>
<td>11.037</td>
<td>101.22%</td>
</tr>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>16</td>
<td>50.368</td>
<td>43.786</td>
<td>86.93%</td>
</tr>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>64</td>
<td>208.272</td>
<td>202.366</td>
<td>97.16%</td>
</tr>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>256</td>
<td>1,580.026</td>
<td>1,493.020</td>
<td>94.49%</td>
</tr>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>1024</td>
<td>21,282.905</td>
<td>11,520.900</td>
<td>54.13%</td>
</tr>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>4096</td>
<td>255,186.977</td>
<td>133,976.543</td>
<td>52.50%</td>
</tr>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>16384</td>
<td>3,730,420.068</td>
<td>1,754,421.485</td>
<td>47.03%</td>
</tr>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>65536</td>
<td>56,939,817.593</td>
<td>26,613,731.568</td>
<td>46.74%</td>
</tr>
</tbody>
</table>
<p>There are a few things that stand out from this comparison:</p>
<ul>
<li>The percentage differences in the low bit counts (1,4) should be ignored, they are minuscule in absolute terms and within the margin of error.</li>
<li>C# is doing pretty well up to 256 bits when we <strong>don’t</strong> execute the unrolled loop, it’s basically neck to neck with C++.</li>
<li>Sweet mercy, what is going on with 1024 bits an onwards, inside the unrolled loop? Why is there such a big difference for what is a relatively optimized (and equivalent) piece of code between the two languages?</li>
</ul>
<p>I’ll cut to the chase and answer this last question directly, then, proceed to explain the underlying relevant basics (<em>tl;dr</em>: it’s not so basic) of CPU pipelining and register renaming in order for the explanation to stick for people reading this that are not familiar with those terms/concepts.</p>
<p>The bottom line is: there is a bug in the CPU! There is a well known (even if very cryptic) <a href="https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-desktop-specification-update.pdf">erratum</a> about this bug, and compiler developers are more or less generally aware of this issue and have been <em>working around</em> it for the better part of the last 5 years.</p>
<h3 id="false-dependencies">False Dependencies</h3>
<p>So what is this mysterious CPU bug all about? The JIT was producing what should be, according to the processor documentation, pretty good code:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
</pre></td><td class="rouge-code"><pre><span class="nl">BEGIN_POPCNT_UROLLED_LOOP:</span>
<span class="nf">popcnt</span> <span class="nb">rsi</span><span class="p">,</span> <span class="kt">qword</span> <span class="nv">ptr</span> <span class="p">[</span><span class="nb">rcx</span><span class="p">]</span>
<span class="nf">sub</span> <span class="nb">rdx</span><span class="p">,</span> <span class="nb">esi</span>
<span class="nf">popcnt</span> <span class="nb">rsi</span><span class="p">,</span> <span class="kt">qword</span> <span class="nv">ptr</span> <span class="p">[</span><span class="nb">rcx</span><span class="o">+</span><span class="mi">8</span><span class="p">]</span>
<span class="nf">sub</span> <span class="nb">rdx</span><span class="p">,</span> <span class="nb">esi</span>
<span class="nf">popcnt</span> <span class="nb">rsi</span><span class="p">,</span> <span class="kt">qword</span> <span class="nv">ptr</span> <span class="p">[</span><span class="nb">rcx</span><span class="o">+</span><span class="mi">16</span><span class="p">]</span>
<span class="nf">sub</span> <span class="nb">rdx</span><span class="p">,</span> <span class="nb">esi</span>
<span class="nf">popcnt</span> <span class="nb">rsi</span><span class="p">,</span> <span class="kt">qword</span> <span class="nv">ptr</span> <span class="p">[</span><span class="nb">rcx</span><span class="o">+</span><span class="mi">24</span><span class="p">]</span>
<span class="nf">sub</span> <span class="nb">rdx</span><span class="p">,</span> <span class="nb">esi</span>
<span class="nf">add</span> <span class="nb">rcx</span><span class="p">,</span> <span class="mi">32</span>
<span class="nf">cmp</span> <span class="nb">rdx</span><span class="p">,</span> <span class="mi">256</span>
<span class="nf">jge</span> <span class="nv">SHORT</span> <span class="nv">BEGIN_POPCNT_UROLLED_LOOP</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>What we see above is an excerpt from <code class="highlighter-rouge">POPCNTAndBMI2Unrolled</code> method’s assembly code, and more specifically the unrolled loop that does 4 <code class="highlighter-rouge">POPCNT</code> instructions in succession.</p>
<p>Even if you are not an assembly guru, it’s pretty clear we have 4 pairs of <code class="highlighter-rouge">POPCNT</code> + <code class="highlighter-rouge">SUB</code> instructions, where:</p>
<ul>
<li>Each <code class="highlighter-rouge">POPCNT</code> instruction is <strong>reading</strong> from successive memory addresses and <strong>writing</strong> their result temporarily into a register <em>named</em> <code class="highlighter-rouge">rsi</code>.</li>
<li>This temporary value is then subtracted using <code class="highlighter-rouge">SUB</code> from another register which represents our good old C# variable <code class="highlighter-rouge">n</code> (the target-bit count).</li>
</ul>
<p>The <em>high-level</em> explanation of the bug goes like this:</p>
<ol>
<li>The CPU <em>should</em> have <strong>detected</strong> that each <code class="highlighter-rouge">POPCNT</code> + <code class="highlighter-rouge">SUB</code> instruction <em>pair</em> is effectively <em>independent</em> of the previous pair (inside our unrolled loop and <em>between</em> the loop’s iterations). In other words: although all 4 pairs are using the same destination register (<code class="highlighter-rouge">rsi</code>), each such pair is really not dependent on the previous value of <code class="highlighter-rouge">rsi</code>.</li>
<li>This dependency analysis, performed by the CPU, <em>should</em> have <em>enabled</em> it to use an internal optimization called register-renaming (more on that later).</li>
<li><em>Had</em> register renaming been triggered the CPU could have processed our <code class="highlighter-rouge">POPCNT</code> instructions with a higher degree of parallelism: In other words, our CPU, would run a few <code class="highlighter-rouge">POPCNT</code> instructions in <strong>parallel</strong> at any given moment. This would lead to better perf or better IPC (Instruction-Per-Cycle ratio).</li>
<li>In reality, the bug is causing the CPU to delay the processing of each such pair of instructions for a few cycles, per pair, introducing a lot of “garbage time” inside the CPU, where it’s stalling, doing less work than it should, leading to the slowdown we are seeing.</li>
</ol>
<p>Terminology wise, this sort of bug is called a <em>false-dependency</em> bug: In our case, the CPU wrongfully introduces a dependency between the different <code class="highlighter-rouge">POPCNT</code> instructions on their destination register, it <em>thinks</em> each <code class="highlighter-rouge">POPCNT</code> instruction is <strong>not only writing</strong> into <code class="highlighter-rouge">rsi</code> but <strong>also reading</strong> from it! (it does no such thing)<br />
Given that this false dependency now exists, it is preventing the CPU from using register-renaming to execute the code more efficiently.</p>
<p>I will first focus on describing how compilers have been working around this, and afterward, I will describe in much more detail how the CPU employs register renaming to improve the throughput of the pipeline when the bug does not exist <em>or</em> is worked around.</p>
<h3 id="working-around-false-dependencies">Working Around False Dependencies</h3>
<p>As I’ve mentioned, this bug has been around for quite some time: It was reported <a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011">somewhere in 2014</a> and is unfortunately still persistent to this day on most Intel CPUs, at least when it comes to the <code class="highlighter-rouge">POPCNT</code> instruction<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote">2</a></sup>.</p>
<p>Luckily, compiler developers have been able to work around this issue with relative ease by generating <em>extra code</em> that <strong><em>breaks</em></strong> the aforementioned false-dependency. As far as I can tell, the people who originally wrote the workarounds were Intel developers, so they had a very good understanding of the exact nature of this false-dependency. What they opted to do was make compilers introduce a single-byte instruction that clears the lower 32-bits of the <em>destination</em> register. In our case, this comes in the form of a <code class="highlighter-rouge">xor esi, esi</code> instruction. This is the shortest way (instruction length-wise) in x86 CPUs to zero out a register. This instruction is a well known special case in the CPU since it “knows” the future value of the destination register (0) even without executing it, or knowing what its original value ever was. It appears the Intel engineers <em>knew</em> that the dependency is not the entire 64-bit register (<code class="highlighter-rouge">rsi</code>) but only on the lower 32-bit part of that register (<code class="highlighter-rouge">esi</code><sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote">3</a></sup>) and took advantage of this understanding to introduce a single byte fix into the instruction stream, which is relatively very cheap.</p>
<p>The correct x86 assembly, generated by a fixed JIT or compiler should look like this:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
</pre></td><td class="rouge-code"><pre><span class="nl">BEGIN_POPCNT_UROLLED_LOOP:</span>
<span class="nf">xor</span> <span class="nb">esi</span><span class="p">,</span> <span class="nb">esi</span> <span class="c1">; This breaks the dependency</span>
<span class="nf">popcnt</span> <span class="nb">rsi</span><span class="p">,</span> <span class="kt">qword</span> <span class="nv">ptr</span> <span class="p">[</span><span class="nb">rcx</span><span class="p">]</span>
<span class="nf">sub</span> <span class="nb">rdx</span><span class="p">,</span> <span class="nb">esi</span>
<span class="nf">xor</span> <span class="nb">esi</span><span class="p">,</span> <span class="nb">esi</span> <span class="c1">; This breaks the dependency</span>
<span class="nf">popcnt</span> <span class="nb">rsi</span><span class="p">,</span> <span class="kt">qword</span> <span class="nv">ptr</span> <span class="p">[</span><span class="nb">rcx</span><span class="o">+</span><span class="mi">8</span><span class="p">]</span>
<span class="nf">sub</span> <span class="nb">rdx</span><span class="p">,</span> <span class="nb">esi</span>
<span class="nf">xor</span> <span class="nb">esi</span><span class="p">,</span> <span class="nb">esi</span> <span class="c1">; This breaks the dependency</span>
<span class="nf">popcnt</span> <span class="nb">rsi</span><span class="p">,</span> <span class="kt">qword</span> <span class="nv">ptr</span> <span class="p">[</span><span class="nb">rcx</span><span class="o">+</span><span class="mi">16</span><span class="p">]</span>
<span class="nf">sub</span> <span class="nb">rdx</span><span class="p">,</span> <span class="nb">esi</span>
<span class="nf">xor</span> <span class="nb">esi</span><span class="p">,</span> <span class="nb">esi</span> <span class="c1">; This breaks the dependency</span>
<span class="nf">popcnt</span> <span class="nb">rsi</span><span class="p">,</span> <span class="kt">qword</span> <span class="nv">ptr</span> <span class="p">[</span><span class="nb">rcx</span><span class="o">+</span><span class="mi">24</span><span class="p">]</span>
<span class="nf">sub</span> <span class="nb">rdx</span><span class="p">,</span> <span class="nb">esi</span>
<span class="nf">add</span> <span class="nb">rcx</span><span class="p">,</span> <span class="mi">32</span>
<span class="nf">cmp</span> <span class="nb">rdx</span><span class="p">,</span> <span class="mi">256</span>
<span class="nf">jge</span> <span class="nv">SHORT</span> <span class="nv">BEGIN_POPCNT_UROLLED_LOOP</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>This short piece of code is the sort of code that gcc/clang would generate for <code class="highlighter-rouge">POPCNT</code> to work-around the bug. When read out of context, it looks silly… it appears like the compiler generated useless code, to begin with, and you’ll find people wondering about this publicly in stackoverflow and other forums from time to time, or worse yet: trying to “fix” it. But for most in-production x86 CPUs (e.g. all the ones that suffer from this false-dependency) this code will substantially outperform the original code we saw above…</p>
<h2 id="update-coreclr-does-the-right-thing">Update: CoreCLR does the right thing</h2>
<p>I originally started writing part 3 after I found this issue with the JIT, and submitting <a href="https://github.com/dotnet/coreclr/issues/19555">an issue</a>, thinking I would finish writing this post before anyone would fix the underlying issue. I was wrong on both counts: Writing this post became an ever-growing challenge as I attempted to explain pipelines and register-renaming for the uninitiated (below), while <a href="https://github.com/dotnet/coreclr/pull/19772">Fei Peng fixed the issue</a> in a matter of two weeks (Thanks!).</p>
<p>What CoreCLR now does (since commit <a href="https://github.com/dotnet/coreclr/pull/19772/commits/6957b4f44f0917209df89499b7c4071bb0bc1941">6957b4f</a>) is <strong>always</strong> introduce the <code class="highlighter-rouge">xor dest, dest</code> workaround/dependency breaker for the 3 affected instructions <code class="highlighter-rouge">LZCNT</code>, <code class="highlighter-rouge">TZCNT</code>, <code class="highlighter-rouge">POPCNT</code>. This is <em>not the optimal</em> solution since the JIT will introduce this both for CPUs afflicted with this bug (specific Intel CPUs) as well as CPUs that <strong>don’t</strong> have this bug (All AMD CPUs and newer Intel CPUs).<br />
From the discussion, it’s clear that this path was chosen for simplicity’s sake: it would require more infrastructure both to detect the correct CPU family inside the JIT, and introduce questions around what should the JIT do in case of AOT (Ahead Of Time) compilation, as well as require more testing infrastructure than what is currently in place on the one hand, while the one byte fix is very cheap even for CPUs that are not affected.</p>
<p>Let’s see if this CoreCLR fix does anything to our unmodified piece of code…:</p>
<table>
<thead>
<tr>
<th>Method</th>
<th>N</th>
<th style="text-align: right">Mean (ns)</th>
<th style="text-align: right">Scaled To “buggy” CoreCLR</th>
<th style="text-align: right">Scaled to C++</th>
</tr>
</thead>
<tbody>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>1</td>
<td style="text-align: right">2.170</td>
<td style="text-align: right">0.96</td>
<td style="text-align: right">0.65</td>
</tr>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>4</td>
<td style="text-align: right">11.910</td>
<td style="text-align: right">1.09</td>
<td style="text-align: right">1.08</td>
</tr>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>16</td>
<td style="text-align: right">55.016</td>
<td style="text-align: right">1.09</td>
<td style="text-align: right">1.26</td>
</tr>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>64</td>
<td style="text-align: right">225.156</td>
<td style="text-align: right">1.08</td>
<td style="text-align: right">1.11</td>
</tr>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>256</td>
<td style="text-align: right">1,637.336</td>
<td style="text-align: right">1.04</td>
<td style="text-align: right">1.10</td>
</tr>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>1024</td>
<td style="text-align: right">11,698.421</td>
<td style="text-align: right">0.55</td>
<td style="text-align: right">1.02</td>
</tr>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>4096</td>
<td style="text-align: right">149,247.146</td>
<td style="text-align: right">0.58</td>
<td style="text-align: right">1.11</td>
</tr>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>16384</td>
<td style="text-align: right">1,904,945.748</td>
<td style="text-align: right">0.51</td>
<td style="text-align: right">1.09</td>
</tr>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>65536</td>
<td style="text-align: right">27,712,720.427</td>
<td style="text-align: right">0.49</td>
<td style="text-align: right">1.04</td>
</tr>
</tbody>
</table>
<p>It sure does! It appears now that the unrolled version is running roughly 85-101% faster for higher bit counts than it did with the previous, unfixed CoreCLR!. When compared to C++, performance is now pretty close and consistent for the important parts of the benchmark. If you consider, for a moment, that we got here by making the JIT spill out <em>an extra, supposedly useless</em> instruction, this makes the achievement that much more impressive :), as before, <a href="https://gist.github.com/damageboy/0266018efbbf0a8478aa4d50de1c894f">here is the JITDump</a> with the newly fixed JIT in place.</p>
<p>Now, we can really see just how much of a profound effect this false-dependency had on performance. In theory, this might be the right time to finish this post, however, I couldn’t let it go without attempting to explain the underlying CPU internals of <em>how and why</em> the false-dependency had such a deep effect on performance. For readers well aware of how CPU pipelines operate and how they interact with the register renaming functionality on a modern super-scalar out-of-order CPU this is a good time to stop reading.<br />
What follows is me trying to explain how the CPU tries to handle loops of code effectively, and how register renaming plays an important role in that.</p>
<h2 id="the-lovehate-story-that-is-tight-loops-in-cpus">The love/hate story that is tight loops in CPUs</h2>
<p>It takes very little imagination to realize that CPUs spend a lot their processing time executing loops (or the same machine code multiple times, in this context). <br />
We need to remember that CPUs achieve remarkable throughput (e.g. instructions per cycles, or IPC) even though the table, in some ways, is set <strong>against</strong> them:</p>
<ul>
<li>A modern CPU will often have a dozen or so stages in their pipeline (examples: 14 in Skylake, 19 in AMD Ryzen)
<ul>
<li>This means a single instruction will take about 14 cycles on my cpu from start to finish if we were only executing that instruction and waiting for it to complete!</li>
</ul>
</li>
<li>The CPU attempts to handle multiple instructions in different stages of the pipeline, but it may become <em>stalled</em> (i.e. do no work) when it needs to wait for a previous instruction to advance through the pipeline enough to have its result ready (this is generally referred to as instruction dependencies).</li>
<li>To improve the utilization of CPU caches (L1/2/3 caches) and memory bus utilization, most modern processors artificially limit the number of register <strong>names</strong> they support for instructions (seems like in 2018 everyone has settled on 16 general purpose registers, except for PowerPC at 32)
<ul>
<li>That way instructions take up fewer bits and can be read more quickly over these highly subscribed resources (caches and memory bus).</li>
<li>The flip side of this design decision is that compilers do not have the ability to generate code that uses many different registers, which in turn leads them to generate more code fragments that are dependent of each other because of the limited register names available for them.</li>
</ul>
</li>
</ul>
<p>With that in mind, let’s take the same, short piece of assembly code, which was generated by the JIT for our last unrolled attempt at, and see how it theoretically executes on a Skylake CPU.</p>
<h2 id="visualizing-our-loop">Visualizing our loop</h2>
<p>Without any additional fanfare, lets introduce the following visualization:</p>
<p><img src="/assets/images/iaca-popcnt-retirement.svg" alt="iaca-popcnt" /></p>
<p>I created this diagram by prettifying a trace file generated by a little known tool made by Intel called <a href="https://software.intel.com/en-us/articles/intel-architecture-code-analyzer">IACA</a>, which stands for <strong>I</strong>ntel <strong>A</strong>rchitecture <strong>C</strong>ode <strong>A</strong>nalyzer. IACA takes a piece of machine code + target CPU family and produces a textual trace file that can help us see better what the CPU does, at every cycle of a relatively short loop.<br />
If you dislike having to use commercial (non-OSS) tools, please note that there is a similar tool by the LLVM project called <a href="https://llvm.org/docs/CommandGuide/llvm-mca.html">llvm-mca</a>, and you can even use it from the <a href="https://godbolt.org/z/baOZWy">infamous compiler-explorer</a>.</p>
<p>Let’s try to break this diagram down:</p>
<ul>
<li>The leftmost column contains the loop counter, I’ve limited the trace to 2 [0, 1] iterations of that loop, to keep everything compact.</li>
<li>Next, the instruction counter <em>within</em> its respective loop. Clearly we have 11 instructions per loop.</li>
<li>Next, the disassembly, where we can see 4 <code class="highlighter-rouge">POPCNT</code> instructions and they are interleaved with 4 subtractions of each <code class="highlighter-rouge">POPCNT</code> result from the register <code class="highlighter-rouge">rdx</code></li>
<li>Next we see how the instructions are broken down to µops<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote">4</a></sup>:<br />
For now, we will simply make note that every <code class="highlighter-rouge">POPCNT</code> we have , having been encoded as an instruction that reads from memory AND calculates the population count, was broken down to two µops:
<ul>
<li>A load µop (<code class="highlighter-rouge">TYPE_LOAD</code>) loading the data from its respective pointer.</li>
<li>An operation µop (<code class="highlighter-rouge">TYPE_OP</code>) performing the actual <code class="highlighter-rouge">POPCNT</code>ing into our destination register (<code class="highlighter-rouge">rsi</code>).</li>
</ul>
</li>
<li>Then comes the real kicker: IACA <strong>simulates</strong> what a Skylake CPU (specifically) <em>should</em> be doing at every cycle of those two loop iterations and provides us with critical insight into the state that each instruction is at every cycle (relative to the beginning of the first loop). These states are described by the coded symbols in each box, which I will shortly describe in more detail.</li>
</ul>
<p class="notice--warning">It is important to note that IACA, while being Intel’s <em>own tool</em> is <strong>not</strong> aware of the Intel CPU bug I just described. It is simulating what that processor <em>should have</em> done with NO false dependency…</p>
<p>While all the various states of the instruction within the pipeline are interesting I will give some more meaning to specific states:</p>
<table>
<thead>
<tr>
<th>mnemonic</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>d</td>
<td>Dispatched to execution: The CPU has completed decoding and waiting for the instruction’s dependencies to be ready. Execution will begin in the next cycle</td>
</tr>
<tr>
<td>e</td>
<td>Executing: The instruction is being executed, often in multiple cycles within a specific execution port (unit) inside the CPU</td>
</tr>
<tr>
<td>w</td>
<td>Writeback: The instruction’s result is being written back to a register in the register-file (more on this below), where it will be available for other instructions that might have a dependency on that instruction</td>
</tr>
<tr>
<td>R</td>
<td>Retired: The temporary register used during the execution/writeback has to be written back to the “real” destination register, according to the original order of the program code, this is called retirement, after which the CPUs internal, temporary register is free again (more on this below)</td>
</tr>
</tbody>
</table>
<p>I encourage you to try to follow this execution trace for a couple of instructions. I like to stare at these things for hours, trying to tell a story in my own head in the form of “what is the CPU thinking now” for each and every cycle. There is much we could say about this, but I will highlight a few remarkable things:</p>
<ul>
<li>I’ve highlighted the <code class="highlighter-rouge">R</code> symbol/stage with a <span style="color:red"><strong>red-ellipse</strong></span>. For our purposes here, this represents the final stage of each instruction. To me, it’s very impressive to see how all of these instructions terminate execution either 0 or 1 cycles apart of each other.</li>
<li>By the time the first instruction (<code class="highlighter-rouge">POPCNT</code>) reaches the <code class="highlighter-rouge">R</code> (retired) state at cycle 14, when it’s done, we are <em>already</em> executing, in some pipeline stage or another, all instructions from the next 4 iterations of this unrolled loop (I’ve limited the visualization to only 2 iterations for brevity, but you get the hang of it).
<ul>
<li>The processor is already (speculatively) executing loads from memory to satisfy our <code class="highlighter-rouge">POPCNT</code> instructions in loop iterations 1,2,3 before the first iteration has even completed running, and without even knowing for sure our loop would actually execute for that amount of iterations.</li>
<li>Quantitatively speaking: We have roughly 4 iterations of an 11 instruction loop (> 40 instructions) all running in parallel inside one core(!) of our processor. This is possible both because of the length of the pipeline (14 stages for this specific processor) and the fact that internally, the processor has multiple units or ports capable of running various instructions in parallel. This is often referred to as a super-scalar CPU.</li>
</ul>
</li>
</ul>
<p>In case you are interested in digging much more deeper than I can afford to go into this within this post, I suggest you read <a href="http://www.lighterra.com/papers/modernmicroprocessors/">Modern Microprocessors: A 90-Minute Guide!</a> to get more detailed information about pipelines, super-scalar CPUs, everything I try to cover here, and more.</p>
<p>For this post, I will focus on one key aspect that lies in the root of how the CPU manages to do so many things at the same time: register renaming.</p>
<h3 id="instruction-dependencies">Instruction Dependencies</h3>
<p>Let’s look at the code again, this time adding arrows between the various instructions, marking their interdependencies.</p>
<p><img src="/assets/images/popcnt-dependencies.svg" alt="popcnt-deps" /></p>
<p>If we interpret this code naively (and wrongly), we see that <code class="highlighter-rouge">rsi</code> is being used in each and every instruction of this code fragment, this could lead us to assume that the heavy usage of <code class="highlighter-rouge">rsi</code> is generating a long dependency chain:</p>
<ul>
<li>The <code class="highlighter-rouge">POPCNT</code> is writing into <code class="highlighter-rouge">rsi</code>.</li>
<li><code class="highlighter-rouge">rsi</code> is then used as a source for the subtraction from <code class="highlighter-rouge">rdx</code>, so naturally, the <code class="highlighter-rouge">sub</code> instruction cannot proceed before <code class="highlighter-rouge">rsi</code> has the value of <code class="highlighter-rouge">POPCNT</code>.</li>
<li>The next <code class="highlighter-rouge">POPCNT</code> is again writing to <code class="highlighter-rouge">rsi</code> but would seemingly be unable to write before the previous <code class="highlighter-rouge">sub</code> has finished.</li>
<li>After four such operations, we loop (in turquoise) again and we are again taking a dependency on <code class="highlighter-rouge">rsi</code> at the beginning of the loop.</li>
</ul>
<p>This naive dependency analysis pretty much contradicts the output we saw come out of IACA in the previous diagram without further explanation. It would seem impossible for the CPU to run so many things in parallel where every instruction here seems to have a dependency through the use of the <code class="highlighter-rouge">rsi</code> register.<br />
Moreover, both our original C# and C++ code did not force the JIT/compiler to re-use the same register over and over. It could have allocated 4 different registers and used them to generate code where each <code class="highlighter-rouge">POPCNT</code> + <code class="highlighter-rouge">SUB</code> pair would be independent of the previous one, so why didn’t it do so?<br />
Well, it turns out there is no need to! The JIT/compiler is doing exactly what it needs to be doing, it is just us, that need to learn about a very important concept in modern processors called register renaming.</p>
<h3 id="register-renaming">Register Renaming</h3>
<p>To understand why anyone would need something like register renaming, we first need to understand that CPU designers are stuck between a rock and a hard place:</p>
<ul>
<li>On one hand they want to be able to read our program code as fast as possible, from memory 🡒 cache 🡒 instruction decoder (a.k.a CPU front end), this requirement leads down a path where they have to severely <em>limit</em> the number of register <em>names</em> available for machine code, since fewer register names leads to more compact instructions (fewer bits) in memory and more efficient utilization of memory buses and caches.</li>
<li>On the other hand, they would like to give compilers / JIT engines as much flexibility as possible in using as many registers as they want (possibly hundreds) without needing to move their contents into memory (or more realistically: CPU cache) just because they ran out of registers names.</li>
</ul>
<p>These contradicting requirements led CPU designers to decouple the idea of register names and register storage: modern CPUs have many more (hundreds) or physical registers (storage) in their register-file than they have names for our software to use. This is where register renaming enters the scene.</p>
<p>What CPU designers have been doing, for quite a long time now (<a href="https://ieeexplore.ieee.org/document/5392015">before 1967</a>, believe it or not!) is really remarkable: they have been employing a really neat trick that effectively gets the best of both worlds (i.e. satisfy both requirements) at the cost of more complexity, more power usage, and more stages in the pipeline (hence also a little slowdown in the execution of a single instruction) to achieve better pipeline utilization at the global scale.</p>
<p>This optimization, named “Register renaming”, accomplishes just that: by analyzing <em>when</em> a register is being <strong>written</strong> (write-only, not read-write) to, the CPU “understands” that the previous value of that register is <em>no longer required</em> for the execution of instructions reading/writing to that same register from that moment onwards, even if previous instructions have not completed (or started) execution! What this really means, is that if we go back to the naive (now you see why) dependency analysis we did in the previous section, it’s clear that each <code class="highlighter-rouge">POPCNT</code> + <code class="highlighter-rouge">SUB</code> pair are actually completely <strong>independent</strong> of each other because they begin with overwriting <code class="highlighter-rouge">rsi</code>! In other words, each <code class="highlighter-rouge">POPCNT</code> having written to <code class="highlighter-rouge">rsi</code> is considered to be breaking the dependency chain from that moment onwards.
What the CPU does, therefore, is continuously re-map <em>named</em> registers to different register <em>locations</em> on the register-file, according to the real dependency chain, and use that newly <strong>allocated</strong> location within the register file (hence the initial “Allocation” stage at the IACA diagram above) until the dependency chain is broken again (e.g. the same register is written to again).<br />
I cannot emphasize how important of a tool this is for the CPU. Register renaming allows it to schedule multiple instructions to execute concurrently, either at different stages of the same pipeline or in parallel in different execution ports (pipelines) that exist in a super-scalar CPU. Moreover, this optimization achieves this while keeping the machine code small and easy to decode, since there are very few bits allocated for register names!</p>
<p class="notice--info">How big of a deal is this? How good is the CPU in using this renaming trick? To best answer this from a practical standpoint, I think, we can take a look into the disparity between how many register <em>names</em> exist, for example, in the x64 architecture, that number being 16, and how <em>many physical register</em> storage space there is on the register-file, for example, on an Intel Skylake CPU: 180 (!).</p>
<p>After the temporary (renamed) register has finished its job for a given instruction chain, we are still, unfortunately, not <em>entirely</em> done with it. Understand, that the CPU cannot look too far into the incoming instruction stream (mostly a few dozen bytes), and it can not know, with certainty, if the last written value it just wrote to a renamed register will not be required by some future part of the code it hasn’t seen yet, hundreds of instructions in the future. This brings us to the last phase of register renaming, which is retirement: The CPU must still write the last value for our <em>symbolic</em> register (<code class="highlighter-rouge">rsi</code>) back to the canonical location of that register (a.k.a the “real” register), in case future instructions that have not been loaded/decoded will attempt to read that value.<br />
Moreover, this retirement phase must be performed exactly in program order for the program to continue operating as its original intention was.</p>
<h3 id="wrapping-up-clearing-the-register-for-the-rescue">Wrapping up: clearing the register for the rescue</h3>
<p>So going back to our false-dependency bug, we can now hopefully understand the underlying issue and the fix armed with our new knowledge:</p>
<p>Our Intel CPU wrongly misunderstands our <code class="highlighter-rouge">POPCNT</code> instruction, when it comes to its dependency analysis: It <strong>“thinks”</strong> our usage of <code class="highlighter-rouge">rsi</code> is not only writing to it but also reading from it.<br />
This is the false-dependency at the root of this issue. We cannot see this with IACA, but we can understand it conceptually: If the CPU (wrongfully) “thinks” that our second <code class="highlighter-rouge">POPCNT</code> has to READ the previous <code class="highlighter-rouge">rsi</code> value, then no register renaming can occur at that point, and the second <code class="highlighter-rouge">POPCNT</code> instruction cannot execute in parallel to the first one, it needs to wait for the completion of the first <code class="highlighter-rouge">POPCNT</code> and basically stall for a few precious cycles, in order for the previous <code class="highlighter-rouge">rsi</code> to be written back somewhere. Naturally this is true for every unrolled <code class="highlighter-rouge">POPCNT</code> in our loop and <em>also</em> between loop iterations.<br />
This alone is enough to cause the perf drop we saw originally with the C# code before CoreCLR was patched. Once the <code class="highlighter-rouge">xor esi,esi</code> dependency breaker is added to the instruction stream, we are basically “informing” the CPU that we really are not dependent on the previous value of <code class="highlighter-rouge">rsi</code> and we allow it to perform register renaming from that point onwards. It still wrongfully thinks that <code class="highlighter-rouge">POPCNT</code> reads from <code class="highlighter-rouge">rsi</code> but thanks to our otherwise seemingly superfluous <code class="highlighter-rouge">xor</code>, this is an <em>already renamed</em> <code class="highlighter-rouge">rsi</code> and the pipeline stall is averted.</p>
<p>I think it is pretty clear by now, although we barely scratched the surface of CPU internals, that CPUs are very complex, and that in the race to extract more performance out of code, today’s out-of-order, super-scalar CPUs go into extreme lengths to find ways to parallelize machine code execution.<br />
It should be also clear that it’s important to be able to <a href="https://mechanical-sympathy.blogspot.com/2011/07/why-mechanical-sympathy.html">empathize with the machine</a> and understand the true nature of its inner workings to really be able to deal the weirdness we experience as we try to make stuff go faster.</p>
<p>It would be great if all we needed to do was keep compiler and hardware developers well fed and well paid so we could do our job without needing to know any of this, and to a great extent, this statement is true. But more often than not, extreme performance requires deeper understanding.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>As a side note, after not doing serious C++ work for years, coming back to it and discovering sanitizers, cmake, google test & benchmark was a very pleasant surprise. I distinctly remember the surprise of writing C++ and not having violent murderous thoughts at the same time. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>Apparently Intel has fixed the bug (according to reports) for the <code class="highlighter-rouge">LZCNT</code> and <code class="highlighter-rouge">TZCNT</code> instructions on Skylake processors, but not so for the <code class="highlighter-rouge">POPCNT</code> instruction for reasons unknown to practically anyone. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>yes, x86 registers are weird in that way, where <em>some</em> 64 bit registers have additional symbolic names referring to their lower 32, 16, and both 8 bit parts of their lower 16 bits, don’t ask. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>µop or micro-op, is a low-level hardware operation. The CPU Front-End is responsible for reading the x86 machine code and decoding them into one or more µops. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>damageboydans@houmus.orghttps://bits.houmus.orgAs I’ve described in part 1 & part 2 of this series, I’ve recently overhauled an internal data structure we use at Work® to start using platform dependent intrinsics..NET Core 3.0 Intrinsics in Real Life - (Part 2/3)2018-08-19T15:26:28+00:002018-08-19T15:26:28+00:00https://bits.houmus.org/2018-08-19/netcoreapp3.0-intrinsics-in-real-life-pt2<p>As I’ve described in <a href="/2018-08-18/netcoreapp3.0-intrinsics-in-real-life-pt1">part 1</a> of this series, I’ve recently overhauled an internal data structure we use at Work<sup>®</sup> to start using <a href="https://github.com/dotnet/designs/blob/master/accepted/platform-intrinsics.md">platform dependent intrinsics</a>.</p>
<p>If you’ve not read part 1 yet, I suggest you do so, since we continue right where we left off…</p>
<p>As a reminder, this series is made in 3 parts:</p>
<ul>
<li><a href="/2018-08-18/netcoreapp3.0-intrinsics-in-real-life-pt1">The data-structure/operation that we’ll optimize and basic usage of intrinsics</a>.</li>
<li>Using intrinsics more effectively (this post).</li>
<li><a href="/2018-08-20/netcoreapp3.0-intrinsics-in-real-life-pt3">The C++ version(s) of the corresponding C# code, and what I learned from them</a>.</li>
</ul>
<p>All of the code (C# & C++) is published under the <a href="https://github.com/damageboy/bitgoo">bitgoo github repo</a>.</p>
<h3 id="pdep---parallel-bit-deposit">PDEP - Parallel Bit Deposit</h3>
<p>We’re about to twist our heads with a bit of a challenge: For me, this was a lot of fun, since I got to play with something I knew <em>nothing</em> about which turned out to be very useful, and not only for this specific task, but useful in general.</p>
<p>We’re going to optimize a subset of this method’s performance “spectrum”: lower bit counts.<br />
If you go back to the previous iteration of the code, we can clearly see that apart from the one 64 bit <code class="highlighter-rouge">POPCNT</code> loop up at the top, the ratio between instructions executed and bits processed for low values of <code class="highlighter-rouge">N</code> doesn’t look too good. I summed up the instruction counts from the JIT Dump linked above:</p>
<ul>
<li>The 64-bit <code class="highlighter-rouge">POPCNT</code> loop takes 10 instructions, split into two fragments of the function, processing 64 bits each iteration.</li>
<li>The rest of the code (31 instructions not including the <code class="highlighter-rouge">ret</code>!) is spent processing the last <= 64 bits, executing a single time.</li>
</ul>
<p>While just counting instructions isn’t the best profiling metric in the world, it’s still very revealing…<br />
Wouldn’t it be great if we could do something to improve that last, long code fragment?
Guess what…<br />
Yes we can! using a weird little instruction called <a href="https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets#Parallel_bit_deposit_and_extract"><code class="highlighter-rouge">PDEP</code></a> whose description (copy-pasted from <a href="https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf">Intel’s bible of instructions</a> in page 922) goes like this:</p>
<blockquote>
<p>PDEP uses a mask in the second source operand (the third operand) to transfer/scatter contiguous low order bits in the first source operand (the second operand) into the destination (the first operand). PDEP takes the low bits from the first source operand and deposit them in the destination operand at the corresponding bit locations that are set in the second source operand (mask). All other bits (bits not set in mask) in destination are set to zero.</p>
</blockquote>
<p>Luckily, it comes with a diagram that makes is more digestible:</p>
<p><img src="/assets/images/pdep.svg" alt="PDEP" /></p>
<p>I know this might be a bit intimidating at first, but what <code class="highlighter-rouge">PDEP</code> can do for us, in my own words, is this: Process a single 64-bit value (<code class="highlighter-rouge">SRC1</code>) according to a mask of bits (<code class="highlighter-rouge">SRC2</code>) and copy (“deposit”) the least-significant bits from <code class="highlighter-rouge">SRC1</code> (or from right to left in the diagram) into a destination register according to the the position of <code class="highlighter-rouge">1</code> bits in the mask (<code class="highlighter-rouge">SRC2</code>).<br />
It definitely takes time to wrap your head around how/what can be done with this, and there are many more applications than just this bit-searching. To be honest, right after I read a <a href="http://palms.ee.princeton.edu/PALMSopen/hilewitz06FastBitCompression.pdf">paper</a> about <code class="highlighter-rouge">PDEP</code>, which from what I gathered, was the inspiration that led to having these primitives in our processors and an extremely good paper for those willing to dive deeper, I felt like a hammer in search of a nail, in wanting to apply this somewhere, until I remembered I had <em>this</em> little thing I need (e.g. this function) and I tried using it, still in C++, about 2 years ago…<br />
It took me a good day of goofing around (I actually started with its sister instruction <code class="highlighter-rouge">PEXT</code> ) with this on a white-board until I finally saw <em>a</em> solution…
<u>*Note*</u>: There might be other solutions, better than what I came up with, and if anyone reading this finds one, I would love to hear about it!</p>
<p>For those of you who don’t like spoilers, this might be a good time to grab a piece of paper and try to figure out how <code class="highlighter-rouge">PDEP</code> could help us in processing the last 64 bits, where we know our target bit is hiding…</p>
<p>If you are ready for the solution, I’ll just show the one-liner C# expression that replaces the <strong>31</strong> instructions we saw the JIT emit for us to handle those last < 64 bits in our bitmap all the way town to <strong>13</strong> instructions and just as importantly: with <strong>0</strong> branching:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre><span class="c1">// Where:</span>
<span class="c1">// n is the # of the target bit we are searching for</span>
<span class="c1">// value is the 64 bits when we know for sure that n is "hiding" within</span>
<span class="kt">var</span> <span class="n">offsetOfNthBit</span> <span class="p">=</span> <span class="nf">TrailingZeroCount</span><span class="p">(</span>
<span class="nf">ParallelBitDeposit</span><span class="p">(</span><span class="m">1U</span><span class="n">L</span> <span class="p"><<</span> <span class="p">(</span><span class="n">n</span> <span class="p">-</span> <span class="m">1</span><span class="p">),</span> <span class="k">value</span><span class="p">);</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>It’s not trivial to see how/why this works just from reading the code, so lets break this down, for an imaginary case of a 16-bit <code class="highlighter-rouge">PDEP</code> and assorted registers, for simplicity:</p>
<p>As an example, let’s pretend we are are looking for the offset (position) of the 8<sup>th</sup> <code class="highlighter-rouge">1</code> bit.<br />
We pass two operands to <code class="highlighter-rouge">ParallelBitDeposit()</code>:<br />
The <code class="highlighter-rouge">SRC1</code> operand has the value of <code class="highlighter-rouge">1</code> left shifted by the bit number we are searching for minus 1, so for our case of <code class="highlighter-rouge">n = 8</code>, we shift a single <code class="highlighter-rouge">1</code> bit 7 bits to the left, ending up with:</p>
<div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre><span class="err">0b_0000_0000_1000_0000</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Our “fake” 16-bit <code class="highlighter-rouge">SRC1</code> now has a single <code class="highlighter-rouge">1</code> bit in the <strong>position</strong> that equals our target-bit <strong>count</strong> (This last emphasis is important!)
Remember that by this point in our search function, we have made sure our <code class="highlighter-rouge">n</code> is within the range <code class="highlighter-rouge">1..64</code>, so <code class="highlighter-rouge">n-1</code> can only be <code class="highlighter-rouge">0..63</code> we we can never shift negative number of bits, or above the size of the register (this can be seen more easily in the full code listing below).</p>
<p>As for <code class="highlighter-rouge">SRC2</code>, We load it up with our remaining portion of the bitmap, whose n<sup>th</sup> lit bit position we are searching for, so with careful mashing of the keyboard, I came up with these random bits:</p>
<div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre><span class="err">0b_0001_0111_0011_0110</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>This is what executing <code class="highlighter-rouge">PDEP</code> with these two operands does:</p>
<p><img src="/assets/images/pdep-bitsearch-example-animated.svg" alt="PDEP" /></p>
<p>By now, we’ve managed to generate a temporary value where only our original target-bit remains lit, in its original position, so thanks for that, <code class="highlighter-rouge">PDEP</code>! In a way, we’ve managed to tweak <code class="highlighter-rouge">PDEP</code> into a custom masking opcode, capable of masking out the first <code class="highlighter-rouge">n-1</code> lit bits…<br />
Finally, all that remains is to use the BMI1 <code class="highlighter-rouge">TZCNT</code> instruction to count the number of <code class="highlighter-rouge">0</code> bits leading up to our deposited <code class="highlighter-rouge">1</code> bit marker. That number ends up being the offset of the n<sup>th</sup> lit bit in the original bitmap! Cool, eh?</p>
<p>Let’s look at the final code for this function:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
</pre></td><td class="rouge-code"><pre><span class="k">using</span> <span class="nn">static</span> <span class="n">System</span><span class="p">.</span><span class="n">Runtime</span><span class="p">.</span><span class="n">Intrinsics</span><span class="p">.</span><span class="n">X86</span><span class="p">.</span><span class="n">Popcnt</span><span class="p">;</span>
<span class="k">using</span> <span class="nn">static</span> <span class="n">System</span><span class="p">.</span><span class="n">Runtime</span><span class="p">.</span><span class="n">Intrinsics</span><span class="p">.</span><span class="n">X86</span><span class="p">.</span><span class="n">Bmi1</span><span class="p">;</span>
<span class="k">using</span> <span class="nn">static</span> <span class="n">System</span><span class="p">.</span><span class="n">Runtime</span><span class="p">.</span><span class="n">Intrinsics</span><span class="p">.</span><span class="n">X86</span><span class="p">.</span><span class="n">Bmi2</span><span class="p">;</span>
<span class="k">public</span> <span class="k">static</span> <span class="k">unsafe</span> <span class="kt">int</span> <span class="nf">POPCNTAndBMI2</span><span class="p">(</span><span class="kt">ulong</span><span class="p">*</span> <span class="n">bits</span><span class="p">,</span> <span class="kt">int</span> <span class="n">numBits</span><span class="p">,</span> <span class="kt">int</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">var</span> <span class="n">p64</span> <span class="p">=</span> <span class="n">bits</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">prevN</span><span class="p">;</span>
<span class="k">do</span> <span class="p">{</span>
<span class="n">prevN</span> <span class="p">=</span> <span class="n">n</span><span class="p">;</span>
<span class="n">n</span> <span class="p">-=</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span> <span class="nf">PopCount</span><span class="p">(*</span><span class="n">p64</span><span class="p">);</span>
<span class="n">p64</span><span class="p">++;</span>
<span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">n</span> <span class="p">></span> <span class="m">0</span><span class="p">);</span>
<span class="n">p64</span><span class="p">--;</span>
<span class="c1">// Here, we know for sure that 1 .. prevN .. 64 (including)</span>
<span class="kt">var</span> <span class="n">pos</span> <span class="p">=</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span> <span class="nf">TrailingZeroCount</span><span class="p">(</span>
<span class="nf">ParallelBitDeposit</span><span class="p">(</span><span class="m">1U</span><span class="n">L</span> <span class="p"><<</span> <span class="p">(</span><span class="n">prevN</span> <span class="p">-</span> <span class="m">1</span><span class="p">),</span> <span class="p">*</span><span class="n">p64</span><span class="p">));</span>
<span class="k">return</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span> <span class="p">((</span><span class="n">p64</span> <span class="p">-</span> <span class="n">bits</span><span class="p">)</span> <span class="p"><<</span> <span class="m">6</span><span class="p">)</span> <span class="p">+</span> <span class="n">pos</span><span class="p">;</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>With the code out of the way, time to see if the whole thing paid off?</p>
<table>
<thead>
<tr>
<th>Method</th>
<th>N</th>
<th style="text-align: right">Mean (ns)</th>
<th style="text-align: right">Scaled to “POPCNTAndBMI1”</th>
</tr>
</thead>
<tbody>
<tr>
<td>POPCNTAndBMI2</td>
<td>1</td>
<td style="text-align: right">2.232</td>
<td style="text-align: right">0.95</td>
</tr>
<tr>
<td>POPCNTAndBMI2</td>
<td>4</td>
<td style="text-align: right">9.497</td>
<td style="text-align: right">0.62</td>
</tr>
<tr>
<td>POPCNTAndBMI2</td>
<td>16</td>
<td style="text-align: right">40.259</td>
<td style="text-align: right">0.34</td>
</tr>
<tr>
<td>POPCNTAndBMI2</td>
<td>64</td>
<td style="text-align: right">193.253</td>
<td style="text-align: right">0.19</td>
</tr>
<tr>
<td>POPCNTAndBMI2</td>
<td>256</td>
<td style="text-align: right">1,581.082</td>
<td style="text-align: right">0.32</td>
</tr>
<tr>
<td>POPCNTAndBMI2</td>
<td>1024</td>
<td style="text-align: right">23,174.989</td>
<td style="text-align: right">0.51</td>
</tr>
<tr>
<td>POPCNTAndBMI2</td>
<td>4096</td>
<td style="text-align: right">341,087.341</td>
<td style="text-align: right">0.82</td>
</tr>
<tr>
<td>POPCNTAndBMI2</td>
<td>16384</td>
<td style="text-align: right">4,979,229.288</td>
<td style="text-align: right">0.95</td>
</tr>
<tr>
<td>POPCNTAndBMI2</td>
<td>65536</td>
<td style="text-align: right">76,144,935.381</td>
<td style="text-align: right">0.98</td>
</tr>
</tbody>
</table>
<p>Oh boy did it! results are much better for the lower counts of <code class="highlighter-rouge">N</code>:</p>
<ul>
<li>As expected, the scaling improved with <em>peak improvement</em> compared to the previous version at <code class="highlighter-rouge">N==64</code>, with a 400% speedup compared to the previous version!</li>
<li>As N grows beyond 64, this version’s performance resembles the previous version’s more and more (duh!).</li>
</ul>
<p>All in all, everything looks as we would have expected so far…<br />
Again, for those interested, here’s a <a href="https://gist.github.com/9b049a464dc66237500454ed367a79aa">gist</a> of the JITDump, for your pleasure.</p>
<h3 id="loop-unrolling">Loop Unrolling</h3>
<p>A common optimization technique we haven’t used up to this point, is <a href="https://en.wikipedia.org/wiki/Loop_unrolling">loop unrolling/unwinding</a>:</p>
<blockquote>
<p>The goal of loop unwinding is to increase a program’s speed by reducing
or eliminating instructions that control the loop, such as <a href="https://en.wikipedia.org/wiki/Pointer_arithmetic">pointer arithmetic</a> and “end of loop” tests on each iteration;[<a href="https://en.wikipedia.org/wiki/Loop_unrolling#cite_note-1">1]</a> reducing branch penalties; as well as hiding latencies including the delay in reading data from memory.[<a href="https://en.wikipedia.org/wiki/Loop_unrolling#cite_note-2">2]</a> To eliminate this <a href="https://en.wikipedia.org/wiki/Computational_overhead">computational overhead</a>, loops can be re-written as a repeated sequence of similar independent statements.[<a href="https://en.wikipedia.org/wiki/Loop_unrolling#cite_note-3">3]</a></p>
</blockquote>
<p>By now, we’re left with only one loop, so clearly the target of loop unrolling is the <code class="highlighter-rouge">POPCNT</code> loop.<br />
After all, we are potentially going over thousands of bits, and by shoving more <code class="highlighter-rouge">POPCNT</code> instructions in between the looping instructions, we can theoretically drive the CPU harder.<br />
Not only that, but modern (in this case x86/x64) CPUs are notorious for having internal parallelism that comes in many shapes and forms. For <code class="highlighter-rouge">POPCNT</code> specifically, we know from <a href="https://www.agner.org/optimize/instruction_tables.pdf">Agner Fog’s Instruction Tables</a> that:</p>
<ul>
<li>Intel Skylake can execute certain <code class="highlighter-rouge">POPCNT</code> instructions on two different execution ports, with a single <code class="highlighter-rouge">POPCNT</code> latency of 3 cycles, and a reciprocal throughput of 1 cycle, so a latency of <code class="highlighter-rouge">x + 2</code> cycles as a best case, where <code class="highlighter-rouge">x</code> is the number of <strong>continuous independent</strong> <code class="highlighter-rouge">POPCNT</code> instructions.</li>
<li>AMD Ryzen can execute up to 4 <code class="highlighter-rouge">POPCNT</code> instructions in 1 cycle, with a latency of 1 cycle, for <strong>continuous independent</strong> <code class="highlighter-rouge">POPCNT</code> instructions, which is even more impressive (I’ve not yet been able to verify this somewhat extravagant claim…).</li>
</ul>
<p>These numbers were measured on real CPUs, with very specific benchmarks that measure single independent instructions. They should <strong>not</strong> be taken as a target performance for <strong>our</strong> code, since we are attempting to solve a real-life problem, which isn’t limited to a single instruction and has at least SOME dependency between the different instructions and branching logic on top of that.<br />
But the numbers do give us at least one thing: motivation to unroll our <code class="highlighter-rouge">POPCNT</code> loop and try to get more work out of the CPU by issuing independent <code class="highlighter-rouge">POPCNT</code> on different parts of our bitmap.</p>
<p>Here’s the code that does this:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
</pre></td><td class="rouge-code"><pre><span class="k">using</span> <span class="nn">static</span> <span class="n">System</span><span class="p">.</span><span class="n">Runtime</span><span class="p">.</span><span class="n">Intrinsics</span><span class="p">.</span><span class="n">X86</span><span class="p">.</span><span class="n">Popcnt</span><span class="p">;</span>
<span class="k">using</span> <span class="nn">static</span> <span class="n">System</span><span class="p">.</span><span class="n">Runtime</span><span class="p">.</span><span class="n">Intrinsics</span><span class="p">.</span><span class="n">X86</span><span class="p">.</span><span class="n">Bmi1</span><span class="p">;</span>
<span class="k">using</span> <span class="nn">static</span> <span class="n">System</span><span class="p">.</span><span class="n">Runtime</span><span class="p">.</span><span class="n">Intrinsics</span><span class="p">.</span><span class="n">X86</span><span class="p">.</span><span class="n">Bmi2</span><span class="p">;</span>
<span class="k">public</span> <span class="k">static</span> <span class="k">unsafe</span> <span class="kt">int</span> <span class="nf">POPCNTAndBMI2Unrolled</span><span class="p">(</span><span class="kt">ulong</span><span class="p">*</span> <span class="n">bits</span><span class="p">,</span> <span class="kt">int</span> <span class="n">numBits</span><span class="p">,</span> <span class="kt">int</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">var</span> <span class="n">p64</span> <span class="p">=</span> <span class="n">bits</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(;</span> <span class="n">n</span> <span class="p">>=</span> <span class="m">256</span><span class="p">;</span> <span class="n">p64</span> <span class="p">+=</span> <span class="m">4</span><span class="p">)</span> <span class="p">{</span>
<span class="n">n</span> <span class="p">-=</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span> <span class="p">(</span>
<span class="nf">PopCount</span><span class="p">(</span><span class="n">p64</span><span class="p">[</span><span class="m">0</span><span class="p">])</span> <span class="p">+</span>
<span class="nf">PopCount</span><span class="p">(</span><span class="n">p64</span><span class="p">[</span><span class="m">1</span><span class="p">])</span> <span class="p">+</span>
<span class="nf">PopCount</span><span class="p">(</span><span class="n">p64</span><span class="p">[</span><span class="m">2</span><span class="p">])</span> <span class="p">+</span>
<span class="nf">PopCount</span><span class="p">(</span><span class="n">p64</span><span class="p">[</span><span class="m">3</span><span class="p">]));</span>
<span class="p">}</span>
<span class="kt">var</span> <span class="n">prevN</span> <span class="p">=</span> <span class="n">n</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">n</span> <span class="p">></span> <span class="m">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">prevN</span> <span class="p">=</span> <span class="n">n</span><span class="p">;</span>
<span class="n">n</span> <span class="p">-=</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span> <span class="nf">PopCount</span><span class="p">(*</span><span class="n">p64</span><span class="p">);</span>
<span class="n">p64</span><span class="p">++;</span>
<span class="p">}</span>
<span class="n">p64</span><span class="p">--;</span>
<span class="kt">var</span> <span class="n">pos</span> <span class="p">=</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span> <span class="nf">TrailingZeroCount</span><span class="p">(</span>
<span class="nf">ParallelBitDeposit</span><span class="p">(</span><span class="m">1U</span><span class="n">L</span> <span class="p"><<</span> <span class="p">(</span><span class="n">prevN</span> <span class="p">-</span> <span class="m">1</span><span class="p">),</span> <span class="p">*</span><span class="n">p64</span><span class="p">));</span>
<span class="k">return</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span> <span class="p">((</span><span class="n">p64</span> <span class="p">-</span> <span class="n">bits</span><span class="p">)</span> <span class="p">*</span> <span class="m">64</span><span class="p">)</span> <span class="p">+</span> <span class="n">pos</span><span class="p">;</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>We had to change the code flow to account for the unrolled loop, but all in all this is pretty straight forward, so let’s see how this performs:</p>
<table>
<thead>
<tr>
<th>Method</th>
<th>N</th>
<th style="text-align: right">Mean (ns)</th>
<th style="text-align: right">Scaled to POPCNTAndBMI2</th>
</tr>
</thead>
<tbody>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>1</td>
<td style="text-align: right">2.249</td>
<td style="text-align: right">1.04</td>
</tr>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>4</td>
<td style="text-align: right">10.904</td>
<td style="text-align: right">1.15</td>
</tr>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>16</td>
<td style="text-align: right">50.368</td>
<td style="text-align: right">1.11</td>
</tr>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>64</td>
<td style="text-align: right">208.272</td>
<td style="text-align: right">1.13</td>
</tr>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>256</td>
<td style="text-align: right">1,580.026</td>
<td style="text-align: right">0.99</td>
</tr>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>1024</td>
<td style="text-align: right">21,282.905</td>
<td style="text-align: right">0.92</td>
</tr>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>4096</td>
<td style="text-align: right">255,186.977</td>
<td style="text-align: right">0.74</td>
</tr>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>16384</td>
<td style="text-align: right">3,730,420.068</td>
<td style="text-align: right">0.77</td>
</tr>
<tr>
<td>POPCNTAndBMI2Unrolled</td>
<td>65536</td>
<td style="text-align: right">56,939,817.593</td>
<td style="text-align: right">0.76</td>
</tr>
</tbody>
</table>
<p>There are a few interesting things going on here:</p>
<ul>
<li>For low bit-counts (<code class="highlighter-rouge">N <= 64</code>) we can see a drop in performance compared to the previous version. That is totally acceptable: We’ve made the code longer and more branch-y, and all of this was done in order to gain some serious change on the other side of this benchmark (Also, in reality, no one ever complains that your code used to take 193ns, but is now taking 208ns :).</li>
<li>In other words: The drop is not horrible, And we hope to make up enough for it on higher bit counts.</li>
<li>And we are making up for it, kind of… We can see a 33%-ish speedup for <code class="highlighter-rouge">N >= 4096</code>.</li>
</ul>
<p>For those interested, here’s the <a href="https://gist.github.com/c73959ad3dfe31e5d65e6bf273f53211">JITDump</a> of this version.</p>
<p>In theory, we should be happy, pack our bags, and call it a day! We’ve done it, we’ve squeezed every last bit we could hope to.<br />
<strong>Except we really didn’t…</strong><br />
While it might not be clear from these results alone, the loop unrolling hit an unexpected snag: the performance improvement is actually disappointing.<br />
How can I tell? Well, that’s simple: <strong>I’m cheating!</strong><br />
I’ve already written parallel C++ code as part of this whole effort (to be honest, I wrote the C++ code two years before C# intrinsics were a thing), and I’ve seen where unrolled <code class="highlighter-rouge">POPCNT</code> can go, and this is not it.<br />
Not <em>yet</em> at least.</p>
<p>From my C++ attempts, I know we should have seen a ~100% speedup in high bit-counts with loop unrolling, but we are seeing much less than that.</p>
<p>To understand why though, and what is really going on here, you’ll have to wait for the next post, where we cover some of the C++ code, and possibly learn more about processors than we cared to know…</p>
<h2 id="mid-journey-conclusions">Mid-Journey Conclusions</h2>
<p>We’ve taken our not so bad code at the end of the first post and improved upon quite a lot!<br />
I hope you’ve seen how trying to think outside the box, and finding creative ways to compound various intrinsics provided by the CPU can really pay off in performance, and even simplicity.</p>
<p>With the positive things, we must also not forget there are some negative sides to working with intrinsics, which by now, you might also begin sensing them:</p>
<ul>
<li>You’ll need to map which CPUs your users are using, and which CPU intrinsics are supported on each model (even within a single architecture, such as Intel/AMD x64 you’ll see great variation throughout different models!).</li>
<li>You’ll sometimes need to cryptic implementation-selection code, that uses the provided <code class="highlighter-rouge">.IsHardwareAccelerated</code> properties (for example detecting <code class="highlighter-rouge">BMI1</code> only CPUs vs. <code class="highlighter-rouge">BMI1</code> + <code class="highlighter-rouge">BMI2</code> ones) to steer the JIT into the “best” implementation, while praying to the powers that be that the JIT will be intelligent enough to elide the un-needed code at generation time, and still inline the resulting code.</li>
<li>Due to having multiple implementations, architecture specific <em>testing</em> becomes a new requirement.
This might sound basic to a C++ developer, but less so for C#/CLR developers; this would mean that you need to have access to x86 (both 32 and 64 bit) ,arm32,arm64 test agents and run tests on <strong>all of them</strong> to be able to sleep calmly at night.</li>
</ul>
<p>All of these are considerations to be taken seriously, especially if you work outside of Microsoft (where there are considerably more resources for testing, and greater impact for using intrinsics at the same time), while considering intrinsics.</p>
<p>In the <a href="/2018-08-20/netcoreapp3.0-intrinsics-in-real-life-pt3">next and final post</a>, we’ll explore the performance bug I uncovered, and how generally C# compares to C++ for this sort of code…</p>damageboydans@houmus.orghttps://bits.houmus.orgAs I’ve described in part 1 of this series, I’ve recently overhauled an internal data structure we use at Work® to start using platform dependent intrinsics..NET Core 3.0 Intrinsics in Real Life - (Part 1/3)2018-08-18T15:26:28+00:002018-08-18T15:26:28+00:00https://bits.houmus.org/2018-08-18/netcoreapp3.0-intrinsics-in-real-life-pt1<p>I’ve recently overhauled an internal data structure we use at Work<sup>®</sup> to start using <a href="https://github.com/dotnet/designs/blob/master/accepted/platform-intrinsics.md">platform dependent intrinsics</a>- the anticipated feature (for speed junkies like me, that is) which was released in preview form as part of CoreCLR 2.1:
What follows is sort of a travel log of what I did and how the new CoreCLR functionality fares compared to writing C++ code, when processor intrinsics are involved.</p>
<p>This series will contain 3 parts:</p>
<ul>
<li>The data-structure/operation that we’ll optimize and basic usage of intrinsics (this post).</li>
<li><a href="/2018-08-19/netcoreapp3.0-intrinsics-in-real-life-pt2">Using intrinsics more effectively</a>.</li>
<li><a href="/2018-08-20/netcoreapp3.0-intrinsics-in-real-life-pt3">The C++ version(s) of the corresponding C# code, and what I learned from them</a>.</li>
</ul>
<p>All of the code (C# & C++) is published under the <a href="https://github.com/damageboy/bitgoo">bitgoo github repo</a>, with build/run scripts in case someone wants to play with it and/or use it as a starting point for humiliating me with better versions.</p>
<p>In order to keep people motivated:</p>
<ul>
<li>By the end of this post, we’ll already start using intrinsics, and see considerable speedup in our execution time</li>
<li>By the end of the 2<sup>nd</sup> post, we will already see a <strong>300%</strong> speed-up compared to my current .NET Core 2.1 production code, and:</li>
<li>By the end of the 3<sup>rd</sup> post I hope to show how with some fixing in the JIT, we can probably get another 100%-ish improvement on top of <strong>that</strong>, bringing us practically to C++ territory<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote">1</a></sup></li>
</ul>
<h2 id="the-whatwhy-of-intrinsics">The What/Why of Intrinsics</h2>
<p>Processor intrinsics are a way to directly embed specific CPU instructions via special, fake method calls that the JIT replaces at code-generation time. Many of these instructions are considered exotic, and normal language syntax cannot map them cleanly.<br />
The general rule is that a single intrinsic “function” becomes a single CPU instruction.</p>
<p>Intrinsics are not really new to the CLR, and staples of .NET rely on having them around. For example, practically all of the methods in the <a href="https://docs.microsoft.com/en-us/dotnet/api/system.threading.interlocked?view=netframework-4.7.2"><code class="highlighter-rouge">Interlocked</code></a> class in <code class="highlighter-rouge">System.Threading</code> are essentially intrinsics, even if not referred to as such in the documentation. The same holds true for a vast set of vectorized mathematical operations exposed through the types in <a href="https://docs.microsoft.com/en-us/dotnet/api/system.numerics?view=netframework-4.7.2"><code class="highlighter-rouge">System.Numerics</code></a>.</p>
<p>The recent, new effort to introduce more intrinsics in CoreCLR tries to provide additional processor specific intrinsics that deal with a wide range of interesting operations from sped-up cryptographic functions, random number generation to fused mathematical operations and various CPU/cache synchronization primitives.</p>
<p>Unlike the previous cases mentioned, the new intrinsic wrappers in .NET Core don’t shy away from providing <em>model and architecture specific</em> intrinsics, even in cases were only a small portion of actual CPUs might support them. In addition, a <code class="highlighter-rouge">.IsHardwareAccelerated</code> property was sprinkled all over the BCL classes providing intrinsics to allow runtime discovery of what the CPU supports.</p>
<p>On the performance/latency side, which is the focus of this series, we often find that intrinsics can replace tens of CPU instructions with one or two while possibly also eliminating branches (sometimes, more important than using less instructions…). This is compounded by the fact that the simplified instruction stream makes it possible for a modern CPU to “see” the dependencies between instructions (or lack thereof!) more clearly, and safely attempt to run multiple instructions in parallel even inside a <strong>single CPU core</strong>.</p>
<p>While there are some downsides as well to using intrinsics, I’ll discuss some of those at the end of the second post; by then, I hope my warnings will fall on more welcoming ears.<br />
Personally, I’m more than ready to take that plunge, so with that long preamble out of the way, let’s describe our starting point:</p>
<h2 id="the-bitmap-getnthbitoffset">The Bitmap, GetNthBitOffset()</h2>
<p>To keep it short, I’m purposely going to completely ignore the context the code we are about to discuss is a key part of (If there is interest, I may write a separate post about it).
For now, let’s accept that we have a god-given assignment in the form of a function that we really want to optimize the hell out of, without stopping to ask “Why?”.</p>
<h3 id="the-bitmap">The Bitmap</h3>
<p>This is dead simple: we have a bitmap which is potentially thousands or tens of thousands of bits long, which we will store somewhere as an <code class="highlighter-rouge">ulong[]</code>:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre><span class="k">const</span> <span class="kt">int</span> <span class="n">THIS_MANY_BITS</span> <span class="p">=</span> <span class="m">66666</span><span class="p">;</span>
<span class="kt">ulong</span><span class="p">[]</span> <span class="n">bits</span> <span class="p">=</span> <span class="k">new</span> <span class="kt">ulong</span><span class="p">[(</span><span class="n">THIS_MANY_BITS</span> <span class="p">/</span> <span class="m">64</span><span class="p">)</span> <span class="p">+</span> <span class="m">1</span><span class="p">];</span> <span class="c1">// enough room for everyone</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>The <code class="highlighter-rouge">bits</code> array in the sample above is continuously being mutated, and as bits go, this is going to be in the form of bits being turned on and off in no particular order, so imagine:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="rouge-code"><pre><span class="kt">var</span> <span class="n">r</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">Random</span><span class="p">(</span><span class="n">DateTime</span><span class="p">.</span><span class="n">Ticks</span> <span class="p">%</span> <span class="kt">int</span><span class="p">.</span><span class="n">MaxValue</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">var</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p"><</span> <span class="n">bits</span><span class="p">.</span><span class="n">Length</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span>
<span class="n">bits</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="p">=</span> <span class="k">unchecked</span><span class="p">(((</span><span class="kt">ulong</span><span class="p">)</span><span class="n">r</span><span class="p">.</span><span class="nf">Next</span><span class="p">())</span> <span class="p"><<</span> <span class="m">32</span> <span class="p">|</span> <span class="p">((</span><span class="kt">ulong</span><span class="p">)</span> <span class="n">r</span><span class="p">.</span><span class="nf">Next</span><span class="p">()));</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<h3 id="the-search-method">The Search Method</h3>
<p>We’re about to describe one of the two methods that I optimized.
I chose this particular method since it was the more challenging one to optimize. But before describing it, a short disclaimer is in order:</p>
<p>The method is implemented with <code class="highlighter-rouge">unsafe</code> and <code class="highlighter-rouge">ulong *</code> rather than the managed/safe variants (<code class="highlighter-rouge">ulong[]</code> or <code class="highlighter-rouge">Span<ulong></code>). The reasons I’m using <code class="highlighter-rouge">unsafe</code> are that for this type of code, which makes up double digit % of our CPU time, adding bounds-checking can be very destructive for performance; Specifically, in the context of this series where I’m about to compare C# with C++, we get an apples-to-apples comparison, as C++ is compiled without bounds-checking normally.</p>
<p>With that out of the way, lets inspect the method signature:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre><span class="k">unsafe</span> <span class="kt">int</span> <span class="nf">GetNthBitOffset</span><span class="p">(</span><span class="kt">ulong</span> <span class="p">*</span><span class="n">bits</span><span class="p">,</span> <span class="kt">int</span> <span class="n">numBits</span><span class="p">,</span> <span class="kt">int</span> <span class="n">n</span><span class="p">);</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>This method runs over the entire bitmap until it finds the n<sup>th</sup> bit with the value <code class="highlighter-rouge">1</code>, or as I will refer to it here-on, our <em>target-bit</em>, and returns its bit offset within the bitmap as its return value.
For brevity we <em>assume</em> that incoming values of <code class="highlighter-rouge">n</code> are never below <code class="highlighter-rouge">1</code> or above the number of <code class="highlighter-rouge">1</code> bits in the bitmap.</p>
<p>Here’s a super naive implementation that achieves this:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
</pre></td><td class="rouge-code"><pre><span class="k">public</span> <span class="k">static</span> <span class="k">unsafe</span> <span class="kt">int</span> <span class="nf">Naive</span><span class="p">(</span><span class="kt">ulong</span><span class="p">*</span> <span class="n">bits</span><span class="p">,</span> <span class="kt">int</span> <span class="n">numBits</span><span class="p">,</span> <span class="kt">int</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">var</span> <span class="n">b</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span>
<span class="kt">var</span> <span class="k">value</span> <span class="p">=</span> <span class="p">*</span><span class="n">bits</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">leftInULong</span> <span class="p">=</span> <span class="m">64</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">i</span> <span class="p"><</span> <span class="n">numBits</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">((</span><span class="k">value</span> <span class="p">&</span> <span class="m">0x1U</span><span class="n">L</span><span class="p">)</span> <span class="p">==</span> <span class="m">0x1U</span><span class="n">L</span><span class="p">)</span>
<span class="n">i</span><span class="p">++;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="p">==</span> <span class="n">n</span><span class="p">)</span>
<span class="k">break</span><span class="p">;</span>
<span class="k">value</span> <span class="p">>>=</span> <span class="m">1</span><span class="p">;</span>
<span class="n">leftInULong</span><span class="p">--;</span>
<span class="n">b</span><span class="p">++;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">leftInULong</span> <span class="p">!=</span> <span class="m">0</span><span class="p">)</span> <span class="c1">// Still more bits left in this ulong?</span>
<span class="k">continue</span><span class="p">;</span>
<span class="k">value</span> <span class="p">=</span> <span class="p">*(++</span><span class="n">bits</span><span class="p">);</span> <span class="c1">// Load a new 64 bit value </span>
<span class="n">leftInULong</span> <span class="p">=</span> <span class="m">64</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">b</span><span class="p">;</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<h3 id="initial-performance">Initial Performance</h3>
<p>This implementation is obviously pretty bad, performance wise. <em>But wait</em>: There are lots of ways you could improve upon this: bit-twiddling hacks, LUTs and what we’re here for, processor intrinsics.</p>
<p>Our next step is to start measuring this, and we’ll move on to better and better versions of this method, until we exhaust <em>my</em> abilities to make this go any faster.<br />
Using everyone’s favorite CLR microbenchmarking tool, <a href="https://benchmarkdotnet.org/">BDN</a>, I wrote a small harness that preallocates a huge array of bits, fills it up with random values (roughly 50% <code class="highlighter-rouge">0</code>/<code class="highlighter-rouge">1</code>), then executes the benchmark(s) over this array looking for all the offsets of <strong>lit</strong> bits <strong>up to</strong> <code class="highlighter-rouge">N</code> where <code class="highlighter-rouge">N</code> is parametrized to be: 1, 8, 64, 512, 2048, 4096, 16384, 65536.
The benchmark code looks roughly like this:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="rouge-code"><pre><span class="k">const</span> <span class="kt">int</span> <span class="n">KB</span> <span class="p">=</span> <span class="m">1024</span><span class="p">;</span>
<span class="p">[</span><span class="nf">Params</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="m">4</span><span class="p">,</span> <span class="m">16</span><span class="p">,</span> <span class="m">64</span><span class="p">,</span> <span class="m">256</span><span class="p">,</span> <span class="m">1</span><span class="p">*</span><span class="n">KB</span><span class="p">,</span> <span class="m">4</span><span class="p">*</span><span class="n">KB</span><span class="p">,</span> <span class="m">16</span><span class="p">*</span><span class="n">KB</span><span class="p">,</span> <span class="m">64</span><span class="p">*</span><span class="n">KB</span><span class="p">)]</span>
<span class="k">public</span> <span class="kt">int</span> <span class="n">N</span> <span class="p">{</span> <span class="k">get</span><span class="p">;</span> <span class="k">set</span><span class="p">;</span> <span class="p">}</span>
<span class="k">protected</span> <span class="k">unsafe</span> <span class="kt">ulong</span> <span class="p">*</span><span class="n">_bits</span><span class="p">;</span>
<span class="p">...</span>
<span class="p">[</span><span class="n">Benchmark</span><span class="p">]</span>
<span class="k">public</span> <span class="k">unsafe</span> <span class="kt">int</span> <span class="nf">Naive</span><span class="p">()</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">sum</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">var</span> <span class="n">i</span> <span class="p">=</span> <span class="m">1</span><span class="p">;</span> <span class="n">i</span> <span class="p"><=</span> <span class="n">N</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span>
<span class="n">sum</span> <span class="p">+=</span> <span class="n">GetNthBitOffset</span><span class="p">.</span><span class="nf">Naive</span><span class="p">(</span><span class="n">_bits</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span>
<span class="k">return</span> <span class="n">sum</span><span class="p">;</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>For those in the know, I’m NOT using BDN’s <code class="highlighter-rouge">OperationsPerInvoke()</code> to normalize for <code class="highlighter-rouge">N</code> since the benchmark is looping over the entire bitmap, and the performance varies wildly throughout the loop.</p>
<p>Running this gives us the following results:</p>
<table>
<thead>
<tr>
<th>Method</th>
<th>N</th>
<th style="text-align: right">Mean (ns)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Naive</td>
<td>1</td>
<td style="text-align: right">1.185</td>
</tr>
<tr>
<td>Naive</td>
<td>4</td>
<td style="text-align: right">35.308</td>
</tr>
<tr>
<td>Naive</td>
<td>16</td>
<td style="text-align: right">605.021</td>
</tr>
<tr>
<td>Naive</td>
<td>64</td>
<td style="text-align: right">6,368.355</td>
</tr>
<tr>
<td>Naive</td>
<td>256</td>
<td style="text-align: right">99,448.636</td>
</tr>
<tr>
<td>Naive</td>
<td>1024</td>
<td style="text-align: right">2,057,984.353</td>
</tr>
<tr>
<td>Naive</td>
<td>4096</td>
<td style="text-align: right">68,728,413.667</td>
</tr>
<tr>
<td>Naive</td>
<td>16384</td>
<td style="text-align: right">1,365,698,984.333</td>
</tr>
<tr>
<td>Naive</td>
<td>65536</td>
<td style="text-align: right">22,669,217,647.333</td>
</tr>
</tbody>
</table>
<p>A couple of comments about these results:</p>
<ol>
<li>Small numbers of bits actually work out OK-ish given how bad the code is.</li>
<li>Yes, finding all the offsets of the first 64k lit bits (so 64K calls times average length of 64K bits processed per call<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote">2</a></sup>) takes a whopping 22+ seconds…</li>
</ol>
<h3 id="prepping-the-machine--clr--environmental-information">Prepping the Machine / CLR + Environmental information</h3>
<p>Here is the BDN environmental data about my machine:</p>
<div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="rouge-code"><pre><span class="py">BenchmarkDotNet</span><span class="p">=</span><span class="s">v0.11.0, OS=ubuntu 18.04</span>
<span class="err">Intel</span> <span class="err">Core</span> <span class="err">i7-7700HQ</span> <span class="err">CPU</span> <span class="err">2.80GHz</span> <span class="err">(Sky</span> <span class="err">Lake),</span> <span class="err">1</span> <span class="err">CPU,</span> <span class="err">4</span> <span class="err">logical</span> <span class="err">and</span> <span class="err">4</span> <span class="err">physical</span> <span class="err">cores</span>
<span class="err">.NET</span> <span class="err">Core</span> <span class="py">SDK</span><span class="p">=</span><span class="s">3.0.100-alpha1-20180720-2</span>
<span class="nn">[Host]</span> <span class="err">:</span> <span class="err">.NET</span> <span class="err">Core</span> <span class="err">3.0.0-preview1-26814-05</span> <span class="err">(CoreCLR</span> <span class="err">4.6.26814.06,</span> <span class="err">CoreFX</span> <span class="err">4.6.26814.01),</span> <span class="err">64bit</span> <span class="err">RyuJIT</span>
<span class="err">ShortRun</span> <span class="err">:</span> <span class="err">.NET</span> <span class="err">Core</span> <span class="err">?</span> <span class="err">(CoreCLR</span> <span class="err">4.6.26814.06,</span> <span class="err">CoreFX</span> <span class="err">4.6.26814.01),</span> <span class="err">64bit</span> <span class="err">RyuJIT</span>
<span class="py">Job</span><span class="p">=</span><span class="s">ShortRun Toolchain=3.0.100-alpha1-20180720-2 IterationCount=3 </span>
<span class="py">LaunchCount</span><span class="p">=</span><span class="s">1 WarmupCount=3 </span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Keen eyes will notice I’m running this with .NET Core 3.0 pre-alpha / preview.
While this is completely uncalled for the code we’ve seen so far, the next variations will actually depend on having .NET Core 3.0 around, so I ran the whole benchmark set with 3.0.</p>
<p>I’m using an excellent <a href="https://github.com/damageboy/bitcrap/blob/master/prep.sh">prep.sh</a> originally prepared by <a href="https://www.alexgallego.org/">Alexander Gallego</a> that basically kills the Turbo effect on modern CPUs, by setting up the min/max frequencies to the base clock of the machine (e.g. what you would get when running 100% CPU on all cores).</p>
<p>My laptop has an <a href="https://ark.intel.com/products/97185/Intel-Core-i7-7700HQ-Processor-6M-Cache-up-to-3_80-GHz">Intel i7 Skylake processor model 7700HQ</a> with a base frequency of 2.8Ghz, so I ran the following commands on my laptop as <code class="highlighter-rouge">root</code>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre><span class="nb">source </span>prep.sh <span class="c"># to get the bash functions used below</span>
cpu_enable_performance_cpupower_state
cpu_set_min_frequencies 2800000
cpu_set_max_frequencies 2800000
cpu_available_frequencies <span class="c"># should print 2800000 for all 4 cores, in my case</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>This is done so that the numbers presented here are applicable for multi-core machines running this code on all cores, and so that very short benchmarks don’t get skewed results compared to longer benchmarks due to CPU frequency scaling.</p>
<h2 id="popcount-without-popcnt">PopCount() without POPCNT</h2>
<p>Now that we have the initial code out of the way, we’re not going to look at it anymore. The next version will use bit-twiddling hacks in order to count larger groups of bits much faster.</p>
<p>We’ll introduce two pure C# functions that implement <a href="https://en.wikipedia.org/wiki/Hamming_weight">population counts</a>:</p>
<blockquote>
<p>The <strong>Hamming weight</strong> of a <a href="https://en.wikipedia.org/wiki/String_(computer_science)">string</a> is the number of symbols that are different from the zero-symbol of the <a href="https://en.wikipedia.org/wiki/Alphabet">alphabet</a> used. It is thus equivalent to the <a href="https://en.wikipedia.org/wiki/Hamming_distance">Hamming distance</a> from the all-zero string of the same length. For the most typical case, a string of <a href="https://en.wikipedia.org/wiki/Bit">bits</a>, this is the number of 1’s in the string, or the <a href="https://en.wikipedia.org/wiki/Digit_sum">digit sum</a> of the <a href="https://en.wikipedia.org/wiki/Binary_numeral_system">binary representation</a> of a given number and the <a href="https://en.wikipedia.org/wiki/Taxicab_geometry"><em>ℓ</em>₁ norm</a> of a bit vector. In this binary case, it is also called the <strong>population count</strong>,[<a href="https://en.wikipedia.org/wiki/Hamming_weight#cite_note-Warren_2013-1">1]</a> <strong>popcount</strong>, <strong>sideways sum</strong>,[<a href="https://en.wikipedia.org/wiki/Hamming_weight#cite_note-Knuth_2009-2">2]</a> or <strong>bit summation</strong>.[<a href="https://en.wikipedia.org/wiki/Hamming_weight#cite_note-HP-16C_1982-3">3]</a></p>
</blockquote>
<p>Ultimately, one of the key processor intrinsics we will use is… <code class="highlighter-rouge">POPCNT</code> which does exactly this, as a single instruction at the processor level, but for now, we will implement a <code class="highlighter-rouge">PopCount()</code> method without those intrinsics, for 64/32 bit inputs.<br />
Apart from <code class="highlighter-rouge">PopCount()</code> we will also define a <code class="highlighter-rouge">TrailingZeroCount()</code><sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote">3</a></sup> method, that counts trailing zero bits. I chose an implementation that uses <code class="highlighter-rouge">PopCount()</code> internally.<br />
Here are the two <code class="highlighter-rouge">PopCount()</code> and <code class="highlighter-rouge">TrailingZeroCount()</code>methods shamelessly stolen throughout the interwebs from <a href="https://github.com/hcs0/Hackers-Delight/blob/master/pop.c.txt">Hacker’s delight</a>:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
</pre></td><td class="rouge-code"><pre><span class="k">public</span> <span class="k">class</span> <span class="nc">HackersDelight</span>
<span class="p">{</span>
<span class="k">public</span> <span class="k">static</span> <span class="kt">int</span> <span class="nf">PopCount</span><span class="p">(</span><span class="kt">ulong</span> <span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">b</span> <span class="p">-=</span> <span class="p">(</span><span class="n">b</span> <span class="p">>></span> <span class="m">1</span><span class="p">)</span> <span class="p">&</span> <span class="m">0x5555555555555555</span><span class="p">;</span>
<span class="n">b</span> <span class="p">=</span> <span class="p">(</span><span class="n">b</span> <span class="p">&</span> <span class="m">0x3333333333333333</span><span class="p">)</span> <span class="p">+</span> <span class="p">((</span><span class="n">b</span> <span class="p">>></span> <span class="m">2</span><span class="p">)</span> <span class="p">&</span> <span class="m">0x3333333333333333</span><span class="p">);</span>
<span class="n">b</span> <span class="p">=</span> <span class="p">(</span><span class="n">b</span> <span class="p">+</span> <span class="p">(</span><span class="n">b</span> <span class="p">>></span> <span class="m">4</span><span class="p">))</span> <span class="p">&</span> <span class="m">0x0f0f0f0f0f0f0f0f</span><span class="p">;</span>
<span class="k">return</span> <span class="k">unchecked</span><span class="p">((</span><span class="kt">int</span><span class="p">)</span> <span class="p">((</span><span class="n">b</span> <span class="p">*</span> <span class="m">0x0101010101010101</span><span class="p">)</span> <span class="p">>></span> <span class="m">56</span><span class="p">));</span>
<span class="p">}</span>
<span class="k">public</span> <span class="k">static</span> <span class="kt">int</span> <span class="nf">PopCount</span><span class="p">(</span><span class="kt">uint</span> <span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">b</span> <span class="p">-=</span> <span class="p">(</span><span class="n">b</span> <span class="p">>></span> <span class="m">1</span><span class="p">)</span> <span class="p">&</span> <span class="m">0x55555555</span><span class="p">;</span>
<span class="n">b</span> <span class="p">=</span> <span class="p">(</span><span class="n">b</span> <span class="p">&</span> <span class="m">0x33333333</span><span class="p">)</span> <span class="p">+</span> <span class="p">((</span><span class="n">b</span> <span class="p">>></span> <span class="m">2</span><span class="p">)</span> <span class="p">&</span> <span class="m">0x33333333</span><span class="p">);</span>
<span class="n">b</span> <span class="p">=</span> <span class="p">(</span><span class="n">b</span> <span class="p">+</span> <span class="p">(</span><span class="n">b</span> <span class="p">>></span> <span class="m">4</span><span class="p">))</span> <span class="p">&</span> <span class="m">0x0f0f0f0f</span><span class="p">;</span>
<span class="k">return</span> <span class="k">unchecked</span><span class="p">((</span><span class="kt">int</span><span class="p">)</span> <span class="p">((</span><span class="n">b</span> <span class="p">*</span> <span class="m">0x01010101</span><span class="p">)</span> <span class="p">>></span> <span class="m">24</span><span class="p">));</span>
<span class="p">}</span>
<span class="k">public</span> <span class="k">static</span> <span class="kt">int</span> <span class="nf">TrailingZeroCount</span><span class="p">(</span><span class="kt">uint</span> <span class="n">x</span><span class="p">)</span> <span class="p">=></span> <span class="nf">PopCount</span><span class="p">(~</span><span class="n">x</span> <span class="p">&</span> <span class="p">(</span><span class="n">x</span> <span class="p">-</span> <span class="m">1</span><span class="p">));</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>These methods can quickly and <strong>without</strong> a single branch instruction, count the lit bits in 64/32 bit words, with just 12 arithmetic operations, most of them simple bit operations and only one (!) multiplication.</p>
<p>With our bit-twiddling optimized functions implemented and out of the way, let’s put them to good use in a new implementation, and make a few changes in the flow of the code:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
</pre></td><td class="rouge-code"><pre><span class="k">using</span> <span class="nn">static</span> <span class="n">BitGoo</span><span class="p">.</span><span class="n">HackersDelight</span><span class="p">;</span>
<span class="k">public</span> <span class="k">static</span> <span class="k">unsafe</span> <span class="kt">int</span> <span class="nf">NoIntrisics</span><span class="p">(</span><span class="kt">ulong</span><span class="p">*</span> <span class="n">bits</span><span class="p">,</span> <span class="kt">int</span> <span class="n">numBits</span><span class="p">,</span> <span class="kt">int</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// (1)</span>
<span class="kt">var</span> <span class="n">p64</span> <span class="p">=</span> <span class="n">bits</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">prevN</span><span class="p">;</span>
<span class="k">do</span> <span class="p">{</span>
<span class="n">prevN</span> <span class="p">=</span> <span class="n">n</span><span class="p">;</span>
<span class="n">n</span> <span class="p">-=</span> <span class="nf">PopCount</span><span class="p">(*</span><span class="n">p64</span><span class="p">);</span>
<span class="n">p64</span><span class="p">++;</span>
<span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">n</span> <span class="p">></span> <span class="m">0</span><span class="p">);</span>
<span class="c1">// (2)</span>
<span class="kt">var</span> <span class="n">p32</span> <span class="p">=</span> <span class="p">(</span><span class="kt">uint</span> <span class="p">*)</span> <span class="p">(</span><span class="n">p64</span> <span class="p">-</span> <span class="m">1</span><span class="p">);</span>
<span class="n">n</span> <span class="p">=</span> <span class="n">prevN</span> <span class="p">-</span> <span class="nf">PopCount</span><span class="p">(*</span><span class="n">p32</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">n</span> <span class="p">></span> <span class="m">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">prevN</span> <span class="p">=</span> <span class="n">n</span><span class="p">;</span>
<span class="n">p32</span><span class="p">++;</span>
<span class="p">}</span>
<span class="c1">// (3)</span>
<span class="kt">var</span> <span class="n">prevValue</span> <span class="p">=</span> <span class="p">*</span><span class="n">p32</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">pos</span> <span class="p">=</span> <span class="p">(</span><span class="n">p32</span> <span class="p">-</span> <span class="p">(</span><span class="kt">uint</span><span class="p">*)</span> <span class="n">bits</span><span class="p">)</span> <span class="p">*</span> <span class="m">32</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">prevN</span> <span class="p">></span> <span class="m">0</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">var</span> <span class="n">bp</span> <span class="p">=</span> <span class="nf">TrailingZeroCount</span><span class="p">(</span><span class="n">prevValue</span><span class="p">)</span> <span class="p">+</span> <span class="m">1</span><span class="p">;</span>
<span class="n">pos</span> <span class="p">+=</span> <span class="n">bp</span><span class="p">;</span>
<span class="n">prevN</span><span class="p">--;</span>
<span class="n">prevValue</span> <span class="p">>>=</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span> <span class="n">bp</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span> <span class="p">(</span><span class="n">pos</span> <span class="p">-</span> <span class="m">1</span><span class="p">);</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>Our new approach to solving this goes like this (comments correspond to blocks of the code above):</p>
<ol>
<li>As long as we <strong>still</strong> need to look for <em>any</em> <code class="highlighter-rouge">1</code> bits, we loop, calling <code class="highlighter-rouge">PopCount()</code> until we finally consume more bits than what we were tasked with… At that stage our <code class="highlighter-rouge">p64</code> pointer is pointing 1 <code class="highlighter-rouge">ulong</code> beyond the <code class="highlighter-rouge">ulong</code> containing our target-bit, and <code class="highlighter-rouge">prevN</code> contains the number of consumed <code class="highlighter-rouge">1</code> bits that was still correct one <code class="highlighter-rouge">ulong</code> before.</li>
<li>Once we’re out of the loop, we know that out target-bit is hiding somewhere <em>within</em> that last 64-bit <code class="highlighter-rouge">ulong</code>. So we will use a single 32-bit <code class="highlighter-rouge">PopCount()</code> to figure out if its within the first/second 32-bit words making up <em>that</em> 64-bit word and update the bit-counts / <code class="highlighter-rouge">p32</code> pointer accordingly.</li>
<li>Now, we know that <code class="highlighter-rouge">p32</code> is pointing to the 32-bit word containing our target-bit <code class="highlighter-rouge">p32</code>, so we find the target-bit, by using <code class="highlighter-rouge">TrailingZeroCount()</code> and right shifting in a loop until we find the target bit’s position within the word, finally returning the offset when we’re done.</li>
</ol>
<p>Let’s take a look at how this version fairs:</p>
<table>
<thead>
<tr>
<th>Method</th>
<th>N</th>
<th style="text-align: right">Mean (ns)</th>
<th style="text-align: right">Scaled to “Naive”</th>
</tr>
</thead>
<tbody>
<tr>
<td>NoIntrisics</td>
<td>1</td>
<td style="text-align: right">5.247</td>
<td style="text-align: right">4.19</td>
</tr>
<tr>
<td>NoIntrisics</td>
<td>4</td>
<td style="text-align: right">43.919</td>
<td style="text-align: right">0.79</td>
</tr>
<tr>
<td>NoIntrisics</td>
<td>16</td>
<td style="text-align: right">429.974</td>
<td style="text-align: right">0.58</td>
</tr>
<tr>
<td>NoIntrisics</td>
<td>64</td>
<td style="text-align: right">2,986.498</td>
<td style="text-align: right">0.44</td>
</tr>
<tr>
<td>NoIntrisics</td>
<td>256</td>
<td style="text-align: right">16,492.408</td>
<td style="text-align: right">0.16</td>
</tr>
<tr>
<td>NoIntrisics</td>
<td>1024</td>
<td style="text-align: right">112,049.075</td>
<td style="text-align: right">0.06</td>
</tr>
<tr>
<td>NoIntrisics</td>
<td>4096</td>
<td style="text-align: right">1,058,565.813</td>
<td style="text-align: right">0.02</td>
</tr>
<tr>
<td>NoIntrisics</td>
<td>16384</td>
<td style="text-align: right">13,714,191.734</td>
<td style="text-align: right">0.010</td>
</tr>
<tr>
<td>NoIntrisics</td>
<td>65536</td>
<td style="text-align: right">206,236,218.000</td>
<td style="text-align: right">0.009</td>
</tr>
</tbody>
</table>
<p>Quite an improvement already! To be fair, our starting point being so low helped a lot, but still an improvement.
As a side note, this is, essentially, the code I’m running on our own bitmaps in production right now, since I don’t have intrinsics right now.</p>
<p>If there’s really one column where our focus should gravitate towards it’s the “Scaled” column on the right of the table. Each result here is scaled to its corresponding <code class="highlighter-rouge">Naive</code> version:</p>
<ul>
<li>For any bit length < 16, the old version runs faster, but marginally so, in absolute terms.</li>
<li>Once we hit <code class="highlighter-rouge">N == 16</code> and upwards, the landscape changes dramatically and our bit-twiddling <code class="highlighter-rouge">PopCount()</code> starts paying off big-time: the speedup for 64 is already > 100% all the way up to 11100% speedup @ 64K.</li>
</ul>
<h2 id="coreclr--architecture-dependent-intrinsics">CoreCLR & Architecture Dependent Intrinsics</h2>
<p>Let us remind ourselves where things stand at the time of writing this post, when it comes to using intrinsics in CoreCLR:</p>
<ul>
<li>.NET Core 2.1 was released on May 30<sup>th</sup> 2018, with Intrinsics released as a “preview” feature:
<ul>
<li>The 2.1 JIT kind of knows how to handle <em>some</em> intrinsics.</li>
<li>To actually use them, we need to use the dotnet-core myget feed and install an experimental nuget package that provides the API surface for the intrinsics.</li>
<li>No commitments were made that things would be stable/working.</li>
</ul>
</li>
<li>.NET Core 3.0 is the official (so far?) target release for intrinsics support in .NET Core:
<ul>
<li>Considerably more intrinsics are supported than what was available with 2.1.</li>
<li>No extra nuget package is required (intrinsics are part of the SDK).</li>
<li>Work is still being very actively done to add more intrinsics and improve the quality of what is already there.</li>
</ul>
</li>
</ul>
<p>As we require intrinsics that were not available with 2.1, The code in <a href="https://github.com/damageboy/bitgoo">repo</a> is targeting a pre-alpha1 version of .NET Core 3.0 (i.e. <code class="highlighter-rouge">netcoreapp3.0</code>).</p>
<p>For people wanting to run this code, it’s relatively easy to do so, and non-destructive to your current setup:</p>
<ol>
<li>
<p>Go to the <a href="https://github.com/dotnet/core-sdk#installers-and-binaries">Installers and Binaries</a> section of the core-sdk project.</p>
</li>
<li>
<p>The left most column contains .NET Core Master branch builds (3.0.x Runtime).</p>
</li>
<li>
<p>Download the appropriate installer in <code class="highlighter-rouge">.zip</code> / <code class="highlighter-rouge">.tar.gz</code> form: I used the <a href="https://dotnetcli.blob.core.windows.net/dotnet/Sdk/master/dotnet-sdk-latest-linux-x64.tar.gz">linux</a> one, but the <a href="https://dotnetcli.blob.core.windows.net/dotnet/Sdk/master/dotnet-sdk-latest-win-x64.zip">windows</a> / <a href="https://dotnetcli.blob.core.windows.net/dotnet/Sdk/master/dotnet-sdk-latest-osx-x64.tar.gz">osx</a> ones should be just as good.</p>
</li>
<li>
<p>unzip/untar the installer somewhere (*Nix users beware: Microsoft does this entirely inhumane thing of packaging the contents of their distribution as the top level of <code class="highlighter-rouge">.tar.gz</code>, so be sure to <code class="highlighter-rouge">mkdir dotnet; tar -C dotnet xf /path/to/where/you/downloaded/the/tar.gz</code> to avoid heart-ache).</p>
</li>
<li>
<p>Adjust your <code class="highlighter-rouge">PATH</code> env. to find the <code class="highlighter-rouge">dotnet</code> executable in the new folder you just unzipped to, before anywhere else. (I did this locally in my terminal session).</p>
</li>
<li>
<p>You should now be able to <code class="highlighter-rouge">dotnet restore|build|run|test</code> the BitGoo project(s).</p>
</li>
<li>
<p>Just to be on the safe side, here is what <code class="highlighter-rouge">dotnet --info</code> prints for me:</p>
<div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
</pre></td><td class="rouge-code"><pre><span class="err">.NET</span> <span class="err">Core</span> <span class="err">SDK</span> <span class="err">(reflecting</span> <span class="err">any</span> <span class="err">global.json):</span>
<span class="err">Version:</span> <span class="err">3.0.100-alpha1-20180720-2</span>
<span class="err">Commit:</span> <span class="err">82bd85d0a9</span>
<span class="err">Runtime</span> <span class="err">Environment:</span>
<span class="err">OS</span> <span class="err">Name:</span> <span class="err">ubuntu</span>
<span class="err">OS</span> <span class="err">Version:</span> <span class="err">18.04</span>
<span class="err">OS</span> <span class="err">Platform:</span> <span class="err">Linux</span>
<span class="err">RID:</span> <span class="err">ubuntu.18.04-x64</span>
<span class="err">...</span> <span class="c"># No one really cares that much
</span></pre></td></tr></tbody></table></code></pre></div> </div>
</li>
</ol>
<h2 id="using-popcnt--tzcnt">Using POPCNT & TZCNT</h2>
<p>The next step will be to replace our bit-twiddling <code class="highlighter-rouge">PopCount()</code> code with the <code class="highlighter-rouge">PopCount()</code> intrinsic provided by <code class="highlighter-rouge">System.Runtime.Intrinsics.X86.Popcnt</code> class in the 3.0 BCL, which should be replaced by a single CPU <code class="highlighter-rouge">POPCNT</code> instruction by the JIT at runtime.
In addition, we will also use the <code class="highlighter-rouge">BMI1</code> (<strong>B</strong>it <strong>M</strong>anipulation <strong>I</strong>ntrinsics <strong>1</strong>) <code class="highlighter-rouge">TrailingZeroCount()</code> intrinsic which maps to the <code class="highlighter-rouge">TZCNT</code> instruction.</p>
<p>These instructions do exactly what our previous hand written implementation did, except it’s done with dedicated circuitry in our CPUs, takes up less instructions in the instruction stream, runs faster and can be parallelized internally inside the processor.
I was very careful in the last post / code-sample, to use the exact same function name(s) as the intrinsics provided by the 3.0 BCL, so really, the code change comes down to mostly adjusting the two top <code class="highlighter-rouge">using static</code> statements:</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="rouge-code"><pre><span class="k">using</span> <span class="nn">static</span> <span class="n">System</span><span class="p">.</span><span class="n">Runtime</span><span class="p">.</span><span class="n">Intrinsics</span><span class="p">.</span><span class="n">X86</span><span class="p">.</span><span class="n">Popcnt</span><span class="p">;</span>
<span class="k">using</span> <span class="nn">static</span> <span class="n">System</span><span class="p">.</span><span class="n">Runtime</span><span class="p">.</span><span class="n">Intrinsics</span><span class="p">.</span><span class="n">X86</span><span class="p">.</span><span class="n">Bmi1</span><span class="p">;</span>
<span class="c1">// Rest of the code is the same...</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>That’s it! We’re using intrinsics, all done!<br />
If you are having a hard time trusting me, here’s a <a href="https://github.com/damageboy/bitgoo/blob/master/csharp/BitGoo/GetNthBitOffset.POPCNTAndBMI1.cs">link to the complete code</a>.
Here are the results, this time scaled to the <code class="highlighter-rouge">NoIntrinsics()</code> version:</p>
<table>
<thead>
<tr>
<th>Method</th>
<th>N</th>
<th style="text-align: right">Mean (ns)</th>
<th style="text-align: right">Scaled to “NoIntrinsics”`</th>
</tr>
</thead>
<tbody>
<tr>
<td>POPCNTAndBMI1</td>
<td>1</td>
<td style="text-align: right">2.358</td>
<td style="text-align: right">0.44</td>
</tr>
<tr>
<td>POPCNTAndBMI1</td>
<td>4</td>
<td style="text-align: right">15.318</td>
<td style="text-align: right">0.35</td>
</tr>
<tr>
<td>POPCNTAndBMI1</td>
<td>16</td>
<td style="text-align: right">128.712</td>
<td style="text-align: right">0.31</td>
</tr>
<tr>
<td>POPCNTAndBMI1</td>
<td>64</td>
<td style="text-align: right">916.033</td>
<td style="text-align: right">0.27</td>
</tr>
<tr>
<td>POPCNTAndBMI1</td>
<td>256</td>
<td style="text-align: right">5,005.190</td>
<td style="text-align: right">0.30</td>
</tr>
<tr>
<td>POPCNTAndBMI1</td>
<td>1024</td>
<td style="text-align: right">44,606.327</td>
<td style="text-align: right">0.39</td>
</tr>
<tr>
<td>POPCNTAndBMI1</td>
<td>4096</td>
<td style="text-align: right">408,871.712</td>
<td style="text-align: right">0.39</td>
</tr>
<tr>
<td>POPCNTAndBMI1</td>
<td>16384</td>
<td style="text-align: right">5,205,533.285</td>
<td style="text-align: right">0.39</td>
</tr>
<tr>
<td>POPCNTAndBMI1</td>
<td>65536</td>
<td style="text-align: right">76,186,499.286</td>
<td style="text-align: right">0.37</td>
</tr>
</tbody>
</table>
<p>OK, now we’re talking…<br />
There can be no doubt that we have SOMETHING working: we can see a very substantial improvement across the board for every value of <code class="highlighter-rouge">N</code>!. <br />
There are some weird things still happening here that I cannot fully explain yet at this stage, namely: how the scaling becomes relatively worse as <code class="highlighter-rouge">N</code> increases, but there is little to generally complain about.</p>
<p>For those with a need to see assembly code to feel convinced, I’ve uploaded JITDumps to a <a href="https://gist.github.com/b4500d6b7157051551346107786ae4fa">gist</a>, where you can clearly see the various <code class="highlighter-rouge">POPCNT</code> / <code class="highlighter-rouge">LZCNT</code> instructions throughout the ASM code (scroll to the end of the dump…).</p>
<h3 id="whats-next">What’s Next?</h3>
<p>We’ve reached pretty far, and I hope it was interesting even if a bit introductory.<br />
In the next post, we’ll continue iterating on this task, introducing new intrinsics in the process, and encounter some “interesting” quirks.</p>
<p>If you feel like you’re up for it, the next post is <a href="/2018-08-19/netcoreapp3.0-intrinsics-in-real-life-pt2">here</a>…</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Worry not, I reported and <a href="https://github.com/dotnet/coreclr/issues/19555">opened an issue on CoreCLR^1</a> before even starting to write this post and plan to do a deep-dive into this on the 3rd post <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>Since our bitmap is filled with roughly 50% <code class="highlighter-rouge">0</code>/<code class="highlighter-rouge">1</code> values, searching for 64K lit bits means going over roughly 128K bits, as an example. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>The TrailingZeroCount() method I’ve used here is the fastest, from independent testing, for C#. There are others but they either depend on having a compiler that can use CMOV instructions (which CoreCLR doesn’t yet), or on using LUTs (Look Up Tables) which I dislike since they tend to win benchmarks while losing in bigger scope of where the code is used, so I have a semi-religious bias against them. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>damageboydans@houmus.orghttps://bits.houmus.orgI’ve recently overhauled an internal data structure we use at Work® to start using platform dependent intrinsics- the anticipated feature (for speed junkies like me, that is) which was released in preview form as part of CoreCLR 2.1: What follows is sort of a travel log of what I did and how the new CoreCLR functionality fares compared to writing C++ code, when processor intrinsics are involved.