We will explore how developers can substantially improve performance of computationally intensive code by using CPU intrinsics, and go over the new support introduced in the recently released .NET Core 3.0.
We'll be taking a non-trivial algorithm that is already a part of the CoreCLR code-base that should be familiar for most devs, and (re)build an efficient intrinsics based version of it with AVX+AVX2 instructions about 7 vectorized CPU instruction that can be combined in a challenging way.
We'll discuss common hurdles/befits encountered while building/optimizing intrinsics based code such as:
- replacing scalar code with branch-less vectorized code;
- dealing with CPU branch mis-prediction;
- unrolling code to improve performance / complexity trade-off of doing so.
By the end of the talk we'll show how we can beat the CoreCLR's own C++ code with C# Intrinsics utilizing AVX2 instructions.