Han Lee, Intel Corporation [email protected] Notices and Disclaimers

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, or service activation. Learn more at intel.com, or from the OEM or retailer.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request. No product or component can be absolutely secure.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548- 4725 or by visiting www.intel.com/design/literature.htm.

Intel, the Intel logo, and other Intel product and solution names in this presentation are trademarks of Intel

*Other names and brands may be claimed as the property of others

© Intel Corporation. 2 What Do These Have in Common?

Domain Example Image processing Color extraction High performance computing (HPC) Matrix multiplication Data processing Hamming code Text processing UTF-8 conversion Data structures Bit array Machine learning Classification

For performance sensitive code, consider using Intel® hardware intrinsics

3 Objectives

. (Very brief) Intro to SIMD . Design Motivation . Hardware Intrinsics . Call to Action

4 Single Instruction, Multiple Data (SIMD)

Perform same operation on multiple data using a single instruction . Intel Advanced Vector Extensions 2 (Intel® AVX2) supports 256-bit SIMD instructions – Add eight 32-bit integers using one vpaddd instruction

YMM1

SRC1 X7 X6 X5 X4 X3 X2 X1 X0 X7 X6 X5 X4 X3 X2 X1 X0 YMM2 + + + + + + + + Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0

======SRC2 Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0 YMM0 X7 + Y7 X6 + Y6 X5 + Y5 X4 + Y4 X3 + Y3 X2 + Y2 X1 + Y1 X0 + Y0

DEST X7 + Y7 X6 + Y6 X5 + Y5 X4 + Y4 X3 + Y3 X2 + Y2 X1 + Y1 X0 + Y0 8x ADD R32, R32

VPADDD YMM0, YMM1, YMM2

5 Single Instruction, Multiple Data (SIMD)

C# abstraction is System.Numerics.Vector* . System.Numerics.Vector – On Intel AVX2 system, Vector.Count is 8 – On Intel Streaming SIMD Extensions 2 (Intel SSE2) system, Vector.Count is 4

Vector v = New Vector(0, 1, 2, 3, 4, 5, 6, 7) + New Vector(0, 2, 4, 6, 8, 10, 12, 14)

v 21 18 15 12 9 6 3 0

6 Why?

*https://github.com/dotnet/corefx/issues/1168, https://github.com/dotnet/corefx/issues/2209, https://github.com/dotnet/corefx/issues/2025, https://github.com/dotnet/coreclr/issues/916

7 Hardware Intrinsics (a.k.a. Intrinsic functions / Platform Dependent Intrinsics)

Special functions that directly map to hardware instructions . Useful when – Algorithms can better be mapped to underlying hardware – Maximum control over code generation is desired . Mainstream /C++ have intrinsic functions – Field tested and proven to be useful – Provides design guidelines . Both SIMD and non-SIMD

8 Intel Hardware Intrinsics

. originally designed and proposed by Intel . Major enhancements with the help of .NET Core community and Microsoft . Implementation by Intel, Microsoft and .NET Core community . Namespace System.Runtime.Intrinsics contains platform-agnostic types & functions – E.g., Vector256, Vector256.GetLower() . New namespace System.Runtime.Intrinsics. contains 15 top-level classes corresponding to different Instruction Set Architectures (ISAs) – AES, AVX, AVX2, BMI1, BMI2, FMA, LZCNT, PCLMULQDQ, POPCNT, SSE, SSE2, SSE3, SSE4.1, SSE4.2, and SSSE3 . Available as an experimental features since .NET Core 2.1 . Available in .NET Core 3.0 Preview

9 Intel Hardware Intrinsics Design

. ISA class contains IsSupported boolean property to check hardware support . ISA class contains intrinsic methods corresponding to underlying instructions . Methods closely mirror C/C++ intrinsic function ///

/// int _mm_popcnt_u32 (unsigned int a) /// POPCNT reg, reg/m32 /// public static uint PopCount(uint value);

. Majority of Intel hardware intrinsics operate over Vector128/256 public static Vector256 Add(Vector256 left, Vector256 right) . Unsafe methods for operating over pointers public static unsafe Vector256 LoadVector256(sbyte* address)

10 Demo: Counting Set Bits

11 Using Intel Hardware Intrinsics in .NET Core

using System.Runtime.Intrinsics.X86; Import the namespace to use Intel HW intrinsic using System.Runtime.Intrinsics; Import the namespace to use Vector128/256 static int[] Func(int[] data) as needed { if (Avx2.IsSupported) Check hardware ISA support before using any { HW intrinsic // AVX2 implementation } The checks will be optimized away by the else if (Sse.IsSupported) Just-In-Time { // SSE implementation NOTE: Calling HW intrinsic on } unsupported hardware will result in else System.PlatformNotSupportedException { // Software scalar implementation } }

12 SOA-Based Ray Tracer

13 How to Vectorize a Ray-tracer?

Ray-tracer based on 3D vector computation . Most of the underlying structures can be abstracted to 3D vectors – Position/direction (x, y, z) in 3D space, color (R, G, B) Vectorize the underlying 3D vector computation . AoS (Array of Structure) . SoA (Structure of Array)

14 AoS vectorization

Example: Add 3D vectors of float on AVX-capable machine Scalar AoS

x1 y1 z1 + x2 y2 z2 x1 y1 z1 - + x2 y2 z2 - XMM1 XMM2

x = x1 + x2 y = y1 + y2 vaddps xmm, xmm1, xmm2 z = z1 + z2

Cannot leverage wider SIMD architecture!

15 SoA vectorization

Example: Add 3D vectors of float on AVX-capable machine SoA AoS X y z X1 y1 z1 + 2 2 2 x1 y1 z1 - + x y z - X 2 2 2 X12 y12 z12 22 y22 z22 X X13 y13 z13 23 y23 z23 XMM1 XMM2 X X14 y14 z14 24 y24 z24 X X15 y15 z15 25 y25 z25 X X16 y16 z16 26 y26 z26 vaddps xmm, xmm1, xmm2 X X17 y17 z17 27 y27 z27 X X18 y18 z18 28 y28 z28

class VectorPacket256 YMM1 YMM2 YMM3 YMM4 YMM5 YMM6 { public Vector256 Xs; vaddps ymm, ymm1, ymm4 public Vector256 Ys; vaddps ymm, ymm2, ymm5 public Vector256 Zs; vaddps ymm, ymm3, ymm6 }

16 Demo: SOA-Based Ray Tracer Performance

17 Intel Hardware Intrinsics in Use

. CPU math operations in ML.NET . SoA implementation of C# Raytracer . Bit operations . Matrix4x4 operations . BLAKE2 hashing . Your application!

*The figure shown above is from Microsoft blog on ML.NET

18 Turbocharge your applications with Intel® hardware Intrinsics in .NET Core 3.0

19 Accelerate Your Applications

. Understand your bottlenecks – Intel VTune™ Amplifier is a great tool for profiling .NET Core application . Use existing wealth of knowledge available on hardware intrinsics – Intel Intrinsics Guide: https://software.intel.com/sites/landingpage/IntrinsicsGuide/ – How-to guides: https://software.intel.com/en-us/articles/how-to-use-intrinsics . Explore existing solutions in C/C++ – E.g., “Fast conversion from UTF-8 with C++, DFAs, and SSE intrinsics” (https://github.com/BobSteagall/CppCon2018) . Optimize your application & re-measure . Contribute to .NET Core and hardware intrinsics

20

Intrinsic Programming Tips

. Developers are responsible for checking hardware support – Different from higher-level SIMD APIs where software fallback is provided . Test your application with different ISA levels – Use environment variables: e.g., COMPlus_EnableAVX . Instruction encoding issues are handled automatically – No need to worry about AVX/SSE transitions – Best encoding by the JIT . Use using static for concise syntax

22