Han Lee, Intel Corporation [email protected] Notices and Disclaimers
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.
The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request. No product or component can be absolutely secure.
Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548- 4725 or by visiting www.intel.com/design/literature.htm.
Intel, the Intel logo, and other Intel product and solution names in this presentation are trademarks of Intel
*Other names and brands may be claimed as the property of others
© Intel Corporation. 2 What Do These Have in Common?
Domain Example Image processing Color extraction High performance computing (HPC) Matrix multiplication Data processing Hamming code Text processing UTF-8 conversion Data structures Bit array Machine learning Classification
For performance sensitive code, consider using Intel® hardware intrinsics
3 Objectives
. (Very brief) Intro to SIMD . Design Motivation . Hardware Intrinsics . Call to Action
4 Single Instruction, Multiple Data (SIMD)
Perform same operation on multiple data using a single instruction . Intel Advanced Vector Extensions 2 (Intel® AVX2) supports 256-bit SIMD instructions – Add eight 32-bit integers using one vpaddd instruction
YMM1
SRC1 X7 X6 X5 X4 X3 X2 X1 X0 X7 X6 X5 X4 X3 X2 X1 X0 YMM2 + + + + + + + + Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0
======SRC2 Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0 YMM0 X7 + Y7 X6 + Y6 X5 + Y5 X4 + Y4 X3 + Y3 X2 + Y2 X1 + Y1 X0 + Y0
DEST X7 + Y7 X6 + Y6 X5 + Y5 X4 + Y4 X3 + Y3 X2 + Y2 X1 + Y1 X0 + Y0 8x ADD R32, R32
VPADDD YMM0, YMM1, YMM2
5 Single Instruction, Multiple Data (SIMD)
C# abstraction is System.Numerics.Vector* . System.Numerics.Vector
Vector
v 21 18 15 12 9 6 3 0
6 Why?
*https://github.com/dotnet/corefx/issues/1168, https://github.com/dotnet/corefx/issues/2209, https://github.com/dotnet/corefx/issues/2025, https://github.com/dotnet/coreclr/issues/916
7 Hardware Intrinsics (a.k.a. Intrinsic functions / Platform Dependent Intrinsics)
Special functions that directly map to hardware instructions . Useful when – Algorithms can better be mapped to underlying hardware – Maximum control over code generation is desired . Mainstream C/C++ compilers have intrinsic functions – Field tested and proven to be useful – Provides design guidelines . Both SIMD and non-SIMD
8 Intel Hardware Intrinsics
. APIs originally designed and proposed by Intel . Major enhancements with the help of .NET Core community and Microsoft . Implementation by Intel, Microsoft and .NET Core community . Namespace System.Runtime.Intrinsics contains platform-agnostic types & functions – E.g., Vector256
9 Intel Hardware Intrinsics Design
. ISA class contains IsSupported boolean property to check hardware support . ISA class contains intrinsic methods corresponding to underlying instructions . Methods closely mirror C/C++ intrinsic function ///
. Majority of Intel hardware intrinsics operate over Vector128/256
10 Demo: Counting Set Bits
11 Using Intel Hardware Intrinsics in .NET Core
using System.Runtime.Intrinsics.X86; Import the namespace to use Intel HW intrinsic using System.Runtime.Intrinsics; Import the namespace to use Vector128/256
12 SOA-Based Ray Tracer
13 How to Vectorize a Ray-tracer?
Ray-tracer based on 3D vector computation . Most of the underlying structures can be abstracted to 3D vectors – Position/direction (x, y, z) in 3D space, color (R, G, B) Vectorize the underlying 3D vector computation . AoS (Array of Structure) . SoA (Structure of Array)
14 AoS vectorization
Example: Add 3D vectors of float on AVX-capable machine Scalar AoS
x1 y1 z1 + x2 y2 z2 x1 y1 z1 - + x2 y2 z2 - XMM1 XMM2
x = x1 + x2 y = y1 + y2 vaddps xmm, xmm1, xmm2 z = z1 + z2
Cannot leverage wider SIMD architecture!
15 SoA vectorization
Example: Add 3D vectors of float on AVX-capable machine SoA AoS X y z X1 y1 z1 + 2 2 2 x1 y1 z1 - + x y z - X 2 2 2 X12 y12 z12 22 y22 z22 X X13 y13 z13 23 y23 z23 XMM1 XMM2 X X14 y14 z14 24 y24 z24 X X15 y15 z15 25 y25 z25 X X16 y16 z16 26 y26 z26 vaddps xmm, xmm1, xmm2 X X17 y17 z17 27 y27 z27 X X18 y18 z18 28 y28 z28
class VectorPacket256 YMM1 YMM2 YMM3 YMM4 YMM5 YMM6 { public Vector256
16 Demo: SOA-Based Ray Tracer Performance
17 Intel Hardware Intrinsics in Use
. CPU math operations in ML.NET . SoA implementation of C# Raytracer . Bit operations . Matrix4x4 operations . BLAKE2 hashing . Your application!
*The figure shown above is from Microsoft blog on ML.NET
18 Turbocharge your applications with Intel® hardware Intrinsics in .NET Core 3.0
19 Accelerate Your Applications
. Understand your bottlenecks – Intel VTune™ Amplifier is a great tool for profiling .NET Core application . Use existing wealth of knowledge available on hardware intrinsics – Intel Intrinsics Guide: https://software.intel.com/sites/landingpage/IntrinsicsGuide/ – How-to guides: https://software.intel.com/en-us/articles/how-to-use-intrinsics . Explore existing solutions in C/C++ – E.g., “Fast conversion from UTF-8 with C++, DFAs, and SSE intrinsics” (https://github.com/BobSteagall/CppCon2018) . Optimize your application & re-measure . Contribute to .NET Core and hardware intrinsics
20
Intrinsic Programming Tips
. Developers are responsible for checking hardware support – Different from higher-level SIMD APIs where software fallback is provided . Test your application with different ISA levels – Use environment variables: e.g., COMPlus_EnableAVX . Instruction encoding issues are handled automatically – No need to worry about AVX/SSE transitions – Best encoding by the JIT . Use using static for concise syntax
22