X86 Vector Processing Extensions  Vector Processing Today

Dan Stafford, Justine Bonnot Background Applications Timeline MMX 3DNow! Streaming SIMD Extension ◦ SSE ◦ SSE2 ◦ SSE3 and SSSE3 ◦ SSE4 Advanced Vector Extension ◦ AVX ◦ AVX2 ◦ AVX-512 Compiling with x86 Vector Processing Extensions Vector Processing Today Exploits data level parallelism Reduces stalls from branches Equivalent to loop unrolling Scalar Processing Vector Processing Instruction Data Scalar Processor ◦ SISD (Full) Vector ◦ Scalar registers Processor SIMD (Full) Vector Processor ◦ SIMD Results ◦ Vector registers Instruction Data Vector Processing Extension Vector ◦ SIMD Processing SIMD ◦ Scalar registers Extension Vector inside of register Divided into separate components Results SISD: Single Instruction Single Data SIMD: Single Instruction Multiple Data Multimedia Processing ◦ Compression ◦ Graphics ◦ Image Processing Simulations Engineering Tools ◦ CAD Cryptography Etc… MMX 3DNow! SSE AVX •Intel •AMD •Intel •Intel and •1997 •1998 •1999 AMD •2008 “Matrix Math Extensions” Launched by Intel in 1997 ◦ Pentium II 8 64-bit integer registers ◦ Aliased with x87 floating point registers 0 64 byte byte byte byte byte byte byte byte word word word word double word double word MMX Extension by AMD in 1998 ◦ K6-2 1998 ◦ Registers shared with MMX and x87 FPU 21 single precision floating point instructions Discontinued after 2010 0 64 byte byte byte byte byte byte byte byte word word word word double word double word single precision single precision Introduced by Intel 1999 – Pentium III ◦ Pentium III = Pentium II + SSE ◦ Intel’s answer to AMD’s 3DNow! ◦ Katamai New Instructions (KNI) 70 new instructions ◦ Single-precision floating point ◦ Few additional integer instructions 8 new 128-bit registers 0 128 single precision single precision single precision single precision Wilamette New Instructions Intel Pentium 4 ◦ 2001 144 new instructions ◦ Double precision (64-bit) support Extends MMX to use SSE registers ◦ Replaces MMX 0 128 word word word word word word word word double word double word double word double word single precision single precision single precision single precision double precision double precision SSE3 SSSE3 Prescott New Supplemental SSE3 Instructions (PNI) Merom New ◦ 2004 Instructions (MNI) 13 new instructions ◦ 2006 ◦ DSP & 3D focused 16 new instructions ◦ Iterate horizontally vs. ◦ Byte permutations vertically in an instruction ◦ Fixed point multiplication with rounding ◦ Within-word accumulate 0 128 word word word word word word word word double word double word double word double word single precision single precision single precision single precision double precision double precision SSE4.1 SSE4.2 Penryn New Nehalem processors Instructions (PNI) ◦ 2008 ◦ 2007 STTNI - String and Sum of absolute Text New Instructions differences CRC32 Dot products Floating point rounding Blending Packed operations 0 128 word word word word word word word word double word double word double word double word single precision single precision single precision single precision double precision double precision Proposed by Intel and AMD March 2008 ◦ Intel Sandy Bridge processor - 2011 ◦ AMD Bulldozer processor - 2011 VEX Coding Prefixes ◦ 3 Operand Instructions ◦ 16 256-bit registers ◦ Extension supported on legacy SSE instructions SSE instructions still only use 128 bit registers 0 256 double word or 1 2 3 4 5 6 7 8 single precision 1 2 3 4 double precision Haswell New Instructions ◦ Intel Haswell processor – 2013 Additions ◦ AVX and SSE integer instructions to 256 bits ◦ General-purpose bit manipulation and multiply ◦ Fused Multiply Add – FMA3 푑 = 푟표푢푛푑(푎 푥 푏 + 푐) ◦ Gather-Scatter Vector equivalent of register indirect addressing ◦ Permutations ◦ Vector Shifts 0 256 double word or 1 2 3 4 5 6 7 8 single precision 1 2 3 4 double precision Intel Knights Landing processor ◦ 2nd gen Xeon Phi processors ◦ Scheduled 2016 Supports Enhanced Vector Extension (EVEX) ◦ 32 512-bit registers ◦ Up to 4 operand instructions ◦ 7 new opmask registers ◦ Explicit rounding control ◦ Compressed displacement addressing mode 0 512 double word or 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 single precision 1 2 3 4 5 6 7 8 double precision Cannot be used by all the applications Unroll loops and then save time Load a single array instead of executing several Loads Most compilers do not support Vector processing ◦ Program has to be written by hand Problems can happen with memory alignment Data to process has to be known in advance Memory has to be carefully aligned Newer compilers support compiling from high level languages ◦ Intel Compiler Suite 11.1 - AVX ◦ GCC 4.9 – AVX-512 -m[sse, avx, avx512f, etc] Where are vector processors today? ◦ Gone ◦ High bandwidth ◦ Custom designed and costly Super computers now use multiple CPU and GPU cores ◦ Cheaper ◦ Lower Bandwidth ◦ National Energy Research Scientific Computing Center “Cori” ◦ Will have Knights Landing Xeon Phis with AVX-512 .

Load more