What Is Vectorization?

What is Vectorization? Georg Zitzlsberger [email protected] 5th of July 2017 Agenda What is ”Vectorization”? History of SIMD How does SIMD work? SIMD for Intel Architecture SIMD Instructions (in a nutshell) Where is SIMD? What is ”Vectorization”? What is a ”vector”? h iT I This: ~x = x1 x2 ... xm ? I This: std::vector<int> x? I This: x(:)? I This: %ymm? I ... Depending on your background your perception might vary... But two properties are common to all: I Cover multiple data items I Allow transformations/operations Isn’t that related to parallelism somehow? Flynn’s Taxonomy Classifying parallelism since 1966: I Single data stream: I SISD: Single Instruction Single Data I MISD: Multiple Instruction Single Data I Multiple data streams: I SIMD: Single Instruction Multiple Data I MIMD: Multiple Instruction Multiple Data I SPMD: Single Program Multiple Data (Images: Wikipedia) I MPMD: Multiple Program Multiple Data History of SIMD I Idea goes back to 1960s with ILLIAC IV 1972 (∼ 150MFLOPS) I Cray dramatically improved and was renown for that for a long time (Cray-1 1976 ∼ 240MFLOPS) I Nowadays it’s pervasive and delivers ∼ 1 − 20 + TFLOPS per node (Images: Wikipedia, Intel & NVidia) Moore’s ”Law” (Image: Intel) Parallelism (Image: Intel) How does SIMD work? Depends on the architecture... I Cray-1: (Image: Cray) How does SIMD work? I GPUs (e.g. NVidia): (Image: NVidia) How does SIMD work? I Intel Architecture: In more detail in the following... We focus on Intel Architectures used on the > 80% of all supercomputers from TOP500. (Image: nextplatform.com) How does SIMD work? There are other architectures too: I IBM PowerPC: AltiVec I ARM: NEON ...those work close to Intel Architecture we focus on throughout the day (but only have smaller 128 bit vectors). SIMD for Intel Architecture I SIMD: I Terminology: I SIMD: Data Level Parallelism (DLP) I Vector: Consists of more than one element I Elements of Vector: All of same primitive type (e.g. float, integer, . ) I Vector Length (VL): Number of elements of a vector (subject to underlying type) I Vector Size: In bits (type-less) I SIMD lane: Data flow sliced by elements of a vector History of SIMD Types for Intel Architecture I 1996: MMX [+57 instructions] I 1998: 3DNow! (AMD) I 1999: Streaming SIMD Extensions (SSE) [+70 instructions] Only support for single & double precision FP I 2001: SSE2 [+144 instructions] Added integer support I 2004: SSE3 [+13 instructions] I 2006: Supplemental SSE3 (SSSE3) [+16 instructions] I 2006: SSE4.1 & 4.2 [+ 54 instructions] I 2008: Advanced Vector Extensions (AVX) Only support for single & double precision FP I 2011/2012: FMA4/FMA3 (AMD) I 2012: Intel Many Integrated Core Architecture (MIC) I 2013: AVX2 [+283 instructions] Added integer support + FMA for FP I 2015: AVX-512 SIMD Types for Intel Architecture I (Image: Intel) SSE means all SSE and SSSE versions SIMD Types for Intel Architecture II (Image: Intel) AVX means both AVX and AVX2 SIMD Types for Intel Architecture III (Image: Intel) Notice the different subsets! Performance Evolution of SIMD for Intel Architecture (Image: Intel) SIMD Instructions (in a nutshell) Registers1 depending on SIMD type: I SSE 128 bit: XMM0-15 I AVX 256 bit: YMM0-15 I AVX-512 512 bit: ZMM0-32 plus mask registers k0-7 1All limited to 8 for 32 bit mode SIMD Instructions (in a nutshell) SIMD instruction categorization2 E.g.: vmovapd %ymm0, (%rdx,%rax) I Packed [p] or scalar [s] execution? I Double [d] or single precision [s] floating point? vec.cpp void vec(double *a, double *b, vec.s (GAS style) double * __restrict__ c) { .L4: double *aa = (double*) vmovapd (rdi,%rax), %ymm0 __builtin_assume_aligned(a, 64); vaddpd (%rsi,%rax), %ymm0, %ymm0 double *ba = (double*) vmovapd %ymm0, (%rdx,%rax) __builtin_assume_aligned(b, 64); addq $32, %rax double *ca = (double* cmpq $80000, %rax )__builtin_assume_aligned(c, 64); jne .L4 for(int i = 0; i < N; ++i) { vzeroupper ca[i] = aa[i] + ba[i]; ret } } 2Simplified - please refer to the processor manuals for full information Where is SIMD? Where is SIMD? Every core has SIMD capabilities: I Every SW-Thread or MPI Process to use SIMD I In a multicore environment SIMD will be influenced by memory (cache coherency) Where is SIMD? (Image: Intel).

What Is Vectorization?

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support