Choosing the Right Vectorization Method

Tool zoo Choosing the right vectorization method Nicolai Behmann December 5, 2015 Institute of Microelectronic Systems, Leibniz UniversitätHannover Table of contents 1. Motion estimation 2. Vectorization methods 3. Performance evaluation 4. Conclusion 2 Motion estimation Block-based motion estimation 4 Block-based motion estimation Compare blocks by sum of pixelwise absolute difference. for (i = 0; i < nBlocks_y; ++i) for (j = 0; j < nBlocks_x; ++j) for (n = 0; n < nCandidates; ++n) sad = calcSAD(...); // ... find max similarity, e.g. min sad 5 Vectorization methods 7 1. remove explicit index calculation 2. using compile-time known loop length 3. mark pointers with restrict 4. align pointers • Included in most C++ compilers • Activation by optimization switch -O3 • Improve performance by • Zero code adaption Autovectorization for (sad=0, y=0; y < blockHeight; y++) { for (x=0; x < blockWidth; x++) { idx = y*linewidth + x; sad += abs(a[idx] - b[idx]); }} 8 1. remove explicit index calculation 2. using compile-time known loop length 3. mark pointers with restrict 4. align pointers • Improve performance by • Zero code adaption Autovectorization for (sad=0, y=0; y < blockHeight; y++) { for (x=0; x < blockWidth; x++) { idx = y*linewidth + x; sad += abs(a[idx] - b[idx]); }} • Included in most C++ compilers • Activation by optimization switch -O3 8 • Zero code adaption Autovectorization for (sad=0, y=0; y < blockHeight; y++) { for (x=0; x < blockWidth; x++) { idx = y*linewidth + x; sad += abs(a[idx] - b[idx]); }} • Included in most C++ compilers • Activation by optimization switch -O3 • Improve performance by 1. remove explicit index calculation 2. using compile-time known loop length 3. mark pointers with restrict 4. align pointers 8 Autovectorization for (sad=0, y=0; y < blockHeight; y++) { for (x=0; x < blockWidth; x++) { idx = y*linewidth + x; sad += abs(a[idx] - b[idx]); }} • Included in most C++ compilers • Activation by optimization switch -O3 • Improve performance by 1. remove explicit index calculation 2. using compile-time known loop length 3. mark pointers with restrict 4. align pointers • Zero code adaption 8 • Introduced with OpenMP 4.0 • Possibility for many compiler hints • Very few code adaption OpenMP SIMD for (sad=0, y=0, pA=a, pB=b; y < 16; ++y) { #pragma omp simd reduction(+:sad) aligned(pA,pB:16) linear(pA,pB:1) for (int x = 0; x < 16; x++) sad += std::abs(pA[x] - pB[x]); pA += lineWidth; pB += lineWidth; } 9 • Very few code adaption OpenMP SIMD for (sad=0, y=0, pA=a, pB=b; y < 16; ++y) { #pragma omp simd reduction(+:sad) aligned(pA,pB:16) linear(pA,pB:1) for (int x = 0; x < 16; x++) sad += std::abs(pA[x] - pB[x]); pA += lineWidth; pB += lineWidth; } • Introduced with OpenMP 4.0 • Possibility for many compiler hints 9 OpenMP SIMD for (sad=0, y=0, pA=a, pB=b; y < 16; ++y) { #pragma omp simd reduction(+:sad) aligned(pA,pB:16) linear(pA,pB:1) for (int x = 0; x < 16; x++) sad += std::abs(pA[x] - pB[x]); pA += lineWidth; pB += lineWidth; } • Introduced with OpenMP 4.0 • Possibility for many compiler hints • Very few code adaption 9 • Introduced with Cilk Plus • Integrated in gcc 4.9+, few platforms • Medium code adaption Array slicing notation for (sad=0, y=0, pA=a, pB=b; y < 16; ++y) { tmp[0:16] = std::abs(pA[0:16] - pB[0:16]); sad += __sec_reduce_add(tmp[0:16]); pA += lineWidth; pB += lineWidth; } 10 • Medium code adaption Array slicing notation for (sad=0, y=0, pA=a, pB=b; y < 16; ++y) { tmp[0:16] = std::abs(pA[0:16] - pB[0:16]); sad += __sec_reduce_add(tmp[0:16]); pA += lineWidth; pB += lineWidth; } • Introduced with Cilk Plus • Integrated in gcc 4.9+, few platforms 10 Array slicing notation for (sad=0, y=0, pA=a, pB=b; y < 16; ++y) { tmp[0:16] = std::abs(pA[0:16] - pB[0:16]); sad += __sec_reduce_add(tmp[0:16]); pA += lineWidth; pB += lineWidth; } • Introduced with Cilk Plus • Integrated in gcc 4.9+, few platforms • Medium code adaption 10 • Low level accelerator access • Few platforms, less portability • Huge code adaption Instruction-set intrinsics __m128i rSad = _mm_setzero_si128(); for (y=0, pA=a, pB=b; y < 16; y++) { __m128i rA = _mm_load_si128(pA); __m128i rB = _mm_load_si128(pB); __m128i rT = _mm_sad_epu8(rA, rB); rSad += _mm_adds_epu16(rSad, rT); pA += lineWidth; pB += lineWidth; } sad = _mm_extract_epi16(rSad, 0)+_mm_extract_epi16(rSad, 4); 11 • Huge code adaption Instruction-set intrinsics __m128i rSad = _mm_setzero_si128(); for (y=0, pA=a, pB=b; y < 16; y++) { __m128i rA = _mm_load_si128(pA); __m128i rB = _mm_load_si128(pB); __m128i rT = _mm_sad_epu8(rA, rB); rSad += _mm_adds_epu16(rSad, rT); pA += lineWidth; pB += lineWidth; } sad = _mm_extract_epi16(rSad, 0)+_mm_extract_epi16(rSad, 4); • Low level accelerator access • Few platforms, less portability 11 Instruction-set intrinsics __m128i rSad = _mm_setzero_si128(); for (y=0, pA=a, pB=b; y < 16; y++) { __m128i rA = _mm_load_si128(pA); __m128i rB = _mm_load_si128(pB); __m128i rT = _mm_sad_epu8(rA, rB); rSad += _mm_adds_epu16(rSad, rT); pA += lineWidth; pB += lineWidth; } sad = _mm_extract_epi16(rSad, 0)+_mm_extract_epi16(rSad, 4); • Low level accelerator access • Few platforms, less portability • Huge code adaption 11 Performance evaluation Cycle count comparison 10000 no vec. auto-vec. OpenMP 1000 Cilk Plus SSE clock cycles 100 10 [2x2] [4x4] [8x8] [16x16] block size 13 Additional outer parallelization 100 execution time [ms] time execution NEON 720p (ARM) SSE 720p (GPP) ideal scaling 10 1 2 4 8 number of cores 14 Conclusion Choice of vectorization method is strongly application specific. Boost.SIMD, Halide and many others are also worth a try. Summary 16 Summary Choice of vectorization method is strongly application specific. Boost.SIMD, Halide and many others are also worth a try. 16 Questions? 17.

Load more