
Tool zoo Choosing the right vectorization method Nicolai Behmann December 5, 2015 Institute of Microelectronic Systems, Leibniz Universit¨atHannover Table of contents 1. Motion estimation 2. Vectorization methods 3. Performance evaluation 4. Conclusion 2 Motion estimation Block-based motion estimation 4 Block-based motion estimation Compare blocks by sum of pixelwise absolute difference. for (i = 0; i < nBlocks_y; ++i) for (j = 0; j < nBlocks_x; ++j) for (n = 0; n < nCandidates; ++n) sad = calcSAD(...); // ... find max similarity, e.g. min sad 5 Vectorization methods 7 1. remove explicit index calculation 2. using compile-time known loop length 3. mark pointers with restrict 4. align pointers • Included in most C++ compilers • Activation by optimization switch -O3 • Improve performance by • Zero code adaption Autovectorization for (sad=0, y=0; y < blockHeight; y++) { for (x=0; x < blockWidth; x++) { idx = y*linewidth + x; sad += abs(a[idx] - b[idx]); }} 8 1. remove explicit index calculation 2. using compile-time known loop length 3. mark pointers with restrict 4. align pointers • Improve performance by • Zero code adaption Autovectorization for (sad=0, y=0; y < blockHeight; y++) { for (x=0; x < blockWidth; x++) { idx = y*linewidth + x; sad += abs(a[idx] - b[idx]); }} • Included in most C++ compilers • Activation by optimization switch -O3 8 • Zero code adaption Autovectorization for (sad=0, y=0; y < blockHeight; y++) { for (x=0; x < blockWidth; x++) { idx = y*linewidth + x; sad += abs(a[idx] - b[idx]); }} • Included in most C++ compilers • Activation by optimization switch -O3 • Improve performance by 1. remove explicit index calculation 2. using compile-time known loop length 3. mark pointers with restrict 4. align pointers 8 Autovectorization for (sad=0, y=0; y < blockHeight; y++) { for (x=0; x < blockWidth; x++) { idx = y*linewidth + x; sad += abs(a[idx] - b[idx]); }} • Included in most C++ compilers • Activation by optimization switch -O3 • Improve performance by 1. remove explicit index calculation 2. using compile-time known loop length 3. mark pointers with restrict 4. align pointers • Zero code adaption 8 • Introduced with OpenMP 4.0 • Possibility for many compiler hints • Very few code adaption OpenMP SIMD for (sad=0, y=0, pA=a, pB=b; y < 16; ++y) { #pragma omp simd reduction(+:sad) aligned(pA,pB:16) linear(pA,pB:1) for (int x = 0; x < 16; x++) sad += std::abs(pA[x] - pB[x]); pA += lineWidth; pB += lineWidth; } 9 • Very few code adaption OpenMP SIMD for (sad=0, y=0, pA=a, pB=b; y < 16; ++y) { #pragma omp simd reduction(+:sad) aligned(pA,pB:16) linear(pA,pB:1) for (int x = 0; x < 16; x++) sad += std::abs(pA[x] - pB[x]); pA += lineWidth; pB += lineWidth; } • Introduced with OpenMP 4.0 • Possibility for many compiler hints 9 OpenMP SIMD for (sad=0, y=0, pA=a, pB=b; y < 16; ++y) { #pragma omp simd reduction(+:sad) aligned(pA,pB:16) linear(pA,pB:1) for (int x = 0; x < 16; x++) sad += std::abs(pA[x] - pB[x]); pA += lineWidth; pB += lineWidth; } • Introduced with OpenMP 4.0 • Possibility for many compiler hints • Very few code adaption 9 • Introduced with Cilk Plus • Integrated in gcc 4.9+, few platforms • Medium code adaption Array slicing notation for (sad=0, y=0, pA=a, pB=b; y < 16; ++y) { tmp[0:16] = std::abs(pA[0:16] - pB[0:16]); sad += __sec_reduce_add(tmp[0:16]); pA += lineWidth; pB += lineWidth; } 10 • Medium code adaption Array slicing notation for (sad=0, y=0, pA=a, pB=b; y < 16; ++y) { tmp[0:16] = std::abs(pA[0:16] - pB[0:16]); sad += __sec_reduce_add(tmp[0:16]); pA += lineWidth; pB += lineWidth; } • Introduced with Cilk Plus • Integrated in gcc 4.9+, few platforms 10 Array slicing notation for (sad=0, y=0, pA=a, pB=b; y < 16; ++y) { tmp[0:16] = std::abs(pA[0:16] - pB[0:16]); sad += __sec_reduce_add(tmp[0:16]); pA += lineWidth; pB += lineWidth; } • Introduced with Cilk Plus • Integrated in gcc 4.9+, few platforms • Medium code adaption 10 • Low level accelerator access • Few platforms, less portability • Huge code adaption Instruction-set intrinsics __m128i rSad = _mm_setzero_si128(); for (y=0, pA=a, pB=b; y < 16; y++) { __m128i rA = _mm_load_si128(pA); __m128i rB = _mm_load_si128(pB); __m128i rT = _mm_sad_epu8(rA, rB); rSad += _mm_adds_epu16(rSad, rT); pA += lineWidth; pB += lineWidth; } sad = _mm_extract_epi16(rSad, 0)+_mm_extract_epi16(rSad, 4); 11 • Huge code adaption Instruction-set intrinsics __m128i rSad = _mm_setzero_si128(); for (y=0, pA=a, pB=b; y < 16; y++) { __m128i rA = _mm_load_si128(pA); __m128i rB = _mm_load_si128(pB); __m128i rT = _mm_sad_epu8(rA, rB); rSad += _mm_adds_epu16(rSad, rT); pA += lineWidth; pB += lineWidth; } sad = _mm_extract_epi16(rSad, 0)+_mm_extract_epi16(rSad, 4); • Low level accelerator access • Few platforms, less portability 11 Instruction-set intrinsics __m128i rSad = _mm_setzero_si128(); for (y=0, pA=a, pB=b; y < 16; y++) { __m128i rA = _mm_load_si128(pA); __m128i rB = _mm_load_si128(pB); __m128i rT = _mm_sad_epu8(rA, rB); rSad += _mm_adds_epu16(rSad, rT); pA += lineWidth; pB += lineWidth; } sad = _mm_extract_epi16(rSad, 0)+_mm_extract_epi16(rSad, 4); • Low level accelerator access • Few platforms, less portability • Huge code adaption 11 Performance evaluation Cycle count comparison 10000 no vec. auto-vec. OpenMP 1000 Cilk Plus SSE clock cycles 100 10 [2x2] [4x4] [8x8] [16x16] block size 13 Additional outer parallelization 100 execution time [ms] time execution NEON 720p (ARM) SSE 720p (GPP) ideal scaling 10 1 2 4 8 number of cores 14 Conclusion Choice of vectorization method is strongly application specific. Boost.SIMD, Halide and many others are also worth a try. Summary 16 Summary Choice of vectorization method is strongly application specific. Boost.SIMD, Halide and many others are also worth a try. 16 Questions? 17.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages27 Page
-
File Size-