Tool zoo Choosing the right vectorization method
Nicolai Behmann December 5, 2015
Institute of Microelectronic Systems, Leibniz Universit¨atHannover Table of contents
1. Motion estimation
2. Vectorization methods
3. Performance evaluation
4. Conclusion
2 Motion estimation Block-based motion estimation
4 Block-based motion estimation
Compare blocks by sum of pixelwise absolute difference.
for (i = 0; i < nBlocks_y; ++i) for (j = 0; j < nBlocks_x; ++j) for (n = 0; n < nCandidates; ++n) sad = calcSAD(...); // ... find max similarity, e.g. min sad
5 Vectorization methods 7 1. remove explicit index calculation 2. using compile-time known loop length 3. mark pointers with restrict 4. align pointers
• Included in most C++ compilers • Activation by optimization switch -O3 • Improve performance by
• Zero code adaption
Autovectorization
for (sad=0, y=0; y < blockHeight; y++) { for (x=0; x < blockWidth; x++) { idx = y*linewidth + x; sad += abs(a[idx] - b[idx]); }}
8 1. remove explicit index calculation 2. using compile-time known loop length 3. mark pointers with restrict 4. align pointers
• Improve performance by
• Zero code adaption
Autovectorization
for (sad=0, y=0; y < blockHeight; y++) { for (x=0; x < blockWidth; x++) { idx = y*linewidth + x; sad += abs(a[idx] - b[idx]); }}
• Included in most C++ compilers • Activation by optimization switch -O3
8 • Zero code adaption
Autovectorization
for (sad=0, y=0; y < blockHeight; y++) { for (x=0; x < blockWidth; x++) { idx = y*linewidth + x; sad += abs(a[idx] - b[idx]); }}
• Included in most C++ compilers • Activation by optimization switch -O3 • Improve performance by 1. remove explicit index calculation 2. using compile-time known loop length 3. mark pointers with restrict 4. align pointers
8 Autovectorization
for (sad=0, y=0; y < blockHeight; y++) { for (x=0; x < blockWidth; x++) { idx = y*linewidth + x; sad += abs(a[idx] - b[idx]); }}
• Included in most C++ compilers • Activation by optimization switch -O3 • Improve performance by 1. remove explicit index calculation 2. using compile-time known loop length 3. mark pointers with restrict 4. align pointers • Zero code adaption
8 • Introduced with OpenMP 4.0 • Possibility for many compiler hints • Very few code adaption
OpenMP SIMD
for (sad=0, y=0, pA=a, pB=b; y < 16; ++y) { #pragma omp simd reduction(+:sad) aligned(pA,pB:16) linear(pA,pB:1) for (int x = 0; x < 16; x++) sad += std::abs(pA[x] - pB[x]); pA += lineWidth; pB += lineWidth; }
9 • Very few code adaption
OpenMP SIMD
for (sad=0, y=0, pA=a, pB=b; y < 16; ++y) { #pragma omp simd reduction(+:sad) aligned(pA,pB:16) linear(pA,pB:1) for (int x = 0; x < 16; x++) sad += std::abs(pA[x] - pB[x]); pA += lineWidth; pB += lineWidth; }
• Introduced with OpenMP 4.0 • Possibility for many compiler hints
9 OpenMP SIMD
for (sad=0, y=0, pA=a, pB=b; y < 16; ++y) { #pragma omp simd reduction(+:sad) aligned(pA,pB:16) linear(pA,pB:1) for (int x = 0; x < 16; x++) sad += std::abs(pA[x] - pB[x]); pA += lineWidth; pB += lineWidth; }
• Introduced with OpenMP 4.0 • Possibility for many compiler hints • Very few code adaption
9 • Introduced with Cilk Plus • Integrated in gcc 4.9+, few platforms • Medium code adaption
Array slicing notation
for (sad=0, y=0, pA=a, pB=b; y < 16; ++y) { tmp[0:16] = std::abs(pA[0:16] - pB[0:16]); sad += __sec_reduce_add(tmp[0:16]); pA += lineWidth; pB += lineWidth; }
10 • Medium code adaption
Array slicing notation
for (sad=0, y=0, pA=a, pB=b; y < 16; ++y) { tmp[0:16] = std::abs(pA[0:16] - pB[0:16]); sad += __sec_reduce_add(tmp[0:16]); pA += lineWidth; pB += lineWidth; }
• Introduced with Cilk Plus • Integrated in gcc 4.9+, few platforms
10 Array slicing notation
for (sad=0, y=0, pA=a, pB=b; y < 16; ++y) { tmp[0:16] = std::abs(pA[0:16] - pB[0:16]); sad += __sec_reduce_add(tmp[0:16]); pA += lineWidth; pB += lineWidth; }
• Introduced with Cilk Plus • Integrated in gcc 4.9+, few platforms • Medium code adaption
10 • Low level accelerator access • Few platforms, less portability • Huge code adaption
Instruction-set intrinsics
__m128i rSad = _mm_setzero_si128(); for (y=0, pA=a, pB=b; y < 16; y++) { __m128i rA = _mm_load_si128(pA); __m128i rB = _mm_load_si128(pB); __m128i rT = _mm_sad_epu8(rA, rB); rSad += _mm_adds_epu16(rSad, rT); pA += lineWidth; pB += lineWidth; } sad = _mm_extract_epi16(rSad, 0)+_mm_extract_epi16(rSad, 4);
11 • Huge code adaption
Instruction-set intrinsics
__m128i rSad = _mm_setzero_si128(); for (y=0, pA=a, pB=b; y < 16; y++) { __m128i rA = _mm_load_si128(pA); __m128i rB = _mm_load_si128(pB); __m128i rT = _mm_sad_epu8(rA, rB); rSad += _mm_adds_epu16(rSad, rT); pA += lineWidth; pB += lineWidth; } sad = _mm_extract_epi16(rSad, 0)+_mm_extract_epi16(rSad, 4);
• Low level accelerator access • Few platforms, less portability
11 Instruction-set intrinsics
__m128i rSad = _mm_setzero_si128(); for (y=0, pA=a, pB=b; y < 16; y++) { __m128i rA = _mm_load_si128(pA); __m128i rB = _mm_load_si128(pB); __m128i rT = _mm_sad_epu8(rA, rB); rSad += _mm_adds_epu16(rSad, rT); pA += lineWidth; pB += lineWidth; } sad = _mm_extract_epi16(rSad, 0)+_mm_extract_epi16(rSad, 4);
• Low level accelerator access • Few platforms, less portability • Huge code adaption
11 Performance evaluation Cycle count comparison
10000 no vec. auto-vec. OpenMP 1000 Cilk Plus SSE clock cycles 100
10 [2x2] [4x4] [8x8] [16x16]
block size
13 Additional outer parallelization
100 execution time [ms] time execution NEON 720p (ARM) SSE 720p (GPP) ideal scaling 10 1 2 4 8
number of cores
14 Conclusion Choice of vectorization method is strongly application specific.
Boost.SIMD, Halide and many others are also worth a try.
Summary
16 Summary
Choice of vectorization method is strongly application specific.
Boost.SIMD, Halide and many others are also worth a try.
16 Questions?
17