Boost.SIMD: Generic Programming for Portable Simdization
Boost.SIMD: Generic Programming for portable SIMDization
Pierre Est´erie Mathias Gaunard Joel Falcou LRI, Universit´eParis-Sud XI Metascale LRI, Universit´eParis-Sud XI Orsay, France Orsay, France Orsay, France [email protected] [email protected] [email protected]
Jean-Thierry Laprest´e Brigitte Rozoy LASMEA, Universit´eBlaise Pascal LRI, Universit´eParis-Sud XI Clermont-Ferrand, France Orsay, France [email protected] [email protected]
ABSTRACT offer rich SIMD instruction sets working with larger and SIMD extensions have been a feature of choice for proces- larger SIMD registers (table 1). For example, the AVX sor manufacturers for a couple of decades. Designed to ex- extension introduced in 2011 enhances the x86 instruc- ploit data parallelism in applications at the instruction level, tion set for the Intel Sandy Bridge and AMD Bulldozer these extensions still require a high level of expertise or the micro-architectures by providing a distinct set of 16 256- use of potentially fragile compiler support or vendor-specific bit registers. Similary, the forthcoming Intel MIC Ar- libraries. While a large fraction of their theoretical accel- chitecture will embed 512-bit SIMD registers. Usage of erations can be obtained using such tools, exploiting such SIMD processing units can also be mandatory for per- hardware becomes tedious as soon as application portabil- formance on embedded systems as demonstrated by the ity across hardware is required. In this paper, we describe NEON and NEON2 ARM extensions [7] or the CELL- Boost.SIMD, a C++ template library that simplifies the BE processor by IBM [9] which SPUs were designed as exploitation of SIMD hardware within a standard C++ pro- SIMD-only system. gramming model. Boost.SIMD provides a portable way to vectorize computation on Altivec, SSE or AVX while pro- Table 1: SIMD extensions in modern processors viding a generic way to extend the set of supported functions Registers Manufacturer Extension Instructions and hardwares. We introduce a C++ standard compliant in- size & nbr terface for the users which increases expressiveness by pro- SSE 128 bits - 8 70 viding a high-level abstraction to handle SIMD operations, SSE2 128 bits - 8/16 214 an extension-specific optimization pass and a set of SIMD SSE3 128 bits - 8/16 227 SSSE3 128 bits - 8/16 227 aware standard compliant algorithms which allow to reuse Intel classical C++ abstractions for SIMD computation. We as- SSE4.1 128 bits - 8/16 274 sess Boost.SIMD performance and applicability by pro- SSE4.2 128 bits - 8/16 281 viding an implementation of BLAS and image processing AVX 256 bits - 8/16 292 algorithms. AMD SSE4a 128 bits - 8/16 231 IBM VMX 128 - 32 114 Keywords Motorola VMX128 128 bits - 128 VSX 128 bits - 64 SIMD, C++, Generic Programming, Template Metapro- ARM NEON 128 bits - 16 100+ gramming However, programming applications that take advan- 1. INTRODUCTION tage of the SIMD extension available on the current Since the late 90’s, processor manufacturers provide target remains a complex task. Programmers that use specialized processing units called multimedia exten- low-level intrinsics have to deal with a verbose program- sions or Single Instruction Multiple Data (SIMD) ex- ming style due to the fact that SIMD instructions sets tensions. The introduction of this feature has allowed cover a few common functionnalities, requiring to bury processors to exploit the latent data parallelism avail- the initial algorithms in architecture specific implemen- able in applications by executing a given instruction si- tation details. Furthermore, these efforts have to be re- multaneously on multiple data stored in a single special peated for every different extension that one may want register. With a constantly increasing need for perfor- to support, making design and maintenance of such ap- mance in applications, today’s processor architectures plications very time consuming.
1 with a sensible amount (200+) of mathematical Different approaches have been suggested to limit these functions and utility functions, shortcomings. On the compiler side, autovectorizers implement code analysis and transform phases to gener- • A support for C++ standard components ate vectorized code as part of the code generation pro- like Iterator over SIMD Range, SIMD-aware al- cess [15, 18] or as an offline process [13]. By this method locators and SIMD-aware STL algorithms. compilers are able to detect code fragments that can be vectorized. Autovectorizers find their limits when the 2.1 SIMD register abstraction classical code is not presenting a clear vectorizable pat- The first level of abstraction introduced by Boost. tern (complex data dependencies, non-contiguous mem- SIMD is the pack class. For a given type T and a given ory accesses, aliasing or control flows). This results static integral value N (N being a power of 2), a pack in a fragile SIMD code generation where performance encapsulates the best type able to store a sequence of and SIMD usage opportunities are uncertain. Other N elements of type T. For arbitrary T and N, this type compiler-based systems use code directives to guide the is simply std::array
2 pack. Thus, for any type T, an instance of pack< log- of large pack-based expressions and performs compile- ical
• Arithmetic functions: including abs, sqrt, av- Listing 1: Working with pack erage and various others, • IEEE 754 functions: enabling bit-level manipu- 2.2 C++ Standard integration lation of floating point values, including exponent Writing small functions acting over a few pack has and mantissa extraction, been covered in the previous section and we saw how • Reduction functions: for intra-register opera- the API of Boost.SIMD make such functions easy to tions like sum or product of a register elements. write by abstracting away the architecture-specific code fragments. Realistic applications usually require such A fundamental aspect of SIMD programming relies functions to be applied over a large set of data. To on the effective use of fused operations like multiply- support such a use case in a simple way, Boost.SIMD add on VMX extensions or sum of absolute differences provides a set of classes to integrate SIMD computa- on SSE extensions. Unlike simple wrappers around tion inside C++ code relying on the C++ Standard SIMD operations [8], pack relies on Expression Tem- Template Library (STL) components, thus reusing its plates [2] to capture the Abstract Syntax Tree (AST) generic aspect to the fullest.
3 Based on Generic Programming as defined by [17], Previous examples of integration within standard al- the STL is based on the separation between data, stored gorithm are still limited. Applying transform or fold in various Containers, and the way one can traverse algorithm on SIMD aware data requires the size of the these data, thanks to Iterators and algorithms. In- data to be an exact multiple of the SIMD register width stead of providing SIMD aware containers, Boost.SIMD and, in the case of fold, to potentially perform addi- reuses existing STL Concepts to adapt STL-based code tional operations at the end of its call. To alleviate this to SIMD computations. The goal of this integration is limitation, Boost.SIMD provides its own overload for to find standard ways to express classical SIMD pro- both transform and fold that take care of potential gramming idioms, thus raising expresiveness and still trailing data and performs proper completion of fold. benefiting from the expertise put into these idioms. More specifically, Boost.SIMD provides SIMD-aware alloca- 2.2.3 Sliding Window Iterator tors, iterators for regular SIMD computations – includ- Some application domains like image or signal pro- ing interleaved data or sliding window iterators – and cessing require specific memory access patterns in which hardware-optimized algorithms. the neighborhood of a given value is used in the com- putation. Digital filtering and convolutions are exam- 2.2.1 Aligned allocator ples of such algorithms. The efficient techniques for vec- The hardware implementation of SIMD processing torizing such operations is to perform so called shifted units introduces constraints related to memory han- loads i.e. loads from unaligned memory addresses, so dling. Performance is guaranteed by accessing to the that each value of the neighborhood can be available as memory through dedicated load and store intrinsics a SIMD vector. To limit the number of such loads, a that perform register-length aligned memory accesses. technique called the register rotation technique [14] is This constraint requires a special memory allocation often used. This technique allows filter-like algorithms strategy via OS and compiler-specific function calls. to perform only one load per iteration, swapping neigh- Boost.SIMD provides two STL compliant alloca- borhood values as the algorithm goes forward. This tors dealing with this kind of alignment. The first one, idiomatic way of implementing such algorithms usually simd::allocator, wraps these OS and compiler func- increases performance by a large factor and is a strong tions in a simple STL-compliant allocator. When an candidate for encapsulation. existing allocator defines a specific memory allocation strategy, the user can adapt it to handle alignment by In Boost.SIMD, shifted_iterator is an iterator wrapping it in simd::allocator_adaptor. adaptor encapsulating such an abstraction. This itera- tor is constructed from an iterator and a compile-time 2.2.2 SIMD Iterator width N. When dereferenced, it returns a static array Modern C++ programming style based on Generic of N packs containing the initial data and its shifted Programming usually leads to an intensive use of vari- neighbors (Fig. 1). When incremented, the tuple value ous STL components like Iterators. Boost.SIMD pro- are internally swapped and the new vector of value is vides iterator adaptors that turn regular random ac- loaded, thus implementing register rotation. With such cess iterators into iterators suitable for SIMD process- an iterator, one can simply write an average filter using ing. These adaptors act as free functions taking regular std::transform. iterators as parameters and return iterators that out- s t r u c t average put pack whenever dereferenced. These iterators are { then usable directly in usual STL algorithms such as template
4 Data in main memory Shifted Iterator output for N=3 s t r u c t point { f l o a t x,y; int w; } ; 10 11 12 13 14 15 16 10 11 12 13 s t r u c t normalize 11 12 13 14 { 12 13 14 15 template
5 provides a semi-automatic generation of all the template overloading directly and has the advantage of pro- structures needed to make it work. Compared to hand- viding best match selection in cases of multiple written Expressions Templates-based EDSLs, designing matches. a new embedded language with Proto is done at a • Concepts [4] rely on higher level semantic de- higher level of abstraction. In a way, Proto supercedes the normal compiler workflow so that domain-specific scription of type properties. Language support code transformations can take place as soon as possible. is required to resolve overloads based on the con- formance of types to a given Concept. Currently Thanks to Proto, and contrary to other EDSL-based Concept-based overloading is still an experimental solutions[5], Boost.SIMD does not directly evaluate its compile-time AST after its capture. Instead, it re- language feature under consideration for inclusion lies on a multi-pass system: in a future C++ standard. All these techniques also lack a proper way to intro- • AST Optimization: in this pass, the AST is re- duce the architecural information in the dispatching sys- cursively optimized by transforming any potential tem. Boost.SIMD proposes to extend Tag Dispatch- sequence of functions or operators into their equiv- ing by using the full set of argument tags for overload alent fused instructions, selection coupled with a tag representing the current • Code Scheduling: once optimized, the AST is architecture. These tags are computed as follows: split depending on the nature of each of its nodes. • For each SIMD family, a hierarchy of classes is The main schedule operation takes care of reorder- defined to represent the relationship between each ing reduction operations so they can be evaluated extension variant. For example a SSE3 tag inherits in a proper temporary. This temporary is then from the SSE2 tag as SSE3 is more refined than stored into the modified AST as a simple termi- SSE2. nal. This pass also splits the AST whenever a scalar emulation is required, • For each argument type, a type called a hierar- chy is automatically computed. This hierarchy • Code Generation: once scheduled, the remain- contains information about: the type of register ing parts of the AST are recursively evaluated to used to store the value (SIMD or scalar), the in- their hardware specific function calls, eventually trinsic properties of the type (floating point, in- by taking care of any remaining scalar emulation. teger, size in bytes) and the actual type itself. These hierarchies are also ordered from the most Each expression involving Boost.SIMD types is then fine grained description (for example, scalar_< captured at compile-time as a single entity which is then int8_
6 vectorize and unroll the loop due to the static informa- Table 2: Processor details Architecture Nehalem Sandy Bridge PowerPC G5 tion available in the C code. The AXPY computation Max. Frequency 3.6 GHz 3.8 GHz 1.6 GHz can be catched at the tree SSA (Static Single Assign- SIMD Extension SSE4.2 AVX Altivec ment) level and gives the opportunity for optimizations to the compilers. Measurements show that the perfor- mance of Boost.SIMD is slightly behind that of the 4.1 AXPY Kernel two compilers. This is because of the lack of unrolling The AXPY kernel computes a vector-scalar product and fine low-level code tuning, necessary to reach the and adds the result to another vector. With the x86 pro- peak performance of the target. The MKL library goes cessor family, the throughput of the SIMD extension is up to 15 GFlop/s in single precision (7.1 in double preci- two floating-point operations per cycle. The SIMD com- sion) and outperforms the previous results. The AXPY putation of the AXPY kernel fits the hardware specifi- kernel of MKL is provided as a user function with high cations and its execution should result to the peak per- architecture optimizations for the Intel processors and formance of the target. To reach the maximum number introduces an architecture dependency in user code. of floating-point operations, the source code needs to be tuned with very fine grained architecture optimiza- Table 4: Boost.SIMD vs handwritten SIMD code for tions like loop unrolling according to the instructions the AXPY kernel in GFlop/s latency or the number of registers to use. The MKL Li- Type Size Version SSE4.2 brary proposes an optimized routine of this algorithm Ref. SIMD 4.03 for the x86 processor family. Autovectorizers in com- float 29 Boost.SIMD 4.70 pilers are also able to capture this type of kernel and Ref. SIMD 3.40 14 generate optimized code for the targeted architecture. 2 Boost.SIMD 3.49 Ref. SIMD 3.41 Table 3 shows how Boost.SIMD performs against the 219 Boost.SIMD 3.98 autovectorizers. For the particular case of altivec, the SIMD extension does not support the double precision Table 4 shows how Boost.SIMD performs against floating-point type. handwritten SIMD code without loop unrolling. The re- sults of the generated code are equivalent to the SSE4.2 Table 3: Boost.SIMD vs Autovectorizers for the AXPY code and assess that Boost.SIMD delivers the expected kernel in GFlop/s speedup. Loop optimizations and fine load/store schedul- Type Size Version SSE4.2 AVX Altivec ing strategies can be added on top of Boost.SIMD to gcc 3.56 16.8 0.9 increase performance. The previous results show that float 29 icc 13.21 17.77 - Boost.SIMD provides a portable way to access the la- B.SIMD 4.70 7.94 1.07 tent speed-up of the architectures. However, as not be- gcc 3.73 9.9 0.8 ing a special-purpose library like MKL, its performance 14 2 icc 10.51 14.64 - on this very demanding test is satisfactory yet still far B.SIMD 3.49 6.21 0.89 from the peak performance. Other optimizations like gcc 3.59 8.9 0.6 loop unrolling and jamming are necessary to compete 219 icc 8.80 11.52 - B.SIMD 3.98 5.70 0.35 with the library solutions. gcc 3.37 6.6 - 4.2 Sigma-Delta Motion Detection double 29 icc 7.49 11.31 - The Sigma-Delta algorithm [10] is a motion detection B.SIMD 1.95 3.48 - gcc 2.94 4.8 - algorithm used in image processing to discriminate mov- 214 icc 5.56 8.10 - ing objects from the background. Contrary to simple B.SIMD 1.80 3.00 - thresholded background substraction algorithms, Sigma- gcc 2.85 4.3 - Delta uses a per-pixel variance estimation to filter out 219 icc 4.51 6.07 - outliers due to lighting conditions or contrast varia- B.SIMD 1.63 2.45 - tion inside the image. The whole algorithm can be expressed by a series of additions, substractions and The sizes of the used vectors are chosen according to various boolean selections. As pointed by Lacassagne the cache sizes of their respective targets so that they in [10], the Sigma-Delta algorithm is mainly limited all fit in cache. The gcc and icc versions show the by memory bandwidth and so no optimizations beside autovectorizers work on the AXPY kernel written in C SIMDization is efficient as only point-to-point opera- code. First, the memory hierarchy directly impacts the tions are issued. Table 5 details how Boost.SIMD performance of the kernel when the vectors go out of a performs against scalar and handwritten versions of the cache level. The GNU and Intel compilers are able to algorithm; the benchmarks are realized with greyscale
7 images. To handle this format, the type unsigned char less noticeable for the human eye. The conversion from is used so each vector of the SSE2 or Altivec extension the RGB color space is done as follow: can carry sixteen elements. On the AVX side, the in- struction set is not providing a support for the corre- • Weighted values of R, G and B are summed to sponding type. obtain Y , • Scaled differences are computed between Y and Table 5: Results for Sigma-Delta algorithm in cpp the B and R values to produce U and V compo- Extension SSE4.2 Altivec nents. Size 2562 5122 2562 5122 The RGB2YUV algorithm fits the data parallelism Scalar C++(1) 9.237 9.296 14.312 27.074 requirement for SIMD computation. The comparison Scalar C icc 2.619 2.842 - - shown in Table 6 aims to measure the performance of Scalar C gcc 8.073 7.966 - - Boost.SIMD against scalar C++ code and SIMD ref- Ref. SIMD(2) 1.394 1.281 1.380 4.141 erence code. The implementation of the transformation Boost.SIMD(3) 1.106 1.125 1.511 5.488 is realized in single precision floating-point. Speedup(1/3) 8.363 8.263 9.469 4.933 Overhead(2/3) -26% -13.9% 8.7% 24.5% Table 6: Results for RGB2YUV algorithm in cpp Size Version SSE4.2 AVX Altivec The execution time overhead introduced by the use of Scalar C++ 30.53 23.01 39.07 Boost.SIMD stays below 8.7%. On SSE2, the library 1282 Ref. SIMD 5.30 3.99 13.18 outperforms the SSE2 handwritten version while on Al- Boost.SIMD 4.32 3.51 12.89 tivec, a slow-down appears with images of 512 × 512 Speedup 7.07 6.55 3.03 elements. Such a scenario can be explained by the Overhead -22.7% -13.7% -2.2% amount of images used by the algorithm and their sizes. Three vectors of type unsigned char need to be ac- Scalar C++ 29.23 21.46 42.51 2 cessed during the computation which is the critical sec- 256 Ref. SIMD 6.48 2.80 29.05 tion of the Sigma-Delta algorithm. The highest cache Boost.SIMD 6.51 2.45 29.01 level of the PowerPC 970FX provide 512 KBytes of Speedup 4.49 8.76 1.47 memory which is not enough to fit the three images Overhead 0.04% -14.3% 0.1% in cache. Cache misses becomes preponderant and the Scalar C++ 28.91 22.03 44.93 Load/Store unit of the Altivec extension keeps wait- 5122 Ref. SIMD 6.54 3.01 30.05 ing for data from the main memory. This problem Boost.SIMD 6.53 3.83 30.42 goes away on the Nehalem microarchitecture with an Speedup 4.42 5.75 1.48 additional level of cache memory. The icc autovec- Overhead -0.1% 21.4% 1.2% torizer generates SSE4.2 code with the C version of Sigma-Delta while gcc fails. The C++ version keeps The speedups obtained with SEE4.2 are superior to its fully scalar properties even with the autovectoriz- the expected ×4 on such an extension and no overhead ers enabled due to the lack of static information intro- is added by Boost.SIMD. With AVX, the theoriti- duced by the Generic Programming Style of the C++ cal speedup is reached for an image of 256 × 256 ele- language. Boost.SIMD keeps the high level abstrac- ments. A slowdown appears with other sizes due to the tion provided by the use of STL code and is able to memory hierarchy bottleneck. This effect directly im- reach the performance of the vectorized reference code. pacts the performance on the PowerPC architecture for In addition, the portability of the Boost.SIMD code sizes larger than 128×128 elements. Like introduced in gives access to the previous speedups without rewriting the Sigma-Delta benchmark, the lack of a level 3 cache the code. causes the SIMD unit to wait constantly for data from the main memory. For a size of 64×64 on the PowerPC, 4.3 RGB2YUV Color space conversion Boost.SIMD performs at 4.65 cpp against 36.32 cpp The RGB and YUV models are both color spaces for the scalar version giving a speedup of 7.81 which with three components that encode images or videos. confirms the memory limitation of the PowerPC archi- Opposed to RGB, the YUV color space takes into ac- tecture. The latent data parallelism in the RGB2YUV count the human perception of the colors. The Y com- algorithm is fully exploited by Boost.SIMD and the ponent encodes the luminance information (brightness) benchmarks corroborate the ability of the library to gen- and the U and V components have the chrominance erate effecient SIMD code. information (color). The YUV model makes artifacts that can happened during transmission or compression
8 5. CONCLUSION [6] Intel. Math kernel library. http://developer. SIMD instruction sets are a technology present in an intel.com/software/products/mkl/. ever growing number of architectures. Despite the per- [7] M. Jang, K. Kim, and K. Kim. The performance formance boost that such extensions usually provide, analysis of arm neon technology for mobile SIMD has been usually underused. However, as new platforms. In Proceedings of the 2011 ACM SIMD-enabled hardware gets designed with increasingly Symposium on Research in Applied Computation, large SIMD vector sizes, losing the ×4 to ×16 speed-ups RACS ’11, pages 104–106, New York, NY, USA, they may provide in HPC applications is starting to be 2011. ACM. glaring. In this paper, we presented Boost.SIMD, a [8] M. Kretz and V. Lindenstruth. Vc: A c++ C++ software library which aims at simplifying the library for explicit vectorization. Software: design of SIMD-aware applications while providing a Practice and Experience, 2011. portable high-level API, integrated with the C++ Stan- [9] J. Kurzak, W. Alvaro, and J. Dongarra. dard Template Library. Thanks to a meta-programmed Optimizing matrix multiplication for a optimization opportunity detection and code genera- short-vector simd architecture : Cell processor. tion, Boost.SIMD has been shown to be on par with Parallel Computing, 35(3):138 – 150, 2009. hand written SIMD code while also providing sufficent [10] L. Lacassagne, A. Manzanera, J. Denoulet, and expressiveness to implement non-trivial applications.In A. M´erigot. High performance motion detection: opposition to compilers dependent optimizations and some trends toward new embedded architectures manufacturers libraries, Boost.SIMD solves the prob- for vision systems. Journal of Real-Time Image lem of portable SIMD code generation. Processing, 4:127–146, 2009. [11] N. C. Myers. Traits: a new and useful template Future works include providing support for classical technique. C++ Report, June 1995. numeric types like complex numbers or various pixel en- [12] E. Niebler. Proto : A compiler construction codings, using the AST exploration system to estimate toolkit for DSELs. In Proceedings of ACM the proper unrolling and blocking factor for any given SIGPLAN Symposium on Library-Centric expression, and expanding the library to SIMD-enabled Software Design, 2007. embedded systems like the ARM Cortex processor fam- [13] D. Nuzman, S. Dyshel, E. Rohou, I. Rosen, ily. K. Williams, D. Yuste, A. Cohen, and A. Zaks. Vapor simd: Auto-vectorize once, run everywhere. 6. ACKNOWLEDGMENT In CGO, pages 151–160, 2011. [14] T. Saidani, L. Lacassagne, J. Falcou, C. Tadonki, The authors would like to thank Pierre-Louis Caru- and S. Bouaziz. Parallelization schemes for ana for his help on the experiments reported in this memory optimization on the cell processor: a case paper. study on the harris corner detector. In 7. REFERENCES P. Stenstr¨om, editor, Transactions on high-performance embedded architectures and [1] AMD. Amd core math library. compilers III, pages 177–200. Springer-Verlag, http://developer.amd.com/libraries/acml/ Berlin, Heidelberg, 2011. pages/default.aspx. [15] J. Shin. Introducing control flow into vectorized [2] K. Czarnecki, U. W. Eisenecker, R. Gluck,¨ code. Parallel Architectures and Compilation D. Vandevoorde, and T. L. Veldhuizen. Techniques, International Conference on, Generative programming and active libraries. In 0:280–291, 2007. Generic Programming, pages 25–39, 1998. [16] D. Spinellis. Notable design patterns for [3] J. de Guzman, D. Marsden, and T. Schwinger. domain-specific languages. Journal of Systems Boost.fusion library. http://www.boost.org/ and Software, 56(1):91 – 99, 2001. doc/libs/release/libs/fusion/doc/html. [17] A. Stepanov and M. Lee. The standard template [4] D. Gregor, J. J¨arvi, J. Siek, B. Stroustrup, library. Technical Report 95-11(R.1), HP G. Dos Reis, and A. Lumsdaine. Concepts: Laboratories, 1995. linguistic support for generic programming in [18] H. Wang, H. Andrade, B. Gedik, and K.-L. Wu. A c++. In Proceedings of the 21st annual ACM code generation approach for auto-vectorization in SIGPLAN conference on Object-oriented the spade compiler. In LCPC’09, pages 383–390, programming systems, languages, and 2009. applications, OOPSLA ’06, pages 291–310, New York, NY, USA, 2006. ACM. [5] G. Guennebaud, B. Jacob, et al. Eigen v3. http://eigen.tuxfamily.org, 2010.
9