Boost.SIMD: Generic Programming for Portable Simdization

Boost.SIMD: Generic Programming for portable SIMDization Pierre Estérie Mathias Gaunard Joel Falcou LRI, UniversitéParis-Sud XI Metascale LRI, UniversitéParis-Sud XI Orsay, France Orsay, France Orsay, France [email protected] [email protected] [email protected] Jean-Thierry Lapresté Brigitte Rozoy LASMEA, UniversitéBlaise Pascal LRI, UniversitéParis-Sud XI Clermont-Ferrand, France Orsay, France [email protected] [email protected] ABSTRACT offer rich SIMD instruction sets working with larger and SIMD extensions have been a feature of choice for proces- larger SIMD registers (table 1). For example, the AVX sor manufacturers for a couple of decades. Designed to ex- extension introduced in 2011 enhances the x86 instruc- ploit data parallelism in applications at the instruction level, tion set for the Intel Sandy Bridge and AMD Bulldozer these extensions still require a high level of expertise or the micro-architectures by providing a distinct set of 16 256- use of potentially fragile compiler support or vendor-specific bit registers. Similary, the forthcoming Intel MIC Ar- libraries. While a large fraction of their theoretical accel- chitecture will embed 512-bit SIMD registers. Usage of erations can be obtained using such tools, exploiting such SIMD processing units can also be mandatory for per- hardware becomes tedious as soon as application portabil- formance on embedded systems as demonstrated by the ity across hardware is required. In this paper, we describe NEON and NEON2 ARM extensions [7] or the CELL- Boost.SIMD, a C++ template library that simplifies the BE processor by IBM [9] which SPUs were designed as exploitation of SIMD hardware within a standard C++ pro- SIMD-only system. gramming model. Boost.SIMD provides a portable way to vectorize computation on Altivec, SSE or AVX while pro- Table 1: SIMD extensions in modern processors viding a generic way to extend the set of supported functions Registers Manufacturer Extension Instructions and hardwares. We introduce a C++ standard compliant in- size & nbr terface for the users which increases expressiveness by pro- SSE 128 bits - 8 70 viding a high-level abstraction to handle SIMD operations, SSE2 128 bits - 8/16 214 an extension-specific optimization pass and a set of SIMD SSE3 128 bits - 8/16 227 SSSE3 128 bits - 8/16 227 aware standard compliant algorithms which allow to reuse Intel classical C++ abstractions for SIMD computation. We as- SSE4.1 128 bits - 8/16 274 sess Boost.SIMD performance and applicability by pro- SSE4.2 128 bits - 8/16 281 viding an implementation of BLAS and image processing AVX 256 bits - 8/16 292 algorithms. AMD SSE4a 128 bits - 8/16 231 IBM VMX 128 - 32 114 Keywords Motorola VMX128 128 bits - 128 VSX 128 bits - 64 SIMD, C++, Generic Programming, Template Metapro- ARM NEON 128 bits - 16 100+ gramming However, programming applications that take advan- 1. INTRODUCTION tage of the SIMD extension available on the current Since the late 90's, processor manufacturers provide target remains a complex task. Programmers that use specialized processing units called multimedia exten- low-level intrinsics have to deal with a verbose program- sions or Single Instruction Multiple Data (SIMD) ex- ming style due to the fact that SIMD instructions sets tensions. The introduction of this feature has allowed cover a few common functionnalities, requiring to bury processors to exploit the latent data parallelism avail- the initial algorithms in architecture specific implemen- able in applications by executing a given instruction si- tation details. Furthermore, these efforts have to be re- multaneously on multiple data stored in a single special peated for every different extension that one may want register. With a constantly increasing need for perfor- to support, making design and maintenance of such ap- mance in applications, today's processor architectures plications very time consuming. 1 with a sensible amount (200+) of mathematical Different approaches have been suggested to limit these functions and utility functions, shortcomings. On the compiler side, autovectorizers implement code analysis and transform phases to gener- • A support for C++ standard components ate vectorized code as part of the code generation pro- like Iterator over SIMD Range, SIMD-aware al- cess [15, 18] or as an offline process [13]. By this method locators and SIMD-aware STL algorithms. compilers are able to detect code fragments that can be vectorized. Autovectorizers find their limits when the 2.1 SIMD register abstraction classical code is not presenting a clear vectorizable pat- The first level of abstraction introduced by Boost. tern (complex data dependencies, non-contiguous mem- SIMD is the pack class. For a given type T and a given ory accesses, aliasing or control flows). This results static integral value N (N being a power of 2), a pack in a fragile SIMD code generation where performance encapsulates the best type able to store a sequence of and SIMD usage opportunities are uncertain. Other N elements of type T. For arbitrary T and N, this type compiler-based systems use code directives to guide the is simply std::array<T,N> but when T and N matches vectorization process by enforcing loop vectorization. the type and width of a SIMD register, the architecture- The ICC and GCC extension #pragma simd is such a specific type used to represent this register is used in- system. To use this mechanism, developers explicitly instead. This semantic provides a way to use arbitrarily troduce directives in their code, thus having a fine grain large SIMD registers on any system and let the library control on where to apply SIMDization. However, the select the best vectorizable type to handle them. By de- generated code quality will greatly depend on the used fault, if N is not provided, pack will automatically select compiler. Another approach is to use libraries like In- a value that will trigger the selection of the native SIMD tel MKL [6] or its AMD equivalent (ACML) [1]. Those register type. Moreover, by carrying informations about libraries offer a set of domain-specific routines (usually its underlying scalar type, pack enables proper instruc- linear algebra and/or signal processing) that are opti- tion selection even when used on extensions (like SSE2 mized for a given architecture. This solution suffers and above) that map all integral type to a single SIMD from a lack of flexibility as the proposed routines are type (__m128i for SSE2). optimized for specific use-case that may not fulfill arbitrary code constraints. pack handles these low-level SIMD register type as regular objects with value semantics, which includes the In this paper, we present Boost.SIMD, a high-level ability to be constructed or copied from a single scalar C++ library to program SIMD architectures. Designed value, list of scalar values, iterator or range. In each as an Embedded Domain Specific Language, Boost.SIMD case, the proper register loading strategy (splat, set, provides both expresiveness and performance by using load or gather) will be issued. pack also takes care of generic programming to handle vectorization in a porta- issues like boolean predicates support and its range and ble way. The paper is organized as follows: Section 2 tuple-like interface. describes the library API, its essential functionalities 2.1.1 Predicates handling and its interaction with standard C++ idioms. Section 3 details the implementation of the code generation en- Comparisons over SIMD register yield a register of gine and its extension systems. In Section 4, experi- boolean results. Problems arise when one wants to use mental results obtained with a representative set of al- such operations in conjunction with regular, non-SIMD gorithms are shown and commented. Lastly, Section 5 code: sums up our contributions. • To use custom types in C++ standard containers or algorithms, comparison operators over these 2. BOOST.SIMD API types have to yield a single bool, The main issue of SIMD programming is the lack of proper abstractions over the usage of SIMD registers. • Boolean SIMD values are either 0 or ~0 instead of This abstraction should not only provide a portable way 0 and 1, to use hardware-specific registers but also enable the use • While most SIMD extensions use the same register of common programming idioms when designing SIMD- type to store booleans, some like Intel MIC have a aware algorithms. Boost.SIMD solves these problems special register bank for SIMD boolean operations. by providing two components : Boost.SIMD solves these discrepancies by providing • An abstraction of SIMD registers for portable an abstraction over boolean values and a set of associ- algorithm design, coupled with a large set of func- ated operations. The logical class encapsulates the tions covering the classical set of operators along notion of a boolean value and can be combined with 2 pack. Thus, for any type T, an instance of pack< log- of large pack-based expressions and performs compile- ical<T> > encapsulates the proper SIMD register type time optimization on this AST. These optimizations in- able to store boolean values resulting from the appli- clude the detection of fused operation and replacement cation of a SIMD predicates over a pack<T>. More- or reordering of reductions versus elementwise opera- over, the logical class enables a compile-time detec- tions. This compile-time optimization pass ensures that tion of functions acting as predicates and allows opti- every architecture-specific optimization opportunity is mization of SIMD selection operations. With this sys- captured and replaced by the superior version of the tem, the comparison operators yield a single bool based code. Moreover, the AST-based evaluation process is on the lexicographical order thus making pack usable able to merge multiple function calls into a single in- as an element of standard containers while SIMD com- lined one, contrary to solutions like MKL where each parisons are done through a set of predicate functions function can only be applied on the whole data range like is_equal, is_greater, etc.

Load more