An Evaluation of Current SIMD Programming Models for C++

An Evaluation of Current SIMD Programming Models for C++ Angela Pohl Biagio Cosenza Mauricio Alvarez Mesa Chi Ching Chi Ben Juurlink TU Berlin, Berlin, Germany fangela.pohl, cosenza, mauricio.alvarezmesa, chi.c.chi, [email protected] Abstract of available SIMD instructions has been increasing as well, adding SIMD extensions were added to microprocessors in the mid ’90s more advanced functions to the repertoire over the years. to speed-up data-parallel code by vectorization. Unfortunately, the In order to use these SIMD features, code has to be vectorized SIMD programming model has barely evolved and the most effi- to fit the underlying hardware. There are several ways to accom- cient utilization is still obtained with elaborate intrinsics coding. As plish this. The least effort for the programmer is using the auto- a consequence, several approaches to write efficient and portable vectorization capabilities of modern compilers. SIMD code have been proposed. In this work, we evaluate current These compilers implement one or more automatic vectoriza- programming models for the C++ language, which claim to sim- tion passes, commonly applying loop vectorization and Super-word plify SIMD programming while maintaining high performance. Level Parallelism (SLP). Within such a pass, the compiler analyzes The proposals were assessed by implementing two kernels: the code by looking for instructions that profit from the scalar to one standard floating-point benchmark and one real-world integer- vector conversion, and then transforms it accordingly. For tradi- based application, both highly data parallel. Results show that tional automatic loop vectorization [6], this approach succeeds for the proposed solutions perform well for the floating point ker- well-defined induction variables and statically analyzable inter- and nel, achieving close to the maximum possible speed-up. For the intra-loop predictions, but it fails to vectorize codes with complex real-world application, the programming models exhibit significant control flows or structured data-layouts. SLP vectorizers [5] typ- performance gaps due to data type issues, missing template support ically work on straight-line code and scan for scalars that can be and other problems discussed in this paper. grouped together into vectors; recent studies have shown, however, that the average number of vector lanes that are occupied is merely Keywords SIMD, vectorization, C++, parallel programming, pro- 2 [9]. gramming model Significantly more effort has to be spent when a programmer chooses to use intrinsic functions to vectorize code. Intrinsics are Categories and Subject Descriptors CR-number [subcategory]: low level functions that implement all SIMD instructions of a pro- third-level cessor architecture. Though they are at a higher level of abstrac- tion than assembly programming, coding with them comes with its own challenges, i.e. effort, portability and compatibility. Nonethe- 1. Introduction less, intrinsics coding is still considered state-of-the-art for maxi- Single Instruction Multiple Data (SIMD) extensions have their ori- mum performance gain, as its low-level approach results in efficient gin in the vector supercomputers of the early ’70s and were intro- hardware utilization. duced to desktop microprocessors around twenty years later, when With these two options, SIMD programming presents itself as the demand for more compute power grew due to increasingly pop- a trade-off between effort and performance, where the program- ular gaming and video applications. They exploit Data Level Par- mer can only chose between the two extrema. This is illustrated allelism (DLP) by executing the same instruction on a set of data in Figure 1, where an HEVC (H.265) video decoder [16] was vec- simultaneously, instead of repeating it multiple times on a single, torized manually with intrinsics and compared to the results of the scalar value. An example is the brightening of a digital image, most commonly used C++ compilers’ auto-vectorizers. It is appar- where a constant value is added to each pixel. When using SIMD, a ent that high programming effort results in high performance and vector of pixels is created and the constant is added to each vector vice versa. To find a middle ground, researchers have worked on in- element with one instruction. venting more convenient programming models that still deliver suf- The number of bits that can be processed in parallel, the vec- ficient performance. In this paper, we provide an overview of such tor size, has been growing with each SIMD generation. A short programming models for current SIMD units and evaluate how well overview of the evolution of the most common SIMD extensions they do in terms of speed-ups. For this purpose, two kernels were can be found in [14, 15]. Along with the vector size, the number assessed: 1. the Mandelbrot benchmark, which is highly data-parallel, straight-forward to vectorize and works with 32bit floating Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed point numbers for profit or commercial advantage and that copies bear this notice and the full citation 2. an HEVC interpolation kernel, taken from a real-world HEVC on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or decoder, also highly data-parallel, based on function templates republish, to post on servers or to redistribute to lists, requires prior specific permission and working with 8bit and 16bit integers. and/or a fee. Request permissions from [email protected]. This paper will discuss the challenges of intrinsics programming WPMVP ’16, March 13 2016, Barcelona, Spain. Copyright is held by the owner/author(s). Publication rights licensed to ACM. in more detail in Section 2, as the common goal of all proposed ACM 978-1-4503-4060-1/16/03. $15.00. programming models is overcoming them while maintaining high http://dx.doi.org/10.1145/2870650.2870653 performance. It will then present an overview of all proposed ap- Code Version: Scalar Auto-vectorized Intrinsics other hand, when an application provider wants to ensure maximum performance even (or especially) on older platforms, code versions have to be provided for older SIMD ISAs along with current ones, 4:21 4:11 3:93 removing intrinsics added in later SIMD releases. In the end, sev- 4 eral code versions need to exist to ensure portability and compatibility, each of them requiring time-consuming low-level coding with intrinsics. 3 3. Programming Models for Code Vectorization To address these challenges of intrinsics coding, new programming models have been developed to facilitate a higher level of abstrac- 2 tion for code vectorization, while aiming for the same performance. Speed-up 1:57 This section provides an overview of such programming models, and categorizes them by their vectorization techniques. We make 1 1 1:04 1 1:04 1 the distinction between implicit and explicit approaches, where implicit vectorization is performed by the compiler based on directives, and explicit vectorization requires manual code adjustments by the programmer. In addition, there are hybrid forms that com- bine the two concepts. GCC ICC Clang++/LLVM 3.1 Implicit Vectorization When using programming models based on implicit vectorization, Figure 1. Speed-ups obtained with an auto-vectorized and an the programmer does not have to vectorize the code manually. In- intrinsics-based implementation of a real-world HEVC video de- stead, directives, for example in the form of pragmas, are provided coder , shown for the most popular C++ compilers (4K resolution to improve a compiler’s auto-vectorization passes. This approach video decoding, 8 threads on an Intel i7-4770 core with AVX2 ) is based on Single-Program Multiple-Data (SPMD) programming models, which are typically used for multi- and many-core platforms to achieve thread-level parallelism. proaches in Section 3, highlighting the different paths taken by each When a new SIMD generation is published, this vectorization of them. Afterwards, the kernel implementations are evaluated in technique does not require to adjust an application source code, Section 4. Related work is briefly presented in section 5 and all as parallel regions have not changed. Instead, the compiler, who findings are summarized in Section 6. interprets the directives and performs the intrinsics mapping, will have to be updated to support the new hardware features. 2. Challenges of Intrinsics Programming 3.1.1 Auto-vectorization 2.1 Effort Compiler-based automatic vectorization relies on two main tech- The biggest drawback of intrinsics coding is the significant pro- niques: loop vectorization and SLP. For this work, we assessed the gramming effort that is required. Due to their low level of abstrac- vectorization capabilities of three widely used compilers. tion, intrinsics do not offer advanced functions or data structures. The GCC vectorizer is automatically turned on with the -O3 Instead, they work with elemental numeric types, such as int and compiler flag. It includes analyses for memory access patterns and float, and most of their functionality is limited to basic arithmetic, loop carried dependencies. Based on their findings, the vectorizer boolean and load/store operations. That is why code vectorization performs a profitability estimation to decide if a loop should be with intrinsics is time-consuming, because all high level language vectorized. In addition, SLP is implemented through basic block constructs, such as containers or functors, have to be mapped to this optimizations. low-complexity feature set. LLVM contains two vectorizers for loop and SLP vectorization, respectively; it is also able to detect special cases, such as unknown 2.2 Portability trip counts, runtime checks for pointers, reductions and pointer Besides the significant effort of programming with intrinsics, porta- induction variables.

Load more