Nova.Simd - a Framework for Architecture-Independent SIMD Development

nova.simd - A framework for architecture-independent SIMD development Tim BLECHMANN [email protected] Abstract extended to integer data and double-precision floating point types with SSE2, and literally Most CPUs provide instruction set extensions each new CPU generation added some more in- to make use of the Single Instruction Multi- structions for specific use cases1. Some ven- ple Data (SIMD) programming paradigm, ex- dors provide specific libraries, but unfortunately amples are the SSE and AVX instruction set most of them have specific restrictions, only ab- families for IA32 and X86 64, Altivec on PPC stract once specific instruction set or work only or NEON on ARM. While compilers can do lim- on a specific platform. ited auto-vectorization, developers usually have Nova.simd was designed to provide a generic to target each instruction set manually in order and easy to use framework to easily write SIMD to get the full performance out of their code. code, which is independent from the instruction Nova.simd provides a generic framework to set. It provides ready-to-use vector functions, write cross-platform code, that makes use of but also an generic framework to write generic data level parallelism by utilizing instruction vector code. It is a header-only C++ library sets. that makes heavy use of templates and template metaprogramming techniques and cur- 1 Introduction & Motivation rently supports the SSE and AVX families on Most processors provide instructions to make IA32 and X86 64, Altivec on PPC and NEON use of data-level parallelism via the Single In- on ARM. But can be easily extended to other/- struction Multiple Data (SIMD) paradigm [3]. future instruction sets, if they provide a reason- The instruction sets usually treat data as small able compiler support via intrinsics. For unsup- vectors of scalars, which can be processed with ported platforms, a generic C++ implementa- single instructions. tion is provided. The library is free software, In an ideal world compilers would generate released under the GPL-2+. these instructions manually without any contri- Section2 explains the design decisions of bution from the developers. Unfortunately, it is Nova.simd, section3 introduces the main part not that easy: some instructions require specific of the library, the vec class, section4 explains conditions like memory alignment and often the the provided algorithms and we conclude in sec- compiler is not able to infer, if certain mem- tion section5 with a discussion of related li- ory regions overlap at run-time (aliasing). Some braries and frameworks. compilers avoid this by adding run-time checks, which of course introduce some overhead. Some 2 The Design of Nova.simd more complex algorithms would require non- Nova.simd was started, when the author trivial program transformations, which are un- worked on the `nova' project for his bachelor likely to be performed by the compiler. thesis [1], which provided a simple form of ab- To get the full performance out of the SIMD straction for SIMD functions. After the `nova' instruction sets, one usually has to target each project was abandoned, the SIMD code was instruction set manually, either by using com- maintained separately and the unit generators piler intrinsics or by writing assembler. Unfor- of the computer music system SuperCollider [5] tunately, each instruction set implements differ- were adapted to use nova.simd instead of the ent concepts. The SSE instruction set for exam- 1The additional SSE instruction extensions have been ple has been originally working only on single- SSE3, SSSE3, SSE4a, SSE4.1, SSE4.2. AMD's proposed precision floating point numbers, but has been SSE5 has never made it to real hardware. old PPC-specific Altivec code. After some at- more complex algorithms. It covers the follow- tempts to use Python to generate C++ code, it ing aspects: is currently implemented as a header-only C++ library, which makes heavy use of template and Operators template meta-programming techniques to gen- vec<> provides C++ operators for arith- erate code at compile-time. metics and comparison. However it omits The main idea behind the design is to sepa- logical operators, which are provided as bit- rate the library into a platform-specific and a masking functions (see below). Multiply- platform-agnostic part. The platform-specific Accumulate is provided as a function, as it part is based on a generic template vec<> class, is supported by several SIMD instruction which represents a SIMD vector. Using tem- sets. plate specialization, platform-specific versions Bitwise Operators and Bitmasks of this class can be implemented for separate All template specializations which actu- instruction sets, details can be found in the fol- ally map to SIMD implememntations pro- lowing section. The platform-agnostic part then vide comparison functions, that yield bit- uses the vec<> class to build more complex al- masks. These bitmasks can then be used gorithms. with bitwise operators or a select() function, which selects a specific value from two 3 The class vec arguments, depending on the value of a se- The vec<> class is the heart of nova.simd. lection bitmask. It represents a single SIMD vector similar to m128 on SSE, m256 on AVX, vector float Element Access on Altivec or the ARM specific float32x4 t. It is possible to read and write different vec- However unlike these platform-specific types, tor elements. However there is no guaran- which are plain C-stype PODs, the vec<> class tee if the instruction set implements this provides a feature-rich class interface, which can efficiently. be used to compose more complex algorithms. Math Functions All vec<> class instatiations provide a com- vec<> implements many, but not all func- mon interface for aligned and non-aligned loads tions that are part of libm. However it pro- and stores, piecewise vector arithmetics, but vides some functions that are useful for sig- also some functions for horizontal accumulation nal processing like signed square root or a and some mathematical functions. Class spe- signed power function. cializations for SIMD extensions also provide Horizontal Functions common building blocks for vector code like Certain algorithms require some form of comparison to bitmask, bitwise logical opera- horizontal accumulation. The vec<> class tions and other functionality, which is required provides functions that return the mini- to formulate vectorized code. mum or maximum element and the sum of The advantage of this modular approach is all elements. portability: once an algorithm is implemented via the vec<> class interface, it will usually work Constant Generation out of the box on other platforms. The vector Some specific constants can efficiently be implementations for the mathematical functions generated via bit twiddling tricks, which for example are directly implemented via the can be more efficient than loading a con- vec<> class. The original algorithms for the ap- stant from memory. proximations are adapted from cephes [6], while some of the approximation polynomials are im- Listing 1: SuperCollider's soft clipping proved with Sollya [2]. The main idea behind the vectorized implementation is to evaluate the template <typename Float> Float sc_softclip(Float arg) approximation polynomial for every part of the { function and select the result for the specific in- Float abs_arg = std::fabs(arg); terval with bitmasks. if (abs_arg < 0.5) return arg ; 3.1 Class Interface else return (abs_arg - 0.25) / arg The vec<> class provides a feature rich inter- } face, which provides the means to implement Listing 2: SuperCollider's soft clipping via Listing 3: Signatures for Vector Additions nova.simd template <typename FloatType, template <typename Float> typename Arg1Type, typename Arg2Type> void softclip_vec(Float * out, inline void plus_vec(FloatType * out, Float const * in, int count) Arg1Type arg1, Arg2Type arg2, { unsigned int n); typedef vec<Float> vec_type; vec_type const05 = vec_type::gen_05(); template <typename FloatType, vec_type const025 = vec_type::gen_025(); typename Arg1Type, typename Arg2Type> const int vs = vec<Float>::size; inline void plus_vec_simd(FloatType * out, Arg1Type arg1, Arg2Type arg2, int loops = count / vs; unsigned int n); for (int i = 0; i != loops; ++i) { vec_type arg; template <unsigned int n, arg.load_aligned(in + i*vs); typename FloatType, typename Arg1Type, vec_type abs = abs(arg); typename Arg2Type> vec_type mask = mask_lt(abs, const05); inline void plus_vec_simd(FloatType * out, vec_type alt_ret Arg1Type arg1, Arg2Type arg2); = (abs - const025) / arg; vec_type result = select(alt_ret, arg, mask); of the select statement implies that the result of result.store_aligned(out + i*vs); the division won't be used. } } 4 Generic Vector Algorithms The vec<> class is used to implement a number 3.2 Example: Soft Clipping of higher-level vector algorithms, most of them As example we have a closer look, how Su- are similar of other frameworks for vector arith- perCollider's softclip operator could be imple- metics. However all vector arithmetic functions mented using the vec<> class. Softclip is a sim- come with a templated C++ interface, which ple wave shaper that maps values with an abso- provides some very handy features for audio lute value from 0.5 to infinity to 0.5 to 1 (com- synthesis applications. The basic signatures for pare Listing1). The vectorized implementation the arithmetic vector addition is shown in List- shown in Listing2. For the sake of simplic- ing3. ity, we assume that both pointers are reason- We provide three different versions of the ably aligned and that count is a multiple of the function: the first version plus vec does not vec<> size. In the beginning, we generate two make any assumptions about arguments, vec- constants. Then in the loop, we load the ar- tor alignment or vector size, so it can be called gument and compute its absolute value. In the with any reasonable arguments, even overlap- following line, we generate a bitmask that de- ping arrays.

Nova.Simd - a Framework for Architecture-Independent SIMD Development

Software Optimization Guide for Amd Family 15H Processors (.Pdf)

Motmot Documentation Release 0

AMD Accelerated Parallel Processing Math Libraries Are Software Libraries

Downloaded and Freely Modiﬁed to Meet Our Additional Requirements Related to Result Logging

VSM Cover Snipe

Source Code for Biology and Medicine Biomed Central

COMPASS: a Community-Driven Parallelization Advisor for Sequential Software

D1.2 Mathisis Exploitation Plan M24

X86 Assembly Language Reference Manual Oracle Documentation

On the Effective Parallel Programming of Multi-Core Processors

Efficient Implementations of Machine Vision Algorithms Using a Dynamically Typed Programming Language

System Manual