Extending C++ for Explicit Data-Parallel Programming Via SIMD Vector Types

Extending C++ for Explicit Data-Parallel Programming via SIMD Vector Types Dissertation zur Erlangung des Doktorgrades der Naturwissenschaften vorgelegt beim Fachbereich 12 der Johann Wolfgang Goethe-Universität in Frankfurt am Main von Matthias Kretz aus Dieburg Frankfurt (2015) (D 30) vom Fachbereich 12 der Johann Wolfgang Goethe-Universität als Dissertation angenommen. Dekan: Prof. Dr. Uwe Brinkschulte Gutachter: Prof. Dr. Volker Lindenstruth Prof. Dr. Ulrich Meyer Prof. Dr.-Ing. André Brinkmann Datum der Disputation: 16. Oktober 2015 CONTENTS abstract vii publications ix acknowledgments xi I Introduction 1 simd 3 1.1 The SIMD Idea . 3 1.2 SIMD Computers . 5 1.3 SIMD Operations . 6 1.4 Current SIMD Instruction Sets . 8 1.5 The future of SIMD . 10 1.6 Interaction With Other Forms of Parallelism . 11 1.7 Common Optimization Challenges . 12 1.8 Conclusion . 14 2 loop-vectorization 15 2.1 Definitions . 15 2.2 Auto-Vectorization . 18 2.3 Explicit Loop-Vectorization . 21 3 introduction to simd types 23 3.1 The SIMD Vector Type Idea . 23 3.2 SIMD Intrinsics . 24 3.3 Benefits of a SIMD Vector Type . 25 3.4 Previous Work on SIMD Types . 26 3.5 Other Work on SIMD Types . 27 II Vc: A C++ Library for Explicit Vectorization 4 a data-parallel type 31 4.1 The Vector<T> Class Template . 32 4.2 SIMD Vector Initialization . 35 4.3 Loads and Stores . 40 4.4 Unary Operators . 44 4.5 Binary Operators . 45 4.6 Compound Assignment Operators . 50 4.7 Subscript Operators . 51 4.8 Gather & Scatter . 52 iii iv contents 4.9 Selecting the Default SIMD Vector Type . 68 4.10 Interface to Internal Data . 70 4.11 Conclusion . 71 5 data-parallel conditionals 73 5.1 Conditionals in SIMD Context . 74 5.2 The Mask<T> Class Template . 76 5.3 Write-Masking . 84 5.4 Masked Gather & Scatter . 87 5.5 Conclusion . 88 6 vector class design alternatives 89 6.1 Discussion . 89 6.2 Conclusion . 92 7 simdarray: valarray as it should have been 93 7.1 std::valarray<T> . 93 7.2 Vc::SimdArray<T, N> . 94 7.3 Uses for SimdArray . 95 7.4 Implementation Considerations . 96 7.5 Mask Types . 98 7.6 Conclusion . 98 8 abi considerations 99 8.1 ABI Introduction . 99 8.2 ABI Relevant Decisions in Vc . 99 8.3 Problem . 100 8.4 Solution Space . 106 8.5 Future Research & Development . 108 9 load/store abstractions 109 9.1 Vectorized STL Algorithms . 109 9.2 Vectorizing Container . 111 9.3 Conclusion . 111 10 automatic type vectorization 113 10.1 Template Argument Transformation . 116 10.2 Adapter Type . 118 10.3 Scalar Insertion and Extraction . 123 10.4 Complete Specification . 124 10.5 Current Limitations . 125 10.6 Conclusion . 126 III Vectorization of Applications 11 nearest neighbor searching 129 11.1 Vectorizing Searching . 130 contents v 11.2 Vectorizing Nearest Neighbor Search . 134 11.3 k-d Tree . 139 11.4 Conclusion . 152 12 vc in high energy physics 155 12.1 ALICE . 155 12.2 Kalman-filter . 166 12.3 Geant-V . 167 12.4 ROOT . 170 12.5 Conclusion . 171 IV Conclusion 13 conclusion 175 V Appendix a working-set benchmark 179 b linear search benchmark 183 c nearest neighbor benchmark 187 d benchmarking 193 e C++ implementations of the k-d tree algorithms 195 e.1 INSERT Algorithm . 195 e.2 FINDNEAREST Algorithm . 197 f alice spline sources 203 f.1 Spline Implementations . 203 f.2 Benchmark Implementation . 209 g the subscript operator type trait 213 h the adaptsubscriptoperator class 215 i supporting custom diagnostics and sfinae 217 i.1 Problem . 217 i.2 Possible Solutions . 219 i.3 Evaluation . 220 acronyms 221 bibliography 223 index 231 german summary 235 ABSTRACT Data-parallel programming is more important than ever since serial performance is stag- nating. All mainstream computing architectures have been and are still enhancing their support for general purpose computing with explicitly data-parallel execution. For CPUs, data-parallel execution is implemented via SIMD instructions and registers. GPU hardware works very similar allowing very efficient parallel processing of wide data streams with a common instruction stream. These advances in parallel hardware have not been accompanied by the necessary advances in established programming languages. Developers have thus not been enabled to explicitly state the data-parallelism inherent in their algorithms. Some approaches of GPU and CPU vendors have introduced new programming languages, language extensions, or dialects enabling explicit data-parallel programming. However, it is arguable whether the programming models introduced by these approaches deliver the best solution. In addi- tion, some of these approaches have shortcomings from a hardware-specific focus of the language design. There are several programming problems for which the aforementioned language approaches are not expressive and flexible enough. This thesis presents a solution tailored to the C++ programming language. The concepts and interfaces are presented specifically for C++ but as abstract as possible facilitating adop- tion by other programming languages as well. The approach builds upon the observation that C++ is very expressive in terms of types. Types communicate intention and semantics to developers as well as compilers. It allows developers to clearly state their intentions and allows compilers to optimize via explicitly defined semantics of the type system. Since data-parallelism affects data structures and algorithms, it is not sufficient to en- hance the language’s expressivity in only one area. The definition of types whose operators express data-parallel execution automatically enhances the possibilities for building data structures. This thesis therefore defines low-level, but fully portable, arithmetic and mask types required to build a flexible and portable abstraction for data-parallel programming. On top of these, it presents higher-level abstractions such as fixed-width vectors and masks, abstractions for interfacing with containers of scalar types, and an approach for au- tomated vectorization of structured types. The Vc library is an implementation of these types. I developed the Vc library for re- searching data-parallel types and as a solution for explicitly data-parallel programming. This thesis discusses a few example applications using the Vc library showing the real- world relevance of the library. The Vc types enable parallelization of search algorithms and data structures in a way unique to this solution. It shows the importance of using the type system for expressing data-parallelism. Vc has also become an important building block in the high energy physics community. Their reliance on Vc shows that the library and its interfaces were developed to production quality. vii PUBLICATIONS The initial ideas for this work were developed in my Diplom thesis in 2009: [55] Matthias Kretz. “Efficient Use of Multi- and Many-Core Systems with Vectorization and Multithreading.” Diplomarbeit. University of Heidelberg, Dec. 2009 The Vc library idea has previously been published: [59] Matthias Kretz et al. “Vc: A C++ library for explicit vectorization.” In: Software: Practice and Experience (2011). issn: 1097-024X. doi: 10.1002/spe. 1149 An excerpt of Section 12.2 has previously been published in the pre-Chicago mailing of the C++ standards committee: [58] Matthias Kretz. SIMD Vector Types. ISO/IEC C++ Standards Committee Paper. N3759. 2013. url: http://www.open-std.org/jtc1/sc22/wg21/ docs/papers/2013/n3759.html Chapters 4 & 5 and Appendix I have previously been published in the pre-Urbana mailing of the C++ standards committee: [57] Matthias Kretz. SIMD Types: The Vector Type & Operations. ISO/IEC C++ Standards Committee Paper. N4184. 2014. url: http://www.open-std. org/jtc1/sc22/wg21/docs/papers/2014/n4184.pdf [56] Matthias Kretz. SIMD Types: The Mask Type & Write-Masking. ISO/IEC C++ Standards Committee Paper. N4185. 2014. url: http://www.open-std. org/jtc1/sc22/wg21/docs/papers/2014/n4185.pdf [60] Matthias Kretz et al. Supporting Custom Diagnostics and SFINAE. ISO/IEC C++ Standards Committee Paper. N4186. 2014. url: http://www.open-std. org/jtc1/sc22/wg21/docs/papers/2014/n4186.pdf ix ACKNOWLEDGMENTS For Appendix I, I received a lot of useful feedback from Jens Maurer, which prompt- ed me to rewrite the initial draft completely. Jens suggested to focus on deleted functions and contributed Section I.3. xi Part I Introduction 1 SIMD Computer science research is different from these more traditional disciplines. Philosophically it differs from the physical sciences because it seeks not to discover, explain, or exploit the natural world, but instead to study the properties of machines of human creation. — Dennis M. Ritchie (1984) This thesis discusses a programming language extension for portable expression of data-parallelism, which has evolved from the need to program modern SIMD1 hardware. Therefore, this first chapter covers the details of what SIMD is, where it originated, what it can do, and some important details of current hardware. 1.1 THE SIMD IDEA The idea of SIMD stems from the observation that there is data-parallelism in al- most every compute intensive algorithm. Data-parallelism occurs whenever one or more operations are applied repeatedly to several values. Typically, programming languages only allow iterating over the data, which leads to strictly serial semantics. Alternatively (e.g. with C++), it may also be expressed in terms of a standard algorithm applied to one or more container objects.2 Hardware manufacturers have implemented support for data-parallel execution via special instructions executing the same operation on multiple values (List- ing 1.1). Thus, a single instruction stream is executed but multiple (often 4, 8, or 16) data sets are processed in parallel. This leads to an improved transistors per FLOPs3 ratio4 and hence less power usage and overall more capable CPUs.

Extending C++ for Explicit Data-Parallel Programming Via SIMD Vector Types

RISC-V Vector Extension Webinar II

Thriving in a Crowded and Changing World: C++ 2006–2020

Vectorization Optimization

Optimizing Packed String Matching on AVX2 Platform

Vegen: a Vectorizer Generator for SIMD and Beyond

Ali Aydar Anita Borg Alfred Aho Bjarne Stroustrup Bill Gates

Map, Reduce and Mapreduce, the Skeleton Way$

Exploiting Automatic Vectorization to Employ SPMD on SIMD Registers

Using Arm Scalable Vector Extension to Optimize OPEN MPI

Bjarne Stroustrup

Introduction on Vectorization

From Sequential to Parallel Programming with Patterns