Extending C++ for Explicit Data-Parallel Programming Via SIMD Vector Types

Total Page:16

File Type:pdf, Size:1020Kb

Extending C++ for Explicit Data-Parallel Programming Via SIMD Vector Types Extending C++ for Explicit Data-Parallel Programming via SIMD Vector Types Dissertation zur Erlangung des Doktorgrades der Naturwissenschaften vorgelegt beim Fachbereich 12 der Johann Wolfgang Goethe-Universität in Frankfurt am Main von Matthias Kretz aus Dieburg Frankfurt (2015) (D 30) vom Fachbereich 12 der Johann Wolfgang Goethe-Universität als Dissertation angenommen. Dekan: Prof. Dr. Uwe Brinkschulte Gutachter: Prof. Dr. Volker Lindenstruth Prof. Dr. Ulrich Meyer Prof. Dr.-Ing. André Brinkmann Datum der Disputation: 16. Oktober 2015 CONTENTS abstract vii publications ix acknowledgments xi I Introduction 1 simd 3 1.1 The SIMD Idea . 3 1.2 SIMD Computers . 5 1.3 SIMD Operations . 6 1.4 Current SIMD Instruction Sets . 8 1.5 The future of SIMD . 10 1.6 Interaction With Other Forms of Parallelism . 11 1.7 Common Optimization Challenges . 12 1.8 Conclusion . 14 2 loop-vectorization 15 2.1 Definitions . 15 2.2 Auto-Vectorization . 18 2.3 Explicit Loop-Vectorization . 21 3 introduction to simd types 23 3.1 The SIMD Vector Type Idea . 23 3.2 SIMD Intrinsics . 24 3.3 Benefits of a SIMD Vector Type . 25 3.4 Previous Work on SIMD Types . 26 3.5 Other Work on SIMD Types . 27 II Vc: A C++ Library for Explicit Vectorization 4 a data-parallel type 31 4.1 The Vector<T> Class Template . 32 4.2 SIMD Vector Initialization . 35 4.3 Loads and Stores . 40 4.4 Unary Operators . 44 4.5 Binary Operators . 45 4.6 Compound Assignment Operators . 50 4.7 Subscript Operators . 51 4.8 Gather & Scatter . 52 iii iv contents 4.9 Selecting the Default SIMD Vector Type . 68 4.10 Interface to Internal Data . 70 4.11 Conclusion . 71 5 data-parallel conditionals 73 5.1 Conditionals in SIMD Context . 74 5.2 The Mask<T> Class Template . 76 5.3 Write-Masking . 84 5.4 Masked Gather & Scatter . 87 5.5 Conclusion . 88 6 vector class design alternatives 89 6.1 Discussion . 89 6.2 Conclusion . 92 7 simdarray: valarray as it should have been 93 7.1 std::valarray<T> . 93 7.2 Vc::SimdArray<T, N> . 94 7.3 Uses for SimdArray . 95 7.4 Implementation Considerations . 96 7.5 Mask Types . 98 7.6 Conclusion . 98 8 abi considerations 99 8.1 ABI Introduction . 99 8.2 ABI Relevant Decisions in Vc . 99 8.3 Problem . 100 8.4 Solution Space . 106 8.5 Future Research & Development . 108 9 load/store abstractions 109 9.1 Vectorized STL Algorithms . 109 9.2 Vectorizing Container . 111 9.3 Conclusion . 111 10 automatic type vectorization 113 10.1 Template Argument Transformation . 116 10.2 Adapter Type . 118 10.3 Scalar Insertion and Extraction . 123 10.4 Complete Specification . 124 10.5 Current Limitations . 125 10.6 Conclusion . 126 III Vectorization of Applications 11 nearest neighbor searching 129 11.1 Vectorizing Searching . 130 contents v 11.2 Vectorizing Nearest Neighbor Search . 134 11.3 k-d Tree . 139 11.4 Conclusion . 152 12 vc in high energy physics 155 12.1 ALICE . 155 12.2 Kalman-filter . 166 12.3 Geant-V . 167 12.4 ROOT . 170 12.5 Conclusion . 171 IV Conclusion 13 conclusion 175 V Appendix a working-set benchmark 179 b linear search benchmark 183 c nearest neighbor benchmark 187 d benchmarking 193 e C++ implementations of the k-d tree algorithms 195 e.1 INSERT Algorithm . 195 e.2 FINDNEAREST Algorithm . 197 f alice spline sources 203 f.1 Spline Implementations . 203 f.2 Benchmark Implementation . 209 g the subscript operator type trait 213 h the adaptsubscriptoperator class 215 i supporting custom diagnostics and sfinae 217 i.1 Problem . 217 i.2 Possible Solutions . 219 i.3 Evaluation . 220 acronyms 221 bibliography 223 index 231 german summary 235 ABSTRACT Data-parallel programming is more important than ever since serial performance is stag- nating. All mainstream computing architectures have been and are still enhancing their support for general purpose computing with explicitly data-parallel execution. For CPUs, data-parallel execution is implemented via SIMD instructions and registers. GPU hard- ware works very similar allowing very efficient parallel processing of wide data streams with a common instruction stream. These advances in parallel hardware have not been accompanied by the necessary ad- vances in established programming languages. Developers have thus not been enabled to explicitly state the data-parallelism inherent in their algorithms. Some approaches of GPU and CPU vendors have introduced new programming languages, language extensions, or dialects enabling explicit data-parallel programming. However, it is arguable whether the programming models introduced by these approaches deliver the best solution. In addi- tion, some of these approaches have shortcomings from a hardware-specific focus of the language design. There are several programming problems for which the aforementioned language approaches are not expressive and flexible enough. This thesis presents a solution tailored to the C++ programming language. The concepts and interfaces are presented specifically for C++ but as abstract as possible facilitating adop- tion by other programming languages as well. The approach builds upon the observation that C++ is very expressive in terms of types. Types communicate intention and semantics to developers as well as compilers. It allows developers to clearly state their intentions and allows compilers to optimize via explicitly defined semantics of the type system. Since data-parallelism affects data structures and algorithms, it is not sufficient to en- hance the language’s expressivity in only one area. The definition of types whose opera- tors express data-parallel execution automatically enhances the possibilities for building data structures. This thesis therefore defines low-level, but fully portable, arithmetic and mask types required to build a flexible and portable abstraction for data-parallel program- ming. On top of these, it presents higher-level abstractions such as fixed-width vectors and masks, abstractions for interfacing with containers of scalar types, and an approach for au- tomated vectorization of structured types. The Vc library is an implementation of these types. I developed the Vc library for re- searching data-parallel types and as a solution for explicitly data-parallel programming. This thesis discusses a few example applications using the Vc library showing the real- world relevance of the library. The Vc types enable parallelization of search algorithms and data structures in a way unique to this solution. It shows the importance of using the type system for expressing data-parallelism. Vc has also become an important building block in the high energy physics community. Their reliance on Vc shows that the library and its interfaces were developed to production quality. vii PUBLICATIONS The initial ideas for this work were developed in my Diplom thesis in 2009: [55] Matthias Kretz. “Efficient Use of Multi- and Many-Core Systems with Vectorization and Multithreading.” Diplomarbeit. University of Heidelberg, Dec. 2009 The Vc library idea has previously been published: [59] Matthias Kretz et al. “Vc: A C++ library for explicit vectorization.” In: Software: Practice and Experience (2011). issn: 1097-024X. doi: 10.1002/spe. 1149 An excerpt of Section 12.2 has previously been published in the pre-Chicago mail- ing of the C++ standards committee: [58] Matthias Kretz. SIMD Vector Types. ISO/IEC C++ Standards Committee Paper. N3759. 2013. url: http://www.open-std.org/jtc1/sc22/wg21/ docs/papers/2013/n3759.html Chapters 4 & 5 and Appendix I have previously been published in the pre-Urbana mailing of the C++ standards committee: [57] Matthias Kretz. SIMD Types: The Vector Type & Operations. ISO/IEC C++ Standards Committee Paper. N4184. 2014. url: http://www.open-std. org/jtc1/sc22/wg21/docs/papers/2014/n4184.pdf [56] Matthias Kretz. SIMD Types: The Mask Type & Write-Masking. ISO/IEC C++ Standards Committee Paper. N4185. 2014. url: http://www.open-std. org/jtc1/sc22/wg21/docs/papers/2014/n4185.pdf [60] Matthias Kretz et al. Supporting Custom Diagnostics and SFINAE. ISO/IEC C++ Standards Committee Paper. N4186. 2014. url: http://www.open-std. org/jtc1/sc22/wg21/docs/papers/2014/n4186.pdf ix ACKNOWLEDGMENTS For Appendix I, I received a lot of useful feedback from Jens Maurer, which prompt- ed me to rewrite the initial draft completely. Jens suggested to focus on deleted functions and contributed Section I.3. xi Part I Introduction 1 SIMD Computer science research is different from these more traditional disciplines. Philosophically it differs from the physical sciences because it seeks not to discover, explain, or exploit the natural world, but instead to study the properties of machines of human creation. — Dennis M. Ritchie (1984) This thesis discusses a programming language extension for portable expression of data-parallelism, which has evolved from the need to program modern SIMD1 hardware. Therefore, this first chapter covers the details of what SIMD is, where it originated, what it can do, and some important details of current hardware. 1.1 THE SIMD IDEA The idea of SIMD stems from the observation that there is data-parallelism in al- most every compute intensive algorithm. Data-parallelism occurs whenever one or more operations are applied repeatedly to several values. Typically, programming languages only allow iterating over the data, which leads to strictly serial seman- tics. Alternatively (e.g. with C++), it may also be expressed in terms of a standard algorithm applied to one or more container objects.2 Hardware manufacturers have implemented support for data-parallel execu- tion via special instructions executing the same operation on multiple values (List- ing 1.1). Thus, a single instruction stream is executed but multiple (often 4, 8, or 16) data sets are processed in parallel. This leads to an improved transistors per FLOPs3 ratio4 and hence less power usage and overall more capable CPUs.
Recommended publications
  • RISC-V Vector Extension Webinar II
    RISC-V Vector Extension Webinar II August 3th, 2021 Thang Tran, Ph.D. Principal Engineer Webinar II - Agenda • Andes overview • Vector technology background – SIMD/vector concept – Vector processor basic • RISC-V V extension ISA – Basic – CSR • RISC-V V extension ISA – Memory operations – Compute instructions • Sample codes – Matrix multiplication – Loads with RVV versions 0.8 and 1.0 • AndesCore™ NX27V • Summary Copyright© 2020 Andes Technology Corp. 2 Terminology • ACE: Andes Custom Extension • ISA: Instruction Set Architecture • CSR: Control and Status Register • GOPS: Giga Operations Per Second • SEW: Element Width (8-64) • GFLOPS: Giga Floating-Point OPS • ELEN: Largest Element Width (32 or 64) • XRF: Integer register file • XLEN: Scalar register length in bits (64) • FRF: Floating-point register file • FLEN: FP register length in bits (16-64) • VRF: Vector register file • VLEN: Vector register length in bits (128-512) • SIMD: Single Instruction Multiple Data • LMUL: Register grouping multiple (1/8-8) • MMX: Multi Media Extension • EMUL: Effective LMUL • SSE: Streaming SIMD Extension • VLMAX/MVL: Vector Length Max • AVX: Advanced Vector Extension • AVL/VL: Application Vector Length • Configurable: parameters are fixed at built time, i.e. cache size • Extensible: added instructions to ISA includes custom instructions to be added by customer • Standard extension: the reserved codes in the ISA for special purposes, i.e. FP, DSP, … • Programmable: parameters can be dynamically changed in the program Copyright© 2020 Andes Technology Corp. 3 RISC-V V Extension ISA Basic Copyright© 2020 Andes Technology Corp. 4 Vector Register ISA • Vector-Register ISA Definition: − All vector operations are between vector registers (except for load and store).
    [Show full text]
  • Thriving in a Crowded and Changing World: C++ 2006–2020
    Thriving in a Crowded and Changing World: C++ 2006–2020 BJARNE STROUSTRUP, Morgan Stanley and Columbia University, USA Shepherd: Yannis Smaragdakis, University of Athens, Greece By 2006, C++ had been in widespread industrial use for 20 years. It contained parts that had survived unchanged since introduced into C in the early 1970s as well as features that were novel in the early 2000s. From 2006 to 2020, the C++ developer community grew from about 3 million to about 4.5 million. It was a period where new programming models emerged, hardware architectures evolved, new application domains gained massive importance, and quite a few well-financed and professionally marketed languages fought for dominance. How did C++ ś an older language without serious commercial backing ś manage to thrive in the face of all that? This paper focuses on the major changes to the ISO C++ standard for the 2011, 2014, 2017, and 2020 revisions. The standard library is about 3/4 of the C++20 standard, but this paper’s primary focus is on language features and the programming techniques they support. The paper contains long lists of features documenting the growth of C++. Significant technical points are discussed and illustrated with short code fragments. In addition, it presents some failed proposals and the discussions that led to their failure. It offers a perspective on the bewildering flow of facts and features across the years. The emphasis is on the ideas, people, and processes that shaped the language. Themes include efforts to preserve the essence of C++ through evolutionary changes, to simplify itsuse,to improve support for generic programming, to better support compile-time programming, to extend support for concurrency and parallel programming, and to maintain stable support for decades’ old code.
    [Show full text]
  • Vectorization Optimization
    Vectorization Optimization Klaus-Dieter Oertel Intel IAGS HLRN User Workshop, 3-6 Nov 2020 Optimization Notice Copyright © 2020, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Changing Hardware Impacts Software More Cores → More Threads → Wider Vectors Intel® Xeon® Processor Intel® Xeon Phi™ processor 5100 5500 5600 E5-2600 E5-2600 E5-2600 Platinum 64-bit E5-2600 Knights Landing series series series V2 V3 V4 8180 Core(s) 1 2 4 6 8 12 18 22 28 72 Threads 2 2 8 12 16 24 36 44 56 288 SIMD Width 128 128 128 128 256 256 256 256 512 512 High performance software must be both ▪ Parallel (multi-thread, multi-process) ▪ Vectorized *Product specification for launched and shipped products available on ark.intel.com. Optimization Notice Copyright © 2020, Intel Corporation. All rights reserved. 2 *Other names and brands may be claimed as the property of others. Vectorize & Thread or Performance Dies Threaded + Vectorized can be much faster than either one alone Vectorized & Threaded “Automatic” Vectorization Not Enough Explicit pragmas and optimization often required The Difference Is Growing With Each New 130x Generation of Threaded Hardware Vectorized Serial 2010 2012 2013 2014 2016 2017 Intel® Xeon™ Intel® Xeon™ Intel® Xeon™ Intel® Xeon™ Intel® Xeon™ Intel® Xeon® Platinum Processor Processor Processor Processor Processor Processor X5680 E5-2600 E5-2600 v2 E5-2600 v3 E5-2600 v4 81xx formerly codenamed formerly codenamed formerly codenamed formerly codenamed formerly codenamed formerly codenamed Westmere Sandy Bridge Ivy Bridge Haswell Broadwell Skylake Server Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
    [Show full text]
  • Optimizing Packed String Matching on AVX2 Platform
    Optimizing Packed String Matching on AVX2 Platform M. Akif Aydo˘gmu¸s1,2 and M.O˘guzhan Külekci1 1 Informatics Institute, Istanbul Technical University, Istanbul, Turkey [email protected], [email protected] 2 TUBITAK UME, Gebze-Kocaeli, Turkey Abstract. Exact string matching, searching for all occurrences of given pattern P on a text T , is a fundamental issue in computer science with many applica- tions in natural language processing, speech processing, computational biology, information retrieval, intrusion detection systems, data compression, and etc. Speeding up the pattern matching operations benefiting from the SIMD par- allelism has received attention in the recent literature, where the empirical results on previous studies revealed that SIMD parallelism significantly helps, while the performance may even be expected to get automatically enhanced with the ever increasing size of the SIMD registers. In this paper, we provide variants of the previously proposed EPSM and SSEF algorithms, which are orig- inally implemented on Intel SSE4.2 (Streaming SIMD Extensions 4.2 version with 128-bit registers). We tune the new algorithms according to Intel AVX2 platform (Advanced Vector Extensions 2 with 256-bit registers) and analyze the gain in performance with respect to the increasing length of the SIMD registers. Profiling the new algorithms by using the Intel Vtune Amplifier for detecting performance bottlenecks led us to consider the cache friendliness and shared-memory access issues in the AVX2 platform. We applied cache op- timization techniques to overcome the problems particularly addressing the search algorithms based on filtering. Experimental comparison of the new solutions with the previously known-to- be-fast algorithms on small, medium, and large alphabet text files with diverse pattern lengths showed that the algorithms on AVX2 platform optimized cache obliviously outperforms the previous solutions.
    [Show full text]
  • Vegen: a Vectorizer Generator for SIMD and Beyond
    VeGen: A Vectorizer Generator for SIMD and Beyond Yishen Chen Charith Mendis Michael Carbin Saman Amarasinghe MIT CSAIL UIUC MIT CSAIL MIT CSAIL USA USA USA USA [email protected] [email protected] [email protected] [email protected] ABSTRACT Vector instructions are ubiquitous in modern processors. Traditional A1 A2 A3 A4 A1 A2 A3 A4 compiler auto-vectorization techniques have focused on targeting single instruction multiple data (SIMD) instructions. However, these B1 B2 B3 B4 B1 B2 B3 B4 auto-vectorization techniques are not sufficiently powerful to model non-SIMD vector instructions, which can accelerate applications A1+B1 A2+B2 A3+B3 A4+B4 A1+B1 A2-B2 A3+B3 A4-B4 in domains such as image processing, digital signal processing, and (a) vaddpd (b) vaddsubpd machine learning. To target non-SIMD instruction, compiler devel- opers have resorted to complicated, ad hoc peephole optimizations, expending significant development time while still coming up short. A1 A2 A3 A4 A1 A2 A3 A4 A5 A6 A7 A8 As vector instruction sets continue to rapidly evolve, compilers can- B1 B2 B3 B4 B5 B6 B7 B8 not keep up with these new hardware capabilities. B1 B2 B3 B4 In this paper, we introduce Lane Level Parallelism (LLP), which A1*B1+ A3*B3+ A5*B5+ A7*B8+ captures the model of parallelism implemented by both SIMD and A1+A2 B1+B2 A3+A4 B3+B4 A2*B2 A4*B4 A6*B6 A7*B8 non-SIMD vector instructions. We present VeGen, a vectorizer gen- (c) vhaddpd (d) vpmaddwd erator that automatically generates a vectorization pass to uncover target-architecture-specific LLP in programs while using only in- Figure 1: Examples of SIMD and non-SIMD instruction in struction semantics as input.
    [Show full text]
  • Ali Aydar Anita Borg Alfred Aho Bjarne Stroustrup Bill Gates
    Ali Aydar Ali Aydar is a computer scientist and Internet entrepreneur. He is the chief executive officer at Sporcle. He is best known as an early employee and key technical contributor at the original Napster. Aydar bought Fanning his first book on programming in C++, the language he would use two years later to build the Napster file-sharing software. Anita Borg Anita Borg (January 17, 1949 – April 6, 2003) was an American computer scientist. She founded the Institute for Women and Technology (now the Anita Borg Institute for Women and Technology). While at Digital Equipment, she developed and patented a method for generating complete address traces for analyzing and designing high-speed memory systems. Alfred Aho Alfred Aho (born August 9, 1941) is a Canadian computer scientist best known for his work on programming languages, compilers, and related algorithms, and his textbooks on the art and science of computer programming. Aho received a B.A.Sc. in Engineering Physics from the University of Toronto. Bjarne Stroustrup Bjarne Stroustrup (born 30 December 1950) is a Danish computer scientist, most notable for the creation and development of the widely used C++ programming language. He is a Distinguished Research Professor and holds the College of Engineering Chair in Computer Science. Bill Gates 2 of 10 Bill Gates (born October 28, 1955) is an American business magnate, philanthropist, investor, computer programmer, and inventor. Gates is the former chief executive and chairman of Microsoft, the world’s largest personal-computer software company, which he co-founded with Paul Allen. Bruce Arden Bruce Arden (born in 1927 in Minneapolis, Minnesota) is an American computer scientist.
    [Show full text]
  • Map, Reduce and Mapreduce, the Skeleton Way$
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Elsevier - Publisher Connector Procedia Computer Science ProcediaProcedia Computer Computer Science Science 00 1 (2010)(2012) 1–9 2095–2103 www.elsevier.com/locate/procedia International Conference on Computational Science, ICCS 2010 Map, Reduce and MapReduce, the skeleton way$ D. Buonoa, M. Daneluttoa, S. Lamettia aDept. of Computer Science, Univ. of Pisa, Italy Abstract Composition of Map and Reduce algorithmic skeletons have been widely studied at the end of the last century and it has demonstrated effective on a wide class of problems. We recall the theoretical results motivating the intro- duction of these skeletons, then we discuss an experiment implementing three algorithmic skeletons, a map,areduce and an optimized composition of a map followed by a reduce skeleton (map+reduce). The map+reduce skeleton implemented computes the same kind of problems computed by Google MapReduce, but the data flow through the skeleton is streamed rather than relying on already distributed (and possibly quite large) data items. We discuss the implementation of the three skeletons on top of ProActive/GCM in the MareMare prototype and we present some experimental obtained on a COTS cluster. ⃝c 2012 Published by Elsevier Ltd. Open access under CC BY-NC-ND license. Keywords: Algorithmic skeletons, map, reduce, list theory, homomorphisms 1. Introduction Algorithmic skeletons have been around since Murray Cole’s PhD thesis [1]. An algorithmic skeleton is a paramet- ric, reusable, efficient and composable programming construct that exploits a well know, efficient parallelism exploita- tion pattern. A skeleton programmer may build parallel applications simply picking up a (composition of) skeleton from the skeleton library available, providing the required parameters and running some kind of compiler+run time library tool specific of the target architecture at hand [2].
    [Show full text]
  • Exploiting Automatic Vectorization to Employ SPMD on SIMD Registers
    Exploiting automatic vectorization to employ SPMD on SIMD registers Stefan Sprenger Steffen Zeuch Ulf Leser Department of Computer Science Intelligent Analytics for Massive Data Department of Computer Science Humboldt-Universitat¨ zu Berlin German Research Center for Artificial Intelligence Humboldt-Universitat¨ zu Berlin Berlin, Germany Berlin, Germany Berlin, Germany [email protected] [email protected] [email protected] Abstract—Over the last years, vectorized instructions have multi-threading with SIMD instructions1. For these reasons, been successfully applied to accelerate database algorithms. How- vectorization is essential for the performance of database ever, these instructions are typically only available as intrinsics systems on modern CPU architectures. and specialized for a particular hardware architecture or CPU model. As a result, today’s database systems require a manual tai- Although modern compilers, like GCC [2], provide auto loring of database algorithms to the underlying CPU architecture vectorization [1], typically the generated code is not as to fully utilize all vectorization capabilities. In practice, this leads efficient as manually-written intrinsics code. Due to the strict to hard-to-maintain code, which cannot be deployed on arbitrary dependencies of SIMD instructions on the underlying hardware, hardware platforms. In this paper, we utilize ispc as a novel automatically transforming general scalar code into high- compiler that employs the Single Program Multiple Data (SPMD) execution model, which is usually found on GPUs, on the SIMD performing SIMD programs remains a (yet) unsolved challenge. lanes of modern CPUs. ispc enables database developers to exploit To this end, all techniques for auto vectorization have focused vectorization without requiring low-level details or hardware- on enhancing conventional C/C++ programs with SIMD instruc- specific knowledge.
    [Show full text]
  • Using Arm Scalable Vector Extension to Optimize OPEN MPI
    Using Arm Scalable Vector Extension to Optimize OPEN MPI Dong Zhong1,2, Pavel Shamis4, Qinglei Cao1,2, George Bosilca1,2, Shinji Sumimoto3, Kenichi Miura3, and Jack Dongarra1,2 1Innovative Computing Laboratory, The University of Tennessee, US 2fdzhong, [email protected], fbosilca, [email protected] 3Fujitsu Ltd, fsumimoto.shinji, [email protected] 4Arm, [email protected] Abstract— As the scale of high-performance computing (HPC) with extension instruction sets. systems continues to grow, increasing levels of parallelism must SVE is a vector extension for AArch64 execution mode be implored to achieve optimal performance. Recently, the for the A64 instruction set of the Armv8 architecture [4], [5]. processors support wide vector extensions, vectorization becomes much more important to exploit the potential peak performance Unlike other SIMD architectures, SVE does not define the size of target architecture. Novel processor architectures, such as of the vector registers, instead it provides a range of different the Armv8-A architecture, introduce Scalable Vector Extension values which permit vector code to adapt automatically to the (SVE) - an optional separate architectural extension with a new current vector length at runtime with the feature of Vector set of A64 instruction encodings, which enables even greater Length Agnostic (VLA) programming [6], [7]. Vector length parallelisms. In this paper, we analyze the usage and performance of the constrains in the range from a minimum of 128 bits up to a SVE instructions in Arm SVE vector Instruction Set Architec- maximum of 2048 bits in increments of 128 bits. ture (ISA); and utilize those instructions to improve the memcpy SVE not only takes advantage of using long vectors but also and various local reduction operations.
    [Show full text]
  • Bjarne Stroustrup
    Bjarne Stroustrup 52 Riverside Dr. #6A +1 979 219 5004 NY, NY 10024 [email protected] USA www.stroustrup.com Education Ph.D. in Computer Science, University of Cambridge, England, 1979 Ph.D. Thesis: Communication and Control in Distributed Computer Systems Thesis advisor: David Wheeler Cand.Scient. in Mathematics with Computer Science, Aarhus University, Denmark, 1975 Thesis advisor: Brian H. Mayoh Research Interests Distributed Systems, Design, Programming techniques, Software development tools, and Programming Languages Professional Experience Technical Fellow, Morgan Stanley, New York, January 2019 – present Managing Director, Division of Technology and Data, Morgan Stanley, New York, January 2014 – present Visiting Professor, Columbia University, New York, January 2014 – present Visiting Professor in the Computer Lab and Fellow of Churchill College, Cambridge University, Spring 2012 Visiting Professor in the Computer Science Department, Princeton University, Fall 2011 The College of Engineering Chair Professor in Computer Science, Department of Computer Science, Texas A&M University, October 2002 – January 2014 Department Head, AT&T Laboratories – Research, Florham Park, New Jersey, July 1995 – October 2002 Distinguished Member of Technical Staff, AT&T Bell Laboratories, Murray Hill, NJ, June 1987 – July 1995 Member of Technical Staff, AT&T Bell Laboratories, Murray Hill, NJ, March 1979 – June 1987 Honors & Awards 2019: Honorary doctor of University Carlos III in Madrid, Spain. 1 2018: The John Scott Legacy Medal and Premium from The Franklin Institute and the City Council of Philadelphia to men and women whose inventions improved the comfort, welfare, and happiness of human kind in a significant way. 2018: The Computer Pioneer Award from The IEEE Computer Society For bringing object- oriented programming and generic programming to the mainstream with his design and implementation of the C++ programming language.
    [Show full text]
  • Introduction on Vectorization
    C E R N o p e n l a b - Intel MIC / Xeon Phi training Intel® Xeon Phi™ Product Family Code Optimization Hans Pabst, April 11th 2013 Software and Services Group Intel Corporation Agenda • Introduction to Vectorization • Ways to write vector code • Automatic loop vectorization • Array notation • Elemental functions • Other optimizations • Summary 2 Copyright© 2012, Intel Corporation. All rights reserved. 4/12/2013 *Other brands and names are the property of their respective owners. Parallelism Parallelism / perf. dimensions Single Instruction Multiple Data • Across mult. applications • Perf. gain simply because • Across mult. processes an instruction performs more works • Across mult. threads • Data parallel • Across mult. instructions • SIMD (“Vector” is usually used as a synonym) 3 Copyright© 2012, Intel Corporation. All rights reserved. 4/12/2013 *Other brands and names are the property of their respective owners. History of SIMD ISA extensions Intel® Pentium® processor (1993) MMX™ (1997) Intel® Streaming SIMD Extensions (Intel® SSE in 1999 to Intel® SSE4.2 in 2008) Intel® Advanced Vector Extensions (Intel® AVX in 2011 and Intel® AVX2 in 2013) Intel Many Integrated Core Architecture (Intel® MIC Architecture in 2013) * Illustrated with the number of 32-bit data elements that are processed by one “packed” instruction. 4 Copyright© 2012, Intel Corporation. All rights reserved. 4/12/2013 *Other brands and names are the property of their respective owners. Vectors (SIMD) float *restrict A; • SSE: 4 elements at a time float *B, *C; addps xmm1, xmm2 • AVX: 8 elements at a time for (i=0; i<n; ++i) { vaddps ymm1, ymm2, ymm3 A[i] = B[i] + C[i]; • MIC: 16 elements at a time } vaddps zmm1, zmm2, zmm3 Scalar code computes the above b1 with one-element at a time.
    [Show full text]
  • From Sequential to Parallel Programming with Patterns
    From sequential to parallel programming with patterns Inverted CERN School of Computing 2018 Plácido Fernández [email protected] 07 Mar 2018 Table of Contents Introduction Sequential vs Parallel patterns Patterns Data parallel vs streaming patterns Control patterns (sequential and parallel Streaming parallel patterns Composing parallel patterns Using the patterns Real world use case GrPPI with NUMA Conclusions Acknowledgements and references 07 Mar 2018 3 Table of Contents Introduction Sequential vs Parallel patterns Patterns Data parallel vs streaming patterns Control patterns (sequential and parallel Streaming parallel patterns Composing parallel patterns Using the patterns Real world use case GrPPI with NUMA Conclusions Acknowledgements and references 07 Mar 2018 4 Source: Herb Sutter in Dr. Dobb’s Journal Parallel hardware history I Before ~2005, processor manufacturers increased clock speed. I Then we hit the power and memory wall, which limits the frequency at which processors can run (without melting). I Manufacturers response was to continue improving their chips performance by adding more parallelism. To fully exploit modern processors capabilities, programmers need to put effort into the source code. 07 Mar 2018 6 Parallel Hardware Processors are naturally parallel: Programmers have plenty of options: I Caches I Multi-threading I Different processing units (floating I SIMD / Vectorization (Single point, arithmetic logic...) Instruction Multiple Data) I Integrated GPU I Heterogeneous hardware 07 Mar 2018 7 Source: top500.org Source: Intel Source: CERN Parallel hardware Xeon Xeon 5100 Xeon 5500 Sandy Bridge Haswell Broadwell Skylake Year 2005 2006 2009 2012 2015 2016 2017 Cores 1 2 4 8 18 24 28 Threads 2 2 8 16 36 48 56 SIMD Width 128 128 128 256 256 256 512 Figure: Intel Xeon processors evolution 0Source: ark.intel.com 07 Mar 2018 11 Parallel considerations When designing parallel algorithms where are interested in speedup.
    [Show full text]