Auto-Tuning Compiler Options for HPC

Total Page:16

File Type:pdf, Size:1020Kb

Auto-Tuning Compiler Options for HPC University of Bath PHD Auto-tuning compiler options for HPC Jones, Jessica Award date: 2019 Awarding institution: University of Bath Link to publication Alternative formats If you require this document in an alternative format, please contact: [email protected] General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal ? Take down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Download date: 27. Sep. 2021 Citation for published version: Jones, J 2019, 'Auto-tuning compiler options for HPC', Ph.D., University of Bath. Publication date: 2019 Document Version Publisher's PDF, also known as Version of record Link to publication Publisher Rights GNU GPL University of Bath General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. Take down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Download date: 16. Jul. 2019 Auto-tuning compiler options for HPC submitted by Jessica R. Jones for the degree of Doctor of Philosophy of the University of Bath Department of Computer Science August 2018 COPYRIGHT Attention is drawn to the fact that copyright of this thesis rests with the author. A copy of this thesis has been supplied on condition that anyone who consults it is understood to recognise that its copyright rests with the author and that they must not copy it or use material from it except as permitted by law or with the consent of the author. This thesis may be made available for consultation within the University Library and may be photocopied or lent to other libraries for the purposes of consultation. Signature of Author . Jessica R. Jones ACKNOWLEDGEMENTS I must frankly own, that if I had known, beforehand, that this book would have cost me the labour which it has, I should never have been courageous enough to commence it. Isabella Beeton There are many people who should be thanked for their patience, understanding and time. These include, but are not limited to, my supervisors, James Davenport, John ffitch and Guy McCusker, my partner, Ed White, my sisters, Eve Jones, Zoe¨ Jones and Rebecca Bonome, and my friends (in no particular order), Martin Brain, Florian Schanda, David Stewart, Mark Cahill, Alaric Snell-Pym, Anita Faul, Mesar Hameed and Ben Pring. All have helped in various ways, from proof-reading to giving advice at odd hours on Sundays, to bringing me cups of tea and making sure some shred of sanity still remains at the end of this. I could not have done it without your support. The HPC support staff at the University of Bath and at the University of Southampton are also due thanks; without the systems they are responsible for, this work would not have been possible. I would also like to thank Jason Ansel, author of OpenTuner, and Clint Whaley, author of ATLAS, for the advice they gave both in person and by email. They both helped me to avoid wasting precious time, and made my life a little bit easier. Considerable thanks are due also to my examiners, who gave up their time to read my work and make suggestions for its improvement. Having acknowledged that I am probably one of the many who “do not know how to write” [121], I know that their insightful comments and thought-provoking questions have greatly aided in achieving the clarity this document requires. I hope I have done justice to their hard work. Last, but not least, I must thank my colleagues at Cray UK Ltd, who put up with my ramblings and kindly agreed to cope without me for weeks while I battled to finish this. The contribution of time is more valuable than you know. Contents List of Figures . 9 List of Tables . 10 List of Algorithms . 11 List of Prototypes . 12 Summary . 13 1 Introduction 14 1.1 Motivation . 16 1.2 Aims of this thesis . 17 1.3 Document overview . 18 2 The landscape 20 2.1 Computational Hardware . 20 2.1.1 Purpose-built supercomputers . 21 2.1.2 Floating point hardware . 22 2.1.3 Multiply-Accumulate . 22 2.1.4 Vectorisation . 23 2.1.5 Accelerators . 23 2.1.6 Computational Speed and FLOPs . 24 2.2 Interconnect . 25 2.3 The power consumption/performance trade-off . 26 2.4 Memory Hardware . 27 2.4.1 Virtual Memory . 27 2.4.2 Caches and Translation Look-Aside Buffers (TLBs) . 29 2.4.3 Caches on modern CPUs . 29 2.5 Symmetric Multi-Processor (SMP) systems . 30 2.5.1 Simultaneous Multi-Threading (SMT) . 31 2.5.2 Non-Uniform Memory Access (NUMA) Architectures . 31 2.5.3 Pipelining . 31 2.6 Compilers . 32 2.6.1 Compilers in practice . 32 2.6.2 Today’s HPC Compilers . 34 2.7 Libraries . 36 2.7.1 Dense Linear Algebra . 36 2.7.2 Fast Fourier Transform (FFT) . 38 2.8 Software . 38 2.9 Operating Systems . 39 3 Related Work 41 4 Contents 3.1 Auto-tuning . 42 3.2 Tuning the choice of compiler flags . 43 3.2.1 A note about optimization levels of compilers . 44 3.3 Distributed compilation . 45 4 Initial Experiences 47 4.1 Case study: When cache matters, a comparison of ATLAS and GotoBLAS . 48 4.1.1 Comparing BLAS library performance . 48 4.1.2 Observations of TOP500 list . 49 4.1.3 Comparing ATLAS and GotoBLAS . 49 4.1.4 How much does the translation look-aside buffer (TLB) really matter (to GotoBLAS)? . 51 4.2 Towards a deeper understanding . 51 4.3 Early motivation . 52 5 Requirements Analysis 53 5.1 Initial High Level Use Case . 53 5.2 Requirements Specification . 54 5.2.1 Initial Requirements . 55 5.2.2 The Prototypes . 56 6 The convergence of benchmarks 61 6.1 Introduction . 61 6.1.1 Background . 62 6.1.2 Method . 63 6.1.3 Environmental constraints . 63 6.1.4 Results . 63 6.1.5 Homogeneous machines are not always uniform . 72 6.1.6 Conclusion . 72 7 Design Specification 74 7.1 Lessons drawn from prototyping . 74 7.1.1 Language choice . 74 7.1.2 Choice of optimization algorithm . 76 7.1.3 Repetition of benchmark . 78 7.1.4 Choice of pseudo-random number generator (PRNG) . 78 7.1.5 Fault Tolerance . 78 7.2 Key design decisions . 79 7.2.1 Language choice . 80 7.2.2 Distribution of work . 80 7.2.3 Search algorithm . 80 7.2.4 Categorisation of flags . 80 7.2.5 Choice of optimization algorithm . 80 7.2.6 Fitness evaluation . 82 7.2.7 Repeated runs of the benchmark . 83 7.2.8 Choice of PRNG . 84 7.3 Robustness . 86 7.3.1 Recording progress and resuming after termination . 86 5 Contents 7.3.2 Terminating cleanly . 86 7.3.3 Robustness against compiler bugs . 86 7.3.4 Fault tolerance . 89 7.3.5 Non-homogeneity . 90 7.4 Conclusion . 90 8 Design Implementation 92 8.1 Component break-down and encapsulation . 93 8.1.1 Unit testing . 93 8.2 Alterations to target software . 96 8.3 Wrapper scripts . 96 8.4 Signal handling . 97 8.5 Difficulties encountered in third-party software . 97 8.5.1 Overly-helpful libraries . 97 8.5.2 Bugs in SQLite . 98 8.6 Conclusion . ..
Recommended publications
  • Linpack Evaluation on a Supercomputer with Heterogeneous Accelerators
    Linpack Evaluation on a Supercomputer with Heterogeneous Accelerators Toshio Endo Akira Nukada Graduate School of Information Science and Engineering Global Scientific Information and Computing Center Tokyo Institute of Technology Tokyo Institute of Technology Tokyo, Japan Tokyo, Japan [email protected] [email protected] Satoshi Matsuoka Naoya Maruyama Global Scientific Information and Computing Center Global Scientific Information and Computing Center Tokyo Institute of Technology/National Institute of Informatics Tokyo Institute of Technology Tokyo, Japan Tokyo, Japan [email protected] [email protected] Abstract—We report Linpack benchmark results on the Roadrunner or other systems described above, it includes TSUBAME supercomputer, a large scale heterogeneous system two types of accelerators. This is due to incremental upgrade equipped with NVIDIA Tesla GPUs and ClearSpeed SIMD of the system, which has been the case in commodity CPU accelerators. With all of 10,480 Opteron cores, 640 Xeon cores, 648 ClearSpeed accelerators and 624 NVIDIA Tesla GPUs, clusters; they may have processors with different speeds as we have achieved 87.01TFlops, which is the third record as a result of incremental upgrade. In this paper, we present a heterogeneous system in the world. This paper describes a Linpack implementation and evaluation results on TSUB- careful tuning and load balancing method required to achieve AME with 10,480 Opteron cores, 624 Tesla GPUs and 648 this performance. On the other hand, since the peak speed is ClearSpeed accelerators. In the evaluation, we also used a 163 TFlops, the efficiency is 53%, which is lower than other systems.
    [Show full text]
  • Technical Details of the Elliott 152 and 153
    Appendix 1 Technical Details of the Elliott 152 and 153 Introduction The Elliott 152 computer was part of the Admiralty’s MRS5 (medium range system 5) naval gunnery project, described in Chap. 2. The Elliott 153 computer, also known as the D/F (direction-finding) computer, was built for GCHQ and the Admiralty as described in Chap. 3. The information in this appendix is intended to supplement the overall descriptions of the machines as given in Chaps. 2 and 3. A1.1 The Elliott 152 Work on the MRS5 contract at Borehamwood began in October 1946 and was essen- tially finished in 1950. Novel target-tracking radar was at the heart of the project, the radar being synchronized to the computer’s clock. In his enthusiasm for perfecting the radar technology, John Coales seems to have spent little time on what we would now call an overall systems design. When Harry Carpenter joined the staff of the Computing Division at Borehamwood on 1 January 1949, he recalls that nobody had yet defined the way in which the control program, running on the 152 computer, would interface with guns and radar. Furthermore, nobody yet appeared to be working on the computational algorithms necessary for three-dimensional trajectory predic- tion. As for the guns that the MRS5 system was intended to control, not even the basic ballistics parameters seemed to be known with any accuracy at Borehamwood [1, 2]. A1.1.1 Communication and Data-Rate The physical separation, between radar in the Borehamwood car park and digital computer in the laboratory, necessitated an interconnecting cable of about 150 m in length.
    [Show full text]
  • Emerging Technologies Multi/Parallel Processing
    Emerging Technologies Multi/Parallel Processing Mary C. Kulas New Computing Structures Strategic Relations Group December 1987 For Internal Use Only Copyright @ 1987 by Digital Equipment Corporation. Printed in U.S.A. The information contained herein is confidential and proprietary. It is the property of Digital Equipment Corporation and shall not be reproduced or' copied in whole or in part without written permission. This is an unpublished work protected under the Federal copyright laws. The following are trademarks of Digital Equipment Corporation, Maynard, MA 01754. DECpage LN03 This report was produced by Educational Services with DECpage and the LN03 laser printer. Contents Acknowledgments. 1 Abstract. .. 3 Executive Summary. .. 5 I. Analysis . .. 7 A. The Players . .. 9 1. Number and Status . .. 9 2. Funding. .. 10 3. Strategic Alliances. .. 11 4. Sales. .. 13 a. Revenue/Units Installed . .. 13 h. European Sales. .. 14 B. The Product. .. 15 1. CPUs. .. 15 2. Chip . .. 15 3. Bus. .. 15 4. Vector Processing . .. 16 5. Operating System . .. 16 6. Languages. .. 17 7. Third-Party Applications . .. 18 8. Pricing. .. 18 C. ~BM and Other Major Computer Companies. .. 19 D. Why Success? Why Failure? . .. 21 E. Future Directions. .. 25 II. Company/Product Profiles. .. 27 A. Multi/Parallel Processors . .. 29 1. Alliant . .. 31 2. Astronautics. .. 35 3. Concurrent . .. 37 4. Cydrome. .. 41 5. Eastman Kodak. .. 45 6. Elxsi . .. 47 Contents iii 7. Encore ............... 51 8. Flexible . ... 55 9. Floating Point Systems - M64line ................... 59 10. International Parallel ........................... 61 11. Loral .................................... 63 12. Masscomp ................................. 65 13. Meiko .................................... 67 14. Multiflow. ~ ................................ 69 15. Sequent................................... 71 B. Massively Parallel . 75 1. Ametek.................................... 77 2. Bolt Beranek & Newman Advanced Computers ...........
    [Show full text]
  • Information Technology 3 and 4/1998. Emerging
    OCCASION This publication has been made available to the public on the occasion of the 50th anniversary of the United Nations Industrial Development Organisation. DISCLAIMER This document has been produced without formal United Nations editing. The designations employed and the presentation of the material in this document do not imply the expression of any opinion whatsoever on the part of the Secretariat of the United Nations Industrial Development Organization (UNIDO) concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries, or its economic system or degree of development. Designations such as “developed”, “industrialized” and “developing” are intended for statistical convenience and do not necessarily express a judgment about the stage reached by a particular country or area in the development process. Mention of firm names or commercial products does not constitute an endorsement by UNIDO. FAIR USE POLICY Any part of this publication may be quoted and referenced for educational and research purposes without additional permission from UNIDO. However, those who make use of quoting and referencing this publication are requested to follow the Fair Use Policy of giving due credit to UNIDO. CONTACT Please contact [email protected] for further information concerning UNIDO publications. For more information about UNIDO, please visit us at www.unido.org UNITED NATIONS INDUSTRIAL DEVELOPMENT ORGANIZATION Vienna International Centre, P.O. Box 300, 1400 Vienna, Austria Tel: (+43-1) 26026-0 · www.unido.org · [email protected] EMERGING TECHNOLOGY SERIES 3 and 4/1998 Information Technology UNITED NATIONS INDUSTRIAL DEVELOPMENT ORGANIZATION Vienna, 2000 EMERGING TO OUR READERS TECHNOLOGY SERIES Of special interest in this issue of Emerging Technology Series: Information T~chn~~ogy is the focus on a meeting that took INFORMATION place from 2n to 4 November 1998, in Bangalore, India.
    [Show full text]
  • The Profession of IT Multitasking Without Thrashing Lessons from Operating Systems Teach How to Do Multitasking Without Thrashing
    viewpoints VDOI:10.1145/3126494 Peter J. Denning The Profession of IT Multitasking Without Thrashing Lessons from operating systems teach how to do multitasking without thrashing. UR INDIVIDUAL ABILITY to The first four destinations basi- mercial world with its OS 360 in 1965. be productive has been cally remove incoming tasks from your Operating systems implement mul- hard stressed by the sheer workspace, the fifth closes quick loops, titasking by cycling a CPU through a load of task requests we and the sixth holds your incomplete list of all incomplete tasks, giving each receive via the Internet. In loops. GTD helps you keep track of one a time slice on the CPU. If the task 2001, David Allen published Getting these unfinished loops. does not complete by the end of its O 1 Things Done, a best-selling book about The idea of tasks being closed loops time slice, the OS interrupts it and puts a system for managing all our tasks to of a conversation between a requester it on the end of the list. To switch the eliminate stress and increase produc- and a perform was first proposed in CPU context, the OS saves all the CPU tivity. Allen claims that a considerable 1979 by Fernando Flores.5 The “condi- registers of the current task and loads amount of stress comes our way when tions of satisfaction” that are produced the registers of the new task. The de- we have too many incomplete tasks. by the performer define loop comple- signers set the time slice length long He views tasks as loops connecting tion and allow tracking the movement enough to keep the total context switch someone making a request and you of the conversation toward completion.
    [Show full text]
  • Supercomputers: Current Status and Future Trends
    Supercomputers: Current Status and Future Trends Presentation to Melbourne PC Users Group, Inc. August 2nd, 2016 http://levlafayette.com 0.0 About Your Speaker 0.1 Lev works as the HPC and Training Officer at the University of Melbourne, and do the day-to-day administration of the Spartan HPC/Cloud hybrid and the Edward HPC. 0.2 Prior to that he worked for several years at the Victorian Partnership for Advanced Computing, which he looked after the Tango and Trifid systems and taught researchers at 17 different universities and research institutions. Prior to that he worked for the East Timorese government, and prior to that the Parliament of Victoria. 0.3 He is involved in various community organisations, collect degrees, learns languages, designs games, and writes books for fun. 0.4 You can stalk him at: http://levlafayette.com or https://www.linkedin.com/in/levlafayette 1.0 What is a Supercomputer anyway? 1.1 A supercomputer is a rather nebulous term for any computer system that is at the frontline of current processing capacity. 1.2 The Top500 metric (http://top500.org) measures pure speed of floating point operations with LINPACK. The HPC Challenge uses a variety of metrics (floating point calculation speed, matrix calculations, sustainable memory bandwidth, paired processor communications, random memory updates, discrete Fourier transforms, and communication bandwidth and latency). 1.3 The term supercomputer is not quite the same as "high performance computer", and not quite the same as "cluster computing", and not quite the same as "scientific (or research) computing". 2.0 What are they used for? 2.1 Typically used for complex calculations where many cores are required to operate in a tightly coupled fashion, or for extremely large collections datasets where many cores are required to carry out the analysis simultaneously.
    [Show full text]
  • Red Hat Developer Toolset 9 User Guide
    Red Hat Developer Toolset 9 User Guide Installing and Using Red Hat Developer Toolset Last Updated: 2020-08-07 Red Hat Developer Toolset 9 User Guide Installing and Using Red Hat Developer Toolset Zuzana Zoubková Red Hat Customer Content Services Olga Tikhomirova Red Hat Customer Content Services [email protected] Supriya Takkhi Red Hat Customer Content Services Jaromír Hradílek Red Hat Customer Content Services Matt Newsome Red Hat Software Engineering Robert Krátký Red Hat Customer Content Services Vladimír Slávik Red Hat Customer Content Services Legal Notice Copyright © 2020 Red Hat, Inc. The text of and illustrations in this document are licensed by Red Hat under a Creative Commons Attribution–Share Alike 3.0 Unported license ("CC-BY-SA"). An explanation of CC-BY-SA is available at http://creativecommons.org/licenses/by-sa/3.0/ . In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must provide the URL for the original version. Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law. Red Hat, Red Hat Enterprise Linux, the Shadowman logo, the Red Hat logo, JBoss, OpenShift, Fedora, the Infinity logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries. Linux ® is the registered trademark of Linus Torvalds in the United States and other countries. Java ® is a registered trademark of Oracle and/or its affiliates. XFS ® is a trademark of Silicon Graphics International Corp.
    [Show full text]
  • STK-6401 the Affordable Mini-Supercomputer with Muscle
    STK-6401 The Affordable Mini-Supercomputer with Muscle I I II ~~. STKIII SUPERTEK COMPUTERS STK-6401 - The Affordable Mini-Supercomputer with Muscle The STK-6401 is a high-performance design supports two vector reads, one more than 300 third-party and public mini-supercomputer which is fully vector write, and one I/O transfer domain applications developed for compatible with the Cray X-MP/ 48™ with an aggregate bandwidth of the Cray 1 ™ and Cray X-MP instruction set, including some 640 MB/s. Bank conflicts are reduced architectures. important operations not found to a minimum by a 16-way, fully A UNIX™ environment is also in the low-end Cray machines. interleaved structure. available under CTSS. The system design combines Coupled with the multi-ported vec­ advanced TTL/CMOS technologies tor register file and a built-in vector Concurrent Interactive and with a highly optimized architecture. chaining capability, the Memory Unit Batch Processing By taking advantage of mature, makes most vector operations run as CTSS accommodates the differing multiple-sourced off-the-shelf devices, if they were efficient memory-to­ requirements of applications develop­ the STK-6401 offers performance memory operations. This important ment and long-running, computa­ equal to or better than comparable feature is not offered in most of the tionally intensive codes. As a result, mini-supers at approximately half machines on the market today. concurrent interactive and batch their cost. 110 Subsystem access to the STK-6401 is supported Additional benefits of this design with no degradation in system per­ The I/O subsystem of the STK-6401 approach are much smaller size, formance.
    [Show full text]
  • 1St Slide Set Operating Systems
    Organizational Information Operating Systems Generations of Computer Systems and Operating Systems 1st Slide Set Operating Systems Prof. Dr. Christian Baun Frankfurt University of Applied Sciences (1971–2014: Fachhochschule Frankfurt am Main) Faculty of Computer Science and Engineering [email protected] Prof. Dr. Christian Baun – 1st Slide Set Operating Systems – Frankfurt University of Applied Sciences – WS2021 1/24 Organizational Information Operating Systems Generations of Computer Systems and Operating Systems Organizational Information E-Mail: [email protected] !!! Tell me when problems problems exist at an early stage !!! Homepage: http://www.christianbaun.de !!! Check the course page regularly !!! The homepage contains among others the lecture notes Presentation slides in English and German language Exercise sheets in English and German language Sample solutions of the exercise sheers Old exams and their sample solutions What is the password? There is no password! The content of the English and German slides is identical, but please use the English slides for the exam preparation to become familiar with the technical terms Prof. Dr. Christian Baun – 1st Slide Set Operating Systems – Frankfurt University of Applied Sciences – WS2021 2/24 Organizational Information Operating Systems Generations of Computer Systems and Operating Systems Literature My slide sets were the basis for these books The two-column layout (English/German) of the bilingual book is quite useful for this course You can download both books for free via the FRA-UAS library from the intranet Prof. Dr. Christian Baun – 1st Slide Set Operating Systems – Frankfurt University of Applied Sciences – WS2021 3/24 Organizational Information Operating Systems Generations of Computer Systems and Operating Systems Learning Objectives of this Slide Set At the end of this slide set You know/understand.
    [Show full text]
  • Download Chapter 3: "Configuring Your Project with Autoconf"
    CONFIGURING YOUR PROJECT WITH AUTOCONF Come my friends, ’Tis not too late to seek a newer world. —Alfred, Lord Tennyson, “Ulysses” Because Automake and Libtool are essen- tially add-on components to the original Autoconf framework, it’s useful to spend some time focusing on using Autoconf without Automake and Libtool. This will provide a fair amount of insight into how Autoconf operates by exposing aspects of the tool that are often hidden by Automake. Before Automake came along, Autoconf was used alone. In fact, many legacy open source projects never made the transition from Autoconf to the full GNU Autotools suite. As a result, it’s not unusual to find a file called configure.in (the original Autoconf naming convention) as well as handwritten Makefile.in templates in older open source projects. In this chapter, I’ll show you how to add an Autoconf build system to an existing project. I’ll spend most of this chapter talking about the fundamental features of Autoconf, and in Chapter 4, I’ll go into much more detail about how some of the more complex Autoconf macros work and how to properly use them. Throughout this process, we’ll continue using the Jupiter project as our example. Autotools © 2010 by John Calcote Autoconf Configuration Scripts The input to the autoconf program is shell script sprinkled with macro calls. The input stream must also include the definitions of all referenced macros—both those that Autoconf provides and those that you write yourself. The macro language used in Autoconf is called M4. (The name means M, plus 4 more letters, or the word Macro.1) The m4 utility is a general-purpose macro language processor originally written by Brian Kernighan and Dennis Ritchie in 1977.
    [Show full text]
  • Supermatrix: a Multithreaded Runtime Scheduling System for Algorithms-By-Blocks
    SuperMatrix: A Multithreaded Runtime Scheduling System for Algorithms-by-Blocks Ernie Chan Field G. Van Zee Paolo Bientinesi Enrique S. Quintana-Ort´ı Robert van de Geijn Department of Computer Science Gregorio Quintana-Ort´ı Department of Computer Sciences Duke University Departamento de Ingenier´ıa y Ciencia de The University of Texas at Austin Durham, NC 27708 Computadores Austin, Texas 78712 [email protected] Universidad Jaume I {echan,field,rvdg}@cs.utexas.edu 12.071–Castellon,´ Spain {quintana,gquintan}@icc.uji.es Abstract which led to the adoption of out-of-order execution in many com- This paper describes SuperMatrix, a runtime system that paral- puter microarchitectures [32]. For the dense linear algebra opera- lelizes matrix operations for SMP and/or multi-core architectures. tions on which we will concentrate in this paper, many researchers We use this system to demonstrate how code described at a high in the early days of distributed-memory computing recognized that level of abstraction can achieve high performance on such archi- “compute-ahead” techniques could be used to improve parallelism. tectures while completely hiding the parallelism from the library However, the coding complexity required of such an effort proved programmer. The key insight entails viewing matrices hierarchi- too great for these techniques to gain wide acceptance. In fact, cally, consisting of blocks that serve as units of data where oper- compute-ahead optimizations are still absent from linear algebra ations over those blocks are treated as units of computation. The packages such as ScaLAPACK [12] and PLAPACK [34]. implementation transparently enqueues the required operations, in- Recently, there has been a flurry of interest in reviving the idea ternally tracking dependencies, and then executes the operations of compute-ahead [1, 25, 31].
    [Show full text]
  • Benchmark of C++ Libraries for Sparse Matrix Computation
    Benchmark of C++ Libraries for Sparse Matrix Computation Georg Holzmann http://grh.mur.at email: [email protected] August 2007 This report presents benchmarks of C++ scientific computing libraries for small and medium size sparse matrices. Be warned: these benchmarks are very special- ized on a neural network like algorithm I had to implement. However, also the initialization time of sparse matrices and a matrix-vector multiplication was mea- sured, which might be of general interest. WARNING At the time of its writing this document did not consider the eigen library (http://eigen.tuxfamily.org). Please evaluate also this very nice, fast and well maintained library before making any decisions! See http://grh.mur.at/blog/matrix-library-benchmark-follow for more information. Contents 1 Introduction 2 2 Benchmarks 2 2.1 Initialization.....................................3 2.2 Matrix-Vector Multiplication............................4 2.3 Neural Network like Operation...........................4 3 Results 4 4 Conclusion 12 A Libraries and Flags 12 B Implementations 13 1 1 Introduction Quite a lot open-source libraries exist for scientific computing, which makes it hard to choose between them. Therefore, after communication on various mailing lists, I decided to perform some benchmarks, specialized to the kind of algorithms I had to implement. Ideally algorithms should be implemented in an object oriented, reuseable way, but of course without a loss of performance. BLAS (Basic Linear Algebra Subprograms - see [1]) and LA- PACK (Linear Algebra Package - see [3]) are the standard building blocks for efficient scientific software. Highly optimized BLAS implementations for all relevant hardware architectures are available, traditionally expressed in Fortran routines for scalar, vector and matrix operations (called Level 1, 2 and 3 BLAS).
    [Show full text]