Exploiting Data Sparsity in Covariance Matrix Computations on Heterogeneous Systems

Total Page:16

File Type:pdf, Size:1020Kb

Exploiting Data Sparsity in Covariance Matrix Computations on Heterogeneous Systems Exploiting Data Sparsity In Covariance Matrix Computations on Heterogeneous Systems Dissertation by Ali M. Charara In Partial Fulfillment of the Requirements For the Degree of Doctor of Philosophy King Abdullah University of Science and Technology Thuwal, Kingdom of Saudi Arabia May, 2018 2 EXAMINATION COMMITTEE PAGE The dissertation of Ali M. Charara is approved by the examination committee Committee Chairperson: Prof. David E. Keyes Committee Members: Prof. Marc Genton, Prof. Markus Hadwiger, Prof. Anne C. Elster, and Dr. Hatem Ltaief. 3 ©May, 2018 Ali M Charara All Rights Reserved 4 ABSTRACT Exploiting Data Sparsity In Covariance Matrix Computations on Heterogeneous Systems Ali M. Charara Covariance matrices are ubiquitous in computational sciences, typically describing the correlation of elements of large multivariate spatial data sets. For example, covari- ance matrices are employed in climate/weather modeling for the maximum likelihood estimation to improve prediction, as well as in computational ground-based astronomy to enhance the observed image quality by filtering out noise produced by the adap- tive optics instruments and atmospheric turbulence. The structure of these covariance matrices is dense, symmetric, positive-definite, and often data-sparse, therefore, hier- archically of low-rank. This thesis investigates the performance limit of dense matrix computations (e.g., Cholesky factorization) on covariance matrix problems as the number of unknowns grows, and in the context of the aforementioned applications. We employ recursive formulations of some of the basic linear algebra subroutines (BLAS) to accelerate the covariance matrix computation further, while reducing data traffic across the memory subsystems layers. However, dealing with large data sets (i.e., covariance matrices of billions in size) can rapidly become prohibitive in memory footprint and algorithmic complexity. Most importantly, this thesis investigates the tile low-rank data format (TLR), a new compressed data structure and layout, which is valuable in exploiting data sparsity by approximating the operator. The TLR com- pressed data structure allows approximating the original problem up to user-defined numerical accuracy. This comes at the expense of dealing with tasks with much lower 5 arithmetic intensities than traditional dense computations. In fact, this thesis con- solidates the two trends of dense and data-sparse linear algebra for HPC. Not only does the thesis leverage recursive formulations for dense Cholesky-based matrix al- gorithms, but it also implements a novel TLR-Cholesky factorization using batched linear algebra operations to increase hardware occupancy and reduce the overhead of the API. Performance reported of the dense and TLR-Cholesky shows many-fold speedups against state-of-the-art implementations on various systems equipped with GPUs. Additionally, the TLR implementation gives the user flexibility to select the desired accuracy. This trade-off between performance and accuracy is, currently, a well-established leading trend in the convergence of the third and fourth paradigm, i.e., HPC and Big Data, when moving forward with exascale software roadmap. 6 ACKNOWLEDGEMENTS I would like to express my sincere gratitude to my advisor Prof. David Keyes for his continuous support of my Ph.D. study, for his encouragement, immense knowledge, guidance, and inspirations. Equal gratefulness is due to my main mentor Dr. Hatem Ltaief for his dedication, patience, and endurance. Most of the ideas of this thesis work were initiated, supported, and improved by him. I can not say enough to thank his enormous efforts to bring this thesis to its success. I would like to thank my committee members for their insightful comments and encouragements. I'm also grateful to my previous advisor, Prof. Alyn Rockwood, who has sponsored and encouraged me throughout my Master's study and the first project of my Ph.D. study at KAUST. I thank my colleagues and labmates at both ECRC and VCC labs for insightful discussions and encouragements. Special thanks go to the members of HiCMA project for the collaborative and teamwork we have done together, as well as to the members of the KSL lab for their great support and readiness to help. I would like to thank also Philippe Vandermersch for hosting me at their team at NVIDIA as an intern. Highest appreciation goes to my parents and brothers who believed and instilled in me the continuous desire to improve my knowledge and achieve more. Utmost gratefulness and recognition are for my wife who has been, along with my lovely sons (Hassan, Mohsen, Hussein, and Mohamad), the most patient, encouraging, and supportive. They endured long vacations and weekends and sleepless nights while I'm away and indulged in my work. They have been my source of joy and for whom I present this work, hoping it would be an inspiration for them to pursue excellence in their studies. 7 TABLE OF CONTENTS Examination Committee Page 2 Copyright 3 Abstract 4 Acknowledgements 6 List of Abbreviations 11 List of BLAS/LAPACK Routines 12 List of Figures 13 List of Tables 18 List of Algorithms 19 1 Introduction 20 1.1 Motivation . 20 1.2 Objectives and Contributions . 20 1.3 Thesis Outline . 23 2 Covariance Matrices in Scientific Applications 25 2.1 Covariance Matrices . 25 2.2 Multi-Object Adaptive Optics Application . 26 2.3 Spatial Statistics Application . 31 3 Hardware and Software Trends 33 3.1 Emerging Hardware Architectures . 33 3.1.1 GPU . 34 3.1.2 Xeon Phi . 36 3.2 Dense Linear Algebra Algorithms . 37 3.2.1 Block Algorithm . 38 8 3.2.2 Tile Algorithm . 39 3.3 Runtime Systems . 43 3.3.1 The StarPU Runtime System . 43 3.3.2 The OmpSs Runtime System . 44 3.3.3 Customized Static Runtime System . 47 4 Leveraging Vertical Performance using Recursive Formulations 49 4.1 Performance of Matrix Inversion on GPUs . 49 4.2 Triangular BLAS 3 Acceleration on NVIDIA GPUs . 51 4.2.1 Introduction . 51 4.2.2 Related Work . 52 4.2.3 The Triangular Matrix Operations: TRMM and TRSM . 54 4.2.4 Recursive Definition and Implementation Details . 58 4.2.5 Experimental Results . 59 4.2.6 Performance Impact on High-Level Dense Linear Algebra Al- gorithms . 65 4.2.7 Conclusions . 66 4.3 A Framework for Dense Triangular Matrix Kernels on Various Many- core Architectures . 67 4.3.1 Prologue . 67 4.3.2 Related Work . 69 4.3.3 Background on General Recursive TRMM and TRSM Kernels 70 4.3.4 Motivation for further enhancements . 72 4.3.5 Implementation Details for KBLAS CUDA TRSM and TRMM for Few Right-Hand Sides . 73 4.3.6 KBLAS Multi-GPU RecTRSM and RecTRMM ........ 77 4.3.7 On Intel x86 CPU Architectures . 79 4.3.8 Experimental Results . 80 4.3.9 Conclusion . 85 4.4 GPU-Resident Cholesky Factorization . 87 4.4.1 Prologue and Related Work . 87 4.4.2 GPU-Resident Implementation . 88 4.4.3 Performance Results . 91 5 Leveraging Horizontal Performance using Fine-Grained Computa- tions 95 5.1 Background on Adaptive Optics Simulations . 95 9 5.2 Pipelining Computational Stages of the TR for MOAO on a Multi- GPU System . 97 5.2.1 Contributions . 97 5.2.2 A New Mathematical Model . 98 5.2.3 Implementation Details . 103 5.2.4 Experimental Results . 106 5.3 Adaptive Optics Simulation for the World's Largest Telescope on Mul- ticore Architectures with Multiple GPUs . 112 5.3.1 Contributions . 114 5.3.2 Prologue . 115 5.3.3 High Performance MOAO Framework Implementations . 116 5.3.4 Performance Results and Analysis . 119 5.4 Real-Time Massively Distributed Multi-Object Adaptive Optics Sim- ulations for the European Extremely Large Telescope . 125 5.4.1 Introduction . 125 5.4.2 Implementation Details . 128 5.4.3 System and Environment Settings . 133 5.4.4 Science Performance Analysis . 134 5.4.5 HPC Performance Results . 136 5.4.6 Impacts for MOAO Moving Forward . 139 5.4.7 Conclusion . 141 6 Towards Off-Diagonal Low-Rank Matrix Computations 142 6.1 Accelerating Low-Rank Computations . 144 6.2 Contributions Plan . 145 6.3 Batched Operations for Very Small Triangular Matrices Over GPUs . 147 6.3.1 Prologue . 147 6.3.2 Related Work . 150 6.3.3 The KBLAS Library . 151 6.3.4 Limitations of Triangular DLA operations . 153 6.3.5 Fundamental Algorithmic Techniques . 155 6.3.6 High Performance Implementation Details . 160 6.3.7 Performance Results and Analysis . 171 6.3.8 Discussions on Non-uniform Matrix Sizes . 186 6.4 Batched Tile Low-Rank BLAS operations . 187 6.4.1 Introduction . 187 10 6.4.2 Related Work . 189 6.4.3 Background . 190 6.4.4 Design of Tile Low-Rank GEMM Kernels . 192 6.4.5 Implementation Details . 195 6.4.6 Experimental Results . 198 6.5 Tile Low-Rank Cholesky factorization . 202 6.5.1 TLR-Cholesky implementation variants . 203 6.5.2 Performance Results and Analysis . 206 7 Concluding Remarks 211 References 212 Appendices 231 11 LIST OF ABBREVIATIONS AO Adaptive Optics 26 BLAS Basic Linear Algebra Subroutines 21 DAG directed acyclic graph 40 DLA Dense Linear Algebra 38 DM deformable mirrors 26 DP double precision 35 E-ELT European Extremely Large Telescope 26 FoV Field of View 26 HiCMA Hierarchical Computations on Manycore Architectures 145 IP in-place 54 LAPACK Linear Algebra Package 22 MAGMA Matrix Algebra on GPU and Multicore Architectures 39 MOAO Multi-Object Adaptive Optics 27 MORSE Matrices Over Runtime System at Exascale 42 MOS Multi-object spectroscopy 26 OOP out-of-place 54 PLASMA Parallel Linear Algebra for Scalable Multi-core Architectures 40 PSF point spread function 112 RAW Read After Write 51 SM Streaming Multiprocessor 34 SP single precision 35 SPD symmetric positive-definite 98 SVD Singular Value Decomposition 143 TB Thread-Blocks 34 TLR Tile Low-Rank 17 TR tomographic reconstructor 29 TS truth sensor 98 WAR Write After Read 51 WFS wavefront sensors 26 12 LIST OF BLAS/LAPACK ROUTINES GEMM GEneral Matrix Multiply.
Recommended publications
  • ARM HPC Ecosystem
    ARM HPC Ecosystem Darren Cepulis HPC Segment Manager ARM Business Segment Group HPC Forum, Santa Fe, NM 19th April 2017 © ARM 2017 ARM Collaboration for Exascale Programs United States Japan ARM is currently a participant in Fujitsu and RIKEN announced that the two Department of Energy funded Post-K system targeted at Exascale will pre-Exascale projects: Data be based on ARMv8 with new Scalable Movement Dominates and Fast Vector Extensions. Forward 2. European Union China Through FP7 and Horizon 2020, James Lin, vice director for the Center ARM has been involved in several of HPC at Shanghai Jiao Tong University funded pre-Exascale projects claims China will build three pre- including the Mont Blanc program Exascale prototypes to select the which deployed one of the first architecture for their Exascale system. ARM prototype HPC systems. The three prototypes are based on AMD, SunWei TaihuLight, and ARMv8. 2 © ARM 2017 ARM HPC deployments starting in 2H2017 Tw o recent announcements about ARM in HPC in Europe: 3 © ARM 2017 Japan Exascale slides from Fujitsu at ISC’16 4 © ARM 2017 Foundational SW Ecosystem for HPC . Linux OS’s – RedHat, SUSE, CENTOS, UBUNTU,… Commercial . Compilers – ARM, GNU, LLVM,… . Libraries – ARM, OpenBLAS, BLIS, ATLAS, FFTW… . Parallelism – OpenMP, OpenMPI, MVAPICH2,… . Debugging – Allinea, RW Totalview, GDB,… Open-source . Analysis – ARM, Allinea, HPCToolkit, TAU,… . Job schedulers – LSF, PBS Pro, SLURM,… . Cluster mgmt – Bright, CMU, warewulf,… Predictable Baseline 5 © ARM 2017 – now on ARM Functional Supported packages / components Areas OpenHPC defines a baseline. It is a community effort to Base OS RHEL/CentOS 7.1, SLES 12 provide a common, verified set of open source packages for Administrative Conman, Ganglia, Lmod, LosF, ORCM, Nagios, pdsh, HPC deployments Tools prun ARM’s participation: Provisioning Warewulf Resource Mgmt.
    [Show full text]
  • 0 BLIS: a Framework for Rapidly Instantiating BLAS Functionality
    0 BLIS: A Framework for Rapidly Instantiating BLAS Functionality FIELD G. VAN ZEE and ROBERT A. VAN DE GEIJN, The University of Texas at Austin The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly in- stantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental innovation is that virtually all computation within level-2 (matrix-vector) and level-3 (matrix-matrix) BLAS operations can be expressed and optimized in terms of very simple kernels. While others have had similar insights, BLIS reduces the necessary kernels to what we believe is the simplest set that still supports the high performance that the computational science community demands. Higher-level framework code is generalized and imple- mented in ISO C99 so that it can be reused and/or re-parameterized for different operations (and different architectures) with little to no modification. Inserting high-performance kernels into the framework facili- tates the immediate optimization of any BLAS-like operations which are cast in terms of these kernels, and thus the framework acts as a productivity multiplier. Users of BLAS-dependent applications are given a choice of using the the traditional Fortran-77 BLAS interface, a generalized C interface, or any other higher level interface that builds upon this latter API. Preliminary performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL). Categories and Subject Descriptors: G.4 [Mathematical Software]: Efficiency General Terms: Algorithms, Performance Additional Key Words and Phrases: linear algebra, libraries, high-performance, matrix, BLAS ACM Reference Format: ACM Trans.
    [Show full text]
  • The BLAS API of BLASFEO: Optimizing Performance for Small Matrices
    The BLAS API of BLASFEO: optimizing performance for small matrices Gianluca Frison1, Tommaso Sartor1, Andrea Zanelli1, Moritz Diehl1;2 University of Freiburg, 1 Department of Microsystems Engineering (IMTEK), 2 Department of Mathematics email: [email protected] February 5, 2020 This research was supported by the German Federal Ministry for Economic Affairs and Energy (BMWi) via eco4wind (0324125B) and DyConPV (0324166B), and by DFG via Research Unit FOR 2401. Abstract BLASFEO is a dense linear algebra library providing high-performance implementations of BLAS- and LAPACK-like routines for use in embedded optimization and other applications targeting relatively small matrices. BLASFEO defines an API which uses a packed matrix format as its native format. This format is analogous to the internal memory buffers of optimized BLAS, but it is exposed to the user and it removes the packing cost from the routine call. For matrices fitting in cache, BLASFEO outperforms optimized BLAS implementations, both open-source and proprietary. This paper investigates the addition of a standard BLAS API to the BLASFEO framework, and proposes an implementation switching between two or more algorithms optimized for different matrix sizes. Thanks to the modular assembly framework in BLASFEO, tailored linear algebra kernels with mixed column- and panel-major arguments are easily developed. This BLAS API has lower performance than the BLASFEO API, but it nonetheless outperforms optimized BLAS and especially LAPACK libraries for matrices fitting in cache. Therefore, it can boost a wide range of applications, where standard BLAS and LAPACK libraries are employed and the matrix size is moderate. In particular, this paper investigates the benefits in scientific programming languages such as Octave, SciPy and Julia.
    [Show full text]
  • Accelerating the “Motifs” in Machine Learning on Modern Processors
    Accelerating the \Motifs" in Machine Learning on Modern Processors Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical and Computer Engineering Jiyuan Zhang B.S., Electrical Engineering, Harbin Institute of Technology Carnegie Mellon University Pittsburgh, PA December 2020 c Jiyuan Zhang, 2020 All Rights Reserved Acknowledgments First and foremost, I would like to thank my advisor | Franz. Thank you for being a supportive and understanding advisor. You encouraged me to pursue projects I enjoyed and guided me to be the researcher that I am today. When I was stuck on a difficult problem, you are always there to offer help and stand side by side with me to figure out solutions. No matter what the difficulties are, from research to personal life issues, you can always consider in our shoes and provide us your unreserved support. Your encouragement and sense of humor has been a great help to guide me through the difficult times. You are the key enabler for me to accomplish the PhD journey, as well as making my PhD journey less suffering and more joyful. Next, I would like to thank the members of my thesis committee { Dr. Michael Garland, Dr. Phil Gibbons, and Dr. Tze Meng Low. Thank you for taking the time to be my thesis committee. Your questions and suggestions have greatly helped shape this work, and your insights have helped me under- stand the limitations that I would have overlooked in the original draft. I am grateful for the advice and suggestions I have received from you to improve this dissertation.
    [Show full text]
  • Introduchon to Arm for Network Stack Developers
    Introducon to Arm for network stack developers Pavel Shamis/Pasha Principal Research Engineer Mvapich User Group 2017 © 2017 Arm Limited Columbus, OH Outline • Arm Overview • HPC SoLware Stack • Porng on Arm • Evaluaon 2 © 2017 Arm Limited Arm Overview © 2017 Arm Limited An introduc1on to Arm Arm is the world's leading semiconductor intellectual property supplier. We license to over 350 partners, are present in 95% of smart phones, 80% of digital cameras, 35% of all electronic devices, and a total of 60 billion Arm cores have been shipped since 1990. Our CPU business model: License technology to partners, who use it to create their own system-on-chip (SoC) products. We may license an instrucBon set architecture (ISA) such as “ARMv8-A”) or a specific implementaon, such as “Cortex-A72”. …and our IP extends beyond the CPU Partners who license an ISA can create their own implementaon, as long as it passes the compliance tests. 4 © 2017 Arm Limited A partnership business model A business model that shares success Business Development • Everyone in the value chain benefits Arm Licenses technology to Partner • Long term sustainability SemiCo Design once and reuse is fundamental IP Partner Licence fee • Spread the cost amongst many partners Provider • Technology reused across mulBple applicaons Partners develop • Creates market for ecosystem to target chips – Re-use is also fundamental to the ecosystem Royalty Upfront license fee OEM • Covers the development cost Customer Ongoing royalBes OEM sells • Typically based on a percentage of chip price
    [Show full text]
  • A Systematic Approach for Obtaining Performance on Matrix-Like Operations Submitted in Partial Fulfillment of the Requirements
    A Systematic Approach for Obtaining Performance on Matrix-Like Operations Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical and Computer Engineering Richard Michael Veras B.S., Computer Science & Mathematics, The University of Texas at Austin M.S., Electrical and Computer Engineering, Carnegie Mellon University Carnegie Mellon University Pittsburgh, PA August, 2017 Copyright c 2017 Richard Michael� Veras. All rights reserved. Acknowledgments This work would not have been possible without the encouragement and support of those around me. To the members of my thesis committee – Dr. Richard Vuduc, Dr. Keshav Pingali, Dr. James Hoe and my Advisor Dr. Franz Franchetti – thank you for taking the time to be part of this process. Your questions and suggestions helped shape this work. In addition, I am grateful for the advice I have received from the four of you throughout my PhD. Franz, thank you for being a supportive and understanding advisor. Your excitement is contagious and is easily the best part of our meetings, especially during the times when I was stuck on a difficult problem. You en- couraged me to pursue project I enjoyed and guided me to be the researcher that I am today. I am incredibly grateful for your understanding andflexi- bility when it came to my long distance relationship and working remotely. To the Spiral group past and present and A-Level, you are amazing group and community to be a part of. You are all wonderful friends, and you made Pittsburgh feel like a home. Thom and Tze Meng, whether we were talking shop or sci-fi over beers, our discussions have really shaped the way I think and approach ideas, and that will stick with me for a long time.
    [Show full text]
  • Arxiv:1611.01120V1 [Cs.MS] 3 Nov 2016
    Generating Families of Practical Fast Matrix Multiplication Algorithms FLAME Working Note #82 Jianyu Huang*y, Leslie Rice*y, Devin A. Matthewsy, Robert A. van de Geijn*y *Department of Computer Science yInstitute for Computational Engineering and Sciences The University of Texas at Austin, Austin, TX 78712 jianyu@cs., leslierice, dmatthews, rvdg@cs. utexas.edu f g November 3, 2016 Abstract Matrix multiplication (GEMM) is a core operation to numerous scientific applications. Traditional implementations of Strassen-like fast matrix multiplication (FMM) algorithms often do not perform well except for very large matrix sizes, due to the increased cost of memory movement, which is particu- larly noticeable for non-square matrices. Such implementations also require considerable workspace and modifications to the standard BLAS interface. We propose a code generator framework to automatically implement a large family of FMM algorithms suitable for multiplications of arbitrary matrix sizes and shapes. By representing FMM with a triple of matrices U; V; W that capture the linear combinations of submatrices that are formed, we can use the KroneckerJ productK to define a multi-level representation of Strassen-like algorithms. Incorporating the matrix additions that must be performed for Strassen-like algorithms into the inherent packing and micro-kernel operations inside GEMM avoids extra workspace and reduces the cost of memory movement. Adopting the same loop structures as high-performance GEMM implementations allows parallelization of all FMM algorithms with simple but efficient data par- allelism without the overhead of task parallelism. We present a simple performance model for general FMM algorithms and compare actual performance of 20+ FMM algorithms to modeled predictions.
    [Show full text]
  • Implementing Strassen-Like Fast Matrix Multiplication Algorithms with BLIS
    Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016 *Overlook of the Bay Area. Photo taken in Mission Peak Regional Preserve, Fremont, CA. Summer 2014. STRASSEN, from 30,000 feet Volker Strassen Original Strassen Paper (1969) (Born in 1936, aged 80) One-level Strassen’s Algorithm (In theory) Direct Computation Strassen’s Algorithm 8 multiplications, 8 additions 7 multiplications, 22 additions *Strassen, Volker. "Gaussian elimination is not optimal." Numerische Mathematik 13, no. 4 (1969): 354-356. Multi-level Strassen’s Algorithm (In theory) M := (A +A )(B +B ); 0 00 11 00 11 • One-level Strassen (1+14.3% speedup) M1 := (A10+A11)B00; 8 multiplications → 7 multiplications ; M2 := A00(B01–B11); M3 := A11(B10–B00); • Two-level Strassen (1+30.6% speedup) M := (A +A )B ; 4 00 01 11 64 multiplications → 49 multiplications; M := (A –A )(B +B ); 5 10 00 00 01 • d-level Strassen (n3/n2.803 speedup) M6 := (A01–A11)(B10+B11); d d C00 += M0 + M3 – M4 + M6 8 multiplications → 7 multiplications; C01 += M2 + M4 C10 += M1 + M3 C11 += M0 – M1 + M2 + M5 Multi-level Strassen’s Algorithm (In theory) • One-level Strassen (1+14.3% speedup) M0 := (A00+A11)(B00+B11); 8 multiplications → 7 multiplications ; M1 := (A10+A11)B00; • Two-level Strassen (1+30.6% speedup) M2 := A00(B01–B11); 64 multiplications → 49 multiplications; M := A (B –B ); 3 11 10 00 • d-level Strassen (n3/n2.803 speedup) M4 := (A00+A01)B11; 8d multiplications
    [Show full text]
  • BLAS and LAPACK Runtime Switching
    BLAS and LAPACK runtime switching Mo Zhou [email protected] April 9, 2019 Abstract mark different implementations efficiently, and painlessly switch the underlying implementation for different programs Classical numerical linear algebra libraries, BLAS and LA- under different scenarios. PACK play important roles in the scientific computing field. Various demands on these libraries pose non-trivial challenge on system management and linux distribution development. 1.1 Solutions of Other Distros By leveraging Debian’s update-alternatives mechanism which enables user to switch BLAS and LAPACK libraries smoothly Archlinux only provides the Netlib and OpenBLAS, where and painlessly, the problems could be properly and decently they conflict to each other. This won’t satisfy all user’s de- addressed. This project aims at introducing the mechanism mand, because the OpenMP version of OpenBLAS is not a into Gentoo’s eselect framework to manage BLAS and LA- good choice for pthread-based programs. PACK, providing equivalent or better functionality of De- Fedora’s solution is to provide every possible version, and bian’s update-alternatives. make them co-installable by assigning different SONAMEs. It leads to confusion and chaos if many alternative libraries co-exists on the system, e.g. libopenblas, libopenblasp, li- 1 Rationale bopenblaso. The BSD (Port) Family forces packages to use a specific BLAS (Basic Linear Algebra Subroutines)[1] and LAPACK implemtation on a per-package basis, which clearly doesn’t (Linear Algebra PACKage)[2] are important mathematical satisfy the divsersed user demand. APIs/ABIs/libraries to performance-critical programs that manipulate dense and contiguous numerical arrays. BLAS provides low-level and frequently used linear algebra routines 1.2 Gentoo’s Solution for vector and matrix operations; LAPACK provides higher- level functionality based on complex call graph over BLAS, Currently Gentoo’s solution is based on eselect/pkg-config.
    [Show full text]
  • Implementing Strassen's Algorithm with BLIS
    Implementing Strassen's Algorithm with BLIS FLAME Working Note # 79 Jianyu Huang∗ Tyler M. Smith∗y Greg M. Henryz Robert A. van de Geijn∗y April 16, 2016 Abstract We dispel with \street wisdom" regarding the practical implementation of Strassen's algorithm for matrix-matrix multiplication (DGEMM). Conventional wisdom: it is only practical for very large ma- trices. Our implementation is practical for small matrices. Conventional wisdom: the matrices being multiplied should be relatively square. Our implementation is practical for rank-k updates, where k is relatively small (a shape of importance for libraries like LAPACK). Conventional wisdom: it inher- ently requires substantial workspace. Our implementation requires no workspace beyond buffers already incorporated into conventional high-performance DGEMM implementations. Conventional wisdom: a Strassen DGEMM interface must pass in workspace. Our implementation requires no such workspace and can be plug-compatible with the standard DGEMM interface. Conventional wisdom: it is hard to demonstrate speedup on multi-core architectures. Our implementation demonstrates speedup over TM 1 conventional DGEMM even on an Intel R Xeon Phi coprocessor utilizing 240 threads. We show how a distributed memory matrix-matrix multiplication also benefits from these advances. 1 Introduction Strassen's algorithm (Strassen) [1] for matrix-matrix multiplication (gemm) has fascinated theoreticians and practitioners alike since it was first published, in 1969. That paper demonstrated that multiplication of n × n matrices can be achieved in less than the O(n3) arithmetic operations required by a conventional formulation. It has led to many variants that improve upon this result [2, 3, 4, 5] as well as practical implementations [6, 7, 8, 9].
    [Show full text]
  • Profiling and Inspecting Linear Algebra Applications Using Flexiblas
    Profiling and Inspecting Linear Algebra Applications using FlexiBLAS Martin K¨ohler joint work with Jens Saak, Christian Himpe, and J¨ornPapenbroock March 21, 2018 89th GAMM Annual Meeting Used in many software packages: MATLAB®, GNU Octave NumPy, SciPy Julia FEM Packages like FEniCS, DEAL.II, . Libraries like PetSC, Trillinos, Eigen, SuiteSparse, . What is BLAS? Basic Linear Algebra Subprograms (BLAS) \The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. Because the BLAS are efficient, portable, and widely available, they are commonly used in the development of high quality linear algebra software, LAPACK for example."4 4From: http://www.netlib.org/blas/faq.html { What and where are the BLAS? Martin K¨ohler,[email protected] FlexiBLAS 2/14 What is BLAS? Basic Linear Algebra Subprograms (BLAS) \The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. Because the BLAS are efficient, portable, and widely available, they are commonly used in the development of high quality linear algebra software, LAPACK for example."6 Used in many software packages: MATLAB, GNU Octave NumPy, SciPy Julia FEM Packages like FEniCS, DEAL.II, . Libraries like PetSC, Trillinos, Eigen, SuiteSparse, . 6From: http://www.netlib.org/blas/faq.html { What and where are the BLAS? Martin K¨ohler,[email protected] FlexiBLAS 2/14 FlexiBLAS Lightweight wrapper library around
    [Show full text]
  • 0 the BLIS Framework: Experiments in Portability
    0 The BLIS Framework: Experiments in Portability FIELD G. VAN ZEE, TYLER SMITH, BRYAN MARKER, TZE MENG LOW, ROBERT A. VAN DE GEIJN, The University of Texas at Austin FRANCISCO D. IGUAL, Univ. Complutense de Madrid MIKHAIL SMELYANSKIY, Intel Corporation XIANYI ZHANG, Chinese Academy of Sciences MICHAEL KISTLER, VERNON AUSTEL, JOHN A. GUNNELS, IBM Corporation LEE KILLOUGH, Cray Inc. BLIS is a new software framework for instantiating high-performance BLAS-like dense linear algebra li- braries. We demonstrate how BLIS acts as a productivity multiplier by using it to implement the level-3 BLAS on a variety of current architectures. The systems for which we demonstrate the framework include state-of-the-art general-purpose, low-power, and many-core architectures. We show how, with very little effort, the BLIS framework yields sequential and parallel implementations that are competitive with the performance of ATLAS, OpenBLAS (an effort to maintain and extend the GotoBLAS), and commercial vendor implementations such as AMD's ACML, IBM's ESSL, and Intel's MKL libraries. While most of this paper focuses on single core implementation, we also provide compelling results that suggest the framework's leverage extends to the multithreaded domain. Categories and Subject Descriptors: G.4 [Mathematical Software]: Efficiency General Terms: Algorithms, Performance Authors’ addresses: Field G. Van Zee and Robert A. van de Geijn, Department of Computer Science and In- stitute for Computational Engineering and Sciences, The University of Texas at Austin, Austin, TX 78712, ffield,[email protected]. Tyler Smith, Bryan Marker, and Tze Meng Low, Department of Computer Science, The University of Texas at Austin, Austin, TX 78712, ftms,bamarker,[email protected].
    [Show full text]