Mixed-Precision Numerical Linear Algebra Algorithms: Integer Arithmetic Based LU Factorization and Iterative Refinement for Hermitian Eigenvalue Problem
Total Page:16
File Type:pdf, Size:1020Kb
Load more
										Recommended publications
									
								- 
												  Superlu Users' GuideSuperLU Users' Guide Xiaoye S. Li1 James W. Demmel2 John R. Gilbert3 Laura Grigori4 Meiyue Shao5 Ichitaro Yamazaki6 September 1999 Last update: August 2011 1Lawrence Berkeley National Lab, MS 50F-1650, 1 Cyclotron Rd, Berkeley, CA 94720. ([email protected]). This work was supported in part by the Director, Office of Advanced Scientific Computing Research of the U.S. Department of Energy under Contract No. D-AC02-05CH11231. 2Computer Science Division, University of California, Berkeley, CA 94720. ([email protected]). The research of Demmel and Li was supported in part by NSF grant ASC{9313958, DOE grant DE{FG03{ 94ER25219, UT Subcontract No. ORA4466 from ARPA Contract No. DAAL03{91{C0047, DOE grant DE{ FG03{94ER25206, and NSF Infrastructure grants CDA{8722788 and CDA{9401156. 3Department of Computer Science, University of California, Santa Barbara, CA 93106. ([email protected]). The research of this author was supported in part by the Institute for Mathematics and Its Applications at the University of Minnesota and in part by DARPA Contract No. DABT63-95-C0087. Copyright c 1994-1997 by Xerox Corporation. All rights reserved. 4INRIA Saclay-Ile de France, Laboratoire de Recherche en Informatique, Universite Paris-Sud 11. ([email protected]) 5Department of Computing Science and HPC2N, Ume˚a University, SE-901 87, Ume˚a, Sweden. ([email protected]) 6Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, The University of Tennessee. ([email protected]). The research of this author was supported in part by the Director, Office of Advanced Scientific Computing Research of the U.S.
- 
												  Linpack Evaluation on a Supercomputer with Heterogeneous AcceleratorsLinpack Evaluation on a Supercomputer with Heterogeneous Accelerators Toshio Endo Akira Nukada Graduate School of Information Science and Engineering Global Scientific Information and Computing Center Tokyo Institute of Technology Tokyo Institute of Technology Tokyo, Japan Tokyo, Japan [email protected] [email protected] Satoshi Matsuoka Naoya Maruyama Global Scientific Information and Computing Center Global Scientific Information and Computing Center Tokyo Institute of Technology/National Institute of Informatics Tokyo Institute of Technology Tokyo, Japan Tokyo, Japan [email protected] [email protected] Abstract—We report Linpack benchmark results on the Roadrunner or other systems described above, it includes TSUBAME supercomputer, a large scale heterogeneous system two types of accelerators. This is due to incremental upgrade equipped with NVIDIA Tesla GPUs and ClearSpeed SIMD of the system, which has been the case in commodity CPU accelerators. With all of 10,480 Opteron cores, 640 Xeon cores, 648 ClearSpeed accelerators and 624 NVIDIA Tesla GPUs, clusters; they may have processors with different speeds as we have achieved 87.01TFlops, which is the third record as a result of incremental upgrade. In this paper, we present a heterogeneous system in the world. This paper describes a Linpack implementation and evaluation results on TSUB- careful tuning and load balancing method required to achieve AME with 10,480 Opteron cores, 624 Tesla GPUs and 648 this performance. On the other hand, since the peak speed is ClearSpeed accelerators. In the evaluation, we also used a 163 TFlops, the efficiency is 53%, which is lower than other systems.
- 
												  ARM HPC EcosystemARM HPC Ecosystem Darren Cepulis HPC Segment Manager ARM Business Segment Group HPC Forum, Santa Fe, NM 19th April 2017 © ARM 2017 ARM Collaboration for Exascale Programs United States Japan ARM is currently a participant in Fujitsu and RIKEN announced that the two Department of Energy funded Post-K system targeted at Exascale will pre-Exascale projects: Data be based on ARMv8 with new Scalable Movement Dominates and Fast Vector Extensions. Forward 2. European Union China Through FP7 and Horizon 2020, James Lin, vice director for the Center ARM has been involved in several of HPC at Shanghai Jiao Tong University funded pre-Exascale projects claims China will build three pre- including the Mont Blanc program Exascale prototypes to select the which deployed one of the first architecture for their Exascale system. ARM prototype HPC systems. The three prototypes are based on AMD, SunWei TaihuLight, and ARMv8. 2 © ARM 2017 ARM HPC deployments starting in 2H2017 Tw o recent announcements about ARM in HPC in Europe: 3 © ARM 2017 Japan Exascale slides from Fujitsu at ISC’16 4 © ARM 2017 Foundational SW Ecosystem for HPC . Linux OS’s – RedHat, SUSE, CENTOS, UBUNTU,… Commercial . Compilers – ARM, GNU, LLVM,… . Libraries – ARM, OpenBLAS, BLIS, ATLAS, FFTW… . Parallelism – OpenMP, OpenMPI, MVAPICH2,… . Debugging – Allinea, RW Totalview, GDB,… Open-source . Analysis – ARM, Allinea, HPCToolkit, TAU,… . Job schedulers – LSF, PBS Pro, SLURM,… . Cluster mgmt – Bright, CMU, warewulf,… Predictable Baseline 5 © ARM 2017 – now on ARM Functional Supported packages / components Areas OpenHPC defines a baseline. It is a community effort to Base OS RHEL/CentOS 7.1, SLES 12 provide a common, verified set of open source packages for Administrative Conman, Ganglia, Lmod, LosF, ORCM, Nagios, pdsh, HPC deployments Tools prun ARM’s participation: Provisioning Warewulf Resource Mgmt.
- 
												  A New Analysis of Iterative Refinement and Its Application to Accurate Solution of Ill-Conditioned Sparse Linear Systems CarsonA New Analysis of Iterative Refinement and its Application to Accurate Solution of Ill-Conditioned Sparse Linear Systems Carson, Erin and Higham, Nicholas J. 2017 MIMS EPrint: 2017.12 Manchester Institute for Mathematical Sciences School of Mathematics The University of Manchester Reports available from: http://eprints.maths.manchester.ac.uk/ And by contacting: The MIMS Secretary School of Mathematics The University of Manchester Manchester, M13 9PL, UK ISSN 1749-9097 A NEW ANALYSIS OF ITERATIVE REFINEMENT AND ITS APPLICATION TO ACCURATE SOLUTION OF ILL-CONDITIONED SPARSE LINEAR SYSTEMS∗ ERIN CARSONy AND NICHOLAS J. HIGHAMz Abstract. Iterative refinement is a long-standing technique for improving the accuracy of a computed solution to a nonsingular linear system Ax = b obtained via LU factorization. It makes use of residuals computed in extra precision, typically at twice the working precision, and existing results guarantee convergence if the matrix A has condition number safely less than the reciprocal of the unit roundoff, u. We identify a mechanism that allows iterative refinement to produce solutions with normwise relative error of order u to systems with condition numbers of order u−1 or larger, provided that the update equation is solved with a relative error sufficiently less than 1. A new rounding error analysis is given and its implications are analyzed. Building on the analysis, we develop a GMRES-based iterative refinement method (GMRES-IR) that makes use of the computed LU factors as preconditioners. GMRES-IR exploits the fact that even if A is extremely ill conditioned the LU factors contain enough information that preconditioning can greatly reduce the condition number of A.
- 
												  0 BLIS: a Modern Alternative to the BLAS0 BLIS: A Modern Alternative to the BLAS FIELD G. VAN ZEE and ROBERT A. VAN DE GEIJN, The University of Texas at Austin We propose the portable BLAS-like Interface Software (BLIS) framework which addresses a number of short- comings in both the original BLAS interface and present-day BLAS implementations. The framework allows developers to rapidly instantiate high-performance BLAS-like libraries on existing and new architectures with relatively little effort. The key to this achievement is the observation that virtually all computation within level-2 and level-3 BLAS operations may be expressed in terms of very simple kernels. Higher-level framework code is generalized so that it can be reused and/or re-parameterized for different operations (as well as different architectures) with little to no modification. Inserting high-performance kernels into the framework facilitates the immediate optimization of any and all BLAS-like operations which are cast in terms of these kernels, and thus the framework acts as a productivity multiplier. Users of BLAS-dependent applications are supported through a straightforward compatibility layer, though calling sequences must be updated for those who wish to access new functionality. Experimental performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL). Categories and Subject Descriptors: G.4 [Mathematical Software]: Efficiency General Terms: Algorithms, Performance Additional Key Words and Phrases: linear algebra, libraries, high-performance, matrix, BLAS ACM Reference Format: ACM Trans. Math. Softw. 0, 0, Article 0 ( 0000), 31 pages.
- 
												  Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization – LAPACK Working Note 184Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization – LAPACK Working Note 184 Jakub Kurzak1, Alfredo Buttari1, Jack Dongarra1,2 1Department of Computer Science, University Tennessee, Knoxville, Tennessee 37996 2Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, 37831 May 10, 2007 ABSTRACT: The STI CELL processor introduces 1 Motivation pioneering solutions in processor architecture. At the same time it presents new challenges for the devel- In numerical computing, there is a fundamental per- opment of numerical algorithms. One is effective ex- formance advantage of using single precision float- ploitation of the differential between the speed of sin- ing point data format over double precision data for- gle and double precision arithmetic; the other is effi- mat, due to more compact representation, thanks to cient parallelization between the short vector SIMD which, twice the number of single precision data ele- cores. In this work, the first challenge is addressed ments can be stored at each stage of the memory hi- by utilizing a mixed-precision algorithm for the solu- erarchy. Short vector SIMD processing provides yet tion of a dense symmetric positive definite system of more potential for performance gains from using sin- linear equations, which delivers double precision ac- gle precision arithmetic over double precision. Since curacy, while performing the bulk of the work in sin- the goal is to process the entire vector in a single gle precision. The second challenge is approached by operation, the computation throughput can be dou- introducing much finer granularity of parallelization bled when the data representation is halved.
- 
												  Supermatrix: a Multithreaded Runtime Scheduling System for Algorithms-By-BlocksSuperMatrix: A Multithreaded Runtime Scheduling System for Algorithms-by-Blocks Ernie Chan Field G. Van Zee Paolo Bientinesi Enrique S. Quintana-Ort´ı Robert van de Geijn Department of Computer Science Gregorio Quintana-Ort´ı Department of Computer Sciences Duke University Departamento de Ingenier´ıa y Ciencia de The University of Texas at Austin Durham, NC 27708 Computadores Austin, Texas 78712 [email protected] Universidad Jaume I {echan,field,rvdg}@cs.utexas.edu 12.071–Castellon,´ Spain {quintana,gquintan}@icc.uji.es Abstract which led to the adoption of out-of-order execution in many com- This paper describes SuperMatrix, a runtime system that paral- puter microarchitectures [32]. For the dense linear algebra opera- lelizes matrix operations for SMP and/or multi-core architectures. tions on which we will concentrate in this paper, many researchers We use this system to demonstrate how code described at a high in the early days of distributed-memory computing recognized that level of abstraction can achieve high performance on such archi- “compute-ahead” techniques could be used to improve parallelism. tectures while completely hiding the parallelism from the library However, the coding complexity required of such an effort proved programmer. The key insight entails viewing matrices hierarchi- too great for these techniques to gain wide acceptance. In fact, cally, consisting of blocks that serve as units of data where oper- compute-ahead optimizations are still absent from linear algebra ations over those blocks are treated as units of computation. The packages such as ScaLAPACK [12] and PLAPACK [34]. implementation transparently enqueues the required operations, in- Recently, there has been a flurry of interest in reviving the idea ternally tracking dependencies, and then executes the operations of compute-ahead [1, 25, 31].
- 
												  18 Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance18 Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance FIELD G. VAN ZEE and ROBERT A. VAN DE GEIJN, The University of Texas at Austin GREGORIO QUINTANA-ORT´I, Universitat Jaume I We show how both the tridiagonal and bidiagonal QR algorithms can be restructured so that they become rich in operations that can achieve near-peak performance on a modern processor. The key is a novel, cache-friendly algorithm for applying multiple sets of Givens rotations to the eigenvector/singular vec- tor matrix. This algorithm is then implemented with optimizations that: (1) leverage vector instruction units to increase floating-point throughput, and (2) fuse multiple rotations to decrease the total number of memory operations. We demonstrate the merits of these new QR algorithms for computing the Hermitian eigenvalue decomposition (EVD) and singular value decomposition (SVD) of dense matrices when all eigen- vectors/singular vectors are computed. The approach yields vastly improved performance relative to tradi- tional QR algorithms for these problems and is competitive with two commonly used alternatives—Cuppen’s Divide-and-Conquer algorithm and the method of Multiple Relatively Robust Representations—while inherit- ing the more modest O(n) workspace requirements of the original QR algorithms. Since the computations performed by the restructured algorithms remain essentially identical to those performed by the original methods, robust numerical properties are preserved. Categories and Subject Descriptors: G.4 [Mathematical Software]: Efficiency General Terms: Algorithms, Performance Additional Key Words and Phrases: Eigenvalues, singular values, tridiagonal, bidiagonal, EVD, SVD, QR algorithm, Givens rotations, linear algebra, libraries, high performance ACM Reference Format: Van Zee, F.
- 
												  Benchmark of C++ Libraries for Sparse Matrix ComputationBenchmark of C++ Libraries for Sparse Matrix Computation Georg Holzmann http://grh.mur.at email: [email protected] August 2007 This report presents benchmarks of C++ scientific computing libraries for small and medium size sparse matrices. Be warned: these benchmarks are very special- ized on a neural network like algorithm I had to implement. However, also the initialization time of sparse matrices and a matrix-vector multiplication was mea- sured, which might be of general interest. WARNING At the time of its writing this document did not consider the eigen library (http://eigen.tuxfamily.org). Please evaluate also this very nice, fast and well maintained library before making any decisions! See http://grh.mur.at/blog/matrix-library-benchmark-follow for more information. Contents 1 Introduction 2 2 Benchmarks 2 2.1 Initialization.....................................3 2.2 Matrix-Vector Multiplication............................4 2.3 Neural Network like Operation...........................4 3 Results 4 4 Conclusion 12 A Libraries and Flags 12 B Implementations 13 1 1 Introduction Quite a lot open-source libraries exist for scientific computing, which makes it hard to choose between them. Therefore, after communication on various mailing lists, I decided to perform some benchmarks, specialized to the kind of algorithms I had to implement. Ideally algorithms should be implemented in an object oriented, reuseable way, but of course without a loss of performance. BLAS (Basic Linear Algebra Subprograms - see [1]) and LA- PACK (Linear Algebra Package - see [3]) are the standard building blocks for efficient scientific software. Highly optimized BLAS implementations for all relevant hardware architectures are available, traditionally expressed in Fortran routines for scalar, vector and matrix operations (called Level 1, 2 and 3 BLAS).
- 
												  Using Machine Learning to Improve Dense and Sparse Matrix Multiplication KernelsIowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations 2019 Using machine learning to improve dense and sparse matrix multiplication kernels Brandon Groth Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/etd Part of the Applied Mathematics Commons, and the Computer Sciences Commons Recommended Citation Groth, Brandon, "Using machine learning to improve dense and sparse matrix multiplication kernels" (2019). Graduate Theses and Dissertations. 17688. https://lib.dr.iastate.edu/etd/17688 This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Using machine learning to improve dense and sparse matrix multiplication kernels by Brandon Micheal Groth A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Major: Applied Mathematics Program of Study Committee: Glenn R. Luecke, Major Professor James Rossmanith Zhijun Wu Jin Tian Kris De Brabanter The student author, whose presentation of the scholarship herein was approved by the program of study committee, is solely responsible for the content of this dissertation. The Graduate College will ensure this dissertation is globally accessible and will not permit alterations after a degree is conferred. Iowa State University Ames, Iowa 2019 Copyright c Brandon Micheal Groth, 2019. All rights reserved. ii DEDICATION I would like to dedicate this thesis to my wife Maria and our puppy Tiger.
- 
												  0 BLIS: a Framework for Rapidly Instantiating BLAS Functionality0 BLIS: A Framework for Rapidly Instantiating BLAS Functionality FIELD G. VAN ZEE and ROBERT A. VAN DE GEIJN, The University of Texas at Austin The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly in- stantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental innovation is that virtually all computation within level-2 (matrix-vector) and level-3 (matrix-matrix) BLAS operations can be expressed and optimized in terms of very simple kernels. While others have had similar insights, BLIS reduces the necessary kernels to what we believe is the simplest set that still supports the high performance that the computational science community demands. Higher-level framework code is generalized and imple- mented in ISO C99 so that it can be reused and/or re-parameterized for different operations (and different architectures) with little to no modification. Inserting high-performance kernels into the framework facili- tates the immediate optimization of any BLAS-like operations which are cast in terms of these kernels, and thus the framework acts as a productivity multiplier. Users of BLAS-dependent applications are given a choice of using the the traditional Fortran-77 BLAS interface, a generalized C interface, or any other higher level interface that builds upon this latter API. Preliminary performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL). Categories and Subject Descriptors: G.4 [Mathematical Software]: Efficiency General Terms: Algorithms, Performance Additional Key Words and Phrases: linear algebra, libraries, high-performance, matrix, BLAS ACM Reference Format: ACM Trans.
- 
												  Anatomy of High-Performance Matrix MultiplicationAnatomy of High-Performance Matrix Multiplication KAZUSHIGE GOTO The University of Texas at Austin and ROBERT A. VAN DE GEIJN The University of Texas at Austin We present the basic principles which underlie the high-performance implementation of the matrix- matrix multiplication that is part of the widely used GotoBLAS library. Design decisions are justified by successively refining a model of architectures with multilevel memories. A simple but effective algorithm for executing this operation results. Implementations on a broad selection of architectures are shown to achieve near-peak performance. Categories and Subject Descriptors: G.4 [Mathematical Software]: —Efficiency General Terms: Algorithms;Performance Additional Key Words and Phrases: linear algebra, matrix multiplication, basic linear algebra subprogrms 1. INTRODUCTION Implementing matrix multiplication so that near-optimal performance is attained requires a thorough understanding of how the operation must be layered at the macro level in combination with careful engineering of high-performance kernels at the micro level. This paper primarily addresses the macro issues, namely how to exploit a high-performance “inner-kernel”, more so than the the micro issues related to the design and engineering of that “inner-kernel”. In [Gunnels et al. 2001] a layered approach to the implementation of matrix multiplication was reported. The approach was shown to optimally amortize the required movement of data between two adjacent memory layers of an architecture with a complex multi-level memory. Like other work in the area [Agarwal et al. 1994; Whaley et al. 2001], that paper ([Gunnels et al. 2001]) casts computation in terms of an “inner-kernel” that computes C := AB˜ + C for some mc × kc matrix A˜ that is stored contiguously in some packed format and fits in cache memory.