Iterative Methods in Linear Algebra

Total Page:16

File Type:pdf, Size:1020Kb

Iterative Methods in Linear Algebra Iterative Methods in Linear Algebra Azzam Haidar Innovative Computing Laboratory Computer Science Department The University of Tennessee Wednesday April 18, 2018 COSC 594, 04-18-2018 COSC 594, 04-18-2018 Outline Part I Discussion Motivation for iterative methods Part II Stationary Iterative Methods Part III Iterative Refinement Methods Part IV Nonstationary Iterative Methods COSC 594, 04-18-2018 Part I Discussion COSC 594, 04-18-2018 Discussion Original matrix: Reordered matrix: >> load matrix.output >> load A.data >> S = spconvert(matrix); >> S = spconvert(A); >> spy(S) >> spy(S) COSC 594, 04-18-2018 Discussion Original sub-matrix: Reordered sub-matrix: >> load matrix.output >> load A.data >> S = spconvert(matrix); >> S = spconvert(A); >> spy(S(1:8660),1:8660)) >> spy(S(8602:8700,8602:8700)) COSC 594, 04-18-2018 Discussion Results on torc0: Results on woodstock: Pentium III 933 MHz, 16 KB L1 Intel Xeon 5160 Woodcrest and 256 KB L2 cache 3.0GHz, L1 32 KB, L2 4 MB Original matrix: Original matrix: Mflop/s = 42.4568 Mflop/s = 386.8 Reordered matrix: Reordered matrix: Mflop/s = 45.3954 Mflop/s = 387.6 BCRS BCRS Mflop/s = 72.17374 Mflop/s = 894.2 COSC 594, 04-18-2018 Iterative Methods COSC 594, 04-18-2018 Motivation So far we have discussed and have seen: Many engineering and physics simulations lead to sparse matrices e.g. PDE based simulations and their discretizations based on FEM/FVM finite di↵erences Petrov-Galerkin type of conditions, etc. How to optimize performance on sparse matrix computations Some software how to solve sparse matrix problems (e.g. PETSc) The question is: Can we solve sparse matrix problems faster than using direct sparse methods? In certain cases Yes: using iterative methods COSC 594, 04-18-2018 Sparse direct methods Sparse direct methods These are based on Gaussian elimination (LU) Performed in sparse format Are there limitations of the approach? Yes, they have fill-ins which lead to more memory requirements more flops being performed Fill-ins can become prohibitively high COSC 594, 04-18-2018 Sparse direct methods Consider LU for the matrix below 1st step of LU factorization will introduce fill-ins a nonzero is represented by a* marked by an F COSC 594, 04-18-2018 Sparse direct methods Fill-ins can be improved by reordering Remember: we talked about it in slightly di↵erent context (for speed and parallelization) Consider the following reordering These were extreme cases but still, the problem exists COSC 594, 04-18-2018 Sparse methods Sparse direct vs dense methods Dense takes O(n2) storage, O(n3) flops, runs within peak performance Sparse direct can take O(#nonz) storage, and O(#nonz) flops, but these can also grow due to fill-ins and performance is bad with #nonz << n2 and ’proper’ ordering it pays o↵to do sparse direct Software (table from Julien Langou) http://www.netlib.org/utk/people/JackDongarra/la-sw.html Package Support Type Language Mode Real Complex f77 c c++ Seq Dist SPD Gen DSCPACK yes X X X M X HSL yes X X X X X X MFACT yes X X X X MUMPS yes X X X X X M X X PSPASES yes X X X M X SPARSE ? X X X X X X SPOOLES ? X X X X M X SuperLU yes X X X X X M X TAUCS yes X X X X X X UMFPACK yes X X X X X Y12M ? X X X X X COSC 594, 04-18-2018 Sparse methods What about Iterative Methods?Thinkforexample x = x + P(b Ax ) i+1 i − i Pluses: Storage is only O(#nonz)(forthematrixandafewworking vectors) Can take only a few iterations to converge (e.g. << n) 1 for P = A− it takes 1 iteration (check)! In general easy to parallelize COSC 594, 04-18-2018 Sparse methods What about Iterative Methods?Thinkforexample x = x + P(b Ax ) i+1 i − i Minuses: Performance can be bad (as we saw previously) Convergence can be slow or even stagnate But can be improved with preconditioning 1 Think of P as a preconditioner, an operator/matrix P A− Optimal preconditioners (e.g. multigrid can be) lead to⇡ convergence in O(1) iterations COSC 594, 04-18-2018 Part II Stationary Iterative Methods COSC 594, 04-18-2018 Stationary Iterative Methods Can be expressed in the form xi+1 = Bxi + c where B and c do not depend on i older, simpler, easy to implement, but usually not as e↵ective (as nonstationary) examples: Richardson, Jacobi, Gauss-Seidel, SOR, etc. (next) COSC 594, 04-18-2018 Richardson Method Richardson iteration x = x +(b Ax )=(I A)x + b (1) i+1 i − i − i i.e. B from the definition above is B = I A − Denote e = x x and rewrite (1) i − i x x = x x (Ax Ax ) − i+1 − i − − i e = e Ae i+1 i − i =(I A)e − i 2 ei+1 (I A)ei I A ei I A ei 1 || || || − || || − || || || || − || || − || I A i e ···||− || || 0|| i.e. for convergence (e 0) we need i ! I A < 1 || − || for some norm , e.g. when ⇢(B) < 1. || · || COSC 594, 04-18-2018 Jacobi Method Jacobi Method 1 1 1 x = x + D− (b Ax )=(I D− A)x + D− b (2) i+1 i − i − i where D is the diagonal of A (assumed nonzero; B = I D 1A − − from the definition) Denote e = x x and rewrite (2) i − i 1 x x = x x D− (Ax Ax ) − i+1 − i − − i 1 e = e D− Ae i+1 i − i 1 =(I D− A)e − i 1 1 1 2 ei+1 (I D− A)ei I D− A ei I D− A ei 1 || || || − || || − || || || || − || || − || 1 i I D− A e ···||− || || 0|| i.e. for convergence (e 0) we need i ! 1 I D− A < 1 || − || for some norm , e.g. when ⇢(I D 1A) < 1 || · || − − COSC 594, 04-18-2018 Gauss-Seidel Method Gauss-Seidel Method 1 x =(D L)− (Ux + b) (3) i+1 − i where L is the lower and U the upper triangular part of A, and − − D is the diagonal. Equivalently: (i+1) (i) (i) (i) (i) x1 =(b1 a12x2 a13x3 a14x4 ... a1nxn )/a11 (i+1) − (i+1) − (i) − (i) − − (i) x2 =(b2 a21x1 a23x3 a24x4 ... a2nxn )/a22 8 (i+1) − (i+1) − (i+1) − (i) − − (i) > x3 =(b2 a31x1 a32x2 a34x4 ... a3nxn )/a22 > − − − − − > . < . (i+1) (i+1) (i+1) (i+1) (i+1) > xn =(bn an1x1 an2x2 an3x3 ... an,n 1xn 1 )/ann > − − − − − − − > :> COSC 594, 04-18-2018 Acomparison(fromJulienLangouslides) Gauss-Seidel method 10 1 2 0 1 23 01213 1 44 A = 0 129101 b = 0 36 1 B 031100C B 49 C B C B C B 120015C B 80 C @ A @ A b Ax(i) / b ,functionofi,thenumberofiterations k − k2 k k2 x(0) x(1) x(2) x(3) x(4) ... 0.0000 2.3000 0.8783 0.9790 0.9978 1.0000 0.0000 3.6667 2.1548 2.0107 2.0005 2.0000 0.0000 2.9296 3.0339 3.0055 3.0006 3.0000 0.0000 3.5070 3.9502 3.9962 3.9998 4.0000 0.0000 4.6911 4.9875 5.0000 5.0001 5.0000 COSC 594, 04-18-2018 Acomparison(fromJulienLangouslides) Jacobi Method: (i+1) (i) (i) (i) (i) x1 =(b1 a12x2 a13x3 a14x4 ... a1nxn )/a11 (i+1) − (i) − (i) − (i) − − (i) x =(b a x a x a x ... a x )/a 8 2 2 − 21 1 − 23 3 − 24 4 − − 2n n 22 > (i+1) (i) (i) (i) (i) > x =(b2 a31x a32x a34x ... a3nxn )/a22 > 3 − 1 − 2 − 4 − − > . <> . > (i+1) (i) (i) (i) (i) > xn =(bn an1x an2x an3x ... an,n 1x )/ann > − 1 − 2 − 3 − − − n 1 > − :> Gauss-Seidel method (i+1) (i) (i) (i) (i) x1 =(b1 a12x2 a13x3 a14x4 ... a1nxn )/a11 (i+1) − (i+1) − (i) − (i) − − (i) x =(b a x a x a x ... a x )/a 8 2 2 − 21 1 − 23 3 − 24 4 − − 2n n 22 > (i+1) (i+1) (i+1) (i) (i) > x =(b2 a31x a32x a34x ... a3nxn )/a22 > 3 − 1 − 2 − 4 − − > . <> . > (i+1) (i+1) (i+1) (i+1) (i+1) > xn =(bn an1x an2x an3x ... an,n 1x )/ann > − 1 − 2 − 3 − − − n 1 > − :> Gauss-Seidel is the method with better numerical properties (less iterations to convergence). Which is the method with better efficiency in term of implementation in sequential or parallel computer? Why? (i+1) (i+1) In Gauss-Seidel, the computation of xk+1 implies the knowledge of xk . Parallelization is impossible. COSC 594, 04-18-2018 Convergence Convergence can be very slow Consider a modified Richardson method: x = x + ⌧(b Ax ) (4) i+1 i − i Convergence is linear, similarly to Richardson we get e I ⌧A i e || i+1|| || − || || 0|| but can be very slow (if I ⌧A is close to 1), e.g. let || − || A be symmetric and positive definite (SPD) the matrix norm in I ⌧A is induced by || − || || · ||2 Then the best choice for ⌧ (⌧ = 2 ) would give λ1+λN k(A) 1 I ⌧A = − || − || k(A)+1 where k(A) = λN is the condition number of A. λ1 COSC 594, 04-18-2018 Convergence Note: The rate of convergence depend on the condition number k(A) Even for the the best ⌧ the rate of convergence k(A) 1 − k(A)+1 is slow (i.e. close to 1) for large k(A) We will see convergence of nonstationary methods also depend on k(A) but is better, e.g.
Recommended publications
  • Superlu Users' Guide
    SuperLU Users' Guide Xiaoye S. Li1 James W. Demmel2 John R. Gilbert3 Laura Grigori4 Meiyue Shao5 Ichitaro Yamazaki6 September 1999 Last update: August 2011 1Lawrence Berkeley National Lab, MS 50F-1650, 1 Cyclotron Rd, Berkeley, CA 94720. ([email protected]). This work was supported in part by the Director, Office of Advanced Scientific Computing Research of the U.S. Department of Energy under Contract No. D-AC02-05CH11231. 2Computer Science Division, University of California, Berkeley, CA 94720. ([email protected]). The research of Demmel and Li was supported in part by NSF grant ASC{9313958, DOE grant DE{FG03{ 94ER25219, UT Subcontract No. ORA4466 from ARPA Contract No. DAAL03{91{C0047, DOE grant DE{ FG03{94ER25206, and NSF Infrastructure grants CDA{8722788 and CDA{9401156. 3Department of Computer Science, University of California, Santa Barbara, CA 93106. ([email protected]). The research of this author was supported in part by the Institute for Mathematics and Its Applications at the University of Minnesota and in part by DARPA Contract No. DABT63-95-C0087. Copyright c 1994-1997 by Xerox Corporation. All rights reserved. 4INRIA Saclay-Ile de France, Laboratoire de Recherche en Informatique, Universite Paris-Sud 11. ([email protected]) 5Department of Computing Science and HPC2N, Ume˚a University, SE-901 87, Ume˚a, Sweden. ([email protected]) 6Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, The University of Tennessee. ([email protected]). The research of this author was supported in part by the Director, Office of Advanced Scientific Computing Research of the U.S.
    [Show full text]
  • A New Analysis of Iterative Refinement and Its Application to Accurate Solution of Ill-Conditioned Sparse Linear Systems Carson
    A New Analysis of Iterative Refinement and its Application to Accurate Solution of Ill-Conditioned Sparse Linear Systems Carson, Erin and Higham, Nicholas J. 2017 MIMS EPrint: 2017.12 Manchester Institute for Mathematical Sciences School of Mathematics The University of Manchester Reports available from: http://eprints.maths.manchester.ac.uk/ And by contacting: The MIMS Secretary School of Mathematics The University of Manchester Manchester, M13 9PL, UK ISSN 1749-9097 A NEW ANALYSIS OF ITERATIVE REFINEMENT AND ITS APPLICATION TO ACCURATE SOLUTION OF ILL-CONDITIONED SPARSE LINEAR SYSTEMS∗ ERIN CARSONy AND NICHOLAS J. HIGHAMz Abstract. Iterative refinement is a long-standing technique for improving the accuracy of a computed solution to a nonsingular linear system Ax = b obtained via LU factorization. It makes use of residuals computed in extra precision, typically at twice the working precision, and existing results guarantee convergence if the matrix A has condition number safely less than the reciprocal of the unit roundoff, u. We identify a mechanism that allows iterative refinement to produce solutions with normwise relative error of order u to systems with condition numbers of order u−1 or larger, provided that the update equation is solved with a relative error sufficiently less than 1. A new rounding error analysis is given and its implications are analyzed. Building on the analysis, we develop a GMRES-based iterative refinement method (GMRES-IR) that makes use of the computed LU factors as preconditioners. GMRES-IR exploits the fact that even if A is extremely ill conditioned the LU factors contain enough information that preconditioning can greatly reduce the condition number of A.
    [Show full text]
  • Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization – LAPACK Working Note 184
    Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization – LAPACK Working Note 184 Jakub Kurzak1, Alfredo Buttari1, Jack Dongarra1,2 1Department of Computer Science, University Tennessee, Knoxville, Tennessee 37996 2Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, 37831 May 10, 2007 ABSTRACT: The STI CELL processor introduces 1 Motivation pioneering solutions in processor architecture. At the same time it presents new challenges for the devel- In numerical computing, there is a fundamental per- opment of numerical algorithms. One is effective ex- formance advantage of using single precision float- ploitation of the differential between the speed of sin- ing point data format over double precision data for- gle and double precision arithmetic; the other is effi- mat, due to more compact representation, thanks to cient parallelization between the short vector SIMD which, twice the number of single precision data ele- cores. In this work, the first challenge is addressed ments can be stored at each stage of the memory hi- by utilizing a mixed-precision algorithm for the solu- erarchy. Short vector SIMD processing provides yet tion of a dense symmetric positive definite system of more potential for performance gains from using sin- linear equations, which delivers double precision ac- gle precision arithmetic over double precision. Since curacy, while performing the bulk of the work in sin- the goal is to process the entire vector in a single gle precision. The second challenge is approached by operation, the computation throughput can be dou- introducing much finer granularity of parallelization bled when the data representation is halved.
    [Show full text]
  • Error Bounds from Extra-Precise Iterative Refinement
    Error Bounds from Extra-Precise Iterative Refinement JAMES DEMMEL, YOZO HIDA, and WILLIAM KAHAN University of California, Berkeley XIAOYES.LI Lawrence Berkeley National Laboratory SONIL MUKHERJEE Oracle and E. JASON RIEDY University of California, Berkeley We present the design and testing of an algorithm for iterative refinement of the solution of linear equations where the residual is computed with extra precision. This algorithm was originally proposed in 1948 and analyzed in the 1960s as a means to compute very accurate solutions to all but the most ill-conditioned linear systems. However, two obstacles have until now prevented its adoption in standard subroutine libraries like LAPACK: (1) There was no standard way to access the higher precision arithmetic needed to compute residuals, and (2) it was unclear how to compute a reliable error bound for the computed solution. The completion of the new BLAS Technical Forum Standard has essentially removed the first obstacle. To overcome the second obstacle, we show how the application of iterative refinement can be used to compute an error bound in any norm at small cost and use this to compute both an error bound in the usual infinity norm, and a componentwise relative error bound. This research was supported in part by the NSF Cooperative Agreement No. ACI-9619020; NSF Grant Nos. ACI-9813362 and CCF-0444486; the DOE Grant Nos. DE-FG03-94ER25219, DE-FC03- 98ER25351, and DE-FC02-01ER25478; and the National Science Foundation Graduate Research Fellowship. The authors wish to acknowledge the contribution from Intel Corporation, Hewlett- Packard Corporation, IBM Corporation, and the National Science Foundation grant EIA-0303575 in making hardware and software available for the CITRIS Cluster which was used in producing these research results.
    [Show full text]
  • Three-Precision GMRES-Based Iterative Refinement for Least Squares Problems Carson, Erin and Higham, Nicholas J. and Pranesh, Sr
    Three-Precision GMRES-Based Iterative Refinement for Least Squares Problems Carson, Erin and Higham, Nicholas J. and Pranesh, Srikara 2020 MIMS EPrint: 2020.5 Manchester Institute for Mathematical Sciences School of Mathematics The University of Manchester Reports available from: http://eprints.maths.manchester.ac.uk/ And by contacting: The MIMS Secretary School of Mathematics The University of Manchester Manchester, M13 9PL, UK ISSN 1749-9097 THREE-PRECISION GMRES-BASED ITERATIVE REFINEMENT FOR LEAST SQUARES PROBLEMS∗ ERIN CARSONy , NICHOLAS J. HIGHAMz , AND SRIKARA PRANESHz Abstract. The standard iterative refinement procedure for improving an approximate solution m×n to the least squares problem minx kb − Axk2, where A 2 R with m ≥ n has full rank, is based on solving the (m + n) × (m + n) augmented system with the aid of a QR factorization. In order to exploit multiprecision arithmetic, iterative refinement can be formulated to use three precisions, but the resulting algorithm converges only for a limited range of problems. We build an iterative refinement algorithm called GMRES-LSIR, analogous to the GMRES-IR algorithm developed for linear systems [SIAM J. Sci. Comput., 40 (2019), pp. A817-A847], that solves the augmented system using GMRES preconditioned by a matrix based on the computed QR factors. We explore two left preconditioners; the first has full off-diagonal blocks and the second is block diagonal and can be applied in either left-sided or split form. We prove that for a wide range of problems the first preconditioner yields backward and forward errors for the augmented system of order the working precision under suitable assumptions on the precisions and the problem conditioning.
    [Show full text]
  • Using Mixed Precision in Numerical Computations to Speedup Linear Algebra Solvers
    Using Mixed Precision in Numerical Computations to Speedup Linear Algebra Solvers Jack Dongarra, UTK/ORNL/U Manchester Azzam Haidar, Nvidia Nick Higham, U of Manchester Stan Tomov, UTK Slides can be found: http://bit.ly/icerm-05-2020-dongarra 5/7/20 1 Background • My interest in mixed precision began with my dissertation … § Improving the Accuracy of Computed Matrix Eigenvalues • Compute the eigenvalues and eigenvectors in low precision then improve selected values/vectors to higher precision for O(n2) ops using the the matrix decomposition § Extended to singular values, 1983 2 § Algorithm in TOMS 710, 1992 IBM’s Cell Processor - 2004 • 9 Cores § Power PC at 3.2 GHz § 8 SPEs • 204.8 Gflop/s peak! $600 § The catch is that this is for 32 bit fl pt; (Single Precision SP) § 64 bit fl pt peak at 14.6 Gflop/s • 14 times slower that SP; factor of 2 because of DP and 7 because of latency issues The SPEs were fully IEEE-754 compliant in double precision. In single precision, they only implement round-towards-zero, denormalized numbers are flushed to zero and NaNs are treated like normal numbers. Mixed Precision Idea Goes Something Like This… • Exploit 32 bit floating point as much as possible. § Especially for the bulk of the computation • Correct or update the solution with selective use of 64 bit floating point to provide a refined results • Intuitively: § Compute a 32 bit result, § Calculate a correction to 32 bit result using selected higher precision and, § Perform the update of the 32 bit results with the correction using high precision.
    [Show full text]
  • A Survey of Recent Developments in Parallel Implementations of Gaussian Elimination
    CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2015; 27:1292–1309 Published online 2 June 2014 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.3306 A survey of recent developments in parallel implementations of Gaussian elimination Simplice Donfack1, Jack Dongarra1, Mathieu Faverge2, Mark Gates1, Jakub Kurzak1,*,†, Piotr Luszczek1 and Ichitaro Yamazaki1 1EECS Department, University of Tennessee, Knoxville, TN, USA 2IPB ENSEIRB-Matmeca, Inria Bordeaux Sud-Ouest, Bordeaux, France SUMMARY Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm has received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of Gaussian elimination for shared memory architecture. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Partial Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multisocket multicore systems are presented. Performance and numerical accuracy is analyzed. Copyright © 2014 John Wiley & Sons, Ltd. Received 15 July 2013; Revised 26 April 2014; Accepted 2 May 2014 KEY WORDS: Gaussian elimination; LU factorization; parallel; shared memory; multicore 1. INTRODUCTION Gaussian elimination has a long history that can be traced back some 2000 years [1]. Today, dense systems of linear equations have become a critical cornerstone for some of the most compute intensive applications.
    [Show full text]
  • Multiprecision Algorithms 2 / 95 Lecture 1
    MultiprecisionResearch Matters Algorithms FebruaryNick 25, Higham 2009 School of Mathematics The UniversityNick Higham of Manchester Director of Research http://www.maths.manchester.ac.uk/~higham/School of Mathematics @nhigham, nickhigham.wordpress.com Mathematical Modelling, Numerical Analysis and Scientific Computing, Kácov, Czech Republic May 27-June 1, 2018. 1 / 6 Outline Multiprecision arithmetic: floating point arithmetic supporting multiple, possibly arbitrary, precisions. Applications of & support for low precision. Applications of & support for high precision. How to exploit different precisions to achieve faster algs with higher accuracy. Focus on iterative refinement for Ax = b, matrix logarithm. Download these slides from http://bit.ly/kacov18 Nick Higham Multiprecision Algorithms 2 / 95 Lecture 1 Floating-point arithmetic. Hardware landscape. Low precision arithmetic. Nick Higham Multiprecision Algorithms 3 / 95 Floating point numbers are not equally spaced. If β = 2, t = 3, emin = −1, and emax = 3: 0 0.5 1.0 2.0 3.0 4.0 5.0 6.0 7.0 Floating Point Number System Floating point number system F ⊂ R: y = ±m × βe−t ; 0 ≤ m ≤ βt − 1: Base β (β = 2 in practice), precision t, exponent range emin ≤ e ≤ emax. Assume normalized: m ≥ βt−1. Nick Higham Multiprecision Algorithms 5 / 95 Floating Point Number System Floating point number system F ⊂ R: y = ±m × βe−t ; 0 ≤ m ≤ βt − 1: Base β (β = 2 in practice), precision t, exponent range emin ≤ e ≤ emax. Assume normalized: m ≥ βt−1. Floating point numbers are not equally spaced. If β = 2, t = 3, emin = −1, and emax = 3: 0 0.5 1.0 2.0 3.0 4.0 5.0 6.0 7.0 Nick Higham Multiprecision Algorithms 5 / 95 Subnormal Numbers 0 6= y 2 F is normalized if m ≥ βt−1.
    [Show full text]
  • Mixed Precision Iterative Refinement
    Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems Alfredo Buttari1, Jack Dongarra1,3,4, Julie Langou1, Julien Langou2, Piotr Luszczek1, and Jakub Kurzak1 1Department of Electrical Engineering and Computer Science, University Tennessee, Knoxville, Tennessee 2Department of Mathematical Sciences, University of Colorado at Denver and Health Sciences Center, Denver, Colorado 3Oak Ridge National Laboratory, Oak Ridge, Tennessee 4University of Manchester September 9, 2007 Abstract By using a combination of 32-bit and 64-bit floating point arithmetic, the per- formance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The ap- proach presented here can apply not only to conventional processors but also to exotic technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the Cell BE processor. Results on modern processor architectures and the Cell BE are presented. Introduction In numerical computing, there is a fundamental performance advantage in using the single precision, floating point data format over the double precision one. Due to more compact representation, twice the number of single precision data elements can be stored at each level of the memory hierarchy including the register file, the set of caches, and the main memory. By the same token, handling single precision values consumes less bandwidth between different memory levels and decreases the number of cache and TLB misses. However, the data movement aspect affects mostly memory- intensive, bandwidth-bound applications, and historically has not drawn much attention to mixed precision algorithms. In the past, the situation looked differently for computationally intensive work- loads, where the load was on the floating point processing units rather than the mem- ory subsystem, and so the single precision data motion advantages were for the most part irrelevant.
    [Show full text]
  • Design, Implementation and Testing of Extended and Mixed Precision BLAS ∗
    Design, Implementation and Testing of Extended and Mixed Precision BLAS ∗ X. S. Li† J. W. Demmel‡ D. H. Bailey† G. Henry§ Y. Hida‡ J. Iskandar‡ W. Kahan‡ A. Kapur‡ M. C. Martin‡ T. Tung‡ D. J. Yoo‡ October 20, 2000 Abstract This paper describes the design rationale, a C implementation, and conformance testing of a subset of the new Standard for the BLAS (Basic Linear Algebra Subroutines): Extended and Mixed Precision BLAS. Permitting higher internal precision and mixed input/output types and precisions permits us to implement some algorithms that are simpler, more accurate, and some- times faster than possible without these features. The new BLAS are challenging to implement and test because there are many more subroutines than in the existing Standard, and because we must be able to assess whether a higher precision is used for internal computations than is used either for input or output variables. So we have developed an automated process of generating and systematically testing these routines. Our methodology is applicable to languages besides C. In particular, our algorithms used in the testing code would be very valuable to all the other BLAS implementors. Our extra precision routines achieve excellent performance—close to half of the machine peak Megaflop rate even for the Level 2 BLAS, when the data access is stride one. ∗This research was supported in part by the National Science Foundation Cooperative Agreement No. ACI- 9619020, NSF Grant No. ACI-9813362, the Department of Energy Grant Nos. DE-FG03-94ER25219 and DE- FC03-98ER25351, and gifts from the IBM Shared University Research Program, Sun Microsystems, and Intel.
    [Show full text]
  • Adaptive Dynamic Precision Iterative Refinement Jun Kyu Lee [email protected]
    University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Doctoral Dissertations Graduate School 8-2012 AIR: Adaptive Dynamic Precision Iterative Refinement Jun Kyu Lee [email protected] Recommended Citation Lee, Jun Kyu, "AIR: Adaptive Dynamic Precision Iterative Refinement. " PhD diss., University of Tennessee, 2012. https://trace.tennessee.edu/utk_graddiss/1446 This Dissertation is brought to you for free and open access by the Graduate School at Trace: Tennessee Research and Creative Exchange. It has been accepted for inclusion in Doctoral Dissertations by an authorized administrator of Trace: Tennessee Research and Creative Exchange. For more information, please contact [email protected]. To the Graduate Council: I am submitting herewith a dissertation written by Jun Kyu Lee entitled "AIR: Adaptive Dynamic Precision Iterative Refinement." I have examined the final electronic copy of this dissertation for form and content and recommend that it be accepted in partial fulfillment of the requirements for the degree of Doctor of Philosophy, with a major in Computer Engineering. Gregory D. Peterson, Major Professor We have read this dissertation and recommend its acceptance: Itamar Arel, Robert J. Hinde, Robert J. Harrison Accepted for the Council: Dixie L. Thompson Vice Provost and Dean of the Graduate School (Original signatures are on file with official student records.) To the Graduate Council: I am submitting herewith a dissertation written by JunKyu Lee entitled “AIR: Adaptive Dynamic Precision Iterative Refinement.” I have examined the final electronic copy of this dissertation for form and content and recommend that it be accepted in partial fulfillment of the requirements for the degree of Doctor of Philosophy, with a major in Computer Engineering.
    [Show full text]
  • PARDISO User Guide Version
    Parallel Sparse Direct And Multi-Recursive Iterative Linear Solvers PARDISO User Guide Version 7.2 (Updated December 28, 2020) Olaf Schenk and Klaus Gärtner Parallel Sparse Direct Solver PARDISO — User Guide Version 7.2 1 Since a lot of time and effort has gone into PARDISO version 7.2, please cite the following publications if you are using PARDISO for your own research: References [1] C. Alappat, A. Basermann, A. R. Bishop, H. Fehske, G. Hager, O. Schenk, J. Thies, G. Wellein. A Recursive Algebraic Coloring Technique for Hardware-Efficient Symmetric Sparse Matrix-Vector Multiplication. ACM Trans. Parallel Comput, Vol. 7, No. 3, Article 19, June 2020. [2] M. Bollhöfer, O. Schenk, R. Janalik, S. Hamm, and K. Gullapalli. State-of-The-Art Sparse Direct Solvers. Parallel Algorithms in Computational Science and Engineering, Modeling and Simulation in Science, Engineering and Technology. pp 3-33, Birkhäuser, 2020. [3] M. Bollhöfer, A. Eftekhari, S. Scheidegger, O. Schenk, Large-scale Sparse Inverse Covariance Matrix Estimation. SIAM Journal on Scientific Computing, 41(1), A380-A401, January 2019. Note: This document briefly describes the interface to the shared-memory and distributed-memory multiprocessing parallel direct and multi-recursive iterative solvers in PARDISO 7.21. A discussion of the algorithms used in PARDISO and more information on the solver can be found at http://www. pardiso-project.org 1Please note that this version is a significant extension compared to Intel’s MKL version and that these improvements and features are not available in Intel’s MKL 9.3 release or later releases. Parallel Sparse Direct Solver PARDISO — User Guide Version 7.2 2 Contents 1 Introduction 4 1.1 Supported Matrix Types .
    [Show full text]