Solving Diagonally Dominant Tridiagonal Linear Systems with Fpgas in an Heterogeneous Computing Environment
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Superlu Users' Guide
SuperLU Users' Guide Xiaoye S. Li1 James W. Demmel2 John R. Gilbert3 Laura Grigori4 Meiyue Shao5 Ichitaro Yamazaki6 September 1999 Last update: August 2011 1Lawrence Berkeley National Lab, MS 50F-1650, 1 Cyclotron Rd, Berkeley, CA 94720. ([email protected]). This work was supported in part by the Director, Office of Advanced Scientific Computing Research of the U.S. Department of Energy under Contract No. D-AC02-05CH11231. 2Computer Science Division, University of California, Berkeley, CA 94720. ([email protected]). The research of Demmel and Li was supported in part by NSF grant ASC{9313958, DOE grant DE{FG03{ 94ER25219, UT Subcontract No. ORA4466 from ARPA Contract No. DAAL03{91{C0047, DOE grant DE{ FG03{94ER25206, and NSF Infrastructure grants CDA{8722788 and CDA{9401156. 3Department of Computer Science, University of California, Santa Barbara, CA 93106. ([email protected]). The research of this author was supported in part by the Institute for Mathematics and Its Applications at the University of Minnesota and in part by DARPA Contract No. DABT63-95-C0087. Copyright c 1994-1997 by Xerox Corporation. All rights reserved. 4INRIA Saclay-Ile de France, Laboratoire de Recherche en Informatique, Universite Paris-Sud 11. ([email protected]) 5Department of Computing Science and HPC2N, Ume˚a University, SE-901 87, Ume˚a, Sweden. ([email protected]) 6Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, The University of Tennessee. ([email protected]). The research of this author was supported in part by the Director, Office of Advanced Scientific Computing Research of the U.S. -
Enhanced Capabilities of the Spike Algorithm and a New Spike- Openmp Solver
University of Massachusetts Amherst ScholarWorks@UMass Amherst Masters Theses Dissertations and Theses November 2014 Enhanced Capabilities of the Spike Algorithm and a New Spike- OpenMP Solver Braegan S. Spring University of Massachusetts Amherst Follow this and additional works at: https://scholarworks.umass.edu/masters_theses_2 Recommended Citation Spring, Braegan S., "Enhanced Capabilities of the Spike Algorithm and a New Spike-OpenMP Solver" (2014). Masters Theses. 116. https://doi.org/10.7275/5912243 https://scholarworks.umass.edu/masters_theses_2/116 This Open Access Thesis is brought to you for free and open access by the Dissertations and Theses at ScholarWorks@UMass Amherst. It has been accepted for inclusion in Masters Theses by an authorized administrator of ScholarWorks@UMass Amherst. For more information, please contact [email protected]. ENHANCED CAPABILITIES OF THE SPIKE ALGORITHM AND A NEW SPIKE-OPENMP SOLVER A Thesis Presented by BRAEGAN SPRING Submitted to the Graduate School of the University of Massachusetts Amherst in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE IN ELECTRICAL AND COMPUTER ENGINEERING September 2014 Electrical and Computer Engineering ENHANCED CAPABILITIES OF THE SPIKE ALGORITHM AND A NEW SPIKE-OPENMP SOLVER A Thesis Presented by BRAEGAN SPRING Approved as to style and content by: Eric Polizzi, Chair Zlatan Aksamija, Member Hans Johnston, Member Christopher V. Hollot , Department Chair Electrical and Computer Engineering ACKNOWLEDGMENTS I would like to thank Dr. Zlatan Aksamija & Dr. Hans Johnston for assisting me in this project, Dr. Polizzi for guiding me through it, and Dr. Anderson for his help beforehand. iii ABSTRACT ENHANCED CAPABILITIES OF THE SPIKE ALGORITHM AND A NEW SPIKE-OPENMP SOLVER SEPTEMBER 2014 BRAEGAN SPRING B.Sc., UNIVERSITY MASSACHUSETTS AMHERST M.S.E.C.E., UNIVERSITY OF MASSACHUSETTS AMHERST Directed by: Professor Eric Polizzi SPIKE is a parallel algorithm to solve block tridiagonal matrices. -
A New Analysis of Iterative Refinement and Its Application to Accurate Solution of Ill-Conditioned Sparse Linear Systems Carson
A New Analysis of Iterative Refinement and its Application to Accurate Solution of Ill-Conditioned Sparse Linear Systems Carson, Erin and Higham, Nicholas J. 2017 MIMS EPrint: 2017.12 Manchester Institute for Mathematical Sciences School of Mathematics The University of Manchester Reports available from: http://eprints.maths.manchester.ac.uk/ And by contacting: The MIMS Secretary School of Mathematics The University of Manchester Manchester, M13 9PL, UK ISSN 1749-9097 A NEW ANALYSIS OF ITERATIVE REFINEMENT AND ITS APPLICATION TO ACCURATE SOLUTION OF ILL-CONDITIONED SPARSE LINEAR SYSTEMS∗ ERIN CARSONy AND NICHOLAS J. HIGHAMz Abstract. Iterative refinement is a long-standing technique for improving the accuracy of a computed solution to a nonsingular linear system Ax = b obtained via LU factorization. It makes use of residuals computed in extra precision, typically at twice the working precision, and existing results guarantee convergence if the matrix A has condition number safely less than the reciprocal of the unit roundoff, u. We identify a mechanism that allows iterative refinement to produce solutions with normwise relative error of order u to systems with condition numbers of order u−1 or larger, provided that the update equation is solved with a relative error sufficiently less than 1. A new rounding error analysis is given and its implications are analyzed. Building on the analysis, we develop a GMRES-based iterative refinement method (GMRES-IR) that makes use of the computed LU factors as preconditioners. GMRES-IR exploits the fact that even if A is extremely ill conditioned the LU factors contain enough information that preconditioning can greatly reduce the condition number of A. -
Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization – LAPACK Working Note 184
Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization – LAPACK Working Note 184 Jakub Kurzak1, Alfredo Buttari1, Jack Dongarra1,2 1Department of Computer Science, University Tennessee, Knoxville, Tennessee 37996 2Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, 37831 May 10, 2007 ABSTRACT: The STI CELL processor introduces 1 Motivation pioneering solutions in processor architecture. At the same time it presents new challenges for the devel- In numerical computing, there is a fundamental per- opment of numerical algorithms. One is effective ex- formance advantage of using single precision float- ploitation of the differential between the speed of sin- ing point data format over double precision data for- gle and double precision arithmetic; the other is effi- mat, due to more compact representation, thanks to cient parallelization between the short vector SIMD which, twice the number of single precision data ele- cores. In this work, the first challenge is addressed ments can be stored at each stage of the memory hi- by utilizing a mixed-precision algorithm for the solu- erarchy. Short vector SIMD processing provides yet tion of a dense symmetric positive definite system of more potential for performance gains from using sin- linear equations, which delivers double precision ac- gle precision arithmetic over double precision. Since curacy, while performing the bulk of the work in sin- the goal is to process the entire vector in a single gle precision. The second challenge is approached by operation, the computation throughput can be dou- introducing much finer granularity of parallelization bled when the data representation is halved. -
A Parallel Solver for Incompressible Fluid Flows
Available online at www.sciencedirect.com Procedia Computer Science 00 (2013) 000–000 International Conference on Computational Science, ICCS 2013 A parallel solver for incompressible fluid flows Yushan Wanga,∗, Marc Baboulina,b, Jack Dongarrac,d,e, Joel¨ Falcoua, Yann Fraigneauf, Olivier Le Maˆıtref aLRI, Universit´eParis-Sud, France bInria Saclay Ile-de-France,ˆ France cUniversity of Tennessee, USA dOak Ridge National Laboratory, USA eUniversity of Manchester, United Kingdom fLIMSI-CNRS, France Abstract The Navier-Stokes equations describe a large class of fluid flows but are difficult to solve analytically because of their nonlin- earity. We present in this paper a parallel solver for the 3-D Navier-Stokes equations of incompressible unsteady flows with constant coefficients, discretized by the finite difference method. We apply a prediction-projection method that transforms the Navier-Stokes equations into three Helmholtz equations and one Poisson equation. For each Helmholtz system, we ap- ply the Alternating Direction Implicit (ADI) method resulting in three tridiagonal systems. The Poisson equation is solved using partial diagonalization which transforms the Laplacian operator into a tridiagonal one. We present an implementation based on MPI where the computations are performed on each subdomain and information is exchanged at the interfaces be- tween subdomains. We describe in particular how the solution of tridiagonal systems can be accelerated using vectorization techniques. Keywords: Navier-Stokes equations ; prediction-projection method; ADI method; partial diagonalization; SIMD ; parallel computing, tridiagonal systems. 1. Introduction The incompressible Navier-Stokes (NS) equations express the Newton’s second law applied to fluid motion, with the assumptions that the fluid density is constant and the fluid stress is the sum of a viscous term (proportional to the gradient of the velocity) and an isotropic pressure term. -
PSPIKE: a Parallel Hybrid Sparse Linear System Solver *
PSPIKE: A Parallel Hybrid Sparse Linear System Solver ? Murat Manguoglu1, Ahmed H. Sameh1, and Olaf Schenk2 1 Department of Computer Science,Purdue University, West Lafayette IN 47907 2 Computer Science Department,University of Basel,Klingelbergstrasse 50,CH-4056 Basel Abstract. The availability of large-scale computing platforms comprised of tens of thousands of multicore processors motivates the need for the next generation of highly scalable sparse linear system solvers. These solvers must optimize parallel performance, processor (serial) perfor- mance, as well as memory requirements, while being robust across broad classes of applications and systems. In this paper, we present a new parallel solver that combines the desirable characteristics of direct meth- ods (robustness) and effective iterative solvers (low computational cost), while alleviating their drawbacks (memory requirements, lack of robust- ness). Our proposed hybrid solver is based on the general sparse solver PARDISO, and the “Spike” family of hybrid solvers. The resulting algo- rithm, called PSPIKE, is as robust as direct solvers, more reliable than classical preconditioned Krylov subspace methods, and much more scal- able than direct sparse solvers. We support our performance and parallel scalability claims using detailed experimental studies and comparison with direct solvers, as well as classical preconditioned Krylov methods. Key words: Hybrid Solvers, Direct Solvers, Krylov Subspace Methods, Sparse Linear Systems 1 Introduction The emergence of extreme-scale parallel platforms, along with increasing number of cores available in conventional processors pose significant challenges for algo- rithm and software development. Machines with tens of thousands of processors and beyond place tremendous constraints on the communication requirements of algorithms. -
Error Bounds from Extra-Precise Iterative Refinement
Error Bounds from Extra-Precise Iterative Refinement JAMES DEMMEL, YOZO HIDA, and WILLIAM KAHAN University of California, Berkeley XIAOYES.LI Lawrence Berkeley National Laboratory SONIL MUKHERJEE Oracle and E. JASON RIEDY University of California, Berkeley We present the design and testing of an algorithm for iterative refinement of the solution of linear equations where the residual is computed with extra precision. This algorithm was originally proposed in 1948 and analyzed in the 1960s as a means to compute very accurate solutions to all but the most ill-conditioned linear systems. However, two obstacles have until now prevented its adoption in standard subroutine libraries like LAPACK: (1) There was no standard way to access the higher precision arithmetic needed to compute residuals, and (2) it was unclear how to compute a reliable error bound for the computed solution. The completion of the new BLAS Technical Forum Standard has essentially removed the first obstacle. To overcome the second obstacle, we show how the application of iterative refinement can be used to compute an error bound in any norm at small cost and use this to compute both an error bound in the usual infinity norm, and a componentwise relative error bound. This research was supported in part by the NSF Cooperative Agreement No. ACI-9619020; NSF Grant Nos. ACI-9813362 and CCF-0444486; the DOE Grant Nos. DE-FG03-94ER25219, DE-FC03- 98ER25351, and DE-FC02-01ER25478; and the National Science Foundation Graduate Research Fellowship. The authors wish to acknowledge the contribution from Intel Corporation, Hewlett- Packard Corporation, IBM Corporation, and the National Science Foundation grant EIA-0303575 in making hardware and software available for the CITRIS Cluster which was used in producing these research results. -
Dense and Sparse Parallel Linear Algebra Algorithms on Graphics Processing Units
Departament de Sistemes Informatics` i Computacio´ Dense and sparse parallel linear algebra algorithms on graphics processing units Author: Alejandro Lamas Davi~na Director: Jos´eE. Rom´anMolt´o October 2018 To the extent possible under law, the author has waived all copyright and related or neighboring rights to this work. To one, two, and three. Acknowledgments I would like to express my gratitude to my director Jos´eRom´an,for his permanent support during all these years of work. His wise advice and selfless guidance have been decisive for the culmination of this thesis. His door has always been open for me, and he has solved all my doubts with unlimited patience. For all that and for more, I thank him. I would like to extend my gratitude to my colleagues of the SLEPc project. Here I thank again to Jos´eRom´anfor his unique humor sense. To Carmen, for showing me the way and for all her good advices. To Enrique, who helped me to get rolling. And to the former members Andr´esand Eloy, to whom I had the opportunity to meet and who enliven the group meals. I will keep good memories from these years. I do not want to forget to mention to Xavier Cartoix`a,Jeff Steward and Altuˇg Aksoy, great researchers with whom I have had the opportunity to collaborate. The afternoon snacks would not have been the same without the excellent discus- sions and comments of Fernando, David and of course Salva, who, without noticing it, also helped to improve this dissertation. Last, I would like to thank to Jos´eLuis, IT staff of the department, for his high valuable work behind the scenes and his promptly response to any incidence. -
Three-Precision GMRES-Based Iterative Refinement for Least Squares Problems Carson, Erin and Higham, Nicholas J. and Pranesh, Sr
Three-Precision GMRES-Based Iterative Refinement for Least Squares Problems Carson, Erin and Higham, Nicholas J. and Pranesh, Srikara 2020 MIMS EPrint: 2020.5 Manchester Institute for Mathematical Sciences School of Mathematics The University of Manchester Reports available from: http://eprints.maths.manchester.ac.uk/ And by contacting: The MIMS Secretary School of Mathematics The University of Manchester Manchester, M13 9PL, UK ISSN 1749-9097 THREE-PRECISION GMRES-BASED ITERATIVE REFINEMENT FOR LEAST SQUARES PROBLEMS∗ ERIN CARSONy , NICHOLAS J. HIGHAMz , AND SRIKARA PRANESHz Abstract. The standard iterative refinement procedure for improving an approximate solution m×n to the least squares problem minx kb − Axk2, where A 2 R with m ≥ n has full rank, is based on solving the (m + n) × (m + n) augmented system with the aid of a QR factorization. In order to exploit multiprecision arithmetic, iterative refinement can be formulated to use three precisions, but the resulting algorithm converges only for a limited range of problems. We build an iterative refinement algorithm called GMRES-LSIR, analogous to the GMRES-IR algorithm developed for linear systems [SIAM J. Sci. Comput., 40 (2019), pp. A817-A847], that solves the augmented system using GMRES preconditioned by a matrix based on the computed QR factors. We explore two left preconditioners; the first has full off-diagonal blocks and the second is block diagonal and can be applied in either left-sided or split form. We prove that for a wide range of problems the first preconditioner yields backward and forward errors for the augmented system of order the working precision under suitable assumptions on the precisions and the problem conditioning. -
Using Mixed Precision in Numerical Computations to Speedup Linear Algebra Solvers
Using Mixed Precision in Numerical Computations to Speedup Linear Algebra Solvers Jack Dongarra, UTK/ORNL/U Manchester Azzam Haidar, Nvidia Nick Higham, U of Manchester Stan Tomov, UTK Slides can be found: http://bit.ly/icerm-05-2020-dongarra 5/7/20 1 Background • My interest in mixed precision began with my dissertation … § Improving the Accuracy of Computed Matrix Eigenvalues • Compute the eigenvalues and eigenvectors in low precision then improve selected values/vectors to higher precision for O(n2) ops using the the matrix decomposition § Extended to singular values, 1983 2 § Algorithm in TOMS 710, 1992 IBM’s Cell Processor - 2004 • 9 Cores § Power PC at 3.2 GHz § 8 SPEs • 204.8 Gflop/s peak! $600 § The catch is that this is for 32 bit fl pt; (Single Precision SP) § 64 bit fl pt peak at 14.6 Gflop/s • 14 times slower that SP; factor of 2 because of DP and 7 because of latency issues The SPEs were fully IEEE-754 compliant in double precision. In single precision, they only implement round-towards-zero, denormalized numbers are flushed to zero and NaNs are treated like normal numbers. Mixed Precision Idea Goes Something Like This… • Exploit 32 bit floating point as much as possible. § Especially for the bulk of the computation • Correct or update the solution with selective use of 64 bit floating point to provide a refined results • Intuitively: § Compute a 32 bit result, § Calculate a correction to 32 bit result using selected higher precision and, § Perform the update of the 32 bit results with the correction using high precision. -
A Banded Spike Algorithm and Solver for Shared Memory Architectures Karan Mendiratta University of Massachusetts Amherst
University of Massachusetts Amherst ScholarWorks@UMass Amherst Masters Theses 1911 - February 2014 2011 A Banded Spike Algorithm and Solver for Shared Memory Architectures Karan Mendiratta University of Massachusetts Amherst Follow this and additional works at: https://scholarworks.umass.edu/theses Mendiratta, Karan, "A Banded Spike Algorithm and Solver for Shared Memory Architectures" (2011). Masters Theses 1911 - February 2014. 699. Retrieved from https://scholarworks.umass.edu/theses/699 This thesis is brought to you for free and open access by ScholarWorks@UMass Amherst. It has been accepted for inclusion in Masters Theses 1911 - February 2014 by an authorized administrator of ScholarWorks@UMass Amherst. For more information, please contact [email protected]. A BANDED SPIKE ALGORITHM AND SOLVER FOR SHARED MEMORY ARCHITECTURES A Thesis Presented by KARAN MENDIRATTA Submitted to the Graduate School of the University of Massachusetts Amherst in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE IN ELECTRICAL AND COMPUTER ENGINEERING September 2011 Electrical & Computer Engineering A BANDED SPIKE ALGORITHM AND SOLVER FOR SHARED MEMORY ARCHITECTURES A Thesis Presented by KARAN MENDIRATTA Approved as to style and content by: Eric Polizzi, Chair David P Schmidt, Member Michael Zink, Member Christopher V Hollot, Department Chair Electrical & Computer Engineering For Krishna Babbar, and Baldev Raj Mendiratta ACKNOWLEDGMENTS I would like to thank my advisor and mentor Prof. Eric Polizzi for his guidance, patience and support. I would also like to thank Prof. David P Schmidt and Prof. Michael Zink for agreeing to be a part of my thesis committee. Last but not the least, I am grateful to my family and friends for always encouraging and supporting me in all my endeavors. -
A Survey of Recent Developments in Parallel Implementations of Gaussian Elimination
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2015; 27:1292–1309 Published online 2 June 2014 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.3306 A survey of recent developments in parallel implementations of Gaussian elimination Simplice Donfack1, Jack Dongarra1, Mathieu Faverge2, Mark Gates1, Jakub Kurzak1,*,†, Piotr Luszczek1 and Ichitaro Yamazaki1 1EECS Department, University of Tennessee, Knoxville, TN, USA 2IPB ENSEIRB-Matmeca, Inria Bordeaux Sud-Ouest, Bordeaux, France SUMMARY Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm has received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of Gaussian elimination for shared memory architecture. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Partial Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multisocket multicore systems are presented. Performance and numerical accuracy is analyzed. Copyright © 2014 John Wiley & Sons, Ltd. Received 15 July 2013; Revised 26 April 2014; Accepted 2 May 2014 KEY WORDS: Gaussian elimination; LU factorization; parallel; shared memory; multicore 1. INTRODUCTION Gaussian elimination has a long history that can be traced back some 2000 years [1]. Today, dense systems of linear equations have become a critical cornerstone for some of the most compute intensive applications.