Parallel Mathematical Libraries STIMULATE Training Workshop 2018

Total Page:16

File Type:pdf, Size:1020Kb

Parallel Mathematical Libraries STIMULATE Training Workshop 2018 Parallel Mathematical Libraries STIMULATE Training Workshop 2018 December 2018 I.Gutheil JSC Outline A Short History Sequential Libraries Parallel Libraries and Application Systems: Threaded Libraries MPI parallel Libraries Libraries for GPU usage Usage of ScaLAPACK, Elemental, and FFTW Slide 1 A Short History Libraries for Dense Linear Algebra Starting in the early 1970th, written in FORTRAN LINPACK, LINear algebra PACKage for digital computers, supercomputers of the 1970th and early 1980th Dense linear system solvers, factorization and solution, real and complex, single and double precision EISPACK, EIgenSolver PACKage, written around 1972–1973 eigensolvers for real and complex, symmetric and non-symmetric matrices, solvers for generalized eigenvalue problem and singular value decomposition LAPACK, initial release 1992, still in use, now Fortran 90, also C and C++ interfaces, tuned for single-core performance on modern supercomputers, threaded versions from vendors available, e.g. MKL A Short History Slide 2 A Short History, continued BLAS (Basic Linear Algebra Subprograms for Fortran Usage) First step 1979: vector operations, now called BLAS 1, widely used kernels of linear algebra routines, idea: tuning in assembly on each machine leading to portable performance Second step 1988: matrix-vector operations, now called BLAS 2, optimization on the matrix-vector level necessary for vector computers, building blocks for LINPACK and EISPACK Third step 1990: matrix-matrix operations, now called BLAS 3, memory acces became slower than CPU, optimization was now necessary at matrix-matrix-level, builidng blocks for LAPACK, successor of LINPACK and EISPACK A Short History Slide 3 Why you should use BLAS 3 if possible 1 Standardized interface, on most computers also for C, readable code 2 Optimized for data-reuse, memory access usually one order slower than CPU ”data re-use factor” Floating point operations r := Memory accesses Example: 2n AXPY: r ≈ 2n = 1 2 ≈ 2n = GEMV: r n2 2 3 ≈ 2n = 2 GEMM: r 3n2 3 n Only GEMM close to peak performance A Short History Slide 4 Performance of matrix-matrix multiplication JULIA gfortran, Comparison with MKL Comp. O0 O2 O3 ijk-loop 4.9249 0.98752 0.99167 ikj-loop 5.5986 2.5144 2.5146 jik-loop 5.3677 1.0795 1.0792 jki-loop 4.5858 0.67161 0.43218 kij-loop 5.5317 2.5506 2.5492 kji-loop 4.6007 0.68648 0.46044 MKL Fortran 0.045247 0.043458 0.043392 A Short History Slide 5 Performance of matrix-matrix multiplication JULIA ifort, Comparison with MKL Comp. O0 O2 O3 ijk-loop 11.124 1.0156 0.44883 ikj-loop 12.165 0.22118 0.089086 jik-loop 11.888 0.95409 0.38558 jki-loop 10.474 0.22997 0.23005 kij-loop 11.948 0.22124 0.089038 kji-loop 10.528 0.22985 0.23035 MKL Fortran 0.044419 0.042375 0.042731 A Short History Slide 6 Sequential Libraries Vendor specific Library MKL Intel R Math Kernel Library Usage see https://software.intel.com/en-us/ articles/intel-mkl-link-line-advisor Public domain Libraries LAPACK (Linear Algebra PACKage), part of MKL or libopenblas.so ARPACK (Arnoldi PACKage), iterative solver for sparse eigenvalue problems GSL (Gnu Scientific Library, C library) GMP (Gnu Multiple Precision Arithmetic Library) Sequential Libraries Slide 7 Contents of Intel R MKL 11.* BLAS, Sparse BLAS, CBLAS LAPACK Iterative Sparse Solvers, Trust Region Solver Vector Math Library Vector Statistical Library Fourier Transform Functions Trigonometric Transform Functions GMP routines Poisson Library Interface for fftw Sequential Libraries Slide 8 Contents of GSL (not complete) CBLAS Linear Algebra, linear systems and eigenproblems FFT and other transformations Interpolation Integration and numerical differentiation Statistics Ordinary differential equations Sequential Libraries Slide 9 Parallel Libraries Threaded Parallelism MKL is multi-threaded or at least thread-save usage as with sequential routines if OMP NUM THREADS not set, maximum possible threads used ifort name.f -o name -lmkl intel lp64 -lmkl intel thread -lmkl core -liomp5 -lpthread FFTW 3.3 (Fastest Fourier Transform of the West) Sequential, threaded, and OpenMP version additional version in MKL Cray-intelmpi version on JULIA http://www.fftw.org Parallel Libraries Slide 10 Parallel Libraries MPI Parallelism, dense linear algebra ScaLAPACK (Scalable Linear Algebra PACKage), Fortran77 public domain version now contains BLACS http://netlib.org/scalapack ELPA (Eigenvalue SoLvers for Petaflop-Applications), Fortran2003 https://elpa.mpcdf.mpg.de Elemental, C++ framework for parallel dense linear algebra http://libelemental.org/ Parallel Libraries Slide 11 MPI Parallelism sparse linear algebra MUMPS (MUltifrontal Massively Parallel sparse direct Solver) http://mumps.enseeiht.fr/index.php?page=home PARPACK (Parallel ARPACK), now ARPACK-NG, Eigensolver https://github.com/opencollab/arpack-ng hypre (high performance preconditioners) https://computation.llnl.gov/projects/ hypre-scalable-linear-solvers-multigrid-methods/ software Parallel Libraries Slide 12 MPI Parallelism tools and differential equations Tools FFTW (Fastest Fourier Transform of the West) ParMETIS (Parallel Graph Partitioning) http://glaros.dtc.umn.edu/gkhome/views/metis SPRNG (Scalable Parallel Random Number Generator) http://www.sprng.org/ Ordinary differential equations SUNDIALS (SUite of Nonlinear and DIfferential/ALgebraic equation Solvers) https://computation.llnl.gov/projects/sundials/ sundials-software Parallel Libraries Slide 13 Parallel Systems, MPI Parallelism, PETSc Portable, Extensible Toolkit for Scientific Computation Numerical solution of partial differential equations Can make use of many other libraries Can choose solver and preconditioner with command line arguments Comes with lots of examples https://www.mcs.anl.gov/petsc/ Very active mailing list, good support via mailing list Parallel Libraries Slide 14 Contents of parallel libraries, dense linear algebra Contents of ScaLAPACK and ELPA ScaLAPACK Parallel BLAS 1-3, PBLAS Version 2 Dense linear system solvers Banded linear system solvers Solvers for Linear Least Squares Problem Singular value decomposition Eigenvalues and eigenvectors of dense symmetric/hermitian matrices ELPA, Eigensolver only, uses ScaLAPACK Parallel Libraries Slide 15 Contents of parallel libraries, dense linear algebra Contents of Elemental (incomplete list) Dense and sparse-direct (generalized) Least Squares problems High-performance pseudospectral computation and visualization LU and Cholesky with full pivoting Column-pivoted QR and interpolative/skeleton decompositions Many algorithms for Singular-Value soft-Thresholding (SVT) Tall-skinny QR decompositions Hermitian matrix functions Prototype Spectral Divide and Conquer Schur decomposition and Hermitian EVD Sign-based Lyapunov/Ricatti/Sylvester solvers Convex optimization Parallel Libraries Slide 16 Contents of parallel libraries, sparse linear algebra MUMPS and Parmetis Multifrontal Massively Parallel sparse direct Solver MUMPS Multifrontal Massively Parallelsparse direct Solver Solution of linear systems with symmetric positive definite matrices, general symmetric matrices, general unsymmetric matrices Real or Complex Parallel factorization and solve phase, iterative refinement and backward error analysis F90 and MPI Graph partitioning used for symbolic factorization with reduced fill-in Parmetis: Parallel Graph Partinioning and Fill-reducing Matrix Ordering developed in Karypis Lab at the University of Minnesota Parallel Libraries Slide 17 Contents of parallel libraries, sparse linear algebra PARPACK Reverse communication interface, user has to supply parallel matrix-vector multiplication Standard or Generalized Problems. Single and Double Precision Complex Arithmetic Versions for Standard or Generalized Problems Routines for Banded Matrices - Standard or Generalized Problems. Routines for The Singular Value Decomposition. Parallel Libraries Slide 18 Contents of parallel libraries, parallel tools FFTW3 Version 3.3 Discrete Fourier transform (DFT) in one or more dimensions real and complex data arbitrary input size SPRNG The Scalable Parallel Random Number Generators Library for ASCI Monte Carlo Computations Version ≥ 2.0 various random number generators in one library Version 1.0 seperate library for each random number generator Parallel Libraries Slide 19 Contents of parallel libraries, ordinary differential equations SUNDIALS CVODE: initial value problems, ODEs CVODES: ODE systems and sensitivity analysis capabilities ARKODE: initial value ODE problems with additive Runge-Kutta methods IDA: initial value problems, differential-algebraic equation systems (DAE) IDAS: DAE systems and sensitivity analysis capabilities KINSOL: nonlinear algebraic systems Parallel Libraries Slide 20 Libraries for GPU usage cuBLAS, cuSPARSE, cuSOLVER Linear Algebra using CUDA from NVIDIA cuRAND, cuFFT random numbers and FFT for CUDA CUDA Math Library standard mathematical function library cuDNN primitives for deep neural networks (Deep learning) all of them and more: https://developer.nvidia.com/gpu-accelerated-libraries MAGMA, Linear Algebra Library for GPUs, similar to LAPACK http://icl.utk.edu/magma/ Other libraries come with CUDA kernels, for example ELPA Parallel Libraries Slide 21 Usage of ScaLAPACK Background Scalable version of a subset of LAPACK redesigned for distributed memory MIMD parallel computers Calls as similar to those of LAPACK as possible Based on PBLAS instead of BLAS BLACS (Basic Linear Algebra Communication Subroutines) for communication User has to care for data distribution on his own Usage of ScaLAPACK, Elemental,
Recommended publications
  • Arxiv:1911.09220V2 [Cs.MS] 13 Jul 2020
    MFEM: A MODULAR FINITE ELEMENT METHODS LIBRARY ROBERT ANDERSON, JULIAN ANDREJ, ANDREW BARKER, JAMIE BRAMWELL, JEAN- SYLVAIN CAMIER, JAKUB CERVENY, VESELIN DOBREV, YOHANN DUDOUIT, AARON FISHER, TZANIO KOLEV, WILL PAZNER, MARK STOWELL, VLADIMIR TOMOV Lawrence Livermore National Laboratory, Livermore, USA IDO AKKERMAN Delft University of Technology, Netherlands JOHANN DAHM IBM Research { Almaden, Almaden, USA DAVID MEDINA Occalytics, LLC, Houston, USA STEFANO ZAMPINI King Abdullah University of Science and Technology, Thuwal, Saudi Arabia Abstract. MFEM is an open-source, lightweight, flexible and scalable C++ library for modular finite element methods that features arbitrary high-order finite element meshes and spaces, support for a wide variety of dis- cretization approaches and emphasis on usability, portability, and high-performance computing efficiency. MFEM's goal is to provide application scientists with access to cutting-edge algorithms for high-order finite element mesh- ing, discretizations and linear solvers, while enabling researchers to quickly and easily develop and test new algorithms in very general, fully unstructured, high-order, parallel and GPU-accelerated settings. In this paper we describe the underlying algorithms and finite element abstractions provided by MFEM, discuss the software implementation, and illustrate various applications of the library. arXiv:1911.09220v2 [cs.MS] 13 Jul 2020 1. Introduction The Finite Element Method (FEM) is a powerful discretization technique that uses general unstructured grids to approximate the solutions of many partial differential equations (PDEs). It has been exhaustively studied, both theoretically and in practice, in the past several decades [1, 2, 3, 4, 5, 6, 7, 8]. MFEM is an open-source, lightweight, modular and scalable software library for finite elements, featuring arbitrary high-order finite element meshes and spaces, support for a wide variety of discretization approaches and emphasis on usability, portability, and high-performance computing (HPC) efficiency [9].
    [Show full text]
  • Accelerating the LOBPCG Method on Gpus Using a Blocked Sparse Matrix Vector Product
    Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product Hartwig Anzt and Stanimire Tomov and Jack Dongarra Innovative Computing Lab University of Tennessee Knoxville, USA Email: [email protected], [email protected], [email protected] Abstract— the computing power of today’s supercomputers, often accel- erated by coprocessors like graphics processing units (GPUs), This paper presents a heterogeneous CPU-GPU algorithm design and optimized implementation for an entire sparse iter- becomes challenging. ative eigensolver – the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) – starting from low-level GPU While there exist numerous efforts to adapt iterative lin- data structures and kernels to the higher-level algorithmic choices ear solvers like Krylov subspace methods to coprocessor and overall heterogeneous design. Most notably, the eigensolver technology, sparse eigensolvers have so far remained out- leverages the high-performance of a new GPU kernel developed side the main focus. A possible explanation is that many for the simultaneous multiplication of a sparse matrix and a of those combine sparse and dense linear algebra routines, set of vectors (SpMM). This is a building block that serves which makes porting them to accelerators more difficult. Aside as a backbone for not only block-Krylov, but also for other from the power method, algorithms based on the Krylov methods relying on blocking for acceleration in general. The subspace idea are among the most commonly used general heterogeneous LOBPCG developed here reveals the potential of eigensolvers [1]. When targeting symmetric positive definite this type of eigensolver by highly optimizing all of its components, eigenvalue problems, the recently developed Locally Optimal and can be viewed as a benchmark for other SpMM-dependent applications.
    [Show full text]
  • MFEM: a Modular Finite Element Methods Library
    MFEM: A Modular Finite Element Methods Library Robert Anderson1, Andrew Barker1, Jamie Bramwell1, Jakub Cerveny2, Johann Dahm3, Veselin Dobrev1,YohannDudouit1, Aaron Fisher1,TzanioKolev1,MarkStowell1,and Vladimir Tomov1 1Lawrence Livermore National Laboratory 2University of West Bohemia 3IBM Research July 2, 2018 Abstract MFEM is a free, lightweight, flexible and scalable C++ library for modular finite element methods that features arbitrary high-order finite element meshes and spaces, support for a wide variety of discretization approaches and emphasis on usability, portability, and high-performance computing efficiency. Its mission is to provide application scientists with access to cutting-edge algorithms for high-order finite element meshing, discretizations and linear solvers. MFEM also enables researchers to quickly and easily develop and test new algorithms in very general, fully unstructured, high-order, parallel settings. In this paper we describe the underlying algorithms and finite element abstractions provided by MFEM, discuss the software implementation, and illustrate various applications of the library. Contents 1 Introduction 3 2 Overview of the Finite Element Method 4 3Meshes 9 3.1 Conforming Meshes . 10 3.2 Non-Conforming Meshes . 11 3.3 NURBS Meshes . 12 3.4 Parallel Meshes . 12 3.5 Supported Input and Output Formats . 13 1 4 Finite Element Spaces 13 4.1 FiniteElements....................................... 14 4.2 DiscretedeRhamComplex ................................ 16 4.3 High-OrderSpaces ..................................... 17 4.4 Visualization . 18 5 Finite Element Operators 18 5.1 DiscretizationMethods................................... 18 5.2 FiniteElementLinearSystems . 19 5.3 Operator Decomposition . 23 5.4 High-Order Partial Assembly . 25 6 High-Performance Computing 27 6.1 Parallel Meshes, Spaces, and Operators . 27 6.2 Scalable Linear Solvers .
    [Show full text]
  • XAMG: a Library for Solving Linear Systems with Multiple Right-Hand
    XAMG: A library for solving linear systems with multiple right-hand side vectors B. Krasnopolsky∗, A. Medvedev∗∗ Institute of Mechanics, Lomonosov Moscow State University, 119192 Moscow, Michurinsky ave. 1, Russia Abstract This paper presents the XAMG library for solving large sparse systems of linear algebraic equations with multiple right-hand side vectors. The library specializes but is not limited to the solution of linear systems obtained from the discretization of elliptic differential equations. A corresponding set of numerical methods includes Krylov subspace, algebraic multigrid, Jacobi, Gauss-Seidel, and Chebyshev iterative methods. The parallelization is im- plemented with MPI+POSIX shared memory hybrid programming model, which introduces a three-level hierarchical decomposition using the corre- sponding per-level synchronization and communication primitives. The code contains a number of optimizations, including the multilevel data segmen- tation, compression of indices, mixed-precision floating-point calculations, arXiv:2103.07329v1 [cs.MS] 12 Mar 2021 vector status flags, and others. The XAMG library uses the program code of the well-known hypre library to construct the multigrid matrix hierar- chy. The XAMG’s own implementation for the solve phase of the iterative methods provides up to a twofold speedup compared to hypre for the tests ∗E-mail address: [email protected] ∗∗E-mail address: [email protected] Preprint submitted to Elsevier March 15, 2021 performed. Additionally, XAMG provides extended functionality to solve systems with multiple right-hand side vectors. Keywords: systems of linear algebraic equations, Krylov subspace iterative methods, algebraic multigrid method, multiple right-hand sides, hybrid programming model, MPI+POSIX shared memory Nr.
    [Show full text]
  • Slepc Users Manual Scalable Library for Eigenvalue Problem Computations
    Departamento de Sistemas Inform´aticos y Computaci´on Technical Report DSIC-II/24/02 SLEPc Users Manual Scalable Library for Eigenvalue Problem Computations https://slepc.upv.es Jose E. Roman Carmen Campos Lisandro Dalcin Eloy Romero Andr´es Tom´as To be used with slepc 3.15 March, 2021 Abstract This document describes slepc, the Scalable Library for Eigenvalue Problem Computations, a software package for the solution of large sparse eigenproblems on parallel computers. It can be used for the solution of various types of eigenvalue problems, including linear and nonlinear, as well as other related problems such as the singular value decomposition (see a summary of supported problem classes on page iii). slepc is a general library in the sense that it covers both Hermitian and non-Hermitian problems, with either real or complex arithmetic. The emphasis of the software is on methods and techniques appropriate for problems in which the associated matrices are large and sparse, for example, those arising after the discretization of partial differential equations. Thus, most of the methods offered by the library are projection methods, including different variants of Krylov and Davidson iterations. In addition to its own solvers, slepc provides transparent access to some external software packages such as arpack. These packages are optional and their installation is not required to use slepc, see x8.7 for details. Apart from the solvers, slepc also provides built-in support for some operations commonly used in the context of eigenvalue computations, such as preconditioning or the shift- and-invert spectral transformation. slepc is built on top of petsc, the Portable, Extensible Toolkit for Scientific Computation [Balay et al., 2021].
    [Show full text]
  • Geeng Started with Trilinos
    Geng Started with Trilinos Michael A. Heroux Photos placed in horizontal position Sandia National with even amount of white space Laboratories between photos and header USA Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND NO. 2011-XXXXP Outline - Some (Very) Basics - Overview of Trilinos: What it can do for you. - Trilinos Software Organization. - Overview of Packages. - Package Management. - Documentation, Building, Using Trilinos. 1 Online Resource § Trilinos Website: https://trilinos.org § Portal to Trilinos resources. § Documentation. § Mail lists. § Downloads. § WebTrilinos § Build & run Trilinos examples in your web browser! § Need username & password (will give these out later) § https://code.google.com/p/trilinos/wiki/TrilinosHandsOnTutorial § Example codes: https://code.google.com/p/trilinos/w/list 2 WHY USE MATHEMATICAL LIBRARIES? 3 § A farmer had chickens and pigs. There was a total of 60 heads and 200 feet. How many chickens and how many pigs did the farmer have? § Let x be the number of chickens, y be the number of pigs. § Then: x + y = 60 2x + 4y = 200 § From first equaon x = 60 – y, so replace x in second equaon: 2(60 – y) + 4y = 200 § Solve for y: 120 – 2y + 4y = 200 2y = 80 y = 40 § Solve for x: x = 60 – 40 = 20. § The farmer has 20 chickens and 40 pigs. 4 § A restaurant owner purchased one box of frozen chicken and another box of frozen pork for $60. Later the owner purchased 2 boxes of chicken and 4 boxes of pork for $200.
    [Show full text]
  • PFEAST: a High Performance Sparse Eigenvalue Solver Using Distributed-Memory Linear Solvers
    PFEAST: A High Performance Sparse Eigenvalue Solver Using Distributed-Memory Linear Solvers James Kestyn∗, Vasileios Kalantzisy, Eric Polizzi∗, Yousef Saady ∗Electrical and Computer Engineering Department, University of Massachusetts, Amherst, MA, U.S.A. yComputer Science and Engineering Department, University of Minnesota, Minneapolis, MN, U.S.A. Abstract—The FEAST algorithm and eigensolver for interior computing interior eigenpairs that makes use of a rational eigenvalue problems naturally possesses three distinct levels filter obtained from an approximation of the spectral pro- of parallelism. The solver is then suited to exploit modern jector. FEAST can be applied for solving both standard and computer architectures containing many interconnected proces- sors. This paper highlights a recent development within the generalized forms of Hermitian or non-Hermitian problems, software package that allows the dominant computational task, and belongs to the family of contour integration eigensolvers solving a set of complex linear systems, to be performed with a [32], [33], [3], [14], [15], [4]. Once a given search interval distributed memory solver. The software, written with a reverse- is selected, FEAST’s main computational task consists of communication-interface, can now be interfaced with any generic a numerical quadrature computation that involves solving MPI linear-system solver using a customized data distribution for the eigenvector solutions. This work utilizes two common independent linear systems along a complex contour. The “black-box” distributed memory linear-systems solvers (Cluster- algorithm can exploit natural parallelism at three different MKL-Pardiso and MUMPS), as well as our own application- levels: (i) search intervals can be treated separately (no over- specific domain-decomposition MPI solver, for a collection of 3- lap), (ii) linear systems can be solved independently across dimensional finite-element systems.
    [Show full text]
  • Block Locally Optimal Preconditioned Eigenvalue Xolvers BLOPEX
    Block Locally Optimal Preconditioned Eigenvalue Xolvers I.Lashuk, E.Ovtchinnikov, M.Argentati, A.Knyazev Block Locally Optimal Preconditioned Eigenvalue Xolvers BLOPEX Ilya Lashuk, Merico Argentati, Evgenii Ovtchinnikov, Andrew Knyazev (speaker) Department of Mathematical Sciences and Center for Computational Mathematics University of Colorado at Denver and Health Sciences Center Supported by the Lawrence Livermore National Laboratory and the National Science Foundation Center for Computational Mathematics, University of Colorado at Denver Block Locally Optimal Preconditioned Eigenvalue Xolvers I.Lashuk, E.Ovtchinnikov, M.Argentati, A.Knyazev Abstract Block Locally Optimal Preconditioned Eigenvalue Xolvers (BLOPEX) is a package, written in C, that at present includes only one eigenxolver, Locally Optimal Block Preconditioned Conjugate Gradient Method (LOBPCG). BLOPEX supports parallel computations through an abstract layer. BLOPEX is incorporated in the HYPRE package from LLNL and is availabe as an external block to the PETSc package from ANL as well as a stand-alone serial library. Center for Computational Mathematics, University of Colorado at Denver Block Locally Optimal Preconditioned Eigenvalue Xolvers I.Lashuk, E.Ovtchinnikov, M.Argentati, A.Knyazev Acknowledgements Supported by the Lawrence Livermore National Laboratory, Center for Applied Scientific Computing (LLNL–CASC) and the National Science Foundation. We thank Rob Falgout, Charles Tong, Panayot Vassilevski, and other members of the Hypre Scalable Linear Solvers project team for their help and support. We thank Jose E. Roman, a member of SLEPc team, for writing the SLEPc interface to our Hypre LOBPCG solver. The PETSc team has been very helpful in adding our BLOPEX code as an external package to PETSc. Center for Computational Mathematics, University of Colorado at Denver Block Locally Optimal Preconditioned Eigenvalue Xolvers I.Lashuk, E.Ovtchinnikov, M.Argentati, A.Knyazev CONTENTS 1.
    [Show full text]
  • Downloads in FY15)
    Lawrence Berkeley National Laboratory Recent Work Title Preparing sparse solvers for exascale computing. Permalink https://escholarship.org/uc/item/0r56p10n Journal Philosophical transactions. Series A, Mathematical, physical, and engineering sciences, 378(2166) ISSN 1364-503X Authors Anzt, Hartwig Boman, Erik Falgout, Rob et al. Publication Date 2020-03-01 DOI 10.1098/rsta.2019.0053 Peer reviewed eScholarship.org Powered by the California Digital Library University of California Preparing sparse solvers for exascale computing royalsocietypublishing.org/journal/rsta Hartwig Anzt1,ErikBoman2, Rob Falgout3, Pieter Ghysels4,MichaelHeroux2, Xiaoye Li4, 5 5 Review Lois Curfman McInnes , Richard Tran Mills , Sivasankaran Rajamanickam2, Karl Rupp6, Cite this article: Anzt H et al.2020Preparing sparse solvers for exascale computing. Phil. Barry Smith5, Ichitaro Yamazaki2 and Trans. R. Soc. A 378: 20190053. 3 http://dx.doi.org/10.1098/rsta.2019.0053 Ulrike Meier Yang 1Electrical Engineering and Computer Science, University of Accepted: 5 November 2019 Tennessee, Knoxville, TN 37996, USA 2Sandia National Laboratories, Albuquerque, NM, USA One contribution of 15 to a discussion meeting 3Lawrence Livermore National Laboratory, Livermore, CA, USA issue ‘Numerical algorithms for 4Lawrence Berkeley National Laboratory, Berkeley, CA, USA high-performance computational science’. 5Argonne National Laboratory, Argonne, IL, USA 6 Subject Areas: Vienna University of Technology, Wien, Wien, Austria computational mathematics, computer MH, 0000-0002-5893-0273 modelling and simulation, software Sparse solvers provide essential functionality for Keywords: a wide variety of scientific applications. Highly sparse solvers, mathematical libraries parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi- scale simulations, especially as we target exascale Author for correspondence: platforms.
    [Show full text]
  • Software Development Practices Apis for Solver and UDF
    Elmer Software Development Practices APIs for Solver and UDF ElmerTeam CSC – IT Center for Science CSC, November.2015 Elmer programming languages Fortran90 (and newer) – ElmerSolver (~240,000 lines of which ~50% in DLLs) C++ – ElmerGUI (~18,000 lines) – ElmerSolver (~10,000 lines) C – ElmerPost (~45,000 lines) – ElmerGrid (~30,000 lines) – MATC (~11,000 lines) Tools for Elmer development Programming languages – Fortran90 (and newer), C, C++ Compilation – Compiler (e.g. gnu), configure, automake, make, (cmake) Editing – emacs, vi, notepad++,… Code hosting (git) – Current: https://github.com/ElmerCSC – Obsolite: www.sf.net/projects/elmerfem Consistency tests Code documentation – Doxygen Theory documentation – Latex Community server – www.elmerfem.org (forum, wiki, etc.) Elmer libraries ElmerSolver – Required: Matc, HutIter, Lapack, Blas, Umfpack (GPL) – Optional: Arpack, Mumps, Hypre, Pardiso, Trilinos, SuperLU, Cholmod, NetCDF, HDF5, … ElmerGUI – Required: Qt, ElmerGrid, Netgen – Optional: Tetgen, OpenCASCADE, VTK, QVT Elmer licenses ElmerSolver library is published under LGPL – Enables linking with all license types – It is possible to make a new solver even under proprietary license – Note: some optional libraries may constrain this freedom due to use of GPL licences Rest of Elmer is published under GPL – Derived work must also be under same license (“copyleft”) Elmer version control at GitHub In 2015 the official version control of Elmer was transferred from svn at sf.net to git hosted at GitHub Git offers more flexibility over svn
    [Show full text]
  • Arxiv:2104.01196V2 [Math.NA] 24 Apr 2021 Proposed As an Alternative to the Sequential Algorithm Based on a Triangular Solve
    TWO-STAGE GAUSS{SEIDEL PRECONDITIONERS AND SMOOTHERS FOR KRYLOV SOLVERS ON A GPU CLUSTER STEPHEN THOMASy , ICHITARO YAMAZAKI∗, LUC BERGER-VERGIAT∗, BRIAN KELLEY∗, JONATHAN HU∗, PAUL MULLOWNEYy , SIVASANKARAN RAJAMANICKAM∗, AND KATARZYNA SWIRYDOWICZ´ z Abstract. Gauss-Seidel (GS) relaxation is often employed as a preconditioner for a Krylov solver or as a smoother for Algebraic Multigrid (AMG). However, the requisite sparse triangular solve is difficult to parallelize on many-core architectures such as graphics processing units (GPUs). In the present study, the performance of the sequential GS relaxation based on a triangular solve is compared with two-stage variants, replacing the direct triangular solve with a fixed number of inner Jacobi-Richardson (JR) iterations. When a small number of inner iterations is sufficient to maintain the Krylov convergence rate, the two-stage GS (GS2) often outperforms the sequential algorithm on many-core architectures. The GS2 algorithm is also compared with JR. When they perform the same number of flops for SpMV (e.g. three JR sweeps compared to two GS sweeps with one inner JR sweep), the GS2 iterations, and the Krylov solver preconditioned with GS2, may converge faster than the JR iterations. Moreover, for some problems (e.g. elasticity), it was found that JR may diverge with a damping factor of one, whereas two-stage GS may improve the convergence with more inner iterations. Finally, to study the performance of the two-stage smoother and preconditioner for a practical problem, these were applied to incompressible fluid flow simulations on GPUs. 1. Introduction. Solving large sparse linear systems of the form Ax = b is a basic and fundamental component of physics based modeling and simulation.
    [Show full text]
  • PHAML User's Guide
    NISTIR 7374 PHAML User's Guide William F. Mitchell U. S. Department of Commerce Technology Administration National Institute of Standards and Technology Information Technology Laboratory Gaithersburg, MD 20899 USA Revised August 28, 2018 for Version 1.20.0 PHAML User's Guide, Version 1.20.0 William F. Mitchell1 Applied and Computational Mathematics Division 100 Bureau Drive Stop 8910 National Institute of Standards and Technology Gaithersburg, MD 20899-8910 email: [email protected] 1Contribution of NIST, not subject to copyright in the United States. The mention of specific products, trademarks, or brand names is for purposes of identification only. Such mention is not to be interpreted in any way as an endorsement or certification of such products or brands by the National Institute of Standards and Technology. All trademarks mentioned herein belong to their respective owners. Abstract PHAML (Parallel Hierarchical Adaptive MultiLevel) is a Fortran module for the solution of elliptic partial differential equations. It uses finite elements, adaptive grid refinement (h, p or hp) and multigrid solution techniques in a message pass- ing parallel program. It has interactive graphics via OpenGL. This document is the user's guide for PHAML. The first part tells how to obtain any needed software, how to build and test the PHAML library, and how to compile and run the example programs. The second part explains the use of PHAML, in- cluding program structure and the various options that are available. The third part is a reference manual which describes the API (application programming interface) of PHAML. The reference manual begins with a 2 page Quick Start section for the impatient.
    [Show full text]