HPC Libraries Outline

Total Page:16

File Type:pdf, Size:1020Kb

HPC Libraries Outline HPC libraries Outline Introduction to HPC libraries Numerical libraries – Linear algebra – FFT Other – Domain-specific libraries – I/O libraries Frameworks INTRODUCTION Motivation for using libraries No need to reinvent the wheel! Tend to have good performance Libraries tend to have fewer bugs Can be called from high-level languages – C++, Python, Perl, Matlab, Mathematica Makes the code more portable – Even to GPUs! Motivation: matrix multiply At 4096x4096: Optimized DGEMM=9.5GF, Matmul=0.75GF, Naive=0.014GF Choosing libraries Factors to keep in mind – Performance – Accuracy – Licensing & pricing – Supported languages & platforms & portability – Ease of use – Parallelization & scalability NUMERICAL LIBRARIES Dense linear algebra The "Netlib" libraries have become a de facto standard for dense linear algebra – Most notably: BLAS, LAPACK, BLACS, PBLAS, ScaLAPACK – Developed at University of Tennessee – Reference (F77!) implementations available freely from http://www.netlib.org/ A number of tuned implementations are available (Use these whenever possible!) – ATLAS, GOTO, ACML, MKL, LibSci etc. Netlib ecosystem ScaLAPACK PBLAS Global Addressing Local Addressing LAPACK Platform Independent Platform Specific BLAS BLACS MPI BLAS Basic Linear Algebra Subroutines Basic dense matrix operations – Multiplication, addition, rank update, triangular solve Divided into 3 levels Level Type Example # mem refs 1 Vector – Vector DAXPY (y=x+y) 3n 2 Matrix – Vector DGEMV (y=Ax+y) n**2 3 Matrix – Matrix DGEMM (C=AB+C) 4n**2 Try to use the highest level calls – Eg. DGEMM instead of n*DGER LAPACK Linear Algebra PAckage Matrix solvers – Simultaneous linear equations, least-squares solutions, eigenvalue problems – Matrix factorizations (LU, Cholesky, QR, SVD, Schur) Relies heavily on BLAS for computation ScaLAPACK Scalable LAPACK BLACS – Message passing interface for linear algebra – 2D arrays on 2D processes grids – Generation of process grids – Communication of matrices Parallel LAPACK and BLAS routines (through PBLAS) Extends the BLAS / LAPACK interface – Addition of P prefix into the routines – For example PDGEMM Interesting implementations PLASMA – Manycore optimized linear algebra routines – Potential successor of LAPACK – http://icl.cs.utk.edu/plasma/ MAGMA & CULA – LAPACK-type routines for GPUs – http://icl.cs.utk.edu/magma/ – http://www.culatools.com/ Sparse matrices PDEs, network simulation ... "Enough zeroes that it pays to take advantage of them" (J. Wilkinson) – Only store non-zero elements Density and sparsity pattern affects performance – Optimal storage format and algorithm differ – No single general purpose solution – Many different solvers available . Both direct and indirect methods Sparse linear systems Solving sparse linear system are more difficult than dense systems – Using direct solvers is often unfeasible because the decompositions of sparse systems are not sparse – Iterative solvers utilize sparse matrix-vector multiplication (SpMV) extensively – Memory layout of spare matrix formats is difficult to optimize Sparse system libs Hypre (preconditioners) KLU, Paraklete, AztecOO (parts of Trilinos) – Direct and iterative solvers for sparse systems SuperLU, UMFPACK, MUMPS – Direct solvers for sparse systems CuSP, CuSPARSE – Solvers for use with NVIDIA GPUs Many, many more exist.. FFT libraries No standardized API Implementations tend to have similiar functionality – Create a "plan" . Combination of codelets that compute the FFT . Depends on architecture & FFT parameters – Execute the FFT(s) . Plan can be reused if parameters are identical Powers of a single factor perform well FFTW Fastest Fourier Transform in the West 1D, 2D and 3D complex and real FFTs Two versions commonly in use – Version 2: MPI support – Version 3: Much better performance, MPI support in development Planning has multiple levels of comprehensiveness – ESTIMATE, MEASURE, PATIENT, EXHAUSTIVE – Plans can be saved in a "wisdom file" and reused Other FFT implementations Cray CRAFFT – Dynamically selects the best FFT routine – Serial and parallel routines – See info_crafft AMD ACML Intel MKL and IPP CuFFT – FFT routines for NVidia GPUs OTHER LIBRARIES Domain-specific libraries OpenMM – Molecular dynamics – Supports GPUs Libint – Two-body molecular integrals Overture – PDE solvers designed for CFD Libgenometools – Genome analysis functions Decomposition How to decompose a problem efficiently? – Load balancing (static/dynamic), optimal communication, simplicity Linear algebra – BLACS Mesh manipulation and load balancing – (par)Metis, Jostle, Party, Chaco, Zoltan... Generic – Global Arrays Paratitioning: METIS Graph and mesh partitioning Graph repartitioning – Adaptive refined meshes Partitioning refinement – Improves quality of existing partitions Matrix reordering – Fill reduction for sparse systems ParMETIS for MPI parallel meshes Zoltan Library of data management services for unstructured, dynamic and/or adaptive computations – Provides different partitioning methods (geometric, graph, hypergraph) – Part of Trilinos framework, can also be used as a stand- alone library – Callable from C/C++ and Fortran Data-structure neutral design I/O libraries How should HPC data be stored? – Large, complex, heterogeneous, esoteric, metadata ... – Parallel and random access Traditional relational databases poor fit – Cannot handle large objects – Many unnecessary features for HPC data Need a better solution – NetCDF – HDF5 I/O: HDF5 A data model, library, and file format for storing and managing multidimensional data Can store complex data objects and meta-data File format and files are portable Possibility for parallel I/O on top of MPI-IO Fortran, C, C++, and Java interfaces The HDF5 data model and library are complex IO: NetCDF Used extensively by geoscientists Optimized for access to subsets of large datasets – Direct access instead of sequential Version 4 implemented on top of HDF5 – HDF5 file compatibility Parallel NetCDF (pNetCDF) – Supports MPI-IO FRAMEWORKS Frameworks Frameworks provide a large collection of routines for solving numerical problems – Not a single library, but many of them for different tasks . Datatypes for parallel matrices, linear and nonlinear solvers, parallel communication, IO, etc. – Usually different parts are well integrated and user does not have to worry about the different datatypes or matrix representations Frameworks: pros and cons Pros Cons Many common operations Many frameworks are non- are readily available trivial to compile or port – Can speed up development Strong dependency on the considerably selected framework Easy to test different methods provided by the framework – Preconditioners, solvers, etc. PETsc Library for parallel sparse linear systems – Targets solution of PDEs – Suite of data structures and routines Defines a programming model – Can be linked with external solvers C and Fortran interfaces Available on Cray XC – Load the module cray-petsc to use and see the intro_petsc manual page PETSc architecture Trilinos http://trilinos.sandia.gov Comprehensive collection of loosely connected packages developed by Sandia and different research groups Designed from the bottom up to use distributed memory Includes over 50 modules for different tasks – Linear solvers, mesh partitioning, utilities for FEM solvers, optimization, automatic differentiation, etc. Trilinos Most packages are self contained (can be compiled without others) Very liberal licensing, LGPL and BSD Available on Cray XC – 50 packages included – Load the module cray-trilinos to use – See the intro_trilinos manual page Summary Libraries should be used whenever possible – Optimized libraries, if you have a choice – Use common, standardized interfaces Familiarize yourself with potential caveats – Numerical accuracy! Libraries often depend on other libraries – Especially BLAS is fundamental No single library or framework or API covers everything.
Recommended publications
  • Recursive Approach in Sparse Matrix LU Factorization
    51 Recursive approach in sparse matrix LU factorization Jack Dongarra, Victor Eijkhout and the resulting matrix is often guaranteed to be positive Piotr Łuszczek∗ definite or close to it. However, when the linear sys- University of Tennessee, Department of Computer tem matrix is strongly unsymmetric or indefinite, as Science, Knoxville, TN 37996-3450, USA is the case with matrices originating from systems of Tel.: +865 974 8295; Fax: +865 974 8296 ordinary differential equations or the indefinite matri- ces arising from shift-invert techniques in eigenvalue methods, one has to revert to direct methods which are This paper describes a recursive method for the LU factoriza- the focus of this paper. tion of sparse matrices. The recursive formulation of com- In direct methods, Gaussian elimination with partial mon linear algebra codes has been proven very successful in pivoting is performed to find a solution of Eq. (1). Most dense matrix computations. An extension of the recursive commonly, the factored form of A is given by means technique for sparse matrices is presented. Performance re- L U P Q sults given here show that the recursive approach may per- of matrices , , and such that: form comparable to leading software packages for sparse ma- LU = PAQ, (2) trix factorization in terms of execution time, memory usage, and error estimates of the solution. where: – L is a lower triangular matrix with unitary diago- nal, 1. Introduction – U is an upper triangular matrix with arbitrary di- agonal, Typically, a system of linear equations has the form: – P and Q are row and column permutation matri- Ax = b, (1) ces, respectively (each row and column of these matrices contains single a non-zero entry which is A n n A ∈ n×n x where is by real matrix ( R ), and 1, and the following holds: PPT = QQT = I, b n b, x ∈ n and are -dimensional real vectors ( R ).
    [Show full text]
  • Arxiv:1911.09220V2 [Cs.MS] 13 Jul 2020
    MFEM: A MODULAR FINITE ELEMENT METHODS LIBRARY ROBERT ANDERSON, JULIAN ANDREJ, ANDREW BARKER, JAMIE BRAMWELL, JEAN- SYLVAIN CAMIER, JAKUB CERVENY, VESELIN DOBREV, YOHANN DUDOUIT, AARON FISHER, TZANIO KOLEV, WILL PAZNER, MARK STOWELL, VLADIMIR TOMOV Lawrence Livermore National Laboratory, Livermore, USA IDO AKKERMAN Delft University of Technology, Netherlands JOHANN DAHM IBM Research { Almaden, Almaden, USA DAVID MEDINA Occalytics, LLC, Houston, USA STEFANO ZAMPINI King Abdullah University of Science and Technology, Thuwal, Saudi Arabia Abstract. MFEM is an open-source, lightweight, flexible and scalable C++ library for modular finite element methods that features arbitrary high-order finite element meshes and spaces, support for a wide variety of dis- cretization approaches and emphasis on usability, portability, and high-performance computing efficiency. MFEM's goal is to provide application scientists with access to cutting-edge algorithms for high-order finite element mesh- ing, discretizations and linear solvers, while enabling researchers to quickly and easily develop and test new algorithms in very general, fully unstructured, high-order, parallel and GPU-accelerated settings. In this paper we describe the underlying algorithms and finite element abstractions provided by MFEM, discuss the software implementation, and illustrate various applications of the library. arXiv:1911.09220v2 [cs.MS] 13 Jul 2020 1. Introduction The Finite Element Method (FEM) is a powerful discretization technique that uses general unstructured grids to approximate the solutions of many partial differential equations (PDEs). It has been exhaustively studied, both theoretically and in practice, in the past several decades [1, 2, 3, 4, 5, 6, 7, 8]. MFEM is an open-source, lightweight, modular and scalable software library for finite elements, featuring arbitrary high-order finite element meshes and spaces, support for a wide variety of discretization approaches and emphasis on usability, portability, and high-performance computing (HPC) efficiency [9].
    [Show full text]
  • Abstract 1 Introduction
    Implementation in ScaLAPACK of Divide-and-Conquer Algorithms for Banded and Tridiagonal Linear Systems A. Cleary Department of Computer Science University of Tennessee J. Dongarra Department of Computer Science University of Tennessee Mathematical Sciences Section Oak Ridge National Laboratory Abstract Described hereare the design and implementation of a family of algorithms for a variety of classes of narrow ly banded linear systems. The classes of matrices include symmetric and positive de - nite, nonsymmetric but diagonal ly dominant, and general nonsymmetric; and, al l these types are addressed for both general band and tridiagonal matrices. The family of algorithms captures the general avor of existing divide-and-conquer algorithms for banded matrices in that they have three distinct phases, the rst and last of which arecompletely paral lel, and the second of which is the par- al lel bottleneck. The algorithms have been modi ed so that they have the desirable property that they are the same mathematical ly as existing factorizations Cholesky, Gaussian elimination of suitably reordered matrices. This approach represents a departure in the nonsymmetric case from existing methods, but has the practical bene ts of a smal ler and more easily hand led reduced system. All codes implement a block odd-even reduction for the reduced system that al lows the algorithm to scale far better than existing codes that use variants of sequential solution methods for the reduced system. A cross section of results is displayed that supports the predicted performance results for the algo- rithms. Comparison with existing dense-type methods shows that for areas of the problem parameter space with low bandwidth and/or high number of processors, the family of algorithms described here is superior.
    [Show full text]
  • Overview of Iterative Linear System Solver Packages
    Overview of Iterative Linear System Solver Packages Victor Eijkhout July, 1998 Abstract Description and comparison of several packages for the iterative solu- tion of linear systems of equations. 1 1 Intro duction There are several freely available packages for the iterative solution of linear systems of equations, typically derived from partial di erential equation prob- lems. In this rep ort I will give a brief description of a numberofpackages, and giveaninventory of their features and de ning characteristics. The most imp ortant features of the packages are which iterative metho ds and preconditioners supply; the most relevant de ning characteristics are the interface they present to the user's data structures, and their implementation language. 2 2 Discussion Iterative metho ds are sub ject to several design decisions that a ect ease of use of the software and the resulting p erformance. In this section I will give a global discussion of the issues involved, and how certain p oints are addressed in the packages under review. 2.1 Preconditioners A go o d preconditioner is necessary for the convergence of iterative metho ds as the problem to b e solved b ecomes more dicult. Go o d preconditioners are hard to design, and this esp ecially holds true in the case of parallel pro cessing. Here is a short inventory of the various kinds of preconditioners found in the packages reviewed. 2.1.1 Ab out incomplete factorisation preconditioners Incomplete factorisations are among the most successful preconditioners devel- op ed for single-pro cessor computers. Unfortunately, since they are implicit in nature, they cannot immediately b e used on parallel architectures.
    [Show full text]
  • Accelerating the LOBPCG Method on Gpus Using a Blocked Sparse Matrix Vector Product
    Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product Hartwig Anzt and Stanimire Tomov and Jack Dongarra Innovative Computing Lab University of Tennessee Knoxville, USA Email: [email protected], [email protected], [email protected] Abstract— the computing power of today’s supercomputers, often accel- erated by coprocessors like graphics processing units (GPUs), This paper presents a heterogeneous CPU-GPU algorithm design and optimized implementation for an entire sparse iter- becomes challenging. ative eigensolver – the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) – starting from low-level GPU While there exist numerous efforts to adapt iterative lin- data structures and kernels to the higher-level algorithmic choices ear solvers like Krylov subspace methods to coprocessor and overall heterogeneous design. Most notably, the eigensolver technology, sparse eigensolvers have so far remained out- leverages the high-performance of a new GPU kernel developed side the main focus. A possible explanation is that many for the simultaneous multiplication of a sparse matrix and a of those combine sparse and dense linear algebra routines, set of vectors (SpMM). This is a building block that serves which makes porting them to accelerators more difficult. Aside as a backbone for not only block-Krylov, but also for other from the power method, algorithms based on the Krylov methods relying on blocking for acceleration in general. The subspace idea are among the most commonly used general heterogeneous LOBPCG developed here reveals the potential of eigensolvers [1]. When targeting symmetric positive definite this type of eigensolver by highly optimizing all of its components, eigenvalue problems, the recently developed Locally Optimal and can be viewed as a benchmark for other SpMM-dependent applications.
    [Show full text]
  • MFEM: a Modular Finite Element Methods Library
    MFEM: A Modular Finite Element Methods Library Robert Anderson1, Andrew Barker1, Jamie Bramwell1, Jakub Cerveny2, Johann Dahm3, Veselin Dobrev1,YohannDudouit1, Aaron Fisher1,TzanioKolev1,MarkStowell1,and Vladimir Tomov1 1Lawrence Livermore National Laboratory 2University of West Bohemia 3IBM Research July 2, 2018 Abstract MFEM is a free, lightweight, flexible and scalable C++ library for modular finite element methods that features arbitrary high-order finite element meshes and spaces, support for a wide variety of discretization approaches and emphasis on usability, portability, and high-performance computing efficiency. Its mission is to provide application scientists with access to cutting-edge algorithms for high-order finite element meshing, discretizations and linear solvers. MFEM also enables researchers to quickly and easily develop and test new algorithms in very general, fully unstructured, high-order, parallel settings. In this paper we describe the underlying algorithms and finite element abstractions provided by MFEM, discuss the software implementation, and illustrate various applications of the library. Contents 1 Introduction 3 2 Overview of the Finite Element Method 4 3Meshes 9 3.1 Conforming Meshes . 10 3.2 Non-Conforming Meshes . 11 3.3 NURBS Meshes . 12 3.4 Parallel Meshes . 12 3.5 Supported Input and Output Formats . 13 1 4 Finite Element Spaces 13 4.1 FiniteElements....................................... 14 4.2 DiscretedeRhamComplex ................................ 16 4.3 High-OrderSpaces ..................................... 17 4.4 Visualization . 18 5 Finite Element Operators 18 5.1 DiscretizationMethods................................... 18 5.2 FiniteElementLinearSystems . 19 5.3 Operator Decomposition . 23 5.4 High-Order Partial Assembly . 25 6 High-Performance Computing 27 6.1 Parallel Meshes, Spaces, and Operators . 27 6.2 Scalable Linear Solvers .
    [Show full text]
  • Over-Scalapack.Pdf
    1 2 The Situation: Parallel scienti c applications are typically Overview of ScaLAPACK written from scratch, or manually adapted from sequen- tial programs Jack Dongarra using simple SPMD mo del in Fortran or C Computer Science Department UniversityofTennessee with explicit message passing primitives and Mathematical Sciences Section Which means Oak Ridge National Lab oratory similar comp onents are co ded over and over again, co des are dicult to develop and maintain http://w ww .ne tl ib. org /ut k/p e opl e/J ackDo ngarra.html debugging particularly unpleasant... dicult to reuse comp onents not really designed for long-term software solution 3 4 What should math software lo ok like? The NetSolve Client Many more p ossibilities than shared memory Virtual Software Library Multiplicityofinterfaces ! Ease of use Possible approaches { Minimum change to status quo Programming interfaces Message passing and ob ject oriented { Reusable templates { requires sophisticated user Cinterface FORTRAN interface { Problem solving environments Integrated systems a la Matlab, Mathematica use with heterogeneous platform Interactiveinterfaces { Computational server, like netlib/xnetlib f2c Non-graphic interfaces Some users want high p erf., access to details {MATLAB interface Others sacri ce sp eed to hide details { UNIX-Shell interface Not enough p enetration of libraries into hardest appli- Graphic interfaces cations on vector machines { TK/TCL interface Need to lo ok at applications and rethink mathematical software { HotJava Browser
    [Show full text]
  • Prospectus for the Next LAPACK and Scalapack Libraries
    Prospectus for the Next LAPACK and ScaLAPACK Libraries James Demmel1, Jack Dongarra23, Beresford Parlett1, William Kahan1, Ming Gu1, David Bindel1, Yozo Hida1, Xiaoye Li1, Osni Marques1, E. Jason Riedy1, Christof Voemel1, Julien Langou2, Piotr Luszczek2, Jakub Kurzak2, Alfredo Buttari2, Julie Langou2, Stanimire Tomov2 1 University of California, Berkeley CA 94720, USA, 2 University of Tennessee, Knoxville TN 37996, USA, 3 Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA, 1 Introduction Dense linear algebra (DLA) forms the core of many scienti¯c computing appli- cations. Consequently, there is continuous interest and demand for the devel- opment of increasingly better algorithms in the ¯eld. Here 'better' has a broad meaning, and includes improved reliability, accuracy, robustness, ease of use, and most importantly new or improved algorithms that would more e±ciently use the available computational resources to speed up the computation. The rapidly evolving high end computing systems and the close dependence of DLA algo- rithms on the computational environment is what makes the ¯eld particularly dynamic. A typical example of the importance and impact of this dependence is the development of LAPACK [1] (and later ScaLAPACK [2]) as a successor to the well known and formerly widely used LINPACK [3] and EISPACK [3] libraries. Both LINPACK and EISPACK were based, and their e±ciency depended, on optimized Level 1 BLAS [4]. Hardware development trends though, and in par- ticular an increasing Processor-to-Memory speed gap of approximately 50% per year, started to increasingly show the ine±ciency of Level 1 BLAS vs Level 2 and 3 BLAS, which prompted e®orts to reorganize DLA algorithms to use block matrix operations in their innermost loops.
    [Show full text]
  • XAMG: a Library for Solving Linear Systems with Multiple Right-Hand
    XAMG: A library for solving linear systems with multiple right-hand side vectors B. Krasnopolsky∗, A. Medvedev∗∗ Institute of Mechanics, Lomonosov Moscow State University, 119192 Moscow, Michurinsky ave. 1, Russia Abstract This paper presents the XAMG library for solving large sparse systems of linear algebraic equations with multiple right-hand side vectors. The library specializes but is not limited to the solution of linear systems obtained from the discretization of elliptic differential equations. A corresponding set of numerical methods includes Krylov subspace, algebraic multigrid, Jacobi, Gauss-Seidel, and Chebyshev iterative methods. The parallelization is im- plemented with MPI+POSIX shared memory hybrid programming model, which introduces a three-level hierarchical decomposition using the corre- sponding per-level synchronization and communication primitives. The code contains a number of optimizations, including the multilevel data segmen- tation, compression of indices, mixed-precision floating-point calculations, arXiv:2103.07329v1 [cs.MS] 12 Mar 2021 vector status flags, and others. The XAMG library uses the program code of the well-known hypre library to construct the multigrid matrix hierar- chy. The XAMG’s own implementation for the solve phase of the iterative methods provides up to a twofold speedup compared to hypre for the tests ∗E-mail address: [email protected] ∗∗E-mail address: [email protected] Preprint submitted to Elsevier March 15, 2021 performed. Additionally, XAMG provides extended functionality to solve systems with multiple right-hand side vectors. Keywords: systems of linear algebraic equations, Krylov subspace iterative methods, algebraic multigrid method, multiple right-hand sides, hybrid programming model, MPI+POSIX shared memory Nr.
    [Show full text]
  • Jack Dongarra: Supercomputing Expert and Mathematical Software Specialist
    Biographies Jack Dongarra: Supercomputing Expert and Mathematical Software Specialist Thomas Haigh University of Wisconsin Editor: Thomas Haigh Jack J. Dongarra was born in Chicago in 1950 to a he applied for an internship at nearby Argonne National family of Sicilian immigrants. He remembers himself as Laboratory as part of a program that rewarded a small an undistinguished student during his time at a local group of undergraduates with college credit. Dongarra Catholic elementary school, burdened by undiagnosed credits his success here, against strong competition from dyslexia.Onlylater,inhighschool,didhebeginto students attending far more prestigious schools, to Leff’s connect material from science classes with his love of friendship with Jim Pool who was then associate director taking machines apart and tinkering with them. Inspired of the Applied Mathematics Division at Argonne.2 by his science teacher, he attended Chicago State University and majored in mathematics, thinking that EISPACK this would combine well with education courses to equip Dongarra was supervised by Brian Smith, a young him for a high school teaching career. The first person in researcher whose primary concern at the time was the his family to go to college, he lived at home and worked lab’s EISPACK project. Although budget cuts forced Pool to in a pizza restaurant to cover the cost of his education.1 make substantial layoffs during his time as acting In 1972, during his senior year, a series of chance director of the Applied Mathematics Division in 1970– events reshaped Dongarra’s planned career. On the 1971, he had made a special effort to find funds to suggestion of Harvey Leff, one of his physics professors, protect the project and hire Smith.
    [Show full text]
  • Present and Future Leadership Computers at OLCF
    Present and Future Leadership Computers at OLCF Buddy Bland OLCF Project Director Presented at: SC’14 November 17-21, 2014 New Orleans ORNL is managed by UT-Battelle for the US Department of Energy Oak Ridge Leadership Computing Facility (OLCF) Mission: Deploy and operate the computational resources required to tackle global challenges Providing world-leading computational and data resources and specialized services for the most computationally intensive problems Providing stable hardware/software path of increasing scale to maximize productive applications development Providing the resources to investigate otherwise inaccessible systems at every scale: from galaxy formation to supernovae to earth systems to automobiles to nanomaterials With our partners, deliver transforming discoveries in materials, biology, climate, energy technologies, and basic science SC’14 Summit - Bland 2 Our Science requires that we continue to advance OLCF’s computational capability over the next decade on the roadmap to Exascale. Since clock-rate scaling ended in 2003, HPC Titan and beyond deliver hierarchical parallelism with performance has been achieved through increased very powerful nodes. MPI plus thread level parallelism parallelism. Jaguar scaled to 300,000 cores. through OpenACC or OpenMP plus vectors OLCF5: 5-10x Summit Summit: 5-10x Titan ~20 MW Titan: 27 PF Hybrid GPU/CPU Jaguar: 2.3 PF Hybrid GPU/CPU 10 MW Multi-core CPU 9 MW CORAL System 7 MW 2010 2012 2017 2022 3 SC’14 Summit - Bland Today’s Leadership System - Titan Hybrid CPU/GPU architecture, Hierarchical Parallelism Vendors: Cray™ / NVIDIA™ • 27 PF peak • 18,688 Compute nodes, each with – 1.45 TF peak – NVIDIA Kepler™ GPU - 1,311 GF • 6 GB GDDR5 memory – AMD Opteron™- 141 GF • 32 GB DDR3 memory – PCIe2 link between GPU and CPU • Cray Gemini 3-D Torus Interconnect • 32 PB / 1 TB/s Lustre® file system 4 SC’14 Summit - Bland Scientific Progress at all Scales Fusion Energy Liquid Crystal Film Stability A Princeton Plasma Physics Laboratory ORNL Postdoctoral fellow Trung Nguyen team led by C.S.
    [Show full text]
  • Exploring Capabilities Within Fortrilinos by Solving the 3D Burgers Equation
    Scientific Programming 20 (2012) 275–292 275 DOI 10.3233/SPR-2012-0353 IOS Press Exploring capabilities within ForTrilinos by solving the 3D Burgers equation Karla Morris a,∗, Damian W.I. Rouson a, M. Nicole Lemaster a and Salvatore Filippone b a Sandia National Laboratories, Livermore, CA, USA b Università di Roma “Tor Vergata”, Roma, Italy Abstract. We present the first three-dimensional, partial differential equation solver to be built atop the recently released, open-source ForTrilinos package (http://trilinos.sandia.gov/packages/fortrilinos). ForTrilinos currently provides portable, object- oriented Fortran 2003 interfaces to the C++ packages Epetra, AztecOO and Pliris in the Trilinos library and framework [ACM Trans. Math. Softw. 31(3) (2005), 397–423]. Epetra provides distributed matrix and vector storage and basic linear algebra cal- culations. Pliris provides direct solvers for dense linear systems. AztecOO provides iterative sparse linear solvers. We demon- strate how to build a parallel application that encapsulates the Message Passing Interface (MPI) without requiring the user to make direct calls to MPI except for startup and shutdown. The presented example demonstrates the level of effort required to set up a high-order, finite-difference solution on a Cartesian grid. The example employs an abstract data type (ADT) calculus [Sci. Program. 16(4) (2008), 329–339] that empowers programmers to write serial code that lower-level abstractions resolve into distributed-memory, parallel implementations. The ADT calculus uses compilable Fortran constructs that resemble the mathe- matical formulation of the partial differential equation of interest. Keywords: ForTrilinos, Trilinos, Fortran 2003/2008, object oriented programming 1. Introduction Burgers [4].
    [Show full text]