Over-Scalapack.Pdf

Total Page:16

File Type:pdf, Size:1020Kb

Over-Scalapack.Pdf 1 2 The Situation: Parallel scienti c applications are typically Overview of ScaLAPACK written from scratch, or manually adapted from sequen- tial programs Jack Dongarra using simple SPMD mo del in Fortran or C Computer Science Department UniversityofTennessee with explicit message passing primitives and Mathematical Sciences Section Which means Oak Ridge National Lab oratory similar comp onents are co ded over and over again, co des are dicult to develop and maintain http://w ww .ne tl ib. org /ut k/p e opl e/J ackDo ngarra.html debugging particularly unpleasant... dicult to reuse comp onents not really designed for long-term software solution 3 4 What should math software lo ok like? The NetSolve Client Many more p ossibilities than shared memory Virtual Software Library Multiplicityofinterfaces ! Ease of use Possible approaches { Minimum change to status quo Programming interfaces Message passing and ob ject oriented { Reusable templates { requires sophisticated user Cinterface FORTRAN interface { Problem solving environments Integrated systems a la Matlab, Mathematica use with heterogeneous platform Interactiveinterfaces { Computational server, like netlib/xnetlib f2c Non-graphic interfaces Some users want high p erf., access to details {MATLAB interface Others sacri ce sp eed to hide details { UNIX-Shell interface Not enough p enetration of libraries into hardest appli- Graphic interfaces cations on vector machines { TK/TCL interface Need to lo ok at applications and rethink mathematical software { HotJava Browser interface WWW 5 6 The NetSolve Agent The NetSolve System The system is a set of computational servers Intelligent Agent ! Optimum use of Resource Heterogeneity supp orted XDR The agent is p ermanently aware of : Dynamic addition of resource The con guration of the NetSolve system Fault Tolerance The characteristics of each NetSolve resource Each server uses the installed scienti c packages The load of each resource The functionalities of each resource Example of packages : Each client requests informs the agent ab out : BLAS LAPACK The typ e of problem to solve ScaLAPACK The size of the problem Ellpack The NetSolve Agentcho oses the NetSolve resource : Connection to other systems Load Balancing Strategy NEOS 7 8 New challenges LAPACK / ScaLapack A Scalable Library for { No standard software Distributed Memory Machines Fortran 90, HP Fortran, PVM, MPIbut ... { Many avors of message passing LAPACK success from HP-48G to CRAY C-90 Need standard - BLACS {Targets workstations, vector machines, shared mem- { Highly varying ratio of ory computation sp eed { 449,015 requests since then as of 10/95 r = communication sp eed { Dense and banded matrices { Manyways to layout data { Linear systems, least squares, eigenproblems {Fastest parallel algorithm sometimes less { Manual available from SIAM, html, Japanese trans- stable numerically lation { Second release and manual in 9/94 { Used by Cray, IBM, Convex, NAG, IMSL, ... Provide useful scienti c program libraries for computa- tional scientists working on new parallel architectures {Move to new generation high p erformance machines - SP-2, T3D, Paragon, Clusters of WS, ... { Add functionality generalized eigenproblem, sparse matrices, ... 9 10 ScaLAPACK{OVERVIEW Needs Standards {Portable Goal:Port Linear Algebra PACKage LAPACK to distributed-memory {Vendor-supp orted environments. { Scalable Scalability and Performance High eciency { Optimized compute and communication engines Protect programming investment, if p ossible { Blo cked data access level 3 BLAS yields go o d no de p erformance Avoid low-level details { 2-D Blo ck-cyclic data decomp osition yields go o d load balance Scalability Portability { BLAS: optimized compute engine { BLACS: communication kernel PVM, MPI, ... Simple calling interface { Stay as close as p ossible to LAPACK in calling sequence, storage, etc. Ease of Programming / Maintenance { Mo dularity: Build rich set of linear algebra to ols BLAS BLACS PBLAS { Whenever p ossible, use reliable LAPACK algorithms. 11 12 Communication Library for Linear Algebra 2-DIMENSIONAL BLOCK CYCLIC Basic Linear Algebra Communication Subroutines DISTRIBUTION Purp ose of the BLACS: A design to ol, they are a conceptual aid in design and co ding. 012 A11 A12 A13 A14 A15 A16 A17 A18 A11 A14A 17 A12 A15A 18 A13 A16 Asso ciate widely recognized mnemonic names with communication A21 A22 A23 A24 A25 A26 A27 A28 A31 A34 A37 A32 A35 A38 A33 A36 0 op erations, improve program readability and the self-do cumenting A31 A32 A33 A34 A35 A36 A37 A38 A51 A54A 57 A52 A55 A58 A53 A56 qualityofcode. A41 A42 A43 A44 A45 A46 A47 A48 A71 A74 A77 A72 A75 A78 A73 A76 A51 A52 A53 A54 A55 A56 A57 A58 A21 A24 A27 A22 A25 A28 A23 A26 Promote eciency by identifying frequently-o ccurring op erations of h can b e optimized on various computers. A A A A A linear algebra whic A61 A62 A63 A64 A65 A66 A67 A68 A41 44 47 A42 45 48 A43 46 1 A A A A A A71 A72 A73 A74 A75 A76 A77 A78 A61 64 67 A62 65 68 A63 66 Program p ortability is improved through standardization of kernels A81 A84 A87 A82 A85 A88 A83 A86 A81 A82 A83 A84 A85 A86 A87 A88 without giving up eciency since optimized versions will b e available. Global left and distributed right views of matrix Ensure go o d load balance Performance and scalabil- ity, Encompass a large numb er but not all data distribution schemes, Need redistribution routines to go from distribution to the other, Not intuitive. 13 14 BLACS { BASICS BLACS { BASICS Pro cesses are emb edded in a 2-dimensional grid, Op erations are matrix based and ID- 0123 less 0 0123 N N M 1 4567 M−N M N 2 891011 LDA M N−M An op eration whichinvolves more than one sender and one receiver is N−M \scop ed op eration". Using a 2D-grid, there are 3 natural called a M scop es: M M−N MEANING SCOPE N N Row All pro cesses in a pro cess row participate. Column All pro cesses in a pro cess column participate. All All pro cesses in the pro cess grid participate. 4 Typ es of BLACS routines: p oint-to- ! use of SCOPE parameter. p oint communication, broadcast, combine op erations and supp ort routines. 15 16 BLACS { COMMUNICATION BLACS { COMMUNICATION ROUTINES ROUTINES 2. Broadcast 1. Send/Recv Send submatrix to all pro cesses or subsection of Send submatrix from one pro cess to another: pro cesses in SCOPE, using various distribution pat- 2xxSD2D ICONTXT, M, N, A, LDA, RDEST, CDEST terns TOP: 2xxRV2D ICONTXT, M, N, A, LDA, RSRC, CSRC 2xxBS2D ICONTXT, SCOPE, TOP, M, N, A, LDA 2xxBR2D ICONTXT, SCOPE, TOP, M, N, A, LDA, RSRC, CSRC 2 Data typ e xx Matrix typ e I: INTEGER, GE: GEneral SCOPE TOP S: REAL, rectangular 'Row' ' ' default D: DOUBLE PRECISION, matrix, 'Column' 'Increasing Ring' C: COMPLEX, TR: TRap ezoidal 'All' '1-tree' ::: Z: DOUBLE COMPLEX. matrix. CALL BLACS_GRIDINFO ICONTXT, NPROW, NPCOL, MYROW, MYCOL CALL BLACS_GRIDINFO ICONTXT, NPROW, NPCOL, MYROW, MYCOL IF MYROW.EQ.0 THEN IF MYROW.EQ.0 .AND. MYCOL.EQ.0 THEN IF MYCOL.EQ.0 THEN CALL DGESD2D ICONTXT, 5, 1, X, 5, 1, 0 CALL DGEBS2D ICONTXT, 'Row', ' ', 2, 2, A, 3 ELSE IF MYROW.EQ.1 .AND. MYCOL.EQ.0 THEN ELSE CALL DGERV2D ICONTXT, 5, 1, Y, 5, 0, 0 CALL DGEBR2D ICONTXT, 'Row', ' ', 2, 2, A, 3, 0, 0 END IF END IF END IF 17 18 BLACS { COMBINE OPERATIONS BLACS { ADVANCED TOPICS 3. Global combine op erations The BLACS context is the BLACS mecha- nism for partitioning communication space. A Perform element-wise SUM, jMAXj, jMINj op erations, de ning prop erty of a context is that a mes- on rectangular matrices: sage in a context cannot be sent or received in another context. The BLACS context includes 2GSUM2D ICONTXT, SCOPE, TOP, M, N, A, LDA, RDEST, CDEST the de nition of a grid, and each pro cess co or- dinates in it. 2GAMX2D ICONTXT, SCOPE, TOP, M, N, A, LDA, RA, CA, RCFLAG, RDEST, CDEST It allows the user to 2GAMN2D ICONTXT, SCOPE, TOP, M, N, A, LDA, create arbitrary groups of pro cesses, RA, CA, RCFLAG, RDEST, CDEST create multiple overlapping and/or disjoint grids, RDEST = -1 indicates that the result of the op eration should b e left on all pro cesses selected by SCOPE, isolate each pro cess grid so that grids do not For jMAXj, jMINj, when RCFLAG = -1, RA and CA interfere with each other. are not referenced; otherwise, these arrays are set on output with the co ordinates of the pro cess owning the corresp onding maximum resp. minimum elementin absolute value of A. Both arrays must b e at least of ! the BLACS context is compatible with the size M N and their leading dimension is sp eci ed by RCFLAG. MPI communicator 19 20 BLACS { REFERENCES PBLAS { INTRODUCTION BLACS software and do cumentation can b e obtained Parallel Basic Linear Algebra Subprograms for distributed via: memory MIMD computers. { WWW: http://www.netlib.org/blacs, { anonymous ftp ftp.netlib.org: cd blacs; get index Similar functionality as the BLAS: distributed vector-vector, { email [email protected] with the message: matrix-vector and matrix-matrix op erations, send index from blacs Comments and questions can b e addressed to Simpli cation of the parallelization of dense linear algebra [email protected] co des: esp ecially when implemented on top of the BLAS, J. Dongarra and R. van de Geijn, Two dimensional Basic Linear Algebra Communication Subprograms, Technical Rep ort UT CS-91-138, LAPACK Working Note 37, UniversityofTennessee, 1991. Clarity: co de is shorter and easier to read, R. C. Whaley, Basic Linear Algebra Communication Subprograms: Analysis and Implementation Across Multiple Paral lel Architectures ,Technical Re- p ort UT CS-94-234, LAPACK Working Note 73, UniversityofTennessee, Mo dularity: gives programmer larger building blo cks, 1994.
Recommended publications
  • Recursive Approach in Sparse Matrix LU Factorization
    51 Recursive approach in sparse matrix LU factorization Jack Dongarra, Victor Eijkhout and the resulting matrix is often guaranteed to be positive Piotr Łuszczek∗ definite or close to it. However, when the linear sys- University of Tennessee, Department of Computer tem matrix is strongly unsymmetric or indefinite, as Science, Knoxville, TN 37996-3450, USA is the case with matrices originating from systems of Tel.: +865 974 8295; Fax: +865 974 8296 ordinary differential equations or the indefinite matri- ces arising from shift-invert techniques in eigenvalue methods, one has to revert to direct methods which are This paper describes a recursive method for the LU factoriza- the focus of this paper. tion of sparse matrices. The recursive formulation of com- In direct methods, Gaussian elimination with partial mon linear algebra codes has been proven very successful in pivoting is performed to find a solution of Eq. (1). Most dense matrix computations. An extension of the recursive commonly, the factored form of A is given by means technique for sparse matrices is presented. Performance re- L U P Q sults given here show that the recursive approach may per- of matrices , , and such that: form comparable to leading software packages for sparse ma- LU = PAQ, (2) trix factorization in terms of execution time, memory usage, and error estimates of the solution. where: – L is a lower triangular matrix with unitary diago- nal, 1. Introduction – U is an upper triangular matrix with arbitrary di- agonal, Typically, a system of linear equations has the form: – P and Q are row and column permutation matri- Ax = b, (1) ces, respectively (each row and column of these matrices contains single a non-zero entry which is A n n A ∈ n×n x where is by real matrix ( R ), and 1, and the following holds: PPT = QQT = I, b n b, x ∈ n and are -dimensional real vectors ( R ).
    [Show full text]
  • Abstract 1 Introduction
    Implementation in ScaLAPACK of Divide-and-Conquer Algorithms for Banded and Tridiagonal Linear Systems A. Cleary Department of Computer Science University of Tennessee J. Dongarra Department of Computer Science University of Tennessee Mathematical Sciences Section Oak Ridge National Laboratory Abstract Described hereare the design and implementation of a family of algorithms for a variety of classes of narrow ly banded linear systems. The classes of matrices include symmetric and positive de - nite, nonsymmetric but diagonal ly dominant, and general nonsymmetric; and, al l these types are addressed for both general band and tridiagonal matrices. The family of algorithms captures the general avor of existing divide-and-conquer algorithms for banded matrices in that they have three distinct phases, the rst and last of which arecompletely paral lel, and the second of which is the par- al lel bottleneck. The algorithms have been modi ed so that they have the desirable property that they are the same mathematical ly as existing factorizations Cholesky, Gaussian elimination of suitably reordered matrices. This approach represents a departure in the nonsymmetric case from existing methods, but has the practical bene ts of a smal ler and more easily hand led reduced system. All codes implement a block odd-even reduction for the reduced system that al lows the algorithm to scale far better than existing codes that use variants of sequential solution methods for the reduced system. A cross section of results is displayed that supports the predicted performance results for the algo- rithms. Comparison with existing dense-type methods shows that for areas of the problem parameter space with low bandwidth and/or high number of processors, the family of algorithms described here is superior.
    [Show full text]
  • Overview of Iterative Linear System Solver Packages
    Overview of Iterative Linear System Solver Packages Victor Eijkhout July, 1998 Abstract Description and comparison of several packages for the iterative solu- tion of linear systems of equations. 1 1 Intro duction There are several freely available packages for the iterative solution of linear systems of equations, typically derived from partial di erential equation prob- lems. In this rep ort I will give a brief description of a numberofpackages, and giveaninventory of their features and de ning characteristics. The most imp ortant features of the packages are which iterative metho ds and preconditioners supply; the most relevant de ning characteristics are the interface they present to the user's data structures, and their implementation language. 2 2 Discussion Iterative metho ds are sub ject to several design decisions that a ect ease of use of the software and the resulting p erformance. In this section I will give a global discussion of the issues involved, and how certain p oints are addressed in the packages under review. 2.1 Preconditioners A go o d preconditioner is necessary for the convergence of iterative metho ds as the problem to b e solved b ecomes more dicult. Go o d preconditioners are hard to design, and this esp ecially holds true in the case of parallel pro cessing. Here is a short inventory of the various kinds of preconditioners found in the packages reviewed. 2.1.1 Ab out incomplete factorisation preconditioners Incomplete factorisations are among the most successful preconditioners devel- op ed for single-pro cessor computers. Unfortunately, since they are implicit in nature, they cannot immediately b e used on parallel architectures.
    [Show full text]
  • Accelerating the LOBPCG Method on Gpus Using a Blocked Sparse Matrix Vector Product
    Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product Hartwig Anzt and Stanimire Tomov and Jack Dongarra Innovative Computing Lab University of Tennessee Knoxville, USA Email: [email protected], [email protected], [email protected] Abstract— the computing power of today’s supercomputers, often accel- erated by coprocessors like graphics processing units (GPUs), This paper presents a heterogeneous CPU-GPU algorithm design and optimized implementation for an entire sparse iter- becomes challenging. ative eigensolver – the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) – starting from low-level GPU While there exist numerous efforts to adapt iterative lin- data structures and kernels to the higher-level algorithmic choices ear solvers like Krylov subspace methods to coprocessor and overall heterogeneous design. Most notably, the eigensolver technology, sparse eigensolvers have so far remained out- leverages the high-performance of a new GPU kernel developed side the main focus. A possible explanation is that many for the simultaneous multiplication of a sparse matrix and a of those combine sparse and dense linear algebra routines, set of vectors (SpMM). This is a building block that serves which makes porting them to accelerators more difficult. Aside as a backbone for not only block-Krylov, but also for other from the power method, algorithms based on the Krylov methods relying on blocking for acceleration in general. The subspace idea are among the most commonly used general heterogeneous LOBPCG developed here reveals the potential of eigensolvers [1]. When targeting symmetric positive definite this type of eigensolver by highly optimizing all of its components, eigenvalue problems, the recently developed Locally Optimal and can be viewed as a benchmark for other SpMM-dependent applications.
    [Show full text]
  • Prospectus for the Next LAPACK and Scalapack Libraries
    Prospectus for the Next LAPACK and ScaLAPACK Libraries James Demmel1, Jack Dongarra23, Beresford Parlett1, William Kahan1, Ming Gu1, David Bindel1, Yozo Hida1, Xiaoye Li1, Osni Marques1, E. Jason Riedy1, Christof Voemel1, Julien Langou2, Piotr Luszczek2, Jakub Kurzak2, Alfredo Buttari2, Julie Langou2, Stanimire Tomov2 1 University of California, Berkeley CA 94720, USA, 2 University of Tennessee, Knoxville TN 37996, USA, 3 Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA, 1 Introduction Dense linear algebra (DLA) forms the core of many scienti¯c computing appli- cations. Consequently, there is continuous interest and demand for the devel- opment of increasingly better algorithms in the ¯eld. Here 'better' has a broad meaning, and includes improved reliability, accuracy, robustness, ease of use, and most importantly new or improved algorithms that would more e±ciently use the available computational resources to speed up the computation. The rapidly evolving high end computing systems and the close dependence of DLA algo- rithms on the computational environment is what makes the ¯eld particularly dynamic. A typical example of the importance and impact of this dependence is the development of LAPACK [1] (and later ScaLAPACK [2]) as a successor to the well known and formerly widely used LINPACK [3] and EISPACK [3] libraries. Both LINPACK and EISPACK were based, and their e±ciency depended, on optimized Level 1 BLAS [4]. Hardware development trends though, and in par- ticular an increasing Processor-to-Memory speed gap of approximately 50% per year, started to increasingly show the ine±ciency of Level 1 BLAS vs Level 2 and 3 BLAS, which prompted e®orts to reorganize DLA algorithms to use block matrix operations in their innermost loops.
    [Show full text]
  • Jack Dongarra: Supercomputing Expert and Mathematical Software Specialist
    Biographies Jack Dongarra: Supercomputing Expert and Mathematical Software Specialist Thomas Haigh University of Wisconsin Editor: Thomas Haigh Jack J. Dongarra was born in Chicago in 1950 to a he applied for an internship at nearby Argonne National family of Sicilian immigrants. He remembers himself as Laboratory as part of a program that rewarded a small an undistinguished student during his time at a local group of undergraduates with college credit. Dongarra Catholic elementary school, burdened by undiagnosed credits his success here, against strong competition from dyslexia.Onlylater,inhighschool,didhebeginto students attending far more prestigious schools, to Leff’s connect material from science classes with his love of friendship with Jim Pool who was then associate director taking machines apart and tinkering with them. Inspired of the Applied Mathematics Division at Argonne.2 by his science teacher, he attended Chicago State University and majored in mathematics, thinking that EISPACK this would combine well with education courses to equip Dongarra was supervised by Brian Smith, a young him for a high school teaching career. The first person in researcher whose primary concern at the time was the his family to go to college, he lived at home and worked lab’s EISPACK project. Although budget cuts forced Pool to in a pizza restaurant to cover the cost of his education.1 make substantial layoffs during his time as acting In 1972, during his senior year, a series of chance director of the Applied Mathematics Division in 1970– events reshaped Dongarra’s planned career. On the 1971, he had made a special effort to find funds to suggestion of Harvey Leff, one of his physics professors, protect the project and hire Smith.
    [Show full text]
  • Distibuted Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA
    Distibuted Dense Numerical Linear Algebra Algorithms on massively parallel architectures: DPLASMA † George Bosilca∗, Aurelien Bouteiller∗, Anthony Danalis∗ , Mathieu Faverge∗, Azzam Haidar∗, ‡ §¶ Thomas Herault∗ , Jakub Kurzak∗, Julien Langou , Pierre Lemarinier∗, Hatem Ltaief∗, † Piotr Luszczek∗, Asim YarKhan∗ and Jack Dongarra∗ ∗University of Tennessee Innovative Computing Laboratory †Oak Ridge National Laboratory ‡University Paris-XI §University of Colorado Denver ¶Research was supported by the National Science Foundation grant no. NSF CCF-811520 Abstract—We present DPLASMA, a new project related first two, combined with the economic necessity of using to PLASMA, that operates in the distributed memory many thousands of CPUs to scale up to petascale and regime. It uses a new generic distributed Direct Acyclic larger systems. Graph engine for high performance computing (DAGuE). Our work also takes advantage of some of the features of More transistors and slower clocks means multicore DAGuE, such as DAG representation that is independent of designs and more parallelism required. The modus problem-size, overlapping of communication and computa- operandi of traditional processor design, increase the tion, task prioritization, architecture-aware scheduling and transistor density, speed up the clock rate, raise the management of micro-tasks on distributed architectures voltage has now been blocked by a stubborn set of that feature heterogeneous many-core nodes. The origi- nality of this engine is that it is capable of translating physical barriers: excess heat produced, too much power a sequential nested-loop code into a concise and synthetic consumed, too much voltage leaked. Multicore designs format which it can be interpret and then execute in a are a natural response to this situation.
    [Show full text]
  • LAPACK WORKING NOTE 195: SCALAPACK's MRRR ALGORITHM 1. Introduction. Since 2005, the National Science Foundation Has Been Fund
    LAPACK WORKING NOTE 195: SCALAPACK’S MRRR ALGORITHM CHRISTOF VOMEL¨ ∗ Abstract. The sequential algorithm of Multiple Relatively Robust Representations, MRRR, can compute numerically orthogonal eigenvectors of an unreduced symmetric tridiagonal matrix T ∈ Rn×n with O(n2) cost. This paper describes the design of ScaLAPACK’s parallel MRRR algorithm. One emphasis is on the critical role of the representation tree in achieving both numerical accuracy and parallel scalability. A second point concerns the favorable properties of this code: subset computation, the use of static memory, and scalability. Unlike ScaLAPACK’s Divide & Conquer and QR, MRRR can compute subsets of eigenpairs at reduced cost. And in contrast to inverse iteration which can fail, it is guaranteed to produce a numerically satisfactory answer while maintaining memory scalability. ParEig, the parallel MRRR algorithm for PLAPACK, uses dynamic memory allocation. This is avoided by our code at marginal additional cost. We also use a different representation tree criterion that allows for more accurate computation of the eigenvectors but can make parallelization more difficult. AMS subject classifications. 65F15, 65Y15. Key words. Symmetric eigenproblem, multiple relatively robust representations, ScaLAPACK, numerical software. 1. Introduction. Since 2005, the National Science Foundation has been funding an initiative [19] to improve the LAPACK [3] and ScaLAPACK [14] libraries for Numerical Linear Algebra computations. One of the ambitious goals of this initiative is to put more of LAPACK into ScaLAPACK, recognizing the gap between available sequential and parallel software support. In particular, parallelization of the MRRR algorithm [24, 50, 51, 26, 27, 28, 29], the topic of this paper, is identified as one key improvement to ScaLAPACK.
    [Show full text]
  • Reciprocity Relations
    Reciprocity relations - May 11, 2006 Using the reciprocity theorem a relation between the seismic reflection response and transmission data can be derived. This relation is used to calculate transmission data from reflection data measured at the surface. In this presentation the steps involved in this calculation are explained in detail and possible pitfalls are discussed. At the end a simple example is given to illustrate the discussed procedure. 1 One-way reciprocity One-way reciprocity relation of the convolution type: + − − + 2x + − − + 2x {PA PB − PA PB }d H = {PA PB − PA PB }d H (1) Z∂D0 Z∂Dm One-way reciprocity relation of the correlation type: + ∗ + − ∗ − 2x + ∗ + − ∗ − 2x {(PA ) PB − (PA ) PB }d H = {(PA ) PB − (PA ) PB }d H (2) Z∂D0 Z∂Dm the medium parameters in both states A and B are identical, lossless and 3-D inhomogeneous and the domain D is source free. State A State B x xA x xB z0 ∂D0 ∂D0 R0(x, xA,ω) R0(x, xB ,ω) ∆z ∆z D T0(x, xA,ω) D T0(x, xB ,ω) x zm x ∂Dm ∂Dm 1 Surface ∂D0 Field State A State B + P δ(xH − xH,A)sA(ω) δ(xH − xH,B )sB(ω) − P R0(x, xA,ω)sA(ω) R0(x, xB ,ω)sB(ω) Surface ∂Dm + P T0(x, xA,ω)sA(ω) T0(x, xB ,ω)sB(ω) P − 0 0 One-way reciprocity relation of the convolution type: 2 sAsB {δ(xH − xH,A)R0(x, xB ) − R0(x, xA)δ(xH − xH,B )}d xH =0 (3a) Z∂D0 R0(xA, xB )= R0(xB , xA) (3b) One-way reciprocity relation of the correlation type: x x ∗ x x x x 2x ∗ x x x x 2x δ( H,A − H,B) − R0( , A)R0( , B )d H = T0 ( , A)T0( , B )d H (4) Z∂D0 Z∂Dm Equation (4) can be used to estimate transmission effects on wave propagation (coda) from reflection data at the surface.
    [Show full text]
  • C++ API for BLAS and LAPACK
    2 C++ API for BLAS and LAPACK Mark Gates ICL1 Piotr Luszczek ICL Ahmad Abdelfattah ICL Jakub Kurzak ICL Jack Dongarra ICL Konstantin Arturov Intel2 Cris Cecka NVIDIA3 Chip Freitag AMD4 1Innovative Computing Laboratory 2Intel Corporation 3NVIDIA Corporation 4Advanced Micro Devices, Inc. February 21, 2018 This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, in support of the nation's exascale computing imperative. Revision Notes 06-2017 first publication 02-2018 • copy editing, • new cover. 03-2018 • adding a section about GPU support, • adding Ahmad Abdelfattah as an author. @techreport{gates2017cpp, author={Gates, Mark and Luszczek, Piotr and Abdelfattah, Ahmad and Kurzak, Jakub and Dongarra, Jack and Arturov, Konstantin and Cecka, Cris and Freitag, Chip}, title={{SLATE} Working Note 2: C++ {API} for {BLAS} and {LAPACK}}, institution={Innovative Computing Laboratory, University of Tennessee}, year={2017}, month={June}, number={ICL-UT-17-03}, note={revision 03-2018} } i Contents 1 Introduction and Motivation1 2 Standards and Trends2 2.1 Programming Language Fortran............................ 2 2.1.1 FORTRAN 77.................................. 2 2.1.2 BLAST XBLAS................................. 4 2.1.3 Fortran 90.................................... 4 2.2 Programming Language C................................ 5 2.2.1 Netlib CBLAS.................................. 5 2.2.2 Intel MKL CBLAS................................ 6 2.2.3 Netlib lapack cwrapper............................. 6 2.2.4 LAPACKE.................................... 7 2.2.5 Next-Generation BLAS: \BLAS G2".....................
    [Show full text]
  • On the Performance and Energy Efficiency of Sparse Linear Algebra on Gpus
    Original Article The International Journal of High Performance Computing Applications 1–16 On the performance and energy efficiency Ó The Author(s) 2016 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav of sparse linear algebra on GPUs DOI: 10.1177/1094342016672081 hpc.sagepub.com Hartwig Anzt1, Stanimire Tomov1 and Jack Dongarra1,2,3 Abstract In this paper we unveil some performance and energy efficiency frontiers for sparse computations on GPU-based super- computers. We compare the resource efficiency of different sparse matrix–vector products (SpMV) taken from libraries such as cuSPARSE and MAGMA for GPU and Intel’s MKL for multicore CPUs, and develop a GPU sparse matrix–matrix product (SpMM) implementation that handles the simultaneous multiplication of a sparse matrix with a set of vectors in block-wise fashion. While a typical sparse computation such as the SpMV reaches only a fraction of the peak of current GPUs, we show that the SpMM succeeds in exceeding the memory-bound limitations of the SpMV. We integrate this kernel into a GPU-accelerated Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) eigensolver. LOBPCG is chosen as a benchmark algorithm for this study as it combines an interesting mix of sparse and dense linear algebra operations that is typical for complex simulation applications, and allows for hardware-aware optimizations. In a detailed analysis we compare the performance and energy efficiency against a multi-threaded CPU counterpart. The reported performance and energy efficiency results are indicative of sparse computations on supercomputers. Keywords sparse eigensolver, LOBPCG, GPU supercomputer, energy efficiency, blocked sparse matrix–vector product 1Introduction GPU-based supercomputers.
    [Show full text]
  • Evolving Software Repositories
    1 Evolving Software Rep ositories http://www.netli b.org/utk/pro ject s/esr/ Jack Dongarra UniversityofTennessee and Oak Ridge National Lab oratory Ron Boisvert National Institute of Standards and Technology Eric Grosse AT&T Bell Lab oratories 2 Pro ject Fo cus Areas NHSE Overview Resource Cataloging and Distribution System RCDS Safe execution environments for mobile co de Application-l evel and content-oriented to ols Rep ository interop erabili ty Distributed, semantic-based searching 3 NHSE National HPCC Software Exchange NASA plus other agencies funded CRPC pro ject Center for ResearchonParallel Computation CRPC { Argonne National Lab oratory { California Institute of Technology { Rice University { Syracuse University { UniversityofTennessee Uniform interface to distributed HPCC software rep ositories Facilitation of cross-agency and interdisciplinary software reuse Material from ASTA, HPCS, and I ITA comp onents of the HPCC program http://www.netlib.org/nhse/ 4 Goals: Capture, preserve and makeavailable all software and software- related artifacts pro duced by the federal HPCC program. Soft- ware related artifacts include algorithms, sp eci cations, designs, do cumentation, rep ort, ... Promote formation, growth, and interop eration of discipline-oriented rep ositories that organize, evaluate, and add value to individual contributions. Employ and develop where necessary state-of-the-art technologies for assisting users in nding, understanding, and using HPCC software and technologies. 5 Bene ts: 1. Faster development of high-quality software so that scientists can sp end less time writing and debugging programs and more time on research problems. 2. Less duplication of software development e ort by sharing of soft- ware mo dules.
    [Show full text]