Overview of Dense Numerical Libraries

• BLAS: kernel for dense linear algebra • LAPACK: sequenal dense linear algebra • ScaLAPACK: parallel distributed dense linear algebra • And Beyond

Linear Algebra PACKage 1 What is dense linear algebra? • Not just matmul! • Linear Systems: Ax=b

• Least Squares: choose x to minimize ||Ax-b||2 – Overdetermined or underdetermined – Unconstrained, constrained, weighted • Eigenvalues and vectors of Symmetric Matrices • Standard (Ax = λx), Generalized (Ax=λBx) • Eigenvalues and vectors of Unsymmetric matrices • Eigenvalues, Schur form, eigenvectors, invariant subspaces • Standard, Generalized • Singular Values and vectors (SVD) – Standard, Generalized • Different structures – Real, complex; Symmetric, Hermitian, positive definite; dense, triangular, banded … • Level of detail – Simple Driver (“x=A\b”) – Expert Drivers with error bounds, extra-precision, other options – Lower level routines (“apply certain kind of orthogonal transformation”, matmul…)

2 Dense Linear Algebra • Common Operations Ax = b; min || Ax − b ||; Ax = λx x • A major source of large dense linear systems is problems involving the solution of boundary integral equations. – The price one pays for replacing three dimensions with two is that what started as a sparse problem in O(n3) variables is replaced by a dense problem in O(n2). • Dense systems of linear equations are found in numerous other applications, including: – airplane wing design; – radar cross-section studies; – flow around ships and other off-shore constructions; – diffusion of solid bodies in a liquid; – noise reduction; and – diffusion of light through small particles.

3 Exisng Math Soware - Dense LA

hp://www..org/utk/people/JackDongarra/la-sw.html • LINPACK, EISPACK, LAPACK, ScaLAPACK

– PLASMA, MAGMA 4 A brief history of (Dense) Linear Algebra software

• In the beginning was the do-loop… – Libraries like EISPACK (for eigenvalue problems) • Then the BLAS (1) were invented (1973-1977) – Standard library of 15 operations (mostly) on vectors • “AXPY” ( y = α·x + y ), dot product, scale (x = α·x ), etc • Up to 4 versions of each (S/D//Z), 46 routines, 3300 LOC – Goals • Common “pattern” to ease programming, readability • Robustness, via careful coding (avoiding over/underflow) • Portability + Efficiency via machine specific implementations – Why BLAS 1 ? They do O(n1) ops on O(n1) data – Used in libraries like LINPACK (for linear systems) • Source of the name “LINPACK ” (not the code!)

5 What Is LINPACK?

¨ LINPACK is a package of mathematical software for solving problems in linear algebra, mainly dense linear systems of linear equations.

¨ LINPACK: “LINear algebra PACKage” Ø Written in 66

¨ The project had its origins in 1974

¨ The project had four primary contributors: myself when I was at Argonne National Lab, Jim Bunch from the University of California-San Diego, Cleve Moler who was at New Mexico at that time, and Pete Stewart from the University of Maryland.

¨ LINPACK as a software package has been largely superseded by LAPACK, which has been designed to run efficiently on shared- memory, vector .

6 Computing in 1974

¨ High Performance : Ø IBM 370/195, CDC 7600, Univac 1110, DEC PDP-10, Honeywell 6030 ¨ Fortran 66 ¨ Trying to achieve software portability ¨ Run efficiently ¨ BLAS (Level 1) Ø Vector operations ¨ Software released in 1979 Ø About the time of the Cray 1

7 LINPACK Benchmark?

¨ The Linpack Benchmark is a measure of a ’s floating-point rate of execution. Ø It is determined by running a computer program that solves a dense system of linear equations. ¨ Over the years the characteristics of the benchmark has changed a bit. Ø In fact, there are three benchmarks included in the Linpack Benchmark report. ¨ LINPACK Benchmark Ø Dense linear system solve with LU factorization using partial pivoting Ø Operation count is: 2/3 n3 + O(n2) Ø Benchmark Measure: MFlop/s Ø Original benchmark measures the execution rate for a 8 Fortran program on a matrix of size 100x100. Accidental Benchmarker

¨ Appendix B of the Linpack Users’ Guide Ø Designed to help users extrapolate execution time for Linpack software package ¨ First benchmark report from 1977; Ø Cray 1 to DEC PDP-10

9 High Performance Linpack (HPL)

Benchmark Matrix Optimizations Parallel Name dimension allowed Processing Linpack 100 100 compiler –a Linpack 1000b 1000 hand, code –c replacement Linpack Parallel 1000 hand, code Yes replacement HPLinpackd arbitrary hand, code Yes replacement a Compiler parallelization possible. b Also known as TPP (Toward Peak Performance) or Best Effort c Multiprocessor implementations allowed. d Highly-Parallel LINPACK Benchmark is also known as NxN Linpack Benchmark or High (HPC).

10 The Standard LU Factorization LINPACK 1970’s HPC of the Day: Vector Architecture

Factor column Divide by Schur Next Step with Level 1 Pivot complement BLAS row update (Rank 1 update) Main points • Factorization column (zero) mostly sequential due to memory bottleneck • Level 1 BLAS • Divide pivot row has little parallelism • Rank -1 Schur complement update is the only easy parallelize task • Partial pivoting complicates things even further • Bulk synchronous parallelism (fork-join) • Load imbalance • Non-trivial Amdahl fraction in the panel • Potential workaround (look-ahead) has complicated implementation A brief history of (Dense) Linear Algebra software

¨ But the BLAS-1 weren’t enough Ø Consider AXPY ( y = α·x + y ): 2n on 3n read/writes Ø Computational intensity = (2n)/(3n) = 2/3 Ø Too low to run near peak speed (read/write dominates) ¨ So the BLAS-2 were developed (1984-1986) Ø Standard library of 25 operations (mostly) on matrix/vector pairs Ø “GEMV”: y = α·A·x + β·x, “GER”: A = A + α·x·yT, x = T-1·x Ø Up to 4 versions of each (S/D/C/Z), 66 routines, 18K LOC Ø Why BLAS 2 ? They do O(n2) ops on O(n2) data Ø So computational intensity still just ~(2n2)/(n2) = 2 Ø OK for vector machines, but not for machine with caches

12 A brief history of (Dense) Linear Algebra software

¨ The next step: BLAS-3 (1987-1988) Ø Standard library of 9 operations (mostly) on matrix/matrix pairs Ø “GEMM”: C = α·A·B + β·C, C = α·A·AT + β·C, B = T-1·B Ø Up to 4 versions of each (S/D/C/Z), 30 routines, 10K LOC Ø Why BLAS 3 ? They do O(n3) ops on O(n2) data Ø So computational intensity (2n3)/(4n2) = n/2 – big at last! Ø Good for machines with caches, other mem. hierarchy levels ¨ How much BLAS1/2/3 code so far (all at www.netlib.org/blas) Ø Source: 142 routines, 31K LOC, Testing: 28K LOC Ø Reference (unoptimized) implementation only Ø Ex: 3 nested loops for GEMM

13 Back to basics: Why avoiding communication is important

Algorithms have two costs: 1. Arithmetic (FLOPS) 2. Communication: moving data between Ø levels of a memory hierarchy (sequential case) Ø processors over a network (parallel case).

CPU CPU CPU DRAM DRAM Cache

DRAM CPU CPU DRAM DRAM 14 Memory Hierarchy

¨ By taking advantage of the principle of locality: Ø Present the user with as much memory as is available in the cheapest technology. Ø Provide access at the speed offered by the fastest technology.

Processor Tertiary Storage Secondary (Disk/Tape) Control Storage (Disk)

Cache On-Chip Level Main Registers 2 and 3 Memory Distributed Remote Datapath Cache (DRAM) Memory Cluster (SRAM) Memory

Speed (ns): 1s 10s 100s 10,000,000s 10,000,000,000s (10s ms) Size (bytes): 100s (10s sec) Ks Ms 100,000 s 10,000,000 s (.1s ms) (10s ms) Gs Ts Why Higher Level BLAS?

¨ Can only do arithmetic on data at the top of the hierarchy ¨ Higher level BLAS lets us do this

Registers

L 1 Cache

L 2 Cache

Local Memory

Remote Memory

Secondary Memory16 Level 1, 2 and 3 BLAS

¨ Level 1 BLAS + Vector-Vector * operations ¨ Level 2 BLAS Matrix-Vector * operations ¨ Level 3 BLAS Matrix-Matrix + * operations

17 Level 1, 2 and 3 BLAS 1 core Intel Xeon E5-2670 (Sandy Bridge); 2.6 GHz; Peak = 20.8 Gflop/s

19.3 Gflop/s

3.3 Gflop/s

0.2 Gflop/s Gflops

Matrix size

1 core Intel Xeon E5-2670 (Sandy Bridge), 2.6 GHz. 24 MB shared L3 cache, and each core has a private 256 KB L2 and 64 KB L1. The theoretical peak per core DP is 8 flop/cycle * 2.6 GHz = 20.8 Gflop/s per core. Compiled with gcc 4.4.6 and using MKL_composer_xe_2013.3.163 02/25/2009 CS267 Lecture 8 19 A brief history of (Dense) Linear Algebra soware • LAPACK – “Linear Algebra PACKage” - uses BLAS-3 (1989 – now) – Ex: Obvious way to express Gaussian Eliminaon (GE) is adding mulples of one row to other rows – BLAS-1 • How do we reorganize GE to use BLAS-3 ? (details later) – Contents of LAPACK (summary) • Algorithms we can turn into (nearly) 100% BLAS 3 – Linear Systems: solve Ax=b for x

– Least Squares: choose x to minimize ||Ax - b||2 • Algorithms that are only 50% BLAS 3 (so far) – “Eigenproblems”: Find λ and x where Ax = λ x – Singular Value Decomposion (SVD): (ATA)x=σ2x • Generalized problems (eg Ax = λ Bx) • Error bounds for everything • Lots of variants depending on A’s structure (banded, A=AT, etc) – How much code? (Release 3.5.0, Nov 2013) (www.netlib.org/lapack) • Source: 1674 rounes, 490K LOC, Tesng: 448K LOC

20 A brief history of (Dense) Linear Algebra soware

• Is LAPACK parallel? – Only if the BLAS are parallel (possible in shared memory) • ScaLAPACK – “Scalable LAPACK” (1995 – now) – For distributed memory – uses MPI – More complex data structures, algorithms than LAPACK • Only (small) subset of LAPACK’s funconality available – All at www.netlib.org/scalapack

21 LAPACK

• hp://www.netlib.org/lapack/ LAPACK is in FORTRAN Column Major • LAPACK (Linear Algebra Package) provides rounes for – solving systems of simultaneous linear equaons, LAPACK is – least-squares soluons of linear systems of equaons, SEQUENTIAL – eigenvalue problems, – and singular value problems. LAPACK is a REFERENCE • LAPACK relies on BLAS implementaon

• The associated matrix factorizaons (LU, Cholesky, QR, SVD, Schur, generalized Schur) are also provided, as are related computaons such as reordering of the Schur factorizaons and esmang condion numbers.

• Dense and banded matrices are handled, but not general sparse matrices. In all areas, similar funconality is provided for real and complex matrices, in both single and double precision. 22 The Standard LU Factorizaon LINPACK 1970’s HPC of the Day: Vector Architecture

Factor column Divide by Schur Next Step with Level 1 Pivot complement BLAS row update (Rank 1 update) Main points • Factorization column (zero) mostly sequential due to memory bottleneck • Level 1 BLAS • Divide pivot row has little parallelism • Rank -1 Schur complement update is the only easy parallelize task • Partial pivoting complicates things even further • Bulk synchronous parallelism (fork-join) • Load imbalance • Non-trivial Amdahl fraction in the panel • Potential workaround (look-ahead) has complicated implementation The Standard LU Factorizaon LAPACK 1980’s HPC of the Day: Cache Based SMP

Factor panel Triangular Schur Next Step with Level 1,2 update complement BLAS update

Main points • Panel factorization mostly sequential due to memory bottleneck • Triangular solve has little parallelism • Schur complement update is the only easy parallelize task • Partial pivoting complicates things even further • Bulk synchronous parallelism (fork-join) • Load imbalance • Non-trivial Amdahl fraction in the panel • Potential workaround (look-ahead) has complicated implementation Example with GESV Solve a system of linear equaons using a LU factorizaon

subroune dgesv( n, nrhs, A, lda, ipiv, b, ldb, info ) input: n nrhs i p n info n n B A i

v output: Soluon of Ax=b n nrhs i

p

U n n info n X i

L

v 25 Funconalies in LAPACK

Type of Problem Acronyms Linear system of equaons SV Linear least squares problems LLS Linear equality-constrained least squares problem LSE General linear model problem GLM Symmetric eigenproblems SEP Nonsymmetric eigenproblems NEP Singular value decomposion SVD Generalized symmetric definite eigenproblems GSEP Generalized nonsymmetric eigenproblems GNEP Generalized (or quoent) singular value decomposion GSVD (QSVD)

26 LAPACK Soware

• First release in February 1992 • Version 3.5.0 released in November 2013 • LICENSE: Mod-BSD, freely-available soware package - Thus, it can be included in commercial soware packages (and has been). We only ask that proper credit be given to the authors. • Open SVN repository • Mul-OS – *nix, Mac OS/X, Windows • Mul-build support (cmake) – make, xcode, nmake, VS studio, Eclipse, etc.. • LAPACKE: Standard C language APIs for LAPACK (In collaboraon with INTEL) – 2 layers of interface • High-Level Interface : Workspace allocaon and NAN Check • Low-Level Interface • Prebuilt Libraries for Windows • Extensive test suite • Forum and User support: hp://icl.cs.utk.edu/lapack-forum/

27 Latest Algorithms

Since release 3.0 of LAPACK – Hessenberg QR algorithm with the small bulge mul-shi QR algorithm together with aggressive early deflaon. [2003 SIAM SIAG LA Prize winning algorithm of Braman, Byers and Mathias] 3.1 – Improvements of the Hessenberg reducon subrounes. [G. Quintana-Or and van de Geijn] – New MRRR eigenvalue algorithms [2006 SIAM SIAG LA Prize winning algorithm of Dhillon and Parle] – New paral column norm updang strategy for QR factorizaon with column pivong. [Drmač and Bujanovic] – Mixed Precision Iterave Refinement for exploing fast single precision hardware for GE, PO [Langou’s] 3.2 – Variants of various factorizaon (LU, QR, Chol) [Du] – RFP (Rectangular Full Packed) format [Gustavson, Langou] – XBLAS and Extra precise iterave refinement for GESV [Demmel et al.]. – New fast and accurate Jacobi SVD [2009 SIAM SIAG LA Prize, Drmač and Veselić] – Pivoted Cholesky [Lucas] – Beer mulshi Hessenberg QR algorithm with early aggressive deflaon [Byers] – Complete CS decomposion [Suon] 3.3 – Level-3 BLAS symmetric indefinite solve and symmetric indefinite inversion [Langou’s] – Since LAPACK 3.3, all rounes in are now thread-safe

28 Latest Algorithms

Since release 3.0 of LAPACK – xGEQRT: QR factorizaon (improved interface). [James (UC Denver)] – xGEQRT3: recursive QR factorizaon. [James (UC Denver)]. The recursive QR factorizaon enables cache-oblivious and enables high performance on tall and skinny matrices. – xTPQRT: Communicaon-Avoiding QR sequenal kernels. [James (UC Denver)] – CMAKE build system. Building under Windows has never been easier. This also allows us to 3.4 release dll for Windows, so users no longer need a Fortran compiler to use LAPACK under Windows. – Doxygen documentaon. LAPACK roune documentaon has never been more accessible. See hp://www.netlib.org/lapack/explore-html/. – New website allowing for easier navigaon. – LAPACKE - Standard C language APIs for LAPACK.

– Symmetric/Hermian LDLT factorizaon rounes with rook pivong algorithm [Kozachenko] – 2-by-1 CSD to be used for tall and skinny matrix with orthonormal columns [Brian Suon] 3.5 – New stopping criteria for balancing [Rodney James] – New complex division algorithm [Victor Liu]

29 LAPACKE – the C Standard Interface to LAPACK

• Two Interfaces for two kinds of users – High Level Interface : Ease of use Users • No more WORK parameter, allocaon done for the user • Oponal NaN check on all vector/matrix inputs before calling any LAPACK FORTRAN roune • Support Row and Column major formats – Low Level Interface : Expert users and libraries • Low overhead and user control are the two main priories here • Other features – Include and configuraon files – INFO parameter dropped in the C Interface to LAPACK, and instead roune returns an integer – Handle ANSI C99 (default), C structure opon, CPP complex type and custom types for complex types Need Fortran and LAPACK at linking me Resources

Reference Code: – Reference code: (current version 3.4.2) hp://www.netlib.org/lapack/lapack.tgz – LAPACK build for windows (current version 3.4.2) hp://icl.cs.utk.edu/lapack-for-windows/lapack – LAPACKE: Standard C language APIs for LAPACK (in collaboraon with INTEL): hp://www.netlib.org/lapack/#_standard_c_language_apis_for_lapack – Remi’s wrappers (wrapper for Matlab users): hp://icl.cs.utk.edu/~delmas/lapwrapmw.htm

Vendor Libraries: more or less same as the BLAS: MKL, ACML, VECLIB, ESSL, etc… (WARNING: some implementaons are just a subset of LAPACK)

Documentaon: – LAPACK Users’ guide: hp://www.netlib.org/lapack/lug/

– LAPACK Working notes (in parcular LAWN 41) hp://www.netlib.org/lapack/lawns/downloads/

– LAPACK release notes hp://www.netlib.org/lapack/lapack-3.4.2.html

– LAPACK NAG example and auxiliary rounes hp://www.nag.com/lapack-ex/lapack-ex.html

– CRC Handbook of Linear Algebra, Leslie Hogben ed, Packages of Subrounes for Linear Algebra, Bai, Demmel, Dongarra, Langou, and Wang, Secon 75: pages 75-1,75-24, CRC Press, 2006. hp://www.netlib.org/netlib/utk/people/JackDongarra/PAPERS/CRC-LAPACK-2005.pdf

Support: – LAPACK forum: (more than 1000 topics) hp://icl.cs.utk.edu/lapack-forum/ – LAPACK mailing-list: [email protected] – LAPACK mailing-list archive: hp://icl.cs.utk.edu/lapack-forum/archives/

31 Organizing Linear Algebra – in books

www.netlib.org/lapack www.netlib.org/scalapack

www.netlib.org/templates www.cs.utk.edu/~dongarra/etemplates Parallelizaon of LU and QR. Parallelize the update: dgemm • Easy and done in any reasonable soware. • This is the 2/3n 3 term in the FLOPs count. - • Can be done efficiently with LAPACK+multhreaded BLAS U

L A(1)

dgetf2

lu( )

dtrsm (+ dswp)

\ dgemm -

U L A(2) Overview of Dense Libraries

• BLAS: kernel for dense linear algebra • LAPACK: sequenal dense linear algebra • ScaLAPACK: parallel distributed dense linear algebra

Scalable Linear Algebra PACKage ScaLAPACK

¨ Library of software dealing with dense & banded routines ¨ Distributed Memory - Message Passing ¨ MIMD Computers and Networks of Workstations ¨ Clusters of SMPs

35 ScaLAPACK

• hp://www.netlib.org/scalapack/ ScaLAPACK is in • ScaLAPACK (Scalable Linear Algebra Package) provides FORTRAN and C rounes for – solving systems of simultaneous linear equaons, – least-squares soluons of linear systems of equaons, ScaLAPACK is for – eigenvalue problems, PARALLEL – and singular value problems. DISTRIBUTED

• Relies on LAPACK / BLAS and BLACS / MPI ScaLAPACK is a REFERENCE • Includes PBLAS ( Parallel BLAS ) implementaon Programming Style

¨ SPMD Fortran 77 with object based design ¨ Built on various modules Ø PBLAS Interprocessor communication Ø BLACS Ø PVM, MPI, IBM SP, CRI T3, Intel, TMC Ø Provides right level of notation. Ø BLAS ¨ LAPACK software expertise/quality Ø Software approach Ø Numerical methods

37 Overall Structure of Software

¨ Object based - Array descriptor Ø Contains information required to establish mapping between a global array entry and its corresponding process and memory location. Ø Provides a flexible framework to easily specify additional data distributions or matrix types. Ø Currently dense, banded, & out-of-core ¨ Using the concept of context

38 PBLAS

¨ Similar to the BLAS in functionality and naming. ¨ Built on the BLAS and BLACS ¨ Provide global view of matrix CALL DGEXXX ( M, N, A( IA, JA ), LDA,... )

CALL PDGEXXX( M, N, A, IA, JA, DESCA,... )

39 ScaLAPACK Structure

ScaLAPACK

PBLAS Global Local LAPACK BLACS

BLAS PVM/MPI/...

40 Choosing a Data Distribution

¨ Main issues are: Ø Load balancing Ø Use of the Level 3 BLAS

41 Possible Data Layouts

¨ 1D block and cyclic column distributions

¨ 1D block-cycle column and 2D block-cyclic distribution ¨ 2D block-cyclic used in ScaLAPACK for dense matrices 42 From LAPACK to ScaLAPACK [LAPACK] subroune dgesv( n, nrhs, a(ia,ja), lda, ipiv, b(ib,jb), ldb, info ) input: n nrhs i LAPACK Data layout

p n info n n A B i

v output:

n nrhs i LAPACK Data layout

p U n n info n X i L

v From LAPACK to ScaLAPACK [LAPACK] subroune dgesv( n, nrhs, a(ia,ja), lda, ipiv, b(ib,jb), ldb, info ) input: n nrhs

ScaLAPACK Data layout A11 A12 A13 B11 ip1

n n info A21 A22 A23 B21 ip2

A31 A32 A33 B31 ip3 output: n nrhs U11 ScaLAPACK Data layout U12 U13 X11 ip1 L11 n U22 n info L21 L U23 X21 ip2 22 U L L 33 X ip 31 32 L33 31 3

From LAPACK to ScaLAPACK [LAPACK] subroune dgesv( n, nrhs, a(ia,ja), lda, ipiv, b(ib,jb), ldb, info ) [ScaLAPACK] subroune pdgesv( n, nrhs, a, ia, ja, desca, ipiv, b, ib, jb, descb, info ) input: n nrhs

ScaLAPACK Data layout A11 A12 A13 B11 ip1

n n info A21 A22 A23 B21 ip2

A31 A32 A33 B31 ip3 output: n nrhs U11 ScaLAPACK Data layout U12 U13 X11 ip1 L11 n U22 n info L21 L U23 X21 ip2 22 U L L 33 X ip 31 32 L33 31 3

Distribution and Storage

¨ Matrix is block-partitioned & maps blocks ¨ Distributed 2-D block-cyclic scheme 5x5 matrix partitioned in 2x2 blocks 2x2 process grid point of view

A 11 A 12 A 13 A 14 A 15 A11 A12 A15 A13 A14

A 21 A 22 A 23 A 24 A 25 A21 A22 A25 A23 A24

A 31 A 32 A 33 A 34 A 35 A51 A52 A55 A53 A54

A 41 A 42 A 43 A 44 A 45 A31 A32 A35 A33 A34

A 51 A 52 A 53 A 54 A 55 A41 A42 A45 A43 A44 ¨ Routines available to distribute/redistribute data. 46 2D Block Cyclic Layout

Matrix point of view Processor point of view N Q

0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 M P 0 0 0 2 2 2 4 4 4

1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5

NB Matrix is MxN Process grid is PxQ, P=2, Q=3 MB Blocks are MBxNB 2D Block Cyclic Layout

Matrix point of view Processor point of view

0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4

1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 2D Block Cyclic Layout

Matrix point of view Processor point of view

0 2 4 0 0 0 2 2 2 4 4 4 1 3 5 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4

1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 2D Block Cyclic Layout

Matrix point of view Processor point of view

0 2 4 0 2 4 0 0 0 2 2 2 4 4 4 1 3 5 1 3 5 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4

1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 2D Block Cyclic Layout

Matrix point of view Processor point of view

0 2 4 0 2 4 0 2 4 0 0 0 2 2 2 4 4 4 1 3 5 1 3 5 1 3 5 0 0 0 2 2 2 4 4 4 0 2 4 0 2 4 0 2 4 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 1 3 5 1 3 5 1 3 5 0 0 0 2 2 2 4 4 4 0 2 4 0 2 4 0 2 4

1 3 5 1 3 5 1 3 5 1 1 1 3 3 3 5 5 5 0 2 4 0 2 4 0 2 4 1 1 1 3 3 3 5 5 5 1 3 5 1 3 5 1 3 5 1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 0 2 4 0 2 4 0 2 4 2D Block Cyclic Layout

Matrix point of view Processor point of view

0 2 4 0 2 4 0 2 4 0 0 0 2 2 2 4 4 4 1 3 5 1 3 5 1 3 5 0 0 0 2 2 2 4 4 4 0 2 4 0 2 4 0 2 4 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 1 3 5 1 3 5 1 3 5 0 0 0 2 2 2 4 4 4 0 2 4 0 2 4 0 2 4

1 3 5 1 3 5 1 3 5 1 1 1 3 3 3 5 5 5 0 2 4 0 2 4 0 2 4 1 1 1 3 3 3 5 5 5 1 3 5 1 3 5 1 3 5 1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 0 2 4 0 2 4 0 2 4 2D Block Cyclic Layout

Matrix point of view Processor point of view

0 2 4 0 2 4 0 2 4 0 0 0 2 2 2 4 4 4 1 3 5 1 3 5 1 3 5 0 0 0 2 2 2 4 4 4 0 2 4 0 2 4 0 2 4 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 1 3 5 1 3 5 1 3 5 0 0 0 2 2 2 4 4 4 0 2 4 0 2 4 0 2 4

1 3 5 1 3 5 1 3 5 1 1 1 3 3 3 5 5 5 0 2 4 0 2 4 0 2 4 1 1 1 3 3 3 5 5 5 1 3 5 1 3 5 1 3 5 1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 0 2 4 0 2 4 0 2 4 2D Block Cyclic Layout

Matrix point of view Processor point of view

3 5 1 3 5 1 3 5 0 0 2 2 2 4 4 4 2 4 0 2 4 0 2 4 0 0 2 2 2 4 4 4 0 0 2 2 2 4 4 4 3 5 1 3 5 1 3 5 0 0 2 2 2 4 4 4 2 4 0 2 4 0 2 4

3 5 1 3 5 1 3 5 1 1 3 3 3 5 5 5 2 4 0 2 4 0 2 4 1 1 3 3 3 5 5 5 3 5 1 3 5 1 3 5 1 1 3 3 3 5 5 5 1 1 3 3 3 5 5 5 2 4 0 2 4 0 2 4 2D Block Cyclic Layout

Matrix point of view Processor point of view

0 0 2 2 4 4 4 4 0 2 4 0 2 4 0 0 2 2 4 4 4 0 0 2 2 4 4 4 5 1 3 5 1 3 5 0 0 2 2 4 4 4 4 0 2 4 0 2 4 5 1 3 5 1 3 5 4 0 2 4 0 2 4 1 1 3 3 5 5 5 5 1 3 5 1 3 5 1 1 3 3 5 5 5 1 1 3 3 5 5 5 4 0 2 4 0 2 4 2D Block Cyclic Layout

Matrix point of view Processor point of view

0 0 2 2 4 4 0 0 2 2 4 4 1 3 5 1 3 5 0 0 2 2 4 4 0 2 4 0 2 4 1 3 5 1 3 5 0 2 4 0 2 4 1 1 3 3 5 5 1 3 5 1 3 5 1 1 3 3 5 5 1 1 3 3 5 5 0 2 4 0 2 4 2D Block Cyclic Layout

Matrix point of view Processor point of view Parallelism in ScaLAPACK

¨ Level 3 BLAS block ¨ Task parallelism operations Ø Sign function eigenvalue Ø All the reduction routines computations ¨ Pipelining ¨ Divide and Conquer Ø QR Algorithm, Triangular Ø Tridiagonal and band Solvers, classic solvers, symmetric factorizations eigenvalue problem and Sign function ¨ Redundant ¨ Cyclic reduction computations Ø Reduced system in the Ø Condition estimators band solver ¨ Static work ¨ Data parallelism assignment Ø Sign function Ø Bisection

58 Funconalies in LAPACK

Type of Problem Acronyms Linear system of equaons SV Linear least squares problems LLS Linear equality-constrained least squares problem LSE General linear model problem GLM Symmetric eigenproblems SEP Nonsymmetric eigenproblems NEP Singular value decomposion SVD Generalized symmetric definite eigenproblems GSEP Generalized nonsymmetric eigenproblems GNEP Generalized (or quoent) singular value decomposion GSVD (QSVD) Funconnalies in ScaLAPACK

Type of Problem Acronyms Linear system of equaons SV Linear least squares problems LLS Linear equality-constrained least squares problem LSE General linear model problem GLM Symmetric eigenproblems SEP Nonsymmetric eigenproblems NEP Singular value decomposion SVD Generalized symmetric definite eigenproblems GSEP Generalized nonsymmetric eigenproblems GNEP Generalized (or quoent) singular value decomposion GSVD (QSVD) Major Changes to Software

• Must rethink the design of our software Ø Another disruptive technology Ø Similar to what happened with cluster computing and message passing Ø Rethink and rewrite the applications, algorithms, and software • Numerical libraries for example will change Ø For example, both LAPACK and ScaLAPACK will undergo major changes to accommodate this

61 Moore’s Law Reinterpreted Cores in the Top20 Systems ¨ Number of cores per chip doubles every 2 year, while 10000000" clock speed decreases (not 9000000" increases). 8000000" Ø Need to deal with systems with 7000000" millions of concurrent threads Ø Future generation will have 6000000" billions of threads! 5000000" Ø Need to be able to easily 4000000" replace inter-chip parallelism 3000000" with intro-chip parallelism 2000000" ¨ Number of threads of execution doubles every 2 1000000" year 0" 2000"2001"2002"2003"2004"2005"2006"2007"2008"2009"2010"2011"2012"2013" Key Challenges at Exascale

¨ Power Management ¨ Levels of parallelism Ø API for fine grain management Ø O(100M and beyond) ¨ Language constraints ¨ Hybrid architectures Ø Fortran, C & MPI, Open-MP Ø Node composed of multiple ¨ Autotuning multicore sockets + Ø Systems complex and changing accelerators ¨ Bulk Sync Processing ¨ Bandwidth vs Arithmetic rate Ø Break fork join parallelism Ø Most approaches assume flops expensive ¨ Lack of reproducibility; ¨ Storage Capacity unnecessarily expensive (most of the time) Ø Issue of weak scalability in future systems Ø Can’t guarantee bitwise results ¨ Fault occurrence; shared ¨ responsibility Need for effective scheduling of tasks Ø Process failure recovery

63 A New Generation of DLA Software

Soware/Algorithms follow hardware evoluon in me LINPACK (70’s) Rely on (Vector operaons) - Level-1 BLAS operaons

LAPACK (80’s) Rely on (Blocking, cache friendly) - Level-3 BLAS operaons

ScaLAPACK (90’s) Rely on (Distributed Memory) - PBLAS Mess Passing PLASMA Rely on New Algorithms - a DAG/scheduler (many-core friendly) Rely - block data layout on - - some extra kernels hybrid scheduler MAGMA - hybrid kernels Hybrid Algorithms (heterogeneity friendly) Parallelizaon of LU and QR. Parallelize the update: dgemm • Easy and done in any reasonable soware. • This is the 2/3n 3 term in the FLOPs count. - • Can be done efficiently with LAPACK+multhreaded BLAS U

L A(1)

dge2

lu( )

dtrsm (+ dswp) Fork - Join parallelism \ Bulk Sync Processing dgemm -

U L A(2) Allowing for delayed update, out of order,update, for delayed Allowing execution dataflow asynchronous, Synchronization (in LAPACK LU) LU) LAPACK (in Synchronization •

Fork-join, bulk synchronous processingFork-join,

23 Cores Ø Ø bulk synchronous processing processing synchronous bulk fork join fork join Time 27 66 PLASMA LU Factorizaon Dataflow Driven

xTRSM

xGEMM

xGETF2

xTRSM xGEMM

xGEMM xGEMM xGEMM xTRSM xGEMM xGEMM xGEMM xGEMM xTRSM xGEMM xGEMM Data Layout is Critical

• Tile data layout where each data tile is contiguous in memory • Decomposed into several fine-grained tasks, which better fit the memory of the small core caches 68 PLASMA LU: Tile Algorithm and Nested Parallelism • Operates on one, two, or three matrix les at a me using a single core – This is called a kernel; executed independently of other kernels – Mostly Level 3 BLAS are used • Data flows between kernels as prescribed by the programmer • Coordinaon is done transparently via runme scheduler (QUARK) – Parallelism level adjusted at runme – Look-ahead adjusted at runme • Uses single-threaded BLAS with all the opmizaon benefits • Panel is done on mulple cores – Recursive formulaon of LU for beer BLAS use – Level 1 BLAS are faster because they work on combined cache size QUARK Shared Memory Superscalar Scheduling

FOR k = 0..TILES-1 A[k][k] ← DPOTRF(A[k][k]) definition – pseudocode FOR m = k+1..TILES-1 A[m][k] ← DTRSM(A[k][k], A[m][k]) FOR m = k+1..TILES-1 A[m][m] ← DSYRK(A[m][k], A[m][m]) FOR n = k+1..m-1 A[m][n] ← DGEMM(A[m][k], A[n][k], A[m][n])

for (k = 0; k < A.mt; k++) { QUARK_CORE_dpotrf(...); implementation – actual for (m = k+1; m < A.mt; m++) { QUARK_CORE_dtrsm(...); QUARK code in PLASMA } for (m = k+1; m < A.mt; m++) { QUARK_CORE_dsyrk(...); for (n = k+1; n < m; n++) { QUARK_CORE_dgemm(...) } } } Parallel Linear Algebra s/w for Multicore/Hybrid Architectures • Objectives § High utilization of each core § Scaling to large number of cores § Synchronization reducing algorithms • Methodology § Dynamic DAG scheduling (QUARK) § Explicit parallelism § Implicit communication § Fine granularity / block data layout • Arbitrary DAG with dynamic scheduling

Fork-join parallelism Cores

DAG scheduled parallelism Cores Time 71 PLASMA Local Scheduling Dynamic Scheduling: Sliding Window • DAGs get very big, very fast • So windows of active tasks are used; this means no global critical path • Matrix of NBxNB tiles; NB3 operation • NB=100 gives 1 million tasks PLASMA Local Scheduling Dynamic Scheduling: Sliding Window • DAGs get very big, very fast • So windows of active tasks are used; this means no global critical path • Matrix of NBxNB tiles; NB3 operation • NB=100 gives 1 million tasks PLASMA Local Scheduling Dynamic Scheduling: Sliding Window • DAGs get very big, very fast • So windows of active tasks are used; this means no global critical path • Matrix of NBxNB tiles; NB3 operation • NB=100 gives 1 million tasks PLASMA Local Scheduling Dynamic Scheduling: Sliding Window • DAGs get very big, very fast • So windows of active tasks are used; this means no global critical path • Matrix of NBxNB tiles; NB3 operation • NB=100 gives 1 million tasks Example: QR Factorization

FOR k = 0 .. SIZE - 1

A[k][k], T[k][k] <- GEQRT( A[k][k] )

FOR m = k+1 .. SIZE - 1 GEQRT

A[k][k]|Up, A[m][k], T[m][k] <- TSQRT( A[k][k]|Up, A[m][k], T[m][k] ) TSQRT

FOR n = k+1 .. SIZE - 1 UNMQR A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] ) TSMQR FOR m = k+1 .. SIZE - 1

A[k][n], A[m][n] <- TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) Input Format – Quark (PLASMA) for (k = 0; k < A.mt; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, • Sequential C code T[k][k], OUTPUT); for (m = k+1; m < A.mt; m++) { • Annotated through Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, QUARK-specific syntax T[m][k], OUTPUT); • Insert_Task } • INOUT, OUTPUT, INPUT for (n = k+1; n < A.nt; n++) { • REGION_L, REGION_U, Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, REGION_D, … T[k][k], INPUT, • LOCALITY A[k][m], INOUT); for (m = k+1; m < A.mt; m++) { • Executes thru the QUARK RT to Insert_Task( ztsmqr, A[k][n], INOUT, run on multicore SMPs A[m][n], INOUT | LOCALITY,

A[m][k], INPUT, T[m][k], INPUT); } } } Algorithms PLASMA_[scdz]potrf[_Tile][_Async]() Cholesky

l Algorithm

l equivalent to LAPACK

l Numerics

l same as LAPACK

l Performance

l comparable to vendor on few cores

l much better than vendor on many cores

Algorithms PLASMA_[scdz]getrf[_Tile][_Async]() LU

l Algorithm

l equivalent to LAPACK

l same pivot vector

l same L and U factors

l same forward substitution procedure

l Numerics

l same as LAPACK

l Performance

l comparable to vendor on few cores

l much better than vendor on many cores

Algorithms PLASMA_[scdz]geqrt[_Tile][_Async]() incremental QR Factorization

l Algorithm

l the same R factor as LAPACK (absolute values)

l different set of Householder reflectors

l different Q matrix

l different Q generation / application procedure

l Numerics

l same as LAPACK

l Performance

l comparable to vendor on few cores

l much better than vendor on many cores

Algorithms PLASMA_[scdz]syev[_Tile][_Async]() three-stage symmetric EVP

l Algorithm

l two-stage tridiagonal reduction + QR Algorithm

l fast eigenvalues, slower eigenvectors (possibility to calculate a subset)

l Numerics

l same as LAPACK

l Performance

l comparable to MKL for very small problems

l absolutely superior for larger problems

Algorithms PLASMA_[scdz]gesvd[_Tile][_Async]() three-stage SVD

l Algorithm

l two-stage bidiagonal reduction + QR iteration

l fast singular values, slower singular vectors (possibility of calculating a subset)

l Numerics

l same as LAPACK

l Performance

l comparable with MKL for very small problems

l absolutely superior for larger problems

Pipelining: Cholesky Inversion 3 Steps: Factor, Invert L, Multiply L’s

48 cores POTRF, TRTRI and LAUUM. The matrix is 4000 x 4000,tile size is 200 x 200, POTRF+TRTRI+LAUUM: 25 (7t-3) Cholesky Factorization alone: 3t-2 Pipelined: 18 (3t+6) 83 Algorithms PLASMA_[scdz]geqrt[_Tile][_Async]() incremental QR PLASMA_Set( PLASMA_HOUSEHOLDER_MODE, PLASMA_TREE_HOUSEHOLDER);

l Algorithm

l the same R factor as LAPACK (absolute values)

l different set of Householder reflectors

l different Q matrix

l different Q generation / application procedure

l Numerics

l same as LAPACK

l Performance

l absolutely superior for tall matrices

Communication Avoiding QR Example

Quad-socket, quad-core machine Intel Xeon EMT64 E7340 at 2.39 GHz. Theoretical peak is 153.2 Gflop/s with 16 cores. Matrix size 51200 by 3200 Avoiding Synchronization

• “Responsibly Reckless” Algorithms § Try fast algorithm (unstable algorithm) that might fail (but rarely) § Check for failure, recompute, if needed, with a stable algorithm

86 Avoid Pivoting; Minimize Synchronization • To solve Ax = b : T § Compute Ar = U AV, with U and V random matrices

§ Factorize Ar without pivoting (GENP) T § Solve Ar y = U b and then Solve x = Vy • U and V are Recursive Butterfly Matrices

§ Randomization is cheap (O(n2) operations) § GENP is fast (“Cholesky” speed, take advantage of the GPU) § Accuracy is in practice similar to GEPP (with ), but… Think of this as a step.

Goal: Transform A into a matrix that would be sufficiently “random” so that, with a probability close to 1, pivoting is not needed.

PLASMA RBT execution trace

- with n=2000, nb=250 on 12-core AMD Opteron -

Partial randomization (i.e. gray) is inexpensive.

Factorization without pivoting is scalable without synchronizations. Summary

• Major Challenges are ahead for extreme computing § Parallelism O(109) • Programming issues § Hybrid • Peak and HPL may be very misleading • No where near close to peak for most apps § Fault Tolerance • Today Sequoia BG/Q node failure rate is 1.25 failures/day § Power • 50 Gflops/w (today at 2 Gflops/w) • We will need completely new approaches and technologies to reach the Exascale level

Collaborators / Software / Support

u PLASMA http://icl.cs.utk.edu/plasma/

u MAGMA http://icl.cs.utk.edu/magma/

u Quark (RT for Shared Memory)

• http://icl.cs.utk.edu/quark/ u Collaborating partners University of Tennessee, Knoxville University of California, Berkeley u PaRSEC(Parallel Runtime Scheduling University of Colorado, Denver and Execution Control) INRIA, France • http://icl.cs.utk.edu/parsec/ KAUST, Saudi Arabia

90