Overview of Dense Numerical Linear Algebra Libraries
• BLAS: kernel for dense linear algebra • LAPACK: sequen al dense linear algebra • ScaLAPACK: parallel distributed dense linear algebra • And Beyond
Linear Algebra PACKage 1 What is dense linear algebra? • Not just matmul! • Linear Systems: Ax=b
• Least Squares: choose x to minimize ||Ax-b||2 – Overdetermined or underdetermined – Unconstrained, constrained, weighted • Eigenvalues and vectors of Symmetric Matrices • Standard (Ax = λx), Generalized (Ax=λBx) • Eigenvalues and vectors of Unsymmetric matrices • Eigenvalues, Schur form, eigenvectors, invariant subspaces • Standard, Generalized • Singular Values and vectors (SVD) – Standard, Generalized • Different matrix structures – Real, complex; Symmetric, Hermitian, positive definite; dense, triangular, banded … • Level of detail – Simple Driver (“x=A\b”) – Expert Drivers with error bounds, extra-precision, other options – Lower level routines (“apply certain kind of orthogonal transformation”, matmul…)
2 Dense Linear Algebra • Common Operations Ax = b; min || Ax − b ||; Ax = λx x • A major source of large dense linear systems is problems involving the solution of boundary integral equations. – The price one pays for replacing three dimensions with two is that what started as a sparse problem in O(n3) variables is replaced by a dense problem in O(n2). • Dense systems of linear equations are found in numerous other applications, including: – airplane wing design; – radar cross-section studies; – flow around ships and other off-shore constructions; – diffusion of solid bodies in a liquid; – noise reduction; and – diffusion of light through small particles.
3 Exis ng Math So ware - Dense LA
h p://www.netlib.org/utk/people/JackDongarra/la-sw.html • LINPACK, EISPACK, LAPACK, ScaLAPACK
– PLASMA, MAGMA 4 A brief history of (Dense) Linear Algebra software
• In the beginning was the do-loop… – Libraries like EISPACK (for eigenvalue problems) • Then the BLAS (1) were invented (1973-1977) – Standard library of 15 operations (mostly) on vectors • “AXPY” ( y = α·x + y ), dot product, scale (x = α·x ), etc • Up to 4 versions of each (S/D/C/Z), 46 routines, 3300 LOC – Goals • Common “pattern” to ease programming, readability • Robustness, via careful coding (avoiding over/underflow) • Portability + Efficiency via machine specific implementations – Why BLAS 1 ? They do O(n1) ops on O(n1) data – Used in libraries like LINPACK (for linear systems) • Source of the name “LINPACK Benchmark” (not the code!)
5 What Is LINPACK?
¨ LINPACK is a package of mathematical software for solving problems in linear algebra, mainly dense linear systems of linear equations.
¨ LINPACK: “LINear algebra PACKage” Ø Written in Fortran 66
¨ The project had its origins in 1974
¨ The project had four primary contributors: myself when I was at Argonne National Lab, Jim Bunch from the University of California-San Diego, Cleve Moler who was at New Mexico at that time, and Pete Stewart from the University of Maryland.
¨ LINPACK as a software package has been largely superseded by LAPACK, which has been designed to run efficiently on shared- memory, vector supercomputers.
6 Computing in 1974
¨ High Performance Computers: Ø IBM 370/195, CDC 7600, Univac 1110, DEC PDP-10, Honeywell 6030 ¨ Fortran 66 ¨ Trying to achieve software portability ¨ Run efficiently ¨ BLAS (Level 1) Ø Vector operations ¨ Software released in 1979 Ø About the time of the Cray 1
7 LINPACK Benchmark?
¨ The Linpack Benchmark is a measure of a computer’s floating-point rate of execution. Ø It is determined by running a computer program that solves a dense system of linear equations. ¨ Over the years the characteristics of the benchmark has changed a bit. Ø In fact, there are three benchmarks included in the Linpack Benchmark report. ¨ LINPACK Benchmark Ø Dense linear system solve with LU factorization using partial pivoting Ø Operation count is: 2/3 n3 + O(n2) Ø Benchmark Measure: MFlop/s Ø Original benchmark measures the execution rate for a 8 Fortran program on a matrix of size 100x100. Accidental Benchmarker
¨ Appendix B of the Linpack Users’ Guide Ø Designed to help users extrapolate execution time for Linpack software package ¨ First benchmark report from 1977; Ø Cray 1 to DEC PDP-10
9 High Performance Linpack (HPL)
Benchmark Matrix Optimizations Parallel Name dimension allowed Processing Linpack 100 100 compiler –a Linpack 1000b 1000 hand, code –c replacement Linpack Parallel 1000 hand, code Yes replacement HPLinpackd arbitrary hand, code Yes replacement a Compiler parallelization possible. b Also known as TPP (Toward Peak Performance) or Best Effort c Multiprocessor implementations allowed. d Highly-Parallel LINPACK Benchmark is also known as NxN Linpack Benchmark or High Parallel Computing (HPC).
10 The Standard LU Factorization LINPACK 1970’s HPC of the Day: Vector Architecture
Factor column Divide by Schur Next Step with Level 1 Pivot complement BLAS row update (Rank 1 update) Main points • Factorization column (zero) mostly sequential due to memory bottleneck • Level 1 BLAS • Divide pivot row has little parallelism • Rank -1 Schur complement update is the only easy parallelize task • Partial pivoting complicates things even further • Bulk synchronous parallelism (fork-join) • Load imbalance • Non-trivial Amdahl fraction in the panel • Potential workaround (look-ahead) has complicated implementation A brief history of (Dense) Linear Algebra software
¨ But the BLAS-1 weren’t enough Ø Consider AXPY ( y = α·x + y ): 2n flops on 3n read/writes Ø Computational intensity = (2n)/(3n) = 2/3 Ø Too low to run near peak speed (read/write dominates) ¨ So the BLAS-2 were developed (1984-1986) Ø Standard library of 25 operations (mostly) on matrix/vector pairs Ø “GEMV”: y = α·A·x + β·x, “GER”: A = A + α·x·yT, x = T-1·x Ø Up to 4 versions of each (S/D/C/Z), 66 routines, 18K LOC Ø Why BLAS 2 ? They do O(n2) ops on O(n2) data Ø So computational intensity still just ~(2n2)/(n2) = 2 Ø OK for vector machines, but not for machine with caches
12 A brief history of (Dense) Linear Algebra software
¨ The next step: BLAS-3 (1987-1988) Ø Standard library of 9 operations (mostly) on matrix/matrix pairs Ø “GEMM”: C = α·A·B + β·C, C = α·A·AT + β·C, B = T-1·B Ø Up to 4 versions of each (S/D/C/Z), 30 routines, 10K LOC Ø Why BLAS 3 ? They do O(n3) ops on O(n2) data Ø So computational intensity (2n3)/(4n2) = n/2 – big at last! Ø Good for machines with caches, other mem. hierarchy levels ¨ How much BLAS1/2/3 code so far (all at www.netlib.org/blas) Ø Source: 142 routines, 31K LOC, Testing: 28K LOC Ø Reference (unoptimized) implementation only Ø Ex: 3 nested loops for GEMM
13 Back to basics: Why avoiding communication is important
Algorithms have two costs: 1. Arithmetic (FLOPS) 2. Communication: moving data between Ø levels of a memory hierarchy (sequential case) Ø processors over a network (parallel case).
CPU CPU CPU DRAM DRAM Cache
DRAM CPU CPU DRAM DRAM 14 Memory Hierarchy
¨ By taking advantage of the principle of locality: Ø Present the user with as much memory as is available in the cheapest technology. Ø Provide access at the speed offered by the fastest technology.
Processor Tertiary Storage Secondary (Disk/Tape) Control Storage (Disk)
Cache On-Chip Level Main Registers 2 and 3 Memory Distributed Remote Datapath Cache (DRAM) Memory Cluster (SRAM) Memory
Speed (ns): 1s 10s 100s 10,000,000s 10,000,000,000s (10s ms) Size (bytes): 100s (10s sec) Ks Ms 100,000 s 10,000,000 s (.1s ms) (10s ms) Gs Ts Why Higher Level BLAS?
¨ Can only do arithmetic on data at the top of the hierarchy ¨ Higher level BLAS lets us do this
Registers
L 1 Cache
L 2 Cache
Local Memory
Remote Memory
Secondary Memory16 Level 1, 2 and 3 BLAS
¨ Level 1 BLAS + Vector-Vector * operations ¨ Level 2 BLAS Matrix-Vector * operations ¨ Level 3 BLAS Matrix-Matrix + * operations
17 Level 1, 2 and 3 BLAS 1 core Intel Xeon E5-2670 (Sandy Bridge); 2.6 GHz; Peak = 20.8 Gflop/s
19.3 Gflop/s
3.3 Gflop/s
0.2 Gflop/s Gflops
Matrix size
1 core Intel Xeon E5-2670 (Sandy Bridge), 2.6 GHz. 24 MB shared L3 cache, and each core has a private 256 KB L2 and 64 KB L1. The theoretical peak per core DP is 8 flop/cycle * 2.6 GHz = 20.8 Gflop/s per core. Compiled with gcc 4.4.6 and using MKL_composer_xe_2013.3.163 02/25/2009 CS267 Lecture 8 19 A brief history of (Dense) Linear Algebra so ware • LAPACK – “Linear Algebra PACKage” - uses BLAS-3 (1989 – now) – Ex: Obvious way to express Gaussian Elimina on (GE) is adding mul ples of one row to other rows – BLAS-1 • How do we reorganize GE to use BLAS-3 ? (details later) – Contents of LAPACK (summary) • Algorithms we can turn into (nearly) 100% BLAS 3 – Linear Systems: solve Ax=b for x
– Least Squares: choose x to minimize ||Ax - b||2 • Algorithms that are only 50% BLAS 3 (so far) – “Eigenproblems”: Find λ and x where Ax = λ x – Singular Value Decomposi on (SVD): (ATA)x=σ2x • Generalized problems (eg Ax = λ Bx) • Error bounds for everything • Lots of variants depending on A’s structure (banded, A=AT, etc) – How much code? (Release 3.5.0, Nov 2013) (www.netlib.org/lapack) • Source: 1674 rou nes, 490K LOC, Tes ng: 448K LOC
20 A brief history of (Dense) Linear Algebra so ware
• Is LAPACK parallel? – Only if the BLAS are parallel (possible in shared memory) • ScaLAPACK – “Scalable LAPACK” (1995 – now) – For distributed memory – uses MPI – More complex data structures, algorithms than LAPACK • Only (small) subset of LAPACK’s func onality available – All at www.netlib.org/scalapack
21 LAPACK
• h p://www.netlib.org/lapack/ LAPACK is in FORTRAN Column Major • LAPACK (Linear Algebra Package) provides rou nes for – solving systems of simultaneous linear equa ons, LAPACK is – least-squares solu ons of linear systems of equa ons, SEQUENTIAL – eigenvalue problems, – and singular value problems. LAPACK is a REFERENCE • LAPACK relies on BLAS implementa on
• The associated matrix factoriza ons (LU, Cholesky, QR, SVD, Schur, generalized Schur) are also provided, as are related computa ons such as reordering of the Schur factoriza ons and es ma ng condi on numbers.
• Dense and banded matrices are handled, but not general sparse matrices. In all areas, similar func onality is provided for real and complex matrices, in both single and double precision. 22 The Standard LU Factoriza on LINPACK 1970’s HPC of the Day: Vector Architecture
Factor column Divide by Schur Next Step with Level 1 Pivot complement BLAS row update (Rank 1 update) Main points • Factorization column (zero) mostly sequential due to memory bottleneck • Level 1 BLAS • Divide pivot row has little parallelism • Rank -1 Schur complement update is the only easy parallelize task • Partial pivoting complicates things even further • Bulk synchronous parallelism (fork-join) • Load imbalance • Non-trivial Amdahl fraction in the panel • Potential workaround (look-ahead) has complicated implementation The Standard LU Factoriza on LAPACK 1980’s HPC of the Day: Cache Based SMP
Factor panel Triangular Schur Next Step with Level 1,2 update complement BLAS update
Main points • Panel factorization mostly sequential due to memory bottleneck • Triangular solve has little parallelism • Schur complement update is the only easy parallelize task • Partial pivoting complicates things even further • Bulk synchronous parallelism (fork-join) • Load imbalance • Non-trivial Amdahl fraction in the panel • Potential workaround (look-ahead) has complicated implementation Example with GESV Solve a system of linear equa ons using a LU factoriza on
subrou ne dgesv( n, nrhs, A, lda, ipiv, b, ldb, info ) input: n nrhs i p n info n n B A i
v output: Solu on of Ax=b n nrhs i
p
U n n info n X i
L
v 25 Func onali es in LAPACK
Type of Problem Acronyms Linear system of equa ons SV Linear least squares problems LLS Linear equality-constrained least squares problem LSE General linear model problem GLM Symmetric eigenproblems SEP Nonsymmetric eigenproblems NEP Singular value decomposi on SVD Generalized symmetric definite eigenproblems GSEP Generalized nonsymmetric eigenproblems GNEP Generalized (or quo ent) singular value decomposi on GSVD (QSVD)
26 LAPACK So ware
• First release in February 1992 • Version 3.5.0 released in November 2013 • LICENSE: Mod-BSD, freely-available so ware package - Thus, it can be included in commercial so ware packages (and has been). We only ask that proper credit be given to the authors. • Open SVN repository • Mul -OS – *nix, Mac OS/X, Windows • Mul -build support (cmake) – make, xcode, nmake, VS studio, Eclipse, etc.. • LAPACKE: Standard C language APIs for LAPACK (In collabora on with INTEL) – 2 layers of interface • High-Level Interface : Workspace alloca on and NAN Check • Low-Level Interface • Prebuilt Libraries for Windows • Extensive test suite • Forum and User support: h p://icl.cs.utk.edu/lapack-forum/
27 Latest Algorithms
Since release 3.0 of LAPACK – Hessenberg QR algorithm with the small bulge mul -shi QR algorithm together with aggressive early defla on. [2003 SIAM SIAG LA Prize winning algorithm of Braman, Byers and Mathias] 3.1 – Improvements of the Hessenberg reduc on subrou nes. [G. Quintana-Or and van de Geijn] – New MRRR eigenvalue algorithms [2006 SIAM SIAG LA Prize winning algorithm of Dhillon and Parle ] – New par al column norm upda ng strategy for QR factoriza on with column pivo ng. [Drmač and Bujanovic] – Mixed Precision Itera ve Refinement for exploi ng fast single precision hardware for GE, PO [Langou’s] 3.2 – Variants of various factoriza on (LU, QR, Chol) [Du] – RFP (Rectangular Full Packed) format [Gustavson, Langou] – XBLAS and Extra precise itera ve refinement for GESV [Demmel et al.]. – New fast and accurate Jacobi SVD [2009 SIAM SIAG LA Prize, Drmač and Veselić] – Pivoted Cholesky [Lucas] – Be er mul shi Hessenberg QR algorithm with early aggressive defla on [Byers] – Complete CS decomposi on [Su on] 3.3 – Level-3 BLAS symmetric indefinite solve and symmetric indefinite inversion [Langou’s] – Since LAPACK 3.3, all rou nes in are now thread-safe
28 Latest Algorithms
Since release 3.0 of LAPACK – xGEQRT: QR factoriza on (improved interface). [James (UC Denver)] – xGEQRT3: recursive QR factoriza on. [James (UC Denver)]. The recursive QR factoriza on enables cache-oblivious and enables high performance on tall and skinny matrices. – xTPQRT: Communica on-Avoiding QR sequen al kernels. [James (UC Denver)] – CMAKE build system. Building under Windows has never been easier. This also allows us to 3.4 release dll for Windows, so users no longer need a Fortran compiler to use LAPACK under Windows. – Doxygen documenta on. LAPACK rou ne documenta on has never been more accessible. See h p://www.netlib.org/lapack/explore-html/. – New website allowing for easier naviga on. – LAPACKE - Standard C language APIs for LAPACK.
– Symmetric/Hermi an LDLT factoriza on rou nes with rook pivo ng algorithm [Kozachenko] – 2-by-1 CSD to be used for tall and skinny matrix with orthonormal columns [Brian Su on] 3.5 – New stopping criteria for balancing [Rodney James] – New complex division algorithm [Victor Liu]
29 LAPACKE – the C Standard Interface to LAPACK
• Two Interfaces for two kinds of users – High Level Interface : Ease of use Users • No more WORK parameter, alloca on done for the user • Op onal NaN check on all vector/matrix inputs before calling any LAPACK FORTRAN rou ne • Support Row and Column major formats – Low Level Interface : Expert users and libraries • Low overhead and user control are the two main priori es here • Other features – Include and configura on files – INFO parameter dropped in the C Interface to LAPACK, and instead rou ne returns an integer – Handle ANSI C99 (default), C structure op on, CPP complex type and custom types for complex types Need Fortran and LAPACK at linking me Resources
Reference Code: – Reference code: (current version 3.4.2) h p://www.netlib.org/lapack/lapack.tgz – LAPACK build for windows (current version 3.4.2) h p://icl.cs.utk.edu/lapack-for-windows/lapack – LAPACKE: Standard C language APIs for LAPACK (in collabora on with INTEL): h p://www.netlib.org/lapack/#_standard_c_language_apis_for_lapack – Remi’s wrappers (wrapper for Matlab users): h p://icl.cs.utk.edu/~delmas/lapwrapmw.htm
Vendor Libraries: more or less same as the BLAS: MKL, ACML, VECLIB, ESSL, etc… (WARNING: some implementa ons are just a subset of LAPACK)
Documenta on: – LAPACK Users’ guide: h p://www.netlib.org/lapack/lug/
– LAPACK Working notes (in par cular LAWN 41) h p://www.netlib.org/lapack/lawns/downloads/
– LAPACK release notes h p://www.netlib.org/lapack/lapack-3.4.2.html
– LAPACK NAG example and auxiliary rou nes h p://www.nag.com/lapack-ex/lapack-ex.html
– CRC Handbook of Linear Algebra, Leslie Hogben ed, Packages of Subrou nes for Linear Algebra, Bai, Demmel, Dongarra, Langou, and Wang, Sec on 75: pages 75-1,75-24, CRC Press, 2006. h p://www.netlib.org/netlib/utk/people/JackDongarra/PAPERS/CRC-LAPACK-2005.pdf
Support: – LAPACK forum: (more than 1000 topics) h p://icl.cs.utk.edu/lapack-forum/ – LAPACK mailing-list: [email protected] – LAPACK mailing-list archive: h p://icl.cs.utk.edu/lapack-forum/archives/
31 Organizing Linear Algebra – in books
www.netlib.org/lapack www.netlib.org/scalapack
www.netlib.org/templates www.cs.utk.edu/~dongarra/etemplates Paralleliza on of LU and QR. Parallelize the update: dgemm • Easy and done in any reasonable so ware. • This is the 2/3n 3 term in the FLOPs count. - • Can be done efficiently with LAPACK+mul threaded BLAS U
L A(1)
dgetf2
lu( )
dtrsm (+ dswp)
\ dgemm -
U L A(2) Overview of Dense Numerical Linear Algebra Libraries
• BLAS: kernel for dense linear algebra • LAPACK: sequen al dense linear algebra • ScaLAPACK: parallel distributed dense linear algebra
Scalable Linear Algebra PACKage ScaLAPACK
¨ Library of software dealing with dense & banded routines ¨ Distributed Memory - Message Passing ¨ MIMD Computers and Networks of Workstations ¨ Clusters of SMPs
35 ScaLAPACK
• h p://www.netlib.org/scalapack/ ScaLAPACK is in • ScaLAPACK (Scalable Linear Algebra Package) provides FORTRAN and C rou nes for – solving systems of simultaneous linear equa ons, – least-squares solu ons of linear systems of equa ons, ScaLAPACK is for – eigenvalue problems, PARALLEL – and singular value problems. DISTRIBUTED
• Relies on LAPACK / BLAS and BLACS / MPI ScaLAPACK is a REFERENCE • Includes PBLAS ( Parallel BLAS ) implementa on Programming Style
¨ SPMD Fortran 77 with object based design ¨ Built on various modules Ø PBLAS Interprocessor communication Ø BLACS Ø PVM, MPI, IBM SP, CRI T3, Intel, TMC Ø Provides right level of notation. Ø BLAS ¨ LAPACK software expertise/quality Ø Software approach Ø Numerical methods
37 Overall Structure of Software
¨ Object based - Array descriptor Ø Contains information required to establish mapping between a global array entry and its corresponding process and memory location. Ø Provides a flexible framework to easily specify additional data distributions or matrix types. Ø Currently dense, banded, & out-of-core ¨ Using the concept of context
38 PBLAS
¨ Similar to the BLAS in functionality and naming. ¨ Built on the BLAS and BLACS ¨ Provide global view of matrix CALL DGEXXX ( M, N, A( IA, JA ), LDA,... )
CALL PDGEXXX( M, N, A, IA, JA, DESCA,... )
39 ScaLAPACK Structure
ScaLAPACK
PBLAS Global Local LAPACK BLACS
BLAS PVM/MPI/...
40 Choosing a Data Distribution
¨ Main issues are: Ø Load balancing Ø Use of the Level 3 BLAS
41 Possible Data Layouts
¨ 1D block and cyclic column distributions
¨ 1D block-cycle column and 2D block-cyclic distribution ¨ 2D block-cyclic used in ScaLAPACK for dense matrices 42 From LAPACK to ScaLAPACK [LAPACK] subrou ne dgesv( n, nrhs, a(ia,ja), lda, ipiv, b(ib,jb), ldb, info ) input: n nrhs i LAPACK Data layout
p n info n n A B i
v output:
n nrhs i LAPACK Data layout
p U n n info n X i L
v From LAPACK to ScaLAPACK [LAPACK] subrou ne dgesv( n, nrhs, a(ia,ja), lda, ipiv, b(ib,jb), ldb, info ) input: n nrhs
ScaLAPACK Data layout A11 A12 A13 B11 ip1
n n info A21 A22 A23 B21 ip2
A31 A32 A33 B31 ip3 output: n nrhs U11 ScaLAPACK Data layout U12 U13 X11 ip1 L11 n U22 n info L21 L U23 X21 ip2 22 U L L 33 X ip 31 32 L33 31 3
From LAPACK to ScaLAPACK [LAPACK] subrou ne dgesv( n, nrhs, a(ia,ja), lda, ipiv, b(ib,jb), ldb, info ) [ScaLAPACK] subrou ne pdgesv( n, nrhs, a, ia, ja, desca, ipiv, b, ib, jb, descb, info ) input: n nrhs
ScaLAPACK Data layout A11 A12 A13 B11 ip1
n n info A21 A22 A23 B21 ip2
A31 A32 A33 B31 ip3 output: n nrhs U11 ScaLAPACK Data layout U12 U13 X11 ip1 L11 n U22 n info L21 L U23 X21 ip2 22 U L L 33 X ip 31 32 L33 31 3
Distribution and Storage
¨ Matrix is block-partitioned & maps blocks ¨ Distributed 2-D block-cyclic scheme 5x5 matrix partitioned in 2x2 blocks 2x2 process grid point of view
A 11 A 12 A 13 A 14 A 15 A11 A12 A15 A13 A14
A 21 A 22 A 23 A 24 A 25 A21 A22 A25 A23 A24
A 31 A 32 A 33 A 34 A 35 A51 A52 A55 A53 A54
A 41 A 42 A 43 A 44 A 45 A31 A32 A35 A33 A34
A 51 A 52 A 53 A 54 A 55 A41 A42 A45 A43 A44 ¨ Routines available to distribute/redistribute data. 46 2D Block Cyclic Layout
Matrix point of view Processor point of view N Q
0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 M P 0 0 0 2 2 2 4 4 4
1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5
NB Matrix is MxN Process grid is PxQ, P=2, Q=3 MB Blocks are MBxNB 2D Block Cyclic Layout
Matrix point of view Processor point of view
0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4
1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 2D Block Cyclic Layout
Matrix point of view Processor point of view
0 2 4 0 0 0 2 2 2 4 4 4 1 3 5 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4
1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 2D Block Cyclic Layout
Matrix point of view Processor point of view
0 2 4 0 2 4 0 0 0 2 2 2 4 4 4 1 3 5 1 3 5 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4
1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 2D Block Cyclic Layout
Matrix point of view Processor point of view
0 2 4 0 2 4 0 2 4 0 0 0 2 2 2 4 4 4 1 3 5 1 3 5 1 3 5 0 0 0 2 2 2 4 4 4 0 2 4 0 2 4 0 2 4 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 1 3 5 1 3 5 1 3 5 0 0 0 2 2 2 4 4 4 0 2 4 0 2 4 0 2 4
1 3 5 1 3 5 1 3 5 1 1 1 3 3 3 5 5 5 0 2 4 0 2 4 0 2 4 1 1 1 3 3 3 5 5 5 1 3 5 1 3 5 1 3 5 1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 0 2 4 0 2 4 0 2 4 2D Block Cyclic Layout
Matrix point of view Processor point of view
0 2 4 0 2 4 0 2 4 0 0 0 2 2 2 4 4 4 1 3 5 1 3 5 1 3 5 0 0 0 2 2 2 4 4 4 0 2 4 0 2 4 0 2 4 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 1 3 5 1 3 5 1 3 5 0 0 0 2 2 2 4 4 4 0 2 4 0 2 4 0 2 4
1 3 5 1 3 5 1 3 5 1 1 1 3 3 3 5 5 5 0 2 4 0 2 4 0 2 4 1 1 1 3 3 3 5 5 5 1 3 5 1 3 5 1 3 5 1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 0 2 4 0 2 4 0 2 4 2D Block Cyclic Layout
Matrix point of view Processor point of view
0 2 4 0 2 4 0 2 4 0 0 0 2 2 2 4 4 4 1 3 5 1 3 5 1 3 5 0 0 0 2 2 2 4 4 4 0 2 4 0 2 4 0 2 4 0 0 0 2 2 2 4 4 4 0 0 0 2 2 2 4 4 4 1 3 5 1 3 5 1 3 5 0 0 0 2 2 2 4 4 4 0 2 4 0 2 4 0 2 4
1 3 5 1 3 5 1 3 5 1 1 1 3 3 3 5 5 5 0 2 4 0 2 4 0 2 4 1 1 1 3 3 3 5 5 5 1 3 5 1 3 5 1 3 5 1 1 1 3 3 3 5 5 5 1 1 1 3 3 3 5 5 5 0 2 4 0 2 4 0 2 4 2D Block Cyclic Layout
Matrix point of view Processor point of view
3 5 1 3 5 1 3 5 0 0 2 2 2 4 4 4 2 4 0 2 4 0 2 4 0 0 2 2 2 4 4 4 0 0 2 2 2 4 4 4 3 5 1 3 5 1 3 5 0 0 2 2 2 4 4 4 2 4 0 2 4 0 2 4
3 5 1 3 5 1 3 5 1 1 3 3 3 5 5 5 2 4 0 2 4 0 2 4 1 1 3 3 3 5 5 5 3 5 1 3 5 1 3 5 1 1 3 3 3 5 5 5 1 1 3 3 3 5 5 5 2 4 0 2 4 0 2 4 2D Block Cyclic Layout
Matrix point of view Processor point of view
0 0 2 2 4 4 4 4 0 2 4 0 2 4 0 0 2 2 4 4 4 0 0 2 2 4 4 4 5 1 3 5 1 3 5 0 0 2 2 4 4 4 4 0 2 4 0 2 4 5 1 3 5 1 3 5 4 0 2 4 0 2 4 1 1 3 3 5 5 5 5 1 3 5 1 3 5 1 1 3 3 5 5 5 1 1 3 3 5 5 5 4 0 2 4 0 2 4 2D Block Cyclic Layout
Matrix point of view Processor point of view
0 0 2 2 4 4 0 0 2 2 4 4 1 3 5 1 3 5 0 0 2 2 4 4 0 2 4 0 2 4 1 3 5 1 3 5 0 2 4 0 2 4 1 1 3 3 5 5 1 3 5 1 3 5 1 1 3 3 5 5 1 1 3 3 5 5 0 2 4 0 2 4 2D Block Cyclic Layout
Matrix point of view Processor point of view Parallelism in ScaLAPACK
¨ Level 3 BLAS block ¨ Task parallelism operations Ø Sign function eigenvalue Ø All the reduction routines computations ¨ Pipelining ¨ Divide and Conquer Ø QR Algorithm, Triangular Ø Tridiagonal and band Solvers, classic solvers, symmetric factorizations eigenvalue problem and Sign function ¨ Redundant ¨ Cyclic reduction computations Ø Reduced system in the Ø Condition estimators band solver ¨ Static work ¨ Data parallelism assignment Ø Sign function Ø Bisection
58 Func onali es in LAPACK
Type of Problem Acronyms Linear system of equa ons SV Linear least squares problems LLS Linear equality-constrained least squares problem LSE General linear model problem GLM Symmetric eigenproblems SEP Nonsymmetric eigenproblems NEP Singular value decomposi on SVD Generalized symmetric definite eigenproblems GSEP Generalized nonsymmetric eigenproblems GNEP Generalized (or quo ent) singular value decomposi on GSVD (QSVD) Func onnali es in ScaLAPACK
Type of Problem Acronyms Linear system of equa ons SV Linear least squares problems LLS Linear equality-constrained least squares problem LSE General linear model problem GLM Symmetric eigenproblems SEP Nonsymmetric eigenproblems NEP Singular value decomposi on SVD Generalized symmetric definite eigenproblems GSEP Generalized nonsymmetric eigenproblems GNEP Generalized (or quo ent) singular value decomposi on GSVD (QSVD) Major Changes to Software
• Must rethink the design of our software Ø Another disruptive technology Ø Similar to what happened with cluster computing and message passing Ø Rethink and rewrite the applications, algorithms, and software • Numerical libraries for example will change Ø For example, both LAPACK and ScaLAPACK will undergo major changes to accommodate this
61 Moore’s Law Reinterpreted Cores in the Top20 Systems ¨ Number of cores per chip doubles every 2 year, while 10000000" clock speed decreases (not 9000000" increases). 8000000" Ø Need to deal with systems with 7000000" millions of concurrent threads Ø Future generation will have 6000000" billions of threads! 5000000" Ø Need to be able to easily 4000000" replace inter-chip parallelism 3000000" with intro-chip parallelism 2000000" ¨ Number of threads of execution doubles every 2 1000000" year 0" 2000"2001"2002"2003"2004"2005"2006"2007"2008"2009"2010"2011"2012"2013" Key Challenges at Exascale
¨ Power Management ¨ Levels of parallelism Ø API for fine grain management Ø O(100M and beyond) ¨ Language constraints ¨ Hybrid architectures Ø Fortran, C & MPI, Open-MP Ø Node composed of multiple ¨ Autotuning multicore sockets + Ø Systems complex and changing accelerators ¨ Bulk Sync Processing ¨ Bandwidth vs Arithmetic rate Ø Break fork join parallelism Ø Most approaches assume flops expensive ¨ Lack of reproducibility; ¨ Storage Capacity unnecessarily expensive (most of the time) Ø Issue of weak scalability in future systems Ø Can’t guarantee bitwise results ¨ Fault occurrence; shared ¨ responsibility Need for effective scheduling of tasks Ø Process failure recovery
63 A New Generation of DLA Software
So ware/Algorithms follow hardware evolu on in me LINPACK (70’s) Rely on (Vector opera ons) - Level-1 BLAS opera ons
LAPACK (80’s) Rely on (Blocking, cache friendly) - Level-3 BLAS opera ons
ScaLAPACK (90’s) Rely on (Distributed Memory) - PBLAS Mess Passing PLASMA Rely on New Algorithms - a DAG/scheduler (many-core friendly) Rely - block data layout on - - some extra kernels hybrid scheduler MAGMA - hybrid kernels Hybrid Algorithms (heterogeneity friendly) Paralleliza on of LU and QR. Parallelize the update: dgemm • Easy and done in any reasonable so ware. • This is the 2/3n 3 term in the FLOPs count. - • Can be done efficiently with LAPACK+mul threaded BLAS U
L A(1)
dge 2
lu( )
dtrsm (+ dswp) Fork - Join parallelism \ Bulk Sync Processing dgemm -
U L A(2) Allowing for delayed update, out of order,update, for delayed Allowing execution dataflow asynchronous, Synchronization (in LAPACK LU) LU) LAPACK (in Synchronization •
Fork-join, bulk synchronous processingFork-join,
23 Cores Ø Ø bulk synchronous processing processing synchronous bulk fork join fork join Time 27 66 PLASMA LU Factoriza on Dataflow Driven
xTRSM
xGEMM
xGETF2
xTRSM xGEMM
xGEMM xGEMM xGEMM xTRSM xGEMM xGEMM xGEMM xGEMM xTRSM xGEMM xGEMM Data Layout is Critical
• Tile data layout where each data tile is contiguous in memory • Decomposed into several fine-grained tasks, which better fit the memory of the small core caches 68 PLASMA LU: Tile Algorithm and Nested Parallelism • Operates on one, two, or three matrix les at a me using a single core – This is called a kernel; executed independently of other kernels – Mostly Level 3 BLAS are used • Data flows between kernels as prescribed by the programmer • Coordina on is done transparently via run me scheduler (QUARK) – Parallelism level adjusted at run me – Look-ahead adjusted at run me • Uses single-threaded BLAS with all the op miza on benefits • Panel is done on mul ple cores – Recursive formula on of LU for be er BLAS use – Level 1 BLAS are faster because they work on combined cache size QUARK Shared Memory Superscalar Scheduling
FOR k = 0..TILES-1 A[k][k] ← DPOTRF(A[k][k]) definition – pseudocode FOR m = k+1..TILES-1 A[m][k] ← DTRSM(A[k][k], A[m][k]) FOR m = k+1..TILES-1 A[m][m] ← DSYRK(A[m][k], A[m][m]) FOR n = k+1..m-1 A[m][n] ← DGEMM(A[m][k], A[n][k], A[m][n])
for (k = 0; k < A.mt; k++) { QUARK_CORE_dpotrf(...); implementation – actual for (m = k+1; m < A.mt; m++) { QUARK_CORE_dtrsm(...); QUARK code in PLASMA } for (m = k+1; m < A.mt; m++) { QUARK_CORE_dsyrk(...); for (n = k+1; n < m; n++) { QUARK_CORE_dgemm(...) } } } Parallel Linear Algebra s/w for Multicore/Hybrid Architectures • Objectives § High utilization of each core § Scaling to large number of cores § Synchronization reducing algorithms • Methodology § Dynamic DAG scheduling (QUARK) § Explicit parallelism § Implicit communication § Fine granularity / block data layout • Arbitrary DAG with dynamic scheduling
Fork-join parallelism Cores
DAG scheduled parallelism Cores Time 71 PLASMA Local Scheduling Dynamic Scheduling: Sliding Window • DAGs get very big, very fast • So windows of active tasks are used; this means no global critical path • Matrix of NBxNB tiles; NB3 operation • NB=100 gives 1 million tasks PLASMA Local Scheduling Dynamic Scheduling: Sliding Window • DAGs get very big, very fast • So windows of active tasks are used; this means no global critical path • Matrix of NBxNB tiles; NB3 operation • NB=100 gives 1 million tasks PLASMA Local Scheduling Dynamic Scheduling: Sliding Window • DAGs get very big, very fast • So windows of active tasks are used; this means no global critical path • Matrix of NBxNB tiles; NB3 operation • NB=100 gives 1 million tasks PLASMA Local Scheduling Dynamic Scheduling: Sliding Window • DAGs get very big, very fast • So windows of active tasks are used; this means no global critical path • Matrix of NBxNB tiles; NB3 operation • NB=100 gives 1 million tasks Example: QR Factorization
FOR k = 0 .. SIZE - 1
A[k][k], T[k][k] <- GEQRT( A[k][k] )
FOR m = k+1 .. SIZE - 1 GEQRT
A[k][k]|Up, A[m][k], T[m][k] <- TSQRT( A[k][k]|Up, A[m][k], T[m][k] ) TSQRT
FOR n = k+1 .. SIZE - 1 UNMQR A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] ) TSMQR FOR m = k+1 .. SIZE - 1
A[k][n], A[m][n] <- TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) Input Format – Quark (PLASMA) for (k = 0; k < A.mt; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, • Sequential C code T[k][k], OUTPUT); for (m = k+1; m < A.mt; m++) { • Annotated through Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, QUARK-specific syntax T[m][k], OUTPUT); • Insert_Task } • INOUT, OUTPUT, INPUT for (n = k+1; n < A.nt; n++) { • REGION_L, REGION_U, Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, REGION_D, … T[k][k], INPUT, • LOCALITY A[k][m], INOUT); for (m = k+1; m < A.mt; m++) { • Executes thru the QUARK RT to Insert_Task( ztsmqr, A[k][n], INOUT, run on multicore SMPs A[m][n], INOUT | LOCALITY,
A[m][k], INPUT, T[m][k], INPUT); } } } Algorithms PLASMA_[scdz]potrf[_Tile][_Async]() Cholesky
l Algorithm
l equivalent to LAPACK
l Numerics
l same as LAPACK
l Performance
l comparable to vendor on few cores
l much better than vendor on many cores
Algorithms PLASMA_[scdz]getrf[_Tile][_Async]() LU
l Algorithm
l equivalent to LAPACK
l same pivot vector
l same L and U factors
l same forward substitution procedure
l Numerics
l same as LAPACK
l Performance
l comparable to vendor on few cores
l much better than vendor on many cores
Algorithms PLASMA_[scdz]geqrt[_Tile][_Async]() incremental QR Factorization
l Algorithm
l the same R factor as LAPACK (absolute values)
l different set of Householder reflectors
l different Q matrix
l different Q generation / application procedure
l Numerics
l same as LAPACK
l Performance
l comparable to vendor on few cores
l much better than vendor on many cores
Algorithms PLASMA_[scdz]syev[_Tile][_Async]() three-stage symmetric EVP
l Algorithm
l two-stage tridiagonal reduction + QR Algorithm
l fast eigenvalues, slower eigenvectors (possibility to calculate a subset)
l Numerics
l same as LAPACK
l Performance
l comparable to MKL for very small problems
l absolutely superior for larger problems
Algorithms PLASMA_[scdz]gesvd[_Tile][_Async]() three-stage SVD
l Algorithm
l two-stage bidiagonal reduction + QR iteration
l fast singular values, slower singular vectors (possibility of calculating a subset)
l Numerics
l same as LAPACK
l Performance
l comparable with MKL for very small problems
l absolutely superior for larger problems
Pipelining: Cholesky Inversion 3 Steps: Factor, Invert L, Multiply L’s
48 cores POTRF, TRTRI and LAUUM. The matrix is 4000 x 4000,tile size is 200 x 200, POTRF+TRTRI+LAUUM: 25 (7t-3) Cholesky Factorization alone: 3t-2 Pipelined: 18 (3t+6) 83 Algorithms PLASMA_[scdz]geqrt[_Tile][_Async]() incremental QR PLASMA_Set( PLASMA_HOUSEHOLDER_MODE, PLASMA_TREE_HOUSEHOLDER);
l Algorithm
l the same R factor as LAPACK (absolute values)
l different set of Householder reflectors
l different Q matrix
l different Q generation / application procedure
l Numerics
l same as LAPACK
l Performance
l absolutely superior for tall matrices
Communication Avoiding QR Example
Quad-socket, quad-core machine Intel Xeon EMT64 E7340 at 2.39 GHz. Theoretical peak is 153.2 Gflop/s with 16 cores. Matrix size 51200 by 3200 Avoiding Synchronization
• “Responsibly Reckless” Algorithms § Try fast algorithm (unstable algorithm) that might fail (but rarely) § Check for failure, recompute, if needed, with a stable algorithm
86 Avoid Pivoting; Minimize Synchronization • To solve Ax = b : T § Compute Ar = U AV, with U and V random matrices
§ Factorize Ar without pivoting (GENP) T § Solve Ar y = U b and then Solve x = Vy • U and V are Recursive Butterfly Matrices
§ Randomization is cheap (O(n2) operations) § GENP is fast (“Cholesky” speed, take advantage of the GPU) § Accuracy is in practice similar to GEPP (with iterative refinement), but… Think of this as a preconditioner step.
Goal: Transform A into a matrix that would be sufficiently “random” so that, with a probability close to 1, pivoting is not needed.
PLASMA RBT execution trace
- with n=2000, nb=250 on 12-core AMD Opteron -