Mathematical Software for Linear Algebra in the 1970'S

1 2 The Impact of RISC Overview Motivation - History Level 1 BLAS and Parallel RISC Systems Linpack on Linear Algebra Software Unrolling to allow compiler to reconize Why Tricks like the bab e algorithm Jack Dongarra Sup ercomputers Vector Computers Computer Science Department UniversityofTennessee Level 2 3 BLAS and Compilers Memory Heirarchy Mathematical Sciences Section Oak Ridge National Lab oratory RISC Linpack on RISC MP http://w ww .ne tl ib. org /ut k/p e opl e/J ackDo ngarra.html Scalapack Bandwidth Latency Iterative Metho ds Bombardment Future 3 4 Mathematical Software for Linear Mathematical Software for Linear Algebra in the 1970's Algebra in the 1970's 1970's software packages for eigenvalue problems and EISPACK early 70's linear equations. Fortran translation of Algol algorithms from Num. Math. EISPACK-Fortran routines Software issues - { Eigenvalue problem {Portability { Wilkinson - Reinsch Handb o ok for Automatic Com- { Robustness putation { Accuracy LINPACK - Fortran routines { Uniform {Well do cumented { Systems of linear equations. { Collective ideas. Little thought given to high p erformance computers. CDC 7600 state-of-the-art sup ercomputer. 5 6 Mathematical Software for Linear Mathematical Software for Linear Algebra in the 1970's Algebra in the 1970's Level 1 BLAS Level 1 BLAS Basic Linear Algebra Subprograms for Fortran Usage Dot pro ducts C. Lawson, R. Hanson, D. Kincaid, and F. Krogh 'axpy' op erations y x + y Conceptial aid in design and co ding Multiple a vector by a constant Aid to readability and do cumentation Set up and apply Givens rotations Promote ecieny: Copy and swap vectors { through optimization or assembly language versions Vector norms Improve robustness and reliability name dim scalars vector vector scalars Improve p ortability through standardization SUBROUTINE AXPY N, ALPHA X, INCX, Y, INCY FUNCTION DOT N, X, INCX, Y, INCY SUBROUTINE SCAL N, ALPHA X, INCX SUBROUTINE ROTG A, B C,S SUBROUTINE ROT N X, INCX, Y, INCY C,S SUBROUTINE COPY N, X, INCX, Y, INCY SUBROUTINE SWAP N, X, INCX, Y, INCY FUNCTION NRM2 N, X, INCX FUNCTION ASUM N, X, INCX FUNCTION ASUM N, X, INCX 7 8 Here is the body of the co de of the LINPACK routine Mathematical Software for Linear SPOFA, which implements the ab ove metho d: Algebra in the 1970's LINPACK late 70's DO 30 J = 1, N Used current ideas - not a translation. INFO = J First vector sup ercomputer arrived - CRAY1. S = 0.0E0 BLAS to standardize basic vector op erations. JM1 = J - 1 IF JM1 .LT. 1 GO TO 20 LINPACK embraced BLAS for mo dularity and eciency. DO 10 K = 1, JM1 Reworked algorithms. T = AK,J - SDOTK-1,A1,K,1,A1,J,1 T = T/AK,K Column oriented. AK,J = T S = S + T*T 10 CONTINUE 20 CONTINUE S = AJ,J - S C ......EXIT IF S .LE. 0.0E0 GO TO 40 AJ,J = SQRTS 30 CONTINUE 9 10 BLAS { BASICS Lo op Unrolling In scalar case do i = 1,n,4 y1:n = y1:n + alfa*x1:n Level 1 BLAS {Vector Op erations Late '70s yi = yi + alfa*xi y y + x, y x, y x, yi+1 = yi+1 + alfa*xi+1 T x y , kxk , y $x. 2 yi+2 = yi+2 + alfa*xi+2 yi+3 = yi+3 + alfa*xi+3 continue Example: SAXPY op erate with vectors: WHY? y y + x { Reduce the lo op overhead 2vector loads { Allow pip elined functional units to op erate 2vector op erations { Allow indep endent functional units to op erate { Vector instructions 1vector store Op eration to data reference { 2n op erations { 3n data transfer Too much memory trac: little chance to use the memory hierarchy, O n memory references, Machines that might b ene t? O n oating p oint op erations. { Thoughtwould work well on all scalar machines, but... vector machines 11 12 Other Tricks for Performance... Burn At Both Ends BABE Algorithm for factoring a tridiagonal matrix With the BABE algorithm the facto erization is started at 1 0 x x the top of the matrix as well as the b ottom of the matrix C B C B x x x C B at the same time. C B 1 0 1 0 x x x C B x x x x C B x x x C B C B C B 0 x x 0 x x C B C B C B . C B C B C B . C . B C B C B 0 x x x x x C B C B C B C B C B C B . x x x x x x . C B C B C B . C B C B C B . C B . C B C B . C . B C B C B . C B C B C B . C B C B C B . C B x x x C B C B C B C B C B . C . B C B C B . x x x . C . B C B C B C B C B C B x x A @ C B C B x x x x x x C B C B C B C B x x 0 x x x C B C B C B C B x x 0 x x 0 C B C B C B C B A x x @ A x x @ Algorithm is sequential in nature. 1 0 Not much scop e for pip elining. x x C B 0 x x C B C B In an attempt to increase the sp eed of execution the BABE algorithm 0 x x C B C B . C B . C B was adopted. C B C B . C B . C B C B . C B . C B C B 0 x x C B C B 0 x 0 C B C B C B x x 0 C B C B . C B . C B C B . C B . C B C B . C B . C B C B C B x x 0 C B A @ x x Reduce lo op overhead by a factor of 2 Scop e for indep endent op erations Resulted in 25 faster execution 13 14 Level 1, 2, and 3 BLAS Why Higher Level BLAS? IBM SP2 Memory Hierarchy Level 1 BLAS{vector-vector op erations Largest Smallest + . 256 B Register 2128 MB/s 256 KB Data cache 2128 MB/s 128 MB Local memory 2128 MB/s y y+ x, also dot pro duct, ... 24 GB Remote memory 32 MB/s 485 GB Secondary memory 10 MB/s Level 2 BLAS{matrix-vector op erations Largest Slowest Can only do arithmetic on data at top ! keep active data as close to top of hierarchy as p ossible Higher level BLAS lets us do this: + mem ref ops ops/mem ref Level 1 BLAS y y+ x 3n 2n 2/3 2 2 Level 2 BLAS y y+Ax n 2n 2 y y+Ax, also triangular solve, rank-1 up date 2 3 Level 3 BLAS A A+BC 4n 2n n/2 Level 3 BLAS{matrix-matrix op erations On parallel machines higher level BLAS !increase granularity ! lower synchronization cost . + A A+BC, also blo ck triangular solve, rank-k up date 15 16 Matrix-vector pro duct Lo op unrolling DOT version-25 M ops in cache DO20I=1,M,2 Mo del 530, 50 M op/s p eak T1 = YI T2 = YI+1 DO10J=1,N T1 = T1 + AI,J *XJ DO20I=1,M T2 = T2 + AI+1,J*XJ DO10J=1,N 10 CONTINUE YI = YI + AI,J*XJ YI =T1 10 CONTINUE YI+1 = T2 20 CONTINUE 20 CONTINUE From Cache 22.7 M ops Memory 12.4 M ops 3 loads, 4 ops T Sp eed of y y + A x; N =48 Mo del 530 50 M op/s p eak Depth 1 2 3 4 1 Sp eed 25 33.3 37.5 40 50 Measured 22.7 30.5 34.3 36.5 - Memory 12.4 12.7 12.7 12.6 - 17 18 Matrix-matrix multiply How to get 50 M ops! DO30J=1,M,2 DOT version - 25 M ops in cache DO 20 I + 1, M, 2 T11 = CI, J T12 = CI, J+1 T21 = CI+1,J T22 = CI+1,J+1 DO30J=1,M DO 10 K = 1, L DO20I=1,M T11 = T11 + AI, K*BK,J DO 10 K = 1, L T12 = T12 + AI, K*BK,J+1 CI,J = CI,J + AI,K*BK,J T21 = T21 + AI+1,K*BK,J 10 CONTINUE T22 = T22 + AI+1,K*BK,J+1 20 CONTINUE 10 CONTINUE 30 CONTINUE CI, J = T11 CI, J+1 = T12 CI+1,J = T21 CI+1,J+1 = T22 20 CONTINUE 30 CONTINUE Inner lo op: 4 loads , 8 op erations, optimal. In practice wehave measured 48.1 M=N=16, L=128 19 20 BLAS { BASICS BLAS { REFERENCES BLAS software and do cumentation can be obtained Level 1, 2 and 3 BLAS Performance on RS/6000−550 80 Level 3 via: WWW: http://www.netlib.org/blas, 70 { { anonymous ftp netlib2.cs.utk.edu: 60 cd blas; get index 50 { email [email protected] with the message: Level 2 index from blas 40 send Comments and questions can b e addressed at 30 Speed in Megaflops Level 1 [email protected] 20 10 C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, Basic Linear lgebra Subprograms for Fortran Usage, ACM Transactions on 0 A 0 100 200 300 400 500 600 Order of vectors/matrices Mathematical Software, 5:308{325, 1979.

Load more