An Accidental Benchmarker
Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester Over the Past 50 Years Evolving SW and Alg Tracking Hardware Developments Features: Performance, Portability, and Accuracy EISPACK (1970’s) Rely on (Translation of Algol to F66) - Fortran, but row oriented Level 1 Basic Linear Algebra Subprograms (BLAS) Standards for: Vector-Vector operations LINPACK (1980’s) Rely on (Vector operations) - Level-1 BLAS operations - Column oriented Level 2 & 3 BLAS Standards for: Matrix-Vector & Matrix-Matrix operations LAPACK (1990’s) Rely on •(Blocking,EISPACK cache isfriendly) a software library for numerical computation- Level of-3 BLAS eigenvalues operations and eigenvectors of matrices, PVM and• MPIWritten in FORTRAN. Standards for: Message passing ScaLAPACK• Contains(2000’s) subroutines for calculating the eigenvaluesRely on of nine classes of matrices: (Distributed• Memory)complex general, complex Hermitian, real general,- PBLAS realMess Passingsymmetric, real symmetric banded, • real symmetric tridiagonal, special real tridiagonal, generalized real, and PLASMA / MAGMA (2010’s) Rely on (Many-core•friendlygeneralized & GPUs) real symmetric matrices. - DAG/scheduler • The library drew heavily on Algol algorithms developed- block data by layout Jim Wilkinson & colleagues. PaRSEC Standards for: Scheduling
SLATE (2020’s) Rely on C++ (DM and Heterogeneous arch) - Tasking DAG scheduling - Tiling, but tiles can come from anywhere - Heterogeneous HW, Batched dispatch Over the Past 50 Years Evolving SW and Alg Tracking Hardware Developments Features: Performance, Portability, and Accuracy EISPACK (1970’s) Rely on (Translation of Algol to F66) - Fortran, but row oriented Level 1 Basic Linear Algebra Subprograms (BLAS) Standards for: Vector-Vector operations LINPACK (1980’s) Rely on (Vector operations) - Level-1 BLAS operations - Column oriented Level 2 & 3 BLAS Standards for: Matrix-Vector & Matrix-Matrix operations LAPACK (1990’s) Rely on •(Blocking,EISPACK cache isfriendly) a software library for numerical computation- Level of-3 BLAS eigenvalues operations and eigenvectors of matrices, PVM and• MPIWritten in FORTRAN. Standards for: Message passing ScaLAPACK• Contains(2000’s) subroutines for calculating the eigenvaluesRely on of nine classes of matrices: (Distributed• Memory)complex general, complex Hermitian, real general,- PBLAS realMess Passingsymmetric, real symmetric banded, • real symmetric tridiagonal, special real tridiagonal, generalized real, and PLASMA / MAGMA (2010’s) Rely on (Many-core•friendlygeneralized & GPUs) real symmetric matrices. - DAG/scheduler • The library drew heavily on Algol algorithms developed- block data by layout Jim Wilkinson & colleagues. PaRSEC Standards for: Scheduling
SLATE (2020’s) Rely on C++ (DM and Heterogeneous arch) - Tasking DAG scheduling - Tiling, but tiles can come from anywhere - Heterogeneous HW, Batched dispatch My First Paper
• Reduces loop overhead • Level of unrolling dedicated by the instruction stack size • Help the compiler to: • Facilitates pipelining • Increases the concurrence between independent functional units • Provided ~15% performance improvement 4 Over the Past 50 Years Evolving SW and Alg Tracking Hardware Developments Features: Performance, Portability, and Accuracy EISPACK (1970’s) Rely on (Translation of Algol to F66) - Fortran, but row oriented Level 1 Basic Linear Algebra Subprograms (BLAS) Standards for: Vector-Vector operations LINPACK (1980’s) Rely on (Vector operations) - Level-1 BLAS operations - Column oriented Level 2 & 3 BLAS Standards for: Matrix-Vector & Matrix-Matrix operations LAPACK (1990’s) Rely on (Blocking, cache friendly) - Level-3 BLAS operations PVM and MPI Standards for: Message passing ScaLAPACK (2000’s) Rely on (Distributed Memory) - PBLAS Mess Passing
PLASMA / MAGMA (2010’s) Rely on (Many-core friendly & GPUs) - DAG/scheduler - block data layout PaRSEC Standards for: Scheduling
SLATE (2020’s) Rely on C++ (DM and Heterogeneous arch) - Tasking DAG scheduling - Tiling, but tiles can come from anywhere - Heterogeneous HW, Batched dispatch 6
An Accidental Benchmarker Appendix B of the Linpack Users’ Guide LINPACK was an NSF Project w/ ANL, UNM, UM, & UCSD Designed to help users estimate the We worked independently and came to Argonne in the run time for solving systems of equation summers using the Linpack software. Top 23 List from 1977 First benchmark report from 1977; Cray 1 to DEC PDP-10
1979 1979 subroutine dgefa(a,lda,n,ipvt,info) integer lda,n,ipvt(1),info double precision a(lda,1) # of Flops The Original Code double precision t ∑ ∑ ∑ 2 = �3 + �(� ) integer idamax,j,k,kp1,l,nm1 info = 0 nm1 = n - 1 if (nm1 .lt. 1) go to 70 Outer Benchmark based on solving loop do 60 k = 1, nm1 Partial kp1 = k + 1 Ax=b using LU factorization l = idamax(n-k+1,a(k,k),1) + k - 1 Pivoting ipvt(k) = l Find the if (a(l,k) .eq. 0.0d0) go to 40 largest element of if (l .eq. k) go to 10 Using Level 1 BLAS a column & t = a(l,k) interchange a(l,k) = a(k,k) a(k,k) = t Fortran 77 10 continue Scale t = -1.0d0/a(k,k) column call dscal(n-k,t,a(k+1,k),1) do 30 j = kp1, n t = a(l,j) Middle if (l .eq. k) go to 20 loop a(l,j) = a(k,j) a(k,j) = t Apply 20 continue call daxpy(n-k,t,a(k+1,k),1,a(k+1,j),1) pivot 30 continue go to 50 40 continue Inner Loop info = k axpy 50 continue 60 continue y = �x + y 70 continue ipvt(n) = n if (a(n,n) .eq. 0.0d0) info = n return end Pivoting needed to Ax=b; factor A, then solve • Avoid dividing by zero or a small number • Degeneracy • Avoid growth in the elements U • Loss of accuracy Factor A into L and U L Then solve the system of equations. y = L-1 b; x = U-1 y