1 2
The Impact of RISC Overview
Motivation - History Level 1 BLAS
and Parallel RISC Systems
Linpack
on Linear Algebra Software
Unrolling to allow compiler to reconize
Why
Tricks like the bab e algorithm
Jack Dongarra
Sup ercomputers
Vector Computers
Computer Science Department
UniversityofTennessee
Level 2 3 BLAS
and
Compilers
Memory Heirarchy
Mathematical Sciences Section
Oak Ridge National Lab oratory
RISC
Linpack on RISC
MP
http://w ww .ne tl ib. org /ut k/p e opl e/J ackDo ngarra.html
Scalapack
Bandwidth Latency
Iterative Metho ds Bombardment
Future
3 4
Mathematical Software for Linear Mathematical Software for Linear
Algebra in the 1970's Algebra in the 1970's
1970's software packages for eigenvalue problems and EISPACK early 70's
linear equations.
Fortran translation of Algol algorithms from Num. Math.
EISPACK-Fortran routines
Software issues -
{ Eigenvalue problem
{Portability
{ Wilkinson - Reinsch Handb o ok for Automatic Com-
{ Robustness
putation
{ Accuracy
LINPACK - Fortran routines
{ Uniform
{Well do cumented
{ Systems of linear equations.
{ Collective ideas.
Little thought given to high p erformance computers.
CDC 7600 state-of-the-art sup ercomputer.
5 6
Mathematical Software for Linear Mathematical Software for Linear
Algebra in the 1970's Algebra in the 1970's
Level 1 BLAS Level 1 BLAS
Basic Linear Algebra Subprograms for Fortran Usage Dot pro ducts
C. Lawson, R. Hanson, D. Kincaid, and F. Krogh 'axpy' op erations y x + y
Conceptial aid in design and co ding Multiple a vector by a constant
Aid to readability and do cumentation Set up and apply Givens rotations
Promote ecieny: Copy and swap vectors
{ through optimization or assembly language versions Vector norms
Improve robustness and reliability
name dim scalars vector vector scalars
Improve p ortability through standardization
SUBROUTINE
AXPY N, ALPHA X, INCX, Y, INCY
FUNCTION DOT N, X, INCX, Y, INCY
SUBROUTINE SCAL N, ALPHA X, INCX
SUBROUTINE ROTG A, B C,S
SUBROUTINE ROT N X, INCX, Y, INCY C,S
SUBROUTINE COPY N, X, INCX, Y, INCY
SUBROUTINE SWAP N, X, INCX, Y, INCY
FUNCTION NRM2 N, X, INCX
FUNCTION ASUM N, X, INCX
FUNCTION ASUM N, X, INCX
7 8
Here is the body of the co de of the LINPACK routine
Mathematical Software for Linear
SPOFA, which implements the ab ove metho d:
Algebra in the 1970's
LINPACK late 70's
DO 30 J = 1, N
Used current ideas - not a translation.
INFO = J
First vector sup ercomputer arrived - CRAY1.
S = 0.0E0
BLAS to standardize basic vector op erations.
JM1 = J - 1
IF JM1 .LT. 1 GO TO 20
LINPACK embraced BLAS for mo dularity and eciency.
DO 10 K = 1, JM1
Reworked algorithms.
T = AK,J - SDOTK-1,A1,K,1,A1,J,1
T = T/AK,K
Column oriented.
AK,J = T
S = S + T*T
10 CONTINUE
20 CONTINUE
S = AJ,J - S
C ...... EXIT
IF S .LE. 0.0E0 GO TO 40
AJ,J = SQRTS
30 CONTINUE
9 10
BLAS { BASICS
Lo op Unrolling
In scalar case
do i = 1,n,4 y1:n = y1:n + alfa*x1:n
Level 1 BLAS {Vector Op erations Late '70s
yi = yi + alfa*xi
y y + x, y x, y x,
yi+1 = yi+1 + alfa*xi+1
T
x y , kxk , y $x.
2
yi+2 = yi+2 + alfa*xi+2
yi+3 = yi+3 + alfa*xi+3
continue
Example: SAXPY op erate with vectors:
WHY?
y y + x
{ Reduce the lo op overhead
2vector loads
{ Allow pip elined functional units to op erate
2vector op erations
{ Allow indep endent functional units to op erate
{ Vector instructions
1vector store
Op eration to data reference
{ 2n op erations
{ 3n data transfer
Too much memory trac: little chance to use the memory hierarchy,
O n memory references,
Machines that might b ene t?
O n oating p oint op erations.
{ Thoughtwould work well on all scalar machines, but... vector
machines
11 12
Other Tricks for Performance...
Burn At Both Ends BABE
Algorithm for factoring a tridiagonal matrix
With the BABE algorithm the facto erization is started at
1 0
x x
the top of the matrix as well as the b ottom of the matrix
C B
C B
x x x
C B
at the same time.
C B
1 0 1 0
x x x
C B
x x x x
C B
x x x
C B
C B C B
0 x x 0 x x
C B
C B C B
. . .
C B
C B C B
. . .
C . . . B
C B C B
0 x x x x x
C B
C B C B
C B
C B C B
. . .
x x x x x x
. . .
C B
C B C B
. . .
C B
C B C B
......
C B
......
C B C B
. . .
......
C . . . B
C B C B
. . .
C B
C B C B
......
C B
C B C B
......
......
C B
x x x
C B C B
C B
C B C B
. . . C . . . B
C B C B
......
x x x
. . . C . . . B
C B C B
C B
C B C B
x x
A @
C B C B
x x x x x x
C B C B
C B C B
x x 0 x x x
C B C B
C B C B
x x 0 x x 0
C B C B
C B C B
A x x @ A x x @
Algorithm is sequential in nature.
1 0
Not much scop e for pip elining.
x x
C B
0 x x
C B
C B
In an attempt to increase the sp eed of execution the BABE algorithm
0 x x
C B
C B
. . .
C B
. . .
. . .
C B
was adopted.
C B
C B
. . .
. . .
C B
. . .
C B
C B
. . .
. . .
C B
. . .
C B
C B
0 x x
C B
C B
0 x 0
C B
C B
C B
x x 0
C B
C B
. . .
. . .
C B
. . .
C B
C B
. . .
. . .
C B
. . .
C B
C B
. . .
C B
. . .
. . .
C B
C B
C B
x x 0
C B
A @
x x
Reduce lo op overhead by a factor of 2
Scop e for indep endent op erations
Resulted in 25 faster execution
13 14
Level 1, 2, and 3 BLAS Why Higher Level BLAS?
IBM SP2 Memory Hierarchy
Level 1 BLAS{vector-vector op erations
Largest Smallest + . 256 B Register 2128 MB/s
256 KB Data cache 2128 MB/s
128 MB Local memory 2128 MB/s
y y+ x, also dot pro duct, ... 24 GB Remote memory 32 MB/s
485 GB Secondary memory 10 MB/s
Level 2 BLAS{matrix-vector op erations
Largest Slowest
Can only do arithmetic on data at top