Optimizing Matrix Multiply using PHiPAC: a Portable,

High-Performance, ANSI C Co ding Metho dology

 y z x

Je Bilmes, Krste Asanovi c, Chee-Whye Chin, Jim Demmel

fbilmes,krste,cheewhye,[email protected]

CS Division, University of California at Berkeley

Berkeley CA, 94720

International Science Institute

Berkeley CA, 94704

Abstract 1 Intro duction

Mo dern micropro cessors can achieve high p erformance The use of a standard linear algebra library interface,

on linear algebra kernels but this currently requires ex- such as BLAS [LHKK79, DCHH88, DCDH90 ], enables

tensive machine-sp eci c hand tuning. Wehave devel- p ortable application co de to obtain high-p erformance

op ed a metho dology whereby near-p eak p erformance on provided that an optimized library e.g., [AGZ94,

a wide range of systems can b e achieved automatically KHM94] is available and a ordable.

for such routines. First, by analyzing current machines Developing an optimized library,however, is a dif-

and C compilers, we've develop ed guideline s for writing cult and time-consuming task. Even excluding algo-

Portable, High-Performance, ANSI C PHiPAC, pro- rithmic variants such as Strassen's metho d [BLS91] for

nounced \fee-pack". Second, rather than co de by hand, matrix multipli cati on, these routines have a large design

we pro duce parameterized co de generators. Third, we space with many parameters such as blo cking sizes, lo op

write search scripts that nd the b est parameters for a nesting p ermutations, lo op unrolling depths, software

given system. We rep ort on a BLAS GEMM compat- pip elini ng strategies, register allo cations, and instruc-

ible multi-level cache-blo cked matrix multiply genera- tion schedules. Furthermore, these parameters have

tor which pro duces co de that achieves around 90 of complicated interactions with the increasingl y sophis-

p eak on the Sparcstation-20/61, IBM RS/6000-590, HP ticated microarchitectures of new micropro cessors.

712/80i, SGI Power Challenge R8k, and SGI Octane Various strategies can b e used to pro duced opti-

R10k, and over 80 of p eak on the SGI Indigo R4k. mized routines for a given platform. For example, the

The resulting routines are comp etitive with vendor- routines could b e manually written in assembly co de,

optimized BLAS GEMMs. but fully exploring the design space might then b e in-

feasible, and the resulting co de mightbeunusable or



CS Division, University of California at Berkeley and the

sub-optimal on a di erent system.

International Computer Science Institute. The author acknowl-

Another commonly used metho d is to co de using a

edges the supp ort of JSEP contract F49620-94-C-0038.

y

high level language but with manual tuning to match

CS Division, University of California at Berkeley and the

International Computer Science Institute. The author acknowl-

the underlying architecture [AGZ94, KHM94 ]. While

edges the supp ort of ONR URI Grant N00014-92-J-1617.

less tedious than co ding in assembler, this approach

z

CS Division, University of California at Berkeley. The au-

still requires writing machine sp eci c co de which is not

thor acknowledges the supp ort of ARPA contract DAAL03-91-

p erformance-p ortable across a range of systems.

C-0047 UniversityofTennessee Sub contract ORA4466.02.

x

Ideally, the routines would b e written once in a high- CS Division and Mathematics Dept., University of Califor-

nia at Berkeley. The author acknowledges the supp ort of ARPA

level language and fed to an optimizing compiler for

contract DAAL03-91-C-0047 UniversityofTennessee Sub con-

each machine. There is a large literature on relevant

tract ORA4466.02, ARPA contract DAAH04-95-1-0077 Uni-

compiler techniques, many of which use matrix multi-

versityofTennessee Sub contract ORA7453.02, DOE grant DE-

plication as a test case [WL91 , LRW91 , MS95 , ACF95, FG03-94ER25219, DOE contract W-31-109-Eng-38, NSF grant

+ 1

ASC-9313958, and DOE grant DE-FG03-94ER25206.

CFH95, SMP 96 ] . While these compiler heuristics

generate reasonably go o d co de in general, they tend not

to generate near-p eak co de for any one op eration. A

high-level language's semantics might also obstruct ag-

gressive compiler optimizations. Moreover, it takes sig-

ni cant time and investment b efore compiler research

app ears in pro duction compilers, so these capabiliti es

are often unavailable. While b oth microarchitectures

and compilers will improveover time, we exp ect it will

1

A longer list app ears in [Wol96].

however, includi ng p ointer alias disambiguation , reg- b e manyyears b efore a single version of a library routine

ister and cache blo cking, lo op unrolling, and software can b e compiled to give near-p eak p erformance across

pip elini ng, were either not p erformed or not very e ec- a wide range of machines.

tive at pro ducing the highest quality co de. Wehave develop ed a metho dology, named PHiPAC,

Although it would b e p ossible to use another target for developing Portable High-Performance linear alge-

language, wechose ANSI C b ecause it provides a low- bra libraries in ANSI C. Our goal is to pro duce, with

level, yet p ortable, interface to machine resources, and minimal e ort, high-p erformance linear algebra libraries

compilers are widely available. One problem with our for a wide range of systems. The PHiPAC metho dol-

use of C is that wemust explicitly work around p ointer ogy has three comp onents. First, wehave develop ed

aliasing as describ ed b elow. In practice, this has not a generic mo del of current C compilers and micropro-

limited our ability to extract near-p eak p erformance. cessors that provides guidelin es for pro ducing p ortable

We emphasize that for b oth microarchitectures and high-p erformance ANSI C co de. Second, rather than

compilers we are determining a lowest common denom- hand co de particular routines, we write parameterized

inator. Some microarchitectures or compilers will have generators [ACF95, MS95 ] that pro duce co de according

sup erior characteristics in certain attributes, but, if we to our guidelines. Third, we write scripts that automat-

co de assuming these exist, p erformance will su er on ically tune co de for a particular system byvarying the

systems where they do not. Conversely, co ding for the generators' parameters and b enchmarking the resulting

lowest common denominator should not adversely a ect routines.

p erformance on more capable platforms. Wehave found that writing a parameterized genera-

For example, some machines can fold a p ointer up- tor and search scripts for a routine takes less e ort than

date into a load instruction while others require a sep- hand-tuning a single version for a single system. Fur-

arate add. Co ding for the lowest common denomina- thermore, with the PHiPAC approach, development ef-

tor dictates replacing p ointer up dates with base plus fort can b e amortized over a large numb er of platforms.

constant o set addressing where p ossible. In addi- And by automatically searching a large design space,

tion, while some pro duction compilers have sophisti- we can discover winning yet unanticipated parameter

cated lo op unrollin g and software pip elin in g algorithms, combinations.

many do not. Our search strategy Section 4 empiri- Using the PHiPAC metho dology,wehave pro duced

cally evaluates several levels of explicit lo op unrolling a p ortable, BLAS-compatible matrix multiply genera-

and depths of software pip elini ng . While a naive com- tor. The resulting co de can achieveover 90 of p eak

piler might b ene t from co de with explicit lo op un- p erformance on a variety of currentworkstations, and

rolling or software pip elini ng, a more sophisticated com- is sometimes faster than the vendor-optimized libraries.

piler might p erform b etter without either. We fo cus on matrix multipli catio n in this pap er, but we

have pro duced other generators includin g dot-pro duct,

AXPY, and convolution, whichhave similarly demon-

2.1 PHiPAC Co ding Guidelines

strated p ortable high p erformance.

The following paragraphs exemplify PHiPACCcode

We concentrate on pro ducing high quality unipro ces-

generation guidelines. Programmers can use these co d-

sor libraries for micropro cessor-based systems b ecause

+

ing guidelines directly to improve p erformance in critical

multipro cessor libraries, such as [CDD 96], can b e read-

routines while retaining p ortability, but this do es come

ily built from unipro cessor libraries. For vector and

at the cost of less maintainable co de. This problem is

other architectures, however, our machine mo del would

mitigated in the PHiPAC approach, however, by the use

likely need substantial mo di cation.

of parameterized co de generators.

Section 2 describ es our generic C compiler and mi-

cropro cessor mo del, and develops the resulting guide-

lines for writing p ortable high-p erformance C co de. Sec-

Use lo cal variables to explicitly remove false dep enden-

tion 3 describ es our generator and the resulting co de

cies.

for a BLAS-compatible matrix multiply. Section 4 de-

Casually written C co de often over-sp eci es op eration

scrib es our strategy for searching the matrix multiply

order, particularl y where p ointer aliasing is p ossible.

parameter space. Section 5 presents results on several

C compilers, constrained by C semantics, must ob ey

architectures comparing against vendor-supplied BLAS

these over-sp eci cation s thereby reducing optimization

GEMM. Section 6 describ es the availabil i ty of the dis-

p otential. We therefore remove these extraneous dep en-

tribution, and discusses future work.

dencies.

For example, the following co de fragment contains a

2 PHiPAC

false Read-After-Write hazard:

By analyzing the microarchitectures of a range of ma-

a[i] = b[i]+c;

chines, suchasworkstations and micropro cessor-based

a[i+1] = b[i+1]*d;

SMP and MPP no des, and the output of their ANSI C

The compiler may not assume &a[i] != &b[i+1] and

compilers, we derived a set of guidelines that help us

is forced to rst store a[i] to memory b efore loading

attain high p erformance across a range of machine and

+

b[i+1].Wemay re-write this with explicit loads to

compiler combinations [BAD 96].

lo cal variables:

From our analysis of various ANSI C compilers, we

determined we could usually rely on reasonable reg-

float f1,f2;

ister allo cation, instruction selection, and instruction

f1 = b[i]; f2 = b[i+1];

scheduling. More sophisticated compiler optimizations , a[i] = f1 + c; a[i+1] = f2*d;

Convert integer multiplies to adds. The compiler can nowinterleave execution of b oth orig-

inal statements thereby increasing paralleli sm.

Integer multiplies and divides are slow relativetointeger

addition. Therefore, we use p ointer up dates rather than

Exploit multiple integer and oating-p oint registers.

subscript expressions. Rather than:

We explicitl y keep values in lo cal variables to reduce

for i=... {row_ptr=&p[i*col_stride];...}

memory bandwidth demands. For example, consider

we pro duce: the following 3-p oint FIR lter co de:

for i=... {...row_ptr+=col_stride;}

while ... {

*res++ = filter[0]*signal[0]+

filter[1]*signal[1]+

Minimize branches, avoid magnitude compares.

filter[2]*signal[2];

Branches are costly, esp ecially on mo dern sup erscalar

signal++; }

pro cessors. Therefore, we unroll lo ops to amortize

branch cost and use C do fg while ; lo op control The compiler will usually reload the lter values every

whenever p ossible to avoid any unnecessary compiler-

lo op iteration b ecause of p otential aliasing with res.We

pro duced lo op head branches. can remove the alias by preloading the lter into lo cal

Also, on many microarchitectures, it is cheap er to

variables that may b e mapp ed into registers:

p erform equality or inequality lo op termination tests

float f0,f1,f2;

than magnitude comparisons. For example, instead of:

f0=filter[0];f1=filter[1];f2=filter[2];

for i=0,a=start_ptr;i

while  ...  {

{ ... }

*res++ = f0*signal[0]

+ f1*signal[1] + f2*signal[2];

we pro duce:

signal++; }

end_ptr = &a[ARRAY_SIZE]; a = start_ptr;

do { ... a++; } while a != end_ptr;

Minimize p ointer up dates by striding with constant o -

sets.

This also removes one lo op control variable.

We replace p ointer up dates for strided memory address-

ing with constant array o sets. For example:

Lo op unroll explicitly to exp ose optimization opp ortu-

nities.

f0=*r8;r8+=4;f1=*r8;r8+=4;f2=*r8;r8+=4;

We unroll lo ops explicitly to increase opp ortunities for

should b e converted to:

other p erformance optimizations. For example, our 3-

p oint FIR lter example ab ovemay b e further opti-

mized as follows:

f0 = r8[0]; f1 = r8[4]; f2 = r8[8]; r8 += 12;

float f0,f1,f2,s0,s1,s2;

Compilers can fold the constant index into a register

f0=filter[0];f1=filter[1];f2=filter[2];

plus o set addressing mo de.

s0=signal[0];s1=signal[1];s2=signal[2];

*res++=f0*s0+f1*s1+f2*s2;

Hide multiple instruction FPU latency with indep en-

do {

dent op erations.

signal+=3;

s0=signal[0];res[0]=f0*s1+f1*s2+f2*s0;

We use lo cal variables to exp ose indep endent op erations

s1=signal[1];res[1]=f0*s2+f1*s0+f2*s1;

so they can b e executed indep endently in a pip elined or

s2=signal[2];res[2]=f0*s0+f1*s1+f2*s2;

sup erscalar pro cessor. For example:

res += 3;

} while  ... ;

f1=f5*f9;f2=f6+f10;f3=f7*f11;f4=f8+f12;

In the inner lo op, there are now only two memory access

Balance the instruction mix.

per ve oating p oint op erations whereas in our unopti-

mized co de, there were seven memory accesses p er ve

A balanced instruction mix has a oating-p ointmulti-

oating p oint op erations.

ply, a oating-p oint add, and 1{2 oating-p oint loads or

stores interleaved. It is not worth decreasing the num-

berofmultiplie s at the exp ense of additions if the total

3 Matrix Multiply Generator

oating-p oint op eration count increases.

mm gen is a generator that pro duces C co de, follow-

ing the PHiPAC co ding guidelin es, for one variantof

Increase lo cality to improve cache p erformance.

the matrix multiply op eration C = opAopB + C

Cached machines b ene t from increases in spatial and

where opA, opB , and C , are resp ectively MK,

temp oral lo cality. Whenever p ossible, we arrange our

KN, and MN matrices, and are scalar parame-

co de to have predominantly unit-stride memory ac-

ters, and opX  is either tr ansposeX  or just X . Our

cesses, and try to reuse data once it is in cache. See

individua l pro cedures havealower level interface then

Section 3.1, for our blo cked matrix multiply example.

a BLAS GEMM and have no error checking. For op-

timal eciency, error checking should b e p erformed by

N

y are subsumed by the metho d

N0 sulting gains in lo calit

describ ed b elow.

K0

The outer M lo op in Figure 2 maintains p ointers

and a0 to rows of register blo cks in the A and C

N0*N1 c0

matrices. It also maintains end p ointers ap0 endp and

endp used for lo op termination. The middle N lo op

K cp0

maintains a p ointer b0 to columns of register blo cks in

the B matrix, and maintains a p ointer cp0 to the current

C destination register blo ck. The N lo op also maintains

0 through ap0 1 to successive separate p ointers ap0

ws of the current A source blo ck. It also initiali zes a

K ro

ter bp0 to the current B source blo ck. We assume

K0 N0 p oin

lo cal variables can b e held in registers, so our co de uses

M0 M0

many p ointers to minimize b oth memory references and

teger multiplies.

K0*K11 N0*N1 in

ver source matrix blo cks and

M The K lo op iterates o

accumulates into the same M  N destination blo ck.

0 0

We assume that the oating-p oint registers can hold a

M  N accumulator blo ck, so this blo ck is loaded once

0 0

b efore the K lo op b egins and stored after it ends. The K

lo op up dates the set of p ointers to the A source blo ck,

one of which is used for lo op termination.

The parameter K controls the extent of inner lo op

0

Figure 1: Matrix blo cking parameters

unrolling as can b e seen in Figure 2. The unrolled core

lo op p erforms K outer pro ducts accumulating into the

0

C destination blo ck. We co de the outer pro ducts by

the caller when necessary rather than unnecessarily by

loading one row of the B blo ck, one element of the A

the callee. We create a full BLAS-compatible GEMM,

blo ck, then p erforming N multiply-accumul ates. The

0

by generating all required matrix multiply variants and

C co de uses N + M memory references p er 2N M

0 0 0 0

linking with our GEMM-compatible interface that in-

oating-p oint op erations in the inner K lo op, while hold-

cludes error checking.

ing M N + N +1 values in lo cal variables. While the

0 0 0

mm gen pro duces a cache-blo cked matrix multiply

intent is that these lo cal variables map to registers, the

[GL89, LRW91 , MS95 ], restructuring the algorithm for

compiler is free to reorder all of the indep endent loads

unit stride, and reducing the numb er of cache misses

and multiply-accumul ates to trade increased memory

and unnecessary loads and stores. Under control of

references for reduced register usage. The compiler also

command line parameters, mm gen can pro duce blo ck-

requires additional registers to name intermediate re-

ing co de for anynumb er of levels of memory hierar-

sults propagating through machine pip elines.

chy, includin g register, L1 cache, TLB, L2 cache, and so

The co de wehave so far describ ed is valid only when

on. mm gen's co de can also p erform copy optimization

M, K, and N are integer multiples of M , K , and N

0 0 0

[LRW91 ], optionall y with a di erent accumulator preci-

resp ectively. In the general case, mm gen also includes

sion. The latest version can also generate the innermost

co de that op erates on p ower-of-two sized fringe strips,

lo op with various forms of software pip elinin g.

0 blog Lc

2

i.e., 2 through 2 where L is M , K ,or N .We

0 0 0

Atypical invo cation of mm gen is:

can therefore manage any fringe size from 1 to L1by

executing an appropriate combination of fringe co de.

mm_gen -cb M0 K0 N0 [ -cb M1 K1 N1 ] ...

The resulting co de size growth is logarithmic in the

where the register blo cking is M , K , N , the L1-cache

register blo cking i.e., O log M  log K  log N  yet

0 0 0

0 0 0

blo cking is M , K , N , etc. The parameters M , K , maintains go o d p erformance. To reduce the demands

1 1 1 0 0

and N are sp eci ed in units of matrix elements, i.e., on the instruction cache, we arrange the co de into sev-

0

single, double, or extended precision oating-p ointnum- eral indep endent sections, the rst handling the matrix

b ers, M , K , N are sp eci ed in units of register blo cks, core and the remainder handling the fringes.

1 1 1

M , K , and K are in units of L1 cache blo cks, and so

The latest version of the generator can optionally

2 2 2

on. For a particular cache level, say i, the co de accu- pro duce co de with a software-pip elined inner lo op. Each

mulates into a C destination blo ck of size M  N units

outer pro duct consists of a load, a multiply, and an accu-

i i

and uses A source blo cks of size M  K units and B mulate section. We group these sections into three soft-

i i

source blo cks of size K  N units see Figure 1.

ware pip elined co de variants: twotwo-stage pip es with

i i

stages [load-multipl y, accumulate] and [load, multiply-

accumulate], and a three-stage pip e with stages [load,

3.1 Matrix Multiply Co de

multiply, accumulate]. Recent results show that this

gen In this section, we examine the co de pro duced by mm

software pip elin in g can result in an appreciable sp eedup.

for the op eration C =C+A*Bwhere A resp ectively B,

Because of the separation b etween matrix dimen-

C is an MK resp ectively KN, MN matrix. Fig-

sion and matrix stride, we can implement higher levels

ure 2 lists the L1 cache blo cking core co de comprising

of cache blo cking as calls to lower level routines with

the 3 nested lo ops, M, N, and K. mm gen do es not vary

appropriately sized sub-matrices.

the lo op p ermutation [MS95 , LRW91 ] b ecause the re-

4 Matrix Multiply Search Scripts

The search script take parameters describing the ma-

mul_mfmf_mfconst int M,const int K,const int N,

chine architecture, including the number of integer and

const float*const A,const float*const B,float*const C,

oating-p oint registers and the sizes of each level of

const int Astride,const int Bstride,const int Cstride

{ cache. For each combination of generator parameters

const float *a,*b; float *c;

and compilation options, the matrix multiply search

const float *ap_0,*ap_1; const float *bp; float *cp;

script calls the generator, compiles the resulting routine,

const int A_sbs_stride = Astride*2;

links it with timing co de, and b enchmarks the resulting

const int C_sbs_stride = Cstride*2;

executable.

const int k_marg_el = K & 1;

To pro duce a complete BLAS GEMM routine, we

const int k_norm = K - k_marg_el;

nd separate parameters for each of the three cases A 

const int m_marg_el = M & 1;

T T T T

B , A  B , and A  B A  B has co de identical

const int m_norm = M - m_marg_el;

to A  B . For each case, we rst nd the b est register

const int n_marg_el = N & 1;

or L0 parameters for in-L1-cache matrices, then nd

const int n_norm = N - n_marg_el;

float *const c_endp = C+m_norm*Cstride;

the b est L1 parameters for in-L2-cache matrices, etc.

register float c0_0,c0_1,c1_0,c1_1;

While this strategy is not guaranteed to nd the b est

c=C;a=A;

L0 core for out-of-L1-cache matrices, the resulting cores

do { /* M loop */

have p erformed well in practice.

const float* const ap_endp = a + k_norm;

float* const cp_endp = c + n_norm;

4.1 Register L0 Parameter Search

const float* const apc_1 = a + Astride;

b=B;cp=c;

The register blo ck searchevaluates all combinations of

do { /* N loop */

M and N where 1  M N  NR and where NR is the

0 0 0 0

register float _b0,_b1;

numb er of machine oating-p oint registers. We search

register float _a0,_a1;

max max

float *_cp;

the ab ove for 1  K  K where K =20but

0

0 0

max

ap_0=a;ap_1=apc_1;bp=b;

is adjustable. Empiricall y, K > 20 has never shown

0

_cp=cp;c0_0=_cp[0];c0_1=_cp[1];

appreciable b ene t.

+

_cp+=Cstride;c1_0=_cp[0];c1_1=_cp[1];

Our initial strategy [BAD 96 ] b enchmarked a set of

do { /* K loop */

square matrices that t in L1 cache. We then chose

_b0 = bp[0]; _b1 = bp[1];

the L0 parameters that achieved the highest p erfor-

bp += Bstride; _a0 = ap_0[0];

mance. While this approachgave go o d p erformance,

c0_0 += _a0*_b0; c0_1 += _a0*_b1;

the searches were time consuming.

_a1 = ap_1[0];

We noted that that the ma jority of the computation,

c1_0 += _a1*_b0; c1_1 += _a1*_b1;

esp ecially for larger matrices, is p erformed by the core

_b0 = bp[0]; _b1 = bp[1];

M  K  N register blo cked co de. Our newer search

0 0 0

bp += Bstride; _a0 = ap_0[1];

strategy, therefore, pro duces co de containing only the

c0_0 += _a0*_b0; c0_1 += _a0*_b1;

core, which decreases compile time, and for eachL0pa-

_a1 = ap_1[1];

rameter set, we b enchmark only a single matrix multiply

c1_0 += _a1*_b0; c1_1 += _a1*_b1;

with size M = M , N = N , and K = kK . The pa-

0 0 0

rameter k is chosen such that the three matrices are no

ap_0+=2;ap_1+=2;

larger than L1 cache we call this a \fat" dot-pro duct.

} while ap_0!=ap_endp;

While this case ignores the cost of the M- and N-lo op

_cp=cp;_cp[0]=c0_0;_cp[1]=c0_1;

overhead, wehave so far found this approach to pro- _cp+=Cstride;_cp[0]=c1_0;_cp[1]=c1_1;

b+=2;cp+=2;

duce comparable qualitycodeinmuch less time than

} while cp!=cp_endp;

the previous strategy e.g., 5 hours vs. 24 hours on the

c+=C_sbs_stride;a+=A_sbs_stride;

SGI Indigo R4K.

} while c!=c_endp;

We initially thought the order in whichwe searched

}

the L0 parameter space could have an e ect on search

time and evaluated orders such as i-j-k, random, b est-

rst, and simulated annealing. We found, however, that

Figure 2: M =2, K =2, N = 2 matrix multiply

0 0 0

the added search complexity outweighed any b ene t so

L1 routine for M 2f2m : m  1g;K 2f2k : k 

we settled on a random strategy.

1g;N 2f2n : n  1g. Within the K-lo op is our fully-

unrolled 2  2  2 core matrix multiply. The co de is not

4.2 Cache Blo ck Search

unlike the register co de in [CFH95]. In our terminol-

ogy, the leading dimensions LDA, LDB, and LDC are

We p erform the L1 cache blo cking search after the b est

called Astride, Bstride, and Cstride resp ectively. The

register blo cking is known. Wewould like to make

0 through c1 1 hold a complete four lo cal variables c0

the L1 blo cks large to increase data reuse but larger

0 and ap 1 p ointto C destination blo ck. Variables ap

L1 blo cks increase the probability of cache con icts

successiverows of the A source matrix blo ck, and vari-

[LRW91]. Tradeo s b etween M- and N- lo op overheads,

able bp p oints to the rst row of the B source matrix

memory access patterns, and TLB structure also af-

blo ck. Elements in A and B are accessed using constant

fect the b est L1 size. We currently p erform a rela-

o sets from the appropriate p ointers.

tively simple search of the L1 parameter space. For

60

the DD square case, we search the neighb orho o d cen-

2 PHiPAC

tered at 3D = L1 where L1 is the L1 cache size in

elements. We set M to the values bD =M c where 0

1 50

p

 2f0:25; 0:5; 1:0; 1:5; 2:0g and D = L1=3. K and

1

N are set similarly.We b enchmark the resulting 125 1

40

combinations with matrix sizes that either t in L2

cache, or are within some upp er b ound if no L2 cache

he blo cking search, when necessary,

exists. The L2 cac 30

is p erformed in a similar way. MFLOPS

20 Results

5 C, 3 nested loops

e ran the search scripts to nd the b est register and L1

W 10

cache blo cking parameters for six commercial worksta-

tion systems. These systems have di erent instruction

0

hitectures and widely varying microarchitectures set arc 0 50 100 150 200 250 300

Square matrix size

and memory hierarchies. The results are summarized in

Table 1.

Figure 3: Performance of single precision matrix multi-

The SGI R8K and R10K searches used the newer

ply on a Sparcstation-20/61.

version of the co de generator and search scripts, while

other results were obtained with the alpha release. Fig-

60

ures 3{6 plot the p erformance of the resulting routines

for all square matrices M = K = N = D , where D runs

ver p owers of 2 and 3, multiples of 10, and primes, up to

o 50 PHiPAC

a maximum of 300. We compare with the p erformance

of a vendor-optimized BLAS GEMM where available. In

40

each case, PHiPAC yields a substantial fraction of p eak

p erformance and is comp etitive with vendor BLAS. Due

vailabil i ty,we could only p erform an incom- to limited a 30

SGI R4k assembly libblas_mips2_serial.a SGEMM

plete search on the R8K and R10K, and so these are

MFLOPS

preliminary p erformance numb ers. There is also p o-

20

tential for improvement on the other machines when we

rerun with the newer version. For completeness, we also

C, 3 nested loops w the p o or p erformance obtained when compiling a

sho 10

simple three nested lo op version of GEMM with FOR-

TRAN or C optimizing compilers.

AC metho dology can also improve p erfor- The PHiP 0

0 50 100 150 200 250 300

en if there is no scop e for memory blo cking. mance ev Square matrix size

In Figure 9 we plot the p erformance of a dot pro duct

co de generated using PHiPAC techniques. Although

Figure 4: Performance of single precision matrix mul-

the parameters used were obtained using a short man-

tiply on a 100 MHz SGI Indigo R4K. We show the

ual search, we can see that we are comp etitive with the

SGEMM from SGI's libblas mips2 serial.a library.

assembly-co ded SGI BLAS SDOT.

80

In some of the plots, we see that PHiPAC routines

su er from cache con icts. Our measurements exagger-

70

ate this e ect by includin g all p ower-of-2 sized matri-

y allo cating all regions contiguously in mem- ces, and b PHiPAC

60

ory.For matrix multiply,we can reduce cache con icts

by copying to contiguous memory when pathological

50 Vendor DGEMM

strides are encountered [LRW91]. Unfortunately, this

approach do es not help dot pro duct. One drawbackof

40

the PHiPAC approach is that we can not control the

MFLOPS

hedule indep endent loads. We'veoc-

order compilers sc 30

casionally found that exchanging two loads in the as-

bly output for dot pro duct can halve the number of

sem 20

he misses where con icts o ccur, without otherwise cac FORTRAN, 3 nested loops impacting p erformance. 10

0

Status, Availability, and Future Work 6 0 50 100 150 200 250 300

Square matrix sizes

This pap er has demonstrated our ability to write

p ortable, high p erformance ANSI C co de for matrix

Figure 5: Performance of double precision matrix multi-

multiply using parameterized co de generators and a

ply on a HP 712/80i. We show DGEMM from the pa1.1

timing-driven search strategy.

version of libvec.a in HP's compiler distribution.

Sun HP IBM SGI SGI SGI

Sparc-20/61 712/80i RS/6000-590 Indigo R4K Challenge Octane

Pro cessor Sup erSPARC+ PA7100LC RIOS-2 R4K R8K R10K

Frequency MHz 60 80 66.5 100 90 195

Max Instructions/cycle 3 2 6 1 4 4

Peak MFLOPS 32b/64b 60/60 160/80 266/266 67/50 360/360 390/390

FP registers 32b/64b 32/16 64/32 32/32 16/16 32/32 32/32

L1 Data cache KB 16 128 256 8 - 32

L2 Data cache KB 1024 - - 1024 4096 1024

OS SunOS 4.1.3 HP-UX 9.05 AIX 3.2 Irix 4.0.5H IRIX6.2 IRIX6.4

C Compiler Sun acc 2.0.1 HP c89 9.61 IBM xlc 1.3 SGI cc 3.10 SGI cc SGI cc

Search results

PHiPACversion alpha alpha alpha alpha new new

Precision 32b 64b 64b 32b 64b 64b

M ,K ,N 2,4,10 3,1,2 2,1,10 2,10,3 2,4,14 4,2,6

0 0 0

M ,K ,N 26,10,4 30,60,30 105,70,28 30,4,10 200,80,25 12,24,9

1 1 1

CFLAGS -fast -O -O3 -qarch=pwr2 -O2 -mips2 -n32 -mips4 -O3

Table 1: system details and results of matrix multiply parameter search.

300 400

Vendor DGEMM 250 PHiPAC 350

ESSL DGEMM 300 200 PHiPAC

250

150 FORTRAN, 3 nested loops 200 MFLOPS MFLOPS

100 150 FORTRAN, 3 nested loops

100 50

50

0 0 50 100 150 200 250 300 0 Square matrix size 0 50 100 150 200 250 300

Square matrix sizes

Figure 6: Performance of double precision matrix mul-

Figure 8: Preliminary p erformance of double precision

tiply on an IBM RS/6000-590. We show the DGEMM

matrix multiply on an SGI R10K Octane. We show the

from IBM's POWER2-optimized ESSL library.

DGEMM from SGI's R10K-optimized libblas.

350 Vendor DGEMM 50 300 PHiPAC

SGI libblas_mips2_serial.a SDOT 250 40

PHiPAC 200 30

MFLOPS 150 C, simple loop MFLOPS 20 100

FORTRAN, 3 nested loops 50 10

0 0 50 100 150 200 250 300 0 0 1 2 3 4 5 6 Square matrix sizes 10 10 10 10 10 10 10

vector length (log scale)

Figure 7: Preliminary p erformance of double precision

Figure 9: Performance of single precision unit-stride

matrix multiply on an SGI R8K Power Challenge. We

dot-pro duct on a 100 MHz SGI R4K.

show the DGEMM from SGI's R8K-optimized libblas.

The PHiPAC alpha release contains the matrix mul- distributed memory - design is-

tiply generator, the naive search scripts written in perl, sues and p erformance. LAPACK working

and our timing libraries. Wehave created a Web note 95, UniversityofTennessee, 1996.

site from which the alpha release is available and on

[CFH95] L. Carter, J. Ferrante, and S. Flynn Hum-

whichwe plan to list blo cking parameters for many sys-

+

mel. Hierarchical tiling for improved sup er-

tems [BAD ]. We are currently working on a b etter

scalar p erformance. In International Paral lel

L1 blo cking strategy and accompanying metho ds for

Processing Symposium, April 1995.

search based on various criteria [LRW91 ]. The PHiPAC

GEMM can b e used with Bo Kagstrom's GEMM-based

[DCDH90] J. Dongarra, J. Du Croz, I. Du , and

+

BLAS3 package [BLL93 ] and LAPACK [ABB 92].

S. Hammarling. A set of level 3 basic linear

Wehave also written parameterized generators for

algebra subprograms. ACM Trans. Math.

matrix-vector and vector-matrix multiply, dot pro duct,

Soft., 161:1{17, March 1990.

AXPY, convolution, and outer-pro duct, and further

generators, such as for FFT, are planned.

[DCHH88] J. Dongarra, J. Du Cros, S. Hammarling,

We wish to thank Ed Rothb erg of SGI for help ob-

and R.J. Hanson. An extended set of

taining the R8K and R10K p erformance plots. We also

FORTRAN basic linear algebra subroutines.

wish to thank Nelson Morgan who provided initial im-

ACM Trans. Math. Soft., 14:1{17, March

p etus for this pro ject and Dominic Lam for work on the

1988.

initial search scripts.

[GL89] G.H. Golub and C.F. Van Loan. Matrix

Computations. Johns Hopkins University

References

Press, 1989.

+

[ABB 92] E. Anderson, Z. Bai, C. Bischof, J. Dem-

[KHM94] C. Kamath, R. Ho, and D.P. Manley.

mel, J. Dongarra, J. Du Croz, A. Green-

DXML: A high-p erformance scienti c sub-

baum, S. Hammarling, A. McKenney, S. Os-

routine library. Digital Technical Journal,

trouchov, and D. Sorensen. LAPACK users'

63:44{56, Summer 1994.

guide, release 1.0. In SIAM, Philadelph ia ,

1992.

[LHKK79] C. Lawson, R. Hanson, D. Kincaid, and

F. Krogh. Basic linear algebra subprograms

[ACF95] B. Alp ern, L. Carter, and J. Ferrante.

for FORTRAN usage. ACM Trans. Math.

Space-limited pro cedures: A metho dology

Soft., 5:308{323, 1979.

for p ortable high-p erformance. In Inter-

national Working Conference on Massively

[LRW91] M. S. Lam, E. E. Rothb erg, and M. E. Wolf.

Paral lel Programming Models, 1995.

The cache p erformance and optimization s of

blo cked algorithms. In Proceedings of ASP-

[AGZ94] R. Agarwal, F. Gustavson, and M. Zubair.

LOS IV, pages 63{74, April 1991.

IBM Engineering and Scienti c Subroutine

Library, Guide and Reference, 1994. Avail-

[MS95] J.D. McCalpin and M. Smotherman. Auto-

able through IBM branch oces.

matic b enchmark generation for cache opti-

+

mization of matrix algorithms. In R. Geist

[BAD ] J. Bilmes, K. Asanovi c, J. Demmel,

and S. Junkins, editors, Proceedings of the

D. Lam, and C.W. Chin. The PHiPAC

33rdAnnual Southeast Conference, pages

WWW home page. http://www.icsi.

195{204. ACM, March 1995.

berkeley.edu/~bilmes/phipac.

+

+

[BAD 96] J. Bilmes, K. Asanovi c, J. Demmel, D. Lam,

[SMP 96] R. Saavedra, W. Mao, D. Park, J. Chame,

and C.W. Chin. PHiPAC: A p ortable, high-

and S. Mo on. The combined e ectiveness

p erformance, ANSI C co ding metho dology

of unimo dular transformations, tiling, and

and its application to matrix multiply. LA-

software prefetching. In Proceedings of the

PACK working note 111, UniversityofTen-

10th International Paral lel Processing Sym-

nessee, 1996.

posium, April 15{19 1996.

[BLL93] B.Kagstrom, P. Ling, and C. Van Loan.

[WL91] M. E. Wolf and M. S. Lam. A data lo cality

Portable high p erformance GEMM-based

optimizing algorithm. In Proceedings of the

level 3 BLAS. In R.F. Sincovec et al., ed-

ACM SIGPLAN'91 ConferenceonProgram-

itor, Paral lel Processing for Scienti c Com-

ming Language Design and Implementation,

puting, pages 339{346, Philadelp hi a, 1993.

pages 30{44, June 1991.

SIAM Publication s.

[Wol96] M. Wolfe. High performancecompilers for

[BLS91] D. H. Bailey, K. Lee, and H. D. Simon. Using

paral lel computing. Addison-Wesley, 1996.

Strassen's algorithm to accelerate the solu-

tion of linear systems. J. Supercomputing,

4:97{371, 1991.

+

[CDD 96] J. Choi, J. Demmel, I. Dhillon, J. Don-

garra, S. Ostrouchov, A. Petitet, K. Stan-

ley,D.Walker, and R.C. Whaley. ScaLA-

PAC: A p ortable linear algebra library for