Optimizing Matrix Multiply Using Phipac: a Portable

Optimizing Matrix Multiply using PHiPAC: a Portable, High-Performance, ANSI C Co ding Metho dology y z x Je Bilmes, Krste Asanovi c, Chee-Whye Chin, Jim Demmel fbilmes,krste,cheewhye,[email protected] CS Division, University of California at Berkeley Berkeley CA, 94720 International Computer Science Institute Berkeley CA, 94704 Abstract 1 Intro duction Mo dern micropro cessors can achieve high p erformance The use of a standard linear algebra library interface, on linear algebra kernels but this currently requires ex- such as BLAS [LHKK79, DCHH88, DCDH90 ], enables tensive machine-sp eci c hand tuning. Wehave devel- p ortable application co de to obtain high-p erformance op ed a metho dology whereby near-p eak p erformance on provided that an optimized library e.g., [AGZ94, a wide range of systems can b e achieved automatically KHM94] is available and a ordable. for such routines. First, by analyzing current machines Developing an optimized library,however, is a dif- and C compilers, we've develop ed guideline s for writing cult and time-consuming task. Even excluding algo- Portable, High-Performance, ANSI C PHiPAC, pro- rithmic variants such as Strassen's metho d [BLS91] for nounced \fee-pack". Second, rather than co de by hand, matrix multipli cati on, these routines have a large design we pro duce parameterized co de generators. Third, we space with many parameters such as blo cking sizes, lo op write search scripts that nd the b est parameters for a nesting p ermutations, lo op unrolling depths, software given system. We rep ort on a BLAS GEMM compat- pip elini ng strategies, register allo cations, and instruc- ible multi-level cache-blo cked matrix multiply generation schedules. Furthermore, these parameters have tor which pro duces co de that achieves around 90 of complicated interactions with the increasingl y sophis- p eak on the Sparcstation-20/61, IBM RS/6000-590, HP ticated microarchitectures of new micropro cessors. 712/80i, SGI Power Challenge R8k, and SGI Octane Various strategies can b e used to pro duced opti- R10k, and over 80 of p eak on the SGI Indigo R4k. mized routines for a given platform. For example, the The resulting routines are comp etitive with vendor- routines could b e manually written in assembly co de, optimized BLAS GEMMs. but fully exploring the design space might then b e in- feasible, and the resulting co de mightbeunusable or CS Division, University of California at Berkeley and the sub-optimal on a di erent system. International Computer Science Institute. The author acknowl- Another commonly used metho d is to co de using a edges the supp ort of JSEP contract F49620-94-C-0038. y high level language but with manual tuning to match CS Division, University of California at Berkeley and the International Computer Science Institute. The author acknowl- the underlying architecture [AGZ94, KHM94 ]. While edges the supp ort of ONR URI Grant N00014-92-J-1617. less tedious than co ding in assembler, this approach z CS Division, University of California at Berkeley. The au- still requires writing machine sp eci c co de which is not thor acknowledges the supp ort of ARPA contract DAAL03-91- p erformance-p ortable across a range of systems. C-0047 UniversityofTennessee Sub contract ORA4466.02. x Ideally, the routines would b e written once in a high- CS Division and Mathematics Dept., University of Califor- nia at Berkeley. The author acknowledges the supp ort of ARPA level language and fed to an optimizing compiler for contract DAAL03-91-C-0047 UniversityofTennessee Sub con- each machine. There is a large literature on relevant tract ORA4466.02, ARPA contract DAAH04-95-1-0077 Uni- compiler techniques, many of which use matrix multi- versityofTennessee Sub contract ORA7453.02, DOE grant DE- plication as a test case [WL91 , LRW91 , MS95 , ACF95, FG03-94ER25219, DOE contract W-31-109-Eng-38, NSF grant + 1 ASC-9313958, and DOE grant DE-FG03-94ER25206. CFH95, SMP 96 ] . While these compiler heuristics generate reasonably go o d co de in general, they tend not to generate near-p eak co de for any one op eration. A high-level language's semantics might also obstruct ag- gressive compiler optimizations. Moreover, it takes sig- ni cant time and investment b efore compiler research app ears in pro duction compilers, so these capabiliti es are often unavailable. While b oth microarchitectures and compilers will improveover time, we exp ect it will 1 A longer list app ears in [Wol96]. however, includi ng p ointer alias disambiguation , reg- b e manyyears b efore a single version of a library routine ister and cache blo cking, lo op unrolling, and software can b e compiled to give near-p eak p erformance across pip elini ng, were either not p erformed or not very e ec- a wide range of machines. tive at pro ducing the highest quality co de. Wehave develop ed a metho dology, named PHiPAC, Although it would b e p ossible to use another target for developing Portable High-Performance linear alge- language, wechose ANSI C b ecause it provides a low- bra libraries in ANSI C. Our goal is to pro duce, with level, yet p ortable, interface to machine resources, and minimal e ort, high-p erformance linear algebra libraries compilers are widely available. One problem with our for a wide range of systems. The PHiPAC metho dol- use of C is that wemust explicitly work around p ointer ogy has three comp onents. First, wehave develop ed aliasing as describ ed b elow. In practice, this has not a generic mo del of current C compilers and micropro- limited our ability to extract near-p eak p erformance. cessors that provides guidelin es for pro ducing p ortable We emphasize that for b oth microarchitectures and high-p erformance ANSI C co de. Second, rather than compilers we are determining a lowest common denom- hand co de particular routines, we write parameterized inator. Some microarchitectures or compilers will have generators [ACF95, MS95 ] that pro duce co de according sup erior characteristics in certain attributes, but, if we to our guidelines. Third, we write scripts that automat- co de assuming these exist, p erformance will su er on ically tune co de for a particular system byvarying the systems where they do not. Conversely, co ding for the generators' parameters and b enchmarking the resulting lowest common denominator should not adversely a ect routines. p erformance on more capable platforms. Wehave found that writing a parameterized genera- For example, some machines can fold a p ointer up- tor and search scripts for a routine takes less e ort than date into a load instruction while others require a sep- hand-tuning a single version for a single system. Fur- arate add. Co ding for the lowest common denomina- thermore, with the PHiPAC approach, development ef- tor dictates replacing p ointer up dates with base plus fort can b e amortized over a large numb er of platforms. constant o set addressing where p ossible. In addi- And by automatically searching a large design space, tion, while some pro duction compilers have sophisti- we can discover winning yet unanticipated parameter cated lo op unrollin g and software pip elin in g algorithms, combinations. many do not. Our search strategy Section 4 empiri- Using the PHiPAC metho dology,wehave pro duced cally evaluates several levels of explicit lo op unrolling a p ortable, BLAS-compatible matrix multiply genera- and depths of software pip elini ng . While a naive com- tor. The resulting co de can achieveover 90 of p eak piler might b ene t from co de with explicit lo op un- p erformance on a variety of currentworkstations, and rolling or software pip elini ng, a more sophisticated com- is sometimes faster than the vendor-optimized libraries. piler might p erform b etter without either. We fo cus on matrix multipli catio n in this pap er, but we have pro duced other generators includin g dot-pro duct, AXPY, and convolution, whichhave similarly demon- 2.1 PHiPAC Co ding Guidelines strated p ortable high p erformance. The following paragraphs exemplify PHiPACCcode We concentrate on pro ducing high quality unipro ces- generation guidelines. Programmers can use these co d- sor libraries for micropro cessor-based systems b ecause + ing guidelines directly to improve p erformance in critical multipro cessor libraries, such as [CDD 96], can b e read- routines while retaining p ortability, but this do es come ily built from unipro cessor libraries. For vector and at the cost of less maintainable co de. This problem is other architectures, however, our machine mo del would mitigated in the PHiPAC approach, however, by the use likely need substantial mo di cation. of parameterized co de generators. Section 2 describ es our generic C compiler and micropro cessor mo del, and develops the resulting guidelines for writing p ortable high-p erformance C co de. Sec- Use lo cal variables to explicitly remove false dep enden- tion 3 describ es our generator and the resulting co de cies. for a BLAS-compatible matrix multiply. Section 4 de- Casually written C co de often over-sp eci es op eration scrib es our strategy for searching the matrix multiply order, particularl y where p ointer aliasing is p ossible. parameter space. Section 5 presents results on several C compilers, constrained by C semantics, must ob ey architectures comparing against vendor-supplied BLAS these over-sp eci cation s thereby reducing optimization GEMM.

Load more