A Framework for Practical Parallel Fast Matrix Multiplication

A Framework for Practical Parallel Fast Matrix Multiplication Austin R. Benson Grey Ballard Institute for Computational and Sandia National Laboratories Mathematical Engineering Livermore, CA USA Stanford University [email protected] Stanford, CA USA [email protected] Abstract tion) routine and Strassen’s algorithm [32]. In parallel implementa- Matrix multiplication is a fundamental computation in many scien- tions, fast algorithms can achieve a speedup of 5% over Strassen’s tific disciplines. In this paper, we show that novel fast matrix mul- original fast algorithm and greater than 15% over MKL. tiplication algorithms can significantly outperform vendor imple- However, fast algorithms for matrix multiplication have largely mentations of the classical algorithm and Strassen’s fast algorithm been ignored in practice. For example, numerical libraries such as on modest problem sizes and shapes. Furthermore, we show that Intel’s MKL [19], AMD’s Core Math Library (ACML) [1], and the best choice of fast algorithm depends not only on the size of the the Cray Scientific Libraries package (LibSci) [8] do not provide matrices but also the shape. We develop a code generation tool to implementations of fast algorithms, though we note that IBM’s automatically implement multiple sequential and shared-memory Engineering and Scientific Subroutine Library (ESSL) [18] does parallel variants of each fast algorithm, including our novel par- include Strassen’s algorithm. Why is this the case? First, users allelization scheme. This allows us to rapidly benchmark over 20 of numerical libraries typically consider fast algorithms to be of fast algorithms on several problem sizes. Furthermore, we discuss a only theoretical interest and never practical for reasonable problem number of practical implementation issues for these algorithms on sizes. We argue that this is not the case with our performance shared-memory machines that can direct further research on mak- results in Section 5. Second, fast algorithms do not provide the ing fast algorithms practical. same numerical stability guarantees as the classical algorithm. In practice, there is some loss in precision in the fast algorithms, but Categories and Subject Descriptors G.4 [Mathematical soft- they are not nearly as bad as the worst-case guarantees [14, 27]. ware]: Efficiency; G.4 [Mathematical software]: Parallel and vec- Third, the LINPACK benchmark1 used to rank supercomputers by tor implementations performance forbids fast algorithms. We suspect that this has driven effort away from the study of fast algorithms. Keywords fast matrix multiplication, dense linear algebra, paral- Strassen’s algorithm is the most well known fast algorithm, but lel linear algebra, shared memory this paper explores a much larger class of recursive fast algorithms based on different base case dimensions. We review these algo- 1. Introduction rithms and methods for constructing them in Section 2. The structure of these algorithms makes them amenable to code generation, Matrix multiplication is one of the most fundamental computations and we describe this process and other performance tuning con- in numerical linear algebra and scientific computing. Consequently, siderations in Section 3. In Section 4, we describe three different the computation has been extensively studied in parallel computing methods for parallelizing fast matrix multiplication algorithms on environments (cf. [3, 20, 34] and references therein). In this paper, shared-memory machines. Our code generator implements all three we show that fast algorithms for matrix-matrix multiplication can parallel methods for each fast algorithm. We evaluate the sequential achieve higher performance on sequential and shared-memory par- and parallel performance characteristics of the various algorithms allel architectures for modestly sized problems. By fast algorithms, and implementations in Section 5 and compare them with MKL’s we mean ones that perform asymptotically fewer floating point op- implementation of the classical algorithm as well as an existing im- erations and communicate asymptotically less data than the clas- plementation of Strassen’s algorithm. sical algorithm. We also provide a code generation framework to The goal of this paper is to help bridge the gap between theory rapidly implement sequential and parallel versions of over 20 fast and practice of fast matrix multiplication algorithms. By introduc- algorithms. Our performance results in Section 5 show that sev- ing our tool of automatically translating a fast matrix multiplica- eral fast algorithms can outperform the Intel Math Kernel Library tion algorithm to high performance sequential and parallel imple- dgemm (MKL) (double precision general matrix-matrix multiplica- mentations, we enable the rapid prototyping and testing of theoretical developments in the search for faster algorithms. We focus the attention of theoretical researchers on what algorithmic charac- Permission to make digital or hard copies of all or part of this work for personal or teristics matter most in practice, and we demonstrate to practical Permissionclassroom use to makeis granted digital without or hard fee copies provided of all that or copies part of are this not work made for or personaldistributed or researchers the utility of several existing fast algorithms besides classroomfor profit or use commercial is granted advantage without fee and provided that copies that bear copies this are notice not and made the or full distributed citation Strassen’s, motivating further effort towards high performance im- foron the profit first or page. commercial Copyrights advantage for components and that copies of this bear work this owned notice by and others the full than citation ACM onmust the be first honored. page. CopyrightsAbstracting for with components credit is permitted. of this work To copy owned otherwise, by others or thanrepublish, ACM plementations of those that are most promising. Our contributions mustto post be on honored. servers Abstracting or to redistribute with credit to lists, is requirespermitted. prior To copyspecific otherwise, permission or republish, and/or a are summarized as follows: tofee. post Request on servers permissions or to redistribute from [email protected]. to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. PPoPP’15, , February 7–11, 2015, San Francisco, CA, USA. PPoPP’15Copyright c, February2015 ACM 7–11, 978-1-4503-3205-7 2015, San Francisco,/15/02. $15.00. CA, USA Copyrighthttp://dx.doi.org 2015/10.1145 ACM/2688500.2688513 978-1-4503-3205-7/15/02...$15.00 1 http://www.top500.org http://dx.doi.org/10.1145/2688500.2688513 42 • By using new fast matrix multiplication algorithms, we achieve multiplication to construct an exact algorithm asymptotically faster better performance than Intel MKL’s dgemm, both sequentially than Strassen’s algorithm [29]. This algorithm was implemented by and with 6 and 24 cores on a shared-memory machine. Kaporin [22], and the running time was competitive with Strassen’s • We demonstrate that, in order to achieve the best performance algorithm on a sequential machine. Recently, Smirnov presented for matrix multiplication, the choice of fast algorithm depends optimization tools for finding many fast algorithms based on fac- on the size and shape of the matrices. Our new fast algorithms toring bilinear forms [31], and we will use these tools for finding outperform Strassen’s on the multiplication of rectangular ma- our own algorithms in Section 2. trices. There are several lines of theoretical research (cf. [12, 36] and references therein) that prove existence of fast APA algorithms with • We show how to use code generation techniques to rapidly much better asymptotic complexity than the algorithms considered implement sequential and shared-memory parallel fast matrix here. Unfortunately, there remains a large gap between the substan- multiplication algorithms. tial theoretical work and what we can practically implement. • We provide a new hybrid parallel algorithm for shared-memory Renewed interest in the practicality of Strassen’s and other fast fast matrix multiplication. algorithms is motivated by the observation that not only is the arith- metic cost reduced when compared to the classical algorithm, the • We implement a fast matrix multiplication algorithm with communication costs also improve asymptotically [4]. That is, as 2:775 × asymptotic complexity O(N ) for square N N matrices (dis- the relative cost of moving data throughout the memory hierarchy covered by Smirnov [31]). In terms of asymptotic complexity, and between processors increases, we can expect the benefits of fast this is the fastest matrix multiplication algorithm implementa- algorithms to grow accordingly. We note that communication lower tion to date. However, our performance results show that this bounds [4] apply to all the algorithms presented in this paper, and algorithm is not practical for the problem sizes that we consider. in most cases they are attained by the implementations used here. Overall, we find that Strassen’s algorithm is hard to beat for square matrix multiplication, both in serial and in parallel. How- 1.2 Notation and Tensor Preliminaries ever, for rectangular matrices (which occur more frequently in practice), other fast algorithms can perform much better. The structure We briefly review basic tensor preliminaries, following the notation of the fast algorithms that perform well tend to “match the shape” of Kolda

Load more