Communication Lower Bounds for Nested Bilinear Algorithms

Communication Lower Bounds for Nested Bilinear Algorithms Caleb Ju∗†, Yifan Zhang∗‡, Edgar Solomonik§ Abstract We develop lower bounds on communication in the memory hierarchy or between processors for nested bilinear algorithms, such as Strassen's algorithm for matrix multiplication. We build on a previous framework that establishes communication lower bounds by use of the rank expansion, or the minimum rank of any fixed size subset of columns of a matrix, for each of the three matrices encoding the bilinear algorithm. This framework provides lower bounds for any way of computing a bilinear algorithm, which encompasses a larger space of algorithms than by fixing a particular dependency graph. Nested bilinear algorithms include fast recursive algorithms for convolution, matrix multiplication, and contraction of tensors with symmetry. Two bilinear algorithms can be nested by taking Kronecker products between their encoding matrices. Our main result is a lower bound on the rank expansion of a matrix constructed by a Kronecker product derived from lower bounds on the rank expansion of the Kronecker product's operands. To prove this bound, we map a subset of columns from a submatrix to a 2D grid, collapse them into a dense grid, expand the grid, and use the size of the expanded grid to bound the number of linearly independent columns of the submatrix. We apply the rank expansion lower bounds to obtain novel communication lower bounds for nested Toom-Cook convolution, Strassen's algorithm, and fast algorithms for partially symmetric contractions. 1 Introduction With the proliferation of massively parallel machines, the communication cost of an algorithm (i.e., data movement across the memory hierarchy or between processors) is and will continue to be orders of magnitudes larger than its arithmetic cost, both in time and energy. Therefore, it is imperative to design algorithms that minimize communication. Communication lower bounds provide a theoretical limit and guide the design of algorithms that minimize communication [1]. Hung and Kung initiated the study of communication lower bounds by modeling the computation as a directed acyclic dependency graph (dependency DAG), and representing the data arXiv:2107.09834v1 [cs.DC] 21 Jul 2021 access patterns via a red-blue pebble game [2]. Since then, new techniques have been developed to derive more lower bounds, such as volumetric inequalities for nested loop programs [3, 4, 5, 6] and analysis of expansion and separability of the dependency DAG [7, 8, 9]. These approaches derive closed form expressions lower bounds by considering a particular dependency DAG con- sisting of binary operations on scalar values. However, most algorithms admit algebraic reor- ganizations (i.e., computation of different partial sums) that change the dependency graph and ∗Equal contribution. Work was partially done while at the University of Illinois at Urbana-Champaign †School of Industrial and Systems Engineering, Georgia Institute of Technology, [email protected] ‡Oden Institute for Computational Engineering and Sciences, University of Texas at Austin, [email protected] §Department of Computer Science, University of Illinois at Urbana-Champaign, [email protected] 1 may be more communication efficient in a particular setting. By working with more abstract algorithm representations, a larger space of admissible dependency graphs can be considered simultaneously. Hypergraphs have been used to capture potential orderings of partial sums [10], while bilinear algorithms [11] provide a more powerful abstraction for problems which can be posed as bilinear maps on two input sets. Many important numerical problems fall under this category, including matrix multiplication, convolution, and symmetric tensor contractions, and all known fast algorithms for these problems can be expressed as bilinear algorithms. m ×R m ×R m ×R A bilinear algorithm (A; B; C) with A 2 C A , B 2 C B , and C 2 C C computes f(x; y) = C[(AT x) (BT y)], where is the Hadamard (element-wise or bilinear) product. The value R is called the rank of the bilinear algorithm. When a subset of columns from A, B, or C is a low-rank matrix, then the communication costs can be reduced for executing this portion of the computation. To see why, let P consist of a subset of k different columns from an identity matrix of size R so that a portion of the bilinear algorithm associated with k of the R bilinear products is CP (AP )T x BP )T y. We see that rank(AP ), rank(BP ), and rank(CP ) bound the minimum number of linear combinations of inputs needed from x, y, and the amount of output information produced, respectively, in computing this portion of the bilinear algorithm. Lower bounds on the growth of this rank with k, i.e., the rank expansion, yield lower bounds for any execution DAG of the bilinear algorithm [12]. We focus on nested bilinear algorithms [11], which are bilinear algorithm constructed via Kronecker products of matrices encoding the two factor bilinear algorithms: (A1 ⊗ A2; B1 ⊗ B2; C1 ⊗ C2). This abstraction captures both recursive and higher-order methods for matrix multiplication, integer multiplication, convolution, tensor contractions, as well as other algorithms. We show in general the rank expansion for the matrices defining a nested bilinear algorithm is based on the rank expansion of its factors. We prove that for a certain class of rank expansion lower bounds σA and σB for A and B, respectively, there exists a rank expansion lower bound σC for C = A ⊗ B satisfying σC (k) 6 σA(kA)σB(kB), for some reals kA 2 [1; nA] and kB 2 [1; nB] with kAkB = k and nX = #cols(X). We state these results in Section 2 and provide the proofs in Section 6. In Section 4, we apply our framework towards fast algorithms for matrix multiplication, convolution, and partially symmetric tensor contractions. We obtain lower bounds on both vertical communication (communication in a two-level memory hierarchy) as well horizontal communication (communication between processors in a distributed-memory computer with a fully connected network). The latter bounds can be translated to the LogGP and BSP model [12]. Our lower bounds are all novel in that they consider a larger space of algorithms than previous works. We obtain the first communication lower bounds for nested symmetry preserving tensor contraction algorithms [13], lower bounds for multi-dimensional and recursive Toom-Cook (i.e., convolution) that are either novel or match previously known bounds [7], and lower bounds for Strassen's algorithm which are asymptotically lower than previous results [8]. See Table 1 for a comparison between previously known lower bounds and the lower bounds derived in this paper. 2 Algorithm Previous (V) Previous (H) This Paper (V) This Paper (H) nlog2(7) n2 nlog2(7) nlog3(7) Strassen's [8] [8] (4.3) (4.4) Hlog4(7)−1 plog7(4) Hlog2(3)−1 plog3(2) Recursive nlogk(2k−1) n [7] - Match Prev (4.6) (4.7) convolution Hlogk(2k−1)−1 plog2k−1(k) Table 1: Asymptotic, non-trivial (not reading inputs or writing outputs) communication lower bounds between fast memory of size H and slow memory (i.e., vertical, denoted by V) as well as between p processors (i.e., horizontal, denoted by H). The two algorithms are Strassen's fast matrix multiplication and nested Toom-k for 1D convolution. A dash indicates no previous lower bound is known, to the best of our knowledge. 2 Notation and Definitions We will denote N = f1; 2;:::g to be the natural numbers. For any n 2 N, we write [n] = f1; 2; : : : ; ng. We now introduce the main tool to quantify communication lower bounds. Let (k) Pn be the set of matrices comprised of k different columns from an identity matrix of size n. We write f(x) /+ g(x) if there exists c > 0 such that f(x) = c · g(x). m×n Definition 2.1. The rank expansion of A 2 C ,σ ~ : f0g [ [n] 7! f0g [ N, is defined as σ~(k) = min rank(AP ) . (k) P 2Pn To obtain communication lower bounds for bilinear algorithms, we seek to lower boundσ ~ by a continuous increasing function σ. m×n Definition 2.2. A lower bound on the rank expansionσ ~ for A 2 C is a continuous and increasing function σ : R+ 7! R+ with σ(k) 6 σ~(k) for all k 2 [n] and σ(0) = 0. We denote the y y pseudo-inverse of function σ as σ : [0; +1) 7! [0; +1], σ (x) = sup fk : σ(k) 6 xg. 3 Lower Bounds for Rank Expansion of Kronecker Product Ma- trices In this section, we show how to construct a rank expansion lower bound σC for C = A ⊗ B when given the rank expansion lower bounds σA and σB for A and B, respectively. We can reduce the search for a continuous increasing function σC to solving an optimization problem. If we put further assumptions on σA and σB, we can simplify the optimization problem and derive closed-form expressions for σC . More precisely, we prove the following theorem in Section 6. m ×n Theorem 3.1. Suppose σA and σB are concave rank expansions lower bounds for A 2 C A A mB ×nB y y and B 2 C , respectively. Let dA = σA(1), dB = σB(1). Then σC : [1; nAnB] 7! R, σC (k) = min fσA(k=dB); σB(k=dA)g (1) is a concave rank expansion lower bound for C = A ⊗ B. 3 mi×ni Corollary 3.2. Let σi be a rank expansion lower bound of Ai 2 R , 1 6 i 6 p, and let Np Q A = i=1 Ai. If σi are concave, then σA : [1; i ni] 7! R, ( !) k σA(k) = min σj Q (2) j i6=j di y qi is a concave rank expansion lower bound of A, where di = σi (1).

Load more