Communication Lower Bounds for Nested Bilinear Algorithms

Caleb Ju∗†, Yifan Zhang∗‡, Edgar Solomonik§

Abstract We develop lower bounds on communication in the memory hierarchy or between proces- sors for nested bilinear algorithms, such as Strassen’s algorithm for multiplication. We build on a previous framework that establishes communication lower bounds by use of the rank expansion, or the minimum rank of any fixed size subset of columns of a matrix, for each of the three matrices encoding the bilinear algorithm. This framework provides lower bounds for any way of computing a bilinear algorithm, which encompasses a larger space of algorithms than by fixing a particular dependency graph. Nested bilinear algorithms include fast recursive algorithms for convolution, , and contraction of tensors with symmetry. Two bilinear algorithms can be nested by taking Kronecker products between their encoding matrices. Our main result is a lower bound on the rank expansion of a matrix constructed by a Kronecker product derived from lower bounds on the rank expansion of the Kronecker product’s operands. To prove this bound, we map a subset of columns from a submatrix to a 2D grid, collapse them into a dense grid, expand the grid, and use the size of the expanded grid to bound the number of linearly independent columns of the submatrix. We apply the rank expansion lower bounds to obtain novel communication lower bounds for nested Toom-Cook convolution, Strassen’s algorithm, and fast algorithms for partially symmetric contractions.

1 Introduction

With the proliferation of massively parallel machines, the communication cost of an algorithm (i.e., data movement across the memory hierarchy or between processors) is and will continue to be orders of magnitudes larger than its arithmetic cost, both in time and energy. Therefore, it is imperative to design algorithms that minimize communication. Communication lower bounds provide a theoretical limit and guide the design of algorithms that minimize communication [1]. Hung and Kung initiated the study of communication lower bounds by modeling the com- putation as a directed acyclic dependency graph (dependency DAG), and representing the data arXiv:2107.09834v1 [cs.DC] 21 Jul 2021 access patterns via a red-blue pebble game [2]. Since then, new techniques have been developed to derive more lower bounds, such as volumetric inequalities for nested loop programs [3, 4, 5, 6] and analysis of expansion and separability of the dependency DAG [7, 8, 9]. These approaches derive closed form expressions lower bounds by considering a particular dependency DAG con- sisting of binary operations on scalar values. However, most algorithms admit algebraic reor- ganizations (i.e., computation of different partial sums) that change the dependency graph and

∗Equal contribution. Work was partially done while at the University of Illinois at Urbana-Champaign †School of Industrial and Systems Engineering, Georgia Institute of Technology, [email protected] ‡Oden Institute for Computational Engineering and Sciences, University of Texas at Austin, [email protected] §Department of Computer Science, University of Illinois at Urbana-Champaign, [email protected]

1 may be more communication efficient in a particular setting. By working with more abstract algorithm representations, a larger space of admissible dependency graphs can be considered simultaneously. Hypergraphs have been used to capture potential orderings of partial sums [10], while bilinear algorithms [11] provide a more powerful abstraction for problems which can be posed as bilinear maps on two input sets. Many important numerical problems fall under this category, including matrix multiplication, convolution, and symmetric tensor contractions, and all known fast algorithms for these problems can be expressed as bilinear algorithms. m ×R m ×R m ×R A bilinear algorithm (A, B, C) with A ∈ C A , B ∈ C B , and C ∈ C C computes f(x, y) = C[(AT x) (BT y)], where is the Hadamard (element-wise or bilinear) product. The value R is called the rank of the bilinear algorithm. When a subset of columns from A, B, or C is a low-rank matrix, then the communication costs can be reduced for executing this portion of the computation. To see why, let P consist of a subset of k different columns from an identity matrix of size R so that a portion of the bilinear algorithm associated with k of the R bilinear products is CP (AP )T x BP )T y. We see that rank(AP ), rank(BP ), and rank(CP ) bound the minimum number of linear combinations of inputs needed from x, y, and the amount of output information produced, respectively, in computing this portion of the bilinear algorithm. Lower bounds on the growth of this rank with k, i.e., the rank expansion, yield lower bounds for any execution DAG of the bilinear algorithm [12]. We focus on nested bilinear algorithms [11], which are bilinear algorithm constructed via

Kronecker products of matrices encoding the two factor bilinear algorithms: (A1 ⊗ A2, B1 ⊗

B2, C1 ⊗ C2). This abstraction captures both recursive and higher-order methods for matrix multiplication, integer multiplication, convolution, tensor contractions, as well as other algo- rithms. We show in general the rank expansion for the matrices defining a nested bilinear algorithm is based on the rank expansion of its factors. We prove that for a certain class of rank expansion lower bounds σA and σB for A and B, respectively, there exists a rank expansion lower bound σC for C = A ⊗ B satisfying σC (k) 6 σA(kA)σB(kB), for some reals kA ∈ [1, nA] and kB ∈ [1, nB] with kAkB = k and nX = #cols(X). We state these results in Section 2 and provide the proofs in Section 6. In Section 4, we apply our framework towards fast algorithms for matrix multiplication, convolution, and partially symmetric tensor contractions. We obtain lower bounds on both vertical communication (communication in a two-level memory hierarchy) as well horizontal communication (communication between processors in a distributed-memory computer with a fully connected network). The latter bounds can be translated to the LogGP and BSP model [12]. Our lower bounds are all novel in that they consider a larger space of algorithms than previous works. We obtain the first communication lower bounds for nested symmetry preserving tensor contraction algorithms [13], lower bounds for multi-dimensional and recursive Toom-Cook (i.e., convolution) that are either novel or match previously known bounds [7], and lower bounds for Strassen’s algorithm which are asymptotically lower than previous results [8]. See Table 1 for a comparison between previously known lower bounds and the lower bounds derived in this paper.

2 Algorithm Previous (V) Previous (H) This Paper (V) This Paper (H)

nlog2(7) n2 nlog2(7) nlog3(7) Strassen’s [8] [8] (4.3) (4.4) Hlog4(7)−1 plog7(4) Hlog2(3)−1 plog3(2) Recursive nlogk(2k−1) n [7] - Match Prev (4.6) (4.7) convolution Hlogk(2k−1)−1 plog2k−1(k)

Table 1: Asymptotic, non-trivial (not reading inputs or writing outputs) communication lower bounds between fast memory of size H and slow memory (i.e., vertical, denoted by V) as well as between p processors (i.e., horizontal, denoted by H). The two algorithms are Strassen’s fast matrix multiplication and nested Toom-k for 1D convolution. A dash indicates no previous lower bound is known, to the best of our knowledge.

2 Notation and Definitions

We will denote N = {1, 2,...} to be the natural numbers. For any n ∈ N, we write [n] = {1, 2, . . . , n}. We now introduce the main tool to quantify communication lower bounds. Let (k) Pn be the set of matrices comprised of k different columns from an identity matrix of size n. We write f(x) ∝+ g(x) if there exists c > 0 such that f(x) = c · g(x).

m×n Definition 2.1. The rank expansion of A ∈ C ,σ ˜ : {0} ∪ [n] 7→ {0} ∪ N, is defined as σ˜(k) = min  rank(AP ) . (k) P ∈Pn To obtain communication lower bounds for bilinear algorithms, we seek to lower boundσ ˜ by a continuous increasing function σ.

m×n Definition 2.2. A lower bound on the rank expansionσ ˜ for A ∈ C is a continuous and increasing function σ : R+ 7→ R+ with σ(k) 6 σ˜(k) for all k ∈ [n] and σ(0) = 0. We denote the † † pseudo-inverse of function σ as σ : [0, +∞) 7→ [0, +∞], σ (x) = sup {k : σ(k) 6 x}.

3 Lower Bounds for Rank Expansion of Kronecker Product Ma- trices

In this section, we show how to construct a rank expansion lower bound σC for C = A ⊗ B when given the rank expansion lower bounds σA and σB for A and B, respectively. We can reduce the search for a continuous increasing function σC to solving an optimization problem. If we put further assumptions on σA and σB, we can simplify the optimization problem and derive closed-form expressions for σC . More precisely, we prove the following theorem in Section 6.

m ×n Theorem 3.1. Suppose σA and σB are concave rank expansions lower bounds for A ∈ C A A mB ×nB † † and B ∈ C , respectively. Let dA = σA(1), dB = σB(1). Then σC : [1, nAnB] 7→ R,

σC (k) = min {σA(k/dB), σB(k/dA)} (1) is a concave rank expansion lower bound for C = A ⊗ B.

3 mi×ni Corollary 3.2. Let σi be a rank expansion lower bound of Ai ∈ R , 1 6 i 6 p, and let Np Q A = i=1 Ai. If σi are concave, then σA : [1, i ni] 7→ R, ( !) k σA(k) = min σj Q (2) j i6=j di

† qi is a concave rank expansion lower bound of A, where di = σi (1). In particular, if σi(k) = (k/ki) with ki > 1, qi ∈ (0, 1], then  k min{qi} σA(k) = Q . (3) i ki

If σi(k) = ai ln(bik + 1), then

σA(k) = a ln(bk + 1), (4) e1/a − 1 where a = min {ai}, b = . Qn −1 1/ai i=1 bi (e − 1) N Proof. Equation (2) will follow from Theorem 3.1 using an induction on p. For B = i

qj ! !  qj k k/dj k σj Q = Q = Q . i6=j di i6=j di i ki

The result follows after a trivial minimization. For σi(k) = ai ln(bik + 1), we proceed by induction. We assume that a1 > ... > ap−1 > ap, N and again let B = i

  k   k  σA(k) = min {σB(k/dp), σp(k/dB)} = min ap−1 ln bB + 1 , ap ln bp + 1 , dp dB

† −1 1/ap † −1 1/ap−1 where dp = σp(1) = bp (e − 1), and dB = σB(1) = bB (e − 1), and by the induction e1/ap−1 − 1 hypothesis bB = . Q −1 1/ai i

dBdp. Let k = γdBdp, γ > 1, we have n    o 1/ap−1 1/ap σA(k) = min ap−1 ln (e − 1)γ + 1 , ap ln (e − 1)γ + 1 .

We prove that the the second term is smaller than the first. With this established, one can

4 directly verify that σA coincides with the proposed function given by (4), and the proof is then h(x) x complete. To that end, let us consider g(x) = x , where h(x) = ln (γ(e − 1) + 1). The two terms in the minimization are respectively g(1/ap−1) and g(1/ap). Now it suffices to show that g is decreasing on R+. Indeed, note that h(0) = 0, and one can check that h is a concave function on R+ when γ > 1. Thus g(x) is the secant slope of h between x1 = 0 and x2 = x, which is decreasing in x. 

Remark 3.3. Using different lower bounds σA and σB in Theorem 3.1 gives different rank expansion lower bounds σC , and the optimal choice is not always obvious, as illustrated below.

4×7 Example 3.4. Consider the case A = B ∈ R with rank and Kruskal rank (as defined in [14]) being 4. For example,

  1 0 0 0 1 1 1   0 1 0 0 1 2 3  A = B =   . (5) 0 0 1 0 1 4 9    0 0 0 1 1 8 27

Denote ai, bi the ith column of A and B. Nowσ ˜A(x) =σ ˜B(x) = min {x, 4}. Let C = A ⊗ B, and we seek rank expansion lower bound σC (k) for C.

When k = 13, it is not hard to check thatσ ˜C (k) = 7, attained by submatrix CP =

{ai ⊗ bj : i = 1 or j = 1}. If we naturally take σA =σ ˜A, σB =σ ˜B, Theorem 3.1 gives σC (13) =  1 1 4. If we take instead σA(k) = σB(k) = min k, 2 k + 2 , Theorem 3.1 gives σC (13) = 7, which is optimal.

However when k = 5 < nA = 7, with σA =σ ˜A, σB =σ ˜B, σC (5) = 4 =σ ˜C (5), but with  1 1 σA = σB = min k, 2 k + 2 , σC (5) = 3. So the best choice of σA, σB can depend on k.

Although the rank expansion lower bound stated in Theorem 3.1 is easy to use and can be applied in a nested manner, it involves function evaluations σA(kA) and σB(kB) at kA > nA, kB > nB. This may not be desirable when σA or σB is not well-defined beyond its original domain and generally leads to a suboptimal bound for sufficiently large k. Thus, we derive the following tighter bounds that only depend on σA|[1,nA] and σB|[1,nB ]. In this case, the optimal choice of σA, σB is to matchσ ˜A,σ ˜B. See Remark 6.15 for more details.

Theorem 3.5. Suppose functions σA and σB are concave rank expansion lower bounds of A ∈ mA×nA mB ×nB † C and B ∈ C , respectively. Let rA = σA(nA), rB = σB(nB), dA = σA(1), dB = † σB(1). Define RC : [1, nAnB] 7→ R by

RC (k) = min σA(kA) · σB(kB), (6) kA∈[dA,nA], kB ∈[dB ,nB ] kA∈{dA,nA} or kB ∈{dB ,nB }, kAkB >k and LC : [1, nAnB] 7→ R by

LC (k) = min rBσA(kA) + rAσB(kB) − σA(kA) · σB(kB). (7) kA∈[0,nA], kB ∈[0,nB ], nB kA+nAkB −kAkB =k

5 Then σC (k) = min {LC (k),RC (k)} is a rank expansion lower bound of C = A ⊗ B. If RC (k) 6 max {rA, rB} then σC (k) = RC (k) 6 LC (k).

RC is easily computable – given constants k, nA, nB, dA, dB, there will be only two feasible combinations of (kA, kB) one needs to compare. Similar simplifications of LC may not to be new readily available. The bound σC derived in Theorem 3.5 is tighter than the bound σC derived new in Theorem 3.1 since σC (k) 6 RC (k) 6 σC (k). Under further assumptions on σA and σB, we can simplify Theorem 3.5 to the following.

Theorem 3.6. Let everything be defined as in Theorem 3.5. Suppose in addition σA and σB −1 −1 are strictly increasing and differentiable. Let f(x) = σA (x), g(x) = σB (x). If

ˆ f(rA) − f(rA − x) g(rB) − g(rB − x) f(x) = 0 , gˆ(x) = 0 . xf (rA − x) xg (rB − x) are increasing on (0, rA) and (0, rB), respectively, then σC : [1, nAnB] 7→ R,

σC (k) = min σA(kA) · σB(kB), (8) kA∈[dA,nA], kB ∈[dB ,nB ] kA∈{dA,nA} or kB ∈{dB ,nB }, kAkB >k is a rank expansion lower bound of C = A ⊗ B.

The assumption on fˆ andg ˆ is satisfied by commonly used convex increasing functions, such as monomial and exponential functions.

Corollary 3.7. For the following choices of f (resp. g), fˆ(x) (resp. gˆ) is increasing in x: (i) f(x) = b(ax − 1), a > 1, b > 0; p (ii) f(x) = ax , a > 0, p > 1. Thus when σA and σB are monomials or logarithms,

σC (k) = min σA(kA) · σB(kB) kA∈[dA,nA], kB ∈[dB ,nB ] kA∈{dA,nA} or kB ∈{dB ,nB }, kAkB >k is a rank expansion lower bound of C = A ⊗ B.

Proof. Assuming Theorem 3.6, it suffices to verify fˆ(x) is increasing. Direct computation gives (i) when f is exponential,

(arA − 1) − (arA−x − 1) ax − 1 fˆ(x) = ∝+ ; (9) xarA−x ln a x and (ii) when f is monomial,

p p p ˆ rA − (rA − x) t=x/rA 1 − (1 − t) f(x) = p−1 ∝+ p−1 , (t ∈ (0, 1)). (10) xp(rA − x) t(1 − t)

It is then trivial to check that they are indeed increasing functions on their domains.  In Section 6, we will prove and discuss these rank expansion lower bounds in detail. Although Theorem 3.5 and Theorem 3.6 in general give tighter lower bounds on the rank expansion for

6 A ⊗ B, the resulting function may not be concave and thus may not be repeatedly applied to nested algorithms. Before the proofs, we first describe some applications.

4 Applications

To derive communication lower bounds, we require an explicit upper bound for the size of any subset of selected columns with bounded rank. Let us make this concrete. Recall that R is the rank of the bilinear algorithm.

Definition 4.1. The bilinear algorithm (A, B, C) has non-decreasing (in all variables) expan- 3 (k) sion bound E : N 7→ N, if for any k ∈ [R] and P ∈ PR ,   #cols(P ) 6 E rank(AP ), rank(BP ), rank(CP ) .

In the forthcoming applications, we will define the expansion bound using the inverse of the lower bound on the rank expansion, for which we can apply the results on the rank expansion lower bounds for Kronecker products. Finally, our model of computational prohibits re-computation in the bilinear algorithm, meaning that if a bilinear product has already been computed and is not in fast memory or the current processor, the algorithm must communicate the value instead of recomputing it [12]. This restriction differs from other works, e.g., [7].

4.1 Fast Matrix Multiplication

Communication lower bounds for classical matrix multiplication are usually derived by explicitly reasoning about potential partial sums and applying the Loomis-Whitney inequality [15]. The expansion bound for classical matrix multiplication follows from the same inequality H3/2 [12], where H is the size of fast memory. Strassen’s algorithm [16], the most practical known fast algorithm [17] among those that achieve sub-cubic complexity, corresponds to a more nontrivial bilinear algorithm than classical matrix multiplication. Previous work on communication lower bounds for Strassen’s algorithm has yielded tight bounds by making use of graph expansion [8]. In particular, they consider the computational DAG arising from executing the multiplications and additions prescribed by Strassen’s algorithm. We derive communication lower bounds for the bilinear algorithm given by Strassen’s ap- proach, which considers any other computational DAG. These DAGs include ones that do not follow the recursive structure of Strassen’s algorithm; they need only compute the same scalar products at the base-case level of recursion. In particular, the operands can be computed by any other additions or linear combinations. First, we establish expansion bounds for bilinear algorithms encoding Strassen’s algorithm.

⊗τ ⊗τ ⊗τ Lemma 4.2. For τ ∈ Z+ and the nested bilinear algorithm F = (A , B , C ), where ⊗τ Nτ X = i=1 X and (A, B, C) is the bilinear algorithm for Strassen’s base algorithm (Defini- tion 7.1), E(d(a), d(b), d(c)) = min(d(a), d(b), d(c))log2(3)

7 is an expansion bound for F.

Proof. The rank expansion for A, B, and C matrices of Strassen’s base algorithm for k = 1, 2,..., 7 is given by 1, 2, 2, 3, 3, 4, 4, respectively. Therefore, function σ(k) = klog2(3) is a rank expansion lower bound for A, B, and C. By application of Corollary 3.7 with a = 1 and p = log2(3) > 1, the rank expansion lower bound for A ⊗ A is

−1 σA⊗A(k) = min σ(kA) · σ(kB)(dA = dB = σ (1) = 1) kA∈[dA,nA], kB ∈[dB ,nB ] kA∈{dA,nA} or kB ∈{dB ,nB }, kAkB >k   log (3) = σ min(k, nA) · σ max(1, k/nA) = k 2 = σ(k).

Repeatedly applying Corollary 3.7, we find σ(k) remains a rank expansion lower bound for ⊗τ ⊗τ ⊗τ τ (k) ⊗τ A as well as B and C . Then for k ∈ [7 ] and P ∈ P7τ , σ(k) 6 min(rank(A P ), rank(B⊗τ P ), rank(C⊗τ P )). Applying the inverse σ−1(k) = klog2(3) to both sides leads to

 −1 ⊗τ  −1 ⊗τ  −1 ⊗τ  #cols(P ) = k 6 min σ rank(A P ) , σ rank(B P ) , σ rank(C P ) = Erank(A⊗τ P ), rank(B⊗τ P ), rank(C⊗τ P ).

Thus, E is an expansion bound by Definition 4.1.  Next, we apply Lemma 4.2 to derive a vertical communication lower bound.

Theorem 4.3 (Vertical Communication Costs for Strassen’s Algorithm). Given square matrices of size n, the vertical communication cost of Strassen’s fast matrix is at least 2nlog2(7)  max · H, 3n2 . Hlog2(3)

Proof. The bilinear rank of Strassen’s algorithm is R = nlog2(7). Using the expansion bound for Strassen’s algorithm from Lemma 4.2, we have

Emax(H) = max E(c(A), c(B), c(B)) c(A),c(B),c(C)∈N,c(A)+c(B)+c(C)=3H = max min(c(A), c(B), c(C))log2(3) = Hlog2(3). c(A),c(B),c(C)∈N,c(A)+c(B)+c(C)=3H

By Theorem 5.2 of [12], the vertical communication lower bound of the bilinear algorithm is

h 2RH i max ,N (A) + N (B) + N (C) , (11) Emax(H) where N (A) and N (B) are the sizes of the two inputs and N (C) is the size of the output, which are all equal to n2. The communication lower bound follows after substituting these values into (11).  In contrast, the existing vertical commmunication lower bound for the standard Strassen’s algorithm computational DAG [8] is

 nlog2(7)  max · H, 3n2 . Hlog4(7)

8 Our asymptotically smaller communication lower bound suggests that our bound may not be tight. Alternatively, there may exist a reordering of partial sums from Strassen’s algorithm that can reduce the communication costs. Such a reordering must exploit the low-rank structure of submatrices, particularly the submatrix of size three with rank two, to achieve higher re-use of inputs. At the same time, it is unclear how the remaining submatrix of size four, which is not as low rank, affects the total communication cost. Next we consider lower bounds on horizontal communication costs.

Theorem 4.4 (Horizontal Communication Costs for Strassen’s Algorithm). Given square ma- trices of size n, the horizontal communication costs with p processors for computing Strassen’s fast matrix multiplication algorithm is at least

nlog3(7) n2  3 · − . plog3(2) p

Proof. By Theorem 5.3 from [12], the communication lower bound is no smaller than the sum (A) (B) (C) of some non-negative integers c , c , c ∈ N, which satisfy, R E(c(A) + n2/p, c(B) + n2/p, c(C) + n2/p) (12) p 6 for a bilinear algorithm with an expansion bound E. Lemma 4.2 provides an expansion bound E for Strassen’s algorithm, yielding

R min c(A) + n2/p, c(B) + n2/p, c(C) + n2/plog2(3). p 6

Since the rank is R = nlog2(7), the communication cost is at least

log3(7) 2 (A) (B) (C) n n  c + c + c > 3 · − . plog3(2) p



4.2 Convolution

m Given a set of distinct nodes {xi}i=1, where xi ∈ C, the corresponding Vandermonde matrix is n m×n n j−1 Vm ∈ C , where [Vm]i,j = xi . To compute the discrete convolution between two vectors k  k T k T 2k−1 −1 f, g ∈ C , we use the Toom-k bilinear algorithm, F = (V2k−1) , (V2k−1) , (V2k−1 ) [18]. The term Toom-k refers to a particular bilinear algorithm for computing the convolution between vectors of size k, and it belongs to a broader class of convolution algorithms known as Toom- Cook. For example, the discrete Fourier transform (DFT) is a special case of Toom-Cook, hence our lower bounds apply to fast Fourier transform (FFT)-based approaches for convolution. By nesting Toom-Cook bilinear algorithms, one can utilize split-nesting schemes to derive algorithms for 1D convolution that are more stable and computationally efficient [19, 20], as well as to compute multidimensional convolution [21, 22]. This, however, results in bilinear algorithms whose matrices are no longer Vandermonde. In view of this, we apply Corollary 3.7 to deduce rank expansion lower bounds for Nτ (V ki )T given arbitrary integers k . i=1 2ki−1 i

9 Nτ Nτ Nτ Lemma 4.5. For τ ∈ Z+ and the nested bilinear algorithm F = ( i=1 Ai, i=1 Bi, i=1 Ci), where (Ai, Bi, Ci) is the bilinear algorithm for Toom-ki and k = min ki, i

E(d(A), d(B), d(C)) = min{(d(A))logk(2k−1), (d(B))logk(2k−1)} is an expansion bound for F.

2k−1 Proof. Since C = V2k−1 is linearly independent, the function σC (`) = ` is a valid rank k expansion lower bound. On the other hand, A = B = V2k−1 are Vandermonde matrices with k T (`) nodes {xi} = S. Thus, (V2k−1) P , where P ∈ P2k−1 is a subset of ` unique columns from an identity matrix of size 2k − 1, is the transpose of a Vandermonde matrix with a subset of nodes C ⊆ S. And because S contains unique nodes, then C must also have contain unique k T nodes. Hence, (V2k−1) P is full rank. With this observation, we find the functions σA(`) = log (k) σB(`) = ` 2k−1 are valid rank expansion lower bounds. Repeatedly invoking Corollary 3.7 log (k) log (k) like in Lemma 4.2, we have that σA(`) = ` 2k−1 , σB(`) = ` 2k−1 , and σC (`) = ` are sigma expansion lower bounds for A⊗τ , B⊗τ , and C⊗τ , respectively. Q (`) Set K = i ki. Then for ` ∈ [K] and P ∈ PK , the definition of rank expansion lower bounds ⊗τ ⊗τ ⊗τ gives us σA(`) 6 rank(A P ), σB(`) 6 rank(B P ), and σC (`) 6 rank(C P ). Applying the inverse to both sides of the three aforementioned equations yields

 −1 ⊗τ  −1 ⊗τ  −1 ⊗τ  #cols(P ) = ` 6 min σA rank(A P ) , σB rank(B P ) , σC rank(C P ) ⊗τ log (2k−1) ⊗τ log (2k−1) ⊗τ  6 min rank(A P ) k , rank(B P ) k , rank(C P ) ⊗τ log (2k−1) ⊗τ log (2k−1) 6 min rank(A P ) k , rank(B P ) k = Erank(A⊗τ P ), rank(B⊗τ P ), rank(C⊗τ P ).

Thus, E is an expansion bound by Definition 4.1.  We now state a communication lower bound when nesting the same Toom-k bilinear algo- rithm. Note that these bounds hold for both multidimensional and recursive (1D) convolution. The latter is simply multidimensional convolution plus a recomposition step, where the recom- position is applied via a pre-multiplication by the linear operator Q ∈ {0, 1}2n−1×(2k−1)τ (see Section 7.3 in [18]) to C⊗τ . Because the rank of the C⊗τ matrix, the only matrix in the bilinear algorithm that is affected by the recomposition matrix Q, is absent in the expansion bound of Lemma 4.5, the expansion bound still holds for recursive (1D) convolution. First, we focus on vertical communication, or communication between a fast and slow mem- ory (e.g., cache and memory). In the special 1D case, where Toom-n is recursively applied to a input vector of size N = nd, the following result matches previously established lower bounds [7].

Theorem 4.6 (Vertical communication costs for multidimensional convolution). Given two d- dimensional tensors where each mode length is n, the vertical communication of computing their discrete convolution using a nested Toom-n bilinear algorithm is at least

 (2n − 1)d  max · H, 2nd + (2n − 1)d . (13) Hlogn(2n−1)

Proof. Similar to Theorem 4.3, the vertical communication lower bound of the bilinear algo-

10 rithm is given by (11). Using the expansion bound for Toom-Cook algorithm from Lemma 4.5,

Emax(H) = max E(c(A), c(B), c(B)) c(A),c(B),c(C)∈N,c(A)+c(B)+c(C)=3H = max min{(c(A))logn(2n−1), (c(B))logn(2n−1)} c(A),c(B),c(C)∈N,c(A)+c(B)+c(C)=3H 3H logn(2n−1) = . 2

Since the rank is R = (2n − 1)d, the input size is N (A) = N (B) = nd, the output size is N (C) = d log (2n−1) (2n−1) , and 2·(2/3) n > 1, ∀n > 2, then equation (11) yields the communication lower bound as written in (13).  Next, we derive horizontal communication (i.e., communication between parallel processors) lower bounds for multidimensional convolution.

Theorem 4.7 (Horizontal communication costs for multidimensional convolution). Given two d-dimensional tensors where each mode length is n, the horizontal communication costs with p processors for computing their discrete convolution using a nested Toom-n bilinear algorithm is at least  nd nd  2 · − . plog2n−1(n) p Proof. Similar to Theorem 4.4, the horizontal communication lower bound is the sum of some (A) (B) (C) non-negative integers c , c , c ∈ N satisfying (12). Lemma 4.5 provides an expansion bound E for Toom-Cook algorithm yielding

R min c(A) + nd/p, c(B) + nd/plogn(2n−1). p 6

Since the rank is R = (2n − 1)d, the communication cost is at least

d d (A) (B) (C) (A) (B)  n n  c + c + c > c + c > 2 · − . plog2n−1(n) p



4.3 Partially Symmetric Tensor Contractions

Tensor contractions are isomorphic to matrix multiplication, but special fast bilinear algorithms become possible when the tensors have symmetry [13]. Symmetric tensors are equivalent under all permutations of their indices, e.g., tijk = tjik = .... For example, a product of a symmetric matrix A and a vector b can be computed from n(n + 1)/2 scalar products, mostly of the form aij(bi + bj), yielding a bilinear algorithm with rank R = n(n + 1)/2. The main application of such symmetry preserving algorithms is the derivation of lower cost contraction algorithms for partially symmetric tensors, i.e., tensors that are equivalent only under permutation of a subset of their indices. For example, given a partially symmetric tensor T with symmetry tijab = tjiab, P the contraction uiac = jb tijabvjbc can be performed with 2X fewer operations to leading order than in the nonsymmetric case, by nesting a symmetry preserving algorithm for a symmetric

11 vector product (corresponding to contraction indices i and j) with an algorithm for matrix multiplication (corresponding to contraction indices a, b, c). Tensors with partial symmetry are prevalent in quantum chemistry methods, a core appli- cation domain of higher order tensor contractions [23]. By analysis of these bilinear algorithms’ rank expansions, communication lower bounds have been established showing that such sym- metry preserving algorithms require asymptotically more communication for some contractions of symmetric tensors [12]. Our results allow us to derive the first communication lower bounds for nested symmetry preserving algorithms, yielding lower bounds on communication costs for symmetry preserving contraction algorithms on partially symmetric tensors. The lemma below follows from the analysis in the proof of Lemma 6.3 in [12].

Lemma 4.8. For the bilinear algorithm (A, B, C) corresponding to symmetry preserving con- traction of symmetric tensors of order s + v and v + t over v indices, we can lower bound the (s+v)/(s+t+v) s+t+v rank expansion of each encoding matrix as follows: σA(k) = k / t , σB(k) = (v+t)/(s+t+v) s+t+v (s+t)/(s+t+v) s+t+v k / s , and σC (k) = k / v . We derive lower bounds for nesting of multiple symmetry preserving algorithms, as well as nesting of a symmetry preserving algorithm with a nonsymmetric contraction algorithm. In the former case, we consider nesting of two arbitrary symmetry preserving algorithms.

Lemma 4.9. For the bilinear algorithm (A⊗U, B⊗V , C⊗W ), where (A, B, C) is a symmetry preserving contraction of symmetric tensors of order s + v and v + t over v indices and all dimensions equal to n1, while (U, V , W ) is a symmetry preserving contraction of symmetric 0 0 0 0 0 0 tensors of order s + v and v + t over v indices with all dimension equal to n2, we can lower bound the rank expansion of A ⊗ U by

1 (s+v)/(s+t+v) (s0+v0)/(s0+t0+v0) σA⊗U (k) > min 0 0 0 · k1 · k2 s+t+v s0+t0+v0 s+t+vs +t +v  0 k1∈[1,n1 ],k2∈[1,n2 ], t t k1k2>k 1 0 0 0 0 0  · kmin (s+v)/(s+t+v),(s +v )/(s +t +v ) , > s+t+vs0+t0+v0 t t0 as well as similar bounds for B ⊗ V and C ⊗ W .

Proof. The theorem follows by application of Corollary 3.7 on the rank expansion lower bounds s+t+v given by Lemma 4.8. Note that in the first inequality, we restrict kA and kB to [1, n1 ] and s0+t0+v0 [1, n2 ], respectively, which weakens the bound up to a constant.  The rank expansion lower bound σA⊗U in Lemma 4.9 generalizes to nestings of three or more symmetry preserving bilinear algorithms. This rank expansion lower bound implies horizontal and vertical communication for nested bilinear algorithms follow immediately from those of the nested parts. These communication lower bounds ascertain that standard approaches for tiling nested loops can asymptotically minimize communication done in the execution of the bilinear algorithm. We also consider nestings of a symmetry preserving algorithm with a standard (nonsym- metric) tensor contraction. Nonsymmetric tensor contractions are equivalent to matrix-matrix products with an appropriate choice of matrix dimensions. Like for symmetric tensor contrac- tions, we assume tensor dimensions are equal size and classify contractions by the tuple (s, t, v).

12 When one of s, t, v is zero, the number of products needed to compute the contraction matches the size of the largest tensor (ns+t+v = nmax(s+t,s+v,v+t)), and the corresponding bilinear encod- ing matrix is (some permutation of) the identity matrix, with a rank expansion ofσ ˜(k) = k. When this is the case for nonsymmetric contractions, a tight communication lower bounds can be derived just by considering the rank expansion of a single matrix. For the symmetry pre- serving algorithm, for any choice of s, t, v, a tight communication lower bound can be derived from the rank expansion of just one of the matrices. Consequently, we can use our general lower bounds on rank expansion of a Kronecker product of matrices to derive tight communication lower bounds by restricting the type of nonsymmetric contraction performed.

Lemma 4.10. For the bilinear algorithm (A ⊗ U, B ⊗ V , C ⊗ W ), where (A, B, C) is a symmetry preserving contraction of symmetric tensors of order s + v and v + t over v indices and all dimensions equal to n1, while (U, V , W ) is a contraction of nonsymmetric tensors of 0 0 0 0 0 0 0 order s + v and v + t over v indices with all dimension equal to n2. If t = 0, we can lower bound the rank expansion of A ⊗ U by

1 (s+v)/(s+t+v) σA⊗U (k) = σA(k) = s+t+v · k , t as well as similar bounds for B ⊗ V (if instead of t0 = 0, we have s0 = 0) and C ⊗ W (if instead of t0 = 0, we have v0 = 0).

The expansion bound in Lemma 4.10 implies that we can obtain a lower bound on hori- zontal and vertical communication cost on nested algorithms composed of symmetry preserving algorithms and a nonsymmetric tensor contraction where one of s0, t0, v0 is zero. Given a rank expansion bound for one of the inputs one the output, e.g., σA⊗U , we obtain a vertical commu- nication lower bound of the form,

s+t+v+s0+t0+v0 −1 Ω(n H/σA⊗U (H)), and a horizontal communication lower bound of

s+t+v+s0+t0+v0 s+v+s0+v0 Ω(σA⊗U (n /p) − n /p).

For nested symmetry preserving algorithms, the greatest of the three communication lower bounds (based on σA⊗U , σB⊗V , or σC⊗W ) would be asymptotically attainable for sufficiently small H, p with standard approaches for multidimensional tiling. While the bound Lemma 4.10 should also be asymptotically attainable for many contractions, but not all, as preclusion of t0 > 0 implies we do not provide bounds for communication associated of all inputs/ouputs in a partic- ular contraction. The new communication lower bounds imply that for some partially symmetric tensor contractions, the use of the symmetry preserving algorithm may require asymptotically more communication than if the symmetry was ignored. However, these contractions involve high order tensors. The example below is among the simplest possible cases.

13 Example 4.11. Consider the contraction,

X cim = aijklbjklm, j,k,l where aijkl is symmetric under any permutation of (i, j, k, l) and bjklm is symmetric under any permutation of (j, k, l). Here, a symmetry preserving algorithm with s = 1, v = 3, t = 0 may be nested with a nonsymmetric contraction with s0 = 0, v0 = 0, t0 = 1. By Lemma 4.10, we obtain the following rank expansion bound,

1 (t+v)/(s+t+v) 3/4 σB⊗V (k) = s+t+vk = (1/4)k . s

Assume the dimension of each mode of the tensor (range of each index) is n1 = n2 = n. Overall, this algorithm would then require 4X fewer products (n5/24 to leading order) than if only considering symmetry in (j, k, l) and performing classical matrix multiplication with dimensions n+2 5 n × 3 × n (requiring n /6 products to leading order). However, by our new lower bounds, it would require Ω(n5/H1/3) vertical communication. On the other hand, for sufficiently small H, the matrix-multiplication-based approach requires only O(n5/H1/2) vertical communication.

5 Related Work

Let us review previously used techniques to derive communication lower bounds. Algorithms that can be cast as a nested loop program, which encompass many classical routines, have relatively simple data access patterns [1, 24]. The matrices for their bilinear algorithm consists of columns with one non-zero. This permits the use of volumetric inequalities, such as the Loomis-Whitney or more general H¨older-Brascamp-Liebinequalities, to uncover an expansion bound. However, many fast algorithms, such as Strassen’s algorithm, have more than one non-zero per column. Therefore, volumetric inequalities do not solely suffice in this setting. Previous works have also studied the structure of the dependency DAG using techniques such as graph expansion, dominating sets, or graph partitioning by eigenvalues [2, 7, 8, 9]. In general, this family of techniques aims to uncover bounds on the size of the graph partitions of the dependency DAG. The study of such graph structures is similar to our study of a submatrix rank, in that an expansion bound quantifies how information is passed through any subset of a dependency DAG. These methodologies, however, must fix a particular DAG. Moreover, they require additional constraints or a priori information on the dependency DAG. For example, the graph expansion argument [8] requires the DAG to have bounded degrees to derive sharp lower bounds, and the graph partitioning argument [9] requires one to know the eigenvalues of the corresponding Laplacian matrix to derive analytic lower bounds.

6 Proof of Main Theorems

m ×n m ×n We now prove the results from Section 3. Given matrices A ∈ C A A and B ∈ C B B with rank expansion lower bounds σA and σB, respectively, we aim to construct a sharp lower bound

14 m ×n on the rank expansion for C = A ⊗ B ∈ C C C . Later in our analysis, we will require σA and σB satisfy some properties, such as continuity. The heart of our proof employs a grid expansion method to solve this problem, which we now formalize.

6.1 Preliminaries

(k) Recall that Pn is the set of matrices comprised of k unique columns from an identity matrix (k) of size n. For any P ∈ P#cols(C), CP then contains k columns of C. We can identify a column ai ⊗ bj of C by the tuple (i, j). Thus CP can be identified (up to a column reordering) by a set of k grid points from an nA × nB grid, as shown in Figure 1. We call this set of grid points the grid representation of CP .

m ×7 m ×3 Figure 1: Let A ∈ C A , B ∈ C B , and C = A ⊗ B. The submatrix {a1 ⊗ b1, a3 ⊗ b2, a4 ⊗ b1, a6 ⊗ b3} of C is then represented as the black dots in the grid.

For a grid G ⊆ [nA] × [nB], we will write G[i,·] and G[·,j] to be the sets of points of G with column and row index of i and j, respectively. We denote the size of a grid G as |G|. When it is not ambiguous, we use (i, j) and ai ⊗ bj interchangeably to denote a column of C, and associate a grid G with a submatrix of C, referring to the subset of columns represented by G. When we say span {G}, we mean the space spanned by column vectors represented by grid G, and we call the dimension of this space by rank(G).

(k) Definition 6.1. Let G be the grid representation of CP for some P ∈ P#cols(C). We say that grid B ⊆ G is a basis of G if the columns of C corresponding to grid B comprise a maximal linearly independent set of columns of G.

The basis B may be non-unique. We make it unique by selecting the basis with lowest colexicographic order, where (i, j) precedes (i0, j0) if j < j0 or if j = j0 and i < i0. Graphically, the colexicographic order is a series of repeated N-shaped curves. Algorithm 1 is one way to find this basis. For the rest of the paper, we will simply write “the basis of a grid G” to mean the unique colexicographic basis of G.

Definition 6.2. Let C = A ⊗ B be represented by a nA × nB grid. A subset F ⊆ [nA] × [nB] is a foundation for a grid point (i, j) ∈ [nA] × [nB] if (i, j) ∈ span {F } and no proper subset of F is a foundation for (i, j).

15 Algorithm 1 Constructing a unique basis 1: function BasisSelection(Grid G) 2: B ← { } 3: for (i, j) ∈ G using colexicographic traversal order do 4: if (i, j) 6∈ span {B} then 5: B ← B ∪ {(i, j)} . Add grid point if it is “new” to span return B

To gain a more granular-level viewpoint of the structure of a dependent grid point’s foun- dation, we separate the Kronecker product into its operands. We do so by projecting grid points.

Definition 6.3. The A-projection of the grid G, denoted by PA(G), is the set of indices I ⊆ [nA] such that G[i,·] 6= ∅. Likewise, the B-projection, PB(G), is the set of indices J ⊆ [nB] such that

G[·,j] 6= ∅.

Definition 6.4. A grid G is a dense grid if for every nonempty column of i of G, PB(G[i,·]) = [k] for some integer k ∈ [nB]. A grid is a compact dense grid (CDG) if it is a dense grid and |G[i,·]| is non-increasing in i. We show the difference between a dense and CDG in Figure 2. We can transform any arbitrary grid into a dense grid by collapsing as shown in the same figure.

Figure 2: Example a non-dense grid (left), dense grid (center), and CDG (right).

Definition 6.5. Let G be an arbitrary grid. Then a vertical collapse (VCollapse) of the grid G produces a new grid D such that  [|G[i,·]|]: |G[i,·]| > 0 PB(D[i,·]) = ∅ : |G[i,·]| = 0

Given a CDG B, Algorithm 2 will help bound the largest number of grid points G such that span{B} = span{G}. To do so, the algorithm adds grid points to B in two phases. In the first phase, the algorithm independently adds grid points to each column i until there are min{nB, bfB(ηi)c} grid points, where ηi is the number of grid points in column i before the expansion, fB is a linear or superlinear function, and nB is the height of the grid. Let the resulting subset of grid points be G+. In the second phase, the algorithm grows G+ but now ++ row-wise using the function fA, which results in G . We show an application of the two- stage grid expansion in Figure 3, where we identify the first phase as “VExpansion“ for vertical expansion and the second phase as “HExpansion” for horizontal expansion.

16 Figure 3: Grid expansion on a 4 × 4 CDG where f = fA = fB and bf(1)c = 1, bf(2)c = 2, bf(3)c = 4, and bf(4)c = 5. The left grid is the input grid G, the center grid is G+, and the right grid is G++. Black cells represent existing grid points before the expansion and blue cells represent new grid points.

Algorithm 2 Grid Expansion

1: function GridExp(Functions fA, fB : R 7→ R>0, CDG G) + 2: G ← VExpansion(fB,G) . Expand grid column-wise ++ + 3: G ← HExpansion(fA,G ) . Expand grid row-wise 4: return G++ . Returned grid is a CDG 5:

6: function VExpansion(fB : R 7→ R>0, Grid G) 7: H ← { } . Column-expanded version of G 8: for i = 1, 2, . . . nA do 9: ηi ← |G[i,·]| . Number of grid points in column i pre-expansion + 10: ηi ← min{nB, bfB(ηi)c} . Number of grid points in column i post-expansion  +  11: H ← H ∪ {i} × [ηi ] . Add grid points to column i 12: return H 13:

14: function HExpansion(fA : R 7→ R>0, Grid G) 15: H ← { } 16: for i = 1, 2, . . . nB do 17: ηi ← |G[·,j]| + 18: ηi ← min{nA, bfA(ηi)c}  +  19: H ← H ∪ [ηi ] × {j} . Add grid points to row j 20: return H

For any matrix M and a submatrix A ⊆ M (without loss of generality, assume it is linearly independent), Algorithm 2 is similar to an algorithm where a column in M is added to A if it lies in span(A). The resulting submatrix B ⊆ M is the largest submatrix (w.r.t. the number of columns) such that span(A) = span(B). A key difference between the two algorithms is one adds columns from M based on the entire starting matrix A, whereas the other only considers a subset of grid points (i.e., column i) from G. When CDG G is the grid representation of A † † and when we set fA = σA, fB = σB, then the structure of a CDG ensures Algorithm 2 adds a ++ sufficient number of grid points so that #cols(B) 6 |G |.

6.2 Outline of the Proof

In this section we outline the proof for the main theorems using the idea of collapse-then-expand, where we collapse then permute columns of a grid so that it is CDG followed by the two-stage

17 grid expansion. Let the grid G represent an arbitrary submatrix of C = A ⊗ B. We can reorder the columns of the matrix A so that D = VCollapse(G) is a CDG. Thus we will always assume D is a CDG. Let BD and BG be the basis for D and G, respectively, constructed using Algorithm 1 with a colexicographic order. Assume that any future basis is constructed in a similar fashion.

Now, let σA and σB be, respectively, rank expansion lower bounds for A and B. Then we will show that for some subset S ⊂ BD with |S| 6 |BG|, we have

† † |G| = |D| 6 |GridExp(σA, σB,S)|. (14)

† † We abbreviate our notation by GridExp(S) ≡ GridExp(σA, σB,S) (Algorithm 2). The equality |G| = |D| is trivial. In Proposition 6.9, we will show how to select grid S so that inequality (14) holds. Up to this point, we do not need σA, σB being concave. Next, we show in Section 6.4 that for concave σA, σB, we can derive

|GridExp(S)| 6 φ(|S|) (15) for some increasing and invertible function φ. Finally,

−1 k = |G| 6 φ(|S|) 6 φ(|BG|) =⇒ φ is a rank expansion lower bound for C.

6.3 Finding a Right Grid S

The key observation is that the basis of a CDG has a reducible structure, as implied by the proposition below. We denote the columns of A and B as {ai} and {bj} , respectively. i∈[nA] j∈[nB ]

Proposition 6.6. Let D be a CDG with basis BD. Let the A and B-projection of BD be

X = PA(BD) and Y = PB(BD), respectively. Then

(i) (1, j) ∈/ BD if and only if bj ∈ span {bk}k∈Y ∩[j−1]

(ii) (i, 1) ∈/ BD if and only if ai ∈ span {ak}k∈X∩[i−1] (iii) BD = (X × Y ) ∩ D

(iv) For each p ∈ X, (BD)[p,·] spans D[p,·]

(v) For each q ∈ Y , (BD)[·,q] spans D[·,q].

Graphically, this proposition says that BD should look like the one in Figure 4, and for nonempty rows (columns) in BD, the selected vectors span the vectors in that row (column) of

D. This “” structure in BD is important for bounding its size in our later proof. To establish this proposition, we need following lemmas.

n m p n q m Lemma 6.7. Let u ∈ C , v ∈ C . Let {ai}i=1 ⊆ C , {bj}j=1 ⊆ C be sets of vectors. Then u ⊗ v ∈ span {ai ⊗ bj} iff u ∈ span {ai} and v ∈ span {bj}.

Proof. The “if” direction is trivial. We show the other. Without loss of generality, we may assume {ai} and {bj} are sets of linearly independent vectors. Write A = span {ai} and p q B = span {bj}. Let {ei}i=1 and {fj}j=1 be an orthonormal basis (ONB) for A and B, re- n m spectively, and {ei}i=p+1 and {fj}j=q+1 be an ONB for the orthogonal complements of A and

18 Figure 4: An illustration of the structure of BD.

B, respectively. Then, we have decompositions,

n m X X u = λiei, v = µjfj. i=1 j=1

Thus, X X u ⊗ v = λiµjei ⊗ fj + λiµjei ⊗ fj. (i,j)∈[p]×[q] (i,j)∈/[p]×[q]

Since {ei ⊗ fj}(i,j)∈[p]×[q] and {ei ⊗ fj}(i,j)∈[n]×[m] are respectively an ONB for span{ai ⊗ bj} mn and C , while u ⊗ v ∈ span{ai ⊗ bj}, we have

p !  q  X X X u ⊗ v = λiµjei ⊗ fj = λiei ⊗  µifj . (i,j)∈[p]×[q] i=1 j=1

Hence, u ∈ span {ai} and v ∈ span {bj}.  We endow the grid points with the colexicographic order. We then denote by [(i, j)] the set of all points preceding (i, j) in this order. We write ((i, j)) = [(i, j)] \{(i, j)}. We now show that if a column in the grid is “new to” (not spanned by) the previous columns, then any grid point from this new column can only be spanned by its own column.

Lemma 6.8. If ap ∈/ span {ai}i

0 0 Proof. Let F[p,·] = {(p, jm)}m. Let V = span {bjm }m and V = span {ap ⊗ bjm }m = ap ⊗ V . Let PV , PV 0 be orthogonal projections. Let u = (I − PV )(ap ⊗ bq) = ap ⊗ (I − PV 0 )bq. If u 6= 0, then F \{(p, jm)}m is a foundation of u, but this is impossible by Lemma 6.7, since ap ∈/ span {ai}i

19 6.8; if ai ∈/ span {ak}k∈X∩[i−1], then (i, 1) is added to the basis. If otherwise, by an induction on i, we see that ai ⊗ b1 is spanned by {ak ⊗ b1}k∈X∩[i−1] ⊆ BD. We prove part (iii) by induction on the number of non-empty columns in D. The base case is clear. Now suppose we have p > 1 columns.

Case 1: If ap is spanned by previous columns, then ∀ q,(p, q) is spanned by {(i, q)}i

Case 2: Now suppose ap is not spanned by previous columns. For a fixed q, if bq ∈/ span {bj}j

If bq ∈ span {bj}j

BD. Case 2 says if we don’t skip a column, then we add point (p, q) iff bq ∈/ span {bj}j

Proposition 6.9. Let G be a grid such that D = VCollapse(G) is a CDG. Let S be generated from the procedure above. Let BG be the basis grid of G. Then (i) |S| 6 |BG| (ii) D ⊆ GridExp(S). Thus |D| 6 | GridExp(S)|.

First we introduce a lemma for part (i).

m Lemma 6.10. Let G be a grid such that VCollapse(G) is a CDG with n columns. Let BG be the basis for the first m columns of G, and let BG be the basis for G. Then 1 2 n (i) BG ⊆ BG ⊆ · · · ⊆ BG = BG (ii) If a ∈/ span {a } , then |Bk+1| − |Bk | rank(G ). k+1 i i6k G G > [k+1,·] Proof. Statement (i) is clear by the construction of a basis. For (ii), since a ∈/ span {a } , k+1 i i6k by Lemma 6.8, we must add (k + 1, q) to the basis if bq ∈/ span {bj :(p, j) ∈ G, j < q}. Thus we have to add at least a maximal linear independent set of vectors G[k+1,·] to the basis (but may include more).  Now we prove Proposition 6.9. Proof. (of Proposition 6.9) First we prove part (i) by an induction on the number of columns. n n As before, we define S and BG to be the points in S and BG in columns 1 to n. We show that n n |S | 6 |BG| for all n. As previously defined, let rp = rank(G[p,·]). 1 1 n n The base case is clear, that |S | 6 r1 = |BG|. Now suppose |S | 6 |BG| for all n 6 k. We wish to show |Sk+1| |Bk+1|. Indeed, if a ∈ span {a } , then by Proposition 6.6, |Sk+1| = |Sk|, 6 G k+1 i i6k so |Sk+1| = |Sk| |Bk | |Bk+1|. If a ∈/ span {a } , we have |Sk+1| − |Sk| r , and by 6 G 6 G k+1 i i6k 6 k+1

20 k+1 k k k part (ii) of Lemma 6.10, |BG | − |BG| > rk+1. Thus we conclude |S | 6 |BG| for all k. Now by part (i) of Proposition 6.10, we have |S| 6 |BG| as desired. Now we prove statement (ii). First we show that VExpansion(S) contains the points in

{(i, j) ∈ D : i ∈ PA(BD)}, i.e., columns of D not skipped by BD are covered in VExpansion(S).

Fix i ∈ PA(BD). Denote |S[i,·]|, |G[i,·]|, and |D[i,·]| by respectively s, g, and d. Since VExpansion(S) and D are both CDGs, it suffices to check |D[i,·]| 6 | VExpansion(S)[i,·]|. Easily, g = d. If no point is removed from BD to form S in column i, then by part (iv) of Proposition 6.6 and the fact σB is a lower bound on the rank expansion for B, |D[i,·]| 6 | VExpansion(S)[i,·]|. If grid points are † † removed from BD, then s = ri. Now | VExpansion(S)[i,·]| = σB(s) = σB(ri) > g = d = |D[i,·]|. Thus, in both cases, column i of D is covered by VExpansion(S). Finally, by part (v) of Proposition 6.6, if we now apply a HExpansion on VExpansion(S), it covers those skipped columns in D. Hence D ⊆ GridExp(S).  Now we established that for an arbitrary grid G, |G| 6 | GridExp(S)| for some |S| 6 |BG|. It remains to bound | GridExp(S)| using |S|. We address this in the next section.

6.4 Bounding the Expansion Size – Proof of Theorem 3.1

We note that through a set of row and column permutations, S can be made a CDG S0. As permutations do not affect the size of GridExp (Algorithm 2) of a grid, we may assume S to be a CDG. Thus, if we define φ : Z+ 7→ R,

† † φ(t) = max | GridExp(σA, σB,S)|, (16) CDG S, |S|=t then for all continuous and increasing σC such that σC (0) = 0, φ(σC (k)) 6 k, ∀ k ∈ [nAnB], σC is a rank expansion lower bound for C.

We show how to further reduce this optimization problem when σA and σB are concave functions on R+. When σA or σB are zero functions, the proposed bound in Theorem 3.1 holds trivially. Otherwise, using a density argument, we can assume that they are C1 and strictly −1 −1 −1 −1 increasing functions. Now σA and σB are well-defined. We denote f = σA , g = σB . They are convex and strictly increasing functions, and we will seek a bound for

φ(t) = max | GridExp(f, g, S)| (17) CDG S, |S|=t

We want to show the inverse of this φ function is exactly the function σC proposed in Theorem 3.1. We approach this in a few steps. Step 1. Reduce from any CDG S to a rectangular-shaped grid. That is to show

φ(t) = max | GridExp(f, g, S)| 6 max f(tA)g(tB). (18) CDG S, |S|=t tA, tB >1, tAtB =t

Step 2. Solve the optimization problem on rectangular-shaped grids. We will show that for convex increasing functions f and g with f(0) = g(0) = 0,

max f(tA)g(tB) = max {f(1)g(t), f(t)g(1)} 6 φ(t). (19) tA, tB >1, tAtB =t

21 Step 3. Finally with φ(t) = max {f(1)g(t), f(t)g(1)}, we invert this function to get the proposed function σC . Now we prove Step 1. Hereafter, we will abbreviate | GridExp(S)| ≡ | GridExp(f, g, S)|. Let R us define hGi = G df(x)dg(y) to be the area of grid G under the new measure (here we think of 2 a grid point (i, j) as the unit square [i−1, i]×[j −1, j] in R ). To verify (18), we identify a CDG 2 with a stair-like area in R , as illustrated in Figure 5. Furthermore, we relax the corner points 2 from being integer coordinates to points in R , i.e., we represent a CDG as the region below a decreasing step function in the first quadrant. With this setup, one can check the following. Proposition 6.11. If G is a CDG and f(0) = g(0) = 0, then hGi = | GridExp(G)|. Our proof approach is to show that if we reduce the number of “stairs” in S in a proper way (i.e., preserving the area), then hSi increases, as explained in the Figure 5.

Figure 5: An illustration of reducing the number of stairs in S by 1. The 2-stair (shaded in green) E ⊆ S is merged in 2 ways: either moving the left stair up and shrink the lower stair until (1) the left stair reaches the height of its left neighbor, or (2) the right stair decreases to the height of its right neighbor (or floor), which reduces E to M1 (pink 2-stair); or move in the other way until the two stairs level up, which reduces E to M2 (blue rectangle). The area is not allowed to change, i.e., |E| = |M1| = |M2|. In either case, the number of stairs in S is reduced by at least 1. If hEi 6 max {hM1i, hM2i}, then we can repeat this procedure until S becomes a single rectangle without decreasing hSi = | GridExp(S)|.

We use the notation introduced in the caption of Figure 5. We fix x0, x1, x2, y0 and the area |E| as constants and parametrize the stair by u = y2 − y1 ∈ R>0. Let E(u) be the grid with height difference u. We allow u to grow without changing the area |E| until one of the followings happens: (i) when the left stair reaches the height of its left neighbor or (ii) when the right stair reaches the height of its right neighbor (or down to the ground). Define the feasible region u ∈ [0, c] (the exact value of c is not important, though available). Let M1 = E(c) and M2 = E(0). It then suffices to show that hE(u)i 6 max {hM1i, hM2i}. Equivalently, hE(u)i is maximized on the boundary of [0, c]. The rest work to prove Step 1 is pure calculus and is given below.

22 Proof. (of Step 1.) Let us define the widths of the stairs w1 = x1 − x0, w2 = x2 − x1. Using the equi-area relation, we have  w1 y1 =y ¯ − u, w1+w2 (20) y =y ¯ + w2 u, 2 w1+w2 wherey ¯ = y0 + |E|/(w1 + w2) is independent of u. Easily, the area can be computed

hE(u)i = [f(x1) − f(x0)][g(y2(u)) − g(y0)] + [f(x2) − f(x1)][g(y1(u)) − g(y0)].

Taking derivative of u, we have

d w2 0 w1 0 hE(u)i = [f(x1) − f(x0)]g (y2) − [f(x2) − f(x1)]g (y1). du w1 + w2 w1 + w2

Replacing x1,2 by w1,2, we have

d f(x0 + w1) − f(x0) 0 f(x1 + w2) − f(x1) 0 0 0 hE(u)i ∝+ g (y2) − g (y1) = k1g (y2) − k2g (y1). du w1 w2

Plugging in relation (20), we have

d  w  k  w  hE(u)i = 0 ⇔ g0 y¯ − 1 u = 1 · g0 y¯ + 2 u . du w1 + w2 k2 w1 + w2

By convexity of g, the right-hand side is increasing in u whereas the left-hand side is decreasing in d u. Thus there is at most one critical point for hE(u)i on [0, c]. Now we show that du hE(u)i|u=0 6 0. Indeed, when u = 0, y1 = y2 =y ¯, and by convexity of f, k1 6 k2 by the slopes of secant lines d 0 of f. Therefore, du hE(u)i|u=0 = (k1 − k2)g (¯y) 6 0. Thus hE(u)i cannot have a local maximum on [0, c]. Hence, the maximum of hE(u)i is attained at either u = 0 (M2) or u = c (M1). Finally, we apply this merge repeatedly to the left-most 2-stair E of S to convert S into a rectangle. Note that when we finish the merge, the resulting rectangle has a width and height of at least 1. This completes the proof of Step 1.  Next we establish Step 2. Recall that for any convex function f(x), we can write it as a supremum of a set of linear functions (see, e.g., Theorem 12.1 in [25]), f(x) = supi∈I aix + bi.

Proof. (of Step 2.) Let f(x) = supi∈I aix + bi. Since f(0) = 0 and f is increasing, we assume supi∈I bi = 0 and ai > 0. Similarly we can write g(x) = supj∈J cjx + dj, with cj > 0, supj∈J dj = 0. By positiveness of f and g, we can reduce the optimization problem to

max sup (aix + bi)(cjy + dj). (21) x>1, y>1, (i,j)∈I×J x·y=t

Now we show that for each given i, j, the maximum is attained (1, t) or (t, 1). Indeed, the

23 problem reduces to maximizing the nonconstant part,

max aidjx + bicjy, (22) x>1, y>1, x·y=t which has a solution on the boundary, since the coefficients aidj, and bicj are nonnegative, and the constraint is equivalent to aidjx · bicjy = const. Finally we return to the conclusion. Suppose the maximum is attained at some (x0, y0). Then, for any ε > 0, one can find (i0, j0), and (x∗, y∗) ∈ {(1, t), (t, 1)}, so that

∗ ∗ ∗ ∗ 0 0 0 0 f(x )g(y ) > (ai0 x + bi0 )(cj0 y + dj0 ) > (ai0 x + bi0 )(cj0 y + dj0 ) > f(x )g(y ) − ε. (23)

Since ε > 0 is arbitrary, the proof is complete.  Finally, we prove Step 3, which completes the proof of Theorem 3.1. Proof. (of Step 3 and Theorem 3.1) This proof is mostly repeating the procedure stated in Section 6.2. Recall that the proposed lower bound of the rank expansion for C = A ⊗ B is

σC (k) = min {σA(k/dB), σB(k/dA)} ,

† † where dA = σA(1), dB = σB(1). Clearly, σC is continuous, increasing, and concave, as both candidates in the minimization are. It suffices to show σC 6 σ˜C with concave, strictly increasing 1 C functions σA, σB. If k 6 dAdB, then σC (k) 6 1 is clearly a lower bound of the rank of any (k) submatrices C . Let us consider k > dAdB. For any grid G with a basis of size t, we proved that |G| 6 φ(t), where by Steps 1 and 2,

 −1 −1 −1 −1  −1 −1 φ(t) = max σA (1)σB (t), σA (t)σB (1) = max dAσB (t), dBσA (t) .

Note that φ(t) is now defined on {t > 1}, not only on integers t. Now φ is strictly increasing, −1 and it suffices to check σC (k) = φ (k). −1 Suppose t = φ (k). When k > dAdB, σA(k/dB) > 1, σB(k/dA) > 1, so t > 1 and φ(t) is −1 well-defined. Now suppose φ(t) = dAσB (t), then one can derive

−1 −1 −1 φ(t) = dAσB (t) ⇐⇒ k = dAσB (t) > dBσA (t) (24) −1 ⇐⇒ dBσA (σB(k/dA)) 6 k (25) ⇐⇒ σB(k/dA) 6 σA(k/dB) (26) −1 ⇐⇒ σC (k) = σB(k/dA) = φ (k). (27)

−1 −1 In the other case, φ(t) = dBσA (t), one can verify that σC = φ through a similar derivation. −1 Thus, σC = φ , and it is a lower bound on the rank expansion for C. 

24 6.5 Improving the Bound – Proof of Theorem 3.5, 3.6

1 Again we assume that σA, σB are continuous, concave, and strictly increasing C functions. Let rA = σA(nA), rB = σB(nB). Recall that in last section, we proved that

−1 −1  −1 −1 −1 −1 max | GridExp(σA , σB ,S)| = max σA (1)σB (t), σA (t)σB (1) . (28) CDG S, |S|=t

−1 −1 In practice (e.g., applications in Section 4), σA and σB may only be well-defined over their respective domains, [1, rA] and [1, rB]. When |S| > min{rA, rB}, the optimization problem (28) −1 −1 evaluates σA and σB beyond these domains, thereby extrapolating values which may not be well-defined. This is because the merging of a CDG’s stairs may produce rectangles extending beyond the region contained in [rA] × [rB] to achieve maximal growth. To constrain tA 6 rA and tB 6 rB, we must conduct merges so that the geometry does not cross the border. This results in a 2-stair or L-shaped CDG. We make this notion concrete.

Definition 6.12. The L-shaped CDG S = L(x1, y1; x2, y2) is the 2-stair CDG with horizontal edges at y1, y2 and vertical edges at x1, x2. By convention, we require 0 6 x1 6 x2, 0 6 y1 6 y2. Alternatively, L(x1, y1; x2, y2) can be obtained by removing rectangle [x1, x2] × [y1, y2] from the rectangle [0, x2]×[0, y2]. Also note that a rectangle with height h and width w is an L-shaped CDG L(w, h; w, h). Now we state the improved version of (28).

Theorem 6.13. Let S be a CDG with size |S|, and let f, g be convex increasing functions on [0, rA] and [0, rB]. Let L = {L(x1, y1; rA, rB): x1, y1 > 0} and R = {L(x1, y1; x1, y1) : 1 6 x1 6 rA, 1 6 y1 6 rB}. Then

max | GridExp(f, g, S)| 6 max hMi. (29) CDG S, |S|=t M∈L ∪ R, |M|=t

Proof. The proof is mostly identical to that of Step 1 in the last section. We start with CDG

S, whose height and width are bounded by rB and rA, respectively. Instead of merging the left-most 2-stair, we start with the second and third stairs in S. By what is proved in Step 1, hSi increases either (i) when u = c, which means either the second stair is raised to the level of the left-most stair, or the third stair is lowered to the level of the ground or the fourth stair; or (ii) when u = 0, which means we level-up the second and third stairs. We continually apply this procedure onto the updated second and third stairs until we are left with an L-shaped grid. Now we merge the left-most 2-stair but we do not allow the height of the left-most stair to go above rB. This may result in a L-shaped grid L(x1, y1; x2, rB). Finally we merge horizontally and 0 do not allow the width of the lower layer to go beyond rA, which may result in L(x1, y1, rA, rB). When we finish these steps of merging, S either becomes a rectangle in R, or a L-shaped grid in L. The proof is thus complete. 

Remark 6.14. Note that we can always assume rB > rA. If |S| 6 max {rA, rB} = rB, then Theorem 3.1 is sufficient, since when we complete the (vertical) merge into a box, the box has width at least 1, and thus height at most rB, which is within [rA] × [rB]. Thus we only need to take the set L into account when |S| > max {rA, rB}. Inevitably, including the L-shaped grid in the bound gives a tighter bound with a more

25 complicated expression. It corresponds to the rank expansion lower bound given in Theorem 3.5. We prove it below.

Proof. (of Theorem 3.5) First we show that σC is increasing (clearly continuous as RC and LC are continuous). Clearly, RC is increasing. We rewrite LC as

LC (k) = max rArB − (rA − σA(kA)) (rB − σB(kB)) . (30) kA∈[0,nA], kB ∈[0,nB ], (nA−kA)(nB −kB ) = nAnB −k

Thus as k increases, one can increase kA or kB, so that LC increases. Hence LC is also increasing.

Consequently, σC is increasing. 1 To show σC 6 σ˜C , it is sufficient to consider strictly increasing C functions σA, σB. By Theorem 6.13, to obtain the maximum grid expansion, the CDG should be either a box or L-shaped. Similar to the proof of Theorem 3.1, we define

−1 −1 φR(k) = max σA (tA)σB (tB); (31) tA∈[1,rA], tB ∈[1,rB ], tAtB =k −1 −1 −1 −1 φL(k) = max nAσB (y1) + nBσA (x1) − σA (x1)σB (y1), (32) x1∈[0,rA], y1∈[0,rB ], rB x1+rAy1−x1y1=k and put

φ(k) = max {φR(k), φL(k)} . (33)

By Theorem 6.13, φ(k) is an upper bound of |G|, where G is any grid with a basis of size k. −1 Similar to previous arguments on LC , φL is strictly increasing, so φ is well-defined. Now we −1 check σC 6 φ . Through the same argument we used to prove Step 2 of the last section, we know φR is maximized on the “boundary”, that is, at least one of the constraints tA ∈ [1, rA] and tB ∈ [1, rB] must be active. This leads to

−1 −1 φR(k) = max σA (tA)σB (tB); (34) tA∈[1,rA], tB ∈[1,rB ], tA∈{1,rA} or tB ∈{1,rB }, tAtB =k

−1 Through a similar inversion we did in the proof of Theorem 3.1, we see that φR (k) = RC (k). −1 Thus we have σC 6 φ when φ = φR. Let us verify this when φ = φL. It suffices to show that φL(σC (k)) 6 k. If φL(σC (k)) > k, then there exist x1 ∈ [0, rA], y1 ∈ [0, rB], and rBx1 + rAy1 − x1y1 = σC (k), such that

0 −1 −1 −1 −1 k = nAσB (y1) + nBσA (x1) − σA (x1)σB (y1) > k.

26 Consequently,

0 σC (k) 6 LC (k) < LC (k )

= min rBσA(kA) + rAσB(kB) − σA(kA) · σB(kB) kA∈[0,nA], kB ∈[0,nB ], 0 nB kA+nAkB −kAkB =k −1  −1  −1  −1  6 rBσA σA (x1) + rAσB σB (y1) − σA σA (x1) · σB σB (y1)

= rBx1 + rAy1 − x1y1

= σC (k).

−1 This is absurd. Thus we proved σC 6 φ , so σC is a lower bound on the rank expansion for C. Finally, as we noticed in Remark 6.14 that when the resulting CDG S has size no larger than max {rA, rB}, we only need to consider boxes. Thus σC = RC 6 LC in this regime, and σC (0) = RC (0) = 0. The proof is then complete. 

Remark 6.15. Theorem 3.5 is in a sense a consistent result. Suppose we have two rank expan- sion lower bounds σA, σˆA for A and σB, σˆB for B. If one set of lower bounds is better than the ˆ other, e.g., σA,B > σˆA,B on [0, nA,B], then the corresponding functions RC > RC on [0, nAnB]. If in addition σA,B(nA,B) =σ ˆA,B(nA,B), say both equal to the true rank of A and B, then also ˆ LC > LC on [0, nA,B]. Thus a better rank expansion lower bound on C is guaranteed given better rank expansion lower bounds on A and B. We emphasize that this is not necessarily the case when using Theorem 3.1 due to the extrapolation of σA,B from [0, nA,B] to infinity. However, Theorem 3.1 gives a much simpler expression and can be easily applied recursively to derive a lower bound on the rank expansion N for C = i Ai.

Next, we investigate under what conditions we can ignore the class L and only maximize over

R while restricting ourselves to grids of [nA] × [nB]. That is, we establish a sufficient condition under which RC (k) 6 LC (k), ∀ k ∈ [1, nAnB]. (35) This corresponds to Theorem 3.6. Now we prove this theorem here. Proof. (of Theorem 3.6) As defined in the proof of Theorem 3.5,

−1 −1 −1 −1 φL(k) = max nAσB (y1) + nBσA (x1) − σA (x1)σB (y1). (36) x1∈[0,rA], y1∈[0,rB ], rB x1+rAy1−x1y1=k

When x1 = 0, we get a rA × (k/rA) rectangle R1. When y1 = 0, we get a (k/rB) × rB rectangle R2. We only need to consider the L-shaped grids when k > max {rA, rB}, which −1 means k/rA > 1 and k/rB > 1, and thus R1 ⊆ R, R2 ⊆ R. We show that when f = σA and −1 g = σB satisfy the assumptions stated in Theorem 3.6, φL(k) is maximized at either x1 = 0 or y1 = 0. Therefore, in this case, we have φL(k) 6 φR(k) for all k. Hence one can ignore the L-shaped grids L when maximizing the expansion size.

27 To that end, we again rewrite φL as

φL(k) = max nAnB − (nA − f(x1)) (nB − g(y1)) x1∈[0,rA], y1∈[0,rB ], (rA−x1)(rB −y1) = rArB −k    rArB − k = nAnB − min (f(rA) − f(x)) g(rB) − g rB − . h rArB −k i r − x x∈ 0,rA− A rB

rArB −k c For simplicity, we write rArB − k = c 0, = sx ∈ [ , rB]. It suffices to show > rA−x rA

 c  argmin (f(rA) − f(x)) (g(rB) − g(rB − sx)) ∈ 0, rA − . (37) c x∈[0,rA− ] rB rB

1 c Indeed, let h(x) be the function in (37), then h is C on [0, rA − ], and rB

ds h0(x) = [f(r ) − f(x)] · x · g0(r − s ) − f 0(x)[g(r ) − g(r − s )] A dx B x B B x   f(rA) − f(x) 0 0 g(rB) − g(rB − sx) = sx g (rB − sx) − f (x) . rA − x sx

c 0 0 On (0, rA − ), f (x) > 0, g (rB − sx) > 0, so rB   0 ˆ c h (x) ∝+ f(rA − x) − gˆ . (38) rA − x

ˆ c By monotonicity of f andg ˆ, h(x) has at most one critical point on (0, rA − ). rB If f 0(0) = 0, then h0(0) > 0, so the minimum of h is attained on boundary. Similar conclusion 0 0 c 0 0 holds if g (0) = 0, which leads to h (rA − ) < 0. Assuming f (0), g (0) 6= 0, then we can rB ˆ c continuously extend f andg ˆ to rA and rB, and (38) holds on [0, rA − ]. It then suffices to rB 0 0 c verify that we cannot simultaneously have h (0) < 0 and h (rA − ) > 0. Indeed, if so, as rB c 6 rArB,   0 ˆ c h (0) ∝+ f(rA) − gˆ < 0, rA     0 c ˆ c h rA − ∝+ f − gˆ (rB) > 0, rB rB since fˆ andg ˆ are both increasing, we have       c ˆ ˆ c c gˆ > f(rA) > f > gˆ (rB) > gˆ . rA rB rA

This is absurd. The proof is then complete.  In the example below, we show that the assumption on fˆ andg ˆ is nontrivial.

−1 1 Example 6.16. In general, given that f = σA is C , convex, strictly increasing, and f(0) = 0, we do not have monotonicity for fˆ, and we cannot ignore the L-shaped grids in maximization. A counterexample is given below.

28 Consider rA = 5 and  x x 6 3  σ−1(x) = f(x) = 5 1 2(x−3) A 2 + 2 e 3 < x 6 4   5 1 2 2  2 + 2 e + e (x − 4) 4 < x 6 5.

Then f satisfies all the assumptions we require. One can check that fˆis not monotone increasing. We demonstrate that in this case L-shaped grids cannot be discarded. Let us take g = f. Consider grid S = L(1, 1; 5, 5) of size 9. We have hSi ≈ 26.17. With a rectangle of area 9 inside the 5 × 5 region, the maximum expansion size, rounding up to an integer, is dφR(9)e = df(5) · f(9/5)e = 25 < bhSic 6 bφL(9)c. Thus in this case we have to return to Theorem 3.5 or Theorem 3.1. Finally, we conclude this section by demonstrating how Theorem 3.6 can improve the rank expansion lower bound derived from Theorem 3.1 in the following example.

Example 6.17. For brevity, we write σX to be some rank expansion lower bound for the matrix 1/2 1/4 X. First consider C = A ⊗ B, where σA(k) = k and σB(k) = k . If we apply Corollary 3.2, we find 1/4 σC (k) = k .

Now if we turn to Theorem 3.6, for k > nB, we get

 1/2 new k 1/4 −1/4 1/2 σC (k) = · nB = nB · k , nB

new 1/4 and when k 6 nB, σC (k) = σC (k). The bound is improved by an order of k . Numerically, new with nA = nB = 100, k = 10nA, we have dσC (k)e = 6, whereas σC (k) = 10. However, note new that although σC is continuous, it is no longer concave. Hence it is in general not possible to apply the bound in Theorem 3.6 recursively. For another illustration, let us use logarithms for the rank expansion lower bound. Suppose

C = A ⊗ A with σA(k) = ln(k + 1). From Corollary 3.2, we get

 k  σ (k) = ln + 1 . C e − 1

Using the new bound, when k/nA > e − 1, we have

new σC (k) = ln(nA + 1) · ln(k/nA + 1).

Thus

k eσC (k) = + 1; (39) e − 1  ln(nA+1) σnew(k) k e C = + 1 . (40) nA

As (40) grows exponentially, it essentially surpasses (39) when k is large, and thus we obtain a

29 new tighter lower bound, σC , on the rank expansion. Numerically, with nA = 100, k = 10nA, we new have (39) ≈ 583, (40) ≈ 63996. Thus, dσC (k)e = 7, whereas dσC (k)e = 12.

7 Conclusion

We develop a new framework to ascertain communication lower bounds for any bilinear algo- rithm via the rank expansion of a matrix, or the minimum rank of any submatrix of fixed size. We acquire general lower bounds by a novel technique called grid expansion, a geometric inter- pretation of how the rank of any submatrix grows, and we solve a simple optimization problem to bound the growth of the grid expansion. We apply the proposed framework to derive com- munication lower bounds for fast matrix multiplication, convolution, and symmetry preserving algorithms. Unlike previous works which assume a particular computational DAG, our lower bounds consider a larger space of permissible computational DAGs. We show an asymptotically smaller lower bound for Strassen’s matrix multiplication, both matching and novel lower bounds for variants of Toom-Cook convolution, and novel lower bounds for nested symmetry preserving tensor contractions. To close, we briefly discuss limitations in our analysis. First, recall that we reason about communication lower bounds by separately nesting the matrices from a bilinear algorithm. This is why we do not nest standard matrix-multiplication, whose bilinear algorithm’s matrices each have a trivial rank expansion lower bound of one, to build partially symmetric tensor contractions in Lemma 4.10. Second, in our analysis of Strassen’s algorithm (Theorem 4.3), we observed the low-rank submatrix resulted in an asymptotically smaller communication lower bound than previous works [4]. A more holistic analysis, where one nests the entire bilinear algorithm and understands how low-rank submatrices affect the overall communication cost, may enhance the theoretical analysis and produce sharper communication lower bounds.

Appendix

Definition 7.1 (Bilinear Algorithm for Strassen’s Matrix Multiplication).

      1 0 1 0 1 −1 0 1 1 0 −1 0 1 0 1 0 0 1 −1 0 1       0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 A =   , B =   , C =   . 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0       1 1 0 1 0 0 1 1 0 −1 0 1 0 1 1 −1 1 0 0 1 0

References

[1] Grace Dinh and James Demmel. Communication-optimal tilings for projective nested loops with arbitrary bounds. In Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures, pages 523–525, 2020.

[2] Hong Jia-Wei and Hsiang-Tsung Kung. I/O complexity: The red-blue pebble game. In Proceedings of the thirteenth annual ACM symposium on Theory of computing, pages 326– 333, 1981.

30 [3] James Demmel and Grace Dinh. Communication-optimal convolutional neural nets. arXiv preprint arXiv:1802.06905, 2018.

[4] Grey Ballard, Aydin Buluc, James Demmel, Laura Grigori, Benjamin Lipshitz, Oded Schwartz, and Sivan Toledo. Communication optimal parallel multiplication of sparse ran- dom matrices. In Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures, pages 222–231, 2013.

[5] Grey Ballard, Nicholas Knight, and Kathryn Rouse. Communication lower bounds for matricized tensor times Khatri-Rao product. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 557–567. IEEE, 2018.

[6] Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. Minimizing communication in . SIAM Journal on Matrix Analysis and Applications, 32(3):866– 901, 2011.

[7] Gianfranco Bilardi and Lorenzo De Stefani. The I/O complexity of toom-cook integer multiplication. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2034–2052. SIAM, 2019.

[8] Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. Graph expansion and communication costs of fast matrix multiplication. Journal of the ACM (JACM), 59(6):1– 23, 2013.

[9] Saachi Jain and Matei Zaharia. Spectral lower bounds on the I/O complexity of computation graphs. In Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures, pages 329–338, 2020.

[10] Grey Ballard, Alex Druinsky, Nicholas Knight, and Oded Schwartz. Hypergraph parti- tioning for -matrix multiplication. ACM Transactions on Parallel Computing (TOPC), 3(3):1–34, 2016.

[11] Victor Pan. How can we speed up matrix multiplication? SIAM review, 26(3):393–415, 1984.

[12] Edgar Solomonik, James Demmel, and Torsten Hoefler. Communication lower bounds of bilinear algorithms for symmetric tensor contractions. Journal on Scientific Computing, 2021. to appear, arXiv preprint arXiv:1707.04618.

[13] Edgar Solomonik and James Demmel. Fast bilinear algorithms for symmetric tensor con- tractions. Computational Methods in Applied Mathematics, (0), 2020.

[14] Joseph B Kruskal. Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear algebra and its applications, 18(2):95–138, 1977.

[15] Dror Irony, Sivan Toledo, and Alexander Tiskin. Communication lower bounds for distributed-memory matrix multiplication. Journal of Parallel and Distributed Comput- ing, 64(9):1017–1026, 2004.

31 [16] . is not optimal. Numerische mathematik, 13(4):354– 356, 1969.

[17] Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. Communication-optimal parallel algorithm for strassen’s matrix multiplication. In Pro- ceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures, pages 193–204, 2012.

[18] Caleb Ju and Edgar Solomonik. Derivation and analysis of fast bilinear algorithms for convolution. SIAM Review, 62(4):743–777, 2020.

[19] Ivan W Selesnick and C Sidney Burrus. Extending Winograd’s small convolution algo- rithm to longer lengths. In Proceedings of IEEE International Symposium on Circuits and Systems-ISCAS’94, volume 2, pages 449–452. IEEE, 1994.

[20] R Agarwal and J Cooley. New algorithms for digital convolution. IEEE Transactions on Acoustics, Speech, and Signal Processing, 25(5):392–410, 1977.

[21] Ioannis Pitas and M Strintzis. Multidimensional cyclic convolution algorithms with minimal multiplicative complexity. IEEE transactions on acoustics, speech, and signal processing, 35(3):384–390, 1987.

[22] Andrew Lavin and Scott Gray. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4013–4021, 2016.

[23] So Hirata. Tensor Contraction Engine: Abstraction and automated parallel implementation of configuration-interaction, coupled-cluster, and many-body perturbation theories. The Journal of Physical Chemistry A, 107(46):9887–9897, 2003.

[24] Michael Christ, James Demmel, Nicholas Knight, Thomas Scanlon, and Katherine Yelick. Communication lower bounds and optimal algorithms for programs that reference arrays– part 1. arXiv preprint arXiv:1308.0068, 2013.

[25] R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970.

32