arXiv:1808.06287v1 [math.NA] 20 Aug 2018 urmn fteagrtm.Frty eaeicesnl l increasingly are re- we accuracy Firstly, algorithms. the the and of bo requirement quirement of in matri efficiency differ many rank computation problems of low the big-data of scale to applications techniques the approximation recent at The prohibitive cost datasets. are computational today’s the algorithms [8], these 1960s the of since existed have SVD opeiy ti nw htterank- o computationa the matrices of that nearby question known the identify is aside it to complexity, leaving and used and, nearness is rank, it lower to particular, related In properties rank. i useful and of importance, number practical a and theoretical are both that of forms ization into storage. data and transmission, the computation, transform for to efficient and rank more data low raw processing, inform the data structural in important modern reveal scale to [5], large help approximations subroutine of computational part int a As da incorporated [6]. as been [3], algorithms also processing has complex dimensional- it signal more for more, [2], and technique [4], denoising alone compression addition [1], stand In reduction today. a ity applications as mining functioning data to many in ponent analysis. novel our experimen th superli present numerical validate show is from convergence We results and provide convergence the parameter. we value regimes, Additionally, singular spectrum size of certain block fo under rate the algorithm, unifi the on a this of results present for choices we analysis paper, valid convergence this a value In behavior singular restric settings. convergence been parameter have b its results certain has on past and there While results accuracy, well, approximation theoretical matrix. quite new starting bloc performs scant traditional randomized algorithm the a this of empirically with b amalgamation randomized algorithm an the Lanczos is - problems competi algorithm of A class today. Lanczos this applications for mining algorithm data tive many in component aco,boksz,snua values. singular size, block Lanczos, gl tnad o prxmtn arxb nte matri most another at by rank matrix a of approximating for standard” “gold h iglrvledcmoiin(V)i arxfactor- matrix a is (SVD) decomposition value singular The com- crucial a is matrices of approximation rank low The Terms Index Abstract hl rcdrsfrcmuigteeatrank- exact the computing for procedures While uelna ovrec fRnoie Block Randomized of Convergence Superlinear eateto Mathematics of Department Telwrn prxmto fmtie sacrucial a is matrices of approximation rank low —The [email protected] ekly A USA CA, Berkeley, lwrn prxmto,rnoie block randomized approximation, —low-rank k ioh Yuan Qiaochu CBerkeley UC [7]. .I I. NTRODUCTION k aco Algorithm Lanczos rnae V sthe is SVD truncated eateto Mathematics of Department k [email protected] truncated ekly A USA CA, Berkeley, sthat ts e to ted CBerkeley UC ation has t near. all r eav- lock igGu Ming een nd ed th at ta x o k x f - l ovrec o h case the for convergence loih n ics oepeiu ovrec eut fo results convergence previous some discuss and algorithm o h case the for epeetagnrlzdterm plcbet l block all to applicable theorem, generalized sizes a present We oorkolde l rvosrslsi h ieaueare of literature choice the the in results for previous only all applicable knowledge, our To loih,fralvldprmtrcocso lc size block of choices Lanczos parameter Block valid the all of for singu variants unified algorithm, for a sup- theory presents experiments convergence analysis value numerical Our results. with f these convergence along porting value algorithm, singular RBL of the rate the concerning results block the randomized algorithm, complicated latter more algorithm. the Lanczos but of performing wo guarantees theoretical better convergence new scant the been has on softwar there popular [17], by e.g. to adopted packages, widely shown been have have reducing and they ac- effective be empirically balance acceptably While an algorithms approximation. producing curate these with matrix, complexity computational original By the opera [13]–[16]. projecting or (RBL) on sketching randomized Lanczos a iterat block either applying subspace randomized randomized and particular (RSI) in la to matrices, suited algorithms sparse SVD traditional of variants domized ywt only comfortably with can on applications by big-data t focused of at iterations aimed algorithms newer algorithms precision, double SVD full to truncated up whil computing Thus, of [12]. variants approximation matrix previous the depen of weakly accuracy only the task on the intermed of regressio accuracy an or final usually the classification Empirically, overall is the it for applications, representation qu big-data computing scientific for be previous tions, may for SVD object desired truncated final the the while Secondly, demands algorithms. efficiency the computational on higher much have and [11], naeo e-cl aaesadbgdt plctos The large of extraordinarily applications. order often the big-data are ceeding and such entering from datasets and arising matrices web-scale matrices sized of moderately age of an era the behind ing nScinI,w rsn h admzdbokLanczos block randomized the present we II, Section In nti ae,w rsn oe hoeia convergence theoretical novel present we paper, this In ran- of development the to led have considerations These b hc onieaypoial ihpeiu results previous with asymptotically coincide which , b 2 - ≥ 3 iiso accuracy. of digits k hl rvdn qal togrtsof rates strong equally providing while , 10 eateto Mathematics of Department 6 noeo oho h iesos[9]– dimensions the of both or one in k < b bo ekly A USA CA, Berkeley, [email protected] CBerkeley UC . oLi Bo b ≥ k h agtrank. target the , task. n ex- , hese tion rge, iate ion get es- lar ds or rk b e e r . this algorithm. In Section III, we dive into our main theoretical B. The Algorithm result and its derivation, followed by corollaries for special The randomized block Lancos algorithm is a straightforward cases. In Section IV, we investigate the behavior of this combination of the classical block [18] with algorithm for different parameter settings and empirically the added element of a randomized starting matrix V = AΩ. verify the results of the previous section. Finally, we give The pseudocode for this algorithm is outlined in Algo- concluding remarks in Section V. rithm 1. Of the parameters of the algorithm, k (target rank) is II. BACKGROUND problem dependent, while b (block size), q (no. of iterations) A. Preliminaries are chosen by the user to control the quality and computational cost of the approximation. The algorithm requires the choices Throughout this paper, our analysis assumes exact arith- of b, q to satisfy qb k, to ensure that the Krylov subspace metics. be at least k dimensional.≥ We denote matrices by bold-faced uppercase letters, e.g. M, entries of matrices by the plain-faced lowercase letter that Algorithm 1 randomized block Lanczos algorithm pseu- the entry belongs to, e.g. m11, and block submatrices by the docode bold-faced or script-faced uppercase letter that the submatrix A Rm×n belongs to subscripted by position, possibly with subscripts, Ω ∈ Rn×b , random Gaussian matrix ∈ e.g. M11, 11 or Ma×b. Double numerical subscripts denote Input: k , target rank M the position of the element or the submatrix, i.e. M11 and m11 b , block size are the topmost leftmost subblock or entry of M respectively. q , number of Lanczos iterations m×n m n subscripts denote the dimensions of a submatrix, when Output: Bk R , a rank-k approximation to A × ∈ such information is relevant, i.e. Ma×b denote a subblock of 1: Form the block column Krylov subspace matrix M that has dimensions a b. K = AΩ (AAT )AΩ (AAT )qAΩ . × ··· Constants are denoted by script-faced uppercase or lower- 2: Compute an orthonormal basis Q for the column span of case letters, e.g. or α, when it is asymptotically insignificant, K, using e.g. QR qr(K).  C ← i.e. constant with respect to the convergence parameter. 3: Project A onto the Krylov subspace by computing The SVD of a matrix A is defined as the factorization B = QQT A. A = UΣVT (1) 4: Compute k-truncated SVD Bk = svdk (B) = T T svdk QQ A = Q svdk Q A . where U u u and V v v are · = 1 n = 1 n 5: Return Bk. orthogonal matrices··· whose columns are the set of··· left and right   singular vectors respectively, and Σ is a diagonal matrix whose We present the algorithm pseudocode in this form in order entries Σ = σ are the singular values ordered descendingly ii i to highlight the mathematical ideas that are at the core of this σ σ 0. 1 n algorithm. It is well known that a naive implementation of The≥···≥ rank-k truncated≥ SVD of a matrix is defined as any Lanczos algorithm is plagued by loss of orthogonality of svdk (A)= UkΣkVk (2) the Lanczos vectors due to roundoff errors [19]. A practical implementation of Algorithm 1 should involve, at the very where Uk = u1 uk and Vk = v1 vk contain the first k left··· and right singular vectors respectively,··· least, a reorganization of the computation to use the three-     term recurrence and bidiagonalization [20], and reorthogonal- and Σk = diag(σ1, , σk). The ith singular values··· of an arbitrary matrix M is denoted izations of the Lanczos vectors at each step using one of the numerous schemes that has been proposed [20]–[22]. by σi(M), or simply σi when it is clear from context the matrix in question. C. Previous Work The pth degree Chebyshev polynomial is defined by the Historically, the the classical Lanczos algorithm was de- recurrence veloped as an eigenvalue algorithm for symmetric matrices. T0(x) 1 (3) Its convergence analysis focused on theorems concerning the ≡ approximation quality of the approximant’s eigenvalues as a T1(x) x (4) ≡ function of k, the target rank. The analysis relied heavily Tp(x) 2pTp− (x) Tp− (x) (5) ≡ 1 − 2 on the analysis of the k-dimensional Krylov subspace and Alternatively, they may be expressed as the choice of the associated k-degree Chebyshev polynomial. p −p Classical results in this line of inquiry include those by Kaniel 1 2 2 Tp(x)= x + x 1 + x + x 1 (6) 2 − − [23], Paige [24], Underwood [25], Saad [26].  p   p   More recently, while there has been much work on the for x > 1, and estimated as | | analysis of randomized algorithms, such efforts have been 1 p focused mostly on RBL’s simpler cousins, such as randomized Tp(1 + ǫ) 1+ ǫ + √2ǫ (7) ≈ 2 or randomized subspace iteration [12], [15]. for p large and ǫ small.   The exception is the results from [16]. To our knowledge, this is one of the few works that provide convergence analysis We will to show that the randomized block Lanczos algo- for randomized block Lanczos and the first work that gives rithm provides competitive accuracy, and produces singular “gap”-independent theoretical bounds for this algorithm. The value estimates at least some fraction of the optimum. analysis found therein is restricted to the case for the block σj size, b, chosen at least the size of k, the desired target rank. σj σj (Bk) (10) ≥ ≥ 1+ some convergence factor Our theoretical analysis will give a more generally applicable { } convergence bound, encompassing the case for both 1 b < k for some convergencep factor 0. and b k. In the latter case, our theoretical results≤ will { } → coincide≥ with those in [16]. In the former case, we show that B. Key Results the rapid convergence of the algorithm for any block size b Our convergence analysis will show that if the randomized larger than the largest singular value cluster size is assured. block Lanczos algorithm converges, then the k desired singular We draw attention to this distinction in choosing the block values of the approximation Bk converges to the correspond- size parameter b - in our numerical experiments, we show that ing true singular values of A exponentially in the number of generally smaller choices for b are favored. iteration q. Moreover, convergence occurs as long as the block Our current work is based partially on the analysis found in size b is chosen to be larger than the maximum cluster size [12]. This work established aggressive multiplicative conver- for the k relevant singular values. gence bounds for the randomized subspace iteration algorithm, We present our main results here and delay their proofs to for both singular values and normed (Frobenius, spectral) Subsection III-D. Our main theorem is as follows. matrix convergence. These bounds depend on both the singular value gap and the number of iterations taken by the algorithm Theorem III.1. Let Bk be the matrix returned by Alg. 1. - the former is a property of the matrix in question, and Assume that Ω is chosen such that the two conditions in Remark III.1 hold. For any choices of r, s 0, and any the latter is proportional to the computational complexity of ≥ the algorithm. The analysis presented in this work is linear parameter choice b, q satisfying k + r = (q p)b k, for j =1, , k, − ≥ algebraic in nature, drawing on deterministic matrix analysis, ··· as well expectation bounds on randomized Gaussian matrices σj+s σj σj (Bk) and their concentration of measure characteristics. Our current ≥ ≥ 2 −2 σj −σj+s+r+1 1+ T p 1+2 work employs similar methods, and achieves bounds of a C 2 +1 · σj+s+r+1 similar form. While the details differ, core ideas, such as r  (11) creating an artificial “gap” in the spectrum and choosing an where is a constant that is independent of q. C opportune orthonormal basis for the analysis, are the same. This inequality shows that for all valid choices of parameters III. THEORETICAL RESULTS b, q, the convergence of the approximate singular values are governed by the growth of the Chebyshev polynomial term A. Problem Statement Rm×n σj σj+s+r+1 Given an arbitrary matrix A and a target rank T p 1+2 − (12) ∈ 2 +1 · σ k rank(A), the goal of a low-rank matrix approximation  j+s+r+1  ≤ Rm×n algorithm is to compute another matrix Bk whose , with the bound holding across all choices of the analysis ∈ rank is at most k. parameters s, r. There are many ways to ask and answer the question, Theorem III.1 admits the following corollaries about two “how good of an approximation is Bk to the original A?” In special choices for the block size parameter b, where the particular, for various low-rank approximation algorithms, the constants in each case can be expressed in an algebraically answer has been provided in terms of normed approximation closed form. error [12], [15], [16], [27], singular subspace error [28], [29], and singular value error [12], [26]. Corollary III.2 (Special case: b = 1). For any choices of r, s 0 satisfying k + r = (q p) k, for j =1, , k, In this paper, we focus on the singular value error for the ≥ − ≥ ··· randomized block Lanczos algorithm. As B is an orthogonal σj s σ σ (B ) + projection of A in Alg. 1, by the Cauchy interlacing theorem j j k ≥ ≥ −2 σj −σj+s+r+1 1+ b=1T p 1+2 for singular values, we immediately have the upper bound C 2 +1 · σj+s+r+1 r  (13) σj σj (Bk) (8) ≥ where 2 for j =1, , k. j n j+r ··· ω σ2 σ2 2 The optimal lower bound is achieved, of course, by the = max r r t b=1  2 − 2  rank-k truncated SVD of A, giving the tight inequality C  1≤s≤k ωs  · σs σt j+r+1≤r≤n s=1 r=j+r+1 t=1  −  b X X tY=6 s  σj σj (svdk(A)) σj (9)    (14) ≥ ≥ is a constant independentb of q. for j =1, , k. ··· Corollary III.3 . For any choices of (Special case: b k + r) Fig. 1. Chebyshev polynomials Tn(x) grow much faster than monomials of ≥ n r, s 0, for j =1, , k, the same degree Mn(x) = x in the interval |x| > 1. ≥ ··· σj+s 140 σj σj (Bk) ≥ ≥ 2 −2 σj −σj+s+r+1 1+ T q 1+2 120 Cb≥k+r 2 +1 · σj+s+r+1 Chebyshev, Lanczos r  (15) where 100

−1 80 b≥k+r = Ω41 Ω11 (16) C 2 2

60 is a constant independent of e both q ,e the iteration parameter, and Σ, the spectrum of A. 40 Monomial, power method Choosing optimally the analysis parameters r, s, we arrive at a result coinciding asymptotically with the conclusions 20 reached in [16]. 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Theorem III.4. Let Bk be the matrix returned by running Alg. 1 with the block size b = k. Assume Ω is chosen such that Ω11 is nonsingular. Then, for j =1, , k ··· Fig. 2. Auxiliary analysis parameters r, s are adjusted to create a sufficient 2 singular spectrum “gap” to drive convergence. e O − log(A(4q+2))  (4q+2)2  σj σj (Bk) σj e (17) σk+s ≥ ≥ 0 σn σk+s+r+1 σk σ1 where =2 b≥k r is a constant independent of q. A C + Finally, from Theorem III.1 we may derive the following “gap” result, which states that for certain matrices with singular spectrum rapidly decaying to 0, the RBL algorithm converges superlinearly. D. Analysis We are interested in the column span of the Krylov subspace Theorem III.5. Assume the singular value spectrum of A de- matrix K. Let the singular value decomposition of A be cays such that σ 0. Let B be the rank k approximation of i k denoted as A = UΣVT . Then, we may write A returned by Alg.→ 1. Assume additionally that the hypothesis and notation of Theorem III.1 hold. Then K = AΩ (AAT )AΩ (AAT )qAΩ ··· T 2+1 T 2q+1 T = UΣV Ω UΣ V Ω UΣ  V Ω σj (Bk) σj (18) ··· → = UΣ Ω ΣΩ ΣqΩ (19) ··· superlinearly in q, the number of iterations. h i where for notationalb b b convenienceb web have defined the quanti- This theorem validates long observed empirical behaviors ties Ω VT Ω and Σ Σ2. of block Lanczos algorithms. In Section IV, we show two We “factor≡ out” the component≡ of the Krylov subspace that examples of typical data matrices with spectrums that fall drivesb convergence fromb the component that is related to the under this regime, and the expected superlinear convergence initial starting subspace but independent of q. To this end, behavior. define for 0 p q, ≤ ≤ q−p Kp UT2p+1(Σ) Ω ΣΩ Σ Ω (20) C. Intuition ≡ ··· h i Our analysis makes use of the following three ideas: The matrices K and K bare relatedb b as b b 1) the growth behavior of Chebyshev polynomials, a tra- span Kp span K (21) ditional ingredient in the analysis of Lanczos iteration {b }⊆ { } methods, (Fig. 1) In light of this, since Step 3 of Alg. 1 is a projection, we are 2) the choice of a clever orthonormal basis for the Krylov justified in our analysis to work with Kp instead of the more subspace, an idea adapted from [12], complicated K. 3) the creation of a spectrum “gap”, by separating the spec- Next, we multiply Kp by a specially constructed, full rank trum of A into those singular values that are “close” to matrix X. This operation will preserve the subspace spanned σk, and those that are sufficiently smaller in magnitude, by the columns of Kp, but align, as much as possible, the first using auxiliary analysis parameters r, s. (Fig. 2) k columns to the direction of the leading k singular vectors. For all 0 p q, let ensures the (2, 1) block of dimensions r k to be V31 = 0, ≤ ≤ and causes the (1, 1) block of dimensions (×k+s) k to become q−p × Vp Ω ΣΩ Σ Ω (22) b ≡ ··· V Σ V V V−1V h i 11 T2p+1( 1) 11 12 32 31 denote the generalized Vandermonde matrix from Eqn. 20 and = − −1 X11 b b b b b V T2p+1(Σ2) V21 V22V V31 21!   − 32  partition this matrix as follows: b We can then take the QR factorization V11 V12 b T (Σ ) V V V−1V V21 V22 QR = 2p+1 1 11 12 32 31 Vp =   (23) Σ V − V V−1V V31 V32 T2p+1( 2) 21 22 32 31   − (30) V41 V42   ande e set   −1 where the blocks in the first dimension are sized k,s,r,t = X11 = R (31) n (k + s + r) and the blocks in the second dimension are − This ensures that sized k, r. Intuitively, s is used to handle duplicate or clustered e T singular values, while r is used to create the “gap” that drives V V T convergence (Fig. 2). With this partition, we examine the 11 11 = QRR−1 QRR−1 = I (32) V V convergence behavior viewed as an accentuation of the “gap” 21! 21! b b     by the appropriate Chebysehv polynomial. Let Eqn. (31) and Eqn.e (29)e e define Xe e eand X respec- b b 11 21 We show the existence of a(t least one) special non-singular tively as (k+r)×(k+r) X R such that X11 I −1 ∈ = −1 R (33) X21 V32 V31 KpX = UT2p+1(Σ)VpX (24)   −  We specify e Q11 V12 ⊥ X12 X11 Q21 V22 (34) = U   (25) X22 ≡ X21 0 Vb 32       to provide a complete description of X which satisfies Eqn. 25.  H Vb 42    b  Remark III.1. In order for the above derivation and thus Q with 11 a column orthogonal matrix.b Notice the “gap” in Eqn. (34) and Eqn. (33) to be valid, the following conditions Q21 must hold: Ω is chosen to allow the (3, 1) block of size r is created by using X to align the • V32 to be non-singular and thus invertible, columns of Kp. −1 • V V V V to be non-singular and thus R We explicit construct such an X. Partition 11 12 32 31 to be− invertible. Note that this expression is the Schur X11 X12 V11 V12e X = (26) complement of the (k + r) (k + r) matrix X X × V31 V32 21 22     with respect to the V32 block. Σ1 Σ2 We present a first result on a lower bound for the singular Σ =   (27) Σ3 value of Bk.  Σ4   Lemma III.6. Let B be the matrix returned by Alg. 1, let H   k where each dimension of X is sized k, r, and each dimension be as defined in Eqn. (25), and assume that the two conditions of Σ is sized k,s,r,t = n (k + s + r). Then, in Remark III.1 hold. Then, − σk+s V σk(Bk) (35) 11 2 ≥ 1+ H 2 V21! ··· k k T2p+1(Σ)VpX   (28) ≡ Vb Proof. The matrix returned byp Alg. 1 is the k-truncated SVD  31 ···  of QQT A, where the columns of Q are an orthonormal basis  Vb   41  for the column span of K. By construction, it follows that  b ···  where T b σk(Bk) σk QpQp A (36) V T (Σ ) V V X ≥ 11 = 2p+1 1 11 12 11   V T2p+1(Σ2) V21 V22 X21 where Qp contains columns thatb formb an orthonormal basis 21!       b for the column span of KpX. V31 = T2p+1(Σ3)(V31X11 + V32X21) b In particular,b let QpRp be the QR factorization of KpX, V41 = T2p+1(Σ4)(V41X11 + V42X21) partitioned as follows: b b b Setting R11 R12 b −1 KpX = QpRp = Q1 Q2 (37) X21 = V32 V31X11 (29) " R22# − h i b b b b b b b where the block dimensions are sized k,s, as appropriate. With the help of Eqn. (39),

We can then write T T T T T R11R11 = R11 U Q1 U Q1 R11 (43) Q QT A p p  T    Q11 Q11 T Σ1 0 0 b b = b b + H bH b (44) b b T Q21 Q21 Q1 Σ2 0 0 T = Q U V  T   p T   0 0  Σ   = I + H H (45) "Q2 # 3 b   0 0   Σ4  b       , which completes the proof. b Σ1   0 0  T Σ2 T 0 0 We are now in a position to provide the proof for Theo-  Q1 U   Q1 U    0 0 Σ3 rem III.1   0 0   Σ4  T = Qp  b   b    V  Σ1   0 0   Proof. With an eye toward Lemma III.6, we proceed by   2  T Σ2 T 0 0  providing a bound for H 2. b  Q2 U   Q2 U    k k  0 0 Σ3    H 2   0 0   Σ4  2  b   b    k k       = σ2 HHT By the Cauchy interlacing theorem for singular values, it 1 −1 follows that 2  −1 T = σ1 T2p+1(Σ4)(V41 V42V32 V31) R R − Σ1   Σ e e σ Q QT A σ QT U 2 (38) V V V−1V T Σ k p p k  1  0 0  ( 41 42 32 31) T2p+1( 4) ≥ − !     0 0  b b  b      2 Σ V V V−1V We can compare the first columns of Eqn. (37) with the = σ1 T2p+1( 4)( 41 42 32 31) k − expression in Eqn. (25) to see that −1 T 2 V11 V12V32 V31 T2p+1(Σ1) − −1 2 Q11 V V V V T (Σ2)  21 − 22 32 31  2p+1  Q21 −1 Q1R11 = U   (39) −1 0 V11 V12V32 V31 − −  H  V V V 1V b b    21 − 22 32 31 !   which helps us to write −1 T (V41 V42V32 V31) T2p+1(Σ4) T − ! Σ1 Q11 Σ1 T Σ2 Q21 −1 Σ2 2 −1 Q1 U   = U   R11  U   σ1 T2p+1(Σ4)(V41 V42V32 V31) 0 0 0 0 0 ≤ −  0 0    H    0 0  −1 T 2 −1 −1 b      b    ((V11 V12V V31) T (Σ1)(V11 V12V V31))      T    − 32 2p+1 − 32 Q11 Σ1 −1 T −T Q21 Σ2 (V41 V V V ) T (Σ ) = R 42 32 31 2p+1 4 11  0   0 0  − ! −1  H   0 0  = T p (Σ )(V V V V ) b     k 2 +1 4 41 − 42 32 31 −T  T   T  −1 −1 −1 2 = R Q Σ1 Q Σ2 (40) (V V V V ) T (Σ ) 11 11 21 11 − 12 32 31 2p+1 1 k2 σk σk s r On the other hand,   T −2 1+2 − + + +1 b ≤ 2p+1 · σ  k+s+r+1  T T −1 −1 −1 2 σ = σ σ Q Q (V41 V42V V31)(V11 V12V V31) k+s k k+s 11 21 k − 32 − 32 k2 QT Σ QT Σ σk 11  1 21 2 σk−σk+s+r+1 ≤ The 1+2 σ factor is interpreted as shifting the T −T T T · k+s+r+1 = σk R11R11 Q11Σ1 Q21Σ2 Chebyshev polynomial T2p+1 onto the interval [0, σk+s+r+1], so that the tail of the singular spectrum is bounded by 1 T −T T T  R σk R Q Σ1 Q Σ2 (41) ≤k 11bk2 b 11 11 21 and convergence is driven by the growth of the Chebyshev    polynomial on the [σk, , σ1] part of the spectrum that we Combining Eqns.b (36), (38),b (40), and (41), we obtain are interested in. ··· σk+s Repeating the previous argument for 1 j k completes σk(Bk) (42) ≤ ≤ ≥ RT the proof for the bound on σj (Bk). k 11k2 b Due to space constraint, we omit the proofs for the corol- of K can be computed using a QR factorization using the laries of Theorem III.1. They are similar in flavor to the standard Householder implementation, which has complexity proof above and involve constructions of specifically chosen (m(bq)2). Finally, steps 3 and 4 consists of first forming X matrices in each case. QOT A for (mnbq) flops, then computing its truncated SVD We close by providing the proof for Theorem III.5. factorization.O Because the size of this matrix is qb n and we expect qb k to be small, we assume its SVD computation× Proof. The statement of the theorem is equivalent to the is performed≈ with a non-specialized dense matrix algorithm, statement that using (n(bq)2) flops. The final step of projecting the right σj σj r O 2 T −1 1+2 − + +1 0 (46) k singular vectors onto Q is an additional (m(bq) ) flops. C 2p+1 · σ → O  j+r+1  Overall, the computational complexity of Algorithm 1 is superlinearly. For notational convenience we assume is not (mnbq + (m + n)(bq)2). The first term dominates the σj O a multiple singular value and we have chosen s =0; otherwise, computations and is the result of performing the matrix the following argument can be made for the largest choice of multiplications for the computation of the Lanczos block vec- s such that σj+s = σj . tors. Fortunately, is a highly optimized Recall that a sequence an converges superlinearly to a if and highly tuned part of many matrix computation libraries, { } especially for suitably chosen block sizes. an a lim | +1 − | =0 (47) We draw attention to the fact that the parameters b and q n→∞ an a | − | only appear together as the quantity bq in our computational For any fixed j =1, , k, define complexity count. This suggests that we may freely vary ··· b, q - as long as they vary inversely and the quantity bq −1 σj σj+r+1 aq (r)T k r 1+2 − remains constant, the cost for running Algorithm 1 remains q − + ≡C 2( +1 b )+1 · σ  j+r+1  comparable. (In practice, this will only hold true for b> 1, due where we have explicitly specified the dependence of the to the efficiency of BLAS2 and BLAS3 operations compared constant on the analysis parameter r, and expressed p in with BLAS1 operations.) Given the comparable computational C terms of q. We approximate complexity, and assuming the conditions for the convergence

− q − k+r of Algorithm 1 is met, we need not privilege the block size 1 (2( +1 b )+1) a (r) 1+ g + 2g choice b = k. In fact, we show empirically that in many cases, q 2 ≈C · it is advantageous to choose block sizes b strictly smaller than − − k+r − q 1  p  2(1 b )+1 2 (r) 1+ g + 2g 1+ g + 2g k. ≈ 2 ·C · ·  σj σj+pr+1  σj  p  where g =2 − =2 1 B. Activities and Sports Dataset · σj r · σj r − + +1  + +1  The Activities and Sports Dataset is a dataset consisting of Then we argue that aq+1/aq 0 as follows. motion sensor data for 8 subjects performing 19 daily/sports → a 1 1 activities, for 5 minutes, sampled at 25Hz frequency. This q+1 (48) = 2 2 dataset can be found at [31]. aq 1+ g + √2g ≤ (1 + g) The matrix associated with this dataset is dense and of 9120×5625 Since we assume a spectrum such that σi 0 eventually, it dimensions A R , where each row is a sample and → is possible to chose r sufficiently large such that 1/(1 + g)2 each entry is a∈ double precision float. Figure 3 shows a plot is arbitrarily small. of the first 500 singular values of A. As is typically for data matrices, this matrix exhibit spectrum decay on the order of Rigorously, the above argument applies only to infinite 1 σ = τ , and our theory suggests that in this case, we should dimensional operators, as in the finite dimensional case, r n j j observe superlinear convergence for the RBL algorithm. cannot be chosen to be arbitrarily large. However, numerous≤ previous works have noted that in practice, the convergence In this set of experiments, we investigate the convergence of does tend to exhibit superlinear behavior for certain types of a single singular value with respect to the number of iterations, spectrums [30]. in addition to the affect of the block size on convergence. We run the RSI and RBL algorithms on the Activities and Sports IV. NUMERICAL EXPERIMENTS Dataset matrix with a target rank of k = 200, and examine A. Computational Complexity the convergence of σ1, σ100, and σ200. The results of these experiments are in Figures 4, 5, and 6. We will give an arithmetic complexity accounting of the Each of these plots represent the convergence of a particular randomized block Lanczos algorithm. The initialization of singular value. In each plot, each line represents a single the random starting matrix Ω takes (nb) floating-point parameter setting for the block size , for either the RSI or operations (flops). In step 1, the formationO of the Krylov b the RBL algorithm. The y-axis is in log scale, and denote matrix K consists of 1 matrix multiplications of AΩ along T with 2(q 1) accumulated applications of either A or A σj σj (Bk) for a total− of (mnbq) flops. The orthonormal basis Q rel. err. = − (49) O σj 10 4 Spectrum of Daily Activities Matrix Eigenfaces Dataset, target k = 200, j = 100 2 10 0

1.8

1.6 j 10 -5 ))/

1.4 k (B j 1.2 - j RSI b=200 j 1 RSI b=200+2 10 -10 RSI b=200+5 0.8 RBL b=200 RBL b=100 RBL b=50 0.6 rel. err. = ( RBL b=20 RBL b=10 -15 0.4 10 RBL b=2 RBL b=1 0.2 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 Number of MATVECs = b*(1+2*q) 0 100 200 300 400 500 j Fig. 5. k = 200 approximation of the Daily Activities Dataset, convergence of σ100. Fig. 3. First 500 singular values of the Daily Activities and Sports Matrix.

Eigenfaces Dataset, target k = 200, j = 200 Eigenfaces Dataset, target k = 200, j = 1 10 0 10 0

RSI b=200 RSI b=200+2 RSI b=200+5

RBL b=200 j 10 -5 10 -5 j RBL b=100 ))/

RBL b=50 k ))/ k

RBL b=20 (B j

(B RBL b=10 - j RBL b=2 j - j RBL b=1 10 -10 -10 RSI b=200 10 RSI b=200+2 RSI b=200+5 RBL b=200

rel. err. = ( RBL b=100

rel. err. = ( -15 RBL b=50 10 RBL b=20 RBL b=10 10 -15 RBL b=2 RBL b=1

10 -20 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Number of MATVECs = b*(1+2*q) Number of MATVECs = b*(1+2*q)

Fig. 6. k = 200 approximation of the Daily Activities Dataset, convergence Fig. 4. k = 200 approximation of the Daily Activities Dataset, convergence of σ200. of σ1.

taking b = 1 uses 1/2 the number of MATVECs as taking , the relative error of the particular singular value we are b = k = 200. examining. The x-axis is in linear scale, and denote the number of matrix-vector multiplications (MATVECs), a proxy measure C. Eigenfaces Dataset for computational complexity. Markers on each line represent The Eigenfaces dataset is available from the AT&T Labora- successive iterations of the algorithm. In these plots, down tories’ Database of Faces [32], and consists of 10 different face and to the left is good - we seek parameter settings that images of 40 different subjects at 92 112 pixels resolution, give good convergence for less computational complexity. We varying in light, facial expressions,× and other details. The observe that, as expected, RSI converges linearly and RBL widely cited technique for processing this data is via principal converges superlinearly. These trends are most clearly seen in component analysis (PCA), where it was observed that each Figure 6 and is also present in Figure 5. The convergence face can be composed in large part from a few prominent of σ1 is extremely rapid in Figure 4, and reaches double “Eigenfaces” [33]. precision in 2-5 iterations for all block sizes. In all cases, for The associated matrix is a dense matrix, which is formed both RBL and RSI, it appears that at the same computational by vectorizing each different face image as a column vector. It complexity, choosing a smaller block size b, leads to more has dimensions A R10304×400 and is of full numerical rank. rapid convergence. For example, in Figure 6, we observe that The spectrum of this∈ matrix spans 5 orders of magnitude but −5 in order for σj to converge to a relative error of 10 , decays extremely rapidly, typical of data matrices. In fact, as ∼ for all block sizes, the singular value approximation accuracy 10 5 Spectrum of Eigenfaces Matrix 2.5 for this algorithm converges geometrically in the number of iterations, with a rate that is asymptotically superior to that achieved by the randomized subspace iteration algorithm. We 2 have also shown for a matrix with spectrum decaying to zero, the RBL algorithm converges superlinearly. Additionally, we

1.5 have provided numerical results in support of our analysis. The current work is largely theoretical in nature, and j there continues to be need for quality implementations of 1 the Randomized Block Lanczos algorithm to aid its wider adoptability. To this end, continuations of the current work might include such an (possibly parallelized) implementation, 0.5 along with further investigations of practical choices for the block size parameter b which balances the evident preference for a smaller b for convergence with the advantages of a larger 0 0 50 100 150 200 250 300 350 400 b for computational efficiency and . j REFERENCES Fig. 7. Spectrum of the Eigenfaces Matrix. [1] M. B. Cohen, S. Elder, C. Musco, C. Musco, and M. Persu, “Dimen- sionality reduction for k-means clustering and low rank approximation,” in Proceedings of the forty-seventh annual ACM symposium on Theory Eigenfaces Dataset, target k = 100, j = 100 10 0 of computing. ACM, 2015, pp. 163–172. [2] H. M. Nguyen, X. Peng, M. N. Do, and Z.-P. Liang, “Denoising mr spectroscopic imaging data with low-rank approximations,” IEEE Transactions on Biomedical Engineering, vol. 60, no. 1, pp. 78–89,

-5 2013. j 10 [3] M. Fazel, E. Candes, B. Recht, and P. Parrilo, “Compressed sensing ))/

k and robust recovery of low rank matrices,” in Signals, Systems and (B

j Computers, 2008 42nd Asilomar Conference on. IEEE, 2008, pp. 1043– -

j 1047. 10 -10 [4] D. Anderson and M. Gu, “An efficient, sparsity-preserving, online RSI b=100 algorithm for low-rank approximation,” in Proceedings of the 34th RSI b=100+2 International Conference on Machine Learning, ser. Proceedings of RSI b=100+5 RBL b=100 Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70. rel. err. = ( 10 -15 RBL b=50 International Convention Centre, Sydney, Australia: PMLR, 06–11 Aug RBL b=20 2017, pp. 156–165. RBL b=10 [5] J. Liu, P. Musialski, P. Wonka, and J. Ye, “Tensor completion for RBL b=2 RBL b=1 estimating missing values in visual data,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 208–220, 2013. 10 -20 0 500 1000 1500 2000 2500 [6] N. Parikh, S. Boyd et al., “Proximal algorithms,” Foundations and Number of MATVECs = b*(1+2*q) Trends® in Optimization, vol. 1, no. 3, pp. 127–239, 2014. [7] C. Eckart and G. Young, “The approximation of one matrix by another Fig. 8. k = 100 approximation of the Eigenfaces Dataset, convergence of of lower rank,” Psychometrika, vol. 1, no. 3, pp. 211–218, 1936. σ100. [8] G. Golub and W. Kahan, “Calculating the singular values and pseudo- inverse of a matrix,” Journal of the Society for Industrial and Applied Mathematics, Series B: , vol. 2, no. 2, pp. 205–224, 1965. seen in Figure 7, it drops to zero within the first 50 largest [9] A. Talwalkar, S. Kumar, M. Mohri, and H. Rowley, “Large-scale svd and singular values. manifold learning,” The Journal of Machine Learning Research, vol. 14, no. 1, pp. 3129–3152, 2013. We repeat the experiments performed in the last section. For [10] R. Mazumder, T. Hastie, and R. Tibshirani, “Spectral regularization this set of experiments, we use the RSI and RBL algorithms algorithms for learning large incomplete matrices,” Journal of machine to compute rank-k = 100 approximations for the Eigenfaces learning research, vol. 11, no. Aug, pp. 2287–2322, 2010. [11] S. Cohen, B. Kimelfeld, and G. Koutrika, “A survey on proximity matrix, and examine the convergence of σ100. The result measures for social networks,” in Search computing. Springer, 2012, appears in Figure 8. pp. 191–206. We observe similar behavior as those observed for the Daily [12] M. Gu, “Subspace iteration randomization and singular value problems,” SIAM Journal on Scientific Computing, vol. 37, no. 3, pp. A1139– Activities and Sports Matrix: the RSI algorithm exhibits linear A1173, 2015. convergence while the RBL algorithm exhibits superlinear [13] P. Drineas, R. Kannan, and M. W. Mahoney, “Fast monte carlo algo- convergence; smaller block sizes b appear to converge more rithms for matrices ii: Computing a low-rank approximation to a matrix,” SIAM Journal on computing, vol. 36, no. 1, pp. 158–183, 2006. quickly for a fixed number of flops. [14] V. Rokhlin, A. Szlam, and M. Tygert, “A randomized algorithm for principal component analysis,” SIAM Journal on Matrix Analysis and V. CONCLUSIONS Applications, vol. 31, no. 3, pp. 1100–1124, 2009. [15] N. Halko, P.-G. Martinsson, and J. A. Tropp, “Finding structure with In this paper, we have derived a novel convergence result for randomness: Probabilistic algorithms for constructing approximate ma- the randomized block Lanczos algorithm. We have shown that trix decompositions,” SIAM review, vol. 53, no. 2, pp. 217–288, 2011. [16] C. Musco and C. Musco, “Randomized block krylov methods for stronger and faster approximate singular value decomposition,” in Ad- vances in Neural Information Processing Systems, 2015, pp. 1396–1404. [17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., “Scikit-learn: Machine learning in python,” Journal of machine learning research, vol. 12, no. Oct, pp. 2825–2830, 2011. [18] G. H. Golub and R. Underwood, “The block lanczos method for computing eigenvalues,” in Mathematical software. Elsevier, 1977, pp. 361–377. [19] C. C. Paige, “Computational variants of the lanczos method for the eigenproblem,” IMA Journal of Applied Mathematics, vol. 10, no. 3, pp. 373–381, 1972. [20] G. H. Golub, R. R. Underwood, and J. H. Wilkinson, “The lanczos algorithm for the symmetric ax= λ bx problem.” 1972. [21] B. N. Parlett and D. S. Scott, “The lanczos algorithm with selective orthogonalization,” Mathematics of computation, vol. 33, no. 145, pp. 217–238, 1979. [22] H. D. Simon, “The lanczos algorithm with partial reorthogonalization,” Mathematics of Computation, vol. 42, no. 165, pp. 115–142, 1984. [23] S. Kaniel, “Estimates for some computational techniques in linear algebra,” Mathematics of Computation, vol. 20, no. 95, pp. 369–378, 1966. [24] C. C. Paige, “The computation of eigenvalues and eigenvectors of very large sparse matrices.” Ph.D. dissertation, University of London, 1971. [25] R. Underwood, “An iterative block lanczos method for the solution of large sparse symmetric eigenproblems,” Tech. Rep., 1975. [26] Y. Saad, “On the rates of convergence of the lanczos and the block- lanczos methods,” SIAM Journal on Numerical Analysis, vol. 17, no. 5, pp. 687–706, 1980. [27] J. Xiao and M. Gu, “Spectrum-revealing cholesky factorization for methods,” in Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, 2016, pp. 1293–1298. [28] J. Chen and Y. Saad, “Lanczos vectors versus singular vectors for effective dimension reduction,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 8, pp. 1091–1103, 2009. [29] R.-C. Li and L.-H. Zhang, “Convergence of the block lanczos method for eigenvalue clusters,” Numerische Mathematik, vol. 131, no. 1, pp. 83–113, 2015. [30] Y. Saad, “Theoretical error bounds and general analysis of a few lanczos-type algorithms,” in Proceedings of the Cornelius Lanczos International Centenary Conference (JD Brown, MT Chu, DC Ellison and RJ Plemmons, eds), SIAM, Philadelphia, PA, 1994, pp. 123–134. [31] K. Altun, B. Barshan, and O. Tunc¸el, “Comparative study on classifying human activities with miniature inertial and magnetic sensors,” Pattern Recognition, vol. 43, no. 10, pp. 3605–3620, 2010. [32] F. S. Samaria and A. C. Harter, “Parameterisation of a stochastic model for human face identification,” in Applications of Computer Vision, 1994., Proceedings of the Second IEEE Workshop on. IEEE, 1994, pp. 138–142. [33] M. A. Turk and A. P. Pentland, “Face recognition using eigenfaces,” in Computer Vision and Pattern Recognition, 1991. Proceedings CVPR’91., IEEE Computer Society Conference on. IEEE, 1991, pp. 586–591.