Approximating Orthogonal Matrices with Effective Givens Factorization

arXiv:1905.05796v1 [math.OC] 14 May 2019 epnec o hmsFrerix Thomas to: respondence in ntesrcueo h arx pligauiaytran unitary a in applying formation matrix, the of structure the on tions Bruna ueia ieragbat unu optn and QR- ( matri- the symmetric ces computing of include diagonalization the quantum from applications and decomposition Celebrated areas, to many algebra cryptography. in linear ubiquitous numerical are operators Unitary Introduction 1. mtoso ntr arcsi h omo iesfac- Givens of form approx- the compute in to ( matrices method torization a unitary develop of we imations paper, this In desirable. is dimension better the a in times, with behavior many scaling accuracy trade-off applied that intensively approximations be uni- using to given needs a where operator scenarios tary In product. matrix-vector the 1 ehia nvriyo Munich of University Technical ou a e Vorst Der Van & Golub eso htmnmzn priyidcn ob- sparsity-inducing a minimizing that show we atgos lhuhefcieGvn factoriza- Givens effective Although vantageous. bandwe ignlzn rp Laplacian. graph matrix a the diagonalizing when is obtained which transform, Fourier graph approximating the of results We numerical transforms. demonstrate basis orthogonal of are applications setup a Canonical such for factorizations matrices. good on structured yields algorithm group descent unitary coordinate the a with jective operators, unitary generic for possible not is tion ad- becomes representation a applied such then times, and ap- many computed an once where is settings proximation effi- in implemented Consequently, be can number ciently. which the of with each scales applying factors, approximation matrix, an dense such a applying depen- dimension when quadratic dence the rota- of two- Givens Instead in tions. so-called rotations of subspaces, matrix product unitary dimensional a a as represented formulation, is our unitary In of matrices. approximation effective analyze We prxmtn rhgnlMtie ihEfcieGvn F Givens Effective with Matrices Orthogonal Approximating < [email protected] Givens d iesosrequires dimensions , 1958 Abstract .Gvn oain r localized are rotations Givens ). > , . < 2000 [email protected] 2 e okUiest.Cor- University. York New .Wtotayassump- any Without ). O ( d 2 ) hmsFrerix Thomas prtosfor operations > Joan , s- 1 evldt u loih nafml fsrcue or- structured of family a on algorithm our validate We hs ymtisd o ar vrt ntutrddo- unstructured to over carry not do symmetries These hrfr,cmuain ihGvn eune cl with scale sequences Givens with computations Therefore, hndaoaiigagahLaplacian. graph a diagonalizing when arcsrequire unitary generic matrices that show fact, we In bounds, operators. covering simple unitary using general and graphs as such mains FFT. symme- the of many de- structure provides butterfly is the which transform in grid, leveraged Fourier tries periodic classical a the over that fined fact re- speed-up the this on However, lies century. 20th the of algorithms Dongarra & Sullivan Fourier a to applying transform of cost computational the down brought anmtvto oe rmtescessoyo h Fast the ( of (FFT) story transform success Fourier the from comes motivation main ligec atrcnb etlwsneefiin imple- efficient since ( low ap- possible kept for are be cost mentations can computational factor the each and plying factors of number the coordinates. predefined of subspace two-dimensional a in possible htminimizes that oarvlto nsga rcsigadwsrcgie by recognized was and processing signal in revolution a to n ihfwrfcosi h rsneo tutr remains structure element of an presence given the open: in factors fewer with approximating of question the However, approximated. tively om vrteuiaygop nesne ie point a given essence, In group. U unitary this sparsity-inducing the on In over based norms relaxation problem. a optimization sub- propose NP-hard we Givens paper, an selecting is of this nature spaces, combinatorial the to Due f ore rnfr GT,teotooa arxobtained graph matrix a orthogonal approximate the to (GFT), algorithm transform our Fourier apply we Finally, hgnloeaos osrce ihapatdrno se- random planted of a quence with constructed operators, thogonal of generators as group. acting the coordi- factors Givens using the path with this descent follow nate approximately to of tries minimizers rithm global the the of thus elements and sparsest group the matrix, permutation signed est prxmto spsil nteregime the in possible is approximation onBruna Joan ∈ : U U ( ( d d ) N ) euetegain o faptnilfunction potential a of flow gradient the use we , K → tr euneo iesrotations Givens of sequence -term 2 iesfcosaddmntaeta effective that demonstrate and factors Givens O R

( U d O odﬁeapt htlinks that path a deﬁne to log( − ( d Q ( 2 2000 d / U j )) log G ou a Loan Van & Golub ∈ oly&Tukey & Cooley prtos hsrdcinled reduction This operations. j soeo h otimportant most the of one as ) d

U ) ? ( iesfcost eeffec- be to factors Givens d ) o opouetebest the produce to how , actorization f K hn u algo- our Then, . , = 1965 , U 2012 O oisnear- its to G ( 1 ,which ), d G . . . .Our ). log d N ) . Approximating Orthogonal Matrices with Effective Givens Factorization

For ease of exposition, we restrict our discussion to approx- quences compared to a truncated Jacobi algorithm. This al- imating orthogonal group elements. However, this does lows to efficiently transform a graph signal into the graph’s not impose a restriction on the outlined approaches, as they approximate Fourier basis, an essential operation in graph equally apply to the complex unitary group as well as the signal processing. real orthogonal group. 3. Givens Factorization and Elimination 2. Related Work Givens matrices represent rotations in a two-dimensional Givens rotations were introduced by (Givens, 1958) to subspace, while leaving all other dimensions invariant factorize the unitary matrix that transforms a square ma- (Givens, 1958; Golub & Van Loan, 2012). Such a counter- trix into triangular form. The elementary operation of clockwise rotation in the (i, j)-plane by an angle α can be rotating in a two-dimensional subspace led to numer- written as applying GT (i, j, α), where ous successful applications in numerical linear algebra 1 ··· 0 ··· 0 ··· 0 (Golub & Van Loan, 2012), in particular, for eigenvalue ...... problems (Golub & Van Der Vorst, 2000). In this context, . . . . .  0 ··· cos(α) ··· sin(α) ··· 0  a Givens sequence factorizes a unitary basis transform, . . . . . G(i, j, α)=  ......  (1) which is an operation of paramount importance to signal  . . . .   0 ··· − sin(α) ··· cos(α) ··· 0  processing.  . . . .   ......   . . . . .  In contrast to signal processing on a Euclidean domain, re-  0 ··· 0 ··· 0 ··· 1    cently there has been increased interest in signal processing The trigonometric expressions appear in the i-th and j-th on irregular domains such as graphs (Shuman et al., 2013; rows and columns. Any orthogonal matrix U Rd×d that Bronstein et al., 2017). In this setting, Magoarou et al. is a rotation, U SO(d), can be decomposed∈ into a prod- (2018) considered a truncated version of the classical Ja- uct of at most d(d∈ 1)/2 Givens rotations. In general, there cobi algorithm (Jacobi, 1846) to approximate the orthog- exist many possible− factorizations. If U O(d) SO(d), onal matrix that diagonalizes a graph Laplacian. Other then it cannot be represented directly by∈ a sequence\ of notable strategies to efficiently approximate large matri- Givens rotations. However, a factorization can be obtained ces with presumed structure include multiresolution anal- up to permutation with a negative sign, e.g., by flipping two ysis (Kondor et al., 2014) and sparsity (Kyng & Sachdeva, columns. 2016). In numericallinear algebra, Givens factors are often used to In quantum computation, approximate representation of selectively introduce zero matrix entries by controlling the unitary operators is a fundamental problem. Here, a uni- rotation angle. This leads to a constructive factorization al- tary operation that performs a computation on a quantum gorithm, which demonstrates a d(d 1)/2-factorization. To state needs to be represented by or approximated with few this end, we start with the matrix U− SO(d) and introduce elementary single- and two-qubit gates, ideally polynomial zeros on the lower diagonal column-wise∈ from left to right in the number of qubits. In the literature of quantum com- and bottom to top within every column. This is achieved by puting, a Givens rotation is commonly referred to as a two- choosing the rotation subspace (i, j) and a suitable rotation level unitary matrix; a generic n-qubit unitary operator can n angle to zero-out the matrix element (i, j). The elimination be factorized in such two-level matrices with (4 ) ele- order is illustrated for d =4 by mentary quantum gates (Vartiainen et al., 2004).O An alternative viewpoint on Givens sequences was ana- 3∗∗ ∗ ∗ lyzed by Shalit & Chechik (2014). The authors considered (2) 2 5∗ ∗ ∗  manifold coordinate descent over the orthogonal group as ∗ ∗ 1 4 6  sequentially applying Givens factors. Consequently, the  ∗    minimizing sequence of this algorithm yields a Givens fac- T T After N = d(d 1)/2 steps, we have GN ...G1 U = D, torization of the initial orthogonal matrix. where D is a diagonal− matrix with D = 1 for an even kk − In this work, we analyze information theoretic properties number of values and Dkk = 1 otherwise. This result can of approximating unitary matrices via Givens factorization. be reduced to the identity by selecting two subspaces with We then propose to minimize a sparsity-inducing objective values Dii = Djj = 1 and applying a rotation by an − via manifold coordinate descent in a regime where effec- angle α = π. We refer to this algorithm by structured tive approximation is possible. Subsequently, we apply this elimination. approach to approximate the graph Fourier transform and Apart from this sign ambiguity, we consider factorizations demonstrate that the proposed method can find better se- in the broader sense up to signed permutation of the result- Approximating Orthogonal Matrices with Effective Givens Factorization ing matrix columns. To be explicit, the set of signed per- gous argument holds by replacing the operator 2-norm with mutation matrices is defined as := P Rd×d P the Frobenius norm while re-scaling the error by √d. Pd { ∈ | ij ∈ 1, 0, 1 , i Pij = 1 j, j Pij = 1 i . Fora ma- {− } ∀ ∀ } Lemma 1. Let n≤N Gn be a product of Givens factors trix U O(d), to measure approximation quality, we de- ¯ ∈ with rotation angles αn and Gn be the respective perturbed note an approximationP by Uˆ andP use a symmetrized Frobe- Q factors with rotation angles α +δ and perturbations 0 nius norm criterion up to a signed permutation matrix as n n δ δ. Then, ≤ follows: n ≤ ˆ ˆ U U := min U UP . (3) G¯ G 2Nδ . (4) − F,sym P ∈Pd − F n − n ≤ n≤N n≤N Y Y F The range of (3) over the orthogonal group is [0, √2d) as the maximum is obtained for the distance be- Proof. For any orthogonal matrices U,U ′,V,V ′, we have tween Hadamard1 matrices H(d) and the identity with H(d) I /√d √2 as d . Since A 2 = U ′V ′ UV = (U + U ′ U)V ′ UV F,sym F − F − − F − → → ∞ k k ′ ′ ′ E 2 , the criterion measures the average ap- U(V V ) + (U U)V x∼N (0,I) Ax 2 ≤ − F − F k k ′ ′ proximationh qualityi over random Gaussian vectors when = V V + U U , (5) − F − F applying Uˆ instead of U. The motivation for this definition is twofold. First, this definition allows us to discuss by using the fact that the Frobenius norm is invariant to Givens factorizations of orthogonal matrices with negative orthogonal matrix multiplication. By iterating this relation, determinant and henceforth we consider factorization over we obtain the orthogonal group O(d) rather than the special orthogonal group SO(d). Second, it enlarges the class of possible G¯ G G¯ G . (6) factorization algorithms to those that cannot distinguish be- n − n ≤ n − n F n≤N n≤N n≤N tween signed permutation matrices. Observe that since the Y Y F X cost of multiplying by a signed permutation matrix is (d) Since G¯n and Gn rotate in the same subspace, (Knuth, 1998), the computational efficiency argumentsO in this paper are not affected by the permutation equivalence G¯n Gn =2 1 cos(δn) . (7) class as we are discussing approximations in the regime of − F − factors. p d log(d) Inequality (4 ) follows from 1 cos(δ ) δ δ. O − n ≤ n ≤ 2 4. Information Theoretic Rate of Givens Theorem 1. Let ǫ > 0. IfpN =o d / log(d) , then as d , Representation → ∞ The elimination algorithm discussed in Section 3 guaran- µ U U(d) inf U Gn 2 ǫ 0 . tees to factorize any orthogonalmatrix in at most d(d 1)/2  ∈ G1...GN k − k ≤  → − ( n ) Givens factors, which corresponds to the dimension of the Y   orthogonal group. Since each Givens factor is parametrized by a single angle, it immediately follows that exact Givens Proof. Consider an ǫ-covering of the unitary group, i.e., a discrete set such that inf ∈X U X ǫ for all factorization for arbitrary elements U O(d) necessarily X X k − k2 ≤ requires d(d 1)/2 factors. ∈ U U(d). Since the manifold dimension of the unitary − group∈ is d(d 1)/2, we need = Θ(ǫ−d(d−1)/2) many Hence, this leads to the question of approximate factoriza- balls for that− cover. Let N := N|X(d |) be the number of avail- ˆ tion: if one tolerates a certain error U U F ǫ, is able Givens factors for approximation at dimension d, and k − k ≤ it possible to find approximations Uˆ = Gn with n≤N N = X U(d) inf G1...GN X n≤N Gn 2 ǫ/2 N = o(d2), ideally with N = (d log d)? A covering Adenote{ the set∈ of unitary| operatorsk which− can bek effectively≤ } O Q argument shows that generic orthogonal matrices in d di- approximated with N Givens terms.Q Now, suppose that 2 mensions require at least Θ(d / log(d)) Givens factors to µ( N ) c > 0, i.e., the set of group elements admitting achieve an ǫ-approximate factorization. We denote by µ the anAǫ/2-approximation≥ has positive measure. This implies −d(d−1)/2 uniform Haar measure on the unitary group, which we nor- that any ǫ-cover of N must be of size Θ(ǫ ). Let malize for each d, µ(U(d)) = 1. For notational simplicity, us build such an ǫ-cover.A we carry out the proof for the operator 2-norm. An analo- If we discretize the rotation angle to a value δ > 0, then 1A Hadamard matrix is an orthogonal matrix Hwhose entries there are (d(d 1)/2δ) many different quantized Givens − N satisfy Hi,j = 1/√d for all i, j. factors, denoted by G¯ , and consequently (d(d 1)/2δ) | | i − Approximating Orthogonal Matrices with Effective Givens Factorization

ǫ many different sequences. It follows that if δ := 4N , the A greedy criterion determines the best descent on f by a ¯ discrete set = n≤N Gin containing all possible se- search over all possible coordinate directions Xij i≤j≤d quences of lengthY {N of quantized} Givens rotations is an with the optimal step size obtained by a line search.{ } ǫ-cover of . Indeed,Q by using Lemma 1 and the fact that N A Givens factor can be interpreted as a coordinate descent the operatorA2-norm is bounded by the Frobenius norm, we step over the orthogonal group. This follows from the rela- have X , ∀ ∈ AN tion ǫ ǫ X G¯ X G +2Nδ + = ǫ . T k − nk2 ≤k − nk2 ≤ 2 2 Exp( αXij )= G (i, j, α) . (10) ≤ ≤ − nYN nYN In d = 3, an explicit example of the correspondence be- N Since = 2d(d−1)N , it follows that tween Lie algebra and Lie group elements is |Y| ǫ 00 0 10 0 N 2d(d 1)N −d(d−1)/2 0 0 α 0 cos(α) sin(α) . − = Θ(ǫ ) , ǫ  0 α −0  −→  0 sin(α)− cos(α)     (11) which implies N = d2/ log d . O Suppose we want to minimize a function f over the orthog- An immediate consequence of Theorem 1 is that generic onal group, effective approximation, i.e., with a number of factors min f(U) . (12) N = (d log d), is information theoretically impossible. U∈O(d) However,O the situation may be entirely different for structured distributions of unitary operators. For that purpose, Then minimizing (12) with manifold coordinate descent it- we develop an algorithm to obtain effective approximations erations (9) yields a Givens factorization of the initial point 0 based on sparsity-inducing norms. U . A truncated sequence leads to an approximate factorization. From this viewpoint, the quality of a Givens factorization can be controlled by properties of the function 5. Givens Factorization and Coordinate f. In the following, we construct an objective function that Descent on O(d) results in approximate factorization with less than (d2) factors. O In this section, we offer an alternative viewpoint presented by Shalit & Chechik (2014) that interprets Givens factorization as manifold coordinate descent on the orthogonal 6. Sparsity-Inducing Dynamics group over a certain potential energy. To factorize a matrix U O(d) one may choose it as an The orthogonal group O(d) is a matrix Lie group with as- initial value to problem (∈12) when minimizing a suitable sociated Lie algebra o(d) = Skew(d)= X Rd×d X = potential function f with manifold coordinate descent. We { ∈ | XT , the set of d d skew-symmetric matrices (Hall, want to ﬁnd a factorization up to signed permutation of the − } × 2003). The tangent space at an element U is TU O(d) = matrix columns. As the signed permutation matrices are XU X Skew(d) and the Riemannian directional the sparsest orthogonal matrices, we consider an energy { | ∈ } derivative of a differentiable function f in the direction function that quickly enforces sparsity, the element-wise XU TU O(d) is given by L -norm of a matrix, ∈ 1 d d −1 −1 DX f(U)= dα f(Exp(αX)U) , (8) f(U) := d U 1 = d Uij . (13) k k α=0 i,j=1 X 2 where Exp : o(d) O(d) is the matrix exponential. If we Although f is convex in Rd (since it is a norm),due to the → T T choose the basis Xij = eiej ej ei 1 i j d non-convexityof the domain, the problem { − | ≤ ≤ ≤ } minU∈O(d) f(U) for the tangent space, then DXij f(U) represents the direc- is non-convex . The landscape of f characterizes the class tional derivative in such a coordinate direction. A coordi- of orthogonal matrices that admit effective Givens approxi- nate descent algorithm uses a criterion to choose coordi- mation. It is easy to see that the globalminima of f in O(d) nates (i, j) and a step size (rotation angle) α to iteratively consist of signed permutation matrices, with min f(U) = update 1, and the global maxima are located at Hadamard matrices, with max f(U)= √d. A more involved question concern- U k+1 = Exp( αX )U k . (9) − ij ing the presence or absence of spurious local minima of f Approximating Orthogonal Matrices with Effective Givens Factorization is of interest. The following proposition partially addresses Algorithm 1 Coordinate descent on the L1-criterion this question by showing that critical points of f are neces- 0 Input: initial value U O(d),f(U)= U 1 sarily located at U O(d) with some of its entries set to repeat ∈ k k ∈ zero. for i =1 to d do Proposition 1. Let x R2×d and let for j =1 to d do ∈ ∗ if αij not up-to-date then cos(α) sin(α) ∗ R(α) := (14) α = argmin f(GT (i, j, α)U k) sin(α)− cos(α) ij α end if be a counter-clockwise rotation in the plane by an angle end for . Consider the function : . Then, at end for α g(α) = R(α)x 1 ∗ ∗ ∗ T ∗ k every local minimum α of g there exist indices k,l such i , j = argmini,j f(G (i, j, αij )U ) ∗ k+1 T ∗ ∗ ∗ k that . U = G (i , j , α ∗ ∗ )U R(α )x kl =0 i j until U k+1 I <ε or maxIter is reached − F,sym Proof. We show equivalently that any stationary point α∗ with ∗ is a local maximum. At any R(α )x kl = 0 k,l such point the function6 ∀g is twice continuously differentiable and the second derivative is 7. Numerical Experiments ∂2g ∗ (15) 7.1. Planted Models 2 = g(α ) < 0 . ∂α ∗ − α=α Theorem 1 shows that we cannot expect to ﬁnd good

Consequently, any stationary point under this assumption approximations to Haar-sampled matrices with less than must be a local maximum. d2/ log(d) Givens factors. Therefore, we focus on a Odistribution for which we can control approximability. We Proposition 1 implies that for a given subspace (i, j), the use the uniform distribution over the set U SO(d) U = { ∈ | best rotation angle can be found by checking all axis tran- G1 GK , Gk = G(ik, jk, αk) , where each Gk is ob- sitions for the 2D points (u ,u ), k 1,...d and se- tained··· by first sampling a subspace} uniformly at random ik jk ∈ { } lecting the angle that most minimizes the objective among (with replacement), and then sampling the corresponding them. It also implies that any local minimum of f must cor- angle uniformly from (0, 2π). We denote the resulting dis- respond to an orthogonalmatrix with at least d zeros placed tribution by the K-planted distribution νK . While this dis- at specific entries, such that no two rows or columns have tribution may be sparse in the number of Givens factors for the same support. Indeed, Proposition 1 implies that there K d(d 1)/2, this does not imply that the resulting exists a continuous path t U(t) = G(i, j, α(t)) with matrices≪ are− sparse. In fact, products of Givens matrices 7→ α(0) = 0, generated by a Givens rotation of angle α(t), become dense quickly. It follows from the Coupon Col- such that f(U(t)) is non-increasing at t =0, provided one lector’s Lemma that matrices generated with Θ(d log2(d)) can find two rows or columns of U with the same support. Givens factors are already dense with high probability. To However, this result does not exclude the possibility that visualize this effect, Figure 1 shows the L0-norm as a func- f has spurious local minima at matrices U with the above tion of planted Givens factors. special sparsity pattern. In fact, we conjecture that the land- We compare the following factorization algorithms. A scape of f does have spurious local minima. greedy baseline iteratively finds the Givens factor that most A manifold coordinate descent on the objective function minimizes the objective (3). The structured elimination al- f is explicitly stated in Algorithm 1. The crucial step in- gorithm described in Section 3 yields a sequence of Givens volves optimizing this objective in the rotation angle α for factors that eliminate matrix entries in the order (2) and is a given subspace (i, j), which is a non-convexoptimization guaranteed to find a perfect factorization with d(d 1)/2 − problem. Nevertheless, the global optimum can be found as factors. Our sparsity-inducing algorithm minimizes the L1- stated by Proposition 1. In d dimensions, this step requires criterion (13) viaa manifold coordinate descent scheme.2 d operations. Consequently, due to the squared dimension In an initial experiment, we demonstrate the approximation dependence of the double for-loop, a naive implementation effectiveness of these algorithms; the results are shown in of Algorithm 1 would require d3 operations . However, Figure 2. They indicate that minimizing the -criterion applying the selected GivensO factor in each step changes L1 improves over directly minimizing the Frobenius norm only two rows of the matrix; thus, in the subsequent iter- (greedy baseline). Next, we analyze the approximability ation, only those pairs of rows that involve the previously modified ones need to be re-computed. These are (d) 2An implementation of these algorithms can be found at O rows and altogether the runtime of an iteration is d2 . https://github.com/tfrerix/givens-factorization O Approximating Orthogonal Matrices with Effective Givens Factorization

1 1.2

0.8 1.0 d √ / 0.8

2 0.6 sym /d 0 F, || || 0.6 ˆ U U || 0.4 −

U 0.4 ||

0.2 0.2

0 0.0 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 50

K/d log2(d) N/d log2(d)

256 512 1024 L1 structured elimination greedy baseline

Figure 1. Average sparsity based on 100 samples of matrices Figure 2. Average Frobenius norm approximation error in drawn from the K-planted distribution over SO(d) for increas- d = 1024 dimensions when factorizing 10 samples drawn from ing K. Standard deviation is negligible and not shown. Matrices the d log2(d)-planted distribution over SO(d) with d(d 1)/2 become dense quickly as the number of planted Givens factors factors. Shaded area denotes standard deviation. − grows. In particular, matrices sampled from the d log2(d)-planted distribution are already dense.

of samples drawn from the K-planted distribution νK as a function of K. To obtain a Givens sequence, we factorize these samples with manifold coordinate descent on the K = (d log d). The mathematical analysis of our co- L1-objective (13). Along the optimization path, we deﬁne ordinateO descent algorithm in the regime where effective Nǫ(U) as the number of Givens factors for which the nor- approximationis feasible is beyond the scope of the present malized approximation error (3) is smaller than ǫ = 0.1, paper. In particular, proving that Nǫ = (d log d) is sufﬁ- i.e., cient when K . d log d remains an openO question.

U G1 ...GN N (U) := min N k − kF,sym <ǫ (16) 7.2. Application: Graph Fourier Transform ǫ √ ( d ) The method introduced in this paper is useful in situations

We refer to a Givens sequence with such Nǫ(U) factors as where one at first computes an approximation to a uni- an ǫ-factorizing sequence of U. In Figure 3, the sample tary operator, which is subsequently applied many times. −1 n average Nǫ = n i=1 Nǫ(Ui) for n = 10 samples is Hence, the trade-off between initial computation and ap- shown as a function of K. We are interested in the rate at proximation on the one hand and efficient application on P which Nǫ grows for increasing K. The data in Figure 3 the other hand is in favor of the latter. Canonical examples show that for K = αd log2(d) and Nǫ = βd log2(d), the for this scenario are orthogonal basis transforms. In this ratio β/α is not independentof d. For the shown dimension paper, we draw motivation from the FFT, which yields a regime this implies that for K = d log(d) , Nǫ grows speed-up of applying a Fourier transformation over a regu- polynomial in d, albeit with smallO rate for few planted fac- lar grid domain from (d2) to (d log(d)) time complex- tors. To make this relation more precise, we extract the ity (Cooley & Tukey, O1965). However,O these speed-ups do η exponent η of a model Nǫ d . Figure 4 shows that the not carry over when the domain is unstructured, such as growth is slightly superlinear∼ in the few-factor regime and general graphs. Here, we compute an effective approxima- becomes quadratic towards K = d log2(d). Analytically tion of the graph Fourier transformation (GFT). Consider characterizing such growth is left for future work. a simple, undirected graph with degree matrix D and ad- jacency matrix A. The unnormalized graph Laplacian is That said, our initial results suggest the existence of a defined as L := D A, which is a positive semi-definite, computational-to-statistical gap for the recovery (or detec- symmetric matrix. The− GFT is represented by the orthogo- tion) of sparse planted Givens factors. Indeed, Theorem nal matrix that diagonalizes L. 1 proves that recovery with K = d2/ log d planted factors is information-theoreticallyO possible, whereas A baseline for our method is the Jacobi algorithm (Jacobi, our greedy recovery strategy is only effective for 1846), which diagonalizes a symmetric matrix L by greed- Approximating Orthogonal Matrices with Effective Givens Factorization

2 30

1 25 1.8 .75 ) d (

2 20 .5 η log .25 1.6

/d 15 ǫ 0 N 10 1.4

5 1.2 0 .1 .15 .2 .25 .3 .35 .4 .45 .5 .75 1 .1 .15 .2 .25 .3 .35 .4 .45 .5 .75 1

K/d log2(d) K/d log2(d) 256 512 1024

η Figure 4. Polynomial growth rate η of the model Nǫ d as a ∼ Figure 3. Average number of Givens factors necessary to factorize function of the number of planted factors estimated from d ∈ a K-planted matrix in d 256, 512, 1024 dimensions up to 256, 512, 1024 . Note that the x-axis is shown with unequal ∈ { } { } desired accuracy as a function of K. Here, ǫ = 0.1 is the accuracy spacing to highlight the relevant regime of the data. as defined in expression (16). Note that the x-axis is shown with unequal spacing to highlight the relevant regime of the data. The inset plot shows a zoom of the first data points. Table 1. Construction of Barabási-Albert graphs. An n-vertex graph is constructed by choosing n0 = mk initial vertices, then adding vertices and connecting them to m of already existing ily minimizing the off-diagonal squared Frobenius norm, k ones with a probability proportional to the degree of these ver- d tices. mk is chosen such that the number of resulting edges is off(L) := L 2 L2 . (17) approximatly k 0.25n(n 1)/2. k kF − kk · − Xk=1 This is achieved by zeroing-out the largest matrix element n 64 128 256 512 1024 in absolute value at every iteration. To this end, a Givens m1 54 109 218 437 874 matrix similarity transformation with a suitably chosen ro- m2 36 69 136 267 528 tation subspace and rotation angle is applied. However, the Jacobi algorithm does not guarantee factorization in a fi- nite number of steps; in particular, it may take more than The Barabási-Albert model starts with n unconnected ver- N = d(d 1)/2 iterations. In fact, the algorithm converges 0 tices and iteratively adds vertices to the graph, which are linearly (Golub− & Van Loan, 2012), connected to a number m of already existing ones with 1 a probability proportional to the degree of these vertices. off(Lk+1) 1 off(Lk) . (18) ≤ − N This construction is known as preferential attachment and induces a scale-free degree distribution found in real world If the iteration number k is large enough, quadratic conver- graphs (Barabási & Albert, 1999). The details of gener- gence was shown by Schönhage (1964). Hence, the method ating these graphs are described in Table 1. We approx- is ineffective for small iteration numbers and in high di- imate the corresponding graph Laplacians with n log (n) mensions. A truncated version of this algorithm was used 2 factors leading to the results shown in Figure 5. While our by Magoarou et al. (2018) to obtain an approximation to sparsity-inducing algorithm yields better factorizations in the GFT. The objective (17) of the Jacobi method is mo- most cases, there exist scenarios, where the greedy base- tivated by approximating the spectrum of the symmetric line results in better approximations (d 512, 2014 for matrix through the Gershgorin circle theorem (Gershgorin, 0.25n log (n) edges). Finally, we demonstrate∈ { approx-} 1931). However, we argue here that a criterion focused 2 imate∼ factorization of the graph Laplacian of various real on approximating the eigenbasis of the symmetric matrix world graphs listed in Table 2. Our L -algorithm yiels the directly yields a more effective approximation to this or- 1 best factorization for the Minnesota, HumanProtein, and thogonal basis transformation. We consider the eigende- EMail graphs, while the greedy baseline algorithm is supe- composition L = UΛU T and compute an approximation rior for the Facebook graph. of the orthogonal matrix U with the algorithms outlined in Section 7.1. We demonstrate this procedure on Barabási- A simple strategy to improve the performance of our L1 Albert random graphs and several real world graphs. greedy method with mild computational overhead is to per- Approximating Orthogonal Matrices with Effective Givens Factorization

1.3 1.2 1.2 1 d d 1.1 √ √ / 0.8 / 1 sym sym F, F, || || ˆ ˆ 0.6 U U 0.9 − − U U || || 0.4 0.8

0.2 0.7

0.6 0 64 128 256 512 1024 Minnesota HumanProtein EMail Facebook n

L1 Jacobi greedy baseline L1 Jacobi greedy baseline

Figure 5. Approximate factorization of the graph Laplacian of n- Figure 6. Approximate factorization of the graph Laplacian of var- vertex Barabási-Albert graphs with n log2(n) factors. Data points ious n-vertex real world graphs with n log2(n) factors. are averages of 10 samples, vertical lines denote standard deviation. The solid (–) lines show factorizations of graphs with 0.5n(n 1)/2 edges, while the dashed (- -) lines show factor- lem to further characterize the matrices that admit effective ∼ − izations of graphs with 0.25n(n 1)/2 edges. factorization using manifold coordinate descent on an L1- ∼ − criterion. This work opens up questions we believe are important Table 2. GFT approximation for real world graphs with n vertices both from a theoretical and an applied perspective. On and ne edges. the theory side, important problems arising from our analysis are: (i) a complete description of the landscape of n ne f(U) = U over the orthogonal and unitary groups, (ii) k k1 MINNESOTA 2642 3304 a precise classification of the detection threshold K(d) be- (Defferrard et al.) low which it is possible to discriminate a K-planted sample HUMANPROTEIN 3133 6726 from a Haar sample in polynomial time, and (iii) a guaran- (Rual et al., 2005) tee that the proposed sparse Givens coordinate descent algorithm requires N = Θ(d log d) terms for K Cd log d EMAIL 1133 5451 ≤ (Guimeràet al., 2003) for some constant C > 0. These questions suggest a learning approach whereby our sparsity promoting potential f FACEBOOK 2888 2981 (McAuley & Leskovec, 2012) would be replaced by a classifier fθ trained to discriminate between K-planted and Haar distributions. From an applied perspective, the method allows to approximately in- form beam-search, which is beyond the scope of this paper. vert a time-varying symmetric linear operator H(t). Sim- Overall, it remains an open question to more closely charac- ilar to the Woodbury formula for low-rank updates of an terize the graphs for which our sparsity-inducing algorithm inverse, one could set up an approximate Givens factoriza- yields effective approximations of the GFT. tion of the eigenbasis of H(t0), and update it efficiently at subsequent times. If successful, this could dramatically improve the efficiency of second-order optimization schemes, 8. Discussion where H(t) is the Hessian of a loss function. We analyzed the problem of approximating orthogonal matrices with few Givens factors. While a perfect factoriza- Acknowledgements tion in d2 is always possible, an approximation with fewer factorsO is advantageous if the orthogonal matrix is The authors would like to thank Oded Regev for in-depth applied many times. We showed that effective Givens fac- discussions and early feedback, as well as Kyle Cranmer torization of generic orthogonal matrices is impossible and and Lenka Zbedorova for fruitful discussions on the topic, inspected a distribution of planted factors, which allows and Yann LeCun for introducing us to the problem. This us to control approximability. Our initial results suggest work was partially supported by NSF grant ri-iis 1816753, that sparsity inducing factorization is promising beyond the NSF CAREER CIF 1845360, the Alfred P. Sloan Fellow- sparse matrix regime. However, it remains an open prob- ship and Samsung Electronics. Approximating Orthogonal Matrices with Effective Givens Factorization

References Magoarou, L. L., Gribonval, R., and Tremblay, N. Ap- proximate Fast Graph Fourier Transforms via Multilayer Barabási, A.-L. and Albert, R. Emergence of Scaling in Sparse Approximations. IEEE Transactions on Signal Random Networks. Science, 286(5439):509–512, 1999. and Information Processing over Networks, 4(2):407– Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A., and 420, 2018. Vandergheynst, P. Geometric Deep Learning: Going be- McAuley, J. and Leskovec, J. Learning to Dis- yond Euclidean data. IEEE Signal Processing Magazine, cover Social Circles in Ego Networks. In Ad- 34(4):18–42, 2017. vances in Neural Information Processing Systems Cooley, J. W. and Tukey, J. W. An algorithm for the (NIPS 2012), pp. 548–556. 2012. retrieved from machine calculation of complex Fourier series. Math. http://konect.uni-koblenz.de. Comp., 19:297–301, 1965. Rual, J.-F., Venkatesan, K., Hao, T., Hirozane-Kishikawa, Defferrard, M., Martin, L., Pena, R., and Perraudin, N. T., Dricot, A., Li, N., Berriz, G. F., Gibbons, F. D., Dreze, Pygsp: Graph signal processing in python. M., and Ayivi-Guedehoussou, N. Towards a Proteome- scale Map of the Human Protein–Protein Interaction Net- Gershgorin, S. A. Uber¨ die Abgrenzung der Eigenwerte work. Nature, (7062):1173–1178, 2005. retrieved from einer Matrix. Izv. Akad. Nauk. USSR Otd. Fiz.-Mat. http://konect.uni-koblenz.de. Nauk, (6):749–754, 1931. Schönhage, A. Zur quadratischen Konvergenz des Jacobi- Givens, W. Computation of Plain Unitary Rotations Trans- Verfahrens. Numerische Mathematik, 6(1):410–412, forming a General Matrix to Triangular Form. Journal 1964. of the Society for Industrial and Applied Mathematics,6 (1):26–50, 1958. Shalit, U. and Chechik, G. Coordinate-Descent for Learn- ing Orthogonal Matrices through Givens Rotations. In Golub, G. H. and Van Der Vorst, H. A. Eigenvalue Com- Proceedings of the 31st International Conference on Ma- putation in the 20th Century. Journal of Computational chine Learning (ICML 2014), 2014. and Applied Mathematics, 123(1-2):35–65, 2000. Shuman, D. I., Narang, S. K., Frossard, P., Ortega, A., and Golub, G. H. and Van Loan, C. F. Matrix computations. Vandergheynst, P. The Emerging Field of Signal Process- JHU Press, 4th edition, 2012. ing on Graphs: Extending High-Dimensional Data Anal- ysis to Networks and other Irregular Domains. IEEE Sig- Guimerà, R., Danon, L., D´ıaz-Guilera, A., Giralt, nal Processing Magazine, 30(3):83–98, 2013. F., and Arenas, A. Self-similar Community Structure in a Network of Human Interactions. Sullivan, F. and Dongarra, J. Guest Editors’ Introduction: Phys. Rev. E, 68(6):065103, 2003. retrieved from The Top 10 Algorithms. Computing in Science and En- http://konect.uni-koblenz.de. gineering, 2:22–23, 2000. Hall, B. Lie Groups, Lie Algebras, and Representations. Vartiainen, J. J., Möttönen, M., and Salomaa, M. M. Effi- Graduate Texts in Mathematics. Springer, 2003. cient Decomposition of Quantum Gates. Phys. Rev. Lett., 92:177902, 2004. Jacobi, C. Uber¨ ein leichtes Verfahren die in der The- orie der Säcularstörungen vorkommenden Gleichungen numerisch aufzulösen. Journal fur¨ die reine und ange- wandte Mathematik, 30:51–94, 1846.

Knuth, D. E. The Art of Computer Programming, Volume 3: Sorting and Searching. 2nd edition, 1998. Ex 5.2-10, p.80.

Kondor, R., Teneva, N., and Garg, V. Multiresolution matrix factorization. In International Conference on Ma- chine Learning (ICML 2014), pp. 1620–1628, 2014.

Kyng, R. and Sachdeva, S. Approximate Gaussian Elimina- tion for Laplacians: Fast, Sparse, and Simple. In Foun- dations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pp. 573–582, 2016.