Tensor Balancing on Statistical Manifold
Total Page:16
File Type:pdf, Size:1020Kb
Tensor Balancing on Statistical Manifold Mahito Sugiyama 1 2 Hiroyuki Nakahara 3 Koji Tsuda 4 5 6 Abstract Given tensor A Statistical manifold (dually at Riemannian manifold) We solve tensor balancing, rescaling an Nth or- Probability der nonnegative tensor by multiplying N ten- distribution P sors of order N − 1 so that every fiber sums to one. This generalizes a fundamental process of Tensor balancing Projection matrix balancing used to compare matrices in a Multistochastic tensor A’ Submanifold (β) wide range of applications from biology to eco- nomics. We present an efficient balancing al- gorithm with quadratic convergence using New- Every ber Projected sums to 1 distribution Pβ ton’s method and show in numerical experiments that the proposed algorithm is several orders of magnitude faster than existing ones. To theo- Figure 1. Overview of our approach. retically prove the correctness of the algorithm, we model tensors as probability distributions in a statistical manifold and realize tensor balanc- Knight, 2016), Hi-C data analysis (Rao et al., 2014; Wu & ing as projection onto a submanifold. The key to Michor, 2016), the Sudoku puzzle (Moon et al., 2009), and our algorithm is that the gradient of the manifold, the optimal transportation problem (Cuturi, 2013; Frogner used as a Jacobian matrix in Newton’s method, et al., 2015; Solomon et al., 2015). An excellent review of can be analytically obtained using the Mobius¨ in- this theory and its applications is given by Idel (2016). version formula, the essential of combinatorial The standard matrix balancing algorithm is the Sinkhorn- mathematics. Our model is not limited to ten- Knopp algorithm (Sinkhorn, 1964; Sinkhorn & Knopp, sor balancing, but has a wide applicability as it 1967; Marshall & Olkin, 1968; Knight, 2008), a special includes various statistical and machine learning case of Bregman’s balancing method (Lamond & Stewart, models such as weighted DAGs and Boltzmann 1981) that iterates rescaling of each row and column until machines. convergence. The algorithm is widely used in the above applications due to its simple implementation and theo- 1. Introduction retically guaranteed convergence. However, the algorithm converges linearly (Soules, 1991), which is prohibitively Matrix balancing is the problem of rescaling a given square slow for recently emerging large and sparse matrices. Al- 2 Rn×n nonnegative matrix A ≥0 to a doubly stochastic ma- though Livne & Golub (2004) and Knight & Ruiz (2013) trix RAS, where every row and column sums to one, by tried to achieve faster convergence by approximating each multiplying two diagonal matrices R and S. This is a step of Newton’s method, the exact Newton’s method with fundamental process for analyzing and comparing matri- quadratic convergence has not been intensively studied yet. ces in a wide range of applications, including input-output Another open problem is tensor balancing, which is a gen- analysis in economics, called the RAS approach (Parikh, eralization of balancing from matrices to higher-order mul- 1979; Miller & Blair, 2009; Lahr & de Mesnard, 2004), tidimentional arrays, or tensors. The task is to rescale an seat assignments in elections (Balinski, 2008; Akartunalı & Nth order nonnegative tensor to a multistochastic tensor, 1National Institute of Informatics 2JST PRESTO 3RIKEN in which every fiber sums to one, by multiplying (N − 1)th Brain Science Institute 4Graduate School of Frontier Sciences, order N tensors. There are some results about mathemat- The University of Tokyo 5RIKEN AIP 6NIMS. Correspondence ical properties of multistochastic tensors (Cui et al., 2014; to: Mahito Sugiyama <[email protected]>. Chang et al., 2016; Ahmed et al., 2003). However, there Proceedings of the 34 th International Conference on Machine is no result for tensor balancing algorithms with guaran- Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 teed convergence that transforms a given tensor to a multi- by the author(s). stochastic tensor until now. Tensor Balancing on Statistical Manifold Here we show that Newton’s method with quadratic con- Matrix Constraints for balancing vergence can be applied to tensor balancing while avoid- a a a a η η η η ing solving a linear system on the full tensor. Our strat- a a a a η θ θ θ egy is to realize matrix and tensor balancing as projec- tion onto a dually flat Riemmanian submanifold (Figure 1), a a a a η θ θ θ which is a statistical manifold and known to be the es- a a a a η θ θ θ Invariant sential structure for probability distributions in information geometry (Amari, 2016). Using a partially ordered out- Figure 2. Matrix balancing with two parameters θ and η. come space, we generalize the log-linear model (Agresti, 2012) used to model the higher-order combinations of bi- 2 nary variables (Amari, 2001; Ganmor et al., 2011; Naka- for eachP i; j [n], whereP we normalized entries as pij = hara & Amari, 2002; Nakahara et al., 2003), which allows aij= ij aij so that ij pij = 1. We assume for simplic- us to model tensors as probability distributions in the sta- ity that each entry is strictly larger than zero. The assump- tistical manifold. The remarkable property of our model is tion will be removed in Section 5. that the gradient of the manifold can be analytically com- (t) The key to our approach is that we update θij with i = 1 puted using the Mobius¨ inversion formula (Rota, 1964), the or j = 1 by Newton’s method at each iteration t = 1; 2;::: heart of combinatorial mathematics (Ito, 1993), which en- while fixing θ with i; j =6 1 so that η(t) satisfies the fol- ables us to directly obtain the Jacobian matrix in Newton’s ij ij lowing condition (Figure 2): method. Moreover, we show that (n − 1)N entries for the N (t) − (t) − size n of a tensor are invariant with respect to one of the ηi1 = (n i + 1)=n; η1j = (n j + 1)=n: two coordinate systems of the statistical manifold. Thus Note that the rows and columns sum not to 1 but to 1=n due N−1 the number of equations in Newton’s method is O(n ). to the normalization. The update formula is described as 2 3 2 3 2 3 The remainder of this paper is organized as follows: We θ(t+1) θ(t) η(t) − (n − 1 + 1)=n 6 11 7 6 11 7 6 11 7 begin with a low-level description of our matrix balancing 6 . 7 6 . 7 6 . 7 6 . 7 6 . 7 6 . 7 algorithm in Section 2 and demonstrate its efficiency in nu- 6 7 6 7 6 7 6θ(t+1)7 6θ(t)7 6η(t) − (n − n + 1)=n7 merical experiments in Section 3. To guarantee the correct- 6 1n 7=6 1n 7− J −16 1n 7 ; (3) 6 (t+1)7 6 (t)7 6 (t) − − 7 ness of the algorithm and extend it to tensor balancing, we 6θ21 7 6θ21 7 6η21 (n 2 + 1)=n7 provide theoretical analysis in Section 4. In Section 4.1, we 6 . 7 6 . 7 6 . 7 4 . 5 4 . 5 4 . 5 introduce a generalized log-linear model associated with a (t+1) (t) (t) − − partial order structured outcome space, followed by intro- θn1 θn1 ηn1 (n n + 1)=n ducing the dually flat Riemannian structure in Section 4.2. where J is the Jacobian matrix given as In Section 4.3, we show how to use Newton’s method to (t) @ηij 2 compute projection of a probability distribution onto a sub- J 0 0 = = η f 0g f 0g −n η η 0 0 ; (ij)(i j ) (t) max i;i max j;j ij i j (4) manifold. Finally, we formulate the matrix and tensor bal- @θi0j0 ancing problem in Section 5 and summarize our contribu- which is derived from our theoretical result in Theorem 3. tions in Section 6. Since J is a (2n−1)×(2n−1) matrix, the time complexity of each update is O(n3), which is needed to compute the 2. The Matrix Balancing Algorithm inverse of J. (t+1) (t+1) (t+1) n×n After updating to θ , we can compute p and η Given a nonnegative square matrix A = (a ) 2 R , the ij ij ij ij ≥0 by Equation (2). Since this update does not ensure the task of matrix balancing is to find r; s 2 Rn that satisfy P condition p(t+1) = 1, we again update θ(t+1) as T ij ij P 11 (RAS)1 = 1; (RAS) 1 = 1; (1) (t+1) (t+1) − (t+1) (t+1) θ11 = θ11 log ij pij and recompute pij where R = diag(r) and S = diag(s). The balanced matrix and η(t+1) for each i; j 2 [n]. A0 = RAS is called doubly stochastic, in which each entry ij 0 aij = aijrisj and all the rows and columns sum to one. By iterating the above update process in Equation (3) until The most popular algorithm is the Sinkhorn-Knopp algo- convergence, A = (aij) with aij = npij becomes doubly rithm, which repeats updating r and s as r = 1=(As) and stochastic. s = 1=(AT r). We denote by [n] = f1; 2; : : : ; ng hereafter. In our algorithm, instead of directly updating r and s, we 3. Numerical Experiments update two parameters θ and η defined as X X X X We evaluate the efficiency of our algorithm compared to the log pij = θi0j0 ; ηij = pi0j0 (2) two prominent balancing methods, the standard Sinkhorn- i0≤i j0≤j i0≥i j0≥j Knopp algorithm (Sinkhorn, 1964) and the state-of-the-art Tensor Balancing on Statistical Manifold 10 10 10 10 ations ations 10 r 10 10 r 10 10 10 10 10 – 10 10– 10 Running time (sec.) Running time (sec.) Number of ite Number of ite 10 10– 10 10– 10 50 500 5000 10 50 500 5000 20 100 200 300 20 100 200 300 n n n n Newton (proposed) Sinkhorn BNEWT Newton (proposed) Sinkhorn BNEWT Figure 3.