<<

Balancing on Statistical

Mahito Sugiyama 1 2 Hiroyuki Nakahara 3 Koji Tsuda 4 5 6

Abstract Given tensor A Statistical manifold (dually at ) We solve tensor balancing, rescaling an Nth or- Probability der nonnegative tensor by multiplying N ten- distribution P sors of order N − 1 so that every fiber sums to one. This generalizes a fundamental process of Tensor balancing balancing used to compare matrices in a Multistochastic tensor A’ (β) wide range of applications from biology to eco- nomics. We present an efficient balancing al- gorithm with quadratic convergence using New- Every ber Projected sums to 1 distribution Pβ ton’s method and show in numerical experiments that the proposed is several orders of magnitude faster than existing ones. To theo- Figure 1. Overview of our approach. retically prove the correctness of the algorithm, we model as probability distributions in a statistical manifold and realize tensor balanc- Knight, 2016), Hi-C data analysis (Rao et al., 2014; Wu & ing as projection onto a submanifold. The key to Michor, 2016), the Sudoku puzzle (Moon et al., 2009), and our algorithm is that the of the manifold, the optimal transportation problem (Cuturi, 2013; Frogner used as a Jacobian matrix in Newton’s method, et al., 2015; Solomon et al., 2015). An excellent review of can be analytically obtained using the Mobius¨ in- this and its applications is given by Idel (2016). version formula, the essential of combinatorial The standard matrix balancing algorithm is the Sinkhorn- mathematics. Our model is not limited to ten- Knopp algorithm (Sinkhorn, 1964; Sinkhorn & Knopp, sor balancing, but has a wide applicability as it 1967; Marshall & Olkin, 1968; Knight, 2008), a special includes various statistical and case of Bregman’s balancing method (Lamond & Stewart, models such as weighted DAGs and Boltzmann 1981) that iterates rescaling of each row and column until machines. convergence. The algorithm is widely used in the above applications due to its simple implementation and theo- 1. Introduction retically guaranteed convergence. However, the algorithm converges linearly (Soules, 1991), which is prohibitively Matrix balancing is the problem of rescaling a given slow for recently emerging large and sparse matrices. Al- ∈ Rn×n nonnegative matrix A ≥0 to a doubly stochastic ma- though Livne & Golub (2004) and Knight & Ruiz (2013) trix RAS, where every row and column sums to one, by tried to achieve faster convergence by approximating each multiplying two diagonal matrices R and S. This is a step of Newton’s method, the exact Newton’s method with fundamental process for analyzing and comparing matri- quadratic convergence has not been intensively studied yet. ces in a wide range of applications, including input-output Another open problem is tensor balancing, which is a gen- analysis in economics, called the RAS approach (Parikh, eralization of balancing from matrices to higher-order mul- 1979; Miller & Blair, 2009; Lahr & de Mesnard, 2004), tidimentional arrays, or tensors. The task is to rescale an seat assignments in elections (Balinski, 2008; Akartunalı & Nth order nonnegative tensor to a multistochastic tensor, 1National Institute of Informatics 2JST PRESTO 3RIKEN in which every fiber sums to one, by multiplying (N − 1)th Brain Science Institute 4Graduate School of Frontier Sciences, order N tensors. There are some results about mathemat- The University of Tokyo 5RIKEN AIP 6NIMS. Correspondence ical properties of multistochastic tensors (Cui et al., 2014; to: Mahito Sugiyama . Chang et al., 2016; Ahmed et al., 2003). However, there Proceedings of the 34 th International Conference on Machine is no result for tensor balancing with guaran- Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 teed convergence that transforms a given tensor to a multi- by the author(s). stochastic tensor until now. Tensor Balancing on Statistical Manifold

Here we show that Newton’s method with quadratic con- Matrix Constraints for balancing vergence can be applied to tensor balancing while avoid- a a  a  a  η η  η  η  ing solving a linear system on the full tensor. Our strat- a a a a η θ θ θ egy is to realize matrix and tensor balancing as projec- tion onto a dually flat Riemmanian submanifold (Figure 1), a a a a η θ θ θ which is a statistical manifold and known to be the es- a a a a η θ θ θ sential structure for probability distributions in information (Amari, 2016). Using a partially ordered out- Figure 2. Matrix balancing with two parameters θ and η. come , we generalize the log-linear model (Agresti, 2012) used to model the higher-order combinations of bi- ∈ nary variables (Amari, 2001; Ganmor et al., 2011; Naka- for each∑ i, j [n], where∑ we normalized entries as pij = hara & Amari, 2002; Nakahara et al., 2003), which allows aij/ ij aij so that ij pij = 1. We assume for simplic- us to model tensors as probability distributions in the sta- ity that each entry is strictly larger than zero. The assump- tistical manifold. The remarkable property of our model is tion will be removed in Section 5. that the gradient of the manifold can be analytically com- (t) The key to our approach is that we update θij with i = 1 puted using the Mobius¨ inversion formula (Rota, 1964), the or j = 1 by Newton’s method at each iteration t = 1, 2,... heart of combinatorial mathematics (Ito, 1993), which en- while fixing θ with i, j ≠ 1 so that η(t) satisfies the fol- ables us to directly obtain the Jacobian matrix in Newton’s ij ij lowing condition (Figure 2): method. Moreover, we show that (n − 1)N entries for the N (t) − (t) − size n of a tensor are invariant with respect to one of the ηi1 = (n i + 1)/n, η1j = (n j + 1)/n. two coordinate systems of the statistical manifold. Thus Note that the rows and columns sum not to 1 but to 1/n due N−1 the number of in Newton’s method is O(n ). to the normalization. The update formula is described as       The remainder of this paper is organized as follows: We θ(t+1) θ(t) η(t) − (n − 1 + 1)/n  11   11   11  begin with a low-level description of our matrix balancing  .   .   .   .   .   .  algorithm in Section 2 and demonstrate its efficiency in nu-       θ(t+1) θ(t) η(t) − (n − n + 1)/n merical experiments in Section 3. To guarantee the correct-  1n = 1n − J −1 1n  , (3)  (t+1)  (t)  (t) − −  ness of the algorithm and extend it to tensor balancing, we θ21  θ21  η21 (n 2 + 1)/n provide theoretical analysis in Section 4. In Section 4.1, we  .   .   .   .   .   .  introduce a generalized log-linear model associated with a (t+1) (t) (t) − − partial order structured outcome space, followed by intro- θn1 θn1 ηn1 (n n + 1)/n ducing the dually flat Riemannian structure in Section 4.2. where J is the Jacobian matrix given as In Section 4.3, we show how to use Newton’s method to (t) ∂ηij 2 compute projection of a probability distribution onto a sub- J ′ ′ = = η { ′} { ′} −n η η ′ ′ , (ij)(i j ) (t) max i,i max j,j ij i j (4) manifold. Finally, we formulate the matrix and tensor bal- ∂θi′j′ ancing problem in Section 5 and summarize our contribu- which is derived from our theoretical result in Theorem 3. tions in Section 6. Since J is a (2n−1)×(2n−1) matrix, the time of each update is O(n3), which is needed to compute the 2. The Matrix Balancing Algorithm inverse of J. (t+1) (t+1) (t+1) n×n After updating to θ , we can compute p and η Given a nonnegative A = (a ) ∈ R , the ij ij ij ij ≥0 by (2). Since this update does not ensure the task of matrix balancing is to find r, s ∈ Rn that satisfy ∑ condition p(t+1) = 1, we again update θ(t+1) as T ij ij ∑ 11 (RAS)1 = 1, (RAS) 1 = 1, (1) (t+1) (t+1) − (t+1) (t+1) θ11 = θ11 log ij pij and recompute pij where R = diag(r) and S = diag(s). The balanced matrix and η(t+1) for each i, j ∈ [n]. A′ = RAS is called doubly stochastic, in which each entry ij ′ aij = aijrisj and all the rows and columns sum to one. By iterating the above update process in Equation (3) until The most popular algorithm is the Sinkhorn-Knopp algo- convergence, A = (aij) with aij = npij becomes doubly rithm, which repeats updating r and s as r = 1/(As) and stochastic. s = 1/(AT r). We denote by [n] = {1, 2, . . . , n} hereafter. In our algorithm, instead of directly updating r and s, we 3. Numerical Experiments update two parameters θ and η defined as ∑ ∑ ∑ ∑ We evaluate the efficiency of our algorithm compared to the log pij = θi′j′ , ηij = pi′j′ (2) two prominent balancing methods, the standard Sinkhorn- i′≤i j′≤j i′≥i j′≥j Knopp algorithm (Sinkhorn, 1964) and the state-of-the-art Tensor Balancing on Statistical Manifold

10 10 10 10

ations ations 10 10 10 10 10 10 10 10 – 10 10– 10  Running time (sec.) Running time (sec.) Number of ite r Number of ite r 10 10– 10 10– 10 50 500 5000 10 50 500 5000 20 100 200 300 20 100 200 300 n n n n Newton (proposed) Sinkhorn BNEWT Newton (proposed) Sinkhorn BNEWT

Figure 3. Results on Hessenberg matrices. The BNEWT algo- Figure 5. Results on Trefethen matrices. The BNEWT algorithm rithm (green) failed to converge for n ≥ 200. (green) failed to converge for n ≥ 200.

10

10– is clearly the fastest: It is three to five orders of magnitude faster than the standard Sinkhorn-Knopp algorithm (plotted 10– in red). Although the BNEWT algorithm (plotted in green) Residual Newton 10– Sinkhorn is competitive if n is small, it suddenly fails to converge BNEWT whenever n ≥ 200, which is consistent with results in the 10– 0 500 1000 1500 2000 2500 3000 original paper (Knight & Ruiz, 2013) where there is no re- Number of iterations sult for the setting n ≥ 200 on the same matrix. Moreover, our method converges around 10 to 20 steps, which is about Figure 4. Convergence graph on H20. three and seven orders of magnitude smaller than BNEWT and Sinkhorn-Knopp, respectively, at n = 100. algorithm BNEWT (Knight & Ruiz, 2013), which uses To see the behavior of the rate of convergence in detail, we Newton’s method-like iterations with conjugate . plot the convergence graph in Figure 4 for n = 20, where All experiments were conducted on Amazon Linux AMI we observe the slow convergence rate of the Sinkhorn- release 2016.09 with a single core of 2.3 GHz Intel Xeon Knopp algorithm and unstable convergence of the BNEWT CPU E5-2686 v4 and 256 GB of memory. All methods algorithm, which contrasts with our quick convergence. were implemented in C++ with the Eigen library and Trefethen Matrix. Next, we collected a set of Trefethen compiled with gcc 4.8.31. We have carefully implemented matrices from a collection website2, which are nonnega- BNEWT by directly translating the MATLAB code pro- tive diagonal matrices with primes. Results are plotted in vided in (Knight & Ruiz, 2013) into C++ with the Eigen Figure 5, where we observe the same trend as before: Our library for fair comparison, and used the default parame- algorithm is the fastest and about four orders of magnitude ters. We measured the residual of a matrix A′ = (a′ ) by ij faster than the Sinkhorn-Knopp algorithm. Note that larger ′ ′T the squared ∥(A 1 − 1,A 1 − 1)∥2, where each en- ′ matrices with n > 300 do not have total , which try aij is obtained as npij in our algorithm, and ran each is the necessary condition for matrix balancing (Knight & of three algorithms until the residual is below the tolerance Ruiz, 2013), while the BNEWT algorithm fails to converge −6 threshold 10 . if n = 200 or n = 300. Hessenberg Matrix. The first set of experiments used a Hessenberg matrix, which has been a standard benchmark 4. Theoretical Analysis for matrix balancing (Parlett & Landis, 1982; Knight & Ruiz, 2013). Each entry of an n × n Hessenberg matrix In the following, we provide theoretical support to our al- Hn = (hij) is given as hij = 0 if j < i − 1 and hij = 1 gorithm by formulating the problem as a projection within otherwise. We varied the size n from 10 to 5, 000, and a statistical manifold, in which a matrix corresponds to an measured running time (in seconds) and the number of it- element, that is, a probability distribution, in the manifold. erations of each method. We show that a balanced matrix forms a submanifold and Results are plotted in Figure 3. Our balancing algorithm matrix balancing is projection of a given distribution onto with the Newton’s method (plotted in blue in the figures) the submanifold, where the Jacobian matrix in Equation (4) is derived from the gradient of the manifold. 1An implementation of algorithms for matrices and third order tensors is available at: https://github.com/ 2http://www.cise.ufl.edu/research/sparse/ mahito-sugiyama/newton-balancing matrices/ Tensor Balancing on Statistical Manifold

4.1. Formulation A probability∑ vector is treated as a mapping p: S → (0, 1) such that p(x) = 1, where every entry p(x) is as- We introduce our log-linear probabilistic model, where the x∈S sumed to be strictly larger than zero. outcome space is a partially ordered set, or a poset (Gierz et al., 2003). We prepare basic notations and the key math- Using the zeta and the Mobius¨ functions, let us introduce ematical tool for posets, the Mobius¨ inversion formula, fol- two mappings θ : S → R and η : S → R as lowed by formulating the log-linear model. ∑ θ(x) = µ(s, x) log p(s), (6) ∈ 4.1.1. MOBIUS¨ INVERSION s∑S ∑ η(x) = ζ(x, s)p(s) = p(s). (7) A poset (S, ≤), the set of elements S and a partial order s∈S s≥x ≤ on S, is a fundamental structured space in computer science. A partial order “≤” is a relation between el- From the Mobius¨ inversion formula, we have ∑ ∑ ements in S that satisfies the following three properties: log p(x) = ζ(s, x)θ(s) = θ(s), (8) For all x, y, z ∈ S, (1) x ≤ x (reflexivity), (2) x ≤ y, s∈S s≤x y ≤ x ⇒ x = y (antisymmetry), and (3) x ≤ y, ∑ y ≤ z ⇒ x ≤ z (transitivity). In what follows, S is al- p(x) = µ(x, s)η(s). (9) ways finite and includes the least element (bottom) ⊥ ∈ S; s∈S + that is, ⊥ ≤ x for all x ∈ S. We denote S \ {⊥} by S . They are generalization of the log-linear model (Agresti, Rota (1964) introduced the Mobius¨ inversion formula on 2012) that gives the probability p(x) of an n-dimensional 1 n ∈ { }n posets by generalizing the inclusion-exclusion principle. binary vector x = (x , . . . , x ) 0, 1 as ∑ ∑ ∑ Let ζ : S × S → {0, 1} be the zeta defined as { log p(x) = θixi + θijxixj + θijkxixjxk 1 if s ≤ x, i i

In the following, we denote by ξ the function θ or η and by Proof. They can be directly derived from our definitions ∇ the gradient operator with respect to S+ = S \ {⊥}, i.e., (Equations (6) and (11)) as (∇f(ξ))(x) = ∂f/∂ξ(x) for x ∈ S+, and denote by S the ∑ (∑ ) ∑ set of probability distributions specified by probability vec- ∂ψ(θ) y≥x exp ⊥

5.2. Tensor Balancing Tensor balancing is the e-projection of P onto the subman- ifold S(β), that is, the multistochastic tensor is the distri- Next, we generalize our approach from matrices to tensors. × ×···× bution Pβ such that ∈ Rn1 n2 nN For an Nth order tensor A = (ai1i2...iN ) { ∈ Rnm ∈ + \ and a vector b , the m-mode of A and b is θPβ (x) = θP (x) if x S dom(β), ∈ defined as ηPβ (x) = β(x) if x dom(β), ∑nm which is unique and always exists in S, thanks to its dually (A ×m b) = ai i ...i bi . i1...im−1im+1...iN 1 2 N m flat structure. Moreover, each balancing tensor Rm is im=1 We define tensor balancing as follows: Given a tensor A ∈ Rm i1...im−1im+1...iN n1×n2×···×nN   R with n1 = ··· = nN = n, find (N − 1) ∑ ∑im′ order tensors R1,R2,...,RN such that  ′ − ′  = exp θPβ (ιk,m ) θP (ιk,m ) ′ n1×···×nm−1×nm+1×···×nN A ×m 1 = 1 (∈ R ) (18) m′≠ m k=1 ∑ n ′ ∑ for all m ∈ [N], i.e., a = 1, where each 1 1 N−1 im=1 i1i2...iN for every m ∈ [N] and R = R n / aj ...j entry a′ of the balanced tensor A′ is given as j1...jN 1 N i1i2...iN to recover a multistochastic tensor. ■ ∏ a′ = a Rm . i1i2...iN i1i2...iN i1...im−1im+1...iN Our result means that the e-projection algorithm based on m∈[N] Newton’s method proposed in Section 4.3 converges to the S ̸ ∅ A tensor A′ that satisfies Equation (18) is called multi- unique balanced tensor whenever (β) = holds. stochastic (Cui et al., 2014). Note that this is exactly the same as the matrix balancing problem if N = 2. 6. Conclusion It is straightforward to extend matrix balancing to tensor In this paper, we have solved the open problem of tensor balancing as e-projection onto a submanifold. Given a ten- balancing and presented an efficient balancing algorithm n ×n ×···×n sor A ∈ R 1 2 N with its normalized probability using Newton’s method. Our algorithm quadratically con- distribution P such that / ∑ verges, while the popular Sinkhorn-Knopp algorithm lin- early converges. We have examined the efficiency of our p(x) = ai1i2...iN aj1j2...jN (19)

j1j2...jN algorithm in numerical experiments on matrix balancing and showed that the proposed algorithm is several orders for all x = (i , i , . . . , i ). The objective is to obtain ∑1 2 N of magnitude faster than the existing approaches. P such that n p ((i , . . . , i )) = 1/(nN−1) for all β im=1 β 1 N m ∈ [N] and i1, . . . , iN ∈ [n]. In the same way as matrix We have analyzed behind the algorithm, and balancing, we define S as proved that balancing is e-projection in a special type of { } a statistical manifold, in particular, a dually flat Rieman- S = (i , i , . . . , i ) ∈ [n]N a ≠ 0 1 2 N i1i2...iN nian manifold studied in information geometry. Our key with removing zero entries and the partial order ≤ as finding is that the gradient of the manifold, equivalent to Riemannian metric or the Fisher information matrix, can be x = (i1 . . . iN ) ≤ y = (j1 . . . jN ) ⇔ ∀m ∈ [N], im ≤ jm. analytically obtained using the Mobius¨ inversion formula. In addition, we introduce ιk,m as Our information geometric formulation can model several { ∈ | } ιk,m = min x = (i1, i2, . . . , iN ) S im = k . machine learning applications such as statistical analysis and require the condition in Equation (17). on a DAG structure. Thus, we can perform efficient learn- ing as projection using information of the gradient of man- Tensor balancing as e-projection: × ×···× ifolds by reformulating such models, which we will study ∈ Rn1 n2 nN Given a tensor A with its normalized in future work. probability distribution P ∈ S given in Equation (19). The submanifold S(β) of multistochastic tensors is given as Acknowledgements S(β) = {P ∈ S | ηP (x) = β(x) for all x ∈ dom(β)}, The authors sincerely thank Marco Cuturi for his valu- where the domain of the function β is given as able comments. This work was supported by JSPS dom(β) = { ιk,m | k ∈ [n], m ∈ [N] } KAKENHI Grant Numbers JP16K16115, JP16H02870 and each value is described using the zeta function as (MS), JP26120732 and JP16H06570 (HN). The research of K.T. was supported by JST CREST JPMJCR1502, ∑ 1 β(ι ) = ζ(ι , ι ) . RIKEN PostK, KAKENHI Nanostructure and KAKENHI k,m k,m l,m nN−1 l∈[n] JP15H05711. Tensor Balancing on Statistical Manifold

References the National Academy of Sciences, 108(23):9679–9684, 2011. Agresti, A. Categorical data analysis. Wiley, 3 edition, 2012. Gierz, G., Hofmann, K. H., Keimel, K., Lawson, J. D., Mis- love, M., and Scott, D. S. Continuous Lattices and Do- Ahmed, M., De Loera, J., and Hemmecke, R. Polyhedral mains. Cambridge University Press, 2003. Cones of Magic Cubes and , 25 of Algo- rithms and Combinatorics, pp. 25–41. Springer, 2003. Idel, M. A review of matrix scaling and sinkhorn’s form for matrices and positive . arXiv:1609.06349, Akartunalı, K. and Knight, P. A. Network models and 2016. biproportional rounding for fair seat allocations in the UK elections. Annals of Operations Research, pp. 1–19, Ito, K. (ed.). Encyclopedic Dictionary of Mathematics. The 2016. MIT Press, 2 edition, 1993.

Amari, S. Information geometry on hierarchy of proba- Knight, P. A. The Sinkhorn–Knopp algorithm: Conver- bility distributions. IEEE Transactions on Information gence and applications. SIAM Journal on Matrix Analy- Theory, 47(5):1701–1711, 2001. sis and Applications, 30(1):261–275, 2008.

Amari, S. Information geometry and its applications: Con- Knight, P. A. and Ruiz, D. A fast algorithm for matrix vex function and dually flat manifold. In Nielsen, F. balancing. IMA Journal of Numerical Analysis, 33(3): (ed.), Emerging Trends in Visual Computing: LIX Fall 1029–1047, 2013. Colloquium, ETVC 2008, Revised Invited Papers, pp. 75–102. Springer, 2009. Lahr, M. and de Mesnard, L. Biproportional techniques in input-output analysis: Table updating and structural Amari, S. Information geometry of positive measures analysis. Economic Systems Research, 16(2):115–134, and positive-definite matrices: Decomposable dually flat 2004. structure. Entropy, 16(4):2131–2145, 2014. Lamond, B. and Stewart, N. F. Bregman’s balancing Amari, S. Information Geometry and Its Applications. method. Transportation Research Part B: Methodologi- Springer, 2016. cal, 15(4):239–248, 1981.

Balinski, M. Fair majority voting (or how to eliminate ger- Livne, O. E. and Golub, G. H. Scaling by binormalization. rymandering). American Mathematical Monthly, 115(2): Numerical Algorithms, 35(1):97–120, 2004. 97–113, 2008. Marshall, A. W. and Olkin, I. Scaling of matrices to achieve Censor, Y. and Lent, A. An iterative row-action method for specified row and column sums. Numerische Mathe- convex programming. Journal of Optimization matik, 12(1):83–90, 1968. Theory and Applications, 34(3):321–353, 1981. Miller, R. E. and Blair, P. D. Input-Output Analysis: Foun- Chang, H., Paksoy, V. E., and Zhang, F. of dations and Extensions. Cambridge University Press, 2 stochastic tensors. Annals of Analysis, 7(3): edition, 2009. 386–393, 2016. Moon, T. K., Gunther, J. H., and Kupin, J. J. Sinkhorn Cui, L.-B., Li, W., and Ng, M. K. Birkhoff–von Neumann solves sudoku. IEEE Transactions on Information The- theorem for multistochastic tensors. SIAM Journal on ory, 55(4):1741–1746, 2009. Matrix Analysis and Applications, 35(3):956–973, 2014. Nakahara, H. and Amari, S. Information-geometric mea- Cuturi, M. Sinkhorn : Lightspeed computation of sure for neural spikes. Neural Computation, 14(10): optimal transport. In Advances in Neural Information 2269–2316, 2002. Processing Systems 26, pp. 2292–2300, 2013. Nakahara, H., Nishimura, S., Inoue, M., Hori, G., and Frogner, C., Zhang, C., Mobahi, H., Araya, M., and Pog- Amari, S. Gene interaction in DNA microarray data is gio, T. A. Learning with a Wasserstein loss. In Advances decomposed by information geometric measure. Bioin- in Neural Information Processing Systems 28, pp. 2053– formatics, 19(9):1124–1131, 2003. 2061, 2015. Nakahara, H., Amari, S., and Richmond, B. J. A com- Ganmor, E., Segev, R., and Schneidman, E. Sparse low- parison of descriptive models of a single spike train by order interaction network underlies a highly correlated information-geometric measure. Neural computation, 18 and learnable neural population code. Proceedings of (3):545–568, 2006. Tensor Balancing on Statistical Manifold

Parikh, A. Forecasts of input-output matrices using the R.A.S. method. The Review of Economics and Statistics, 61(3):477–481, 1979. Parlett, B. N. and Landis, T. L. Methods for scaling to dou- bly stochastic form. and its Applications, 48:53–79, 1982.

Rao, S. S. P., Huntley, M. H., Durand, N. C., Stamenova, E. K., Bochkov, I. D., Robinson, J. T., Sanborn, A. L., Machol, I., Omer, A. D., Lander, E. S., and Aiden, E. L. A 3D of the human genome at kilobase resolution reveals principles of chromatin looping. Cell, 159(7): 1665–1680, 2014.

Rota, G.-C. On the foundations of combinatorial theory I: Theory of Mobius¨ functions. Z. Wahrseheinlichkeitsthe- orie, 2:340–368, 1964. Sinkhorn, R. A relationship between arbitrary positive ma- trices and doubly stochastic matrices. The Annals of Mathematical Statistics, 35(2):876–879, 06 1964. Sinkhorn, R. and Knopp, P. Concerning nonnegative ma- trices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2):343–348, 1967.

Solomon, J., de Goes, F., Peyre,´ G., Cuturi, M., Butscher, A., Nguyen, A., Du, T., and Guibas, L. Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics, 34(4):66:1–66:11, 2015.

Soules, G. W. The rate of convergence of sinkhorn bal- ancing. Linear Algebra and its Applications, 150:3–40, 1991. Sugiyama, M., Nakahara, H., and Tsuda, K. Information decomposition on structured space. In 2016 IEEE In- ternational Symposium on Information Theory, pp. 575– 579, July 2016. Wu, H.-J. and Michor, F. A computational strategy to adjust for copy number in tumor Hi-C data. Bioinformatics, 32 (24):3695–3701, 2016.