<<

A Simulated Annealing Based Inexact Oracle for Wasserstein Loss Minimization

Jianbo Ye 1 James Z. Wang 1 Jia Li 2

Abstract ∇yl(x, ·) is computed in O(m) time, m being the complex- ity of outcome variables x or y. This part of calculation is Learning under a Wasserstein loss, a.k.a. Wasser- often negligible compared with the calculation of full gra- stein loss minimization (WLM), is an emerging dient with respect to the model parameters. But this is no research topic for gaining insights from a large longer the case in learning problems based on Wasserstein set of structured objects. Despite being concep- distance due to the intrinsic complexity of the distance. tually simple, WLM problems are computation- We will call such problems Wasserstein loss minimization ally challenging because they involve minimiz- (WLM). Examples of WLMs include Wasserstein barycen- ing over functions of quantities (i.e. Wasserstein ters (Li & Wang, 2008; Agueh & Carlier, 2011; Cuturi & distances) that themselves require numerical al- Doucet, 2014; Benamou et al., 2015; Ye & Li, 2014; Ye gorithms to compute. In this paper, we intro- et al., 2017b), principal geodesics (Seguy & Cuturi, 2015), duce a approach based on simulated nonnegative matrix factorization (Rolet et al., 2016; San- annealing for solving WLMs. Particularly, we dler & Lindenbaum, 2009), barycentric coordinate (Bon- have developed a Gibbs sampler to approximate neel et al., 2016), and multi-label classification (Frogner effectively and efficiently the partial gradients of et al., 2015). a sequence of Wasserstein losses. Our new ap- proach has the advantages of numerical stability Wasserstein distance is defined as the cost of matching two and readiness for warm starts. These character- measures, originated from the literature of op- istics are valuable for WLM problems that of- timal transport (OT) (Monge, 1781). It takes into account ten require multiple levels of iterations in which the cross-term similarity between different support points the oracle for computing the value and gradient of the distributions, a level of complexity beyond the usual of a loss function is embedded. We applied the vector data treatment, i.e., to convert the distribution into a method to optimal transport with Coulomb cost vector of frequencies. It has been promoted for comparing and the Wasserstein non-negative matrix factor- sets of vectors (e.g. bag-of-words models) by researchers ization problem, and made comparisons with the in computer vision, multimedia and more recently natural existing method of entropy regularization. language processing (Kusner et al., 2015; Ye et al., 2017a). However, its potential as a powerful loss function for ma- chine learning has been underexplored. The major obstacle 1. Introduction is a lack of standardized and robust numerical methods to solve WLMs. Even to empirically better understand the An oracle is a computational module in an optimization advantages of the distance is of interest. arXiv:1608.03859v4 [stat.CO] 6 Jun 2017 procedure that is applied iteratively to obtain certain char- As a long-standing consensus, solving WLMs is challeng- acteristics of the function being optimized. Typically, it ing (Cuturi & Doucet, 2014). Unlike the usual optimiza- calculates the value and gradient of loss function l(x, y). In tion in machine learning where the loss and the (partial) the vast majority of machine learning models, where those gradient can be calculated in linear time, these quantities loss functions are decomposable along each dimension are non-smooth and hard to obtain in WLMs, requiring so- (e.g., L norm, KL divergence, or hinge loss), ∇ l(·, y) or p x lution of a costly network transportation problem (a.k.a. 3 1College of Information Sciences and Technology, The Penn- OT). The time complexity, O(m log m), is prohibitively 2 sylvania State University, University Park, PA. Department of high (Orlin, 1993). In contrast to the Lp or KL counter- Statistics, The Pennsylvania State University, University Park, parts, this step of calculation elevates from a negligible PA.. Correspondence to: Jianbo Ye . fraction of the overall learning problem to a dominant por- Proceedings of the 34 th International Conference on Machine tion, preventing the scaling of WLMs to large data. Re- Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 cently, iterative approximation techniques have been devel- by the author(s). oped to compute the loss and the (partial) gradient at com- A Simulated Annealing Based Inexact Oracle for Wasserstein Loss Minimization plexity O(m2/ε) (Cuturi, 2013; Wang & Banerjee, 2014). lem with a strongly convex term such that the regularized However, nontrivial algorithmic efforts are needed to in- objective becomes a smooth function of all its coordinat- corporate these methods into WLMs because WLMs often ing parameters. Neither the Sinkhorn’s nor Breg- require multi-level loops (Cuturi & Doucet, 2014; Frogner man ADMM can be readily integrated into a general WLM. et al., 2015). Specifically, one must re-calculate through Based on the entropic regularization of primal OT, Cuturi & many iterations the loss and its partial gradient in order to Peyre´(2016) recently showed that the Legendre transform update other model dependent parameters. of the entropy regularized Wasserstein loss and its gradi- ent can be computed in closed form, which appear in the We are thus motivated to seek for a fast inexact oracle first-order condition of some complex WLM problems. Us- that (i) runs at lower time complexity per iteration, and ing this technique, the regularized primal problem can be (ii) accommodates warm starts and meaningful early stops. converted to an equivalent Fenchel-type dual problem that These two properties are equally important for efficiently has a faster numerical solver in the Euclidean space (Ro- obtaining adequate approximation to the solutions of a se- let et al., 2016). But this methodology can only be applied quence of slowly changing OTs. The second property en- to a certain class of WLM problems of which the Fenchel- sures that the subsequent OTs can effectively leverage the type dual has closed forms of objective and full gradient. solutions of the earlier OTs so that the total computational In contrast, the proposed SA-based approach directly deals time is low. Approximation techniques with low complex- with the dual OT problem without assuming any particular ity per iteration already exist for solving a single OT, but mathematical structure of the WLM problem, and hence is they do not possess the second property. In this paper, we more flexible to apply. introduce a method that uses a time-inhomogeneous Gibbs sampler as an inexact oracle for Wasserstein losses. The More recent approaches base on solving the dual OT prob- Monte Carlo (MCMC) based method natu- lems have been proposed to calculate and optimize the rally satisfies the second property, as reflected by the in- Wasserstein distance between a single pair of distributions tuition of physicists that MCMC samples can efficiently with very large support sets — often as large as the size “remix from a previous equilibrium.” of an entire machine learning dataset (Montavon et al., 2016; Genevay et al., 2016; Arjovsky et al., 2017). For We propose a new optimization approach based on Sim- these methods, scalability is achieved in terms of the sup- ulated Annealing (SA) (Kirkpatrick et al., 1983; Corana port size. Our proposed method has a different focus on et al., 1987) for WLMs where the outcome variables are calculating and optimizing Wasserstein distances between treated as probability measures. SA is especially suitable many pairs all together in WLMs, with each distribution for the dual OT problem, where the usual Metropolis sam- having a moderate support size (e.g., dozens or hundreds). pler can be simplified to a Gibbs sampler. To our knowl- We aim at scalability for the scenarios when a large set of edge, existing optimization techniques used on WLMs are distributions have to be handled simultaneously, that is, the different from MCMC. In practice, MCMC is known to optimization cannot be decoupled on the distributions. In easily accommodate warm start, which is particularly use- addition, existing methods have no on-the-fly mechanism ful in the context of WLMs. We name this approach Gibbs- to control the approximation quality at a limited number of OT for short. The algorithm of Gibbs-OT is as simple iterations. and efficient as the Sinkhorn’s algorithm — a widely ac- cepted method to approximately solve OT (Cuturi, 2013). We show that Gibbs-OT enjoys improved numerical sta- 3. Preliminaries of Optimal Transport bility and several algorithmic characteristics valuable for In this section, we present notations, mathematical back- general WLMs. By experiments, we demonstrate the ef- grounds, and set up the problem of interest. fectiveness of Gibbs-OT for solving optimal transport with Coulomb cost (Benamou et al., 2016) and the Wasserstein Definition 3.1 (Optimal Transportation, OT). Let p ∈ non-negative matrix factorization (NMF) problem (Sandler ∆m1 , q ∈ ∆m2 , where ∆m is the set of m-dimensional & Lindenbaum, 2009; Rolet et al., 2016). def. m 1 simplex: ∆m = {q ∈ R+ : hq, i = 1}. The set of trans- portation plans between p and q is defined as Π(p, q) def.= 2. Related Work m1×m2 1 T 1 {Z ∈ R : Z · m2 = p; Z · m1 = q; }. Let M ∈ m1×m2 be the matrix of costs. The optimal trans- Recently, several methods have been proposed to overcome R+ port cost between p and q with respect to M is the aforementioned difficulties in solving WLMs. Rep- resentatives include entropic regularization (Cuturi, 2013; def. Cuturi & Doucet, 2014; Benamou et al., 2015) and Breg- W (p, q) = min hZ,Mi . (1) Z∈Π(p,q) man ADMM (Wang & Banerjee, 2014; Ye et al., 2017b). The main idea is to augment the original optimization prob- In particular, Π(·, ·) is often called the coupling set. A Simulated Annealing Based Inexact Oracle for Wasserstein Loss Minimization

m1+m2−1 Now we relate primal version of (discrete) OT to a vari- of OT is a probability measure on Ω0(M) ⊆ R ant of its dual version. One may refer to Villani(2003) for such that the general background of the Kantorovich-Rubenstein du-  1  ality. In particular, our formulation introduces an auxiliary p(f; p, q) ∝ exp (hp, gi − hq, hi) . (5) T parameter CM for the sake of mathematical soundness in defining Boltzmann distributions. It is a well-defined probability measure for an arbitrary fi- Definition 3.2 (Dual Formulation of OT). Let CM > 0, de- nite CM > 0. T T note vector [g1, . . . , gm1 ] by g, and vector [h1, . . . , hm2 ] by h. We define the dual domain of OT by The basic concept behind SA states that the samples from the Boltzmann distribution will eventually concentrate at def. n m +m the optimum set of its deriving problem (e.g. W (p, q)) as Ω(M) = f = [g; h] ∈ 1 2 R T → 0. However, since the Boltzmann distribution is of- o ten difficult to sample, a practical convergence rate remains − CM < gi − hj ≤ Mi,j, 1 ≤ i ≤ m1, 1 ≤ j ≤ m2 . mostly unsettled for specific MCMC methods. (2) Because Ω(M) defined by Eq. (2) (also Ω0) has a con- Informally, for a sufficiently large CM (subject to p, q,M), ditional independence structure among variables, a Gibbs the LP problem Eq. (1) can be reformulated as 1 sampler can be naturally applied to the Boltzmann distri- bution defined by Eq. (5). We summarize this result below. W (p, q) = max hp, gi − hq, hi . (3) f∈Ω(M) Proposition 4.1. Given any f = (g; h) ∈ Ω0(M) and any CM > 0, we have for any i and j, ∗ Let the optimum set be Ω (M). Then any optimal point def. ∗ ∗ ∗ ∗ gi ≤ Ui(h) = min (Mi,j + hj) , (6) f = (g , h ) ∈ Ω (M) constructs a (projected) subgradi- 1≤j≤m2 ∗ ∗ ent such that g ∈ ∂W/∂p and −h ∈ ∂W/∂q . The main def. hj ≥ Lj(g) = max (gi − Mi,j) . (7) computational difficulty of WLMs comes from the fact that 1≤i≤m1 (projected) subgradient f ∗ is not efficiently solvable. and Note that Ω(M) is an unbound set in Rm1+m2 . In order to constrain the to be bounded, we alterna- def. gi > Lbi(h) = max (−CM + hj) , (8) tively define 1≤j≤m2 def. hj < Ubj(g) = max (CM + gi) . (9) Ω0(M)={f = [g; h] ∈ Ω(M) | g1 = 0}. (4) 1≤i≤m1

One can show that the maximization in Ω(M) as Eq. (3) Here Ui = Ui(h) and Lj = Lj(g) are auxiliary variables. Suppose f follows the Boltzmann distribution by Eq. (5), is equivalent to the maximization in Ω0(M) because 1 1 gi’s are conditionally independent given h, and likewise hp, m1 i = hq, m2 i. hj’s are also conditionally independent given g. Further- more, it is immediate from Eq. (5) that each of their con- 4. Simulated Annealing for Optimal ditional within its feasible region (subject to Transport via CM ) satisfies

Following the basic strategy outlined in the seminal pa- gipi  p(gi|h) ∝ exp , Lbi(h) < gi ≤ Ui(h), (10) per of simulated annealing (Kirkpatrick et al., 1983), we T   present the definition of Boltzmann distribution supported hjqj p(hj|g) ∝ exp − ,Lj(g) ≤ hj < Ubj(g), (11) on Ω0(M) below which, as we will elaborate, links the T dual formulation of OT to a Gibbs sampling scheme (Al- gorithm1 below). where 2 ≤ i ≤ m1 and 1 ≤ j ≤ m2.

Definition 4.1 (Boltzmann Distribution of OT). Given a Remark 1. As CM → +∞, Ubj(g) → +∞ and Lbi(h) → temperature parameter T > 0, the Boltzmann distribution −∞. For 2 ≤ i ≤ m1 and 1 ≤ j ≤ m2, one can ap- proximate the conditional probability p(g |h) and p(h |g) 1However, for any proper M and strictly positive p, q, there i j by exponential distributions. exists CM such that the optimal value of primal problem is equal to the optimal value of the dual problem. This modification is solely for an ad-hoc treatment of a single OT problem. In general By Proposition. 4.1, our proposed time-inhomogeneous cases of (p, q,M), when CM is pre-fixed, the solution of Eq. (3) Gibbs sampler is given in Algorithm1. Specifically in Al- may be suboptimal. gorithm1, the variable g1 is fixed to zero by the definition A Simulated Annealing Based Inexact Oracle for Wasserstein Loss Minimization

Algorithm 1 Gibbs Sampling for Optimal Transport approximate the Wasserstein gradient. In practice, we find (0) this bound helps one quickly select the beginning tempera- Given f ∈ Ω0(M), p ∈ ∆m1 and q ∈ ∆m2 , and T (1),...,T (2N) > 0, for t = 1,...,N, we define the fol- ture of Gibbs-OT algorithm. lowing Markov chain Definition 4.2 (Notations for Auxiliary Statistics). Besides the Gibbs coordinates g and h, the Gibbs-OT sampler nat- 1. Randomly sample urally introduces two auxiliary variables, U and L. Let T T h (t) (t) i h (t) (t) i i.i.d. (t) (t) L = L1 ,...,Lm2 and U = U1 ,...,Um1 . θ1, . . . , θm2 ∼ Exponential(1). (t) (t) Likewise, denote the collection of gi and hj by vectors For j = 1, 2, . . . , m , let 2 g(t) and h(t) respectively. The following sequence of aux- ( (t)  (t−1)  iliary statistics L := max1≤i≤m g − Mi,j j 1 i (12) (t) (t) (2t−1) 2t−1 2t 2t+1 def. hj := Lj + θj · T /qj [..., z , z , z ,..., ] =   L(t)  L(t)  L(t+1)  ..., , , ,... (14) 2. Randomly sample U(t−1) U(t) U(t)

i.i.d. for t = 1,...,N is also a Markov chain. They can be re- θ , . . . , θ ∼ Exponential(1). 2 m1 defined equivalently by specifying the transition probabili- ties p(zn+1|zn) for n = 1,..., 2N, a.k.a., the conditional For i = (1), 2, . . . , m1, let p.d.f. p(U(t)|L(t)) for t = 1,...,N and p(L(t+1)|U(t)) ( (t)  (t) for t = 1,...,N − 1. U := min1≤j≤m Mi,j + h i 2 j (13) (t) (t) (2t) One may notice that the alternative representation converts gi := Ui − θi · T /pi the Gibbs sampler to one whose structure is similar to a hidden Markov model, where the g, h chain is conditional independent given the U, L chain and has (factored) expo- of Ω0(M). But we have found in experiments that by cal- nential emission distributions. We will use this equivalent (t) (t) culating U1 and sampling g1 in Algorithm1 according representation in AppendixA and develop analysis based to Eq. (13), one can still generate MCMC samples from on the U, L chain accordingly. Ω(M) such that the energy quantity hp, gi − hq, hi con- Remark 4. We now consider the function verges to the same distribution as that of MCMC samples V (x, y) def.= hp, xi − hq, yi , from Ω0(M). Therefore, we will not assume g1 = 0 from 0 now on and develop analysis solely for the unconstrained and define a few additional notations. Let V (Ut , Lt) be 0 version of Gibbs-OT. denoted by V (zt+t ), where t0 = t or t − 1. If g, h are Figure1 illustrates the behavior of the proposed Gibbs sam- independently resampled according to Eq. (12) and (13), pler with a cooling schedule at different temperatures. As we will have the inequalities that n n T decreases along iterations, the 95% percentile band for E [V (g, h)|z ] ≤ V (z ) . sample f becomes thinner and thinner. Both V (z) and V (g, h) converges to the exact loss Remark 2. Algorithm1 does not specify the actual cool- W (p, q) at the equilibrium of Boltzmann distribution ing schedule, nor does the analysis of the proposed Gibbs p(f; p, q) as T → 0. 2 sampler in Theorem A.2. We have been agnostic here for a reason. In the SA literature, cooling schedules with guaran- teed optimality are often too slow to be useful in practice. 5. Gibbs-OT: An Inexact Oracle for WLMs To our knowledge, the guaranteed rate of SA approach is In this section, we introduce a non-standard SA approach worse than the combinatorial solver for OT. As a result, a for the general WLM problems. The main idea is to replace well-accepted practice of SA for many complicated opti- the standard Boltzmann energy with an asymptotic consis- mization problems is to empirically adjust cooling sched- tent upper bound, outlined in our previous section. Let ules, a strategy we take for our experiments. |D| Remark 3. Although the exact cooling schedule is not spec- X R(θ) := W (p (θ), q (θ)) ified, we still provide a quantitative upper bound of the cho- i i i=1 sen temperature T at different iterations in AppendixA Eq. (24). One can calculate such bound at the cost of 2The conditional quantity V (zn) − V (g, h)|zn is the sum (2t) m log m at certain iterations to check whether the current of two Gamma random variables: Gamma(m1, 1/T ) + (2t0+1) 0 0 temperature is too high for the used Gibbs-OT to accurately Gamma(m2, 1/T ) where t = t or t = t − 1. A Simulated Annealing Based Inexact Oracle for Wasserstein Loss Minimization

2.5 2 2 U U U 2 L 1.5 L 1.5 L (1) (1) (1) 1.5 f f f f(2) 1 f(2) 1 f(2) 1 0.5 0.5 0.5 0 0 0 -0.5 -0.5 -0.5 -1 -1 -1 -1.5 -1.5 -1.5 -2 -2 -2 -2.5 -2.5 -2.5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 (a) 20 iterations (b) 40 iterations (c) 60 iterations

Figure 1. The Gibbs sampling of the proposed SA method. From left to right is an illustrative example of a simple 1D optimal trans- portation problem with Coulomb cost and plots of variables for solving this problem at different number of iterations ∈ {20, 40, 60} using the inhomogeneous Gibbs sampler. Particularly, the 95% percentile of the exponential distributions are marked by the gray area. be our prototyped objective function, where D represents a Markov chain is roughly considered mixed if every τ iter- (2t) dataset, pi, qi are prototyped probability densities for rep- ation the quantity V (z ) (almost) stops increasing (τ=5 resenting the i-th instance. We now discuss how to solve by default), say, for some t, minθ∈Θ R(θ). V (z(2t)) − V (z(2(t−τ))) < 0.01τT · V (z(2t)), To minimize the Wasserstein losses W (p, q) approxi- mately in such WLMs, we propose to instead optimize we terminate the Gibbs iterations. its asymptotic consistent upper bound E[V (z)] at equilib- rium of Boltzmann distribution p(f; p, q) using its stochas- 6. Applications of Gibbs-OT tic gradients: U ∈ ∂V (z)/∂p and −L ∈ ∂V (z)/∂q . Therefore, one can calculate the gradient approximately: 6.1. Toy OT Examples |D| 1D Case with Euclidean Cost. We first illustrate the dif- X ∇θR ≈ [Jθ(pi(θ))Ui − Jθ(qi(θ))Li] ferences between the approximate primal solutions com- i=1 puted by different methods by replicating a toy example where Jθ(·) is the Jacobian, Ui, Li are computed from in (Benamou et al., 2015). The toy example calculates the Algorithm1 for the problem W (pi, qi) respectively. To- OT between two 1D two-mode distributions. We visualize gether with the iterative updates of model parameters θ, their solved coupling as a 2D image in Fig.2 at the budgets one gradually anneals the temperature T . The equilibrium in terms of different number of iterations. Given their dif- of p(f; p, q) becomes more and more concentrated. We as- ferent convergence behaviors, when one wants to compro- sume the inexact oracle at a relatively higher temperature mise with using pre-converged primal solutions in WLMs, is adequate for early updates of the model parameters, but he or she has to account for the different results computed sooner or later it becomes necessary to set T smaller to bet- by different numerical methods, even though they all aim ter approximate the exact loss. at the Wasserstein loss. It is well known that the variance of stochastic gradient As a note, Sinkhorn, B-ADMM and Gibbs-OT share the usually affects the rate of convergence. The reason to re- same computational complexity per iteration. The differ- place V (g, h) with V (z) as the inexact oracle (for some ence in their actual CPU time comes from the different T > 0) is motivated by the same intuition. The variances arithmetic operations used. B-ADMM may be the slowest (t) (t) log() exp() of MCMC samples g , h of Algorithm1 can be very because it requires and operations. When i j memory efficiency is of concern, both the implementa- large if pi/T and qj/T are small, making the embedded first-order method inaccurate unavoidably. But we find the tions of Sinkhorn and Gibbs-OT can be modified to take only O(m +m ) additional memory besides the space for variances of max/min statistics U (t),L(t) are much smaller. 1 2 i j caching the cost matrix M. Fig.1 shows an example. The bias introduced in the re- placement is also well controlled by decreasing the tem- Two Electrons with Coulomb Cost in DFT. In quantum perature parameter T . For the sake of efficiency, we use mechanics, Coulomb cost (or electron-electron Coulomb a very simple convergence diagnostics in the practice of repulsion) is an important energy functional in Density Gibbs-OT. We check the values of V (z(2t)) such that the Functional Theory (DFT). Numerical methods that solve A Simulated Annealing Based Inexact Oracle for Wasserstein Loss Minimization

0.5/N). For Gibbs-OT, we use a geometric temperature 4 n/l

l=1 scheme such that T = 2.0(1/l ) /N at the n-th iteration, where l is the max iteration number. For the unbounded Coulomb cost, Bregman ADMM (Wang & Banerjee, 2014) does not converge to a solution close to the true optimum. l=10 6.2. Wasserstein NMF We now illustrate how the proposed Gibbs-OT can be used l=50 as a ready-to-plugin inexact oracle for a typical WLM — Wasserstein NMF (Sandler & Lindenbaum, 2009; Rolet et al., 2016). The data parallelization of this framework is natural because the Gibbs-OT samplers subject to different l=200 instances are independent. Problem Formulation. Given a set of discrete proba- n d bility measures {Φi}i=1 (data) over R , we want to es- l=1000 K timate a model Θ = {Ψk}k=1, such that for each Φi, (i) there exists a membership vector β ∈ ∆K : Φi ≈ PK (i) k=1 βk Ψk, where each Ψk is again a discrete prob- l=5000 ability measure to be estimated. Therefore, Wasserstein Pn  PK (i)  NMF reads minΘ,Ξ i=1 W Φi, k=1 βk Ψk , where (1) (n) IBP, rho=0.1/N IBP, rho=0.5/N IBP, rho=2.0/N B-ADMM SimulAnn Ξ = (β , . . . , β ) is the collection of membership vec- tors, and W is the Wasserstein distance. One can write the Figure 2. A simple example for OT between two 1D distribu- problem by plugging Eq. (3) in the dual formulation: tion: The solutions by Iterative Bregman Projection, B-ADMM, n and Gibbs-OT are shown in pink, while the exact solution X h (i) (i) i min max hw , gii − hw , hii (15) by is shown in green. Images in the n i=1 b Θ,Ξ F ={fi}i=1 rows from top to bottom present results at different iterations Xm (k) {1, 10, 50, 200, 1000, 5000}; The left three columns are by IBP s.t. Ψk = v δx , (16) i=1 i i with ε = {0.1/N, 0.5/N, 2/N}, where [0, 1] is discretized with K (i) X (i) N = 128 uniformly spaced points. The fourth column is by B- Φb = β Ψk , (17) k=1 k ADMM (with default parameter τ0 = 2.0). The last column is  (i)  the proposed Gibbs-OT, with a geometric cooling schedule. With fi ∈ Ω M(Φb , Φi) , (18) a properly selected cooling schedule, one can achieve fast conver- gence of OT solution without comprising much solution quality. (i) where wb ∈ ∆m is the weight vector of discrete prob- (i) (i) ability measure Φb and w ∈ ∆mi is the weight vec- tor of Φ(i). M(·, ·) denotes the transportation cost matrix the multi-marginal OT problem with unbounded costs re- between the supports of two measures. The global opti- mains an open challenge in DFT (Benamou et al., 2016). mization solves all three sets of variables (Θ, Ξ,F ). In the We consider two uniform densities on 1D domain [0, 1] m m sequel, we assume support points of {Ψk}k=1 — {xi}i=1 with Coulomb cost c(x, y) = 1/|x − y| which has analytic are shared and pre-fixed. solutions. Coulumb cost is different from the usual metric cost in the OT literature, which is unbounded and singu- Algorithm. At every epoch, one updates variables either lar at x = y. As observed in (Benamou et al., 2016), the sequentially (indexed by i) or all together. It is done by entropic regularized primal solution becomes more concen- first executing the Gibbs-OT oracle subject to the i-th in- (k) trated at boundaries, which is not physically plausible. This stance and then updating v and the membership vec- (i) effect is not observed in the Gibbs-OT solution as shown in tor β accordingly at a chosen step size γ > 0. At the Appendix Fig.3. As shown by Fig1, the variables U, V end of each epoch, the temperature parameter T is adjusted  q 1  1 Pn in computation are always in bounded range (with an over- T := T 1 − m+m ¯ , where m¯ = n i=1 mi . For whelming probability), thus the algorithm does not endure each instance i, the algorithm proceeds with the following any numerical difficulties. steps iteratively: For entropic regularization (Benamou et al., 2015; 2016), we empirically select the minimal ε which does not cause 1. Initiate from the last computed U/V sample subject numerical overflow before 5000 iterations (in which ε = to instance i, execute the Gibbs-OT Gibbs sampler at A Simulated Annealing Based Inexact Oracle for Wasserstein Loss Minimization

Figure 3. The recovered primal solutions for two uniform 1D distribution with Coulumb cost. The approximate solutions are shown in pink, while the exact solution by linear programming is shown in green. Top row: entropic regularization with ε = 0.5/N. Bottom row: Gibbs-OT. Images in the rows from left to right present results at different max iterations {1, 10, 50, 200, 1000, 2000, 5000}.

constant temperature T until a mixing criterion is met, Core i5 CPU, the average time spent for each epoch for and get Ui. these two datasets are 0.84 seconds and 16.8 seconds, re- (k) spectively. It is about two magnitude faster than fully solv- 2. For k = 1,...,K, update v ∈ ∆m based on gradi- 3 (i) ing all OTs via a commercial LP solver . ent βk Ui using the iterates of online mirror descent (MD) subject to the step-size γ (Beck & Teboulle, 2003). 7. Discussions

(i) 3. Also update the membership vector β ∈ ∆K based The solution of primal OT (Monge-Kantorovich problem) (1) (K) T on gradient (hv , Uii,..., hv , Uii) using the have many direct interpretations, where the solved trans- iterates of accelerated mirror descent (AMD) with port is a coupling between two measures. Hence, it could restarts subject to the same step-size γ (Krichene be well motivated to consider regularizing the solution on et al., 2015). the primal domain in those problems (Cuturi, 2013). Mean- while, the solution of dual OT can be meaningful in its We note that the practical speed-ups we achieved via the own right. For instance, in finance, the dual solution is above procedure is the warm-start feature in Step 1. If one directly interpreted as the vanilla prices implementing ro- uses a black-box OT solver, this dimension of speed-ups is bust static super-hedging strategies. The entropy regular- not viable. ized OT, under the Fenchel-type dual, provides a smoothed unconstrained dual problem as shown in (Cuturi & Peyre´, Results. We investigate the empirical convergence of the 2016). In this paper, we develop Gibbs-OT, whose solu- proposed Wasserstein NMF method by two datasets: one tions respect the dual feasibility of OT and are subject to a is a subset of MNIST handwritten digit images which con- different regularization effect as explained by (Abernethy tains 200 digits of “5”, and the other is the ORL 400-face & Hazan, 2015). It is a numerical stable and computational dataset. Our results are based on a C/C++ implementation suitable oracle to handle WLM. with vectorization. In particular, we set K = 40, γ = 2.0 for both datasets. The learned components are visual- Acknowledgement. This material is based upon work sup- ized together with alternative approaches (smoothed W- ported by the National Science Foundation under Grant NMF (Rolet et al., 2016) and regular NMF) in Appendix Nos. ECCS-1462230 and DMS-1521092. The authors Figs.4 and5. From these figures, we observe that our would also like to thank anonymous reviewers for their learned components using Gibbs-OT are shaper than the valuable comments. smoothed W-NMF. This can be explained by the fact that 3We use the specialized network flow solver in Mosek Gibbs-OT can potentially push for higher quality of ap- (https://www.mosek.com) for the computation, which is proximation by gradually annealing the temperature. We found faster than general simplex or IPM solver at moderate prob- also observe that the learned components might possess lem scale. some salt-and-pepper noise. This is because the Wasser- stein distance by definition is not very sensitive to the sub- pixel displacements. On a single-core of a 3.3 GHz Intel A Simulated Annealing Based Inexact Oracle for Wasserstein Loss Minimization

Figure 4. NMF components learned by different methods (K = 40) on the 200 digit “5” images. Top: regular NMF; Middle: W-NMF with entropic regularization (ε = 1/100, ρ1 = ρ2 = 1/200); Bottom: W-NMF using Gibbs-OT. It is observed that the components of W-NMF with entropic regularization are smoother than those optimized with Gibbs-OT. Figure 5. NMF components learned by different methods (K = 40) on the ORL face images. Top: regular NMF; Middle: W- References NMF with entropic regularization (ε = 1/100, ρ1 = ρ2 = 1/200); Bottom: W-NMF using Gibbs-OT, in which the salt and Abernethy, Jacob and Hazan, Elad. Faster convex optimiza- pepper noises are observed due to the fact that Wasserstein dis- tion: Simulated annealing with an efficient universal bar- tance is insensitive to the subpixel mass displacement (Cuturi & rier. arXiv preprint arXiv:1507.02528, 2015. Peyre´, 2016).

Agueh, Martial and Carlier, Guillaume. Barycenters in the Wasserstein space. SIAM J. Math. Analysis, 43(2):904– Wasserstein GAN. arXiv preprint arXiv:1701.07875, 924, 2011. 2017.

Arjovsky, Martin, Chintala, Soumith, and Bottou, Leon.´ Beck, Amir and Teboulle, Marc. Mirror descent and non- A Simulated Annealing Based Inexact Oracle for Wasserstein Loss Minimization

linear projected subgradient methods for convex opti- Kusner, Matt, Sun, Yu, Kolkin, Nicholas, and Weinberger, mization. Operations Research Letters, 31(3):167–175, Kilian. From word embeddings to document distances. 2003. In Proc. of the Int. Conf. on Machine Learning, pp. 957– 966, 2015. Benamou, Jean-David, Carlier, Guillaume, Cuturi, Marco, Nenna, Luca, and Peyre,´ Gabriel. Iterative Bregman pro- Li, Jia and Wang, James Z. Real-time computerized an- jections for regularized transportation problems. SIAM J. notation of pictures. IEEE Trans. Pattern Analysis and on Scientific Computing, 37(2):A1111–A1138, 2015. Machine Intelligence, 30(6):985–1002, 2008.

Benamou, Jean-David, Carlier, Guillaume, and Nenna, Monge, Gaspard. Memoire´ sur la theorie´ des deblais´ et des Luca. A numerical method to solve multi-marginal opti- remblais. De l’Imprimerie Royale, 1781. mal transport problems with Coulomb cost. In Splitting Montavon, Gregoire,´ Muller,¨ Klaus-Robert, and Cuturi, Methods in Communication, Imaging, Science, and En- Marco. Wasserstein training of restricted boltzmann ma- gineering, pp. 577–601. Springer, 2016. chines. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Bonneel, Nicolas, Peyre,´ Gabriel, and Cuturi, Marco. Guyon, I., and Garnett, R. (eds.), Advances in Neural In- Wasserstein barycentric coordinates: Histogram regres- formation Processing Systems 29, pp. 3711–3719. 2016. sion using optimal transport. ACM Trans. on Graphics, Orlin, James B. A faster strongly polynomial minimum 35(4), 2016. cost flow algorithm. Operations Research, 41(2):338– Corana, Angelo, Marchesi, Michele, Martini, Claudio, and 350, 1993. Ridella, Sandro. Minimizing multimodal functions of Rolet, Antoine, Cuturi, Marco, and Peyre,´ Gabriel. Fast continuous variables with the simulated annealing algo- dictionary learning with a smoothed Wasserstein loss. In rithm corrigenda for this article is available here. ACM AISTAT, 2016. Trans. on Mathematical Software, 13(3):262–280, 1987. Sandler, Roman and Lindenbaum, Michael. Nonnegative Cuturi, Marco. Sinkhorn distances: Lightspeed computa- matrix factorization with earth mover’s distance metric. tion of optimal transport. In Advances in Neural Infor- In Proc. of the Conf. on Computer Vision and Pattern mation Processing Systems, pp. 2292–2300, 2013. Recognition, pp. 1873–1880. IEEE, 2009. Cuturi, Marco and Doucet, Arnaud. Fast computation of Seguy, Vivien and Cuturi, Marco. Principal geodesic anal- Wasserstein barycenters. In Proc. Int. Conf. Machine ysis for probability measures under the optimal transport Learning, pp. 685–693, 2014. metric. In Advances in Neural Information Processing Systems, pp. 3294–3302, 2015. Cuturi, Marco and Peyre,´ Gabriel. A smoothed dual ap- proach for variational Wasserstein problems. SIAM J. on Villani, Cedric.´ Topics in Optimal Transportation. Num- Imaging Sciences, 9(1):320–343, 2016. ber 58. American Mathematical Soc., 2003.

Frogner, Charlie, Zhang, Chiyuan, Mobahi, Hossein, Wang, Huahua and Banerjee, Arindam. Bregman alter- Araya, Mauricio, and Poggio, Tomaso A. Learning with nating direction method of multipliers. In Advances in a Wasserstein loss. In Advances in Neural Information Neural Information Processing Systems, pp. 2816–2824, Processing Systems, pp. 2044–2052, 2015. 2014.

Genevay, Aude, Cuturi, Marco, Peyre,´ Gabriel, and Bach, Ye, Jianbo and Li, Jia. Scaling up discrete distribution clus- Francis. for large-scale optimal tering using admm. In Proc. of the Int. Conf. on Image transport. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Processing, pp. 5267–5271. IEEE, 2014. Guyon, I., and Garnett, R. (eds.), Advances in Neural In- Ye, Jianbo, Li, Yanran, Wu, Zhaohui, Wang, James Z, Li, formation Processing Systems 29, pp. 3440–3448. 2016. Wenjie, and Li, Jia. Determining gains acquired from Kirkpatrick, Scott, Gelatt, C. Daniel, Jr, and Vecchi, word embedding quantitatively using discrete distribu- Mario P. Optimization by simmulated annealing. Sci- tion clustering. In Proc. of the Annual Meeting of the ence, 220(4598):671–680, 1983. Association for Computational Linguistics, 2017a.

Krichene, Walid, Bayen, Alexandre, and Bartlett, Peter L. Ye, Jianbo, Wu, Panruo, Wang, James Z, and Li, Jia. Accelerated mirror descent in continuous and discrete Fast discrete distribution clustering using Wasserstein time. In Advances in Neural Information Processing Sys- barycenter with sparse support. IEEE Trans. on Signal tems, pp. 2827–2835, 2015. Processing, 65(9):2317–2332, 2017b. A Simulated Annealing Based Inexact Oracle for Wasserstein Loss Minimization

(t) m2 A. Theoretical Properties of Gibbs-OT that the sequence {Mi,σ(j) + Lσ(j)}j=1 are monotonically non-decreasing. Define the auxiliary quantity We develop quantitative concentration bounds for Gibbs- OT in a finite number of iterations in order to understand m2 Qk−1 (t) def. X (1 − λk) j=1 λk the relationship between the temperature schedule and the ψ = , (20) i Pk concentration progress. The analysis also guides us to ad- k=1 j=1 qσ(j) just cooling schedule on-the-fly, as will be shown. Proofs where are provided in Supplement. (Pk Preliminaries. Before characterizing the properties of def. qσ(j) h   1 ≥ λ = exp j=1 M + L(t) Gibbs-OT by Definition1, we first give the analytic ex- j T (2t−1) i,σ(j) σ(j) n+1 n pression for p(z |z ). Let G(·):[−∞, ∞] 7→ [0, 1] )   i be the c.d.f. of standard exponential distribution. Because − M + L(t+1) (t+1) (t) i,σ(j+1) σ(j) Lj < x by definition ⇔ ∀i, gi − Mi,j < x, the (t+1) (t) c.d.f. of Lj |U reads for i = 1, . . . , m2 − 1 and λm2 = 0 . Then, the conditional

m1 (t) !! expectation  (t+1)  Y −x − Mi,j + U Pr L

(t) (t) t Likewise, the c.d.f. of Ui |L reads In particular, we denote σ(1) by Ji or J(i, t) .

m2 (t) !  (t)  Y x − Mi,j − Lj We note that the calculation of Eq. (19) and Eq. (20) needs Pr U < x L(t) = G . i (2t−1) O(m1 log m1) and O(m2 log m2) time respectively. By T /qj j=1 a few additional calculations, we introduce the notation O(·, ·): With some calculation, the following can be shown. As a note, this lemma provides an intermediate result whose O(z2t,T (2t)) (t) (t) main purpose is to lay down the definition of φ and ϕ , def. h (t) (t+1) (t) (t) i j i = E hq, L i − hq, L i U , L which are then used in defining O(z, T ) (Eq. (21)) and rn m2 (Eq. (23)) and in Theorem A.2. X  (t) (t) (t) (2t) = MIt,j + Lj − U t + φj T qj j Ij Lemma A.1. (i) Given 1 ≤ j ≤ m2 and 1 ≤ t ≤ N, let the j=1 sorted index of {U (t) −M }m1 be permutation {σ(i)}m1 i i,j i=1 i=1 O(z2t−1,T (2t−1)) (t) m1 such that sequence {U −Mσ(i),j}i=1 are monotonically h i σ(i) def. (t) (t−1) (t−1) (t) non-increasing. Define the auxiliary quantity = E hp, U i − hp, U i U , L m m 1 1 Qk−1 X  (t) (t−1) (t) (2t−1) (t) def. X (1 − µk) µi t i=1 = Mi,J + LJ t − Ui + ψi T pi φj = , (19) i i Pk i=1 k=1 i=1 pσ(k) (21) where n n  n+1 n n Note that O(z ,T ) = E V (z ) − V (z )|z . (Pk def. i=1 pσ(k) h  1 ≥ µi = exp Uσ(i+1) − Mσ(i+1),j Recovery of Approximate Primal Solution. An approxi- T (2t) 4 mate (m1 + m2)-sparse primal solution can be recovered ) from zn at n = 2t by  i − Uσ(i) − Mσ(i),j 1 Z ≈ sparse(1 : m1,J(1 : m1, t), p)+ def. 2 for i = 1, . . . , m1 −1, and µm = 0 . Then, the conditional 1 1 sparse(I(1 : m , t), 1 : m , q) ∈ m1×m2 . (22) expectation 2 2 2 R h i L(t+1) U(t) = U (t) − M − φ(t)T (2t) . E j σ(1) σ(1),j j Concentration Bounds. We are interested in the concen- tration bound related to V (zn) because it replaces the true t In particular, we denote σ(1) by Ij or I(j, t) . 4The notation of sparse(·, ·, ·) function is introduced under (ii) Given 1 ≤ i ≤ m1 and 1 ≤ t ≤ N, let the sorted the syntax of MATLAB: http://www.mathworks.com/ m2 m2 index of {Mi,j + Lj}j=1 be permutation {σ(j)}j=1 such help/matlab/ref/sparse.html A Simulated Annealing Based Inexact Oracle for Wasserstein Loss Minimization

(0) 1 Wasserstein loss in WLMs. Given U (i.e., z is implied), where Mi,· and M·,j represents the i-th rows and j- for n = 1,..., 2N, we let th columns of matrix M respectively, ψ(t) and φ(t) are def. n−1 defined in Lemma A.1, and regret function R(x; w) = m n n X s (s) P m r = V (z ) − O(z ,T ) . (23) i=1 wixi − min1≤i≤m xi for any w ∈ ∆m and x ∈ R . s=1 Then for any K > 0, we have This is crucial for one who wants to know whether the cool- ing schedule is too fast to secure the suboptimality within " # K2 a finite budget of iterations. The following Theorem A.2 Pr r2N r +γK ≤ exp − + ε . (26) V (z ) − V (z ) and V (z ) − V (z )|z , P2N−1 2 E 2 i=1 an s=1 the second of which is a quantitative term represent- ing sum of a sequence. We see that O(zs,T (s)) = Remark 5. The bound obtained is a quantitative Hoeffding  s+1 s s (s) E V (z ) − V (z )|z = 0 if and only if T = bound, not a bound that guarantees contraction around the T (zs) def.= true solution of dual OT. Nevertheless, we argue that this bound is still useful in investigating the proposed Gibbs  m2 1 P h (t) (t)i sampler when the temperature is not annealed to zero. Par- − qj M t + L − U t if s=2t  (t) Ij ,j j I  hφ , qi j=1 j ticularly, the bound is for cooling schedules in general, i.e., m 1 P1 h (t) (t−1)i it is more applicable than a bound for a specific schedule. − pi M t + L t − U if s=2t−1  (t) i,Ji J i  hψ , pi i=1 i There has long been a gap between the practice and theory (24) of SA despite of its wide usage. Our result likewise falls In the practice of Gibbs-OT, choosing the proper cooling short of firm theoretical guarantee from the optimization schedule for a specific WLM needs trial-and-error. Here perspective, as with the usual application of SA. we present a that the temperature T (s) is often s chosen and adapted around ηT (z ), where η ∈ [0.1, 0.9]. B. Proof of Lemmas and Theorem We have two concerns regarding the choice of temperature T : First, in a WLM, the cost V (z) is to be gradually min- The minimum of n independent exponential random vari- imized, hence a temperature T smaller than T (zs) at ev- ables with different parameters has computatable formula ery iteration ensures that the cost is actually decreased by for its expectation. The result immediately lays out the expectation, i.e., E[V (zn) − V (z1)] < 0; second, if T is proof of Lemma A.1. too small, it takes many iterations to reach a highly accu- rate equilibrium, which might not be necessary for a single Lemma B.1. Suppose we have n independent exponen- outer level step of parameter update. tial random variables ei whose c.d.f. is by fi(x) = Theorem A.2 (Concentration bounds for finite time Gibb- min{exp(ωi(x − zi)), 1}. Without lose of generality, we n s-OT). First, r (by definition) is a martingale subject to assume z1 ≥ z2 ≥ ... ≥ zn, then let zn+1 = −∞, hi = h i the filtration of z ,..., z . Second, given a ε ∈ (0, 1), for Pi 1 n exp j=1 ωj(zi+1 − zi) ≤ 1 (with hn = 0, zn+1hn = n = 1,..., 2N − 1 if we choose the temperature sched- 0), we have ule T (1),...,T (2N) such that (i) Cn · T (n) ≤ a , or   n 2N max{m1,m2} (n) n (ii) ∃γ > 0, log ε · T + D ≤ γan, where {an ≥ 0} is a pre-determined array. Here for n Qi−1 X (1 − hi) j=1 hi t = 1,...,N, [max{e , . . . , e }] = z − . E 1 n 1 Pi i=1 j=1 ωj C2t−1 def.= hψ(t), pi , C2t def.= hφ(t), qi ,

m1 2t−1 def. X  T (t)  D = piR Mi,· + L ; q , i=1 m2   2t def. X (t) Proof. The c.d.f. of max{e1, . . . , en} is F (x) = D = qjR M·,j − U ; p , Qn i=1 i=1 fi(x) which is piece-wise smooth with interval A Simulated Annealing Based Inexact Oracle for Wasserstein Loss Minimization

R ∞ (zi+1, zi), we want to calculate −∞ xdF (x) . Lemma B.3. Note that Eq. (21) implies  n+1 n 1 n E r − r |z ,..., z = 0 for t = 1,..., 2N. Z ∞ Therefore, {rn} is a (discrete time) martingale subject to xdF (x) n −∞ the filtration of {z }. (Recall the notation by Eq. (14).) n Z zi Moreover, we have the following two bounds. First, we can X n+1 n 2N−1 = xdF (x) + 0 establish the left hand side bound for {r − r }n=1 : i=1 zi+1 rn − rn+1 ≤ Cn · T (n), n  i  X Z zi X = xd exp  ωj(x − zj) where for t = 1,...,N i=1 zi+1 j=1 C2t−1 def.= hψ(t), pi and C2t def.= hφ(t), qi. (27) n  i   i  X Z zi X X =  ωj x exp  ωj(x − zj) dx Second, we also bound on the right hand side. That said, i=1 zi+1 j=1 j=1 for any 1 > ε > 0, we have n !  i  X n 1 X  n+1 n = z − exp ω (z − z ) Pr ∃n ∈ {1,..., 2N}, s.t. r − r i Pi  j i j  ωj i=1 j=1 j=1 2N max{m , m }  ≥ log 1 2 ·T (n)+Dn z1,..., zn ≤ ε, !  i  1 X o ε − z − exp ω (z − z ) i+1 Pi  j i+1 j  (28) j=1 ωj j=1 where for t = 1,...,N n " # i−1 X 1 − hi Y m = (zi − zi+1hi) − hi 1 Pi 2t−1 def. X  T (t)  i=1 j=1 ωj j=1 D = piR Mi,· + L ; q (29) i=1 n  i−1 i  X Y Y m2   = zi hi − zi+1 hi 2t def. X (t)   D = qjR M·,j − U ; p , (30) i=1 j=1 j=1 i=1 n Qi−1 X (1 − hi) j=1 hi where M and M represents the i-th rows and j-th − i,· ·,j Pi columns of matrix M respectively. i=1 j=1 ωj n Qi−1 (1 − hi) hi X j=1 Proof. On one hand, because for each i ∈ {1, . . . , m1}, = z1 − . Pi ω (t) (t) (t) i=1 j=1 j Ui |L is lower bounded by Mi,J(i,t) + LJ(i,t) (t) (t−1) (Lemma B.2), and for each j ∈ {1, . . . , m2}, Lj |U (t−1) is upper bounded by UI(j,t) − MI(j,t),j (Lemma B.2), n+1 n Therefore Lemma A.1 is proved up to trivial calculation we easily (by definition) have r |z1,..., z is lower using the above Lemma B.1. In order to further prove bounded by rn − Cn · T (n). Lemma B.3, we also have (by definition of F (x)). n+1 n On the other hand, we have if r − r ≥ log(1/ε0) · (n) n Lemma B.2. Subject to the setup of Lemma B.1, we also T + D |z1,..., zn for some ε0 > 0, then at least one have (t) (t) (n) of Ui (or Lj ) violates the bound log(1/ε0) · T + max{e1, . . . , en} ≤ z1 , T (t) (n) (t) R(Mi,·+L ; q) (or log(1/ε0)·T +R(M·,j −U ; p)), and whose probability using Lemma B.2 is shown to be less than ε0. Therefore, we have for each n ( " n # ) X ∗ n+1 n (n) n F (x) ≤ min exp ωi(x − z ) , 1 , −∞ < x < ∞, P r(r − r ≥ log(1/ε0) · T + D |z1,..., zn) i=1 ≤ max{m1, m2}ε0 , (31) Pn ∗ i=1 ωizi and where z = Pn . i=1 ωi n+1 n (n) n P r(∃n, r − r ≥ log(1/ε0) · T + D |z1,..., zn) Therefore, based on the observation of Lemma B.2, the tail ≤ 2N max{m1, m2}ε0 , (32) probability P r(max{e1, . . . , en} < x) is upper bounded by the probability of an exponential random variable, Let ε = 2N max{m1, m2}ε0 , which concludes our result. which lead us to the proof of Lemma B.3. A Simulated Annealing Based Inexact Oracle for Wasserstein Loss Minimization

Given Lemma B.3, we can prove Theorem A.2 by apply- ing the classical Azuma’s inequality for the left-hand side bound, and applying one of its extensions (Proposition 34 in (Tao and Vu, 2015)) for the right-hand side bound. Re- mark that Theorem A.2 is about a single OT. For multiple different OTs, which share the same temperature sched- ule, one can have asymptotic bounds using the Law of Large Numbers due to the fact that their Gibbs samplers n 1 PS n are independent with each other. Let R = S k=1 rk , where rn is defined by Eq. (23) for sample k. Since for k any ε > 0, one has P ( Rn+1 − Rn > ε) → 0 , as S → ∞, one can have the asymptotic concentration bound 2N for R that for any ε1, ε2 > 0 , there exists S such that 2N 1  1  P ( R − R > ε1) ≤ exp − . 2Nε2

Tao, Terence and Vu, Van. Random matrices: Universality of local spectral statistics of non-Hermitian matrices. The Annals of Probability, 43(2):782-874, 2015.