<<

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

Approximate Optimal Transport for Continuous Densities with Copulas

Jinjin Chi1,2 , Jihong Ouyang1,2 , Ximing Li1,2∗ , Yang Wang3 and Meng Wang3 1 College of Computer Science and Technology, Jilin University, China 2 Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, China 3 School of Computer Science and Information Engineering, Hefei University of Technology, China [email protected], [email protected], [email protected], [email protected] [email protected]

Abstract given a predefined cost function c(x, y): X × Y → R+: Z Optimal Transport (OT) formulates a powerful min c(x, f(x))dp(x) framework by comparing - f X s, and it has increasingly attracted great attention s.t. f(x) ∼ y within the machine learning community. However, it suffers from severe computational burden, due to One concern of the Monge’s problem is that the map f(x) the intractable objective with respect to the distri- that satisfies the constraint may not always exist, e.g., when butions of interest. Especially, there still exist very p(x) is a discrete distribution. To achieve a more feasible few attempts for continuous OT, i.e., OT for com- solution, the Kantorovich problem [Kantorovitch, 1942], a paring continuous densities. To this end, we devel- relaxed version of the Monge problem, has been proposed. It op a novel continuous OT method, namely Copula transforms the optimal map problem into finding an optimal OT (Cop-OT). The basic idea is to transform the joint distribution π(x, y) with marginals p(x) and q(y), i.e., primal objective of continuous OT into a tractable an cheapest transport plan from p(x) to q(y): form with respect to the copula parameter, which Z Z can be efficiently solved by min π(x, y)c(x, y)dxdy with less time and memory requirements. Empir- π∈Π(p,q) Y X ical results on real applications of image retrieval s.t. x ∼ p(x), y ∼ q(y), (1) and synthetic demonstrate that our Cop-OT can gain more accurate approximations to contin- where Π(p, q) is the set of all joint distributions with uous OT values than the state-of-the-art baselines. marginals p(x) and q(y). For our case, we focus on the Kan- torovich problem due to its feasibility, and claim OT for this problem in the subsequent sections of the paper. 1 Introduction Generally speaking, the primal objective of OT, i.e., Eq.1, is intractable to compute, since it involves the optimization Optimal Transport (OT), a powerful distance for compar- about the distributions of interest. A mainstream methodol- ing probability distributions, has recently attracted great at- ogy is to formulate its dual problem with regularizers, e.g., tention in machine learning area [Rubner et al., 2000; Pele entropic and `2-norm penalization [Cuturi, 2013; Ferradan- and Werman, 2009; Ni et al., 2009; Courty et al., 2017; s et al., 2014; Genevay et al., 2016; Seguy et al., 2018; Arjovsky et al., 2017; Wang et al., 2018; Li et al., 2019]. Blondel et al., 2018], leading to approximate but easier solved In contrast to other popular distances of distributions, e.g., objectives with respect to dual variables. For example, the KL-, (1) OT is more effective to describe the ge- representative work proposed in [Cuturi, 2013] incorporates ometry of distributions by considering the spatial location of an entropic regularization term into the dual OT problem of the distribution modes [Villani, 2003]; (2) it has also gained two discrete distributions, so as to achieve a convex objective, the superior performance in real applications, such as image which can be solved by the Sinkhorn’s algorithm. Naturally, retrieval [Pele and Werman, 2009], image segmentation [Ni they have been successfully applied to approximately com- et al., 2009] and generative adversarial network [Arjovsky et puting OT values in many real applications [Kusner et al., al., 2017] etc. 2015]. The motivation of OT begins with the Monge problem The above existing methods, however, mainly focus on the [Monge, 1781]. That is, for any two D-dimensional probabil- discrete distributions, and cannot cope with continuous OT, ity distributions, denoted as p(x) and q(y) over metric spaces i.e., OT for comparing continuous densities. To the best of our X and Y, it consists of finding an optimal map f(x): X → Y knowledge, rarely few works [Genevay et al., 2016; Seguy et that transports the mass of p(x) to q(y) with minimum cost al., 2018] attempt to fill this gap for continuous OT. A typical method [Genevay et al., 2016], referred to as Kernel Optimal ∗Corresponding Author Transport (K-OT), parameterizes the dual variables by kernel

2165 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

Method Primal objective Time complexity Space complexity

3 3 K-OT [Genevay et al., 2016] × O(T S D) O(TSD + fk(D)) 2 NN-OT [Seguy et al., 2018] × O(TS D) O(SD + fn(D)) √ Cop-OT (Ours) O(TSD2) O(SD + D2)

Table 1: A brief summary of methods for continuous OT functions, so as to transform the regularized dual objective tigate approximate but more efficient methods. The main- into an expectation form with respect to p(x)q(y), and then stream methodology is built on the idea of formulating regu- it can be solved through stochastic optimization by drawing larized dual objectives [Cuturi, 2013; Benamou et al., 2015; Monte Carlo samples from p(x)q(y). Genevay et al., 2016; Seguy et al., 2018], and solving However, since at each iteration the update process has them using the commonly used optimization schemes, e.g., to recycle the samples of previous iterations, K-OT suf- the Sinkhorn’s algorithm and stochastic optimization. Be- fers from higher costs of O(T 3S3D) time complexity and sides, there are some works [Aurenhammer et al., 1998; O(TSD + fk(D)) memory space, where S and T are sam- Genevay et al., 2016; Levy´ and Schwindt, 2018; Seguy et ple numbers of per-iteration and the total number of itera- al., 2018] on the semi-discrete OT, i.e., OT for comparing tions, and fk(D) specifies the memory cost of kernel func- a continuous density and a discrete distribution. They em- tions. [Seguy et al., 2018] utilizes neural networks to param- ploy the machinery of c-transforms to eliminate one dual eterize the dual variables, instead of kernel functions, lead- variable, and then obtain convex finite dimensional optimiza- ing to another candidate method of continuous OT, namely tions, solved by the Newton method [Levy´ and Schwindt, Neural Network Optimal Transport (NN-OT). That requires 2018] and stochastic optimization [Genevay et al., 2016; 2 O(TS D) time and O(SD + fn(D)) space, where fn(D) Seguy et al., 2018]. To the best of our knowledge, only very specifies the memory cost of neural networks. However, one few works on continuous-OT, which is still an open problem salient problem remains: their target is to find the optima of we attempt to tackle in this paper. the regularized dual objective, rather than the primal objec- tive of OT, i.e., Eq.1, theoretically introducing biases. An- other problem is that they are empirically insensitive to the 2 Proposed Method regularization parameters, making them less practical. Our goal on the problem is for the computation of OT be- Our contributions. In this work, we attempt to further con- tween continuous densities p(x) and q(y), i.e., continuous tribute on this challenging topic of continuous OT. Compared OT. However, it is intractable to compute, since its target, i.e., with the prior arts of K-OT and NN-OT under in Table 1, its objective of Eq.1, is to minimize the transport cost by find- our motivation is to directly optimize the primal objective of ing an optimal joint distribution π(x, y) from an unknown set continuous OT, i.e., Eq.1, instead of its regularized dual ver- of distributions with marginals p(x) and q(y). To effectively sions, so as to achieve more accurate approximations. Our solve the continuous OT, we propose a novel method, name- basic idea is to formulate the joint distribution π(x, y) using ly Copula Optimal Transport (Cop-OT). The basic idea is to the well-established copula [Sklar, 1959], and then transform re-parameterize the joint distribution π(x, y) with the well- the primal objective with respect to distributions of interest established copula [Sklar, 1959] on p(x)q(y), so as to trans- into a tractable form with respect to the copula parameter, en- form the primal objective of continuous OT into a tractable abling to be efficiently solved by stochastic optimization. Fol- objective with respect to the copula parameter, referring to lowing this, a novel continuous OT method, namely Copula as the copula objective. We directly derive its gradient to Optimal Transport (Cop-OT), was developed. We first pro- an expectation form with respect to p(x)q(y), therefore we pose the Cop-OT for comparing 1-D continuous densities, can efficiently solve it following the spirit of stochastic op- and then extend it to high dimensional ones, i.e., D ≥ 2, timization, where we iteratively update the copula parameter with requirements of time, i.e., O(TSD2), and memory, i.e., by forming noisy gradients using Monte Carlo samples drawn O(SD + D2). Referring to Table 1, Cop-OT is more effi- from p(x)q(y). cient than K-OT, and practically more efficient than NN-OT For the rest of this section, we briefly review the concept due to the expensive computational cost on neural networks. of copula, and then describe the Cop-OT for 1-D continu- Empirical studies on and real applications of ous densities and the extension to high dimensional ones, i.e., image retrieval demonstrate that Cop-OT can gain more ac- D ≥ 2. curate approximations to continuous OT values than the state- of-the-art baselines of K-OT and NN-OT. 2.1 Copula Related work. The existing works on OT mainly focus on The story of copula begins with the theorem of Sklar [Sklar, comparing discrete distributions. In this setting, OT is ac- 1959]: For any continuous joint distribution π(x, y), x, y ∈ tually equivalent to the linear programming problem, which RD, its Cumulative Distribution Function (CDF), denoted can be solved in cubic complexity. Some attempts inves- as F (·), can be represented by a copula with respect to the

2166 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

marginal CDFs Fi(·) as follows: where S is the number of Monte Carlo samples. Then, at each iteration t, the parameter of interest η can be updated as F (x, y) = follows: COP (F1(x1), ··· ,FD(xD),FD+1(y1), ··· ,F2D(yD)|η) , ηt ← ηt−1 − ρt∇ηL, (8)

where η is the copula parameter. Then, one can directly derive where ρt is the learning rate. the following equation of π(x, y): Note that this optimization process strictly follows the spir- [ ] YD it of stochastic optimization Robbins and Monro, 1951 , π(x, y) = pi(xi)qi(yi) where the expectation of the noisy gradient used, i.e., Eq.7, i=1 is equivalent to the true gradient, i.e., Eq.6. Therefore it guar- cop (F (x ), ··· ,F (x ),F (y ), ··· ,F (y )|η) ,(2) 1 1 D D D+1 1 2D D antees to converge to a local optimum, if the learning rate where cop(·) denotes the copula density; pi(·) and qi(·) are satisfies the Robbins-Monro condition: ∞ ∞ the marginals of p(x) and q(y), respectively. For brevity, we X X 2 ρt = ∞, ρt < ∞ omit the notation of CDF, i.e., F (·), enabling to re-write Eq.2 t=1 t=1 as follows: Reduction by Reparameterization YD Such noisy gradient of Eq.7, formed by using Monte Carlo π(x, y) = pi(xi)qi(yi)cop(x1, y1, ··· , xD, yD|η) i=1 samples, often suffers from high variance, resulting in slow- (3) er convergence and even worse performance [Paisley et al., In practice, there are many well-established families of 2012; Li et al., 2018]. To alleviate this, we use the reparam- copulas, including Gaussian, Clayton, Frank, Gumbel, Joe eterization trick [Kingma and Welling, 2013], an empirically [ et al. ] and Student-t copulas Dissmann , 2013 . Given these effective method for variance reduction. The basic idea is to copulas, one can describe any continuous joint distribution use a transformation of a simple by a differ- η via a copula-related form parameterized by . entiable mapping function, and then form the noisy gradient 2.2 Cop-OT for 1-D Continuous Densities using Monte Carlo samples drawn from the distribution of this simple random variable. Inspired by this, we re-write the D We are now ready to propose the Cop-OT for 1- contin- noisy gradient of Eq.7 with reparameterization as follows: uous densities, where both p(x) and q(y) are 1-dimensional densities. Under this situation, referring to Eq.3, the joint dis- ∇ηL ≈ tribution π(x, y) can be directly represented by the following S 1 X (s) (s) (s) (s) form with the copula density: ∇ηcop(fx(u ), fy(v )|η)c(fx(u ), fy(v )) S s=1 π(x, y) = p(x)q(y)cop(x, y|η), (4) (s) (s) u ∼ pb(u), v ∼ qb(v), (9) Combining Eq.4 with the primal objective of Eq.1, we can where u and v are the corresponding simple random variables equivalently transform it into a tractable copula objective with for x and y, respectively, and pb(u) and qb(v) are their distribu- respect to η: tions; fx(·) and fy(·) are the mapping functions for x and y, ∆ Z Z respectively. min L(η) = p(x)q(y)cop(x, y|η)c(x, y)dxdy p(x) η For clarity, we illustrate an example: If is an Gaussian Y X with µ and variance σ2, i.e., x ∼ N (µ, σ2), the corre- s.t. x ∼ p(x), y ∼ q(y), (5) sponding distribution pb(u) of the simple random variable can Given pre-defined copula cop(x, y|η) and cost function be the standard Gaussian, i.e., u ∼ N (0, 1), and the mapping c(x, y), we can iteratively optimize the objective L(η) by function is fx(u) = µ + uσ. gradient-based methods, so as to derive its gradient to an ex- Learning Rate Setting 1 pectation form with p(x)q(y) : The stochastic optimization of Cop-OT may be sensitive to Z Z the learning rate ρt. To address this, we use the Adam method ∇ηL = ∇η p(x)q(y)cop(x, y|η)c(x, y)dxdy [Kingma and Ba, 2015] to adaptively adjust the optimization Y X Z Z process, which guarantees the convergence. For clarity, we = p(x)q(y)∇ηcop(x, y|η)c(x, y)dxdy briefly review this method. Y X The Adam method is built on the first and second moments = Ep(x)q(y)[∇ηcop(x, y|η)c(x, y)] (6) of the gradients, requiring to compute exponential moving averages of the gradient mt and the squared gradient rt. Therefore, we can directly form a noisy gradient using the Monte Carlo samples drawn from p(x)q(y), to simultaneous- mt = β1mt−1 + (1 − β1)∇ηL, ly satisfy the constraints in Eq.5: 2 rt = β2rt−1 + (1 − β2)∇ηL , S 1 X (s) (s) (s) (s) ∇ηL ≈ ∇ηcop(x , y |η)c(x , y ) where β1 and β1 control the decay rates of these moving av- S s=1 2 erages; and ∇ηL is the elementwise square of the gradient. x(s) ∼ p(x), y(s) ∼ q(y), (7) Then, at each iteration t, we can update η using the follow- ing equation: 1There exist many off-the-shelf copula libraries, e.g., VineCopu- m η ← η − α t , (10) la, which can compute the gradients of copulas, i.e., ∇ηcop(x, y|η). t t−1 rt

2167 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

where α is the step size in the Adam method. Following [K- Algorithm 1 Optimization of Cop-OT ingma and Ba, 2015], in this work we empirically fixed the 1: Input: Continuous densities p(x), q(y), x, y ∈ RD parameters of Adam as follows: β = 0.9, β = 0.999 and 1 2 2: Setting: (1) Choose the cost function c(·); (2) construc- α = 0.001. t distributions pb(u) and qb(v) for reparameterization; (3) 2.3 Extension to High-dimensional Densities choose the bivariate copula function cop(·|η), and ini- tialize η randomly In this section, we extend Cop-OT to high-dimensional con- 3: For s = 1 to S tinuous densities p(x) and q(y), x, y ∈ RD,D ≥ 2. (s) 4: Draw u ∼ pb(u) Be analogous to the 1-D problem setting, we can also (s) 5: Draw v ∼ pb(v) achieve a high-dimensional copula objective, however, it suf- 6: End For fers from a severe problem: Since the copula formulation of 7: If D ≥ 2 π(x, y), i.e., Eq.3, refers to all marginals, we can only form 8: For s = 1 to S && i = 1 to D noisy gradients using Monte Carlo samples drawn from the (s) 9: Draw ui ∼ pbi(ui) marginals of p(x) and q(y), rather than samples from p(x) (s) and q(y). This doesn’t satisfy the constraints of OT, i.e., 10: Draw vi ∼ qbi(vi) x ∼ p(x) and y ∼ q(y). 11: End For To resolve this, we resort to a novel form of π(x, y) that 12: Repeat refers to p(x) and q(y) using the D-vine method [Dissmann 13: Form the noisy gradient using Eq.9, If D = 1 ] 14: Form the noisy gradient using Eq.15, If D ≥ 2 et al., 2013 , which decomposes a high-dimensional copula xy into a set of conditional bivariate copulas. After some simple 15: Update η or η using the Adam method. algebra2, we propose the following form with the product of 16: Until Convergence D2 conditional bivariate copulas between each component of 17: Output: OT between p(x) and q(y) approximated by x and y: Cop-OT

YD xy π(x, y) = p(x)q(y) copxy(xi, yj|D(eij ), η ), (11) i,j ij gradient using Monte Carlo samples from both p(x)q(y) and their marginals: where D(eij ) denotes the variable conditioning set of the pair

(xi, yj). xy ∇η L ≈ Applying Eq.11 to the primal objective of the above cop- ij S D ula objective, we achieve a copula objective of OT for high- 1 X Y (s) (s) xy (s) (s) ∇ηxy copxy(x , y |η )c(x , y ) dimensional densities: S s=1 ij i,j i j ij x(s) ∼ p(x), y(s) ∼ q(y) min L(ηxy) =∆ ηxy x(s) ∼ p (x ), y(s) ∼ q (y ) (14) Z Z i i i j j j YD xy p(x)q(y) copxy(xi, yj|D(eij ), ηij )c(x, y)dxdy For variance reduction, we also apply the reparameteriza- Y X i,j tion trick, so as to re-write the gradient of Eq.14 as follows: s.t. x ∼ p(x), y ∼ q(y) (12) S D Be analogous to Eq.6, the gradient of Eq.12 can be also rep- 1 X Y (s) (s) xy xy xy ∇η L ≈ ∇η copxy(fxi (u ), fyj (v )|η ) resented by an expectation form with respect to p(x)q(y): ij S ij i j ij s=1 i,j ∇ xy L = (s) (s) ηij × c(fx(u ), fy(v ))   (s) (s) YD xy E xy u ∼ p(u), v ∼ q(v) p(x)q(y) ∇η copxy(xi, yj|D(eij ), ηij )c(x, y) (13) b b ij i,j (s) (s) ui ∼ pbi(ui), vj ∼ qbj(vj), (15) We propose a noisy gradient using Monte Carlo samples from p(x)q(y), which naturally satisfies the constraints in Eq.12. where ui and vj are the components of u and v, respectively; However, we actually need a further approximation, since ev- pbi(ui) and qbj(vj) are the marginals of pb(u) and qb(v), respec- ery conditional bivariate copula cop (·) refers to the condi- tively. Given these noisy gradients, the parameter of interest, xy i.e., ηxy, can be finally updated using the Adam method. tional marginals pi(xi|D(eij )) and qi(yi|D(eij )), rather than p(x) and q(y), so that the samples from these conditional Full algorithm. In summary, we outline the full optimiza- marginals are also needed to compute copxy(·). Since the tion process of Cop-OT in Algorithm 1. conditional marginals are commonly intractable to compute, we use the corresponding marginals to replace them to further Discussion. We discuss some crucial details of Cop-OT: (1) simplify the computation. Then, we approximate the noisy Referring to Eqs.9 and 15, at each iteration, Cop-OT com- putes the noisy gradients using samples from the same dis- 2The derivation details of Eq.11 can be found at tributions, i.e., pb(x), qb(y) and their marginals. Therefore, we https://github.com/jinjinchi/Approximate-Optimal-Transport- only need to draw samples once and iteratively re-use them as for-Continuous-Densities-with-Copulas. indicated in Algorithm 1. (2) For high-dimensional Cop-OT,

2168 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

Figure 1: Error rate of 1-D Gaussians with different and . Each sub-figure presents the results of different pairs of {µx, µy}: (a) {21, 23}; (b) {21, 25}; (c) {21, 27}; (d) {23, 25}; (e) {23, 27}; (f) {25, 27}.

D×D its noisy gradient of Eq.15 doesn’t strictly satisfy the conver- R , i.e., p(x) = N (µx, Σx), q(y) = N (µy, Σy), which gence condition of stochastic optimization, since its expec- has a Closed-Form Solution (CFS) [Givens et al., 1984] com- tation only approximates the true gradient of Eq.13. Fortu- puted by: nately, it can empirically converge fast to achieve competi- ∗ tive results. We show more details of results in the Section of Wxy(p(x), q(y)) . r   = kµ − µ k2 + Tr Σ + Σ − 2(Σ1/2Σ Σ1/2)1/2 (16), Complexity analysis. We discuss the complexity of Cop- x y 2 x y y x y OT. Reviewing Eq.12, the copula objective actually involves 2 D bivariate copulas between each pair of xi and yj, leading where Tr(·) denotes the trace of a matrix. to D2 copula parameters of interest. Therefore, reviewing E- We compare Cop-OT against two baseline methods of K- q.15, Cop-OT needs to form the noisy gradient for each copu- OT3 [Genevay et al., 2016] and NN-OT4 [Seguy et al., 2018]. la parameter using 2S samples, requiring O(TSD2) time and For both baselines, we implemented in-house codes based on O(SD + D2) memory, where T is the total iteration number. their demo codes of the semi-continuous versions provided Besides, we would like to notice that the 1-D setting can be by their authors. considered as a special case of D = 1, requiring O(TS) time For K-OT and NN-OT, we adjust the regularization param- and O(S) memory. eter over the set [10−3, 10−2, ··· , 103] and report the best results. For Cop-OT, we employ the family of Gaussian cop- 3 Empirical Study ula function, and use the standard Gaussians as the mapping distribution in the reparameterization trick. For all methods. In this section, we empirically evaluate Cop-OT on both syn- the sample number S is set to 200, and we report the average thetic and real data. results of five independent runs. Besides, in our early , we have examined var- 3.1 Experimental Setup ious copula families, e.g., Clayton, Frank, Gumbel, Joe and Our aim is to examine whether Cop-OT can accurately ap- Student-t copulas, and found that the Gaussian copula per- proximate the continuous OT. To this end, we evaluate the formed the best in most settings. OT with the squared Euclidean cost, i.e., c(x, y) = kx − 2 RD 3 yk2, x, y ∈ , between two Gaussian distributions with https://github.com/audeg/StochasticOT D 4 means µx, µy ∈ R and matrices Σx, Σy ∈ https://github.com/vivienseguy/Large-Scale-OT

2169 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

Figure 2: Error rate results of high-dimensional Gaussians: (a) D = 2; (b) D = 8. Figure 3: Convergence curves of the OT methods: (a) Method CFS K-OT NN-OT Cop-OT {N (21, 22), N (23, 28)}; (b) {N (21, 22), N (25, 28)}. Accuracy 0.47 0.44 0.39 0.46 l methods within 25 seconds. We can observe that Cop-OT Table 2: Empirical results of image retrieval converges faster than the baseline methods of K-OT and NN- OT. Surprisingly, NN-OT is converged slower than K-OT. The possible reason is that the neural network of NN-OT is 3.2 Evaluation on Synthetic Data computationally expensive due to its relatively complex struc- Results of 1-D Gaussians ture. 2 We now evaluate on 1-D Gaussians of N (µx, σx) and 2 3.3 Evaluation on Real Data N (µy, σy). The means and variances vary over the sets 5 {21, 23, 25, 27} and {22, 24, 26, 28}, respectively. We exam- We evaluate Cop-OT on image retrieval. The MINST , a ine the performance of the OT methods by measuring whether dataset of handwritten digits from zero to nine, is used. We randomly select 10,000 images as the database and 150 im- their approximations, denoted by Wcxy, approach the CFS values, i.e., W ∗ computed by Eq.16. To this end, we de- ages for texting. We refer to each image as a Gaussian es- xy [ ] fine an evaluation metric of Error rate computed as follows: timated by its pixels. Following Li et al., 2013 , we find the nearest 10 images of test images measured by OT, and ∗ compute the accuracy by counting how many nearest images Wcxy − W xy belong to the same digit number of the test images. ErrorRate(Wcxy) = ∗ , Wxy The results are shown in Table 2. First, we observe that The results are shown in Fig.1, where the error rates of the accuracy scores of Cop-OT are higher than those of K- Cop-OT are significantly lower than those of NN-OT in most OT and NN-OT. Since image retrieval is factually a problem settings, and Cop-OT performs very competitive with K-OT. of finding nearest neighboring images for test inputs, the per- These results imply that Cop-OT can compute more accurate formance gain of Cop-OT indicates that it can better maintain approximations to CFS. the relative distances among images. Besides, we can see that the accuracy gap between Cop-OT and CFS is not obvious. Results of High-dimensional Gaussians This further indicates the effectiveness of Cop-OT. We further evaluate the high-dimensional Gaussians (D = 2 and 8) with mean zero and different covariance matrices 4 Conclusion drawn from inverse-Wishart distributions. For each dimen- sion, we draw three relatively sparse Σs and dense Σd from We develop a novel Cop-OT method for continuous OT. It W−1(I,D + 2) and W−1(10I,D + 2), leading to three pairs formulates the joint distribution with the copula function, and s s d s d d then transforms the primal objective of OT into a copula ob- of Gaussians, denoted by {p1, p2}, {p1, p3} and {p2, p3}, s s d d jective with respect to the copula parameter, solved by s- where pi = N (0, Σi ) and pi = N (0, Σi ). We show the results in Fig.2, where Cop-OT outperforms tochastic optimization with the reparameterization trick. Both K-OT and NN-OT in most cases. This empirically indicates Cop-OT with 1-D and high-dimensional model are proposed. that Cop-OT works well on comparing high-dimensional Empirical studies on both real and synthetic data demonstrate densities, even it follows a further approximation to its noisy the effectiveness of Cop-OT. gradients during optimization. Besides, we observe that the error rates of high-dimensional Gaussians seem less stable Acknowledgments than those of 1-D densities. This research was supported the National Natural Science Foundation of China (NSFC) [No.61602204, No.61876071, Convergence No.61806035]. We discuss the convergence of OT methods on two pairs of 1-D Gaussians. Fig.3 presents the convergence curves of al- 5http://yann.lecun.com/exdb/mnist/

2170 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

References [Levy´ and Schwindt, 2018] Bruno Levy´ and Erica L [Arjovsky et al., 2017] Martin Arjovsky, Soumith Chintala, Schwindt. Notions of optimal transport theory and how to and Leon´ Bottou. Wasserstein generative adversarial net- implement them on a computer. Computers & Graphics, works. In International Conference on Machine Learning, 72:135–148, 2018. pages 214–223, 2017. [Li et al., 2013] Peihua Li, Qilong Wang, and Lei Zhang. [Aurenhammer et al., 1998] Franz Aurenhammer, Friedrich A novel earth mover’s distance methodology for image Hoffmann, and Boris Aronov. Minkowski-type theorems with gaussian mixture models. In IEEE Interna- and least-squares clustering. Algorithmica, 20(1):61–76, tional Conference on Computer Vision, pages 1689–1696, 1998. 2013. [Benamou et al., 2015] Jean-David Benamou, Guillaume [Li et al., 2018] Ximing Li, Changchun Li, Jinjin Chi, and Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyre.´ Jihong Ouyang. Variance reduction in black-box variation- Iterative bregman projections for regularized transporta- al inference by adaptive importance . In Interna- tion problems. SIAM Journal on Scientific Computing, tional Joint Conferences on Artificial Intelligence, pages 37(2):A1111–A1138, 2015. 2404–2410, 2018. [Blondel et al., 2018] Mathieu Blondel, Vivien Seguy, and [Li et al., 2019] Changchun Li, Jihong Ouyang, and Ximing Antoine Rolet. Smooth and sparse optimal transport. In Li. Classifying extremely short texts by exploiting seman- International Conference on Artificial Intelligence and S- tic centroids in word mover’s distance space. In The World tatistics, pages 880–889, 2018. Wide Web Conference, pages 939–949, 2019. [Courty et al., 2017] Nicolas Courty, Remi´ Flamary, Devis [Monge, 1781] Gaspard Monge. Memoire´ sur la theorie´ des Tuia, and Alain Rakotomamonjy. Optimal transport for deblais´ et des remblais. Histoire de l’Academie´ Royale des domain adaptation. IEEE Transactions on Pattern Analy- Sciences de Paris, 1781. sis and Machine Intelligence, 39(9):1853–1865, 2017. [Ni et al., 2009] Kangyu Ni, Xavier Bresson, Tony Chan, [Cuturi, 2013] Marco Cuturi. Sinkhorn distances: Light- and Selim Esedoglu. Local based segmentation speed computation of optimal transport. In Neural Infor- using the Wasserstein distance. International Journal of mation Processing Systems, pages 2292–2300, 2013. Computer Vision, 84(1):97–111, 2009. [Dissmann et al., 2013] Jeffrey Dissmann, Eike C Brech- [Paisley et al., 2012] John William Paisley, David M. Blei, mann, Claudia Czado, and Dorota Kurowicka. Selecting and Michael I. Jordan. Variational with and estimating regular vine copulae and application to fi- stochastic search. In International Conference on Machine nancial returns. Computational Statistics & Data Analysis, Learning, page 1367–1374, 2012. 59:52–69, 2013. [Pele and Werman, 2009] Ofir Pele and Michael Werman. [Ferradans et al., 2014] Sira Ferradans, Nicolas Papadakis, Fast and robust earth mover’s distances. In Internation- Gabriel Peyre,´ and Jean-Franc¸ois Aujol. Regularized dis- al Conference on Computer Vision, pages 460–467, 2009. crete optimal transport. SIAM Journal on Imaging Sci- [Robbins and Monro, 1951] Herbert Robbins and Sutton ences, 7(3):1853–1882, 2014. Monro. A stochastic approximation method. The Annals [Genevay et al., 2016] Aude Genevay, Marco Cuturi, of , pages 400–407, 1951. Gabriel Peyre,´ and Francis Bach. Stochastic optimization [ et al. ] for large-scale optimal transport. In Neural Information Rubner , 2000 Yossi Rubner, Carlo Tomasi, and Processing Systems, pages 3440–3448, 2016. Leonidas J Guibas. The earth mover’s distance as a met- ric for image retrieval. International Journal of Computer [Givens et al., 1984] Clark R Givens, Rae Michael Shortt, Vision, 40(2):99–121, 2000. et al. A class of wasserstein metrics for probability distri- [ ] butions. The Michigan Mathematical Journal, 31(2):231– Seguy et al., 2018 Vivien Seguy, Bharath Bhushan 240, 1984. Damodaran, Remi´ Flamary, Nicolas Courty, Antoine Ro- let, and Mathieu Blondel. Large-scale optimal transport [Kantorovitch, 1942] Leonid Kantorovitch. On the trans- and mapping estimation. In International Conference on fer of masses (in russian). Doklady Akademii Nauk, Learning Representations, 2018. 37(2):227–229, 1942. [Sklar, 1959] Abe Sklar. Fonstions de repartition´ a` n dimen- [ ] Kingma and Ba, 2015 Diederik P. Kingma and Jimmy Ba. sions et leurs marges. Publications de l’Institut de Statis- Adam: a method for stochastic optimization. In Interna- tique de l’Universite´ de Paris, 8:229–231, 1959. tional Conference on Learning Representations, 2015. [Villani, 2003] Cedric´ Villani. Topics in optimal transporta- [ ] Kingma and Welling, 2013 Diederik P. Kingma and Max tion. Number 58. American Mathematical Society, 2003. Welling. Auto-encoding variational bayes. In Internation- al Conference on Learning Representations, 2013. [Wang et al., 2018] Yang Wang, Lin Wu, Xuemin Lin, and Junbin Gao. Multiview spectral clustering via structured [Kusner et al., 2015] Matt Kusner, Yu Sun, Nicholas Kolkin, low-rank matrix factorization. IEEE transactions on neu- and Kilian Weinberger. From word embeddings to docu- ral networks and learning systems, 29(10):4833 – 4843, ment distances. In International Conference on Machine 2018. Learning, pages 957–966, 2015.

2171