Large Deviations for a Matching Problem Related to the -Wasserstein
Total Page:16
File Type:pdf, Size:1020Kb
Large Deviations for a matching problem related to the 1-Wasserstein distance José Trashorras To cite this version: José Trashorras. Large Deviations for a matching problem related to the 1-Wasserstein distance. ALEA : Latin American Journal of Probability and Mathematical Statistics, Instituto Nacional de Matemática Pura e Aplicada, 2018, 15, pp.247-278. hal-00641378v2 HAL Id: hal-00641378 https://hal.archives-ouvertes.fr/hal-00641378v2 Submitted on 15 Feb 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Large Deviations for a matching problem related to the 1-Wasserstein distance Jos´eTrashorras ∗ February 15, 2017 Abstract Let (E; d) be a compact metric space, X = (X1;:::;Xn;::: ) and Y = (Y1;:::;Yn;::: ) two independent sequences of independent E-valued ran- X Y dom variables and (Ln )n≥1 and (Ln )n≥1 the associated sequences of em- X Y pirical measures. We establish a large deviation principle for (W1(Ln ;Ln ))n≥1 where W1 is the 1-Wasserstein distance, X Y W1(Ln ;Ln ) = min max d(Xi;Yσ(i)) σ2Sn 1≤i≤n where Sn stands for the set of permutations of f1; : : : ; ng. 1 Introduction n We say that a sequence of Borel probability measures (P )n≥1 on a topological space Y obeys a Large Deviation Principle (hereafter abbreviated LDP) with rate function I if I is a non-negative, lower semi-continuous function defined on Y such that 1 1 − inf I(y) ≤ lim inf log P n(A) ≤ lim sup log P n(A) ≤ − inf I(y) y2Ao n!1 n n!1 n y2A¯ for any measurable set A ⊂ Y, whose interior is denoted by Ao and closure by A¯. If the level sets fy : I(y) ≤ αg are compact for every α < 1, I is called a good rate function. With a slight abuse of language we say that a sequence of random variables obeys an LDP when the sequence of measures induced by these random variables obeys an LDP. For a background on the theory of large deviations see Dembo and Zeitouni [4] and references therein. Let (E; d) be a metric space. There has been a lot of interest for years in considering the space M 1(E) of Borel probability measures on E endowed with the so-called p-Wasserstein distances (Z 1=p) p Wp(ν; γ) = inf d(x; y) Q(dx; dy) Q2C(ν,γ) E×E ∗Universit´eParis-Dauphine, Ceremade, Place du mar´echal de Lattre de Tassigny, 75775 Paris Cedex 16 France. Email: [email protected] 1 where p 2 [1; 1) and C(ν; γ) stands for the set of Borel probability measures 2 on E with first marginal Q1 = ν and second marginal Q2 = γ, see Chapter 6 in [20] for a broad review. However, the 1-Wasserstein distance equivalently defined by either W1(ν; γ) = lim Wp(ν; γ) p!1 or −1 W1(ν; γ) = inf sup u; u 2 S(Q ◦ d ) (1) Q2C(ν,γ) where S(Q◦d−1) ⊂ E2 stands for the support of the probability measure Q◦d−1 has attracted much less attention so far.1 Our framework is the following : We are given a compact metric space (E; d). Without loss of generality we can assume that supx;y2E d(x; y) = 1. Let (Xn)n≥1 and (Yn)n≥1 be two independent sequences of E-valued independent random variables defined on the same probability space (Ω; A; P). We assume 1 2 that all the Xi's (resp. Yi's) have the same distribution µ (resp. µ ). For every n ≥ 1 we consider the empirical measures n n 1 X 1 X LX = δ and LY = δ : n n Xi n n Yi i=1 i=1 X Y In [7] Ganesh and O'Connell conjecture that (W1(Ln ;Ln ))n≥1 obeys an LDP with rate function 1 1 2 2 I1(x) = inf H(ν jµ ) + H(ν jµ ) ν1,ν22M 1(E) W1(ν1,ν2)=x where for any two ν; µ 2 M 1(E) R log dν dν if ν µ H(νjµ) = E dµ 1 otherwise. If instead of I1 one considers its lower semi-continuous regularization J1 which is defined by J1(x) = sup inf I1(y); δ>0 y2B(x,δ) see e.g. Chapter 1 in [15], then X Y Theorem 1.1 The sequence (W1(Ln ;Ln ))n≥1 satisfies an LDP on [0; 1] with good rate function J1(x). In Section 3.1 we show that if e.g. E = [0; 1] is endowed with the usual Euclidean 1 2 3 3 distance and µ = µ = µ admits f(x) = 1 1 (x) + 1 2 (x) as a density 2 [0; 3 ] 2 [ 3 ;1] w.r.t. the Lebesgue measure then, due to the disconnectedness of the support of µ, I1 fails to be lower semi-continuous. Hence, I1 and J1 do not always coincide. This example illustrates a general feature of the problem considered 1Let us recall that the support S(P ) of a Borel probability measure P on a metric space (E; d) is defined by x 2 S(P ) if and only if for every " > 0;P (B(x; ")) > 0 where B(x; ") stands for the open ball centered at x with radius ". If E is separable, in particular if E is compact, S(P ) is closed, see e.g. Theorem 2.1 in [14]. 2 here : the connectedness properties of the support of the reference measures X Y are the key ingredient in the behavior of (W1(Ln ;Ln ))n≥1. We will say more about that below, in Proposition 1.3 below. X Y Since for every n ≥ 1 every Q 2 C(Ln ;Ln ) can be represented as a bi- stochastic matrix and since, according to the Birkhoff-Von Neumann Theorem, every bi-stochastic matrix is a convex combination of permutation matrices (see e.g. Theorem 5.5.1 in [17]) we have X Y W1(Ln ;Ln ) = min max d(Xi;Yσ(i)) (2) σ2Sn 1≤i≤n where Sn stands for the set of permutations of f1; : : : ; ng. Hence, computing X Y W1(Ln ;Ln ) is nothing but solving a minimax matching problem which is a fundamental combinatorial question. Equality (2) can be rephrased as X Y X Y W1(Ln ;Ln ) = dH(S(Ln ); S(Ln )) (3) X Y where S(Ln ) = fX1;:::;Xng, S(Ln ) = fY1;:::;Yng and dH is the so-called Hausdorff distance. Lets us recall that the Hausdorff distance is defined on the set of non-empty closed subsets of E by dH(A; B) = max sup inf d(x; y); sup inf d(x; y) x2A y2B y2B x2A see e.g. Section 6.2.2 in [3]. It is the essential tool in many applications like image processing [8], object matching [18], face detection [10] and evolutionary optimization [16] to name just a few. Now that we have stated our main result, let us briefly review some facts about W1. The reference paper on the subject is [1] by Champion, De Pascale and Juutinen. They consider measures with support on compact subsets of Rd (d ≥ 1). Some of their results have been generalized from Rd to the setting of Polish spaces and general cost functions by Jylh¨ain [9]. We recall them for latter use. Lemma 1.1 (Proposition 2.1 in [1] and Theorem 2.6 in [9]) For any two ν; γ 2 M 1(E) there exists at least one Q 2 C(ν; γ) such that −1 W1(ν; γ) = sup u; u 2 S(Q ◦ d ) : They further establish the existence of nice solutions to the problem (1) which they call infinitely cyclically monotone couplings. Definition 1.1 A probability measure P 2 M 1(E2) is called infinitely cyclically monotone if and only if for every integer n ≥ 2, every (x1; y1);:::; (xn; yn) 2 S(P ) and every σ 2 Sn we have max d(xi; yi) ≤ max d(xi; yσ(i)): (4) 1≤i≤n 1≤i≤n Lemma 1.2 (Theorem 3.2 and Theorem 3.4 in [1] and Theorem 2.16 in [9]) For any two γ; ν 2 M 1(E) there exists at least one infinitely cyclically monotone P 2 C(γ; ν) such that −1 W1(γ; ν) = sup u; u 2 S(P ◦ d ) : 3 In order to check that P 2 M 1(E2) is infinitely cyclically monotone we only need to know its support. The way mass is spread over S(P ) does not matter : this property is concerned with subsets of E2 rather than with fully specified probability measures. This is the reason why we will also call infinitely cyclically monotone subsets of E2 that satisfy (4). Infinitely cyclically monotone couplings are nice solutions to the problem (1) in the sense that Lemma 1.3 (Theorem 3.4 in [1] and Theorem 2.17 in [9]) Any infinitely cycli- cally monotone P 2 M 1(E2) satisfies −1 W1(P1;P2) = sup u; u 2 S(P ◦ d ) : Indeed, once an infinitely cyclically monotone A ⊂ E2 is given, no matter how one spreads mass over A provided the couple that maximizes (x; y) 7! d(x; y) over A is covered, the marginals of the resulting probability measure over E2 will always be at the same W1 distance.