Bregman Divergence and Mirror Descent

Total Page:16

File Type:pdf, Size:1020Kb

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent 1 Bregman Divergence Motivation • Generalize squared Euclidean distance to a class of distances that all share similar properties • Lots of applications in machine learning, clustering, exponential family Definition 1 (Bregman divergence) Let :Ω ! R be a function that is: a) strictly convex, b) continuously differentiable, c) defined on a closed convex set Ω. Then the Bregman divergence is defined as ∆ (x; y) = (x) − (y) − hr (y); x − yi ; 8 x; y 2 Ω: (1) That is, the difference between the value of at x and the first order Taylor expansion of around y evaluated at point x. Examples 1 2 1 2 • Euclidean distance. Let (x) = 2 kxk . Then ∆ (x; y) = 2 kx − yk . n P P P xi • Ω = x 2 : xi = 1 , and (x) = xi log xi. Then ∆ (x; y) = xi log for x; y 2 Ω. R+ i i i yi This is called relative entropy, or Kullback–Leibler divergence between probability distributions x and y. 1 1 1 2 1 2 1 2 • `p norm. Let p ≥ 1 and p + q = 1. (x) = 2 kxkq. Then ∆ (x; y) = 2 kxkq + 2 kykq − D 1 2E 1 2 x; r 2 kykq . Note 2 kykq is not necessarily continuously differentiable, which makes this case not precisely consistent with our definition. Properties of Bregman divergence • Strict convexity in the first argument x. Trivial by the strict convexity of . • Nonnegativity: ∆ (x; y) ≥ 0 for all x; y. ∆ (x; y) = 0 if and only if x = y. Trivial by strict convexity. • Asymmetry: in general, ∆ (x; y) 6= ∆ (y; x). Eg, KL-divergence. Symmetrization not always useful. • Non-convexity in the second argument. Let Ω = [1; 1), (x) = − log x. Then ∆ (x; y) = − log x + x−y 1 2x log y + y . One can check its second order derivative in y is y2 ( y − 1), which is negative when 2x < y. • Linearity in . For any a > 0, ∆ +a'(x; y) = ∆ (x; y) + a∆'(x; y). @ • Gradient in x: @x ∆ (x; y) = r (x) − r (y). Gradient in y is trickier, and not commonly used. • Generalized triangle inequality: ∆ (x; y) + ∆ (y; z) = (x) − (y) − hr (y); x − yi + (y) − (z) − hr (z); y − zi (2) = ∆ (x; z) + hx − y; r (z) − r (y)i : (3) • Special case: is called strongly convex with respect to some norm with modulus σ if σ (x) ≥ (y) + hr (y); x − yi + kx − yk2 : (4) 2 Note the norm here is not necessarily Euclidean norm. When the norm is Euclidean, this condition is σ 2 P equivalent to (x)− 2 kxk being convex. For example, the (x) = i xi log xi used in KL-divergence n P is 1-strongly convex over the simplex Ω = x 2 R+ : i xi = 1 , with respect to the `1 norm. When is σ strong convex, we have σ ∆ (x; y) ≥ kx − yk2 : (5) 2 σ 2 Proof: By definition ∆ (x; y) = (x) − (y) − hr (y); x − yi ≥ 2 kx − yk . • Duality. Suppose is strongly convex. Then ∗ (r )(r (x)) = x; ∆ (x; y) = ∆ ∗ (r (y); r (x)) : (6) Proof: (for the first equality only) Recall ∗(y) = sup fhz; yi − (z)g : (7) z2Ω sup must be attainable because is strongly convex and Ω is closed. x is a maximizer if and only if y = r (x). So ∗(y) + (x) = hx; yi , y = r (x): (8) Since = ∗∗, so ∗(y) + ∗∗(x) = hx; yi, which means y is the maximizer in ∗∗(x) = sup fhx; zi − ∗(z)g : (9) z This means x = r ∗(y). To summarize, (r ∗)(r (x)) = x. • Mean of distribution. Suppose U is a random variable over an open set S with distribution µ. Then min EU∼µ[∆ (U; x)] (10) x2S R is optimized at u¯ := Eµ[U] = u2S uµ(u). Proof: For any x 2 S, we have EU∼µ[∆ (U; x)] − EU∼µ[∆ (U; u¯)] (11) 0 0 = Eµ[ (U) − (x) − (U − x) r (x) − (U) + (¯u) + (U − u¯) r (¯u)] (12) 0 0 0 0 = (¯u) − (x) + x r (x) − u¯ r (¯u) + Eµ[−U r (x) + U r (¯u)] (13) = (¯u) − (x) − (¯u − x)0r (x) (14) = ∆ (¯u; x): (15) This must be nonnegative, and is 0 if and only if x =u ¯. ∗ • Pythagorean Theorem. If x is the projection of x0 onto a convex set C 2 Ω: ∗ x = argmin ∆ (x; x0): (16) x2C Then ∗ ∗ ∆ (y; x0) ≥ ∆ (y; x ) + ∆ (x ; x0): (17) ∗ In Euclidean case, it means the angle \yx x0 is obtuse. More generally Lemma 2 Suppose L is a proper convex function whose domain is an open set containing C. L is not necessarily differentiable. Let x∗ be ∗ ∗ x = argmin fL(x) + ∆ (x ; x0)g : (18) x2C Then for any y 2 C we have ∗ ∗ ∗ L(y) + ∆ (y; x0) ≥ L(x ) + ∆ (x ; x0) + ∆ (y; x ): (19) 2 The projection in (16) is just a special case of L = 0. This property is the key to the analysis of many optimization algorithms using Bregman divergence. ∗ Proof: Denote J(x) = L(x) + ∆ (x; x0). Since x minimizes J over C, there must exist a subgradient d 2 @J(x∗) such that hd; x − x∗i ≥ 0; 8 x 2 C: (20) ∗ ∗ ∗ ∗ Since @J(x ) = fg + rx=x∗ ∆ (x; x0): g 2 @L(x )g = fg + r (x ) − r (x0): g 2 @L(x )g. So there must be a subgradient g 2 L(x∗) such that ∗ ∗ hg + r (x ) − r (x0); x − x i ≥ 0; 8 x 2 C: (21) Therefore using the property of subgradient, we have for all y 2 C that L(y) ≥ L(x∗) + hg; y − x∗i (22) ∗ ∗ ∗ ≥ L(x ) + hr (x0) − r (x ); y − x i (23) ∗ ∗ ∗ = L(x ) − hr (x0); x − x0i + (x ) − (x0) (24) + hr (x0); y − x0i − (y) + (x0) (25) − hr (x∗); y − x∗i + (y) − (x∗) (26) ∗ ∗ ∗ = L(x ) + ∆ (x ; x0) − ∆ (y; x0) + ∆ (y; x ): (27) Rearranging completes the proof. 2 Mirror Descent Why bother? Because the rate of convergence of subgradient descent often depends on the dimension of the problem. Suppose we want to minimize a function f over a set C. Recall the subgradient descent rule x 1 = xk − αkgk; where gk 2 @f(xk) (28) k+ 2 2 1 1 2 xk+1 = argmin x − xk+ 1 = argmin kx − (xk − αkgk)k : (29) x2C 2 2 x2C 2 This can be interpreted as follows. First approximate f around xk by a first-order Taylor expansion f(x) ≈ f(xk) + hgk; x − xki : (30) 1 2 Then penalize the displacement by kx − xkk . So the update rule is to find a regularized minimizer of 2αk the model 1 2 xk+1 = argmin f(xk) + hgk; x − xki + kx − xkk : (31) x2C 2αk It is trivial to see this is exactly equivalent to (29). To generalize the method beyond Euclidean distance, it is straightforward to use the Bregman divergence as a measure of displacement: 1 xk+1 = argmin f(xk) + hgk; x − xki + ∆ (x; xk) (32) x2C αk = argmin fαkf(xk) + αk hgk; x − xki + ∆ (x; xk)g : (33) x2C Mirror descent interpretation. Suppose the constraint set C is the whole space (i.e. no constraint). Then we can take gradient with respect to x and find the optimality condition 1 gk + (r (xk+1) − r (xk)) = 0 (34) αk , r (xk+1) = r (xk) − αkgk (35) −1 ∗ , xk+1 = (r ) (r (xk) − αkgk) = (r )(r (xk) − αkgk): (36) For example, in KL-divergence over simplex, the update rule becomes xk+1(i) = xk(i) exp(−αkgk(i)): (37) Rate of convergence. Recall in unconstrained subgradient descent we followed 4 steps. 3 1. Bound on single update ∗ 2 ∗ 2 kxk+1 − x k2 = kxk − αkgk − x k2 (38) ∗ 2 ∗ 2 2 = kxk − x k2 − 2αk hgk; xk − x i + αk kgkk2 (39) ∗ 2 ∗ 2 2 ≤ kxk − x k2 − 2αk (f(xk) − f(x )) + αk kgkk2 : (40) 2. Telescope T T ∗ 2 ∗ 2 X ∗ X 2 2 kxT +1 − x k2 ≤ kx1 − x k2 − 2 αk (f(xk) − f(x )) + αk kgkk2 : (41) k=1 k=1 ∗ 2 2 2 2 3. Bounding by kx1 − x k2 ≤ R and kgkk2 ≤ G : T T X ∗ 2 2 X 2 2 αk (f(xk) − f(x )) ≤ R + G αk: (42) k=1 k=1 ∗ 4. Denote k = f(xk) − f(x ) and rearrange 2 2 PT 2 R + G k=1 αk min k ≤ : (43) k2f1;:::;T g PT 2 k=1 αk By setting the step size αk judiciously, we can achieve RG min k ≤ p : (44) k2f1;:::;T g T p Suppose C is the simplex.p Then R ≤ 2. If each coordinate of each gradient gi is upper bounded by M, then G can be at most M n, i.e. depends on the dimension. ∗ 2 ∗ Clearly the step 2 to 4 can be easily extended by replacing kxk+1 − x k2 with ∆ (x ; xk+1). So the only challenge left is to extend step 1. This is actually possible via Lemma2. We further assume is σ strongly convex. In (33), consider αk(f(xk) + hgk; x − xki) as the L in Lemma2. Then ∗ ∗ ∗ αk (f(xk) + hgk; x − xki) + ∆ (x ; xk) ≥ αk (f(xk) + hgk; xk+1 − xki) + ∆ (xk+1; xk) + ∆ (x ; xk+1) (45) Canceling some terms can rearranging, we obtain ∗ ∗ ∗ ∆ (x ; xk+1) ≤ ∆ (x ; xk) + αk hgk; x − xk+1i − ∆ (xk+1; xk) (46) ∗ ∗ = ∆ (x ; xk) + αk hgk; x − xki + αk hgk; xk − xk+1i − ∆ (xk+1; xk) (47) σ ≤ ∆ (x∗; x ) − α (f(x ) − f(x∗)) + α hg ; x − x i − kx − x k2 (48) k k k k k k k+1 2 k k+1 σ ≤ ∆ (x∗; x ) − α (f(x ) − f(x∗)) + α kg k kx − x k − kx − x k2 (49) k k k k k ∗ k k+1 2 k k+1 α2 ≤ ∆ (x∗; x ) − α (f(x ) − f(x∗)) + k kg k2 (50) k k k 2σ k ∗ ∗ 2 ∗ Now compare with (40), we have successfully replaced kxk+1 − x k2 with ∆ (x ; xi).
Recommended publications
  • Learning to Approximate a Bregman Divergence
    Learning to Approximate a Bregman Divergence Ali Siahkamari1 Xide Xia2 Venkatesh Saligrama1 David Castañón1 Brian Kulis1,2 1 Department of Electrical and Computer Engineering 2 Department of Computer Science Boston University Boston, MA, 02215 {siaa, xidexia, srv, dac, bkulis}@bu.edu Abstract Bregman divergences generalize measures such as the squared Euclidean distance and the KL divergence, and arise throughout many areas of machine learning. In this paper, we focus on the problem of approximating an arbitrary Bregman divergence from supervision, and we provide a well-principled approach to ana- lyzing such approximations. We develop a formulation and algorithm for learning arbitrary Bregman divergences based on approximating their underlying convex generating function via a piecewise linear function. We provide theoretical ap- proximation bounds using our parameterization and show that the generalization −1=2 error Op(m ) for metric learning using our framework matches the known generalization error in the strictly less general Mahalanobis metric learning setting. We further demonstrate empirically that our method performs well in comparison to existing metric learning methods, particularly for clustering and ranking problems. 1 Introduction Bregman divergences arise frequently in machine learning. They play an important role in clus- tering [3] and optimization [7], and specific Bregman divergences such as the KL divergence and squared Euclidean distance are fundamental in many areas. Many learning problems require di- vergences other than Euclidean distances—for instance, when requiring a divergence between two distributions—and Bregman divergences are natural in such settings. The goal of this paper is to provide a well-principled framework for learning an arbitrary Bregman divergence from supervision.
    [Show full text]
  • Bregman Divergence Bounds and Universality Properties of the Logarithmic Loss Amichai Painsky, Member, IEEE, and Gregory W
    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2958705, IEEE Transactions on Information Theory 1 Bregman Divergence Bounds and Universality Properties of the Logarithmic Loss Amichai Painsky, Member, IEEE, and Gregory W. Wornell, Fellow, IEEE Abstract—A loss function measures the discrepancy between Moreover, designing a predictor with respect to one measure the true values and their estimated fits, for a given instance may result in poor performance when evaluated by another. of data. In classification problems, a loss function is said to For example, the minimizer of a 0-1 loss may result in be proper if a minimizer of the expected loss is the true underlying probability. We show that for binary classification, an unbounded loss, when measured with a Bernoulli log- the divergence associated with smooth, proper, and convex likelihood loss. In such cases, it would be desireable to design loss functions is upper bounded by the Kullback-Leibler (KL) a predictor according to a “universal” measure, i.e., one that divergence, to within a normalization constant. This implies is suitable for a variety of purposes, and provide performance that by minimizing the logarithmic loss associated with the KL guarantees for different uses. divergence, we minimize an upper bound to any choice of loss from this set. As such the logarithmic loss is universal in the In this paper, we show that for binary classification, the sense of providing performance guarantees with respect to a Bernoulli log-likelihood loss (log-loss) is such a universal broad class of accuracy measures.
    [Show full text]
  • Metrics Defined by Bregman Divergences †
    1 METRICS DEFINED BY BREGMAN DIVERGENCES y PENGWEN CHEN z AND YUNMEI CHEN, MURALI RAOx Abstract. Bregman divergences are generalizations of the well known Kullback Leibler diver- gence. They are based on convex functions and have recently received great attention. We present a class of \squared root metrics" based on Bregman divergences. They can be regarded as natural generalization of Euclidean distance. We provide necessary and sufficient conditions for a convex function so that the square root of its associated average Bregman divergence is a metric. Key words. Metrics, Bregman divergence, Convexity subject classifications. Analysis of noise-corrupted data, is difficult without interpreting the data to have been randomly drawn from some unknown distribution with unknown parameters. The most common assumption on noise is Gaussian distribution. However, it may be inappropriate if data is binary-valued or integer-valued or nonnegative. Gaussian is a member of the exponential family. Other members of this family for example the Poisson and the Bernoulli are better suited for integer and binary data. Exponential families and Bregman divergences ( Definition 1.1 ) have a very intimate relationship. There exists a unique Bregman divergence corresponding to every regular exponential family [13][3]. More precisely, the log-likelihood of an exponential family distribution can be represented by a sum of a Bregman divergence and a parameter unrelated term. Hence, Bregman divergence provides a likelihood distance for exponential family in some sense. This property has been used in generalizing principal component analysis to the Exponential family [7]. The Bregman divergence however is not a metric, because it is not symmetric, and does not satisfy the triangle inequality.
    [Show full text]
  • On a Generalization of the Jensen-Shannon Divergence and the Jensen-Shannon Centroid
    On a generalization of the Jensen-Shannon divergence and the JS-symmetrization of distances relying on abstract means∗ Frank Nielsen† Sony Computer Science Laboratories, Inc. Tokyo, Japan Abstract The Jensen-Shannon divergence is a renown bounded symmetrization of the unbounded Kullback-Leibler divergence which measures the total Kullback-Leibler divergence to the aver- age mixture distribution. However the Jensen-Shannon divergence between Gaussian distribu- tions is not available in closed-form. To bypass this problem, we present a generalization of the Jensen-Shannon (JS) divergence using abstract means which yields closed-form expressions when the mean is chosen according to the parametric family of distributions. More generally, we define the JS-symmetrizations of any distance using generalized statistical mixtures derived from abstract means. In particular, we first show that the geometric mean is well-suited for exponential families, and report two closed-form formula for (i) the geometric Jensen-Shannon divergence between probability densities of the same exponential family, and (ii) the geometric JS-symmetrization of the reverse Kullback-Leibler divergence. As a second illustrating example, we show that the harmonic mean is well-suited for the scale Cauchy distributions, and report a closed-form formula for the harmonic Jensen-Shannon divergence between scale Cauchy distribu- tions. We also define generalized Jensen-Shannon divergences between matrices (e.g., quantum Jensen-Shannon divergences) and consider clustering with respect to these novel Jensen-Shannon divergences. Keywords: Jensen-Shannon divergence, Jeffreys divergence, resistor average distance, Bhat- tacharyya distance, Chernoff information, f-divergence, Jensen divergence, Burbea-Rao divergence, arXiv:1904.04017v3 [cs.IT] 10 Dec 2020 Bregman divergence, abstract weighted mean, quasi-arithmetic mean, mixture family, statistical M-mixture, exponential family, Gaussian family, Cauchy scale family, clustering.
    [Show full text]
  • Information Divergences and the Curious Case of the Binary Alphabet
    Information Divergences and the Curious Case of the Binary Alphabet Jiantao Jiao∗, Thomas Courtadey, Albert No∗, Kartik Venkat∗, and Tsachy Weissman∗ ∗Department of Electrical Engineering, Stanford University; yEECS Department, University of California, Berkeley Email: fjiantao, albertno, kvenkat, [email protected], [email protected] Abstract—Four problems related to information divergence answers to the above series of questions imply that the class measures defined on finite alphabets are considered. In three of “interesting” divergence measures can be very small when of the cases we consider, we illustrate a contrast which arises n ≥ 3 (e.g., restricted to the class of f-divergences, or a between the binary-alphabet and larger-alphabet settings. This is surprising in some instances, since characterizations for the multiple of KL divergence). However, in the binary alphabet larger-alphabet settings do not generalize their binary-alphabet setting, the class of “interesting” divergence measures is strik- counterparts. For example, we show that f-divergences are not ingly rich, a point which will be emphasized in our results. In the unique decomposable divergences on binary alphabets that many ways, this richness is surprising since the binary alphabet satisfy the data processing inequality, despite contrary claims in is usually viewed as the simplest setting one can consider. the literature. In particular, we would expect that the class of interesting I. INTRODUCTION divergence measures corresponding to binary alphabets would Divergence measures play a central role in information the- be less rich than the counterpart class for larger alphabets. ory and other branches of mathematics. Many special classes However, as we will see, the opposite is true.
    [Show full text]
  • Statistical Exponential Families: a Digest with Flash Cards
    Statistical exponential families: A digest with flash cards∗ Frank Nielsen†and Vincent Garcia‡ May 16, 2011 (v2.0) Abstract This document describes concisely the ubiquitous class of exponential family distributions met in statistics. The first part recalls definitions and summarizes main properties and duality with Bregman divergences (all proofs are skipped). The second part lists decompositions and re- lated formula of common exponential family distributions. We recall the Fisher-Rao-Riemannian geometries and the dual affine connection information geometries of statistical manifolds. It is intended to maintain and update this document and catalog by adding new distribution items. arXiv:0911.4863v2 [cs.LG] 13 May 2011 ∗See the jMEF library, a Java package for processing mixture of exponential families. Available for download at http://www.lix.polytechnique.fr/~nielsen/MEF/ †Ecole´ Polytechnique (France) and Sony Computer Science Laboratories Inc. (Japan). ‡Ecole´ Polytechnique (France). 1 Part I A digest of exponential families 1 Essentials of exponential families 1.1 Sufficient statistics A fundamental problem in statistics is to recover the model parameters λ from a given set of observations x1, ..., etc. Those samples are assumed to be randomly drawn from an independent and identically-distributed random vector with associated density p(x; λ). Since the sample set is finite, statisticians estimate a close approximation λˆ of the true parameter. However, a surprising fact is that one can collect and concentrate from a random sample all necessary information for recovering/estimating the parameters. The information is collected into a few elementary statistics of the random vector, called the sufficient statistic.1 Figure 1 illustrates the notions of statistics and sufficiency.
    [Show full text]
  • Applications of Bregman Divergence Measures in Bayesian Modeling Gyuhyeong Goh University of Connecticut - Storrs, [email protected]
    University of Connecticut OpenCommons@UConn Doctoral Dissertations University of Connecticut Graduate School 7-6-2015 Applications of Bregman Divergence Measures in Bayesian Modeling Gyuhyeong Goh University of Connecticut - Storrs, [email protected] Follow this and additional works at: https://opencommons.uconn.edu/dissertations Recommended Citation Goh, Gyuhyeong, "Applications of Bregman Divergence Measures in Bayesian Modeling" (2015). Doctoral Dissertations. 785. https://opencommons.uconn.edu/dissertations/785 Applications of Bregman Divergence Measures in Bayesian Modeling Gyuhyeong Goh, Ph.D. University of Connecticut, 2015 ABSTRACT This dissertation has mainly focused on the development of statistical theory, methodology, and application from a Bayesian perspective using a general class of di- vergence measures (or loss functions), called Bregman divergence. Many applications of Bregman divergence have played a key role in recent advances in machine learn- ing. My goal is to turn the spotlight on Bregman divergence and its applications in Bayesian modeling. Since Bregman divergence includes many well-known loss func- tions such as squared error loss, Kullback-Leibler divergence, Itakura-Saito distance, and Mahalanobis distance, the theoretical and methodological development unify and extend many existing Bayesian methods. The broad applicability of both Bregman divergence and Bayesian approach can handle diverse types of data such as circular data, high-dimensional data, multivariate data and functional data. Furthermore,
    [Show full text]
  • Jensen-Bregman Logdet Divergence with Application to Efficient
    TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, DECEMBER 2012 1 Jensen-Bregman LogDet Divergence with Application to Efficient Similarity Search for Covariance Matrices Anoop Cherian, Student Member, IEEE, Suvrit Sra, Arindam Banerjee and Nikolaos Papanikolopoulos, Fellow, IEEE Covariance matrices have found success in several computer vision applications, including activity recognition, visual surveillance, and diffusion tensor imaging. This is because they provide an easy platform for fusing multiple features compactly. An important task in all of these applications is to compare two covariance matrices using a (dis)similarity function, for which the common choice is the Riemannian metric on the manifold inhabited by these matrices. As this Riemannian manifold is not flat, the dissimilarities should take into account the curvature of the manifold. As a result such distance computations tend to slow down, especially when the matrix dimensions are large or gradients are required. Further, suitability of the metric to enable efficient nearest neighbor retrieval is an important requirement in the contemporary times of big data analytics. To alleviate these difficulties, this paper proposes a novel dissimilarity measure for covariances, the Jensen-Bregman LogDet Divergence (JBLD). This divergence enjoys several desirable theoretical properties, at the same time is computationally less demanding (compared to standard measures). Utilizing the fact that the square-root of JBLD is a metric, we address the problem of efficient nearest neighbor retrieval on large covariance datasets via a metric tree data structure. To this end, we propose a K-Means clustering algorithm on JBLD. We demonstrate the superior performance of JBLD on covariance datasets from several computer vision applications.
    [Show full text]
  • Bregman Divergence Bounds and Universality Properties of the Logarithmic Loss Amichai Painsky, Member, IEEE, and Gregory W
    1 Bregman Divergence Bounds and Universality Properties of the Logarithmic Loss Amichai Painsky, Member, IEEE, and Gregory W. Wornell, Fellow, IEEE Abstract—A loss function measures the discrepancy between Moreover, designing a predictor with respect to one measure the true values and their estimated fits, for a given instance may result in poor performance when evaluated by another. of data. In classification problems, a loss function is said to For example, the minimizer of a 0-1 loss may result in be proper if a minimizer of the expected loss is the true underlying probability. We show that for binary classification, an unbounded loss, when measured with a Bernoulli log- the divergence associated with smooth, proper, and convex likelihood loss. In such cases, it would be desireable to design loss functions is upper bounded by the Kullback-Leibler (KL) a predictor according to a “universal” measure, i.e., one that divergence, to within a normalization constant. This implies is suitable for a variety of purposes, and provide performance that by minimizing the logarithmic loss associated with the KL guarantees for different uses. divergence, we minimize an upper bound to any choice of loss from this set. As such the logarithmic loss is universal in the In this paper, we show that for binary classification, the sense of providing performance guarantees with respect to a Bernoulli log-likelihood loss (log-loss) is such a universal broad class of accuracy measures. Importantly, this notion of choice, dominating all alternative “analytically convenient” universality is not problem-specific, enabling its use in diverse (i.e., smooth, proper, and convex) loss functions.
    [Show full text]
  • Generalized Bregman and Jensen Divergences Which Include Some F-Divergences
    GENERALIZED BREGMAN AND JENSEN DIVERGENCES WHICH INCLUDE SOME F-DIVERGENCES TOMOHIRO NISHIYAMA Abstract. In this paper, we introduce new classes of divergences by extend- ing the definitions of the Bregman divergence and the skew Jensen divergence. These new divergence classes (g-Bregman divergence and skew g-Jensen di- vergence) satisfy some properties similar to the Bregman or skew Jensen di- vergence. We show these g-divergences include divergences which belong to a class of f-divergence (the Hellinger distance, the chi-square divergence and the alpha-divergence in addition to the Kullback-Leibler divergence). More- over, we derive an inequality between the g-Bregman divergence and the skew g-Jensen divergence and show this inequality is a generalization of Lin's in- equality. Keywords: Bregman divergence, f-divergence, Jensen divergence, Kullback- Leibler divergence, Jeffreys divergence, Jensen-Shannon divergence, Hellinger distance, Chi-square divergence, Alpha divergence, Centroid, Parallelogram, Lin's inequality. 1. Introduction Divergences are functions measure the discrepancy between two points and play a key role in the field of machine learning, signal processing and so on. Given a set S and p; q 2 S, a divergence is defined as a function D : S × S ! R which satisfies the following properties. 1. D(p; q) ≥ 0 for all p; q 2 S 2. D(p; q) = 0 () p = q The Bregman divergence [7] B (p; q) def= F (p) − F (q) − hp − q; rF (q)i and f- P F def pi divergence [1, 10] Df (pkq) = qif( ) are representative divergences defined by i qi using a strictly convex function, where F; f : S ! R are strictly convex functions and f(1) = 0.
    [Show full text]
  • Bregman Metrics and Their Applications
    BREGMAN METRICS AND THEIR APPLICATIONS By PENGWEN CHEN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2007 1 °c 2007 Pengwen Chen 2 ACKNOWLEDGMENTS My ¯rst and foremost thanks go to my advisers, Dr. Yunmei Chen and Dr. Murali Rao. Without their constant encouragement, and support, I would not have been to complete this research work. They were very generous with their time and help. Especially, they gave me a lot of helpful mathematical and non-mathematical advice and assistance throughout my research work. I am grateful to my supervisory committee members (William Hager, Gopalakrishnan Jayadeep, Jose Principe, and Rongling Wu) for my study. It is a pleasure to acknowledge their suggestions on this research work. I bene¯ted from every discussion with them. I am especially grateful to Dr. Gopalakrishnan Jayadeep and Dr. William Hager for o®ering numerical courses. Listening to their lectures was a pleasant experience. I would also like to thank the professors in my department who helped me on various occasions, and in particular Dr. John Klauder for our enjoyable discussions during my ¯rst year. Last but not least, I want to thank my wife April, my parents and her parents for their understanding and support. 3 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................. 3 LIST OF TABLES ..................................... 7 LIST OF FIGURES .................................... 8 ABSTRACT ........................................ 9 CHAPTER 1 MOTIVATION AND OUTLINE OF THIS PAPER ................ 11 2 BREGMAN METRICS ............................... 15 2.1 Introduction ................................... 15 2.2 Preliminaries and Notations .........................
    [Show full text]
  • Low-Rank Kernel Learning with Bregman Matrix Divergences
    Journal of Machine Learning Research 10 (2009) 341-376 Submitted 8/07; Revised 10/08; Published 2/09 Low-Rank Kernel Learning with Bregman Matrix Divergences Brian Kulis [email protected] Maty´ as´ A. Sustik [email protected] Inderjit S. Dhillon [email protected] Department of Computer Sciences University of Texas at Austin Austin, TX 78712, USA Editor: Michael I. Jordan Abstract In this paper, we study low-rank matrix nearness problems, with a focus on learning low-rank positive semidefinite (kernel) matrices for machine learning applications. We propose efficient algorithms that scale linearly in the number of data points and quadratically in the rank of the input matrix. Existing algorithms for learning kernel matrices often scale poorly, with running times that are cubic in the number of data points. We employ Bregman matrix divergences as the measures of nearness—these divergences are natural for learning low-rank kernels since they preserve rank as well as positive semidefiniteness. Special cases of our framework yield faster algorithms for various existing learning problems, and experimental results demonstrate that our algorithms can effectively learn both low-rank and full-rank kernel matrices. Keywords: kernel methods, Bregman divergences, convex optimization, kernel learning, matrix nearness 1. Introduction Underlying many machine learning algorithms is a measure of distance, or divergence, between data objects. A number of factors affect the choice of the particular divergence measure used: the data may be intrinsically suited for a specific divergence measure, an algorithm may be faster or easier to implement for some measure, or analysis may be simpler for a particular measure.
    [Show full text]