Learning with Bregman Divergences Machine Learning and Optimization

Learning with Bregman Divergences Machine Learning and Optimization Inderjit S. Dhillon University of Texas at Austin Mathematical Programming in Data Mining and Machine Learning Banff International Research Station, Canada January 18, 2007 Joint work with Arindam Banerjee, Jason Davis, Joydeep Ghosh, Brian Kulis, Srujana Merugu and Suvrit Sra Inderjit S. Dhillon University of Texas at Austin Learning with Bregman Divergences Machine Learning: Problems Unsupervised Learning Clustering: group a set of data objects Co-clustering: simultaneously partition data objects & features Matrix Approximation SVD: low-rank approximation, minimizes Frobenius error NNMA: low-rank non-negative approximation Inderjit S. Dhillon University of Texas at Austin Learning with Bregman Divergences Machine Learning: Problems Unsupervised Learning Clustering: group a set of data objects Co-clustering: simultaneously partition data objects & features Matrix Approximation SVD: low-rank approximation, minimizes Frobenius error NNMA: low-rank non-negative approximation Supervised Learning Classification: k-nearest neighbor, SVMs, boosting, ... Many classifiers rely on choice of distance measures Kernel Learning: used in “kernelized” algorithms Metric Learning: Information retrieval, Nearest neighbor searches Inderjit S. Dhillon University of Texas at Austin Learning with Bregman Divergences Example: Clustering 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Goal: partition points into k clusters Inderjit S. Dhillon University of Texas at Austin Learning with Bregman Divergences Example: K-Means Clustering 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Minimizes squared Euclidean distance from points to their cluster centroids Inderjit S. Dhillon University of Texas at Austin Learning with Bregman Divergences Example: K-Means Clustering Assumes a Gaussian noise model Corresponds to squared Euclidean distance What if a different noise model is assumed? Poisson, multinomial, exponential, etc. We will see: for every exponential family probability distribution, there exists a corresponding generalized distance measure Distribution Distance Measure Spherical Gaussian Squared Euclidean Distance Multinomial Kullback-Leibler Distance Exponential Itakura-Saito Distance Leads to generalizations of the k-means objective Bregman divergences are the generalized distance measures Inderjit S. Dhillon University of Texas at Austin Learning with Bregman Divergences Background Inderjit S. Dhillon University of Texas at Austin Learning with Bregman Divergences Bregman Divergences: Definition Let ϕ : S → R be a differentiable, strictly convex function of “Legendre type” (S ⊆ Rd ) The Bregman Divergence Dϕ : S × relint(S) → R is defined as T Dϕ(x, y)= ϕ(x) − ϕ(y) − (x − y) ∇ϕ(y) Inderjit S. Dhillon University of Texas at Austin Learning with Bregman Divergences Bregman Divergences: Definition Let ϕ : S → R be a differentiable, strictly convex function of “Legendre type” (S ⊆ Rd ) The Bregman Divergence Dϕ : S × relint(S) → R is defined as T Dϕ(x, y)= ϕ(x) − ϕ(y) − (x − y) ∇ϕ(y) 1 T ϕ(z)= 2 z z 1 2 Dϕ(x,y)= 2 kx−yk h(z) y x Squared Euclidean distance is a Bregman divergence Inderjit S. Dhillon University of Texas at Austin Learning with Bregman Divergences Bregman Divergences: Definition Let ϕ : S → R be a differentiable, strictly convex function of “Legendre type” (S ⊆ Rd ) The Bregman Divergence Dϕ : S × relint(S) → R is defined as T Dϕ(x, y)= ϕ(x) − ϕ(y) − (x − y) ∇ϕ(y) ϕ(z)=z log z x Dϕ(x,y)=x log y −x+y h(z) y x Relative Entropy (or KL-divergence) is another Bregman divergence Inderjit S. Dhillon University of Texas at Austin Learning with Bregman Divergences Bregman Divergences: Definition Let ϕ : S → R be a differentiable, strictly convex function of “Legendre type” (S ⊆ Rd ) The Bregman Divergence Dϕ : S × relint(S) → R is defined as T Dϕ(x, y)= ϕ(x) − ϕ(y) − (x − y) ∇ϕ(y) ϕ(z)=− log z y x h(z) x x Dϕ(x,y)= y −log y −1 Itakura-Saito Dist.(used in signal processing) is also a Bregman divergence Inderjit S. Dhillon University of Texas at Austin Learning with Bregman Divergences Bregman Divergences: Properties Dϕ(x, y) ≥ 0, and equals 0 iff x = y Inderjit S. Dhillon University of Texas at Austin Learning with Bregman Divergences Bregman Divergences: Properties Dϕ(x, y) ≥ 0, and equals 0 iff x = y Not a metric (symmetry, triangle inequality do not hold) Inderjit S. Dhillon University of Texas at Austin Learning with Bregman Divergences Bregman Divergences: Properties Dϕ(x, y) ≥ 0, and equals 0 iff x = y Not a metric (symmetry, triangle inequality do not hold) Strictly convex in 1st argument, but (in general) not in 2nd Inderjit S. Dhillon University of Texas at Austin Learning with Bregman Divergences Bregman Divergences: Properties Dϕ(x, y) ≥ 0, and equals 0 iff x = y Not a metric (symmetry, triangle inequality do not hold) Strictly convex in 1st argument, but (in general) not in 2nd Three-point property generalizes the “Law of cosines”: T Dϕ(x, y)= Dϕ(x, z)+ Dϕ(z, y) − (x − z) (∇ϕ(y) − ∇ϕ(z)) Inderjit S. Dhillon University of Texas at Austin Learning with Bregman Divergences Projections “Bregman projection” of y onto a convex set Ω, PΩ(y) = argmin Dϕ(ω, y) ω∈Ω Inderjit S. Dhillon University of Texas at Austin Learning with Bregman Divergences Projections “Bregman projection” of y onto a convex set Ω, PΩ(y) = argmin Dϕ(ω, y) ω∈Ω y PΩ(y) Ω Inderjit S. Dhillon University of Texas at Austin Learning with Bregman Divergences Projections “Bregman projection” of y onto a convex set Ω, PΩ(y) = argmin Dϕ(ω, y) ω∈Ω y Ω PΩ(y) x Generalized Pythagorean Theorem: Dϕ(x, y) ≥ Dϕ(x, PΩ(y)) + Dϕ(PΩ(y), y) When Ω is an affine set, the above holds with equality Inderjit S. Dhillon University of Texas at Austin Learning with Bregman Divergences Bregman’s original work L. M. Bregman. “The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming.” USSR Computational Mathematics and Physics, 7:200-217, 1967. Problem: T min ϕ(x) subject to ai x = bi , i = 0,..., m − 1 Bregman’s cyclic projection method: Start with appropriate x(0). Compute x(t+1) to be the Bregman projection of x(t) onto the i-th hyperplane (i = t mod m) for t = 0, 1, 2,... Converges to globally optimal solution. This cyclic projection method can be extended to halfspace and convex constraints, where each projection is followed by a correction. Question: What role do Bregman divergences play in machine learning? Inderjit S. Dhillon University of Texas at Austin Learning with Bregman Divergences Exponential Families of Distributions Definition: A regular exponential family is a family of probability distributions on Rd with density function parameterized by θ: T pψ(x | θ) = exp{x θ − ψ(θ) − gψ(x)} ψ is the so-called cumulant function, and is a convex function of Legendre type Inderjit S. Dhillon University of Texas at Austin Learning with Bregman Divergences Exponential Families of Distributions Definition: A regular exponential family is a family of probability distributions on Rd with density function parameterized by θ: T pψ(x | θ) = exp{x θ − ψ(θ) − gψ(x)} ψ is the so-called cumulant function, and is a convex function of Legendre type Example: spherical Gaussians parameterized by mean µ (& fixed variance σ): 1 1 p(x) = exp − kx − µk2 (2πσ2)d ½ 2σ2 ¾ p 1 µ σ2 µ 2 xT x xT = exp 2 − 2 − 2 (2πσ2)d ½ ³σ ´ 2 ³σ ´ 2σ ¾ pµ σ2 Thus θ = , and ψ(θ) = θ2 σ2 2 Inderjit S. Dhillon University of Texas at Austin Learning with Bregman Divergences Exponential Families of Distributions Definition: A regular exponential family is a family of probability distributions on Rd with density function parameterized by θ: T pψ(x | θ) = exp{x θ − ψ(θ) − gψ(x)} ψ is the so-called cumulant function, and is a convex function of Legendre type Example: spherical Gaussians parameterized by mean µ (& fixed variance σ): 1 1 p(x) = exp − kx − µk2 (2πσ2)d ½ 2σ2 ¾ p 1 µ σ2 µ 2 xT x xT = exp 2 − 2 − 2 (2πσ2)d ½ ³σ ´ 2 ³σ ´ 2σ ¾ pµ σ2 Thus θ = , and ψ(θ) = θ2 σ2 2 Note: Gaussian distribution ←→ Squared Loss Inderjit S. Dhillon University of Texas at Austin Learning with Bregman Divergences Example: Poisson Distribution Poisson Distribution: λx p(x)= e−λ, x ∈ Z x! + The Poisson Distribution is a member of the exponential family Is there a Divergence associated with the Poisson Distribution? Inderjit S. Dhillon University of Texas at Austin Learning with Bregman Divergences Example: Poisson Distribution Poisson Distribution: λx p(x)= e−λ, x ∈ Z x! + The Poisson Distribution is a member of the exponential family Is there a Divergence associated with the Poisson Distribution? YES — p(x) can be written as p(x) = exp{−Dϕ(x, µ) − gϕ(x)}, x where Dϕ is the Relative Entropy, i.e., Dϕ(x, µ)= x log µ − x + µ ³ ´ Inderjit S. Dhillon University of Texas at Austin Learning with Bregman Divergences Example: Poisson Distribution Poisson Distribution: λx p(x)= e−λ, x ∈ Z x! + The Poisson Distribution is a member of the exponential family Is there a Divergence associated with the Poisson Distribution? YES — p(x) can be written as p(x) = exp{−Dϕ(x, µ) − gϕ(x)}, x where Dϕ is the Relative Entropy, i.e., Dϕ(x, µ)= x log µ − x + µ ³ ´ Implication: Poisson distribution ←→ Relative Entropy Inderjit S.

Learning with Bregman Divergences Machine Learning and Optimization

Learning to Approximate a Bregman Divergence

Bregman Divergence Bounds and Universality Properties of the Logarithmic Loss Amichai Painsky, Member, IEEE, and Gregory W

Metrics Defined by Bregman Divergences †

On a Generalization of the Jensen-Shannon Divergence and the Jensen-Shannon Centroid

Information Divergences and the Curious Case of the Binary Alphabet

Statistical Exponential Families: a Digest with Flash Cards

Applications of Bregman Divergence Measures in Bayesian Modeling Gyuhyeong Goh University of Connecticut - Storrs, [email protected]

Jensen-Bregman Logdet Divergence with Application to Efficient

Bregman Divergence Bounds and Universality Properties of the Logarithmic Loss Amichai Painsky, Member, IEEE, and Gregory W

Generalized Bregman and Jensen Divergences Which Include Some F-Divergences

Bregman Metrics and Their Applications

Low-Rank Kernel Learning with Bregman Matrix Divergences