Bregman Divergences for Data Mining Meta-Algorithms

Bregman Divergences for Data Mining Meta-Algorithms Joydeep Ghosh University of Texas at Austin [email protected] Reflects joint work with Arindam Banerjee, Srujana Merugu, Inderjit Dhillon, Dharmendra Modha – p.1/?? Measuring Distortion or Loss Squared Euclidean distance. kmeans clustering, least square regression, Weiner filtering,.. Squared loss is not appropriate in many situations Sparse, high-dimensional data Probability distributions KL-divergence (relative entropy) What distortion/loss functions make sense, and where? Common properties? (meta-algorithms) – p.2/?? Bregman Divergences φ(z) φ (x) dφ (x,y) (x−y). φ ’(y) φ(y) y x z φ is strictly convex, differentiable dφ(x; y) = φ(x) φ(y) x y; φ(y) − − h − r i – p.3/?? Examples φ(x) = x 2 is strictly convex and differentiable on Rm k k d (x; y) = x y 2 [ squared Euclidean distance ] φ k − k m φ(p) = j=1 pj log pj (negative entropy) is strictly convex and differentiable on the m-simplex P d (p; q) = m p log pj [ KL-divergence ] φ j=1 j qj P m m φ(x) = log xj is strictly convex and differentiable on R − j=1 ++ m xj xj dφ(x; yP) = log 1 [ Itakura-Saito distance ] j=1 yj − yj − P – p.4/?? Properties of Bregman Divergences d (x; y) 0, and equals 0 iff x = y, but not a metric (symmetry, φ ≥ triangle inequality do not hold) Convex in the first argument, but not necessarily in the second one KL divergence between two distributions of the same exponential family is a Bregman divergence Generalized Law of Cosines and Pythagoras Theorem: d (x; y) = d (z; y) + d (x; z) (x z); ( φ(y) φ(z)) φ φ φ − h − r − r i When x convex (affine) set Ω & z is the Bregman projection onto Ω 2 z PΩ(y) = argmin dφ(!; y); ≡ ! Ω 2 the inner product term becomes negative (equals zero) – p.5/?? Bregman Information For squared loss Mean is the best constant predictor of a random variable µ = argmin E[ X c 2] c k − k The minimum loss is the variance E[ X µ 2] k − k Theorem: For all Bregman divergences µ = argmin E[dφ(X; c)] c Definition: The minimum loss is the Bregman information of X Iφ(X) = E[dφ(X; µ)] (minimum distortion at Rate = 0) – p.6/?? Examples of Bregman Information φ(x) = x 2, X ν over Rm k k ∼ I (X) = E [ X E [X] 2] [ Variance ] φ ν k − ν k φ(x) = m x log x , X p(z) over p(Y z) m-simplex j=1 j j ∼ f j g ⊂ Iφ(XP) = I(Z; Y ) [ Mutual Information ] m (i) n m φ(x) = log xj, X uniform over x R − j=1 ∼ f gi=1 ⊂ µ I (X)P= m log j [ log AM/GM ] φ j=1 gj P – p.7/?? Bregman Hard Clustering Algorithm (Std. Objective is same as minimizing loss in Bregman Information when using K representatives.) Initialize µ k f hgh=1 Repeat until convergence { Assignment Step } Assign x to nearest cluster where Xh h = argmin d (x; µ ) φ h0 h0 { Re-estimation step } For all h, recompute mean µh as x x h µh = 2X nh P – p.8/?? Properties Guarantee: Monotonically decreases objective function till convergence Scalability: Every iteration is linear in the size of the input Exhaustiveness: If such an algorithm exists for a loss function L(x; µ), then L has to be a Bregman divergence Linear Separators: Clusters are separated by hyperplanes Mixed Data types: Allows appropriate Bregman divergence for subsets of features – p.9/?? Example of Algorithms Convex function Bregman divergence Algorithm Squared norm Squared Loss KMeans [M'67] Negative entropy KL-divergence Information Theoretic [DMK'03] Burg entropy Itakura-Saito distance Linde-Buzo-Gray [LBG'80] – p.10/?? Bijection between BD and Exponential Family Regular exponential families Regular Bregman divergences $ Gaussian Squared Loss $ Multinomial KL-divergence $ Geometric Itakura-Saito distance $ Poisson I-divergence $ – p.11/?? Bregman Divergences and Exponential Family Theorem: For any regular exponential family p( ;θ), for all x dom(φ), 2 p (x) = exp( d (x; µ))b (x); ( ;θ) − φ φ for a uniquely determined bφ, where θ is the natural parameter and µ is the expectation parameter Legendre Duality d φ (x, µ ) φ(µ) ψ(θ) p(ψ,θ) (x) Bregman Convex Cumulant Exponential Divergence Function Function Family – p.12/?? Bregman Soft Clustering Soft Clustering Data modeling with mixture of exponential family distributions Solved using Expectation Maximization (EM) algorithm Maximum log-likelihood Minimum Bregman divergence ≡ log p (x) d (x; µ) ( ;θ) ≡ − φ Bijection implies a Bregman divergence viewpoint Efficient algorithm for soft clustering – p.13/?? Bregman Soft Clustering Algorithm Initialize π ; µ k f h hgh=1 Repeat until convergence { Expectation Step } For all x; h; the posterior probability p(h x) = π exp( d (x; µ ))=Z(x); j h − φ h where Z(x) is the normalization function { Maximization step } For all h, 1 π = p(h x) h n j x X p(h x) x µ = x j h p(h x) P x j P – p.14/?? Rate Distortion with Bregman Divergences Theorem: If distortion is a Bregman divergence, Either, R(D) is equal to the Shannon-Bregman lower bound Or, X^ is finite j j When X^ is finite j j Bregman divergences Exponential family distributions $ Rate distortion Modeling with mixture of $ with Bregman divergences exponential family distributions R(D) can be obtained either analytically or computationally Compression vs. loss in Bregman information formulation Information bottleneck as a special case – p.15/?? Online Learning (Warmuth) Setting: For trials t = 1; ; T do · · · Predict target y^ = g(w x ) for instance x using link function g t t · t t curr Incur loss Lt (wt) hist curr Update Rule: wt+1 = argmin L (w) + ηt Lt (w) w 0 1 BDeviation from history Current lossC hist @ A When L (w) = dF (w; wt), i.e.,|a Bregman{z } loss function| {zand} curr Lt (w) is convex, the update rule reduces to 1 curr w = f − (f(w ) + η L (w )) where f = F t+1 t tr t t r Also get Regret Bounds. and density estimation Bounds. – p.16/?? Examples History loss:Update family Current loss Algorithm Squared Loss: Gradient Descent Squared Loss Widrow Hoff(LMS) Squared Loss: Gradient Descent Hinge Loss Perceptron KL-divergence: Exponentiated Hinge Loss Normalized Winnow Gradient Descent – p.17/?? Generalizing PCA to the Exponential Family (Collins/Dasgupta/Schapire, NIPS 2001) m n PCA: Given data matrix X = [x1; ; xn] R × , find an orthogonal m k · · · 2 k n basis V R × and a projection matrix A R × that solve: 2 2 n 2 min xi V ai A;V jj − jj i=1 X Equivalent to maximizing likelihood when x Gaussian(V a ; σ2) i ∼ i Generalized PCA: For any specified exponential family p( ;θ), find m k k n V R × and A R × that maximize the data likelihood, i.e., 2 2 n n max log p( ;θi)(xi) where θi = V ai; [i]1 A;V i=1 X n Bregman divergence formulation: min dφ(xi; (V ai)) A;V i=1 r – p.18/?? P Uniting Adaboost, Logistic Regression (Collins/Schapire/Singer, Machine Learning, 2002) Boosting: minimize exponential loss; sequential updates Logistic Regression: min. log-loss; parallel updates Both are special cases of a classical Bregman projection problem: Find p S = dom(φ) that is “closest” in Bregman divergence to a 2 given vector q S subject to certain linear constraints: 0 2 min dφ(p; q0) p S: Ap=Ap 2 0 Boosting: I-divergence; LR: binary relative entropy – p.19/?? Implications convergence proof for Boosting parallel versions of Boosting algorithms Boosting with [0,1] bounded weights Extension to multi-class problems .... – p.20/?? Misc. Work on Bregman Divergences Duality results and auxiliary functions for the Bregman projection problem. Della Pietra, Della Pietra and Lafferty Learning latent variable models using Bregman divergences. Wang and Schuurmans U-Boost: Boosting with Bregman divergences. Murata et al. – p.21/?? Historical Reference L. M. Bregman. “The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming.” USSR Computational Mathematics and Physics, 7:200-217, 1967. Problem: min '(x) subject to aTx = b ; i = 0; : : : ; m 1 i i − Iterative procedure: 1. Start with x(0) that satisfies '(x0) = ATπ. Set t = 0. r − 2. Compute x(t+1) to be the “Bregman” projection of x(t) onto the hyperplane T ai x = bi, where i = t mod m. Set t = t + 1 and repeat. Converges to globally optimal solution. This cyclic projection method can be extended to halfspace and convex constraints, where each projection is followed by a correction). Censor and Lent (1981) coined the term “Bregman distance” – p.22/?? Bertinoro Challenge What other learning formulations can be generalized with BDs? Given a problem/application, how to find the “best” Bregman divergence to use? Examples of unusual but practical Bregman Divergences? Other useful families of loss functions – p.23/?? References BGW'05 On the optimality of conditional expectation as a Bregman predictor IEEE Trans. Info Theory, July 2005 BMDG'05 Clustering with Bregman Divergences Journal of Machine Learning Research (JMLR, Oct05; SDM04) BDGM'04 An Information Theoretic Analysis of Maximum Likelihood Mixture Estimation for Exponential Families ICML, 2004 BDGMM'04 A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation (KDD), 2004 – p.24/?? Backups – p.25/?? The Exponential Family Definition: A multivariate parametric family with density p (x) = exp x; θ (θ) p (x) ( ;θ) fh i − g 0 is the cumulant or log-partition function uniquely determines a family Examples: Gaussian, Bernoulli, Multinomial, Poisson θ fixes a particular distribution in the family is a strictly convex function – p.26/?? Online Learning (Warmuth & Co) Setting: For trials t = 1; ; T do · · · Predict target y^ = g(w x ) for instance x using link function g t t · t t curr Incur loss Lt (wt) (depends on true yt and predicted y^t) Update wt to wt+1 using past

Load more