Agglomerative Bregman Clustering

Agglomerative Bregman Clustering Matus Telgarsky [email protected] Sanjoy Dasgupta [email protected] Department of Computer Science and Engineering, UCSD, 9500 Gilman Drive, La Jolla, CA 92093-0404 Abstract 1.1. Bregman clustering This manuscript develops the theory of ag- There is already a rich theory of clustering with Breg- glomerative clustering with Bregman diver- man divergences, and in particular the relationship gences. Geometric smoothing techniques are of these divergences with exponential family distribu- developed to deal with degenerate clusters. tions (Banerjee et al., 2005). The standard develop- To allow for cluster models based on expo- ment has two shortcomings, the first of which is am- nential families with overcomplete represen- plified in the agglomerative setting. tations, Bregman divergences are developed for nondifferentiable convex functions. Degenerate divergences. Many divergences lead to merge costs which are undefined on certain inputs. This scenario is exacerbated with small clusters; 1. Introduction for instance, with Gaussian clusters, the corre- m sponding divergence rule is the KL divergence, Starting with points fxigi=1 and a pairwise merge cost ∆(·; ·), classical agglomerative clustering produces a which demands full rank cluster covariances. This single hierarchical tree as follows (Duda et al., 2001). is impossible with ≤ d points, but the agglomerative procedure above starts with singletons. 1. Start with m clusters: Ci := fxig for each i. Minimal representations. The standard theory of 2. While at least two clusters remain: exponential families and its connections to Breg- man divergences depend on minimal representa- (a) Choose fCi;Cjg with minimal ∆(Ci;Cj). tions: there is just one way to write down any (b) Remove fCi;Cjg, add in Ci [ Cj. particular distribution. On the other hand, the natural encoding for many problems | e.g., Ising In order to build a hierarchy with low k-means cost, models, and many other examples listed in the one can use the merge cost due to Ward(1963), textbook of Wainwright & Jordan(2008, Section 4) | is overcomplete, necessitating potentially te- jCijjCjj 2 dious conversions to invoke the theory. ∆w(Ci;Cj) := kτ(Ci) − τ(Cj)k2; jCij + jCjj where τ(C) denotes the mean of cluster C. 1.2. Contribution The k-means cost, and thus the Ward merge rule, in- The approach of this manuscript is to carefully build a herently prefer spherical clusters of common radius. theory of Bregman divergences constructed from con- To accommodate other cluster shapes and input do- vex, yet nondifferentiable functions. Section2 will mains, the squared Euclidean norm may be replaced present the basic definition, and verify this generaliza- with a relaxation sharing many of the same properties, tion satisfies the usual Bregman divergence properties. a Bregman divergence. Section3 will revisit the standard Bregman hard clus- This manuscript develops the theory of agglomerative tering model (Banerjee et al., 2005), and show how it clustering with Bregman divergences. naturally leads to a merge cost ∆. Section4 then con- structs exponential families, demonstrating that non- Appearing in Proceedings of the 29 th International Confer- differentiable Bregman divergences, while permitting ence on Machine Learning, Edinburgh, Scotland, UK, 2012. representations which are not minimal, still satisfy Copyright 2012 by the author(s)/owner(s). all the usual properties. To overcome the aforemen- Agglomerative Bregman Clustering tioned small-sample cases where divergences may not f(x; r): x 2 Rn; r ≥ f(x)g ⊆ Rn × R¯. A function is be well-defined, Section5 presents smoothing proce- convex when its epigraph is convex, and closed when dures which immediately follow from the preceding its epigraph is closed. The domain dom(f) of a func- technical development. tion f : Rn ! R¯ is the subset of inputs not mapping to +1: dom(f) = fx 2 n : f(x) < 1g. A function To close, Section6 presents the final algorithm, and R is proper if dom(f) is nonempty, and f never takes Section7 provides experimental validation both by on the value −∞. The Bregman divergences in this measuring cluster fit, and the suitability of cluster fea- manuscript will be generated from closed proper con- tures in supervised learning tasks. vex functions. The various appendices contain all proofs, as well as The conjugate of a function f is the function f ∗(φ) := some additional technical material and examples. supx hφ, xi − f(x); when f is closed proper convex, so is f ∗, and moreover f ∗∗ = f. A subgradient g to 1.3. Related work a function f at y 2 dom(f) provides an affine lower A number of works present agglomerative schemes bound: for any x 2 Rn, f(x) ≥ f(y) + hg; x − yi. for clustering with exponential families, from the per- The set of all subgradients at a point y is denoted spective of KL divergences between distributions, or by @f(y) (which is easily empty). The directional the analogous goal of maximizing model likelihood, derivative f 0(y; d) of a function f at y in direction d is or lastly in connection to the information bottleneck limt#0(f(y + td) − f(y))=t. method (Iwayama & Tokunaga, 1995; Fraley, 1998; n The affine hull of a set S ⊆ R is the smallest affine set Heller & Ghahramani, 2005; Garcia et al., 2010; Blun- containing it. If S is translated and rotated so that its dell et al., 2010; Slonim & Tishby, 1999). Further- d n d affine hull is some R ⊆ R , then its interior within R more, Merugu(2006) studied the same algorithm as n is its relative interior within R . Said another way, the the present work, phrased in terms of Bregman diver- relative interior ri(S) is the interior of S with respect gences. These preceding methods either do not explic- n to the R topology relativized to the affine hull of S. itly mention divergence degeneracies, or circumvent Although functions in this manuscript will generally them with Bayesian techniques, a connection discussed be closed, their domains are often (relatively) open. in Section5. Convex functions will be defined over Rn, but it will Bregman divergences for nondifferentiable functions be useful to treat data as lying in an abstract space X , have been studied in a number of places. Remark 2.4 n and a statistic map τ : X! R will embed examples shows the relationship between the presented version, in the desired Euclidean space. This map, which will and one due to Gordon(1999). Furthermore, Kiwiel also be overloaded to handle finite subsets of X , will (1995) presents divergences almost identical to those eventually incorporate the smoothing procedure. here, but the manuscripts and focuses differ thereafter. The cluster cost will be denoted by φ, or φf,τ to make The development here of exponential families and re- the underlying convex function and statistic map clear; lated Bregman properties generalizes results found in similarly, the merge cost is denoted by ∆ and ∆f,τ . a variety of sources (Brown, 1986; Azoury & War- muth, 2001; Banerjee et al., 2005; Wainwright & Jor- dan, 2008); further bibliographic remarks will appear 2. Bregman divergences throughout, and in AppendixG. Finally, parallel to n Given a convex function f : R ! R¯, the Bregman the completion of this manuscript, another group has divergence Bf (·; y) is the gap between f and its lin- developed exponential families under similarly relaxed earization at y. Typically, f is differentiable, and so conditions, but from the perspective of maximum en- Bf (x; y) = f(x) − f(y) − hrf(y); x − yi. tropy and convex duality (Csiszár& Matúˇs, 2012). Definition 2.1. Given a convex function f : Rn ! R¯, the corresponding Bregman divergence between x; y 2 1.4. Notation dom(f) is The following concepts from convex analysis are used 0 throughout the text; the interested reader is directed Bf (x; y) := f(x) − f(y) + f (y; y − x): ♦ to the seminal text of Rockafellar(1970). A set is convex when the line segment between any two of its Unlike gradients, directional derivatives are well- elements is again within the set. The epigraph of a defined whenever a convex function is finite, although function f : Rn ! R¯, where R¯ = R [ {±∞}, is they can be infinite on the relative boundary of dom(f) the set of points bounded below by f; i.e., the set (Rockafellar, 1970, Theorems 23.1, 23.3, 23.4). Agglomerative Bregman Clustering Remark 2.4. Given x 2 dom(f) and a dual element g 2 n, another nondifferentiable generalization of x x R 1 2 Bregman divergence, due to Gordon(1999), is ∗ Df (x; g) := f(x) + f (g) − hg; xi : y Bf (x1; y) Bf (x2; y) Now suppose there exists y 2 ri(dom(f)) with g 2 @f(y); the Fenchel-Young inequality (Rockafellar, 1970, Theorem 23.5) grants Df (x; g) = f(x) − f(y) − hg; x − yi. Thus, by Proposition 2.3, Figure 1. Bregman divergences with respect to a reference point y at which f is nondifferentiable. The thick (red Bf (x; y) := maxfDf (x; g): g 2 @f(y)g: ♦ or blue) dashed lines denote the divergence values them- selves; they travel down from f to the negated sublinear To sanity check Bf , AppendixA states and proves a function x 7! f(y) − f 0(y; y − x), here a pair of dotted number of key Bregman divergence properties, gener- rays. Noting Proposition 2.3 and fixing some xi, the sub- alized to the case of nondifferentiability. The following gradient at y farthest from xi is one of these dotted rays list summarizes these properties; in general, f is closed together with its dashed, gray extension. The gray ex- proper convex, y 2 ri(dom(f)), and x 2 dom(f).

Agglomerative Bregman Clustering

Learning to Approximate a Bregman Divergence

Bregman Divergence Bounds and Universality Properties of the Logarithmic Loss Amichai Painsky, Member, IEEE, and Gregory W

Metrics Defined by Bregman Divergences †

On a Generalization of the Jensen-Shannon Divergence and the Jensen-Shannon Centroid

Information Divergences and the Curious Case of the Binary Alphabet

Statistical Exponential Families: a Digest with Flash Cards

Applications of Bregman Divergence Measures in Bayesian Modeling Gyuhyeong Goh University of Connecticut - Storrs, [email protected]

Jensen-Bregman Logdet Divergence with Application to Efficient

Bregman Divergence Bounds and Universality Properties of the Logarithmic Loss Amichai Painsky, Member, IEEE, and Gregory W

Generalized Bregman and Jensen Divergences Which Include Some F-Divergences

Bregman Metrics and Their Applications

Low-Rank Kernel Learning with Bregman Matrix Divergences