Agglomerative Bregman Clustering

Matus Telgarsky [email protected] Sanjoy Dasgupta [email protected] Department of Computer Science and Engineering, UCSD, 9500 Gilman Drive, La Jolla, CA 92093-0404

Abstract 1.1. Bregman clustering This manuscript develops the theory of ag- There is already a rich theory of clustering with Breg- glomerative clustering with Bregman diver- man divergences, and in particular the relationship gences. Geometric smoothing techniques are of these divergences with distribu- developed to deal with degenerate clusters. tions (Banerjee et al., 2005). The standard develop- To allow for cluster models based on expo- ment has two shortcomings, the first of which is am- nential families with overcomplete represen- plified in the agglomerative setting. tations, Bregman divergences are developed for nondifferentiable convex functions. Degenerate divergences. Many divergences lead to merge costs which are undefined on certain inputs. This scenario is exacerbated with small clusters; 1. Introduction for instance, with Gaussian clusters, the corre- m sponding rule is the KL divergence, Starting with points {xi}i=1 and a pairwise merge cost ∆(·, ·), classical agglomerative clustering produces a which demands full rank cluster covariances. This single hierarchical tree as follows (Duda et al., 2001). is impossible with ≤ d points, but the agglomer- ative procedure above starts with singletons.

1. Start with m clusters: Ci := {xi} for each i. Minimal representations. The standard theory of 2. While at least two clusters remain: exponential families and its connections to Breg- man divergences depend on minimal representa- (a) Choose {Ci,Cj} with minimal ∆(Ci,Cj). tions: there is just one way to write down any (b) Remove {Ci,Cj}, add in Ci ∪ Cj. particular distribution. On the other hand, the natural encoding for many problems — e.g., Ising In order to build a hierarchy with low k-means cost, models, and many other examples listed in the one can use the merge cost due to Ward(1963), textbook of Wainwright & Jordan(2008, Section 4) — is overcomplete, necessitating potentially te- |Ci||Cj| 2 dious conversions to invoke the theory. ∆w(Ci,Cj) := kτ(Ci) − τ(Cj)k2, |Ci| + |Cj| where τ(C) denotes the mean of cluster C. 1.2. Contribution The k-means cost, and thus the Ward merge rule, in- The approach of this manuscript is to carefully build a herently prefer spherical clusters of common radius. theory of Bregman divergences constructed from con- To accommodate other cluster shapes and input do- vex, yet nondifferentiable functions. Section2 will mains, the squared Euclidean norm may be replaced present the basic definition, and verify this generaliza- with a relaxation sharing many of the same properties, tion satisfies the usual Bregman divergence properties. a Bregman divergence. Section3 will revisit the standard Bregman hard clus- This manuscript develops the theory of agglomerative tering model (Banerjee et al., 2005), and show how it clustering with Bregman divergences. naturally leads to a merge cost ∆. Section4 then con- structs exponential families, demonstrating that non- Appearing in Proceedings of the 29 th International Confer- differentiable Bregman divergences, while permitting ence on Machine Learning, Edinburgh, Scotland, UK, 2012. representations which are not minimal, still satisfy Copyright 2012 by the author(s)/owner(s). all the usual properties. To overcome the aforemen- Agglomerative Bregman Clustering tioned small-sample cases where divergences may not {(x, r): x ∈ Rn, r ≥ f(x)} ⊆ Rn × R¯. A function is be well-defined, Section5 presents smoothing proce- convex when its epigraph is convex, and closed when dures which immediately follow from the preceding its epigraph is closed. The domain dom(f) of a func- technical development. tion f : Rn → R¯ is the subset of inputs not mapping to +∞: dom(f) = {x ∈ n : f(x) < ∞}. A function To close, Section6 presents the final algorithm, and R is proper if dom(f) is nonempty, and f never takes Section7 provides experimental validation both by on the value −∞. The Bregman divergences in this measuring cluster fit, and the suitability of cluster fea- manuscript will be generated from closed proper con- tures in supervised learning tasks. vex functions. The various appendices contain all proofs, as well as The conjugate of a function f is the function f ∗(φ) := some additional technical material and examples. supx hφ, xi − f(x); when f is closed proper convex, so is f ∗, and moreover f ∗∗ = f. A subgradient g to 1.3. Related work a function f at y ∈ dom(f) provides an affine lower A number of works present agglomerative schemes bound: for any x ∈ Rn, f(x) ≥ f(y) + hg, x − yi. for clustering with exponential families, from the per- The set of all subgradients at a point y is denoted spective of KL divergences between distributions, or by ∂f(y) (which is easily empty). The directional the analogous goal of maximizing model likelihood, derivative f 0(y; d) of a function f at y in direction d is or lastly in connection to the information bottleneck limt↓0(f(y + td) − f(y))/t. method (Iwayama & Tokunaga, 1995; Fraley, 1998; n The affine hull of a set S ⊆ R is the smallest affine set Heller & Ghahramani, 2005; Garcia et al., 2010; Blun- containing it. If S is translated and rotated so that its dell et al., 2010; Slonim & Tishby, 1999). Further- d n d affine hull is some R ⊆ R , then its interior within R more, Merugu(2006) studied the same algorithm as n is its relative interior within R . Said another way, the the present work, phrased in terms of Bregman diver- relative interior ri(S) is the interior of S with respect gences. These preceding methods either do not explic- n to the R topology relativized to the affine hull of S. itly mention divergence degeneracies, or circumvent Although functions in this manuscript will generally them with Bayesian techniques, a connection discussed be closed, their domains are often (relatively) open. in Section5. Convex functions will be defined over Rn, but it will Bregman divergences for nondifferentiable functions be useful to treat data as lying in an abstract space X , have been studied in a number of places. Remark 2.4 n and a statistic map τ : X → R will embed examples shows the relationship between the presented version, in the desired Euclidean space. This map, which will and one due to Gordon(1999). Furthermore, Kiwiel also be overloaded to handle finite subsets of X , will (1995) presents divergences almost identical to those eventually incorporate the smoothing procedure. here, but the manuscripts and focuses differ thereafter. The cluster cost will be denoted by φ, or φf,τ to make The development here of exponential families and re- the underlying convex function and statistic map clear; lated Bregman properties generalizes results found in similarly, the merge cost is denoted by ∆ and ∆f,τ . a variety of sources (Brown, 1986; Azoury & War- muth, 2001; Banerjee et al., 2005; Wainwright & Jor- dan, 2008); further bibliographic remarks will appear 2. Bregman divergences throughout, and in AppendixG. Finally, parallel to n Given a convex function f : R → R¯, the Bregman the completion of this manuscript, another group has divergence Bf (·, y) is the gap between f and its lin- developed exponential families under similarly relaxed earization at y. Typically, f is differentiable, and so conditions, but from the perspective of maximum en- Bf (x, y) = f(x) − f(y) − h∇f(y), x − yi. tropy and convex duality (Csisz´ar& Mat´uˇs, 2012). Definition 2.1. Given a convex function f : Rn → R¯, the corresponding Bregman divergence between x, y ∈ 1.4. Notation dom(f) is The following concepts from convex analysis are used 0 throughout the text; the interested reader is directed Bf (x, y) := f(x) − f(y) + f (y; y − x). ♦ to the seminal text of Rockafellar(1970). A set is convex when the line segment between any two of its Unlike gradients, directional derivatives are well- elements is again within the set. The epigraph of a defined whenever a convex function is finite, although function f : Rn → R¯, where R¯ = R ∪ {±∞}, is they can be infinite on the relative boundary of dom(f) the set of points bounded below by f; i.e., the set (Rockafellar, 1970, Theorems 23.1, 23.3, 23.4). Agglomerative Bregman Clustering

Remark 2.4. Given x ∈ dom(f) and a dual element g ∈ n, another nondifferentiable generalization of x x R 1 2 Bregman divergence, due to Gordon(1999), is

∗ Df (x, g) := f(x) + f (g) − hg, xi . y Bf (x1, y) Bf (x2, y) Now suppose there exists y ∈ ri(dom(f)) with g ∈ ∂f(y); the Fenchel-Young inequality (Rockafellar, 1970, Theorem 23.5) grants Df (x, g) = f(x) − f(y) − hg, x − yi. Thus, by Proposition 2.3, Figure 1. Bregman divergences with respect to a reference point y at which f is nondifferentiable. The thick (red Bf (x, y) := max{Df (x, g): g ∈ ∂f(y)}. ♦ or blue) dashed lines denote the divergence values them- selves; they travel down from f to the negated sublinear To sanity check Bf , AppendixA states and proves a function x 7→ f(y) − f 0(y; y − x), here a pair of dotted number of key Bregman divergence properties, gener- rays. Noting Proposition 2.3 and fixing some xi, the sub- alized to the case of nondifferentiability. The following gradient at y farthest from xi is one of these dotted rays list summarizes these properties; in general, f is closed together with its dashed, gray extension. The gray ex- proper convex, y ∈ ri(dom(f)), and x ∈ dom(f). tensions, taken together, represent the sublinear function x 7→ f(y) + f 0(y; x − y). • Bf (·, y) is convex, proper, nonnegative, and Bf (y, y) = 0. Noting that f 0(y; y − x) ≥ −f 0(y; x − y)(Rockafellar, • When f is strictly convex, B (x, y) = 0 iff x = y. 1970, Theorem 23.1), it may seem closer to the original f 0 ∗ expression to instead use f(x) − f(y) − f (y; x − y) ∗ • Given gx ∈ ri(dom(f )), supx∈∂f (gx) Bf (x, y) = (which is thus bounded above by Bf (x, y)); however, ∗ supgy ∈∂f(y) Bf (gy, gx). it will later be shown that Bf (·, y) is convex, which fails if the directional derivative is flipped. This distinction • Under some regularity conditions on f, a gener- is depicted in Figure1. alization of the holds, with B replacing squared Euclidean . Example 2.2. In the case of the differentiable convex f function f = k · k2, B (x, y) = kx − yk2 follows by 2 2 f2 2 Over and over, this section depends on relative inte- noting f 0 (y; y − x) = h2y, y − xi. To analyze the case 2 riors. What’s so bad about the relative boundary? of f = k · k , first consider the univariate case h = | · |. 1 1 The directional derivatives and subgradients break Either by drawing a picture or checking h0(·; ·) from down. If y ∈ relbd(dom(f)) and x ∈ ri(dom(f)), then definition, it follows that 0 f (y; y − x) = ∞ = Bf (x, y), and there exists no max- ( imizing subgradient as in Proposition 2.3; in fact, one 0 when xy > 0, Bh(x, y) := can not in general guarantee the existence of any sub- 2|x| otherwise. gradients at all.

0 Then noting that f1(·; ·) decomposes coordinate-wise, In just a moment, the cluster model will be devel- P oped, where it will be very easy for the second ar- it follows that Bf1 (x, y) = i Bh(xi, yi). Said another 1 way, Bf1 is twice the l distance from x to the farthest gument argument of Bf to lie on relbd(dom(f)), ren- orthant containing y, which bears a resemblance to the dering the divergences infinite and cluster costs mean- hinge loss. ♦ ingless. Worse, it is frequently the case dom(f) is rel- atively open, meaning the relative boundary is not in Bf can also be written in terms of subgradients. dom(f)! The smoothing methods of Section5 work Proposition 2.3. Let a proper convex f and y ∈ around these issues. Their approach is simple enough: ri(dom(f)) be given. Then for any x ∈ dom(f): they just push relative boundary points inside the rel- ative interior. 0 • f (y; y − x) and Bf (x, y) are finite, and 3. Cluster model • Bf (x, y) := maxg∈∂f(y) f(x) − f(y) − hg, x − yi. m Let a finite collection C of points {xi}i=1 in some ab- The above characterization will be extremely useful in stract space X — say, documents or vectors — be proofs, where the existence of a maximizing subgradi- given. In order to cluster these with Bregman diver- n entg ¯y,x will frequently be invoked. gences, the first step is to map them into R . Agglomerative Bregman Clustering

Definition 3.1. A statistic map τ is any function from Continuing, the stage is set to finally construct the X to Rn. Given a finite set C ⊆ X , overload τ via Bregman merge cost. averages: τ(C) := |C|−1 P τ(x). x∈C ♦ Definition 3.7. Given two finite subsets C1,C2 of X , the cluster merge cost is simply growth in total cost: For now, it suffices to think of τ as the identity map n X (with X = R ), with an added convenience of com- ∆f,τ (C1,C2) = φf,τ (C1 ∪ C2) − φf,τ (Ci). ♦ puting means. Section4, however, will rely on τ when j∈{1,2} constructing exponential families. n The above expression seems to imply that the compu- Definition 3.2. Given a statistic map τ : X → R and convex function f, the cost of a single cluster C is tational cost of ∆ scales with the number of points. But in fact, one need only look at the relevant centers. X φf,τ (C) := Bf (τ(x), τ(C)). x∈C Proposition 3.8. Let a proper convex relatively dif- ferentiable f and two finite subsets C1,C2 of X with (This cost was the basis for Bregman hard clustering τ(Ci) ∈ ri(dom(f)) be given. Then (Banerjee et al., 2005).) ♦ X n ∆ (C ,C ) = |C |B (τ(C ), τ(C ∪ C )). Example 3.3 (k-means cost). Choose X = R , f,τ 1 2 j f j 1 2 2 j∈{1,2} τ(x) = x, and f = k · k2. As discussed in Ex- 2 ample 2.2, Bf (x, y) = kx − yk2, and so φf,τ (C) = Example 3.9 (Ward/k-means merge cost). Continu- P 2 x∈C kx − τ(C)k2, precisely the k-means cost. ♦ ing with the k-means cost from Example 3.3, note that for j ∈ {1, 2}, As such, τ(C) plays the role of a cluster center. While this may be intuitive for the k-means cost, it requires |C3−j| · kτ(C1) − τ(C2)k2 kτ(Cj) − τ(C1 ∪ C2)k2 = . justification for general Bregman divergences. The fol- |C1| + |C2| lowing definition and results bridge this gap. Plugging this into the simplification of ∆f,τ provided Definition 3.4. A convex function f is relatively by Proposition 3.8, (Gˆateaux) differentiable if, for any y ∈ ri(dom(f)), 2 X |Cj||C3−j| 2 there exists g (necessarily any subgradient) so that, ∆f,τ (C1,C2) = 2 kτ(C1) − τ(C2)k2 0 (|C1| + |C2|) for any x ∈ dom(f), f (y; y − x) = hg, y − xi. ♦ j∈{1,2} |C1||C2| 2 Every differentiable function is relatively differentiable = kτ(C1) − τ(C2)k . |C | + |C | 2 (with g = ∇f(y)); fortunately, many relevant convex 1 2 functions, in particular those used to construct Breg- This is exactly the Ward merge cost. ♦ man divergences between exponential family distribu- tions (cf. Proposition 4.5), will be relatively differen- 4. Exponential families tiable. So far, this manuscript has developed a mathemati- Under relative differentiability, Bregman divergences cal basis to clustering with Bregman divergences. But admit a bias-variance style decomposition, which im- what does it matter, if examples of meaningful Breg- mediately justifies the choice of centroid τ(C). man divergences are few and far between? Lemma 3.5. Let a proper convex relatively differen- The primary mechanism for constructing meaningful tiable f, points {x }m ⊂ n, and weights {α }m ⊂ i i=1 R i i=1 merge costs is to model the clusters as exponential be given, with µ := P α x /(P α ) ∈ ri(dom(f)). R i i i j j family distributions. Throughout this section, let ν be Then, given any point y ∈ ri(dom(f)), any measure over X , and further stipulate the statistic m m m ! map τ is ν-measurable. X X X αiBf (xi, y) = αiBf (xi, µ) + αi Bf (µ, y). Definition 4.1. Given a measurable statistic map τ i=1 i=1 i=1 and a vector θ ∈ Rn of canonical parameters, the cor- Corollary 3.6. Suppose the convex function f is rel- responding exponential family distribution has density atively differentiable, let any statistic map τ and any pθ(x) := exp(hτ(x), θi − ψ(θ)), finite cluster C ⊆ X be given. Then φf,τ (C) = P where the normalization term ψ, typically called the inf n B(τ(x), y). y∈R x∈C cumulant or log partition function, is simply Z Proof. Use µ := τ(C) = |C|−1 P τ(x), x∈C ψ(θ) = ln exp(hτ(x), θi)dν(x). ♦ Lemma 3.5, and Bf ≥ 0. Agglomerative Bregman Clustering

Many standard distributions have this representation. Theorem 4.6. Let any θ1, θ2 ∈ ri(dom(ψ)) and any d τˆ ∈ ∂ψ(θ ), τˆ ∈ ∂ψ(θ ) be given, where ∂ψ∗(ˆτ ) ⊆ Example 4.2. Choose X = R with ν being Lebesgue 1 1 2 2 2 n d(d+1) ri(dom(ψ)) (for instance, if dom(ψ) is relatively open). measure, and n = d(d + 1), i.e. R = R . The map τ(x) = (x, xx>) will provide for Gaussian den- Then sities. In particular, starting from the familiar form, K(p , p ) = B (θ , θ ) = B ∗ (ˆτ , τˆ ). with mean µ ∈ Rd and positive definite covariance θ1 θ2 ψ 2 1 ψ 1 2 d2 Σ ∈ R , the density at x, pθ(x), is ∗ Furthermore, if θ1 ∈ ∂ψ (ˆτ2), then pθ1 = pθ2 ν-a.e.. exp(−(x − µ)>Σ−1(x − µ)/2) Motivated by Proposition 4.5 and Theorem 4.6, the p(2π)d|Σ| choice here is to base the cluster model on Bψ∗ .  −1 −1 = exp τ(x), (Σ µ, −Σ /2) 2 Given two clusters {Ci}i=1, setτ ˆi := τ(Ci). When 1  working with these clusters, it is entirely sufficient to − ln((2π)d|Σ| exp(µ>Σ−1µ)) . 2 store only these and the cluster sizes, since −1 In other words, θ = (Σ−1µ, −Σ−1/2). Notice that ψ τ(C1 ∪ C2) = |C1 ∪ C2| (|C1|τˆ1 + |C2|τˆ2). Assuming (here expanded as 1 ln(...)) and θ do not make sense for interpretability that ψ is differentiable, since ψ is 2 closed, ψ∗∗ = ψ, and thus ∇ψ(θ ) = R τp =τ ˆ ; that if Σ is merely positive semi-definite. ♦ 1 θ1 1 is to say, these distributions have their (aptly named) So far so good, but where’s the convex function, and mean parameterizations as their means. And as pro- does the definition of pθ even make sense? vided by Theorem 4.6, even if differentiability fails, Proposition 4.3. Given a measurable statistic map various subgradients of the same mean all effectively τ, the function ψ is well-defined, closed, convex, and represent the same distributions. never takes on the value −∞. Example 4.7. Suppose X is a finite set, representing Remark 4.4. Notice that Proposition 4.3 did not pro- a vocabulary with n words, and ν is counting measure vide that ψ is proper, only that it is never −∞. Unfor- over X . The statistic map τ converts word k into the th n tunately, more can not be guaranteed: if ν is Lebesgue k basis vector ek. Letτ ˆ ∈ R++ represent the mean parameters of a multinomial over this set; observe that measure over R and τ(x) = 0 for all x, then every parameter choice θ ∈ has ψ(θ) = ∞. It is there- R p (e ) = hτ(i), τˆi fore necessary to check, for any provided τ, whether θ i Z dom(ψ) is nonempty. ♦ = exp(hei, lnτ ˆi) − ln exp(hτ(k), lnτ ˆi)dν(k). Not only is ψ closed convex, it is about as well-behaved as any function discussed in this manuscript. That is to say, the canonical parameter vector is Proposition 4.5. Suppose dom(ψ) is nonempty. θ = lnτ ˆ, the coordinate-wise logarithm of the mean Then ψ is relatively differentiable; in fact, given any parameters. Proposition 4.5 can be verified directly: (∇ψ(θ)) = eθi / P eθk =τ ˆ. Similarly, given an- θ ∈ ri(dom(ψ)), any τˆ ∈ ∂ψ(θ), and any ξ ∈ dom(ψ), i k other multinomial with mean parametersτ ˆ0 ∈ n Z R++ 0 and canonical parameters θ0 = lnτ ˆ0, ψ (θ; ξ − θ) = hτ,ˆ ξ − θi = hτ(x), ξ − θi pθ(x)dν(x). n   X τˆi R K(pθ, pθ0 ) = τˆi ln . If ψ is fully differentiable at θ,then ∇ψ(θ) = τpθ. 0 τˆi Since ψ is closed, givenτ ˆ ∈ ∂ψ(θ), it follows that i=1 θ ∈ ∂ψ∗(ˆτ). There is still cause for concern that other n The notation R++ means strictly positive coordinates: subgradients atτ ˆ lead to different densities, but as will no word can have zero probability. Without this re- be shown below, this does not happen. striction, it is not possible to map into the canoni- Now that a relevant convex function ψ has been iden- cal parameter space. This is precisely the scenario the smoothing methods of Section5 will work around: tified, the question is whether Bψ (or Bψ∗ ) provide a reasonable notion of distance amongst densities. the provided clusters are on the relative boundary of dom(ψ∗), which is either not part of dom(ψ∗) at all, or This will be answered in two ways. To start, recall the as is the case here, causes degenerate Bregman diver- Kullback-Leibler divergence K between densities p, q: gences (infinite valued, and lacking subgradients). ♦ Z p(x) Remark 4.8. The multinomials in Example 4.7 have K(p, q) = p(x) ln dν(x). q(x) an overcomplete representation: scaling any canoni- cal parameter vector by a constant gives the same Agglomerative Bregman Clustering mean parameter. In general, if two relative interior z are chosen from data, and moreover satisfy kαzk ↓ 0 canonical parameters θ1 6= θ2 have a common sub- as the total amount of available data grows; that is to gradientτ ˆ ∈ ∂ψ(θ1) ∩ ∂ψ(θ2), then it follows that say, τ will more and more closely match τ0. ∗ {θ1, θ2} ⊂ ∂ψ (ˆτ)(Rockafellar, 1970, Theorem 23.5). Example 5.3 (Smoothing multinomials.). The mean That is to say: this scenario leads to mean parameters parameters to a multinomial lie within the probabil- which have distinct subgradients, and are thus points ity simplex, a compact convex set. As discussed in ∗ of nondifferentiability within ri(dom(ψ )), which ne- Example 4.7, only the relative interior of the simplex cessitate the generalized development of Bregman di- provides canonical parameters. According to Theo- vergences in this manuscript. ♦ rem 5.2, all that remains in fixing this problem is to determine αz. A further example of Gaussians appears in Ap- pendixC. The approach here is to interpret the provided multi- nomial τ0(C) =τ ˆ as based on a finite sample of size The second motivation for ∆ ∗ is an interpretation ψ ,τ m, and thus the true parameters lie within some con- in terms of model fit. fidence interval aroundτ ˆ; crucially, this confidence Theorem 4.9. Fix some measurable statistic map interval intersects the relative interior of the proba- τ, and let two finite subsets C1,C2 of X be given bility simplex. One choice is a Bernstein-based up- with mean parameters {τ(C1), τ(C2)} = {τˆ1, τˆ2} ⊆ per confidence estimate τ(C) = τ0(C) + O(1/m + ∗ p ri(dom(ψ )). Choose any canonical parameters θi ∈ p(1 − p)/m), where p = 1/n. ♦ ∂ψ∗(ˆτ ), and for convenience set C := C ∪ C , i 3 1 2 Example 5.4 (Smoothing Gaussians.). In the case of with mean parameter τˆ and any canonical parameter 3 Gaussians, as discussed in Example 4.2, only positive θ ∈ ∂ψ∗(ˆτ ). Then 3 3 definite covariance matrices are allowed. But this set X X X is a convex cone, so Theorem 5.2 reduces the problem ∆ ∗ (C ,C ) = ln p (x) − ln p (x). ψ ,τ 1 2 θi θ3 to finding a sensible element to add in. i∈{1,2} x∈Ci x∈C3 Consider the case of singleton clusters. Adding a full- 5. Smoothing rank covariance matrix in to the observed zero co- variance matrix is like replacing this singleton with The final piece of the technical puzzle is the smooth- a constellation of points centered at it. Equivalently, ing procedure: most of the above properties — for each point is replaced with a tiny Gaussian, which is instance, that Bf (τ(C1), τ(C2)) < ∞ — depend on reminiscent of nonparametric density estimation tech- τ(C2) ∈ ri(dom(f)). Relative boundary points lead to niques. Therefore one option is to use bandwidth selec- degeneracies; for example, this characterizes the Gaus- tion techniques; the experiments of Section7 use the sian degeneracy identified in the introduction. “normal reference rule” (Bowman & Azzalini, 1997, Definition 5.1. Given a (nonempty) convex set S, a Section 2.4.2), trying both the approach of estimating statistic map τ : X → Rn is a smoothing statistic map a bandwidth for each coordinate (suffix -nd), and com- puting one bandwidth for every direction uniformly, for S if, given any finite set C ⊆ X , τ(C) ∈ ri(S). ♦ and simply adding a rescaling of the identity matrix It turns out to be very easy to construct smoothing to the sample covariance (suffix -n). ♦ statistic maps. When there is a probabilistic interpretation of the Theorem 5.2. Let a nonempty convex set S be given. clusters, and in particular τ(C) may be viewed as a Let τ : X → n be a statistic map satisfying, for 0 R maximum likelihood estimate, another approach is to any finite C ⊆ X , τ (C) ∈ cl(S). Let z ∈ ri(S) and 0 choose some prior over the parameters, and have τ pro- α ∈ (0, 1) be arbitrary. Given any finite set C ⊆ X , duce a MAP estimate which also lies in the relative define the maps interior. As stated, this approach will differ slightly from the one presented here: the weight on the added τ1(C) := (1 − α)τ0(C) + αz, element will scale with the cluster size, rather than the τ2(C) := τ0(C) + αz, size of the full data, and the relationship of τ(C1 ∪C2) to τ(C1) and τ(C2) becomes less clear. In general, τ1 is a smoothing statistic map for S. If ad- ditionally S is a convex cone, then τ2 is also a smooth- ing statistic map for S. 6. Clustering algorithm

The following two examples smooth Gaussians and The algorithm appears in Algorithm1. Letting T∆f,τ multinomials via Theorem 5.2. The parameters α and denote an upper bound on the time to calculate a Agglomerative Bregman Clustering

Algorithm 1 Agglomerate. m Table 1. Dendrogram purity on Euclidean and text data. Input Merge cost ∆f,τ , points {xi}i=1 ⊆ X . Output Hierarchical clustering. c-link s-link km dg-nd g-n glass 0.46 0.45 0.50 0.49 0.54 Initialize forest as F := {{xi} : i ∈ [m]}. spam 0.59 0.58 0.59 0.65 0.60 while |F | > 1 do mnist35 0.59 0.51 0.69 0.62 0.73 Let {Ci,Cj} ⊆ F be any pair minimizing c-link s-link multi ∆f,τ (Ci,Cj), as computed by Proposition 3.8. 20n-e 0.60 0.50 0.93 Remove {Ci,Cj} from F , add in Ci ∪ Cj. 20n-h 0.54 0.52 0.56 end while 20n-b 0.31 0.29 0.62 return the single tree within F .

distance is used for text, and l2 distance is used for Euclidean data. km is the Ward/k-means merge cost. single merge cost, a brute-force implementation (over g-n fits full covariance Gaussians, whereas dg-nd fits 3 m points) takes space O(m) and time O(m T∆f,τ ), diagonal covariance Gaussians; smoothing follows the whereas caching merge cost computations in a min- data-dependent scheme of Example 5.4. multi fits heap requires space O(m2) and time O(m2(lg(m) + multinomials, with the smoothing scheme of Exam-

T∆f,τ )). Please refer AppendixE for more notes on ple 5.3. running times, and a depiction of sample hierarchies over synthetic data. 7.1. Cluster compatibility If Proposition 3.8 is used to compute ∆f,τ , then only Table1 contains cluster purity scores, a standard den- the sizes and means of clusters need be stored, and drogram quality measure, defined as follows. For any computing this merge cost involves just two Bregman two points with the same label l, find the smallest divergences calculations. As the new mean is a convex cluster C in the tree which contains them both; the combination of the two old means, computing it takes purity with respect to these two points is the fraction time O(n). The Bregman cost itself can be more ex- of C having label l. The purity of the dendrogram pensive; for instance, as discussed with Gaussians in is the weighted sum, over all pairs of points sharing AppendixC, it is necessary to invert a matrix, mean- labels, of pairwise purities. The , , and 3 glass spam ing O(n ) steps. 20newsgroups data appear in Heller & Ghahramani (2005); although a direct comparison is difficult, since 7. Empirical results those experiments used subsampling and randomized purity, the Euclidean experiments perform similarly, Trees generated by Agglomerate are evaluated in and the text experiments fare slightly better here. two ways. First, cluster compatibility scores are com- puted via dendrogram purity and initialization quality For another experiment, now assessing the viability of for EM upon mixtures of Gaussians. Secondly, cluster Agglomerate as an initialization to EM applied to features are fed into supervised learners. mixtures of Gaussians, please see AppendixF. There are two kinds of data: Euclidean (points in 7.2. Feature generation some Rn), and text data. There are three Euclidean data sets: UCI’s glass (214 points in R9); 3s and 5s The final experiment is to use dendrograms, built from from the mnist digit recognition problem (1984 train- training data, to generate features for classification ing digits and 1984 testing digits in R49, scaled down tasks. Given a budget of features k, the top k clusters k from the original 28x28); UCI’s spambase data (2301 {Ci}i=1 of a specified dendrogram are chosen, and for 57 th train and 2300 test points in R ). Text data is drawn any example x, the i feature is ∆(Ci, {x}). Statisti- from the 20 newsgroups data, which has a vocabulary cally, this feature measures the amount by which the of 61,188 words; a difficult dichotomy (20n-h), the model likelihood degrades when Ci is adjusted to ac- pair alt.atheism/talk.religion.misc (856 train commodate x. The choice of tree was based on training and 569 test documents); an easy dichotomy (20n-e), set purity from Table1. In all tests, the original fea- the pair rec.sport.hockey/sci.crypt (1192 train tures are discarded (i.e., only the k generated features and 794 test documents). Finally, 20n-b collects these are used). four groups into one corpus. Figure2 shows the performance of logistic regression The various trees are labeled as follows. s-link and classifiers using tree features, as well as SVD features. c-link denote single and complete linkage, where l1 The stopping rule used validation set performance. Agglomerative Bregman Clustering

1.00 1.00 Smoothing Techniques for Data Analysis. Oxford University Press, USA, 1997. 0.95 0.95 Brown, Lawrence D. Fundamentals of Statistical Ex- ponential Families. Insitute of Mathematical Statis- 0.90 0.90 tics, USA, 1986.

0.85 0.85 Csisz´ar, Imre and Mat´uˇs, Frantiˇsek. Generalized

g-n g-n minimizers of convex integral functionals, Breg- 0.80 0.80 svd svd man distance, Pythagorean identities. 2012. km km arXiv:1202.0666v1 [math.OC]. 0.75 0.75 0 10 20 30 40 50 0 200 400 600 800 Duda, Richard O., Hart, Peter E., and Stork, David G. (a) mnist35, zoomed in. (b) mnist35, zoomed out. Pattern Classification. Wiley, 2 edition, 2001. Folland, Gerald B. Real analysis: modern techniques 1.0 0.80 and their applicatins. Wiley Interscience, 2 edition,

0.9 0.72 1999. Fraley, C. Algorithms for model-based gaussian hi- 0.8 0.64 erarchical clustering. SIAM Journal on Scientific Computing, 20:270–281, 1998. 0.7 0.56 multi multi Garcia, Vincent, Nielsen, Frank, and Nock, Richard. svd svd 0.6 0.48 Hierarchical gaussian mixture model. In ICASSP, c-link c-link s-link s-link pp. 4070–4073, 2010. 0.5 0.40 0 200 400 600 800 0 200 400 600 800 Gordon, Geoff J. Approximate Solutions to Markov (c) 20n-e. (d) 20n-h. Decision Processes. PhD thesis, Carnegie Mellon University, 1999. Figure 2. Comparison of dendrogram features to SVD fea- Heller, Katherine A. and Ghahramani, Zoubin. tures; y-axis denotes classification accuracy on test data, Bayesian hierarchical clustering. In ICML, pp. 297– x-axis denotes #features. In the first two plots, mnist35 304, 2005. was used. The SVD can only produce as many features Hiriart-Urruty, Jean-Baptiste and Lemar´echal, as the dimension of the data, but the proposed tree con- Claude. Fundamentals of Convex Analysis. tinues to improve performance beyond this point. For the Springer Publishing Company, Incorporated, 2001. text data tasks 20n-e and 20n-h, tree methods strongly outperform the SVD. Please see text for details. Iwayama, Makoto and Tokunaga, Takenobu. Hierar- chical bayesian clustering for automatic text classi- fication. In IJCAI, pp. 1322–1327, 1995. Acknowledgements Kiwiel, Krzysztof C. Proximal minimization methods with generalized Bregman functions. SIAM journal This work was graciously supported by the NSF under on control and optimization, 35:1142–1168, 1995. grant IIS-0713540. Merugu, Srujana. Distributed Learning using Gen- erative Models. PhD thesis, University of Texas, References Austin, 2006. Azoury, Katy S. and Warmuth, Manfred K. Relative Murtagh, Fionn. A Survey of Recent Advances in loss bounds for on-line density estimation with the Hierarchical Clustering Algorithms. The Computer exponential family of distributions. Machine Learn- Journal, 26(4):354–359, 1983. ing, 43(3):211–246, 2001. Rockafellar, R. Tyrrell. Convex Analysis. Princeton Banerjee, Arindam, Merugu, Srujana, Dhillon, Inder- University Press, 1970. jit S., and Ghosh, Joydeep. Clustering with Breg- Slonim, Noam and Tishby, Naftali. Agglomerative in- man divergences. Journal of Machine Learning Re- formation bottleneck. pp. 617–623, 1999. search, 6:1705–1749, 2005. Wainwright, Martin J and Jordan, Michael I. Graph- Blundell, C., Teh, Y.W., and Heller, K.A. Bayesian ical Models, Exponential Families, and Variational rose trees. In UAI, 2010. Inference. Now Publishers Inc., Hanover, MA, USA, Borwein, Jonathan and Lewis, Adrian. Convex Anal- 2008. ysis and Nonlinear Optimization. Springer Publish- Ward, Joe H. Hierarchical grouping to optimize an ob- ing Company, Incorporated, 2000. jective function. Journal of the American Statistical Bowman, Adrian W. and Azzalini, Adelchi. Applied Association, 58(301):236–244, 1963.