Bayesian Hierarchical Clustering

Katherine A. Heller [email protected] Zoubin Ghahramani [email protected] Gatsby Computational Neuroscience Unit, University College London 17 Queen Square, London, WC1N 3AR, UK

Abstract tures are ubiquitous in the natural world. For ex- We present a novel algorithm for agglomer- ample, the evolutionary tree of living organisms (and ative hierarchical clustering based on evalu- consequently features of these organisms such as the ating marginal likelihoods of a probabilistic sequences of homologous genes) is a natural hierarchy. model. This algorithm has several advan- Hierarchical structures are also a natural representa- tages over traditional -based agglom- tion for data which was not generated by evolutionary erative clustering algorithms. (1) It defines processes. For example, internet newsgroups, emails, a probabilistic model of the data which can or documents from a newswire, can be organized in be used to compute the predictive distribu- increasingly broad topic domains. tion of a test point and the probability of it The traditional method for hierarchically clustering belonging to any of the existing clusters in data as given in (Duda & Hart, 1973) is a bottom- the tree. (2) It uses a model-based criterion up agglomerative algorithm. It starts with each data to decide on merging clusters rather than an point assigned to its own cluster and iteratively merges ad-hoc distance . (3) Bayesian hypoth- the two closest clusters together until all the data be- esis testing is used to decide which merges longs to a single cluster. The nearest pair of clusters are advantageous and to output the recom- is chosen based on a given distance measure (e.g. Eu- mended depth of the tree. (4) The algorithm clidean distance between cluster means, or distance can be interpreted as a novel fast bottom-up between nearest points). approximate inference method for a Dirich- let process (i.e. countably infinite) mixture There are several limitations to the traditional hier- model (DPM). It provides a new lower bound archical clustering algorithm. The algorithm provides on the marginal likelihood of a DPM by sum- no guide to choosing the “correct” number of clusters ming over exponentially many clusterings of or the level at which to prune the tree. It is often dif- the data in polynomial time. We describe ficult to know which distance metric to choose, espe- procedures for learning the model hyperpa- cially for structured data such as images or sequences. rameters, computing the predictive distribu- The traditional algorithm does not define a probabilis- tion, and extensions to the algorithm. Exper- tic model of the data, so it is hard to ask how “good” imental results on synthetic and real-world a clustering is, to compare to other models, to make data sets demonstrate useful properties of the predictions and cluster new data into an existing hier- algorithm. archy. We use statistical inference to overcome these limitations. Previous work which uses probabilistic methods to perform hierarchical clustering is discussed 1. Introduction in section 6. Hierarchical clustering is one of the most frequently Our Bayesian hierarchical clustering algorithm uses used methods in . Given a set of marginal likelihoods to decide which clusters to merge data points, the output is a binary tree (dendrogram) and to avoid overfitting. Basically it asks what the whose leaves are the data points and whose internal probability is that all the data in a potential merge nodes represent nested clusters of various sizes. The were generated from the same mixture component, tree organizes these clusters hierarchically, where the and compares this to exponentially many hypotheses hope is that this hierarchy agrees with the intuitive at lower levels of the tree (section 2). organization of real-world data. Hierarchical struc- The generative model for our algorithm is a Dirichlet Tk D process mixture model (i.e. a countably infinite mix- k ture model), and the algorithm can be viewed as a fast bottom-up agglomerative way of performing approxi- Ti Tj D D mate inference in a DPM. Instead of giving weight to i j all possible partitions of the data into clusters, which is intractable and would require the use of sampling 1 2 3 4 methods, the algorithm efficiently computes the weight of exponentially many partitions which are consistent Figure 1. (a) Schematic of a portion of a tree where Ti and Tj with the tree structure (section 3). are merged into Tk, and the associated data sets Di and Dj are merged into Dk. (b) An example tree with 4 data points. 2. Algorithm The clusterings (1 2 3)(4) and (1 2)(3)(4) are tree-consistent par- titions of this data. The clustering (1)(2 3)(4) is not a tree- Our Bayesian hierarchical clustering algorithm is sim- consistent partition. ilar to traditional agglomerative clustering in that it is a one-pass, bottom-up method which initializes each data point in its own cluster and iteratively merges priors (e.g. Normal-Inverse-Wishart priors for Normal pairs of clusters. As we will see, the main difference is continuous data or Dirichlet priors for Multinomial that our algorithm uses a statistical hypothesis test to discrete data) this integral is tractable. Throughout choose which clusters to merge. this paper we use such conjugate priors so the inte- grals are simple functions of sufficient of D . Let D = {x(1),..., x(n)} denote the entire data set, k For example, in the case of Gaussians, (1) is a function and Di ⊂D the set of data points at the leaves of the of the sample mean and covariance of the data in Dk. subtree Ti. The algorithm is initialized with n trivial k trees, {Ti : i = 1 ...n} each containing a single data The alternative hypothesis to H1 would be that the (i) point Di = {x }. At each stage the algorithm consid- data in Dk has two or more clusters in it. Summing ers merging all pairs of existing trees. For example, if over the exponentially many possible ways of dividing Ti and Tj are merged into some new tree Tk then the Dk into two or more clusters is intractable. However, associated set of data is Dk = Di ∪Dj (see figure 1(a)). if we restrict ourselves to clusterings that partition the data in a manner that is consistent with the subtrees In considering each merge, two hypotheses are com- T and T , we can compute the sum efficiently using re- pared. The first hypothesis, which we will denote Hk i j 1 cursion. (We elaborate on the notion of tree-consistent is that all the data in D were in fact generated in- k partitions in section 3 and figure 1(b)). The probabil- dependently and identically from the same probabilis- ity of the data under this restricted alternative hy- tic model, p(x|θ) with unknown parameters θ. Let us pothesis, Hk, is simply a product over the subtrees imagine that this probabilistic model is a multivariate 2 p(D |Hk) = p(D |T )p(D |T ) where the probability of Gaussian, with parameters θ = (µ, Σ), although it is k 2 i i j j a data set under a tree (e.g. p(D |T )) is defined below. crucial to emphasize that for different types of data, i i different probabilistic models may be appropriate. To Combining the probability of the data under hypothe- k k evaluate the probability of the data under this hypoth- ses H1 and H2 , weighted by the prior that all points esis we need to specify some prior over the parameters def k in Dk belong to one cluster, πk = p(H1 ), we obtain of the model, p(θ|β) with hyperparameters β. We now the marginal probability of the data in tree Tk: have the ingredients to compute the probability of the k data Dk under H1 : k p(Dk|Tk) = πkp(Dk|H1 ) + (1 − πk)p(Di|Ti)p(Dj |Tj ) (2)

k p(Dk|H1 ) = p(Dk|θ)p(θ|β)dθ Z This equation is defined recursively, there the first = p(x(i)|θ) p(θ|β)dθ (1) term considers the hypothesis that there is a single x(i) cluster in Dk and the second term efficiently sums over Z h Y∈Dk i all other other clusterings of the data in Dk which are This calculates the probability that all the data in Dk consistent with the tree structure (see figure 1(a)). were generated from the same parameter values as- In section 3 we show that equation 2 can be used to suming a model of the form p(x|θ). This is a natural derive an approximation to the marginal likelihood of model-based criterion for measuring how well the data a Dirichlet Process mixture model, and in fact provides fit into one cluster. If we choose models with conjugate a new lower bound on this marginal likelihood.1 Allowing infinitely many components makes it possible We also show that the prior for the merged hypoth- to more realistically model the kinds of complicated esis, πk, can be computed bottom-up in a DPM. distributions which we expect in real problems. We The posterior probability of the merged hypothesis briefly review DPMs here, starting from finite mixture def k models. rk = p(H1 |Dk) is obtained using Bayes rule: k Consider a finite mixture model with C components πkp(Dk|H1 ) rk = (3) k C πkp(Dk|H1 ) + (1 − πk)p(Di|Ti)p(Dj |Tj) p(x(i)|φ) = p(x(i)|θ )p(c = j|p) (4) This quantity is used to decide greedily which two trees j i j=1 to merge, and is also used to determine which merges X in the final hierarchy structure were justified. The where ci ∈ {1,...,C} is a cluster indicator variable for general algorithm is very simple (see figure 2). data point i, p are the parameters of a multinomial dis- tribution with p(ci = j|p) = pj, θj are the parameters of the jth component, and φ = (θ ,...,θ , p). Let the input: data D = {x(1) . . . x(n)}, model p(x|θ), 1 C parameters of each component have conjugate priors prior p(θ|β) p(θ|β) as in section 2, and the multinomial parameters initialize: number of clusters c = n, and (i) also have a conjugate Dirichlet prior Di = {x } for i = 1 ...n while c > 1 do C Γ(α) α/C−1 Find the pair Di and Dj with the highest p(p|α) = p . (5) Γ(α/C)C j probability of the merged hypothesis: j=1 Y k πkp(Dk|H1 ) x(1) x(n) rk = Given a data set D = { ..., }, the marginal p(Dk|Tk) likelihood for this mixture model is n Merge Dk ←Di ∪Dj, Tk ← (Ti,Tj ) x(i) Delete Di and Dj , c ← c − 1 p(D|α, β) = p( |φ) p(φ|α, β)dφ (6) "i=1 # end while Z Y output: Bayesian mixture model where each where p(φ|α, β) = p(p|α) C p(θ |β). This marginal tree node is a mixture component j=1 j likelihood can be re-written as The tree can be cut at points where rk < 0.5 Q p(D|α, β) = p(c|α)p(D|c, β) (7) Figure 2. Bayesian Hierarchical Clustering Algorithm c X c c c p p p Our Bayesian hierarchical clustering algorithm has where = (c1, . . . , cn) and p( |α) = p( | )p( |α)d is a standard Dirichlet integral. The quantity (7) is many desirable properties which are absent in tradi- tional hierarchical clustering. For example, it allows us well-defined even in the limit C → ∞. Although the c n to define predictive distributions for new data points, number of possible settings of grows as C and there- it decides which merges are advantageous and suggests fore diverges as C → ∞, the number of possible ways natural places to cut the tree using a statistical model of partitioning the n points remains finite (roughly O(nn)). Using V to denote the set of all possible par- comparison criterion (via rk), and it can be customized to different kinds of data by choosing appropriate mod- titioning of n data points, we can re-write (7) as: els for the mixture components. p(D|α, β) = p(v|α)p(D|v, β) (8) v∈V 3. Approximate Inference in a Dirichlet X Process Mixture Model Rasmussen (2000) provides a thorough analysis of DPMs with Gaussian components, and a Markov chain The above algorithm is an approximate inference Monte Carlo (MCMC) algorithm for sampling from method for Dirichlet Process mixture models (DPM). the partitionings v, and we have included a small Dirichlet Process mixture models consider the limit of amount of review in appendix A. DPMs have the in- infinitely many components of a finite mixture model. teresting property that the probability of a new data point belonging to a cluster is propotional to the num- 1It is important not to confuse the marginal likelihood in equation 1, which integrates over the parameters of one cluster, ber of points already in that cluster (Blackwell & Mac- and the marginal likelihood of a DPM, which integrates over all Queen, 1973), where α controls the probability of the clusterings and their parameters. new point creating a new cluster. For an n point data set, each possible clustering is and: m a different partition of the data, which we can de- v p(Dv) = p(Dv) note by placing brackets around data point indices: l l=1 e.g. (1 2)(3)(4). Each individual cluster, e.g. (1 2), Y n is a nonempty subset of data, yielding 2 − 1 pos- Here p(Dk) is a sum over all partitionings, v, where sible clusters, which can be combined in many ways the first (fractional) term in the sum is the prior on to form clusterings (i.e. partitions) of the whole data partitioning v and the second (product) term is the set. We can organize a subset of these clusters into likelihood of partioning v under the data. a tree. Combining these clusters one can obtain all tree-consistent partitions of the data (see figure 1(b)). Theorem 1 The quantity (2) computed by the Rather than summing over all possible partitions of Bayesian Hierarchical Clustering algorithm is: the data using MCMC, our algorithm computes the mv mv v mv sum over all exponentially many tree-consistent par- α ℓ=1 Γ(nℓ ) v p(Dk|Tk) = p(Dℓ ) titions for a particular tree built greedily bottom-up. dk v∈VT Q ℓ=1 This can be seen as a fast and deterministic alternative X Y to MCMC approximations. where VT is the set of all tree-consistent partitionings of D . Returning to our algorithm, since a DPM with concen- k tration hyperparameter α defines a prior on all parti- Proof Rewriting equation (2) using algorithm 3 to tions of the nk data points n Dk (the value of α is substitute in for πk we obtain: directly related to the expected number of clusters), the prior on the merged hypothesis is the relative mass k αΓ(nk) didj p(Dk|Tk) = p(Dk|H1 ) + p(Di|Ti)p(Dj |Tj) of all nk points belonging to one cluster versus all the dk dk other partitions of those nk data points consistent with the tree structure. This can be computed bottom-up We will proceed to give a proof by induction. In the as the tree is being built (figure 3). base case, at a leaf node, the second term in this equa- tion drops out since there are no subtrees. Γ(nk = k 1) = 1 and dk = α yielding p(Dk|Tk) = p(Dk|H1 ) as initialize each leaf i to have di = α, πi = 1 we should expect at a leaf node. for each internal node k do For the inductive step, we note that the first term is al- dk = αΓ(nk) + dleftk drightk αΓ(nk) ways just the trivial partition with all nk points into a πk = dk single cluster. According to our inductive hypothesis: end for m ′ ′ m ′ m ′ v v v α v ′ Γ(n ′ ) ′ ℓ =1 ℓ v Figure 3. Algorithm for computing prior on merging, where ′ p(Di|Ti) = p(Dℓ ) ′ di ′ rightk (leftk) indexes the right (left) subtree of Tk and v ∈V Q ℓ =1 XTi Y drightk (dleftk ) is the value of d computed for the right (left) child of internal node k. and similarly for p(Dj |Tj ), where VTi (VTj ) is the set of all tree-consistent partitionings of Di (Dj ). Combining Lemma 1 The marginal likelihood of a DPM is: terms we obtain:

m mv αmv v Γ(nv) didj ℓ=1 ℓ v p(Di|Ti)p(Dj |Tj ) p(Dk) = p(Dℓ ) Γ(nk+α) dk v∈V ℓ=1 QΓ(α) m ′ m ′ X Y v ′ v ′ h i 1 m ′ v v = α v Γ(n ′ ) p(D ′ ) where V is the set of all possible partitionings of Dk, ℓ ℓ dk  ′ ′ ′  v v ∈V ℓ =1 ℓ =1 mv is the number of clusters in partitioning v, and nℓ XTi Y Y   is the number of points in cluster ℓ of partitioning v. m ′′ m ′′ v ′′ v ′′ m ′′ v v v ′′ ′′ This is easily shown since: ×  α Γ(nℓ ) p(Dℓ ) ′′ ′′ ′′ v ∈V ℓ =1 ℓ =1 XTj Y Y v   p(Dk) = p(v)p(D ) m mv  αmv v Γ(nv)  v ℓ=1 ℓ v X = p(Dℓ ) dk where: v∈VNTT Q ℓ=1 mv mv v X Y α l=1 Γ(nl ) p(v) = where VNTT is the set of all non-trivial tree-consistent Γ(nk+α) QΓ(α) partionings of Dk. For the trivial partition, mv = 1 h i v and n1 = nk. By combining the trivial and non-trivial case of all data belonging to a single cluster: terms we get a sum over all tree-consistent partitions yielding the result in Theorem 1. This completes the p(D|H1) = p(D|θ)p(θ|β)dθ proof. Another way to get this result is to expand out Z p(D|T ) and substitute for π using algorithm 3. ∂p(D|H1) we can compute ∂β . Using this we can compute gradients for the component model hyperparamters Corollary 1 For any binary tree T with the data k bottom-up as the tree is being built: points in Dk at its leaves, the following is a lower bound on the marginal likelihood of a DPM: ∂p(D |T ) ∂p(D|H ) k k = π 1 ∂β k ∂β dkΓ(α) ∂p(D |T ) p(Dk|Tk) ≤ p(Dk) + (1 − π ) i i p(D |T ) Γ(nk + α) k ∂β j j ∂p(D |T ) Proof Proof of Corollary 1 follows trivially by multi- + (1 − π )p(D |T ) j j k i i ∂β plying p(Dk|Tk) by a ratio of its denominator and the denominator from p(D ) from lemma 1 (i.e. dkΓ(α) ), k Γ(nk+α) Similarly, we can compute and from the fact that tree-consistent partitions are a ∂p(D |T ) ∂π subset of all partitions of the data. k k = k p(D |H ) ∂α ∂α k 1 Proposition 1 ∂πk The number of tree-consistent parti- − p(Di|Ti)p(Dj |Tj ) tions is exponential in the number of data points for ∂α ∂p(D |T ) balanced binary trees. + (1 − π ) i i p(D |T ) k ∂α j j

Proof If Ti has Ci tree-consistent partitions of Di and ∂p(Dj |Tj) + (1 − πk)p(Di|Ti) Tj has Cj tree-consistent partitions of Dj, then Tk = ∂α (Ti,Tj ) merging the two has CiCj + 1 tree-consistent where: partitions of Dk = Di ∪Dj , obtained by combining all ∂π π π ∂d k = k − k k partitions and adding the partition where all data in ∂α α dk ∂α D are in one cluster. At the leaves C = 1. Therefore,   k i and: for a balanced binary tree of depth ℓ the number of 2ℓ tree-consistent partitions grows as O(2 ) whereas the ∂dk ∂dleftk ∂drightk = Γ(n ) + d + d number of data points n grows as O(2ℓ) ∂α k ∂α rightk leftk ∂α     These gradients can be computed bottom-up by addi- In summary, p(D|T ) sums the probabilities for all tree- tionally propogating ∂d . This allows us to construct consistent partitions, weighted by the prior mass as- ∂α an EM-like algorithm where we find the best tree struc- signed to each partition by the DPM. The computa- ture in the (Viterbi-like) E step and then optimize over tional complexity of constructing the tree is O(n2), the hyperparameters in the M step. In our experi- the complexity of computing the marginal likelihood ments we have only optimized one of the hyperparam- is O(n log n) , and the complexity of computing the eters with a simple line search for Gaussian compo- predictive distribution (see section 4) is O(n). nents. A simple empirical approach is to set the hy- perparameters β by fitting a single model to the whole 4. Learning and Prediction data set. Learning Hyperparameters. For any given set- Predictive Distribution. For any tree, the prob- ting of the hyperparameters, the root node of the tree ability of a new test point given the data can be approximates the probability of the data given those computed by recursing through the tree starting at particular hyperparameters. In our model the hyper- the root node. Each node k represents a cluster, paramters are the concentration parameter α from the with an associated predictive distribution p(x|D ) = DPM, and the hyperparameters β of the probabilistic k p(x|θ)p(θ|D , β)dθ. The overall predictive distribu- model defining each component of the mixture. We k tion sums over all nodes weighted by their posterior can use the root node marginal likelihood p(D|T ) to Rprobabilities: do model comparison between different settings of the hyperparameters. For a fixed tree we can optimize p(x|D) = ω p(x|D ) (9) over the hyperparameters by taking gradients. In the k k kX∈N 0.5 All Data 1 799 0.45 Quebec Car Jet 0.4 All Data Baseball 354 446 Boston Engine 2 797 0.35 Game Car Team Space Pitcher Car 0.3 Boston Play NASA Player Ball Space 205 149 284 162 Purity 0.25 1 796

Baseball NHL Car Space Vehicle Team 0.2 Pitch Hockey Dealer NASA Dealer Game Hit Round Drive Orbit Driver Hockey 0.15

0.1 Figure 5. Top level structure, of BHC (left) vs. Average Linkage 0.05 −3800 −3700 −3600 −3500 −3400 −3300 −3200 −3100 −3000 −2900 −2800 HC, for the newsgroup dataset. The 3 words shown at each Marginal Likelihood node have the highest between the cluster of documents at that node versus its sibling, and occur with Figure 4. Log marginal likelihood (evidence) vs. purity over 50 higher frequency in that cluster. The number of documents at iterations of hyperparameter optimization each cluster is also given.

def where N is the set of all nodes in the tree, ωk = by thresholding at a greyscale value of 128 out of 0 rk (1 − ri) is the weight on cluster k, and Nk is i∈Nk through 255, and the spambase dataset by whether the set of nodes on the path from the root node to Q each attribute value was zero or non-zero. We ran the parent of node k. This expression can be derived the algorithms on 3 digits (0,2,4), and all 10 digits. by rearranging the sum over all tree-consistent parti- The newsgroup dataset was constucted using Rain- tionings into a sum over all clusters in the tree (noting bow (McCallum, 1996), where a stop list was used that a cluster can appear in many partitionings). and words appearing fewer than 5 times were ignored. For Gaussian components with conjugate priors, this The dataset was then binarized based on word pres- results in a predictive distribution which is a mixture ence/absence in a document. For these classification of multivariate t distributions. We show some exam- datasets, where labels for the data points are known, ples of this in the Results section. we computed a measure between 0 and 1 of how well a dendrogram clusters the known labels called the den- 2 5. Results drogram purity. We found that the marginal like- lihood of the tree structure for the data was highly We compared Bayesian hierarchical clustering to tra- correlated with the purity. Over 50 iterations with dif- ditional hierarchical clustering using average, single, ferent hyperparameters this correlation was 0.888 (see and complete linkage over 5 datasets (4 real and 1 figure 4). Table 1 shows the results on these datasets. synthetic). We also compared our algorithm to aver- age linkage hierarchical clustering on several toy 2D On all datasets except Glass BHC found the highest problems (figure 8). On these problems we were able purity trees. For Glass, the Gaussian assumption may to compare the different hierarchies generated by the have been poor. This highlights the importance of two algorithms, and visualize clusterings and predic- model choice. Similarly, for classical distance-based tive distributions. hierarchical clustering methods, a poor choice of dis- tance metric may result in poor clusterings. The 4 real datasets we used are the spambase (100 ran- dom examples from each class, 2 classes, 57 attributes) Another advantage of the BHC algorithm, not fully and glass (214 examples, 7 classes, 9 attributes) addressed by purity scores alone, is that it tends to datasets from the UCI repository the CEDAR Buf- create hierarchies with good structure, particularly at falo digits (20 random examples from each class, 10 high levels. Figure 5 compares the top three levels classes, 64 attributes), and the CMU 20Newsgroups (last three merges) of the newsgroups hierarchy (using dataset (120 examples, 4 classes - rec.sport.baseball, 2 Let T be a tree with leaves 1,...,n and c1,...,cn be the rec.sport.hockey, rec.autos, and sci.space, 500 at- known discrete class labels for the data points at the leaves. Pick tributes). We also used synthetic data generated from a leaf ℓ uniformly at random; pick another leaf j uniformly in a mixture of Gaussians (200 examples, 4 classes, 2 the same class, i.e. cℓ = cj . Find the smallest subtree containing attributes). The synthetic, glass and toy datasets ℓ and j. Measure the fraction of leaves in that subtree which are in the same class (cℓ). The expected value of this fraction is the were modeled using Gaussians, while the digits, spam- dendrogram purity, and can be computed exactly in a bottom base, and newsgroup datasets were binarized and mod- up recursion on the dendrogram. The purity is 1 iff all leaves in eled using Bernoullis. We binarized the digits dataset each class are contained in some pure subtree. Data Set Single Linkage Complete Linkage Average Linkage BHC

Synthetic 0.599 ± 0.033 0.634 ± 0.024 0.668 ± 0.040 0.828 ± 0.025 Newsgroups 0.275 ± 0.001 0.315 ± 0.008 0.282 ± 0.002 0.465 ± 0.016 Spambase 0.598 ± 0.017 0.699 ± 0.017 0.668 ± 0.019 0.728 ± 0.029 3Digits 0.545 ± 0.015 0.654 ± 0.013 0.742 ± 0.018 0.807 ± 0.022 10Digits 0.224 ± 0.004 0.299 ± 0.006 0.342 ± 0.005 0.393 ± 0.015 Glass 0.478 ± 0.009 0.476 ± 0.009 0.491 ± 0.009 0.467 ± 0.011

Table 1. Purity scores for 3 kinds of traditional agglomerative clustering, and Bayesian hierarchical clustering.

800 examples and the 50 words with highest informa- that maximize marginal likelihood under a Dirichlet- tion gain) from BHC and ALHC. Continuing to look at Multinomial model. lower levels does not improve ALHC, as can be seen Iwayama and Tokunaga define a Bayesian hierarchi- from the full dendrograms for this dataset, figure 7. cal clustering algorithm which attempts to agglomer- Although sports and autos are both under the news- atively find the maximum posterior probability clus- group heading rec, our algorithm finds more similarity tering but it makes strong independence assumptions between autos and the space document class. This and does not use the marginal likelihood. is quite reasonable since these two classes have many more common words (such as engine, speed, cost, de- Segal et al. (2002) present probabilistic abstraction velop, gas etc.) than there are between autos and hierarchies (PAH) which learn a hierarchical model sports. Figure 6 also shows dendrograms of the 3digits in which each node contains a probabilistic model dataset. and the hierarchy favors placing similar models at neighboring nodes in the tree (as measured by a dis- 6. Related Work tance function between probabilstic models). The training data is assigned to leaves of this tree. Ra- The work in this paper is related to and inspired by moni et al. (2002) present an agglomerative algorithm several previous probabilistic approaches to cluster- for merging time series based on greedily maximizing ing3, which we briefly review here. Stolcke and Omo- marginal likelihood. Friedman (2003) has also recently hundro (1993) described an algorithm for agglomera- proposed a greedy agglomerative algorithm based on tive model merging based on marginal likelihoods in marginal likelihood which simultaneously clusters rows the context of structure induc- and columns of gene expression data. tion . Williams (2000) and Neal (2003) describe Gaus- sian and diffusion-based hierarchical generative mod- The algorithm in our paper is different from the above els, respectively, for which inference can be done us- algorithms in several ways. First, unlike (Williams, ing MCMC methods. Similarly, Kemp et al. (2004) 2000; Neal, 2003; Kemp et al., 2004) it is not in fact a present a hierarchical generative model for data based hierarchical generative model of the data, but rather on a mutation process. Ward uses a hierarchical fusion a hierarchical way of organizing nested clusters. Sec- of contexts based on marginal likelihoods in a Dirichlet ond our algorithm is derived from Dirichlet process language model. mixtures. Third the hypothesis test at the core of our algorithm tests between a single merged hypoth- Banfield and Raftery (1993) present an approximate esis and the alternative is exponentially many other method based on the likelihood ratio test statistic clusterings of the same data (not one vs two clusters to compute the marginal likelihood for c and c − 1 at each stage). Lastly, our algorithm does not use any clusters and use this in an agglomerative algorithm. iterative method, like EM, or require sampling, like Vaithyanathan and Dom (2000) perform hierarchical MCMC, and is therefore significantly faster than most clustering of multinomial data consisting of a vector of of the above algorithms. features. The clusters are specified in terms of which subset of features have common distributions. Their clusters can have different parameters for some fea- 7. Discussion tures and the same parameters to model other fea- We have presented a novel algorithm for Bayesian hier- tures and their method is based on finding merges archical clustering based on Dirichlet process mixtures. This algorithm has several advantages over traditional 3There has also been a considerable amount of decision tree based work on Bayesian tree structures for classification and approaches, which we have highlighted throughout the regression (Chipman et al., 1998; Denison et al., 2002), but this paper. We have presented prediction and hyperparam- is not closely related to work presented here. eter optimization procedures and shown that the al- gorithm provides competitive clusterings of real-world From this we can compute the conditional probability data as measured by purity with respect to known la- of a single indicator: bels. The algorithm can also be seen as an extremely n + α/C fast alternative to MCMC inference in DPMs. p(c = j|c , α) = −i,j i −i n − 1 + α The limitations of our algorithm include its inher- ent greediness, a computational complexity which is where n−i,j is the number of data points, excluding quadratic in the number of data points, and the lack point i, assigned to class j. Finally, taking the infinite of any incorporation of tree uncertainty. limit (C → ∞), we obtain:

n−i,j In future work, we plan to try BHC on more com- p(ci = j|c−i, α) = , plex component models for other realistic data—this n − 1 + α ′ α is likely to require approximations of the component p(ci 6= ci′ , ∀i 6= i|c−i, α) = marginal likelihoods (1). We also plan to extend BHC n − 1 + α to systematically incorporate hyperparameter opti- The second term is the prior probability that point i mization and improve the running time to O(n log n) belongs to some new class that no other points belong by exploiting a randomized version of the algorithm. to. We need to compare this novel, fast inference algo- rithm for DPMs to other inference algorithms such B. Marginal Likelihoods for Single as MCMC (Rasmussen, 2000), EP (Minka & Ghahra- mani, 2003) and Variational Bayes (Blei & Jordan, Components 2004). We also hope to explore the idea of computing Here we give equations for the marginal likelihoods several alternative tree structures in order to create of single components using simple distributions and a manipulable tradeoff between computation time and their conjugate priors. For each component model we tightness of our lower bound. There are many exciting compute: avenues for further work in this area.

A. Review of Dirichlet Process p(D|H1) = p(D|θ)p(θ|β)dθ Mixtures Z Bernoulli-Beta This section follows Rasmussen (2000). For each of n data points being modelled by a finite mixture, we can associate cluster indicator variables, c1...cn, which can k take on values c ∈ {1...C}, where C is the number of Γ(αd + βd)Γ(αd + md)Γ(βd + N − md) i p(D|H1) = Γ(αd)Γ(βd)Γ(αd + βd + N) mixture components. The joint distribution of the in- d=1 dicator variables is multinomial with parameter vector Y p = (p1...pC ): where k is the total number of dimensions, N is the N (i) C n number of data points in the dataset, md = i=1 xd , p(c ...c |p) = pnj , n = δ(c ,j) X is the dataset, and α and β are hyperparmeters for 1 n j j i P j=1 i=1 the Beta distribution. Y X

Here nj is the number of data points assigned to class Multinomial-Dirichlet j. If we place a symmetric Dirichlet prior on p with α concentration parameters C : N C Mi Γ(α) p(D|H1) = p(p ...p |α) = pα/C−1 x(i) ...x(i) 1 C Γ(α/C)C j i=1  1 k  j=1 Y Y k k Γ( d=1 αd) d=1 Γ(αd + md) × k k we can integrate p out, obtaining: Γ(PM + d=1Qαd) d=1 Γ(αd) P Q p(c1...cn|α) = p(c1...cn|p1...pC )p(p1...pC )dp1...pC where k is the total number of dimensions, N is the Z number of data points in the dataset, M = N M , C i=1 i Γ(α) Γ(nj + α/C) k (i) N (i) = Mi = j=1 xj , md = i=1 xd , X is the dataset, Γ(n + α) Γ(α/C) P j=1 and α is a hyperparameter for the Dirichlet prior. Y P P Normal-Inverse Wishart-Normal Kemp, C., Griffiths, T. L., Stromsten, S., & Tenenbaum, J. B. (2004). Semi- with trees. NIPS 16.

McCallum, A. K. (1996). Bow: A toolkit for statistical lan- k/2 ′ guage modeling, text retrieval, classification and clustering. − Nk r v ′ − v p(D|H ) = (2π) 2 |S| 2 |S | 2 http://www.cs.cmu.edu/ mccallum/bow. 1 N + r ′   v k k ′ Minka, T., & Ghahramani, Z. (2003). Expectation propaga- 2 v +1−d (2) d=1 Γ 2 tion for infinite mixtures. NIPS Workshop on Nonparametric × vk k Bayesian Methods and Infinite Models. 2  v+1−d  (2) Q d=1 Γ 2 Neal, R. M. (2003). Density modeling and clustering using where: Q  dirichlet diffusion trees. Bayesian Statistics 7 (pp. 619–629). Ramoni, M. F., Sebastiani, P., & Kohane, I. S. (2002). Cluster ′ T rN T S = S + XX + mm analysis of gene expression dynamics. Proc Nat Acad Sci, 99, N + r 9121–9126. N N 1 i i + x( ) (x( ))T Rasmussen, C. E. (2000). The infinite Gaussian mixture model. N + r NIPS 12 i=1 ! i=1 ! (pp. 554–560). X X N N Segal, E., Koller, D., & Ormoneit, D. (2002). Probabilistic ab- r (i) T (i) T − m (x ) + x m straction hierarchies. NIPS 14. N + r i=1 ! i=1 ! ! X X Stolcke, A., & Omohundro, S. (1993). Hidden Markov Model and: induction by bayesian model merging. NIPS 5 (pp. 11–18). ′ v = v + N Vaithyanathan, S., & Dom, B. (2000). Model-based hierarchical clustering. UAI. Here X is the dataset, k is the number of dimensions, and N is the number of datapoints in X. The hy- Ward, D. J. (2001). Adaptive computer interfaces. Doctoral perparameters for the Inverse Wishart-Normal prior dissertation, University of Cambridge. include: m which is the prior on the mean, S is the prior on the precision matrix, r is scaling factor on Williams, C. (2000). A MCMC approach to hierarchical mixture the prior precision of the mean, and v is the degree of modelling. NIPS 12. freedom.

References Banfield, J. D., & Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–821.

Blackwell, & MacQueen (1973). Ferguson distributions via Polya urn schemes. Ann. Stats.

Blei, D., & Jordan, M. (2004). Variational methods for dirichlet process mixture models (Technical Report 674). UC Berkeley.

Chipman, H., George, E., & McCulloch, R. (1998). Bayesian cart model search (with discussion). Journal of the American Statistical Association, 93, 935–960.

Denison, D., Holmes, C., Mallick, B., & Smith, A. (2002). Bayesian methods for nonlinear classification and regression. Wiley.

Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. Wiley.

Friedman, N. (2003). Pcluster: Probabilistic agglomerative clus- tering of gene expression profiles (Technical Report 2003-80). Herbew University.

Iwayama, M., & Tokunaga, T. (1995). Hierarchical Bayesian clustering for automatic text classification. Proceedings of IJCAI-95, 14th International Joint Conference on Artificial Intelligence (pp. 1322–1327). Montreal, CA: Morgan Kauf- mann Publishers, San Francisco, US. Average Linkage Hierarchical Clustering

5

4.5

4

3.5

3

2.5

2

1.5

1

0.5

0 444444444422444444444444444444444444444442222222222222222222222222222242222220000000022000000000000000000000002000000000

−780 80

70

60 −292

50

40

30 −47

20 −114 −58.5 −21.6 −6.76 10 −42.1

0 000000002200000020000000000000000000000000022242222222222222222222222222222222222444444444444444444444444444444444444444

Figure 6. (a) The tree given by performing average linkage hierarchical clustering on 120 examples of 3 digits (0,2, and 4), yielding a purity score of 0.870. Numbers on the x-axis correspond to the true class label of each example, while the y-axis corresponds to distance between merged clusters. (b) The tree given by performing Bayesian hierarchical clustering on the same dataset, yielding a purity score of 0.924. Here higher values on the y-axis also correspond to higher levels of the hierarchy (later merges), though the exact numbers are irrelevant. Red dashed lines correspond to merges with negative posterior log probabilities and are labeled with these values. 4 Newsgroups Average Linkage Clustering

4 Newsgroups Bayesian Hierarchical Clustering

Figure 7. Full dendrograms for average linkage hierarchical clustering and BHC on 800 newsgroup documents. Each of the leaves in the document is labeled with a color, according to which newsgroup that document came from. Red is rec.autos, blue is rec.sport.baseball, green is rec.sport.hockey, and magenta is sci.space. 9 10 8

8.5 9 7.5

8 8 7

7.5 7 6.5

7 6 6 6.5

5 5.5 6

4 5 5.5

3 4.5 5

4.5 2 4

4 1 3.5 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 9 10

Average Linkage Hierarchical Clustering Average Linkage Hierarchical Clustering Average Linkage Hierarchical Clustering

4.5 5 5 4 4.5 4.5

3.5 4 4

3.5 3 3.5

3 2.5 3

2.5 2.5 2 2 2 1.5 1.5 1.5 1 1 1

0.5 0.5 0.5

0 0 1 2 3 4 5 6 8 9 7 10 11 14 15 12 13 16 17 18 20 19 21 25 24 22 23 26 27 7 8 12 13 9 10 11 4 5 18 6 14 15 16 17 1 2 3 1 3 2 4 6 5 13 14 15 28 29 30 31 7 8 9 10 12 11 16 17 18 19 20 21 22 23 24 25 26 27

−66.9 −28.5 18 −59.9 6 6 16

5 5 14

12 4 −17.1 4

10 −3.6

3 3 8 2.5

−5.39 1.21 6 −1.24 2 −0.0123 2 −0.223 4 3.57 3.71 4 3.72 0.819 1 1 3.54 3.97 2 0.878 2.16 4.54 4.48 4.6 3.13

0 0 0 1 2 3 4 5 6 8 9 10 7 11 12 13 14 15 16 18 20 19 17 21 22 23 24 25 26 27 9 10 11 8 7 12 13 18 1 2 3 4 5 6 14 15 16 17 28 29 30 31 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 1 3 2 4 6 5 7 9 8 10 12 11

10 9 7 8 689 17 10 8.5 16 22 23 9 7.5 11 2 4 8 1012 13 7 15 5 9 8 5 21 24 2 46 25 8 7 13

7.5 26 27 7 14 6.5 18 7 6 6 6 6.5 17 5 13 16 18 5 12 5.5 7 6 20 8 4 19 9 4 10 5 19 5.5 11 25 12 3 20 26 13 22 2324 13 3 4.5 5 161718 11 2 14 21 27 14 28293031

4.5 15 2 1 4 15

4 1 3.5 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 9 10

9 1.4 10 0.7 8 0.9

8.5 9 7.5 0.8 1.2 0.6

8 8 7 0.7

0.5 7.5 1 7 6.5 0.6

7 0.8 0.4 6 6 0.5 6.5 5 5.5 0.4 0.6 0.3 6

4 5 0.3 5.5 0.4 0.2

3 4.5 0.2 5

0.2 0.1 4.5 2 4 0.1

4 1 3.5 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 9 10

Figure 8. Three toy examples (one per column). The first row shows the original data sets. The second row gives the dendrograms resulting from average linkage hierarchical clustering, each number on the x-axis corresponds to a data point as displayed on the plots in the fourth row. The third row gives the dendrograms resulting from Bayesian Hierarchical clustering where the red dashed lines are merges our algorithm prefers not to make given our priors. The numbers on the branches are the log odds for merging (log rk ). The fourth row shows the clusterings found by our algorithm using Gaussian components, when the tree is cut at red 1−rk dashed branches (rk < 0.5). The last row shows the predictive densities resulting from our algorithm. The BHC algorithm behaves structurally more sensibly than the traditional algorithm. Note that point 18 in the middle column is clustered into the green class even though it is substantially closer to points in the red class. In the first column the traditional algorithm prefers to merge cluster 1-6 with cluster 7-10 rather than, as BHC does, cluster 21-25 with cluster 26-27. Finally, in the last column, the BHC algorithm clusters all points along the upper and lower horizontal parallels together, whereas the traditional algorithm merges some subsets of points veritcally(for example cluster 7-12 with cluster 16-27).