Bayesian Hierarchical Clustering

Bayesian Hierarchical Clustering Katherine A. Heller [email protected] Zoubin Ghahramani [email protected] Gatsby Computational Neuroscience Unit, University College London 17 Queen Square, London, WC1N 3AR, UK Abstract tures are ubiquitous in the natural world. For ex- We present a novel algorithm for agglomer- ample, the evolutionary tree of living organisms (and ative hierarchical clustering based on evalu- consequently features of these organisms such as the ating marginal likelihoods of a probabilistic sequences of homologous genes) is a natural hierarchy. model. This algorithm has several advan- Hierarchical structures are also a natural representa- tages over traditional distance-based agglom- tion for data which was not generated by evolutionary erative clustering algorithms. (1) It defines processes. For example, internet newsgroups, emails, a probabilistic model of the data which can or documents from a newswire, can be organized in be used to compute the predictive distribu- increasingly broad topic domains. tion of a test point and the probability of it The traditional method for hierarchically clustering belonging to any of the existing clusters in data as given in (Duda & Hart, 1973) is a bottom- the tree. (2) It uses a model-based criterion up agglomerative algorithm. It starts with each data to decide on merging clusters rather than an point assigned to its own cluster and iteratively merges ad-hoc distance metric. (3) Bayesian hypoth- the two closest clusters together until all the data be- esis testing is used to decide which merges longs to a single cluster. The nearest pair of clusters are advantageous and to output the recom- is chosen based on a given distance measure (e.g. Eu- mended depth of the tree. (4) The algorithm clidean distance between cluster means, or distance can be interpreted as a novel fast bottom-up between nearest points). approximate inference method for a Dirich- let process (i.e. countably infinite) mixture There are several limitations to the traditional hier- model (DPM). It provides a new lower bound archical clustering algorithm. The algorithm provides on the marginal likelihood of a DPM by sum- no guide to choosing the “correct” number of clusters ming over exponentially many clusterings of or the level at which to prune the tree. It is often dif- the data in polynomial time. We describe ficult to know which distance metric to choose, espe- procedures for learning the model hyperpa- cially for structured data such as images or sequences. rameters, computing the predictive distribu- The traditional algorithm does not define a probabilis- tion, and extensions to the algorithm. Exper- tic model of the data, so it is hard to ask how “good” imental results on synthetic and real-world a clustering is, to compare to other models, to make data sets demonstrate useful properties of the predictions and cluster new data into an existing hier- algorithm. archy. We use statistical inference to overcome these limitations. Previous work which uses probabilistic methods to perform hierarchical clustering is discussed 1. Introduction in section 6. Hierarchical clustering is one of the most frequently Our Bayesian hierarchical clustering algorithm uses used methods in unsupervised learning. Given a set of marginal likelihoods to decide which clusters to merge data points, the output is a binary tree (dendrogram) and to avoid overfitting. Basically it asks what the whose leaves are the data points and whose internal probability is that all the data in a potential merge nodes represent nested clusters of various sizes. The were generated from the same mixture component, tree organizes these clusters hierarchically, where the and compares this to exponentially many hypotheses hope is that this hierarchy agrees with the intuitive at lower levels of the tree (section 2). organization of real-world data. Hierarchical struc- The generative model for our algorithm is a Dirichlet Tk D process mixture model (i.e. a countably infinite mix- k ture model), and the algorithm can be viewed as a fast bottom-up agglomerative way of performing approxi- Ti Tj D D mate inference in a DPM. Instead of giving weight to i j all possible partitions of the data into clusters, which is intractable and would require the use of sampling 1 2 3 4 methods, the algorithm efficiently computes the weight of exponentially many partitions which are consistent Figure 1. (a) Schematic of a portion of a tree where Ti and Tj with the tree structure (section 3). are merged into Tk, and the associated data sets Di and Dj are merged into Dk. (b) An example tree with 4 data points. 2. Algorithm The clusterings (1 2 3)(4) and (1 2)(3)(4) are tree-consistent partitions of this data. The clustering (1)(2 3)(4) is not a tree- Our Bayesian hierarchical clustering algorithm is sim- consistent partition. ilar to traditional agglomerative clustering in that it is a one-pass, bottom-up method which initializes each data point in its own cluster and iteratively merges priors (e.g. Normal-Inverse-Wishart priors for Normal pairs of clusters. As we will see, the main difference is continuous data or Dirichlet priors for Multinomial that our algorithm uses a statistical hypothesis test to discrete data) this integral is tractable. Throughout choose which clusters to merge. this paper we use such conjugate priors so the inte- grals are simple functions of sufficient statistics of D . Let D = {x(1),..., x(n)} denote the entire data set, k For example, in the case of Gaussians, (1) is a function and Di ⊂D the set of data points at the leaves of the of the sample mean and covariance of the data in Dk. subtree Ti. The algorithm is initialized with n trivial k trees, {Ti : i = 1 ...n} each containing a single data The alternative hypothesis to H1 would be that the (i) point Di = {x }. At each stage the algorithm consid- data in Dk has two or more clusters in it. Summing ers merging all pairs of existing trees. For example, if over the exponentially many possible ways of dividing Ti and Tj are merged into some new tree Tk then the Dk into two or more clusters is intractable. However, associated set of data is Dk = Di ∪Dj (see figure 1(a)). if we restrict ourselves to clusterings that partition the data in a manner that is consistent with the subtrees In considering each merge, two hypotheses are com- T and T , we can compute the sum efficiently using re- pared. The first hypothesis, which we will denote Hk i j 1 cursion. (We elaborate on the notion of tree-consistent is that all the data in D were in fact generated in- k partitions in section 3 and figure 1(b)). The probabil- dependently and identically from the same probabilis- ity of the data under this restricted alternative hy- tic model, p(x|θ) with unknown parameters θ. Let us pothesis, Hk, is simply a product over the subtrees imagine that this probabilistic model is a multivariate 2 p(D |Hk) = p(D |T )p(D |T ) where the probability of Gaussian, with parameters θ = (µ, Σ), although it is k 2 i i j j a data set under a tree (e.g. p(D |T )) is defined below. crucial to emphasize that for different types of data, i i different probabilistic models may be appropriate. To Combining the probability of the data under hypothe- k k evaluate the probability of the data under this hypoth- ses H1 and H2 , weighted by the prior that all points esis we need to specify some prior over the parameters def k in Dk belong to one cluster, πk = p(H1 ), we obtain of the model, p(θ|β) with hyperparameters β. We now the marginal probability of the data in tree Tk: have the ingredients to compute the probability of the k data Dk under H1 : k p(Dk|Tk) = πkp(Dk|H1 ) + (1 − πk)p(Di|Ti)p(Dj |Tj ) (2) k p(Dk|H1 ) = p(Dk|θ)p(θ|β)dθ Z This equation is defined recursively, there the first = p(x(i)|θ) p(θ|β)dθ (1) term considers the hypothesis that there is a single x(i) cluster in Dk and the second term efficiently sums over Z h Y∈Dk i all other other clusterings of the data in Dk which are This calculates the probability that all the data in Dk consistent with the tree structure (see figure 1(a)). were generated from the same parameter values as- In section 3 we show that equation 2 can be used to suming a model of the form p(x|θ). This is a natural derive an approximation to the marginal likelihood of model-based criterion for measuring how well the data a Dirichlet Process mixture model, and in fact provides fit into one cluster. If we choose models with conjugate a new lower bound on this marginal likelihood.1 Allowing infinitely many components makes it possible We also show that the prior for the merged hypoth- to more realistically model the kinds of complicated esis, πk, can be computed bottom-up in a DPM. distributions which we expect in real problems. We The posterior probability of the merged hypothesis briefly review DPMs here, starting from finite mixture def k models. rk = p(H1 |Dk) is obtained using Bayes rule: k Consider a finite mixture model with C components πkp(Dk|H1 ) rk = (3) k C πkp(Dk|H1 ) + (1 − πk)p(Di|Ti)p(Dj |Tj) p(x(i)|φ) = p(x(i)|θ )p(c = j|p) (4) This quantity is used to decide greedily which two trees j i j=1 to merge, and is also used to determine which merges X in the final hierarchy structure were justified.

Load more