Arxiv:Q-Bio.QM/0511042 V2 25 Nov 2005 Nomto Ae Lseig Upeetr Material Supplementary Clustering: Based Information .THE I
Total Page:16
File Type:pdf, Size:1020Kb
Information based clustering: Supplementary material Noam Slonim, Gurinder Singh Atwal, Gaˇsper Tkaˇcik, and William Bialek Joseph Henry Laboratories of Physics, and Lewis–Sigler Institute for Integrative Genomics Princeton University, Princeton, New Jersey 08544 USA (Dated: December 4, 2005) This technical report provides the supplementary material for a paper entitled “Information based clustering,” to appear shortly in Proceedings of the National Academy of Sciences (USA). In Section I we present in detail the iterative clustering algorithm used in our experiments and in Section II we describe the validation scheme used to determine the statistical significance of our results. Then in subsequent sections we provide all the experimental results for three very different applications: the response of gene expression in yeast to different forms of environmental stress, the dynamics of stock prices in the Standard and Poor’s 500, and viewer ratings of popular movies. In particular, we highlight some of the results that seem to deserve special attention. All the experimental results and relevant code, including a freely available web application, can be found at http://www.genomics.princeton.edu/biophysics-theory . Contents many choices at different levels of the analysis. In recent work we suggest that some generality can be achieved I. The Iclust algorithm 1 through the use of information theory (1). Here we re- view this formulation briefly and then proceed to the II. Evaluating clusters’ coherence 3 technical details of its implementation that were left out III. First application: The yeast ESR data 3 of Ref (1). A. Description of the data 3 We formulate clustering as a tradeoff between maxi- B. Quality–complexity trade–off curves 4 mizing the mean similarity of elements within a cluster C. Comparing solutions at different numbers of clusters 4 D. Coherence results 5 and minimizing the complexity of the description pro- 1. Constructing the annotation matrices 5 vided by cluster membership. Thus if we have some sim- 2. Coherence results and comparisons 5 ilarity measure s(i, j) between elements i and j, optimal E. Detailed results for Nc = 20 clusters 6 clustering is a probabilistic assignment to clusters C ac- F. A cluster enriched with uncharacterized genes 6 cording to P (C|i) such that we maximize IV. Second application: The SP500 data 7 A. Description of the data 7 F = hsi− T I(C; i), (1) B. Quality–complexity trade–off curves 7 C. Comparing solutions at different numbers of clusters 7 where hsi is the mean similarity of elements chosen at D. Coherence results 8 random out of each cluster, 1. Constructing the annotation matrices 8 2. Coherence results and comparisons 8 hsi = P (C) P (i|C)P (j|C)s(i, j), (2) E. Detailed results for Nc = 20 clusters 8 C i,j X X V. Third application: The EachMovie data 8 A. Description of the data 8 and I(C; i) is the information that clusters provide about B. Quality–complexity trade–off curves 9 the identity of their elements, arXiv:q-bio.QM/0511042 v2 25 Nov 2005 C. Comparing solutions at different numbers of clusters 9 D. Coherence results 9 P (C|i) 1. Constructing the annotation matrices 9 I(C;i) = P (C|i)P (i) log ; (3) P (C) 2. Coherence results and comparisons 9 C i E. Detailed results for Nc = 20 clusters 10 X X as usual we have Acknowledgments 10 1 P (i|C) = P (C|i)P (i) · (4) References 10 P (C) VI. Figures and tables 11 P (C) = P (C|i)P (i), (5) Xi I. THE ICLUST ALGORITHM and since in many cases all examples i occur with equal probability [P (i) = 1/N] we consider this case for sim- Although clustering is a widely used method of data plicity, although it is not essential. This formulation can analysis and exploration, there is at present no unique be generalized to handle similarity measures defined on or universal mathematical formulation of the clustering groups of more than two elements (1), but here we con- problem. In practice, clustering a given data set involves centrate on the conventional case where only pairwise 2 interactions are considered. Importantly, as opposed to in Eq (9) will be positive for i and C, but might be neg- using a problem specific similarity measure s, we use ative for i and C′. Consequently, while applying the up- the generality of information theory once more, and take date step the weight of assignment of i to C [P (C|i)] will ′ s(i, j) to be the pairwise mutual information Iij between be boosted while its assignment to C will be decreased. the observed patterns that correspond to data items i This is clearly a desirable outcome, which in particular and j (1).1 should increase F. Thus, since F is upper bounded (as a It is shown in Ref (1) that any stationary point of our sum of information terms), after a finite number of such target functional, F, must obey updates the algorithm is expected to converge to a fixed point which corresponds to a (possibly local) maximum P (C) 1 of F. P (C|i) = exp [2s(C; i) − s(C)] , (6) Z(i; T ) T This example also illustrates one of the differences be- tween our algorithm and previous approaches. While in where Z(i; T ) is a normalization function, s(C; i) is the the Blahut–Arimoto algorithm in rate–distortion theory expected similarity between i and a member of cluster C, (3), in the iterative Information Bottleneck algorithm (5), and typically also in EM for maximum likelihood (4), the N sign of the exponent is constant (for a given i), this is not s(C;i) = P (i1|C)s(i1, i), (7) true in our case. In principle, such a non–constant ex- i1=1 ponent sign should imply faster convergence to a local X stationary point, but might also imply higher sensitivity and s(C) is the average similarity among pairs chosen to the random initialization of P (C|i). Thus, as in other independently out of the cluster C, work, we typically perform several runs with different random initializations of P (C|i) from which we choose N N the best solution, i.e., the one that maximizes F. s(C)= P (i1|C)P (i2|C)s(i1, i2). (8) The Iclust algorithm presented here uses a sequential, i =1 i =1 X1 X2 or incremental iterative procedure in which the updates for some i incorporate the implications of the updates for Eq. (6) defines an implicit set of equations since the right ′ preceding elements, i 6= i. As a simple example, consider hand side depends on P (i|C) and P (C). This is a com- the case where we have three elements (N = 3) and two mon situation in variational methods, also present, for clusters (N = 2). We start from some random condi- example, in conventional rate–distortion clustering (3), in c tional distribution matrix, P (0)(C|i), which in particular maximum likelihood estimation with hidden variables (4) defines s(0)(C), ∀ C = 1 : 2. At the first iteration we find and in the Information Bottleneck framework (5). The a new distribution for the first element (i = 1) over the standard strategy is to turn the self–consistency condi- two clusters. Thus, we now have a new conditional distri- tion into an iterative algorithm. Specifically, let us de- bution matrix, P (1)(C|i) which differs from the previous note the intermediate solution of the algorithm at the P (0)(C|i) only by its first row. This distribution is used mth iteration by P (m)(C|i). Then, at the m +1st itera- to define s(1)(C), ∀ C = 1 : 2. Now, in the next iteration, tion, the algorithm applies the following update rule: we find a new distribution for the second element (i = 2) over the two clusters. This yields another new condi- 1 P (m+1)(C|i) ← P (m)(C)exp [2s(m)(C; i) − s(m)(C)] tional distribution matrix, P (2)(C|i) which differs from T (1) the previous P (C|i) only by its middle row, and so on. (9) This process is somewhat in the spirit of the incremental followed by a normalization step. Notice that the terms EM (6) and the sequential Information Bottleneck algo- {P (m)(C), s(m)(C; i), s(m)(C)} all are calculated using (m) rithm (7). An alternative optimization routine, which P (C|i). Pseudo–code for this algorithm is given in seems less natural in our case, would be parallel opti- Figure 1. It is easy to verify that with a straightfor- mization, used e.g., in standard EM (4). In this case, if ward implementation, the complexity of this algorithm 3 we continue our example, at the first iteration we will up- is O(N · Nc) for a single pass over the entire data set, date all the rows in the conditional distribution matrix, where Nc is the number of clusters. We will refer to this P (0)(C|i) using s(0)(C), to find the new P (1)(C|i). algorithm as the Iclust algorithm. In some extreme cases the above algorithm might pro- To gain some intuition let us consider a typical situ- duce a non–monotonic behavior in F. That is, some ation where i is relatively similar to elements in C, but of the updates might reduce F, suggesting that obtain- very different from elements in C′. Thus, the exponent ing a general proof of convergence is a challenging goal.