arXiv:q-bio.QM/0511042 v2 25 Nov 2005 nomto ae lseig upeetr material Supplementary clustering: based Information .THE I. rbe.I rcie lseigagvndt e involves set data given a clustering unique clustering the practice, no In of present formulation problem. at mathematical is universal there or exploration, and analysis Contents I.Frtapiain h es S data ESR yeast The application: First III. I iue n tables and Figures VI. V eodapiain h P0 data SP500 The application: Second IV. I vlaigcutr’coherence clusters’ Evaluating II. .Tidapiain h ahoi data EachMovie The application: Third V. .The I. lhuhcutrn sawdl sdmto fdata of method used widely a is clustering Although References Acknowledgments .Chrneresults Coherence D. .Chrneresults Coherence D. .Chrneresults Coherence D. .Dsrpino h aa8 data the of Description A. .Dsrpino h aa7 data the of Description A. .Dsrpino h aa3 data the of Description A. .Cmaigsltosa ieetnmeso lses9 clusters of numbers different at solutions Comparing C. .Cmaigsltosa ieetnmeso lses7 clusters of numbers different at solutions Comparing C. .Cmaigsltosa ieetnmeso lses4 clusters of numbers different at solutions Comparing C. .Qaiycmlxt rd–ffcre 9 curves trade–off Quality–complexity B. .Qaiycmlxt rd–ffcre 7 curves trade–off Quality–complexity B. .Qaiycmlxt rd–ffcre 4 curves trade–off Quality–complexity B. .Dtie eut for results Detailed E. .Dtie eut for results Detailed E. .Dtie eut for results Detailed E. .Acutrerce ihucaatrzdgns6 genes uncharacterized with enriched cluster A F. ICLUST .Chrnerslsadcmaios9 9 comparisons and results Coherence matrices annotation 2. the Constructing 1. .Chrnerslsadcmaios8 8 comparisons and results Coherence matrices annotation 2. the Constructing 1. .Chrnerslsadcmaios5 5 comparisons and results Coherence matrices annotation 2. the Constructing 1. lseig”t persotyin shortly appear to material supplementary clustering,” the provides report technical This Ins 2005) 4, Lewis–Sigler December (Dated: and USA 08544 Physics, Jersey New of Princeton, University, Laboratories Princeton Henry Wil GaˇsperJoseph Tkaˇcik, Atwal, and Singh Gurinder Slonim, Noam h xeietlrslsadrlvn oe nldn fre a including code, that relevant at results and the found of results Poor some and experimental highlight Standard the we the e particular, in the In prices y stock all movies. in of provide expression dynamics deter gene we the of to sections stress, response used algo subsequent the scheme clustering in applications: validation different Then iterative the the describe results. we detail our II in Section present in we I Section Iclust algorithm http://www.genomics.princeton.edu/biophysics-theory ALGORITHM N N N c c c 0cutr 10 clusters 20 = 0cutr 8 clusters 20 = 0cutr 6 clusters 20 = rceig fteNtoa cdm fSine (USA) Sciences of Academy National the of Proceedings 11 10 10 8 9 7 8 1 3 3 5 att ieetfrso environmental of forms different to east o ae nild“nomto based “Information entitled paper a for l vial e plcto,cnbe can application, web All available attention. ely special deserve to seem ih sdi u xeiet and experiments our in used rithm s50 n iwrrtnso popular of ratings viewer and 500, ’s etaeo h ovninlcs hr nypairwise only where con- we case here conventional but the (1), elements on on two defined centrate than measures more similarity of can handle groups formulation to This generalized essential. be not is it although plicity, n ic nmn ae l xmlsiocrwt equal with occur i examples all [ cases probability many in since and suulw have we usual as ie ycutrmmesi.Tu fw aesm sim- some have pro- we description if measure Thus ilarity the membership. cluster of cluster a by complexity vided within the elements minimizing of similarity and mean the mizing the out to left were proceed (1). that Ref re- then implementation of we its and Here of briefly details achieved (1). technical formulation theory be this information can of view generality use some recent the that In through analysis. suggest the we of work levels different at choices many lseigi rbblsi sinett clusters to assignment probabilistic a is clustering odn to cording admoto ahcluster, each of out random where h dniyo hi elements, their of identity the and ietesaitclsgicneof significance statistical the mine prmna eut o he very three for results xperimental . efruaecutrn sataeffbtenmaxi- between tradeoff a as clustering formulate We I iuefrItgaieGenomics Integrative for titute ( C imBialek liam h I s ( )i h nomto htcutr rvd about provide clusters that information the is i) ; i h C s stema iiaiyo lmnscoe at chosen elements of similarity mean the is i )= i) ; P = P ( P C i 1 = (i) P X (i | C s )sc htw maximize we that such i) X ( | (i C C C F P , = ) = ) )bteneeet n ,optimal j, and i elements between j) X ( = C /N i ) h P s X X P ecnie hscs o sim- for case this consider we ] i − i , ( i j ( C In . C P P | I T i) | i) (i ( P C P | ( C i log (i) C | (i) i) ) i) ; P P · (j (i) , P | C , ( 1 P C ) P s ( ) (i ( C C , | j) ) i) , (3) ; C (5) (4) ac- (1) (2) 2 interactions are considered. Importantly, as opposed to in Eq (9) will be positive for i and C, but might be neg- using a problem specific similarity measure s, we use ative for i and C′. Consequently, while applying the up- the generality of information theory once more, and take date step the weight of assignment of i to C [P (C|i)] will ′ s(i, j) to be the pairwise mutual information Iij between be boosted while its assignment to C will be decreased. the observed patterns that correspond to data items i This is clearly a desirable outcome, which in particular and j (1).1 should increase F. Thus, since F is upper bounded (as a It is shown in Ref (1) that any stationary point of our sum of information terms), after a finite number of such target functional, F, must obey updates the algorithm is expected to converge to a fixed point which corresponds to a (possibly local) maximum P (C) 1 of F. P (C|i) = exp [2s(C; i) − s(C)] , (6) Z(i; T ) T This example also illustrates one of the differences be- tween our algorithm and previous approaches. While in where Z(i; T ) is a normalization function, s(C; i) is the the Blahut–Arimoto algorithm in rate–distortion theory expected similarity between i and a member of cluster C, (3), in the iterative Information Bottleneck algorithm (5), and typically also in EM for maximum likelihood (4), the N sign of the exponent is constant (for a given i), this is not s(C;i) = P (i1|C)s(i1, i), (7) true in our case. In principle, such a non–constant ex- i1=1 ponent sign should imply faster convergence to a local X stationary point, but might also imply higher sensitivity and s(C) is the average similarity among pairs chosen to the random initialization of P (C|i). Thus, as in other independently out of the cluster C, work, we typically perform several runs with different random initializations of P (C|i) from which we choose N N the best solution, i.e., the one that maximizes F. s(C)= P (i1|C)P (i2|C)s(i1, i2). (8) The Iclust algorithm presented here uses a sequential, i =1 i =1 X1 X2 or incremental iterative procedure in which the updates for some i incorporate the implications of the updates for Eq. (6) defines an implicit set of equations since the right ′ preceding elements, i 6= i. As a simple example, consider hand side depends on P (i|C) and P (C). This is a com- the case where we have three elements (N = 3) and two mon situation in variational methods, also present, for clusters (N = 2). We start from some random condi- example, in conventional rate–distortion clustering (3), in c tional distribution matrix, P (0)(C|i), which in particular maximum likelihood estimation with hidden variables (4) defines s(0)(C), ∀ C = 1 : 2. At the first iteration we find and in the Information Bottleneck framework (5). The a new distribution for the first element (i = 1) over the standard strategy is to turn the self–consistency condi- two clusters. Thus, we now have a new conditional distri- tion into an iterative algorithm. Specifically, let us de- bution matrix, P (1)(C|i) which differs from the previous note the intermediate solution of the algorithm at the P (0)(C|i) only by its first row. This distribution is used mth iteration by P (m)(C|i). Then, at the m +1st itera- to define s(1)(C), ∀ C = 1 : 2. Now, in the next iteration, tion, the algorithm applies the following update rule: we find a new distribution for the second element (i = 2) over the two clusters. This yields another new condi- 1 P (m+1)(C|i) ← P (m)(C)exp [2s(m)(C; i) − s(m)(C)] tional distribution matrix, P (2)(C|i) which differs from T (1) the previous P (C|i) only by its middle row, and so on. (9) This process is somewhat in the spirit of the incremental followed by a normalization step. Notice that the terms EM (6) and the sequential Information Bottleneck algo- {P (m)(C), s(m)(C; i), s(m)(C)} all are calculated using (m) rithm (7). An alternative optimization routine, which P (C|i). Pseudo–code for this algorithm is given in seems less natural in our case, would be parallel opti- Figure 1. It is easy to verify that with a straightfor- mization, used e.g., in standard EM (4). In this case, if ward implementation, the complexity of this algorithm 3 we continue our example, at the first iteration we will up- is O(N · Nc) for a single pass over the entire data set, date all the rows in the conditional distribution matrix, where Nc is the number of clusters. We will refer to this P (0)(C|i) using s(0)(C), to find the new P (1)(C|i). algorithm as the Iclust algorithm. In some extreme cases the above algorithm might pro- To gain some intuition let us consider a typical situ- duce a non–monotonic behavior in F. That is, some ation where i is relatively similar to elements in C, but of the updates might reduce F, suggesting that obtain- very different from elements in C′. Thus, the exponent ing a general proof of convergence is a challenging goal. Nonetheless, even in these extreme cases, and more gen- erally in all our experiments (which included more than 1000 runs over real world problems with different T val- 1 This report does not deal with the technical details of estimating mutual (and multi) information from empirical data. The reader ues and different numbers of clusters), the algorithm al- is referred to (2) for a complete description of the estimation ways converged to a stationary point. Moreover, for the
procedure used in our experiments. regime T ≥ maxi1,i2 s(i1, i2) it is possible to prove this 3 convergence analytically (the details will be presented annotations. Naturally, the more hypotheses one tests elsewhere). the less surprising it is to find one with a small P -value, even in a randomly chosen cluster. The simplest and most conservative approach to correct for this multiple II. EVALUATING CLUSTERS’ COHERENCE hypotheses testing effect is to apply the Bonferroni cor- rection (see,e.g., (8)). Specifically, if the statistical sig- The central question in clustering is whether an essen- nificance level is q (e.g., q =0.05), an event is considered tially unsupervised analysis of a data set can recover cat- significant if and only if its P -value satisfies: egories that have “meaning.” In practice we assess this q by comparison with some set of labels for the data that Pval < , (12) were generated by human intervention. To get started, H then, we need a set of annotations (or labels) provided where H is the number of hypotheses being tested. We for every data item we clustered. Importantly, these an- will say that a cluster is enriched with the annotation if notations are not used during the clustering process but the corresponding P -value satisfies Eq. (12). rather are exposed only for the post–clustering valida- Finally, while the above procedure determines the sig- tion. Every data item might be assigned more than one nificance of every annotation that occurs in the cluster, it annotation via different sources of information. The as- also is useful to have a single score that roughly summa- sumption is that these annotations reflect to some extent rizes how homogeneous the cluster is with respect to all the “real” structure of the data that one wishes to reveal annotations. Different alternatives have been proposed through the clustering process. to this end and here we use the coherence score, sug- To be more concrete, let us assume that we clustered gested by Segal et al. (9), N elements, where each one of these elements is assigned some set of annotations. Formally, this could be rep- nenriched coh(C) ≡ 100 · , (13) resented through an annotation matrix, denoted as A, n with N rows and R columns, where R is the number of distinct annotations in our data. Thus, A(i, j) = 1 if and where n is the number of items in the cluster C, and nenriched is the number of items in C with an annotation only if the i-th element is assigned annotation aj, and zero otherwise. A simple example is given in Table I. that was found to be significantly enriched in C. In other When we examine a single cluster, consisting of n