Information-Theoretic Co-Clustering

Information-Theoretic Co-clustering Inderjit S. Dhillon Subramanyam Mallela Dharmendra S. Modha Dept. of Computer Sciences Dept. of Computer Sciences IBM Almaden Research Center University of Texas, Austin University of Texas, Austin San Jose, CA [email protected] [email protected] [email protected] ABSTRACT It is often desirable to co-cluster or simultaneously cluster Two-dimensional contingency or co-occurrence tables arise both dimensions of a contingency table [11] by exploiting the frequently in important applications such as text, web-log clear duality between rows and columns. For example, we and market-basket data analysis. A basic problem in contin- may be interested in finding similar documents and their in- gency table analysis is co-clustering: simultaneous clustering terplay with word clusters. Quite surprisingly, even if we are of the rows and columns. A novel theoretical formulation interested in clustering along one dimension of the contin- views the contingency table as an empirical joint probabil- gency table, when dealing with sparse and high-dimensional ity distribution of two discrete random variables and poses data, it turns out to be beneficial to employ co-clustering. the co-clustering problem as an optimization problem in information theory — the optimal co-clustering maximizes the To outline a principled approach to co-clustering, we treat mutual information between the clustered random variables the (normalized) non-negative contingency table as a joint subject to constraints on the number of row and column probability distribution between two discrete random vari- clusters. We present an innovative co-clustering algorithm ables that take values over the rows and columns. We define that monotonically increases the preserved mutual informa- co-clustering as a pair of maps from rows to row-clusters tion by intertwining both the row and column clusterings and from columns to column-clusters. Clearly, these maps at all stages. Using the practical example of simultaneous induce clustered random variables. Information theory can word-document clustering, we demonstrate that our algo- now be used to give a theoretical formulation to the prob- rithm works well in practice, especially in the presence of lem: the optimal co-clustering is one that leads to the largest sparsity and high-dimensionality. mutual information between the clustered random variables. Equivalently, the optimal co-clustering is one that minimizes Categories and Subject Descriptors the difference (“loss”) in mutual information between the E.4 [Coding and Information Theory]: Data compaction original random variables and the mutual information be- and compression; G.3 [Probability and Statistics]: Con- tween the clustered random variables. In this paper, we tingency table analysis; H.3.3 [Information Search and present a novel algorithm that directly optimizes the above Retrieval]: Clustering; I.5.3 [Pattern Recognition]: Clus- loss function. The resulting algorithm is quite interesting: tering it intertwines both row and column clustering at all stages. Keywords Row clustering is done by assessing closeness of each row Co-clustering, information theory, mutual information distribution, in relative entropy, to certain “row cluster pro- 1. INTRODUCTION totypes”. Column clustering is done similarly, and this pro- Clustering is a fundamental tool in unsupervised learning cess is iterated till it converges to a local minimum. Co- that is used to group together similar objects [14], and has clustering differs from ordinary one-sided clustering in that practical importance in a wide variety of applications such at all stages the row cluster prototypes incorporate column as text, web-log and market-basket data analysis. Typically, clustering information, and vice versa. We theoretically es- the data that arises in these applications is arranged as a tablish that our algorithm never increases the loss, and so, contingency or co-occurrence table, such as, word-document gradually improves the quality of co-clustering. co-occurrence table or webpage-user browsing data. Most clustering algorithms focus on one-way clustering, i.e., clus- We empirically demonstrate that our co-clustering algorithm ter one dimension of the table based on similarities along the alleviates the problems of sparsity and high dimensionality second dimension. For example, documents may be clus- by presenting results on joint word-document clustering. An tered based upon their word distributions or words may be interesting aspect of the results is that our co-clustering ap- clustered based upon their distribution amongst documents. proach yields superior document clusters as compared to the case where document clustering is performed without any word clustering. The explanation is that co-clustering Permission to make digital or hard copies of all or part of this work for implicitly performs an adaptive dimensionality reduction at personal or classroom use is granted without fee provided that copies are each iteration, and estimates fewer parameters than a stan- not made or distributed for profit or commercial advantage and that copies dard “one-dimensional” clustering approach. This results in bear this notice and the full citation on the first page. To copy otherwise, or an implicitly “regularized” clustering. republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGKDD ’03, August 24-27, 2003, Washington, DC, USA. Copyright 2003 ACM 1-58113-737-0/03/0008...$5.00. A word about notation: upper-case letters such as X, Y , For a fixed distribution p, I(X; Y ) is fixed; hence minimiz- Xˆ, Yˆ will denote random variables. Elements of sets will ing (1) amounts to maximizing I(X,ˆ Yˆ ). be denoted by lower-case letters such as x and y. Quanti- ties associated with clusters will be “hatted”: for example, Let us illustrate the situation with an example. Consider the Xˆ denotes a random variable obtained from a clustering 6 × 6 matrix below that represents the joint distribution: of X whilex ˆ denotes a cluster. Probability distributions 2 3 are denoted by p or q when the random variable is obvious .05 .05 .05 0 0 0 or by p(X, Y ), q(X, Y, X,ˆ Yˆ ), p(Y |x), or q(Y |xˆ) to make the 6 .05 .05 .05 0 0 0 7 6 7 random variable explicit. Logarithms to the base 2 are used. 6 0 0 0 .05 .05 .05 7 p(X, Y ) = 6 7 (2) 6 0 0 0 .05 .05 .05 7 4 .04 .04 0 .04 .04 .04 5 2. PROBLEM FORMULATION .04 .04 .04 0 .04 .04 Let X and Y be discrete random variables that take values in the sets {x1, . , xm} and {y1, . , yn} respectively. Let Looking at the row distributions it is natural to group the p(X, Y ) denote the joint probability distribution between X rows into three clusters:x ˆ1 = {x1, x2},x ˆ2 = {x3, x4} and and Y . We will think of p(X, Y ) as a m×n matrix. In prac- xˆ3 = {x5, x6}. Similarly the natural column clustering is: tice, if p is not known, it may be estimated using observa- yˆ1 = {y1, y2, y3},y ˆ2 = {y4, y5, y6}. The resulting joint dis- tions. Such a statistical estimate is called a two-dimensional tribution p(X,ˆ Yˆ ), see (6) below, is given by: contingency table or as a two-way frequency table [9]. 2 .3 0 3 ˆ ˆ We are interested in simultaneously clustering or quantizing p(X, Y ) = 4 0 .3 5 . (3) X into (at most) k disjoint or hard clusters, and Y into .2 .2 (at most) ` disjoint or hard clusters. Let the k clusters of It can be verified that the mutual information lost due to this X be written as: {xˆ1, xˆ2,..., xˆk}, and let the ` clusters of co-clustering is only .0957, and that any other co-clustering Y be written as: {yˆ1, yˆ2,..., yˆ`}. In other words, we are leads to a larger loss in mutual information. interested in finding maps CX and CY , The following lemma shows that the loss in mutual infor- CX : {x1, x2, . , xm} → {xˆ1, xˆ2,..., xˆk} mation can be expressed as the “distance” of p(X, Y ) to C : {y , y , . , y } → {yˆ , yˆ ,..., yˆ }. Y 1 2 n 1 2 ` an approximation q(X, Y ) — this lemma will facilitate our search for the optimal co-clustering. For brevity, we will often write Xˆ = CX (X) and Yˆ = CY (Y ); Xˆ and Yˆ are random variables that are a deter- ministic function of X and Y , respectively. Observe that X 2.1. For a fixed co-clustering (C ,C ), we can ˆ Lemma X Y and Y are clustered separately, that is, X is a function of write the loss in mutual information as X alone and Yˆ is a function of Y alone. But, the partition ˆ ˆ functions CX and CY are allowed to depend upon the entire I(X; Y ) − I(X, Y ) = D(p(X, Y )||q(X, Y )), (4) joint distribution p(X, Y ). where D(·||·) denotes the Kullback-Leibler(KL) divergence, also known as relative entropy, and q(X, Y ) is a distribution Definition 2.1. We refer to the tuple (CX ,CY ) as a co- of the form clustering. q(x, y) = p(ˆx, yˆ)p(x|xˆ)p(y|yˆ), where x ∈ x,ˆ y ∈ y.ˆ (5) Suppose we are given a co-clustering. Let us “re-order” the rows of the joint distribution p such that all rows mapping Proof. Since we are considering hard clustering, intox ˆ are arranged first, followed by all rows mapping into X X 1 p(ˆx, yˆ) = p(x, y), (6) xˆ2, and so on. Similarly, let us “re-order” the columns of the joint distribution p such that all columns mapping into x∈xˆ y∈yˆ P P yˆ1 are arranged first, followed by all columns mapping into p(ˆx) = x∈xˆ p(x), and p(ˆy) = y∈yˆ p(y).

Load more