Machine Learning Approaches to Link-Based Clustering
Total Page:16
File Type:pdf, Size:1020Kb
Machine Learning Approaches to Link-Based Clustering Zhongfei (Mark) Zhang†, Bo Long‡, Zhen Guo†, Tianbing Xu†, and Philip S. Yu♮ Abstract We have reviewed several state-of-the-art machine learning approaches to different types of link based clustering in this chapter. Specifically, we have pre- sented the spectral clustering for heterogeneous relational data, the symmetric con- vex coding for homogeneous relational data, the citation model for clustering the special but popular homogeneous relational data – the textual documents with ci- tations, the probabilistic clustering framework on mixed membership for general relational data, and the statistical graphical model for dynamic relational cluster- ing. We have demonstrated the effectiveness of these machine learning approaches through empirical evaluations. 1 Introduction Link information plays an important role in discovering knowledge from data. For link-based clustering, machine learning approaches provide pivotal strengths to de- velop effective solutions. In this chapter, we review several specific machine learn- ing approachesto link-based clustering. We by no means mean that these approaches are exhaustive. Instead, our intention is to use these exemplar approaches to show- case the power of machine learning techniques to solve the general link-based clus- tering problem. When we say link-based clustering, we mean the clustering of relational data. In other words, links are the relations among the data items or objects. Consequently, Zhang, Guo, and Xu† Computer Science Department, SUNY Binghamton, e-mail: \{zhongfei,zguo,txu\}@cs. binghamton.edu Long‡ Yahoo! Labs, Yahoo! Inc., e-mail: [email protected] Yu♮ Dept. of Computer Science, Univ. of Illinois at Chicago, e-mail: [email protected] 1 2 Zhang, Long, Guo, Xu, and Yu in the rest of this chapter, we use the terminologies of link-based clustering and relational clustering exchangeably. In general, relational data are those that have link information among the data items in addition to the classic attribute informa- tion for the data items. For relational data, we may categorize them in terms of the type of their relations into homogeneous relational data (relations exist among the same type of objects for all the data), heterogeneous relational data (relations only exist between data items of different types), general relational data (relations exist both among data items of the same type and between data items of different types), and dynamic relational data (there are time stamps for all the data items with rela- tions to differentiate from all the previous types of relational data which are static). For all the specific machine learning approaches reviewed in this chapter, they are based on the mathematical foundations of matrix decomposition, optimization, and probability and statistics theory. In this chapter, we review five different machine learning approaches tailored for different types of link-based clustering. Specifically, this chapter is organized as fol- lows. In Section 2 we study the heterogeneous data clustering problem, and present an effective solution to this problem through spectral analysis. In Section 3 we study the homogeneous data clustering problem, and present an effective solution through symmetric convex coding. In Section 4, we study a special but very popular case of the homogeneous relational data, i.e., the data are the textual documents and the link information is the citation information, and present a generative citation model to capture the effective learning of the document topics. In Section 5, we study the general relational data clustering problem, and present a general probabilistic frame- work based on a generative model on mixed membership. In Section 6, we study the dynamic relational data clustering problem and present a solution to this problem under a statistical graphical model. Finally, we conclude this chapter in Section 8. 2 Heterogenous Relational Clustering through Spectral Analysis Many real-world clustering problems involve data objects of multiple types that are related to each other, such as Web pages, search queries, and Web users in a Web search system, and papers, key words, authors, and conferences in a scientific publication domain. In such scenarios, using traditional methods to cluster each type of objects independently may not work well due to the following reasons. First, to make use of relation information under the traditional clustering frame- work, the relation information needs to be transformed into features. In general, this transformation causes information loss and/or very high dimensional and sparse data. For example, if we represent the relations between Web pages and Web users as well as search queries as the features for the Web pages, this leads to a huge num- ber of features with sparse values for each Web page. Second, traditional clustering approaches are unable to tackle with the interactions among the hidden structures of different types of objects, since they cluster data of single type based on static features. Note that the interactions could pass along the relations, i.e., there exists Machine Learning Approaches 3 influence propagation in multi-type relational data. Third, in some machine learning applications, users are not only interested in the hidden structure for each type of objects, but also the global structure involving multi-types of objects. For example, in document clustering, except for document clusters and word clusters, the rela- tionship between document clusters and word clusters is also useful information. It is difficult to discover such global structures by clustering each type of objects individually. Therefore, heterogeneous relational data have presented a great challenge for traditional clustering approaches. In this study [37], we present a general model, the collective factorization on related matrices, to discover the hidden structures of objects of different types based on both feature information and relation informa- tion. By clustering the objects of different types simultaneously, the model performs adaptive dimensionality reduction for each type of data. Through the related factor- izations which share factors, the hidden structures of objects of different types may interact under the model. In addition to the cluster structures for each type of data, the model also provides information about the relation between clusters of objects of different types. Under this model, we derive an iterative algorithm, the spectral relational clus- tering, to cluster the interrelated data objects of different types simultaneously. By iteratively embedding each type of data objects into low dimensional spaces, the al- gorithm benefits from the interactions among the hidden structures of data objects of different types. The algorithm has the simplicity of spectral clustering approaches but at the same time also is applicable to relational data with various structures. Theoretic analysis and experimental results demonstrate the promise and effective- ness of the algorithm. We also show that the existing spectral clustering algorithms can be considered as the special cases of the proposed model and algorithm. This provides a unified view to understanding the connections among these algorithms. 2.1 Model Formulation and Algorithm In this section, we present a general model for clustering heterogeneous relational data in the spectral domain based on factorizing multiple related matrices. X X Given m sets of data objects, 1 = {x11,...,x1n1 },..., m = {xm1,...,xmnm }, which refer to m different types of objects relating to each other, we are interested in simultaneously clustering X1 into k1 disjoint clusters, . , and Xm into km disjoint clusters. We call this task as collective clustering on heterogeneous relational data. To derive a general model for collective clustering, we first formulate the Hetero- geneous Relational Data (HRD) as a set of related matrices, in which two matrices are related in the sense that their row indices or column indices refer to the same set of objects. First, if there exist relations between Xi and X j (denoted as Xi ∼ X j), (i j) n ×n (i j) we represent them as a relation matrix R ∈ R i j , where an element Rpq denotes the relation between xip and x jq. Second, a set of objects Xi may have its own fea- tures, which could be denoted by a feature matrix F(i) ∈ Rni× fi , where an element 4 Zhang, Long, Guo, Xu, and Yu (i) Fpq denotes the qth feature values for the object xip and fi is the number of features for Xi. 1 1 2 f5 2 3 1 5 2 3 4 (a) (b) (c) Fig. 1 Examples of the structures of the heterogeneous relational data. Figure 1 shows three examples of the structures of HRD. Example (a) refers to a basic bi-type of relational data denoted by a relation matrix R(12), such as word- documentdata. Example (b) represents a tri-type of star-structured data, such as Web pages, Web users, and search queries in Web search systems, which are denoted by two relation matrices R(12) and R(23). Example (c) represents the data consisting of shops, customers, suppliers, shareholders, and advertisement media, in which customers (type 5) have features. The data are denoted by four relation matrices R(12), R(13), R(14) and R(15), and one feature matrix F(5). It has been shown that the hidden structure of a data matrix can be explored by its factorization [14,40]. Motivated by this observation, we propose a general model for collective clustering, which is based on factorizing the multiple related matrices. In HRD, the cluster structure for a type of objects Xi may be embedded in multiple related matrices; hence it can be exploited in multiple related factorizations. First, if Xi ∼ X j, then the cluster structures of both Xi and X j are reflected in the triple fac- torization of their relation matrix R(i j) such that R(i j) ≈ C(i)A(i j)(C( j))T [40], where (i) (i) ni×ki X ∑ki C ∈{0,1} is a cluster indicator matrix for i such that q=1 Cpq = 1 and (i) Cpq = 1 denotes that the pth object in Xi is associated with the qth cluster.