Incorporating User's Preference Into Attributed Graph Clustering

Incorporating User’s Preference into Attributed Graph Clustering Wei Ye1, Dominik Mautz2, Christian Böhm2, Ambuj Singh1,Claudia Plant3 1University of California, Santa Barbara 2Ludwig-Maximilians-Universität München, Munich, Germany 3University of Vienna, Vienna, Austria {weiye,ambuj}@cs.ucsb.edu {mautz,boehm}@dbs.ifi.lmu.de [email protected] ABSTRACT 1 INTRODUCTION Graph clustering has been studied extensively on both plain graphs Data can be collected from multiple sources and modeled as attrib- and attributed graphs. However, all these methods need to partition uted graphs (networks), in which vertices represent entities, edges the whole graph to find cluster structures. Sometimes, based on represent their relations and attributes describe their own char- domain knowledge, people may have information about a specific acteristics. For example, proteins in a protein-protein interaction target region in the graph and only want to find a single cluster network may be associated with gene expressions in addition to concentrated on this local region. Such a task is called local clus- their interaction relations; users in a social network may be asso- tering. In contrast to global clustering, local clustering aims to find ciated with individual attributes such as interests, residence and only one cluster that is concentrating on the given seed vertex demographics in addition to their friendship relations. (and also on the designated attributes for attributed graphs). Cur- One of the major data mining tasks in graphs (networks) is the rently, very few methods can deal with this kind of task. To this detection of clusters. Existing methods for cluster detection in at- end, we propose two quality measures for a local cluster: Graph tributed graphs can be divided into two categories, i.e., full space Unimodality (GU) and Attribute Unimodality (AU). The former attributed graph clustering methods [1, 52] and subspace attrib- measures the homogeneity of the graph structure while the latter uted graph clustering methods [6, 11, 47]. The methods belonging measures the homogeneity of the subspace that is composed of the to the first category treat all attributes equally important tothe designated attributes. We call their linear combination as Compact- graph structure, while the methods belonging to the second cate- ness. Further, we propose LOCLU to optimize the Compactness gory consider varying relevance of attributes to the graph structure. score. The local cluster detected by LOCLU concentrates on the All these methods need to partition the whole graph to find clus- region of interest, provides efficient information flow in the graph ter structures. However, based on domain knowledge, sometimes and exhibits a unimodal data distribution in the subspace of the people may have information about a specific target region in the designated attributes. graph and are only interested in finding a cluster surrounding this local region. Such a task is called local cluster detection, which KEYWORDS has aroused a great deal of attention in many applications, e.g., targeted ads, medicine, etc. Without considering scalability, one Local clustering, user’s preference, attributed graphs, dip test, uni- may think we can first use full space or subspace attributed graph modal, NCut, power iteration. clustering techniques and then return the cluster that contains the target region. However, it is hard to set the number of clusters in ACM Reference Format: real-world graphs. And the cluster content depends on the chosen Wei Ye1, Dominik Mautz2, Christian Böhm2, Ambuj Singh1,Claudia Plant3. number of clusters. 2020. Incorporating User’s Preference into Attributed Graph Clustering. In To deal with this task, several recent works [3, 20] use short arXiv:2003.11079v1 [cs.LG] 24 Mar 2020 Woodstock ’20: ACM Symposium on Neural Gaze Detection, June 03–05, 2020, random walks starting from the target region to find the local cluster. Woodstock, NY. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/ Also, some approaches [2, 18] focus on using the graph diffusion 1122445.1122456 methods to find the local cluster. However, these methods are only suitable for detecting local clusters in plain graphs whose vertices have no attributes. Recently, FocusCO [34] has been proposed to find a local cluster of interest to users in attributed graphs. Given Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed an examplar set, it first exploits a metric learning method to learn a for profit or commercial advantage and that copies bear this notice and the full citation projection vector that makes the vertex in the examplar set similar on the first page. Copyrights for components of this work owned by others than ACM to each other in the projected attribute subspace, then updates the must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a graph weight and finally performs the focused cluster extraction. fee. Request permissions from [email protected]. FocusCO cannot infer the projection vector if the examplar set has Woodstock ’20, June 03–05, 2020, Woodstock, NY only one vertex. © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-XXXX-X/20/06...$15.00 https://doi.org/10.1145/1122445.1122456 Woodstock ’20, June 03–05, 2020, Woodstock, NY We Ye et al. In this paper, given user’s preference, i.e., the seed vertex and the • We propose Compactness, a new quality measure for clus- designated attributes, we develop a method that can automatically ters in attributed graphs. Compactness measures the ho- find the vertices that are similar to the given seed vertex. Thesimi- mogeneity (unimodality) of both the graph structure and larity is measured by the homogeneity both in the graph structure subspace that is composed of the designated attributes. and the subspace that is composed of the designated attributes. To • We propose LOCLU to optimize the Compactness score. this end, we first propose Compactness to measure the unimodal- • We demonstrate the effectiveness and efficiency of LOCLU ity1 of the clusters in attributed graphs. Compactness is composed by conducting experiments on both synthetic and real-world of two measures: Graph Unimodality (GU) and Attribute Unimodal- attributed graphs. ity (AU). GU measures the unimodality of the graph structure, and AU measures the unimodality of the subspace that is composed 2 PRELIMINARIES of the designated attributes. To consider both the graph structure and attributes, we first embed the graph structure into vector space. 2.1 Notation Then we consider the graph embedding vector as another desig- In this work, we use lower-case Roman letters (e.g. a;b) to denote nated attribute and apply the local clustering technique separately scalars. We denote vectors (column) by boldface lower case letters on each designated attribute. We call the procedure to find a local (e.g. x) and denote its i-th element by x¹iº. Matrices are denoted by cluster as LOCLU. boldface upper case letters (e.g. X). We denote entries in a matrix by Let us use a simple example to demonstrate our motivation. non-bold lower case letters, such as xij . Row i of matrix X is denoted Figure 1 shows an example social network, in which the vertices by X¹i; :º, column j by X¹:; jº. A set is denoted by calligraphic capital represent students in a middle school, the edges represent their letters (e.g. S). An undirected attributed graph is denoted by G = friendship relations, and the attributes associated to each vertex ¹V; E; Xº, where V is a set of graph vertices with number n = jVj are age (year), sport time (hour) per week, studying time (hour) per of vertices, E is a set of graph edges with number e = jEj of edges week and playing mobile game (M.G.) time (hour) per week. Given and X 2 Rn×d is a data matrix of attributes associated to vertices, vertex 4 and the designated attribute M.G., the task is to find a local where d is the number of attributes. An adjacency matrix of vertices n×n cluster around the vertex 4. (This task is of interest to mobile game is denoted by A 2 R with aij = 1¹i , jº and aij = 0¹i = jº. producers.) Conventional diffusion-based local clustering method The degree matrix D is a diagonal matrix associated with A with Í such as HK [18] finds a cluster C1 = f1; 3; 4; 5; 6; 7g. However, this dii = j aij . The random walk transition matrix W is defined as cluster is not homogeneous in the subspace of the M.G. attribute. D−1A. The Laplacian matrix is denoted as L = I − W, where I is the Compared with C1, the cluster C2 = f1; 2; 3; 4g is more local, which identity matrix. An attributed graph cluster is a subset of vertices is concentrated on the vertex 4 and the M.G. attribute. C ⊆ V with attributes. The indicator function is denoted by 1¹xº. Age: 12 2.2 The Dip Test Age: 17 Sport: 2 Age: 18 Sport: 0.5 Before introducing the concept of the dip test, let us first clarify the Study: 23 Sport: 0.8 definitions of unimodal distribution and multimodal distribution. In Study: 40 Age: 11 M.G.: 5.5 Study: 42 statistics, a unimodal distribution refers to a probability distribution M.G.: 0.9 Sport: 1.5 1 M.G.: 1 that only has a single mode (i.e. peak). If a probability distribution Study: 25 5 has multiple modes, it is called multimodal distribution. From the 7 M.G.: 5.5 behavior of the cumulative distribution function (CDF), unimodal 8 distribution can also be defined as: if the CDF is convex for x < m 2 3 4 Age: 18 and concave for x > m (m is the mode), then the distribution is Age: 13 6 Sport: 1 unimodal.

Load more