Hierarchical Overlapping Clustering of Network Data Using Cut Metrics Fernando Gama, Santiago Segarra, and Alejandro Ribeiro
Total Page:16
File Type:pdf, Size:1020Kb
IEEE TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING OVER NETWORKS (ACCEPTED) 1 Hierarchical Overlapping Clustering of Network Data Using Cut Metrics Fernando Gama, Santiago Segarra, and Alejandro Ribeiro Abstract—A novel method to obtain hierarchical and overlap- degrees that may be present. Hierarchical clustering solves ping clusters from network data – i.e., a set of nodes endowed this issue by providing a collection of partitions, indexed with pairwise dissimilarities – is presented. The introduced by a resolution parameter, that can be set by the user to method is hierarchical in the sense that it outputs a nested collection of groupings of the node set depending on the resolution determine the extent up to which nodes are considered similar or degree of similarity desired, and it is overlapping since it or different [9]. In other words, each partition in the collection allows nodes to belong to more than one group. Our construction reflects a different degree of similarity between the nodes, is rooted on the facts that a hierarchical (non-overlapping) ranging from considering all nodes different, to considering clustering of a network can be equivalently represented by all nodes similar. Examples of hierarchical clustering methods a finite ultrametric space and that a convex combination of ultrametrics results in a cut metric. By applying a hierarchical are UPGMA [10], Ward’s method [11], complete [12] or single (non-overlapping) clustering method to multiple dithered versions linkage [13]. of a given network and then convexly combining the resulting However, hierarchical methods still have a major limitation, ultrametrics, we obtain a cut metric associated to the network namely, that nodes are assigned to one and only one category of interest. We then show how to extract a hierarchical overlap- or cluster. There are many situations in which this assignment ping clustering structure from the aforementioned cut metric. Furthermore, the so-called overlapping function is presented does not constitute a reasonable approach. Specifically, some as a tool for gaining insights about the data by identifying nodes might inherently have traits of more than one group and meaningful resolutions of the obtained hierarchical structure. Ad- hence have similarities to multiple clusters of nodes that oth- ditionally, we explore hierarchical overlapping quasi-clustering erwise would be considered dissimilar [14], [15]. Two classes methods that preserve the asymmetry of the data contained in of methods have been proposed to overcome this issue. The directed networks. Finally, the presented method is illustrated via synthetic and real-world classification problems including first class corresponds to the so-called soft or fuzzy clustering handwritten digit classification and authorship attribution of methods that allow the allocation of a node to multiple clusters famous plays. by assigning to each node a membership degree or probability Index Terms—Clustering, Network theory, Cut metrics, Hier- of belonging to every cluster [16]. This can also be done in archical clustering, Covering, Dithering. a hierarchical fashion [17], [18]. Nevertheless, these methods subscribe to the idea that nodes have more affinity to either one or another subset, as illustrated by their membership degree; I. INTRODUCTION or that they belong to only one group but the setting of ONSIDER a dataset where each element can be rep- the problem is not rich enough to determine to which one, C resented as a node in a network with a pairwise dis- thus assigning a probability of belonging to different subsets. similarity function. In this setting, the general objective of Fundamentally, these methods do not contemplate the idea that clustering is to group those nodes that are more similar to a node can intrinsically be part in equal terms of more than each other than to the rest, according to the relationship one cluster. The second class of methods that attempt to solve established by the dissimilarity function [1], [2]. Clustering this issue is the class of overlapping clustering methods, which arXiv:1611.01393v2 [cs.SI] 12 Dec 2017 and its generalizations are ubiquitous tools since they are used perform a non-hierarchical deterministic assignment of nodes in a wide variety of fields such as psychology [3], social to more than one subset [19]. While overlapping methods network analysis [4], political science [5], neuroscience [6], enable nodes to belong to more than one cluster, they are still among many others [7], [8]. myopic – just like traditional clustering methods – to multiple Traditional clustering methods provide only one partitioning affinity resolutions. of the node set in such a way that each data point belongs In this paper, a hierarchical overlapping clustering method to one and only one block of the partition. An important is proposed. Our method outputs a collection of groupings limitation of these traditional clustering methods is that the where each of them corresponds to different resolutions of dataset may present a complex data structure at several reso- similarity between the nodes (hierarchical), while allowing lutions or levels of similarity, and outputting only one partition nodes in each grouping to deterministically belong to more may not be adequate in portraying the different grouping than one cluster (overlapping). Essentially, the method is de- rived by generalizing the concepts of ultrametrics, equivalence Work in this paper is supported by NSF CCF 1217963. F. Gama and relations, and dendrograms – typical of hierarchical (non- A. Ribeiro are with the Department of Electrical and Systems Engi- overlapping) clustering methods – to those of cut metrics neering, University of Pennsylvania. S. Segarra is with the Institute for Data, Systems, and Society, Massachusetts Institute of Technology. Emails: [20], [21], tolerance relations [22], and nested collections of [email protected], [email protected], [email protected]. coverings. More precisely, we generate a family of networks IEEE TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING OVER NETWORKS (ACCEPTED) 2 0 0 derived from dithering multiple times the dissimilarity function that is, we may have that AX (x; x ) 6= AX (x ; x). This is of a given network. We then proceed to apply a hierarchical particularly useful when modeling uneven levels of influence (non-overlapping) clustering method to each of the dithered among nodes. We denote by N the set of all possible networks. networks, resulting in a collection of ultrametrics. From this Definition 1. A partition PX = fB1;:::;Bmg is a collection collection we obtain a cut metric related to the network of m interest by means of a convex combination of the ultrametrics. of subsets of X such that [i=1Bi = X and Bi \ Bj = ; for Finally, based on the cut metric we construct a tolerance i 6= j, i; j = 1; : : : ; m. Denote by P the set of all possible relation that connects the nodes that are at a distance lower partitions. than a pre-specified resolution, which we then use to derive a Definition 2. An equivalence relation ∼ is a binary set covering of the node set. By considering all possible resolution relation that, for any x; x0; x00 2 X, satisfies the properties of levels, we obtain a nested collection of coverings. Hierar- reflexivity (x ∼ x), symmetry (x ∼ x0 if and only if x0 ∼ x), chical overlapping clustering methods have been previously and transitivity (if x ∼ x0 and x0 ∼ x00, then x ∼ x00). developed in the context of community detection, but only for undirected, unweighted networks [23], [24]. Observe that partitions (Def. 1) uniquely define equivalence 0 0 In Section II, the concepts of partition, equivalence relation, relations (Def. 2) as follows: x ∼ x if and only if x; x 2 Bi and dendrogram are defined, and their relations to both tradi- for some Bi 2 PX , i = 1; : : : ; m. Equivalently, equivalence tional and hierarchical (non-overlapping) clustering methods relations uniquely define partitions [28]. An equivalence re- are established. These concepts are then respectively general- lation obtained from a specific partition PX will be denoted ized to those of covering, tolerance relation, and nested collec- by ∼PX . tion of coverings. This generalization is achieved by dropping With these definitions in place, a (traditional) clustering the requirements that prevent a node from belonging to more method G can be defined as a structure-preserving map from than one group. Moreover, a formal definition of a hierarchical the set of networks to the set of partitions, G : N!P. overlapping clustering method is presented. Section III relates This map is structure preserving in the sense that the output the introduced concepts with concrete finite metric construc- partition is defined over the same node set X specific to the tions. More specifically, ultrametrics are defined and directly input network. That is, G(N) = PX = fB1;:::;Bmg for m linked to hierarchical clustering methods (Theorem 1). We N = (X; AX ) with [i=1Bi = X. Intuitively, a desirable then introduce cut metrics and explain how they can be used clustering method G is one in which the nodes contained in to obtain a nested collection of coverings (Theorem 2). The each subset Bi are determined by the dissimilarity function proposed algorithm to obtain cut metrics from a given network AX so that similar nodes are grouped together. (Algorithm 1), and hence nested collection of coverings, is In many situations, having only one partition as the output detailed in Section IV. The algorithm is grounded on the fact of a clustering method may not be appropriate. It might be that a convex combination of ultrametrics yields a cut metric desirable to output several partitions that depend on a resolu- (Proposition 1) and the concept of dithering [25], that can be tion parameter that specifies the affinity required between two used to obtain multiple noisy versions of the given network. nodes to be deemed as similar. Also, quasi-clustering methods that allow for a hierarchical structure and overlapping nodes are explored in Section V.