Structural Diversity and Homophily: a Study Across More Than One Hundred Big Networks
Total Page:16
File Type:pdf, Size:1020Kb
KDD 2017 Research Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada Structural Diversity and Homophily: A Study Across More Than One Hundred Big Networks Yuxiao Dong∗ Reid A. Johnson Microso Research University of Notre Dame Redmond, WA 98052 Notre Dame, IN 46556 [email protected] [email protected] Jian Xu Nitesh V. Chawla University of Notre Dame University of Notre Dame Notre Dame, IN 46556 Notre Dame, IN 46556 [email protected] [email protected] ABSTRACT ACM Reference format: A widely recognized organizing principle of networks is structural Yuxiao Dong, Reid A. Johnson, Jian Xu, and Nitesh V. Chawla. 2017. Struc- tural Diversity and Homophily: A Study Across More an One Hundred homophily, which suggests that people with more common neigh- Big Networks. In Proceedings of KDD ’17, August 13–17, 2017, Halifax, NS, bors are more likely to connect with each other. However, what in- Canada., , 10 pages. uence the diverse structures embedded in common neighbors (e.g., DOI: hp://dx.doi.org/10.1145/3097983.3098116 and ) have on link formation is much less well-understood. To explore this problem, we begin by characterizing the structural diversity of common neighborhoods. Using a collection of 120 large- 1 INTRODUCTION scale networks, we demonstrate that the impact of the common neighborhood diversity on link existence can vary substantially Since the time of Aristotle, it has been observed that people “love across networks. We nd that its positive eect on Facebook and those who are like themselves” [2]. We now know this as the negative eect on LinkedIn suggest dierent underlying network- principle of homophily, which asserts that people’s propensity to ing needs in these networks. We also discover striking cases where associate and bond with others similar to themselves is what drives diversity violates the principle of homophily—that is, where fewer the formation of social relationships [19, 27]. But homophily applies mutual connections may lead to a higher tendency to link with each not only to shared traits and characteristics, as captured by this other. We then leverage structural diversity to develop a common principle, but to the network structure of our relationships as well. neighborhood signature (CNS), which we apply to a large set of net- Structural homophily holds that individuals with more friends works to uncover unique network superfamilies not discoverable in common are more likely to associate [15, 30]. is principle has by conventional methods. Our ndings shed light on the pursuit to been widely explored in network science [1, 24], where it has been understand the ways in which network structures are organized shown to be a strong driving force of link formation. However, and formed, pointing to potential advancement in designing graph structural homophily can be limiting in its explanation of the di- generation models and recommender systems. verse ways that may actually drive the process of link formation. While it accounts for similarity based on the number of common CCS CONCEPTS neighbors, structural homophily fails to account for the diverse ways in which these neighbors are embedded within the network. •Information systems →Social networks; •Human-centered We posit that structural diversity, as measured by the number of computing →Collaborative and social computing; •Mathematics connected components [39], may provide additional insight into of computing →Random graphs; •Applied computing →Sociology; how individuals are connected. For example, common social net- KEYWORDS work neighbors may arise from relatively weak connections such as a shared hometown, implying a more diverse common neigh- Network Diversity; Embeddedness; Network Signature; Triadic borhood, or relatively stronger connections such as being college Closure; Link Prediction; Social Networks; Big Data. classmates, implying a relatively less diverse common neighbor- structural diversity of ∗ hood. In this paper, we study this notion of is work was done when Yuxiao was a Ph.D. student at University of Notre Dame. the common neighborhood shared by two nodes, investigating how it aligns with the principle of homophily and whether it can shed Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed light on the process of link formation and network organization. for prot or commercial advantage and that copies bear this notice and the full citation Motivating example. Consider the scenarios presented in Fig- on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permied. To copy otherwise, or republish, ure 1. According to homophily, the probability that vi and vj know to post on servers or to redistribute to lists, requires prior specic permission and/or a each other given that they share four (b) common neighbors (CN), fee. Request permissions from [email protected]. is generally higher than when they share only three (a). Formally, KDD ’17, August 13–17, 2017, Halifax, NS, Canada. © 2017 ACM. 978-1-4503-4887-4/17/08...$15.00 P (eij =1 j #CN=4) > P (eij =1 j #CN=3), where eij =1 denotes the exis- DOI: hp://dx.doi.org/10.1145/3097983.3098116 tence of an edge e between vi and vj . A natural question that arises 807 KDD 2017 Research Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada e t a v v r vi j vi j each representative of a particular superfamily. We nd that these e c en t superfamilies cannot be uncovered by subgraph frequencies [28] xis (a) (b) e and other properties, indicating that the CNS can serve to reveal e link v previously undiscovered mechanisms of network organization. In ti v v a i j v el R addition, we nd that classical random graph models are unable to generate synthetic graphs that have CNS similar to real networks, (c) (d) (e) despite their degree distribution, clustering coecient, and other Figure 1: Structural diversity of common neighborhoods. traditional network properties matching those of real networks. Two nodes vi and vj with three disconnected (a), four disconnected Investigating the cause of these dierences between superfam- (b), and four connected (c) common neighbors. (d) Dierent from ilies, we nd that they correspond to the ways in which these structural diversity in an ego-centric notion [39], we go beyond an networks are used to satisfy social or information needs. Some ego, and focus on the structural diversity of common neighborhoods networks, such as Friendster and Facebook, are more likely to be for two persons. (e) Structural diversity of common neighborhoods used for maintaining densely connected real-world relationships, aects the link existence rate. resulting in such networks exhibiting a positive relationship be- tween the diversity of the common neighborhood and the link is how two users’ common neighborhood—that is, the subgraph existence rate. By contrast, other networks, such as BlogCatalog structure of their common neighbors—inuences the probability and LinkedIn, are more likely to be used for acquiring new and that they form a link. For example, let us assume that vi and vj diverse information, resulting in a negative relationship between share four common neighbors. Are vi and vj more likely to connect the common neighborhood diversity and link existence rate. with each other if their four common neighbors do not know each Our ndings demonstrate that the structure of common neigh- other (i.e., (b) ), or if they all know each other (i.e., (c) )? borhoods can eectively capture driving forces intrinsic to the In essence, then, we are interested in the truth of the following formation of local network structures and global network orga- inequality: nizations. e results have important, practical implications for j j P (eij =1 ) ? P (eij =1 ) ? random graph model design, as well as recommendations in social Contribution. We study the structural diversity of common networks, such as “People You May Know (PYMK)” in Facebook. neighborhoods between two nodes and its impact on the proba- For example, by simply using structural diversity as an unsuper- bility that these nodes may be connected. e diversity measures vised link predictor, our case study nds a 57% improvement over the number of connected components that comprise the common the commonly used #CN predictor as measured by AUPR in both neighborhood, capturing the variable ways in which a given pair’s the Friendster and BlogCatalog Networks. common contexts may be composed. Our study considers more than one hundred large-scale networks from a wide range of do- 2 BIG NETWORK DATA mains, making this the largest empirical analysis done on networks to To comprehensively examine our proposed concept of the structural date. We rst consider three representative networks—Friendster, diversity of common neighborhoods, we have assembled a large BlogCatalog, and YouTube—to develop the intuition underlying collection of big network datasets from several platforms, including these ideas, and demonstrate the notion of common neighborhood (in alphabetical order): AMiner [37], ASU [45], KONECT [18], MPI- diversity and its impact on link existence. SWS [29], ND [4, 6], NetRep [35], Newman [31], and SNAP [36]. We discover, in Figure 1(e), that when the size of the common In total, we have compiled a set of 120 large-scale undirected and neighborhood is xed with the same diversity (i.e., #components), unweighted networks from the platforms listed above, including 80 the link existence rates given dierent common neighborhoods (e.g., real-world networks and 40 random graphs. We have cleaned the vs. , vs. , and vs. ) are relatively close. Further, networks as follows: For directed networks that have no reciprocal we nd an increase in the structural diversity of the neighborhood connections (5 citation networks), we convert each directed link negatively impacts the formation of links in Friendster—that is, into an undirected one.