Ego-Centric Graph Pattern Census
Total Page:16
File Type:pdf, Size:1020Kb
Ego-centric Graph Pattern Census Walaa Eldin Moustafa, Amol Deshpande, Lise Getoor Department of Computer Science, University of Maryland College Park, USA fwalaa,amol,[email protected] is_friend M M is_friend Abstract—There is increasing interest in analyzing networks M M of all types including social, biological, sensor, computer, and S transportation networks. Broadly speaking, we may be inter- is_married is_married S is_married is_married ested in global network-wide analysis (e.g., centrality analysis, community detection) where the properties of the entire network F F S S is_friendF F S S are of interest, or local ego-centric analysis where the focus is is_friend is_married is_married on studying the properties of nodes (egos) by analyzing their neighborhood subgraphs. In this paper we propose and study (a) (b) ego-centric pattern census queries, a new type of graph analysis A A A A A query, where a given structural pattern is searched for in every node’s neighborhood and the counts are reported or used in further analysis. This kind of analysis is useful in many domains B B B B B in social network analysis including opinion leader identification, node classification, link prediction, and role identification. We propose an SQL-based declarative language to support this class C C C C C of queries, and develop a series of efficient query evaluation Coordinator Consultant Gatekeeper Representative Liaison algorithms for it. We evaluate our algorithms on a variety of synthetically generated graphs. We also show an application (c) of our language in a real-world scenario for predicting future Fig. 1. (a) Pattern that captures two couples that are friends with each collaborations from DBLP data. other – such a pattern may be useful in a targeted marketing application; (b) Example pattern used in the node classification application; (c) Different I. INTRODUCTION brokerage patterns – the colors denote organizations, and the function of the broker (the middle node) depends on the organizations that the three nodes Network analysis is an area of growing importance in a belong to (e.g., B is a coordinator if all three are in the same organization). variety of domains including the social sciences, biology, communications, ecology, finance, the internet, law enforce- nodes (for a given k), or in subgraphs defined by intersections ment, national security, social media and others. Most of the or unions of k-hop neighborhoods of pairs of nodes; k-hop work in network analysis focuses on either (a) global macro- neighborhood of a node n is defined to be the incident level analysis characterizing properties of the entire network subgraph on the nodes reachable from n in k hops or less. We and communities (subgraphs) within the network, or (b) local refer to this type of query as an ego-centric pattern census. micro-level analysis characterizing properties of the actors We illustrate the need for such analysis using a few motivating (nodes) in the network. Macro-level analysis is important applications: for characterizing networks and understanding the evolution of networks; examples of such analysis include measuring Targeted Marketing: Viral marketing is proving to be an structural properties like degree distribution, diameter, and effective tool for product advertisement. Selected consumers graph cohesion [30], and discovery of patterns or motifs in the are given a product with the hope that they will like the network [28], [27], [6], [7], [21], [38]. Micro-level analysis fo- product and recommend it to their friends. These consumers cuses instead on measuring properties of the actors in network, must be chosen wisely to minimize the cost and maximize e.g., degree centrality, second-order degree, local clustering the benefits of advertising. Simple criteria such as picking coefficient etc. This is often called ego network analysis, consumers with the most friends, or consumers that are because it looks at the individual actors (egos) and their neigh- connected to many other consumers through short paths, bors (called alters) in the network. Although global network are typically used. However the ability to identify richer analysis is well-understood and efficient computational tools structures is desirable in many cases. For example, a travel are well-developed, similar tools are not yet available for the agency may wish to identify couples that have either the harder problem of ego-centric network analysis that requires largest number of couples in their combined network, or the analyzing a very large number of small, largely overlapping largest number of couple pairs, i.e., couples who are friends networks. with couples. The latter structure is depicted in Figure 1(a). In this paper, we address one such problem in ego-centric Node Classification: In collective classification [32], a node’s analysis, namely counting the number of structural patterns neighborhood is used to predict the node’s own class label. or motifs that occur either in the k-hop neighborhoods of the For example, in a collaboration network, a scientist who collaborates mostly with scientists from a specific field (e.g., motifs that a node is part of) has also been recently used as databases or software engineering) is likely to be from a tool to classify a node’s role. For example, Kerrebroeck et the same field. In a family relationship network (with “is al. [39] count the number of loops a node is part of, and married to” and “is parent of” relationships), for each child use it as a measure to quantify the node’s importance in the we may wish to count the number of relatives up to 3 hops network. Pruzlj¨ [31] proposes the notion of local graphlet away who are smokers (or obese), with their parents also degree distribution as a means for network comparison. being smokers (or obese). This could be a measure of the Several ego-centric measures can be expressed as ego- risk of being a smoker (or obese) for the child, and can be centric pattern census queries with very simple input patterns, used for predicting that risk. The pattern of this query is and can be seen as special cases of ego-centric pattern cen- depicted in Figure 1(b). sus. For example, an ego-centric measure like node degree Structural Balance: In social balance theory, signed net- corresponds to searching for a pattern of a node in a 1-hop works are networks that have positive and negative signs on neighborhood, and clustering coefficient (and its k-clustering their links, denoting whether a link expresses a positive tie coefficient version [22]) can be expressed in terms of counting (e.g., friendship) or a negative tie (e.g., foe) [10]. In signed edge patterns in 1-neighborhood (and k-neighborhood, respec- networks, triangles with an odd number of negative links tively). Similarly, Jaccard coefficient of a pair of nodes can (one or three) are considered unstable. In such networks, be computed by counting a node pattern in the nodes’ 1- we can measure the amount of instability in each node’s neighborhood intersection and union. ego network by counting the number of unstable triangles Our approach and contributions: In this work we investi- in its k-hop neighborhood. gate ego-centric pattern census queries, and develop efficient Brokerage Analysis: In organizational theory, management techniques for executing them. Our key contributions include: scientists are interested in the roles different individuals play, • We introduce a flexible declarative SQL-based query lan- both within an organization, and across organizations. For guage for specifying ego-centric pattern census queries. example, in a transaction network, the middle node in a Our query language allows users to specify the neighbor- directed triad (e.g., node B in the triad A ! B ! C) is hood size (k), the set of focal nodes (or pairs of nodes), called a coordinator if all three are in the same organization, and the pattern to be counted. or a gatekeeper if A is in a different organization than B • We propose an efficient graph pattern matching algo- and C. Figure 1(c) shows the different brokerage types. A rithm, and show that it outperforms GraphQL [20], a brokerage score of a given type can then be computed for recent graph pattern matching system. a node by counting the number of patterns of that type in • We introduce two query evaluation algorithms for the which the node occurs as the middle node. ego-centric census queries, one based on searching from Graph Indexing: Finally, counts of specific structural pat- nodes to patterns (node-driven) and another based on terns in every node’s k-hop neighborhood in a large graph searching from patterns to nodes (pattern-driven). are regarded as node signatures and are often used for • We empirically evaluate our algorithms on a variety of subgraph pattern matching to prune the search space [43]. real-world and synthetic data. Our algorithms can be used to efficiently build sophisticated Outline: The rest of paper is organized as follows. In Sec- signatures, that can be used when searching for large, tion II we present the data model and our language specifica- complex subgraphs. tion. In Section III, we propose an efficient pattern matching algorithm that is used as part of the algorithms in Section IV The problem of ego-centric pattern census combines elements that we propose for evaluating pattern census queries. In from micro-level ego-centric network analysis, subgraph pat- Section V we discuss experimental performance evaluation tern matching, and motif counting. The goal of subgraph through synthetic datasets and workloads. In addition, we pattern matching is to find all matches of a given query graph use our language to design a link prediction experiment over in the database graph.