Real-Time Community Detection in Large Social Networks on a Laptop
Total Page:16
File Type:pdf, Size:1020Kb
Real-Time Community Detection in Large Social Networks on a Laptop Ben Chamberlain Clive Humby Marc Peter Deisenroth Department of Computing Starcount Department of Computing Imperial College London 2 Riding House Street Imperial College London London, UK London, UK London, UK ABSTRACT queries reveal related products or users; marketing, where For a broad range of research, governmental and commercial companies seek to optimise advertising channels or celebrity applications, it is important to understand the allegiances, endorsement portfolios. To meet these emergent needs an- communities and structure of key players in society. One alysts must explore vast networks. Our work demonstrates promising direction towards extracting this information is an alternative monetisation channel for Digital Social Net- to exploit the rich relational data in digital social networks works (DSN)s; that social media data can be used in mul- (the social graph). As social media data sets are very large, tiple industries for decision support. Decision support does most approaches make use of distributed computing systems not disrupt user experience in the way that sponsored links for this purpose. Distributing graph processing requires or feed advertising do. It offers another channel for social solving many difficult engineering problems, which has lead media providers to monetise their data, helping them to con- some researchers to look at single-machine solutions that are tinue to provide free services that are valued by billions of faster and easier to maintain. In this article, we present a people globally. single-machine real-time system for large-scale graph pro- As a sketch of how the system is used we imagine a com- cessing that allows analysts to interactively explore graph pany that wants to enter a new foreign market. To do structures. The key idea is that the aggregate actions of this they need to understand the market's competitors, cus- large numbers of users can be compressed into a data struc- tomers and marketing channels. Using our system they ture that encapsulates user similarities while being robust could input the Twitter handles for their existing products, to noise and queryable in real-time. We achieve single ma- key people, brands and endorsers and in real-time receive the chine real-time performance by compressing the neighbour- most similar accounts to their company in that market. The hood graph of each vertex using minhash signatures and output is structured into groups such as magazines, sports- facilitate rapid queries through Locality Sensitive Hashing. people, companies in the same industry etc. They could These techniques reduce query times from hours using in- then examine the results and explore different regions by dustrial desktop machines operating on the full graph to changing the input accounts. milliseconds on standard laptops. Our method allows ex- Throughout this paper we refer to graphs. In this context ploration of strongly associated regions (i.e. communities) a graph is a collection of vertices (social media accounts) and of large graphs in real-time on a laptop. It has been deployed edges connecting them. We are particularly interested in the in software that is actively used by analysts to understand neighbourhood graph, which is often used to compare ver- social network structure. tices and has been shown to have attractive properties in this regard [24, 23]. The neighbourhood graph of a vertex con- sists of the set of all vertices that are directly connected to 1. INTRODUCTION it, irrespective of the edge direction. We propose that robust Algorithms to discover groups of associated entities from associations between social network accounts can be reached relational (networked) data sets are often called community by considering the similarity of their neighbourhood graphs. detection methods. They come in two forms: global meth- This proposition relies on the existence of homophily in so- ods, which partition the entire graph, and local methods cial networks. The homophily principle states that people that look for vertices that are related to an input vertex with similar attributes are more likely to form relationships and only work on a small part of the graph. We are con- [16]. Accordingly social media accounts with similar neigh- cerned with community detection on large graphs that runs bourhood graphs are likely to have similar attributes. on a single commodity computer. To achieve this we com- We seek to build a system that is robust to failure, does bine the two approaches using local community detection to not require engineering support and is parsimonious with identify an interesting region of a graph and then applying the time of its users. The first two requirements lead us global community detection to help understand that region. to search for a single machine solution and the later re- Applications include: security, where analysts explore a quirement prescribes a real-time (or as close as possible) network looking for groups of potential adversaries; social system. To produce a real-time single-machine system that sciences, where queries can establish the important relation- operates on large graphs we must solve two problems: (1) ships between individuals of interest; e-commerce, where Fit the graph in the memory of a single (commodity) ma- chine; (2) Calculate the similarity of neighbourhood graphs containing up to 100m verices in milliseconds. We address these problems by compressing the neighbourhood graphs 3. RELATED WORK into fixed-length minhash signatures. Minhash signatures Existing approaches to large scale, efficient, community vastly reduce the size of the graph while encoding an effi- detection have three flavours: More efficient community de- cient estimation of the Jaccard similarity between any two tection algorithms, innovative ways to perform processing on neighbourhoods. Choosing appropriate length minhash sig- large graphs and data structures for graph compression and natures squeezes the graph into memory. To achieve real- search. Table 1 shows related approaches to this problem time querying we use the elements of the minhash signatures and which constraints they satisfy. to build a Locality Sensitive Hashing (LSH) data structure. LSH facilitates querying of similar accounts in constant time. Table 1: Comparison of related work. SCM is Single Com- A combination of minhashing and LSH allows analysts to modity Machine and MO is Modularity Optimisation enter a set of accounts and receive the set of most related accounts in milliseconds. Our contributions are: Method Real-time Large graphs SCM MO [18] 7 7 3 1. We establish that robust associations between social me- WALKTRAP [20] 7 7 3 dia users can be determined by means of the Jaccard INFOMAP [22] 7 7 3 similarity of their neighbourhood graphs. Louvain method [3] 7 3 3 BigClam [27] 7 3 3 2. We show that the approximations implicit in minhash- Graphci [14] 7 3 3 Twitter WTF [9] 3 3 7 ing and LSH minimally degrade performance and allow Our Method 3 3 3 querying of very large graphs in real time. There is an enormous community detection literature and 3. System design and evaluation: We have designed and Fortunato [8] provides an excellent overview. Approaches evaluated an end-to-end Python system for extracting can be broadly categorised into local and global methods. data from social media providers, compressing the data Global methods assign every vertex to a community, usu- into a form where it can be efficiently queried in real time. ally by partitioning the vertices. Many highly innovative schemes have been developed to do this [18, 22, 20]. Most 4. We demonstrate how queries can be applied to a range global methods do not easily scale to very large data sets. of problems in graph analysis, e.g., understanding the The availability of data from the Web, DSNs and services structure of industries, allegiances within political parties like Wikipedia led to a focus on methods for large graphs [3, and the public image of a brand. 27]. However, there are no real-time algorithms that could facilitate interactive exploration of social networks. 2. DATA Unlike global community detection methods, local algo- rithms do not assign every vertex to a community. Instead In this article, we focus on Twitter data because Twitter they find vertices in the same community as a set of input is the most widely used DSN for academic research. The vertices (seeds). For this reason they are normally faster Twitter Follower graph consists of roughly one billion ver- than global methods. Kloumann et al. [13] conducted a tices and 30 billion edges. To show that our method gener- comprehensive assessment of local community detection al- alises to other social networks, we also present results using gorithms on large graphs and identified Personal PageRank a Facebook Pages engagement graph containing 450 million (PPR) [11] as the clear winner. Recent advances have shown vertices and 700 million edges (see Section 5). that seeding PPR with the neighbourhood graph can im- To collect Twitter data we use the REST API to crawl prove performance [24] and that PPR can be used to initi- the network identifying every account with more than 10,000 local 1 ate spectral methods with good results [25]. Random Followers and gather their complete follower lists. To opti- walk methods are usually evaluated on distributed comput- mize data throughput while remaining within the DSN rate ing resources (e.g., Hadoop). While distributed systems are limits we developed an asynchronous distributed network continually improving, they are not always available to an- crawler using Python's Twisted library [26]. Our data set 10 alysts, require skilled operators and have a typical overhead contains 675,000 such accounts with a total of 1:5 10 Fol- of several minutes per query. lowers, of which 7 108 were unique. We restrict the× data set × A complimentary approach to efficient community detec- to accounts with greater than 10,000 Followers (though 700 tion is to develop more efficient and robust graph process- million Twitter accounts are used to build the signatures) ing systems.