Efficient Semi-Streaming Algorithms for Local Triangle Counting In
Total Page:16
File Type:pdf, Size:1020Kb
Efficient Semi-streaming Algorithms for Local Triangle Counting in Massive Graphs ∗ Luca Becchetti Paolo Boldi Carlos Castillo “Sapienza” Università di Roma Università degli Studi di Milano Yahoo! Research Rome, Italy Milan, Italy Barcelona, Spain [email protected] [email protected] [email protected] Aristides Gionis Yahoo! Research Barcelona, Spain [email protected] ABSTRACT Categories and Subject Descriptors In this paper we study the problem of local triangle count- H.3.3 [Information Systems]: Information Search and Re- ing in large graphs. Namely, given a large graph G = (V; E) trieval we want to estimate as accurately as possible the number of triangles incident to every node v 2 V in the graph. The problem of computing the global number of triangles in a General Terms graph has been considered before, but to our knowledge this Algorithms, Measurements is the first paper that addresses the problem of local tri- angle counting with a focus on the efficiency issues arising Keywords in massive graphs. The distribution of the local number of triangles and the related local clustering coefficient can Graph Mining, Semi-Streaming, Probabilistic Algorithms be used in many interesting applications. For example, we show that the measures we compute can help to detect the 1. INTRODUCTION presence of spamming activity in large-scale Web graphs, as Graphs are a ubiquitous data representation that is used well as to provide useful features to assess content quality to model complex relations in a wide variety of applica- in social networks. tions, including biochemistry, neurobiology, ecology, social For computing the local number of triangles we propose sciences, and information systems. Defining new measures two approximation algorithms, which are based on the idea of interest on graph data and designing novel algorithms that of min-wise independent permutations (Broder et al. 1998). compute or approximate such measures on large graphs is Our algorithms operate in a semi-streaming fashion, using an important task for analysing graph structures that reveal O(jV j) space in main memory and performing O(log jV j) their underlying properties. sequential scans over the edges of the graph. The first al- In this paper we study the problem of counting the local gorithm we describe in this paper also uses O(jEj) space in number of triangles in large graphs. In particular, we con- external memory during computation, while the second al- sider undirected graphs G = (V; E), in which V is the set gorithm uses only main memory. We present the theoretical of nodes and E is the set of edges. For a node u we define analysis as well as experimental results in massive graphs S(u) to be the set of neighbors of u, that is, S(u) = fv 2 V : demonstrating the practical efficiency of our approach. euv 2 Eg, and let the degree of u be du = jS(u))j. We are then interested in computing, for every node u, the number ∗ Luca Becchetti was partially supported by EU Inte- of triangles incident to u, defined as: grated Project AEOLUS and by MIUR FIRB project N. RBIN047MH9: \Tecnologia e Scienza per le reti di prossima 1 T (u) = jfevw 2 E : euv 2 E; euw 2 Egj: generazione". Paolo Boldi was partially supported by the 2 MIUR COFIN Project \Linguaggi formali e automi" and by EU Integrated Project DELIS. The problem of counting triangles also translates into com- puting the local clustering coefficient (also known as tran- sitivity coefficient). For a node u, the local clustering co- efficient is defined as 2T (u) , that is, the ratio between du(du−1) Permission to make digital or hard copies of all or part of this work for the number of triangles and the largest possible number of personal or classroom use is granted without fee provided that copies are triangles in which the node could participate. not made or distributed for profit or commercial advantage and that copies Note that the problem of estimating the overall (global) bear this notice and the full citation on the first page. To copy otherwise, to number of triangles in a graph has been studied already, see republish, to post on servers or to redistribute to lists, requires prior specific e.g. [2, 10]; here we deal with the problem of estimating the permission and/or a fee. KDD'08, August 24–27, 2008, Las Vegas, Nevada, USA. (local) number of triangles of all the individual nodes in the Copyright 2008 ACM 978-1-60558-193-4/08/08 ...$5.00. graph simultaneously. We motivate our problem definition by showing how the (semi-streaming) approximation algorithms for counting tri- local triangle computation can be used in a number of in- angles are described. teresting applications. Our first application involves spam The rest of the paper is organized as follows. In the next detection: we show that the distribution of the local clus- section we review the related work and in Section 3 we intro- tering coefficient can be an effective feature for automatic duce the model of computation and the notation that we will Web-spam detection. In particular, we study the distribu- be using throughout the paper. Section 4 describes how to tion of the local clustering coefficient and the number of tri- approximate the intersection of two sets using pairwise inde- angles in large samples of the Web. Results show that these pendent permutations, as described in [8]. Section 5 presents metrics, in particular the former, exhibit statistical differ- our first algorithm, and Section 6 the main-memory-only al- ences between normal and spam pages and are thus suitable gorithm. The last section presents our conclusions and out- features for the automatic detection of spam activity in the lines future work. Web. Next we apply our techniques to the characterization of content quality in a social network, in our case the Yahoo! Answers community. Following a suggestion from the study of social networks in reference [32], that the type and qual- 2. RELATED WORK ity of content provided by the agents is related to the de- Computing the clustering and the distribution of triangles gree of clustering of their local neighborhoods, we perform are important to quantitatively assess the community struc- a statistical analysis of answers provided by users, studying ture of social networks [29] or the thematic structure of large, the correlation between the quality of answers and the local hyperlinked document collections, such as the Web [16]. clustering of users in the social network. There has been work on the exact computation of the In addition to the ones we consider, the efficient compu- number of triangles incident to each node in a graph [1, 3, tation of the local number of triangles and local clustering 25]. The brute-force algorithm for computing the number coefficient can have a larger number of other potential ap- of triangles simply enumerates all jV j triples of nodes, and plications, ranging from the analysis of social or biological 3 thus it requires O(jV j3) time. A more` ´ efficient solution for networks [29] to the uncovering of thematic relationships in the local triangle counting problem is to reduce the problem the Web [16]. to matrix multiplication, yielding an algorithm with running For computing the local number of triangles we propose time O(jV j!), where currently ! ≤ 2:376 [14]. If in addition two approximation algorithms, which rely on well estab- to counting one wants to list all triangles incident to each lished probabilistic techniques to estimate the size of the node in the graph, variants of the \node iterator" and \edge- intersection of two sets and the related Jaccard coefficient [6, iterator" algorithms can be used. A description and an ex- 8, 9]. Our algorithms use an amount of main memory in the perimental evaluation of those \iterator" algorithms can be order of the number of nodes O(jV j) and make O(log jV j) found in [30]; however, their running time is O(jV jd2 ) and sequential scans over the edges in the graph. max O( d2 ), respectively. For the datasets we consider| Our first algorithm is based on the approach proposed in v2V v very large number of nodes and high-degree nodes due to [7, 8, 9], which uses min-wise independent hash functions P skewed degree distributions|such exact algorithms are not to compute a random permutation of an ordered set. In scalable, thus in this paper we resort to approximation al- our case, this is the (labeled) set of nodes in the graph. In gorithms. practice, to increase efficiency, instead of hash functions we In [13] the authors propose a streaming algorithm that simply use a random number generator to assign binary la- estimates the global number of triangles with high accuracy, bels to nodes. Doing this can in principle lead to collisions using an amount of memory that decreases as the number (i.e., we might have subsets of nodes with the same label). of triangles increases. This result has been improved in [10]. We provide a quantitative analysis of this approach, char- We remark that, differently from [13, 10], in this paper we acterizing the quality of the approximation in terms of the are interested in estimating the local clustering coefficient Jaccard coefficient and the role of collisions. A similar anal- (and the number of triangles) for all vertices at the same ysis had been sketched in [6]. time. We then propose a second algorithm that maintains one Min-wise independent permutations have been proposed counter per node in main memory|as opposed to the first by Broder et al.