Graph Community Discovery Algorithms in Neo4j with a Regularization-Based Evaluation Metric
Total Page:16
File Type:pdf, Size:1020Kb
Graph Community Discovery Algorithms in Neo4j with a Regularization-based Evaluation Metric Andreas Kanavos, Georgios Drakopoulos and Athanasios Tsakalidis Computer Engineering and Informatics Department, University of Patras, Achaia 26504, Greece Keywords: CNM Algorithm, Community Discovery, Graph Databases, Graph Mining, Graph Signal Processing, Louvain Algorithm, Newman-Girvan Algorithm, Neo4j, Regularization, Walktrap Algorithm. Abstract: Community discovery is central to social network analysis as it provides a natural way for decomposing a social graph to smaller ones based on the interactions among individuals. Communities do not need to be disjoint and often exhibit recursive structure. The latter has been established as a distinctive characteristic of large social graphs, indicating a modularity in the way humans build societies. This paper presents the im- plementation of four established community discovery algorithms in the form of Neo4j higher order analytics with the Twitter4j Java API and their application to two real Twitter graphs with diverse structural properties. In order to evaluate the results obtained from each algorithm a regularization-like metric, balancing the global and local graph self-similarity akin to the way it is done in signal processing, is proposed. 1 INTRODUCTION los et al., 2015b) (Drakopoulos et al., 2016). To this end, a coherence metric balancing global and local Twitter is currently the most popular microblogging self-similarity properties with a rationale similar to platform and the stage for ongoing political, finan- the signal processing regularization criterion 2 2 cial, and cultural conversations with a vast amount K = x As + µ0 Bs , µ0 > 0 (1) of tweets being posted on a daily basis. Decom- k − k k k posing a Twitter social graph to communities yields which given a data vector x, possibly with noise and a deeper insight to these seemingly chaotic interac- outliers, computes a smoother version s thereof by tions. However, community discovery is by no means combining global and local patterns coded in matri- a trivial task. Besides the large volume of accounts, ces A and B respectively. µ0 (Drakopoulos and Mega- tweets, retweets, and hashtags that need to be exam- looikonomou, 2016) controls their contribution to s. Graph databases such as Neo4j1, GraphDB2, and ined, necessarily implying parallel or distributed pro- 3 cessing, the question of what constitutes a commu- BrightstarDB provide production grade front- or nity, although posed in easily understood terms, re- back-end graph storage. In addition, they also offer mains to be definitively answered. This does not im- graph analytics such as link prediction and minimum ply that no formal community definition exists. Quite spanning trees (Panzarino, 2014) (Robinson et al., the contrary, a plethora of such definitions has been 2013). Higher order analytics, such as community in fact proposed, for instance in (Carrington et al., discovery, constitute a significant addition as they of- 2005), (Fortunato, 2010), (Newman, 2010), which fer deeper insight in the graph structure. successfully capture crucial aspects of human social The primary contribution of this paper is twofold. organization. However, they differ in key aspects and, Four community discovery algorithms, namely the therefore, lead to different community detection algo- Newman-Girvan, the Walktrap, the Louvain, and the rithms. CNM were implemented in Java over Neo4j. More- over, the results of these algorithms applied to two Similarly, there are a number of ways to assess Twitter graphs created with Twitter4j4 are evalu- the clustering quality, namely community coherence. However, most of the existing coherence metrics 1www.neo4j.com are either prohibitively expensive, such as the maxi- 2www.ontotext.com mum distance between vertices, or are prone to out- 3www.brightstardb.com liers, such as the diameter-based metrics (Drakopou- 4http://twitter4j.org/en/index.html 403 Kanavos, A., Drakopoulos, G. and Tsakalidis, A. Graph Community Discovery Algorithms in Neo4j with a Regularization-based Evaluation Metric. DOI: 10.5220/0006382104030410 In Proceedings of the 13th International Conference on Web Information Systems and Technologies (WEBIST 2017), pages 403-410 ISBN: 978-989-758-246-2 Copyright © 2017 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved WEBIST 2017 - 13th International Conference on Web Information Systems and Technologies ated with a regularization-like criterion which is effi- topics and subsequently the authorities for each topic ciently computed and relies on the fundamental self- are identified. This was extended in (Pal and Counts, similarity property of scale-free graphs. 2011) with additional features, advanced clustering The rest of this paper is structured as follows. and real-time capabilities. In addition, a previous Section 2 provides an overview of community detec- work regarding influential communities identification tion algorithms. The main characteristics of graph is presented in (Kafeza et al., 2014). Finally, an over- databases are described in section 3. The inherent all and extensive overview of the community discov- high order nature of graph communities and four im- ery field is (Fortunato, 2010). plemented algorithms are outlined in section 4. Fi- Signal regularization is a common technique aim- nally, section 5 describes the datasets used in this pa- ing at deriving a smoother or cleaner version of per and the results obtained from executing the com- a data vector without altering the regions of inter- munity detection algorithms in Neo4j, whereas sec- est. It has numerous applications in signal process- tion 6 concludes by recapitulating the main findings ing (Drakopoulos and Megalooikonomou, 2016), ma- and exploring future research directions. chine learning (Girosi et al., 1995), system identi- fication (Johansen, 1997), and inverse problem the- Table 1: Paper notation. ory (Vogel, 2002), while it also has connections to Symbol Meaning Sobolev space theory (Adams and Fournier, 2003) =4 Definition or equality by definition and to reproducible kernel Hilbert space theory (At- touch and Aze,´ 1993). deg(vk) Degree of vertex vk Kn Complete graph with n vertices The interest in the graph processing field has been invigorated with the advent of open source graph (v1,...,vn) Path with vertices v1,...,vn databases such as Neo4j, GraphDB and BrightStar. sk Set containing elements sk { } Graph processing is usually implemented with the S or sk Cardinality of set S or sk | | |{ }| { } use of massive distributed graph computing systems sk Sequence of items sk h i like Google Pregel and graph based machine learning frameworks like GraphLab. In these systems, graphs play a twofold role as the data flow model and as the 2 RELATED WORK learning model. Community detection is related mostly to graph clus- tering (Scott, 2000), Web retrieval (Newman, 2010), and user influence (Carrington et al., 2005). Con- 3 ARCHITECTURE AND cerning graph clustering, it can be performed either SOFTWARE structurally or spectrally. In the former case parti- tioning is based on the properties of the graph adja- Graph databases such as Neo4j constitute one of the cency matrix (Kernighan and Lin, 1970) (Shi and Ma- four major database technologies collectively known lik, 2000), whereas in the latter connectivity patterns as NoSQL. RDBMSs assume that data can be repre- such as edge density or modularity (Newman, 2004b) sented in a structured and tabular manner. However, (Newman, 2004a) play a primary role with notable the modern Web and the IoT generate unstructured or examples being (Blondel et al., 2008), (Girvan and semistructured, higher order, linked data which can- Newman, 2002). Vertex ranking, computed for in- not be easily described by a schema. The primary stance with PageRank (Brin and Page, 1998) includ- properties of Neo4j include (Robinson et al., 2013) ing its variants (Langville and Meyer, 2006) or HITS (Panzarino, 2014) (Drakopoulos et al., 2015a) (Kleinberg, 1998), can be used to build communities Property 1. Neo4j is schemaless. with vertices which share common topics. Authority estimation can also be used to construct graph com- Property 2. Neo4j conforms to BASE requirements. munities. In (Agichtein et al., 2008) several graph Property 3. The property graph model is the primary features as well as hub and authority scores are used conceptual data model of Neo4j. to model the relative importance of a given user. Al- Property 4. Neo4j supports SPARQL, a W3C RDF ternatively, in the expertise ranking model (Jurczyk query language, and Gremlin, a path query language and Agichtein, 2007), authorities are derived by per- (Drakopoulos et al., 2015a). However, queries to forming link analysis to the graph induced from in- a Neo4j system are mostly submitted in Cypher, an teractions between users. Moreover, in (Weng et al., ASCII art, pattern based, declarative language. The 2010) authors employ Latent Dirichlet Allocation and basic Cypher query has the form a PageRank variant to cluster the graph according to 404 Graph Community Discovery Algorithms in Neo4j with a Regularization-based Evaluation Metric [ s t a r t <p a t t e r n >] be considered as a third order quantity. If a triangle match <p a t t e r n > is closed, then it is a third order quantity in terms of [ with <p a t t e r n > [ as <p a t t e r n >]] edges as well. This stems from the fact that single re- where <p a t t e r n > lationships between individuals, namely edges in so- return <e x p r e s s i o n > cial graphs, do not qualify as communities. Thus, in [ order by <f u n c t i o n > [ desc ]] a group there has to be at least one common acquain- tance connecting the individuals in this group. This is Cypher queries can be submitted directly in Neo4j reflected by the fact that successful community detec- console or, most frequently, through an application tion algorithms rely on higher order metrics directly over a Neo4j API. For Java the Neo4j API is included or indirectly.