Evaluating Twitter Influence Ranking with System Theory
Total Page:16
File Type:pdf, Size:1020Kb
Evaluating Twitter Influence Ranking with System Theory Georgios Drakopoulos, Andreas Kanavos and Athanasios Tsakalidis Computer Engineering and Informatics Department, University of Patras, Patras 26500, Greece Keywords: Functional Metrics, Structural Metrics, Graph Databases, Higher Order Data, Influence Ranking, Neo4j. Abstract: A considerable part of social network analysis literature is dedicated to determining which individuals are to be considered as influential in particular social settings. Most established algorithms, such as Freeman and Katz- Bonacich centrality metrics, place emphasis on various structural properties of the social graph. Although this makes centrality metrics generic enough to be applied in virtually any setting, they are oblivious to the functionality of the underlying social network. This paper examines five social influence metrics designed especially for Twitter and their implementation in a Java client retrieving network information from a Neo4j server. Additionally, a sceheme is proposed for evaluating the performance of an influence ranking based on estimating the exponent of a Zipf model fitted to the ranking score. 1 INTRODUCTION processing systems such as Spark (Robinson et al., 2013)(Onofrio Panzarino, 2014). Social media constitute a mainstay of the connected The primary contribution of this work is three- age. Recently, Twitter has emerged among them as fold. Five Tweeter-specific metrics capturing essen- one of the most popular microblogging platforms, tial online behavior characteristics have been devel- where on a daily basis a vast amount of information, oped. Additionally, a methodology for evaluating in- including tweets and hashtags, is posted by the users fluence metrics based on concepts from system the- of the platform to the public or to selected circles of ory is proposed. Finally, the aforementioned metrics their contacts. have been implemented in Java over Neo4j through Their advent made feasible the application of the Cypher API. both traditional and innovative social network anal- The rest of this paper is structured as follows. Sec- ysis methods in previously prohibitive magnitude tion 2 summarizes the influence ranking literature. for cornerstone problems such as social coher- Implementation aspects are described in section 3. ence, social graph clustering, expansion potential, Twitter-specific functional metrics are outlined in sec- or information flow (Russell, 2013)(Leskovec et al., tion 4. Section 5 discusses the experimental results 2014)(Leskovec, 2011). Currently, social influence and the metric evaluation methodology, while section ranking has been recognized as an important research 6 explores future research directions. topic. Existing influence metrics rely heavily on structural properties of the social graph, such as the Table 1: Symbols used in this paper. number of shortest paths through a given vertex. Al- Symbol Meaning though these structural metrics are well defined and =4 Equality by definition can be applied to literally every social graph, they ig- x ,...,x Set containing elements x ,...,x nore the array of functions each social network per- 1 n 1 n {S } Cardinality of set S forms. | | τS ,S Tanimoto coefficient of sets S and S Graph databases such as Neo4j provide produc- 1 2 1 2 tk k-th Twitter user (used as shorthand) tion grade front- or back-end social graph storage. µ Twitter user influence metric Moreover, they offer graph analytics such as link µk Influence score of tk assigned by µ prediction, shortest paths, clustering coefficient, and µ1 µ2 Metric µ1 outperforms µ2 minimum spanning trees, bolstering the potential µ1 µ2 Metric µ1 is at least as good as µ2 of graph tools such as NetworkX, machine learn- ing frameworks such as Graphlab, and distribiuted 113 Drakopoulos, G., Kanavos, A. and Tsakalidis, A. Evaluating Twitter Influence Ranking with System Theory. In Proceedings of the 12th International Conference on Web Information Systems and Technologies (WEBIST 2016) - Volume 1, pages 113-120 ISBN: 978-989-758-186-1 Copyright c 2016 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved WEBIST 2016 - 12th International Conference on Web Information Systems and Technologies 2 RELATED WORK 3 SYSTEM ARCHITECTURE Social network analysis scientific literature abounds Figure 1 illustrates the components of the system de- with influence rankings or metrics (Kempe et al., veloped to address the needs of this work as well as 2003). Although ranking classification is not always the information flow between them. The three main clear, distinct metric design methodologies can be functions are to retrieve the social graph through the broadly classified to structural and functional, while social media crawler, to store it in the Neo4j server, the former can be further subdivided to combinato- and to query this graph through its Java client. rial and spectral. Structural metrics are more versatile than functional ones, since the former are network- independent. On the contrary, the latter are network- specific but they tend to reveal more information about the underlying network. Combinatorial rankings compute the influence score based on basic graph properties. Prominent metrics of this category are the degree metric, namely the neighborhood size of a given vertex, and the Newman-Girvan centrality, a function of the graph shortest paths (Leskovec et al., 2014). Combinato- rial rankings may as well be expressed in the linear Figure 1: System architecture. algebra domain through the graph adjacency matrix. One such example is the Katz centrality (Katz, 1953). The social crawler, described in (Kafeza et al., Spectral rankings derive the influence score di- 2014), has been programmed in Twitter4j to collect rectly or indirectly through the spectral decomposi- im JSON format data such as tweets, retweets, and tion of the graph adjacency matrix. Eigenvalues and hashtags. It communicates with the Neo4j server eigenvectors play an important role in metrics such as through the Cypher API, a Java interface extended by PageRank (Leskovec et al., 2014) or eigenvector cen- Neo4j to Java clients, in order to populate the graph trality (Drakopoulos et al., 2015). database. The crawler is inaccessible from the client. On the contrary, functional metrics are associ- The Neo4j version is 2.2.5, the latest available version ated with particular aspects of a given social net- at the beginning of system development. It commu- work performs. TwitterRank (Weng et al., 2010) and nicates both with the Twitter crawler and the client. TunkRank (TunkRank, 2015) are two PageRank ex- The expressive power of Cypher, its declarative query tensions which take into account user similarity and language, reduces relatively complex graph queries to retweet probability respectively. In (Bakshy et al., simple patterns modifiable dynamically by the Java 2011) the most influential users are also the most cost- client. Since only the API is visible by the client, it is effective ones, where the cost is defined in terms of listed as a grey box, while the graph database proper overall communication complexity. In (Mehta et al., as a black box. On the end user side, the client has 2012) influence is expressed in terms of a sophisti- been developed in Java using the libraries available in cated metric incorporating structural and functional NetBeans for interfacing with Neo4j. elements. In (Pal and Counts, 2011) the problem of The need for new database approaches, besides finding the most influential authors for a given topic in the relational one, was highlighted with the advent Twitter are selected from a Gaussian Mixture Model. of Web 2.0, which is dominated by high velocity, A similar problem in Yahoo! Answers was addressed unstructured or semistructured and high order data. in (Bouguessa et al., 2008), where ranking is done as Neo4j, a NoSQL graph database, stores data physi- a mixture of gamma distributions. cally as a graph. Finally, authors in (Rogers and Beal, 1957) move Property 1. Neo4j is schemaless (Robinson et al., along a different reasoning beyond the structural and 2013). functional divisive line, presenting influence in terms This is a fundamental NoSQL characteristic which of an intuition stemming from the current technolog- yields design flexibility but also imposes additional ical evolution, which eventually led to the successful design and administration burdens. spread of online social media. Quoting (Cha et al., 2010), influence is Property 2. The relational ACID operational re- quirements have been replaced by the three BASE “[. ] the ability of a person to influence the requirements (Robinson et al., 2013)(Onofrio Pan- thoughts or actions of others.” zarino, 2014). 114 Evaluating Twitter Influence Ranking with System Theory The BASE set is less strict than the ACID ’followers’: x, one, providing implementation flexibility and ease of ’frequency’: x, maintenance at the expense of data consistency. De- ’hashtags’: x ) spite the latter, a NoSQL database can be in practice } successfully tailored to the local operational demands. where x denotes a value computed elsewhere in the source code. The above Cypher command creates a Property 3. The property graph model is the primary vertex n of type user along with a set of associated conceptual data model supported by Neo4j and of- key-value pairs. This makes the conceptual graph ho- fered to a high level user typically through the Cypher mogeneous in the sense that each vertex has the same querying language (Drakopoulos et al., 2015)(Kon- number of key-value pairs. Moreover, there are no topoulos and Drakopoulos, 2014). missing values. Therefore, an implicit schema does Property 4. Neo4j fully supports for both vertices exist for the vertices of the particular Twitter graph, and edges the CRUD set of operations, namely but this need not to be the case generally. Create, Read, Update, and Delete. In a similar manner, if a user u follows another Neo4j supports Cypher, a declarative, ASCII art, user v, then an edge with the FOLLOWS tag (and and pattern based query language for handling con- FOLLOWEDBY) is created: ceptual graphs. Cypher possesses suitable syntax to c r e a t e ( ( u ) [:FOLLOWS] >(v ) , concisely express CRUD operations.