Social Network Analysis Student Thesis

Social Network Analysis Student thesis Ioannis Xarchakos Athens University of Economics and Business Supervised by: Prof. Vasilis Vassalos Athens, March 2016 Abstract This thesis investigates Social Network Theory using graph processing systems with purpose to find interesting properties and relationships in different kinds of social networks. We use several kinds of networks: online social networks, collaboration networks, product networks and business email networks. We use mainly four centrality metrics which is Degree centrality, Closeness centrality, Betweenness centrality and Eigenvector centrality in order to investigate interesting properties into the graphs. We use three processing systems; Stanford’s Network Analysis Platform (SNAP), Neo4j and Apache Giraph. We use the four centrality metrics in order to compare the performance of the three systems. We also investigate the correlation between research productivity and research collaboration using Betweenness centrality metric over the years. 1 Acknowledgements First and foremost, I would like to thank professor V. Vassalos for his willingness to be my supervisor and for his valuable guidance. 2 Table of Contents 1. Overview ............................................................................ 5 1.1 Goals .............................................................................................................. 5 1.2 Used techniques ............................................................................................ 5 1.3 Results ........................................................................................................... 6 2. Related Work ................................................................... 7 2.1 Network analysis ........................................................................................... 7 2.2 Centrality measures ....................................................................................... 7 2.3 Graph analysis systems ................................................................................. 8 3. Centrality Measures ......................................................... 9 3.1 Mathematical definitions ............................................................................... 9 3.2 Intuitions ...................................................................................................... 11 3.3 Examples ..................................................................................................... 12 3.4 Pseudocode .................................................................................................. 14 4. Graph Systems ................................................................ 16 4.1 SNAP ........................................................................................................... 16 4.1.1 Degree Centrality ................................................................................ 18 4.1.2 Closeness Centrality ............................................................................ 19 4.1.3 Betweenness Centrality ....................................................................... 19 4.1.4 Eigenvector Centrality ........................................................................ 20 4.1.5 Clustering Coefficient ......................................................................... 20 4.1.6 K-Core ................................................................................................. 21 4.2 Neo4j ........................................................................................................... 22 4.2.1 Degree Centrality - Cypher ................................................................. 26 4.2.2 Closeness Centrality ............................................................................ 27 4.2.3 Betweenness Centrality ....................................................................... 29 4.2.4 Eigenvector Centrality ........................................................................ 30 4.2.5 Clustering Coefficient ......................................................................... 32 3 4.3 Apache Giraph ............................................................................................. 33 4.3.1 Degree Centrality ................................................................................ 40 4.3.2 Closeness Centrality ............................................................................ 42 4.3.3 Eigenvector Centrality ........................................................................ 47 4.4 Experimental Comparison ........................................................................... 49 4.4.1 DBLP ................................................................................................. 49 4.4.2 Wiki-Talk ........................................................................................... 51 4.4.3 LiveJournal ........................................................................................ 52 4.4.4 Twitter ................................................................................................ 53 4.4.5 Amazon product network ................................................................... 54 4.4.6 Email Enron ....................................................................................... 56 5. Conclusion ....................................................................... 58 5.1 Graph Analysis Platforms Performance Comparison ................................ 58 5.2 Research productivity and Research collaboration correlation .................. 60 Appendix A ......................................................................... 65 4 1. Overview Social network analysis (SNA) is the process of investigating social structures through the use of network and graph theories. It characterizes networked structures in terms of nodes (individual actors, people, or things within the network) and the ties or edges (relationships or interactions) that connect them. The extreme popularity and rapid growth of these social networks represents a unique opportunity to study, understand, and leverage their properties. In this thesis, we investigate online social networks such as Twitter and LiveJournal, collaboration networks such as DBLP, product networks like Amazon’s product and network and business email networks such as Enron’s e- mail network. 1.1 Goals This section presents the goals of this thesis. Our first goal is to apply the social network principles and methodologies using centrality metrics including Degree centrality, Betweenness centrality, Closeness centrality and Eigenvector centrality. The second goal is to compare the performance of the three graph analysis platforms which we use to calculate the previously mentioned metrics. Finally, we investigate the correlation between research productivity and research collaboration using Betweenness centrality metric calculated over the years in the DBLP collaboration network. 1.2 Used techniques This section presents how we collect our data and how we calculate the different centrality metrics. We acquire the Twitter dataset from Arizona State University repository [15], the LiveJournal and the Amazon product network data sets acquired from Stanford’s Large Network Dataset Collection repository [14]. We also construct DBLP and Enron’s e-mail network dataset. DBLP is a computer science bibliography website hosted at University of Trier, in Germany. It was originally a database and logic programming bibliography site, and has existed at least since the 1980s. DBLP listed more than 3.5 million journal articles, conference papers, and other publications on computer science. We parse each publication and for each scientific collaboration between two or more authors we construct links among them in pairs. With this method we construct a dense collaboration network. In order to construct Enron’s e-mail network data set, we acquire all the e-mails among Enron’s employees from Enron Email Dataset repository which hosted by Carnegie Mellon University. The Enron corpus is a large database with over 600,000 emails generated by 158 employees of the Enron Corporation and acquired by the Federal Energy Regulatory Commission during its investigation after the company's collapse. Then, we parse every e-mail in the data set and create a directed link from the sender to recipient of every e-mail. The next step is to calculate the centrality metrics for each one of our data sets. For this purpose, we use three different graph analysis systems. The first system which we used is the Stanford’s Network Analysis Platform (SNAP). The SNAP is a general purpose network analysis and graph mining library. It follows a graph-centric programming 5 model and it can efficiently manipulates large graphs, calculates structural properties and supports attributes on nodes and edges. The second system is the Neo4j graph database. Neo4j is an open-source well-known graph database implemented in Java which uses an intuitive graph-oriented model for data representation. The last system which we used is the Apache Giraph. Apache Giraph is an iterative graph processing system built for high scalability and it is the open-source counterpart to Pregel, the graph processing architecture developed at Google. Both systems are inspired by the Bulk Synchronous Parallel model of distributed computation. Apache Giraph is part of the Hadoop ecosystem and runs on Hadoop. Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer

Load more