Comparison Study on Graph Sampling Algorithms for Interactive Visualizations of Large-Scale Networks

DEGREE PROJECT IN INFORMATION AND COMMUNICATION TECHNOLOGY, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2019 Comparison study on graph sampling algorithms for interactive visualizations of large-scale networks ALEKSANDRA VOROSHILOVA KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Comparison study on graph sampling algorithms for interactive visualizations of large-scale networks Aleksandra Voroshilova 2019-06-20 Master’s Thesis Place for Project Stockholm, Sweden Examiner Mihhail Matskin KTH Royal Institute of Technology Supervisor Tino Weinkauf KTH Royal Institute of Technology Industry Supervisor Alexandros Gkogkas Hive Streaming Abstract Networks are present in computer science, sociology, biology, and neuroscience as well as in applied fields such as transportation, communication, medical industries. The growing volumes of data collection are pushing scalability and performance requirements on graph algorithms, and at the same time, a need for a deeper understanding of these structures through visualization arises. Network diagrams or graph drawings can facilitate the understanding of data, making intuitive the identification of the largest clusters, the number of connected components, the overall structure, and detecting anomalies, which is not achievable through textual or matrix representations. The aim of this study was to evaluate approaches that would enable visualization of a large scale peer-to-peer video live streaming networks. The visualization of such large scale graphs has technical limitations which can be overcome by filtering important structural data from the networks. In this study, four sampling algorithms for graph reduction were applied to large overlay peer-to-peer network graphs and compared. The four algorithms cover different approaches: selecting links with the highest weight, selecting nodes with the highest cumulative weight, using betweenness centrality metrics, and constructing a focus-based tree. Through the evaluation process, it was discovered that the algorithm based on betweenness centrality approximation offers the best results. Finally, for each of the algorithms in comparison, their resulting sampled graphs were visualized using a force- directed layout with a 2-step loading approach to depict their effect on the representation of the graphs. Keywords Graph sampling, graph filtering, large graph visualization ii Abstract Nätverk återfinns inom datavetenskap, sociologi, biologi och neurovetenskap samt inom tillämpade områden så som transport, kommunikation och inom medicinindustrin. Den växande mängden datainsamling pressar skalbarheten och prestandakraven på grafalgoritmer, samtidigt som det uppstår ett behov av en djupare förståelse av dessa strukturer genom visualisering. Nätverksdiagram eller grafritningar kan underlätta förståelsen av data, identifiera de största grupperna, ett antal anslutna komponenter, visa en övergripande struktur och upptäcka avvikelser, något som inte kan uppnås med text- eller matrisrepresentationer. Syftet med denna studie var att utvärdera tillvägagångssätt som kunde möjliggöra visualisering av ett omfattande P2P (peer-to-peer) livestreaming- nätverk. Visualiseringen av större grafer har tekniska begränsningar, något som kan lösas genom att samla viktiga strukturella data från nätverken. I den här studien applicerades fyra provtagningsalgoritmer för grafreduktion på stora överlagringar av P2P-nätverksgrafer för att sedan jämföras. De fyra algoritmerna är baserade på val av länkar med högsta vikt, av nodar med högsta kumulativa vikt, betweenness-centralitetsvärden för att konstruera ett fokusbaserat träd som har de längsta vägarna uteslutna. Under utvärderingsprocessen upptäcktes det att algoritmen baserad på betweenness-centralitetstillnärmning visade de bästa resultaten. Dessutom, för varje algoritm i jämförelsen, visualiserades deras slutliga samplade grafer genom att använda en kraftstyrd layout med ett 2-stegs laddningsinfart. Nyckelord grafreduktion, stor graf visualisering iii Acknowledgements I would like to thank my supervisors: Tino Weinkauf for guiding on the process, giving advices on the next steps and keeping it always professional and fun, and Alexandros Gkogkas for giving the opportunity to start this interesting study and sharing his knowledge and helping with everything I needed to successfully finish it. I thank Mihhail Matskin for examining the work and giving the final feedback. I also owe a gratitude to Hive Streaming team, who have made working there very pleasant. Thanks to my friends from EIT Digital Master School and Teknikringen who have made this year very memorable, and encouraged and inspired me to go on. Finally, I want to thank my family: Angela, Artem and Ksenia for always being there for me. Zoja, Maja, and Borja for being an example. iv Contents 1 Introduction 1 1.1 Background ............................... 2 1.2 Problem ................................. 3 1.3 Purpose ................................. 4 1.4 Goal .................................... 4 1.5 Benefits, Ethics and Sustainability ................. 4 1.6 Methodology .............................. 5 1.7 Stakeholders .............................. 5 1.8 Delimitations .............................. 5 1.9 Outline .................................. 6 2 Background 7 2.1 Network theory ............................. 7 2.2 Graph visualization ........................... 12 2.2.1 Graph layout ........................... 12 2.2.2 Edge bundling .......................... 14 2.2.3 Reduction and Clustering .................... 14 2.3 Related Work .............................. 14 2.4 Sampling algorithms .......................... 15 2.4.1 Simple random sample (SRS2) . 16 2.4.2 Weighted Independence Sampling (WIS) . 16 2.4.3 Edge Filtering based on betweenness centrality (BC) . 17 2.4.4 Focus-based Filtering (FF) ................... 18 3 Comparative study design 20 3.1 Research Process and Paradigm .................. 20 3.2 Experimental design .......................... 20 v 3.3 Planned data analysis ......................... 21 3.4 Test data collection ........................... 21 3.4.1 Degree distribution ....................... 22 3.4.2 Reduction size .......................... 23 3.5 Test environment ............................ 24 3.6 Assessing reliability and validity of the data collected ...... 24 3.7 Evaluation framework ......................... 25 4 Implementation 26 4.1 Graph libraries ............................. 26 4.2 Graph-tool graph library ........................ 27 4.3 NetworkX graph library ........................ 28 4.4 Graph sampling implementation ................... 29 4.5 Performance ............................... 29 4.6 Visualization ............................... 30 4.7 Implementation ............................. 32 5 Analysis and evaluation of results 34 5.1 Algorithms performance comparison ................ 34 5.1.1 Hypothesis 1 ........................... 35 5.1.2 Hypothesis 2 ........................... 37 5.1.3 Hypothesis 3 ........................... 39 5.1.4 Hypothesis 4 ........................... 41 5.1.5 Hypothesis 5 ........................... 43 5.1.6 Hypothesis 6 ........................... 45 5.1.7 Conclusions ........................... 47 5.2 Visualization results .......................... 49 5.2.1 Test graphs visualization .................... 49 5.2.2 Overlay network visualization . 53 6 Conclusions 60 References ................................... 62 vi Chapter 1 Introduction Growing amounts of data are constantly generated, collected, analyzed and stored for later use. As the data collection rate increases, data processing algorithms are pushed towards optimization for size and complexity. Furthermore, a need for a deeper understanding of underlying data structures, such as complex graphs, emerges. A graph data structure is a mathematical representation of networks, consisting of data points and relationships among them. Common examples of such are social networks, neural networks, computer networks, traffic networks, and many other [1]. The networks of web pages are examples of tightly interconnected graphs, where the pages are the nodes, and the links among them are directed edges. As more pages are added, the networks are increasing in size. For instance, by the date of this paper, Wikipedia contains around 47 million pages in 293 languages, and 5,8 million interconnected articles in the English language alone [2]. Visualizing such a graph is a challenge from computational, layout, and information visualization points of view. Wikipedia is not the largest example of a network. According to Google, the World Wide Web consists of trillions of interconnected indexed pages. Large networks are also present in other sciences. For example, a neural network of a human brain includes 86 billion interconnected neurons [3]. Graphs can be represented as pairwise relations in a textual format, however, it 1 is hard to make conclusions on the network properties by looking at a text file. A visual representation of a graph is easier to comprehend. The graph structure, connected components, and clusters can be determined by having a quick glance at it, furthermore, it also possible to detect the anomalies and properties by examining the visualization thoroughly. Technical and visual limitations are the two main challenges in the area of large graph visualization. Most graphics engines have a ceiling on the size of a graph they are able to render before running out of memory and computational

Load more