DEGREE PROJECT IN INFORMATION AND COMMUNICATION TECHNOLOGY, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2019

Comparison study on graph sampling algorithms for interactive visualizations of large-scale networks

ALEKSANDRA VOROSHILOVA

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Comparison study on graph sampling algorithms for interactive visualizations of large-scale networks

Aleksandra Voroshilova

2019-06-20 Master’s Thesis

Place for Project

Stockholm, Sweden

Examiner

Mihhail Matskin KTH Royal Institute of Technology

Supervisor

Tino Weinkauf KTH Royal Institute of Technology

Industry Supervisor

Alexandros Gkogkas Hive Streaming Abstract

Networks are present in computer science, sociology, biology, and neuroscience as well as in applied fields such as transportation, communication, medical industries. The growing volumes of data collection are pushing scalability and performance requirements on graph algorithms, and at the same time, a need for a deeper understanding of these structures through visualization arises. Network diagrams or graph drawings can facilitate the understanding of data, making intuitive the identification of the largest clusters, the number of connected components, the overall structure, and detecting anomalies, which is not achievable through textual or matrix representations. The aim of this study was to evaluate approaches that would enable visualization of a large scale peer-to-peer video live streaming networks. The visualization of such large scale graphs has technical limitations which can be overcome by filtering important structural data from the networks. In this study, four sampling algorithms for graph reduction were applied to large overlay peer-to-peer network graphs and compared. The four algorithms cover different approaches: selecting links with the highest weight, selecting nodes with the highest cumulative weight, using betweenness metrics, and constructing a focus-based tree. Through the evaluation process, it was discovered that the algorithm based on betweenness centrality approximation offers the best results. Finally, for each of the algorithms in comparison, their resulting sampled graphs were visualized using a force- directed layout with a 2-step loading approach to depict their effect on the representation of the graphs.

Keywords

Graph sampling, graph filtering, large graph visualization

ii Abstract

Nätverk återfinns inom datavetenskap, sociologi, biologi och neurovetenskap samt inom tillämpade områden så som transport, kommunikation och inom medicinindustrin. Den växande mängden datainsamling pressar skalbarheten och prestandakraven på grafalgoritmer, samtidigt som det uppstår ett behov av en djupare förståelse av dessa strukturer genom visualisering. Nätverksdiagram eller grafritningar kan underlätta förståelsen av data, identifiera de största grupperna, ett antal anslutna komponenter, visa en övergripande struktur och upptäcka avvikelser, något som inte kan uppnås med text- eller matrisrepresentationer. Syftet med denna studie var att utvärdera tillvägagångssätt som kunde möjliggöra visualisering av ett omfattande P2P (peer-to-peer) livestreaming- nätverk. Visualiseringen av större grafer har tekniska begränsningar, något som kan lösas genom att samla viktiga strukturella data från nätverken. I den här studien applicerades fyra provtagningsalgoritmer för grafreduktion på stora överlagringar av P2P-nätverksgrafer för att sedan jämföras. De fyra algoritmerna är baserade på val av länkar med högsta vikt, av nodar med högsta kumulativa vikt, betweenness-centralitetsvärden för att konstruera ett fokusbaserat träd som har de längsta vägarna uteslutna. Under utvärderingsprocessen upptäcktes det att algoritmen baserad på betweenness-centralitetstillnärmning visade de bästa resultaten. Dessutom, för varje algoritm i jämförelsen, visualiserades deras slutliga samplade grafer genom att använda en kraftstyrd layout med ett 2-stegs laddningsinfart.

Nyckelord grafreduktion, stor graf visualisering

iii Acknowledgements

I would like to thank my supervisors: Tino Weinkauf for guiding on the process, giving advices on the next steps and keeping it always professional and fun, and Alexandros Gkogkas for giving the opportunity to start this interesting study and sharing his knowledge and helping with everything I needed to successfully finish it. I thank Mihhail Matskin for examining the work and giving the final feedback. I also owe a gratitude to Hive Streaming team, who have made working there very pleasant.

Thanks to my friends from EIT Digital Master School and Teknikringen who have made this year very memorable, and encouraged and inspired me to go on.

Finally, I want to thank my family: Angela, Artem and Ksenia for always being there for me. Zoja, Maja, and Borja for being an example.

iv Contents

1 Introduction 1 1.1 Background ...... 2 1.2 Problem ...... 3 1.3 Purpose ...... 4 1.4 Goal ...... 4 1.5 Benefits, Ethics and Sustainability ...... 4 1.6 Methodology ...... 5 1.7 Stakeholders ...... 5 1.8 Delimitations ...... 5 1.9 Outline ...... 6

2 Background 7 2.1 ...... 7 2.2 Graph visualization ...... 12 2.2.1 Graph layout ...... 12 2.2.2 Edge bundling ...... 14 2.2.3 Reduction and Clustering ...... 14 2.3 Related Work ...... 14 2.4 Sampling algorithms ...... 15 2.4.1 Simple random sample (SRS2) ...... 16 2.4.2 Weighted Independence Sampling (WIS) ...... 16 2.4.3 Edge Filtering based on betweenness centrality (BC) . . . . 17 2.4.4 Focus-based Filtering (FF) ...... 18

3 Comparative study design 20 3.1 Research Process and Paradigm ...... 20 3.2 Experimental design ...... 20

v 3.3 Planned data analysis ...... 21 3.4 Test data collection ...... 21 3.4.1 distribution ...... 22 3.4.2 Reduction size ...... 23 3.5 Test environment ...... 24 3.6 Assessing reliability and validity of the data collected ...... 24 3.7 Evaluation framework ...... 25

4 Implementation 26 4.1 Graph libraries ...... 26 4.2 Graph-tool graph library ...... 27 4.3 NetworkX graph library ...... 28 4.4 Graph sampling implementation ...... 29 4.5 Performance ...... 29 4.6 Visualization ...... 30 4.7 Implementation ...... 32

5 Analysis and evaluation of results 34 5.1 Algorithms performance comparison ...... 34 5.1.1 Hypothesis 1 ...... 35 5.1.2 Hypothesis 2 ...... 37 5.1.3 Hypothesis 3 ...... 39 5.1.4 Hypothesis 4 ...... 41 5.1.5 Hypothesis 5 ...... 43 5.1.6 Hypothesis 6 ...... 45 5.1.7 Conclusions ...... 47 5.2 Visualization results ...... 49 5.2.1 Test graphs visualization ...... 49 5.2.2 Overlay network visualization ...... 53

6 Conclusions 60 References ...... 62

vi Chapter 1

Introduction

Growing amounts of data are constantly generated, collected, analyzed and stored for later use. As the data collection rate increases, data processing algorithms are pushed towards optimization for size and complexity. Furthermore, a need for a deeper understanding of underlying data structures, such as complex graphs, emerges. A graph data structure is a mathematical representation of networks, consisting of data points and relationships among them. Common examples of such are social networks, neural networks, computer networks, traffic networks, and many other [1].

The networks of web pages are examples of tightly interconnected graphs, where the pages are the nodes, and the links among them are directed edges. As more pages are added, the networks are increasing in size. For instance, by the date of this paper, Wikipedia contains around 47 million pages in 293 languages, and 5,8 million interconnected articles in the English language alone [2]. Visualizing such a graph is a challenge from computational, layout, and information visualization points of view.

Wikipedia is not the largest example of a network. According to Google, the World Wide Web consists of trillions of interconnected indexed pages. Large networks are also present in other sciences. For example, a neural network of a human brain includes 86 billion interconnected neurons [3].

Graphs can be represented as pairwise relations in a textual format, however, it

1 is hard to make conclusions on the network properties by looking at a text file. A visual representation of a graph is easier to comprehend. The graph structure, connected components, and clusters can be determined by having a quick glance at it, furthermore, it also possible to detect the anomalies and properties by examining the visualization thoroughly.

Technical and visual limitations are the two main challenges in the area of large graph visualization. Most graphics engines have a ceiling on the size of a graph they are able to render before running out of memory and computational capacity. The visual challenge is to lay out the graph data in a coordinate space such that the structure is reflected in the best way, and the user is able to understand the properties of the graph by looking at the visualization.

To tackle the technical limitations, it is necessary to reduce the size of the data structure. The emphasis of this work is on keeping the structural information while reducing the original graphs by applying a number of sampling algorithms in a resource-effective way. The study focuses on a comparison of the algorithms based on scalability and structural information retention. Additionally, the resulting sampled graphs are visualized using web-based graphics engine.

1.1 Background

One example of computer networks are peer-to-peer overlays, the computer systems where computers act as ”peers” and they are connected among each other, forming an overlay network. The computers are using their resources to execute tasks and utilize network bandwidth for sharing data. In a client-server architecture, the server acts as a supplier and the client as a consumer. Conversely, peer-to-peer nodes can act both as suppliers and consumers.

Other examples of peer-to-peer networks are peer-to-peer based cryptocurrencies which are taking advantage of the scalability and security features: there is no one source of truth and therefore no one source of failure. Every node can act as a leader node, therefore in case of failures, the system can adjust to select another leader and continue the work.

2 Hive Streaming builds scalable, peer-to-peer (P2P) content distribution solutions with a focus on live video streaming. The company has millions of installed agents that facilitate the distribution of thousands of data streams daily. To track the usage and performance of a P2P distribution network a vast amount of data is collected and then distilled into insightful analytics and interactive data visualizations with exploratory capabilities.

Enterprise networks are usually arranged into LANs, grouped into sites. Hive Streaming solutions are utilizing them to distribute data among peers, without having to send it to a separate server to be consumed by clients. Instead, the nodes are spreading the data via the internal network. Hive Streaming solutions are mostly used for live video streaming. Their solutions are significantly increasing the streaming performance and ease the network load.

1.2 Problem

The peer-to-peer overlay network is a scale-free dynamic network. The nodes are added and removed once computers join or leave the video stream. These networks can reach a size of hundreds of thousands of edges and tenth of thousands of nodes. The graphs of such size are not supported by most of the web- based libraries. Moreover, visualizing such a graph would lead to an unreadable space-filling jumble of edge crossings. To tackle this visualization challenge, various graph pre-processing and clustering techniques could be used to reduce a graph and visualize a simplified version of it.

The study is focusing on finding the algorithm that efficiently optimizes the graph while maintaining the most important nodes and links. The formal problem statement is: How do different algorithms for graph pre-processing compare in terms of performance and keeping the original graph structure?

3 1.3 Purpose

The purpose of this study is to investigate the state of the art and present a solution fitting the requirements of an interactive and large scale visualization of a peer- to-peer distribution network.

1.4 Goal

The goal of this study is to visualize the overlay structure of the network using graph pre-processing techniques while retaining important links and the overall structure. This thesis intends to produce an implementation of compared algorithms and experimentally evaluate them. Moreover, it aims to provide suggestions for a visualization technique of the resulting sampled graphs running inside a browser, scaling up to hundreds of thousands of edges.

1.5 Benefits, Ethics and Sustainability

The study focuses on reducing the data to show insightful analytics while having fewer data to operate on, that would result in a decrease in resource usage. The outcomes can be used in development processes for a quick overview of the network to detect anomalies and get a better understanding of the structure.

Moreover, the customers can visually comprehend the overlay networks and evaluate the performance of peer-to-peer distribution without having to dig into technical logs.

Hive Streaming customer network information is confidential and therefore are not used in this study. The test network data used in this study is anonymous and similar in structure and size to actual enterprise networks, to not compromise the privacy and non-disclosure agreements.

4 1.6 Methodology

The study uses the applied research method. It starts with a literature study for which the inductive approach is used to collect the state of art solutions and determine the algorithms to be implemented. A comparative study is then done on chosen algorithms, and they are evaluated. [4]

1.7 Stakeholders

The stakeholders of this study are the Hive Streaming company and its customers. A visual representation of a peer-to-peer distribution network would be useful for the evaluation and analysis of the dynamics and structure of the formed overlay networks and lead to improvements in peer-to-peer distribution algorithm. While the basic representation of a network graph is stored in a text format, having it represented visually gives a quick and informative overview of the structure.

As for the customers, they get to see the internals of the video stream distribution in a user-friendly way. Moreover, the visualization would reveal the underlying structure of a network. Many companies do not have updated information about their internal network structure, making such information very valuable. Therefore, among the benefits stated above, they can also use it for getting insights on their network structure.

1.8 Delimitations

The study is scoped to a comparison of sampling algorithms. The problem of visualizing large graphs is complex and touches multiple fields such as data optimization, coordinate allocation, graphics rendering, user-experience design, among others. The proposed pre-processing step is necessary for any later visualization approach. The end work includes the definition of a visualization technique, however, user experience evaluation and customer requirements validation are out of the scope of the study could be content of a future study.

5 1.9 Outline

The necessary network theory fundamentals and general graph visualization techniques are explained in Chapter 2, as well as a graph reduction as an approach to visualization, and the algorithms that are used for comparative study.

In Chapter 3 the methodology and the structure of the comparison study are presented. Chapter 4 explains the details of the technical implementation and the libraries used. In Chapter 5 the results of the comparison are presented, followed by conclusions in Chapter 6.

6 Chapter 2

Background

To cover the ground knowledge required to understand the work, an introduction into the theory fundamentals is presented in section 2.1. The information on the network theory comes from the fundamental book by Mark Newman ”Networks: An Introduction” [5].

The graph visualization part of this chapter (Section 2.2) explains the required steps for visualizing a graph and common approaches for working with large graphs. Section 2.3 covers related work on the graph reduction and, finally, in Section 2.4 the four graph sampling algorithms used for the comparative study are presented.

2.1 Network theory

Data can be of different size, have various characteristics, and relationships. Many systems have components that link to one another in some ways or have some kind of relationships. For instance, social groups are sets of people that know each other or have some interests in common. A , such as the Internet, is a set of computers connected by a physical wire. These kinds of data structures are distinguished by the fact that the connecting links are crucial relationships that describe the properties of a data set.

Network theory is an interdisciplinary field that spans across mathematics,

7 biology, physics, social sciences, economics, and other sciences. It studies network structures and focuses on network analysis and optimization. Some examples of large-scale networks are biological protein chains, social networks, traffic connections networks, and the World Wide Web as the largest present computer network. Depending on the field, the scientific interest is focusing on different areas. In social sciences, for instance, the focal point is on the dynamics of relationships between social entities. Such research evaluates the connectivity density of social groups and assesses the probability of members of one group to relate with members of another group. In the case of traffic regulation, the aim is to design for shortest routes and distinguish large hubs. While in the biological field, the network structure is used to model interaction patterns between appropriate biological elements, such as biochemical networks, metabolic networks, protein-protein interaction networks, neural networks, and many others. [5]

Graph definition

A network can be described as a collection of data points, having relationships encoded as links. In computer science, the networks are represented as graphs, which are described as a structure amounting to a set of objects with pairwise relations between them. Graphs are studied thoroughly in the mathematics sub- field of graph theory and are formally represented as

G = (V, E) where V is a collection of data points, called vertices or nodes. E ⊆ x, y|(x, y) ∈ V ∧ x ≠ y are the relationships between vertices, that are referred to as links or edges. This mathematical representation enables an analysis of networks by applying various graph measures and metrics [5].

Types of graphs

Graphs can be directed and undirected, having in-degree and out-degree metric assigned to every node. The nodes with a high degree are called hubs, they are important connecting points. There can be several edges between two nodes, called a multiedge. If those are present in a graph, the graph is called a .

8 Otherwise, if no multiedges are present and the graph does not contain any self- loops (when link’s source and destination are the same nodes), it is called a simple graph.

Knowing if a graph falls under a specific category is important since certain graph properties can facilitate analysis and visualization. The basic types of graphs are a ring, a tree, a (where there exist a link from each node every other node). Another network type is a small-world network, where neighbors of any given node are likely to be reachable in a small number of hops. Word co-occurrence networks [6] and brain neuron networks [7] are some examples of small world networks. Power-law networks, also called scale-free networks, are following power-law , meaning that there are few nodes with a large degree, called hubs, followed a higher number of smaller hubs. The lower is a degree, there more nodes of such degree are present. One example of scale-free networks is social networks.

Adjacency matrix

A mathematical representation of a graph is an A, such that Aij is 1 if there exists an edge between i and j, otherwise 0.

  0 1 1     A =  1 0 1  0 0 1

The matrix above represents a graph with three nodes and five edges, one of which is a self-.

A network can be weighed, representing the strength of a given edge. For instance, on the Internet, a link weight can show the amount of traffic transferred between nodes [8]. In social networks, it can represent the strength of relationship or frequency of communication.

Once networks get large, standard metrics and measurements can be used to analyze the structure of the networks.

9 Shortest path

A geodesic path (or the shortest path) is a path between two vertices in a graph such that no shorter path exists. In weighted graphs, the sum of weights of its constituent edges is minimized to acquire the shortest path.

Centrality metrics

There are several measures of centrality, which represent the relative importance of vertices and edges based on various parameters, chosen depending on the use case. The common ones are eigenvector centrality, closeness centrality, , page rank, and betweenness centrality.

Betweenness centrality is calculated using the shortest paths metrics. The betweenness centrality of a node is the total number of shortest paths passing through the given node. It measures the extent to which a node lies between other nodes. For instance, in a star graph, the center node has the highest betweenness centrality. The betweenness centrality of an edge, similarly, is the total number of shortest paths passing through it.

σst(v) BC(v) = ∑ ̸ ̸ , {s, v, t} ∈ V s=v=t σst

where σst is the number of shortest paths between s and t, and σst(v) counts only the ones containing σv. Similarly, the betweenness centrality of an edge can be calculated, by measuring how many shortest paths of a graph contain the given edge.

Clique

A is the maximal subset of the vertices in an undirected network such that every in the subset is connected with each other. This measure is useful in network analysis. In social networks, a clique can represent a close group of friends where everyone knows each other. If a network is otherwise sparse, that can be an indication of a socially isolated group.

10 Transitivity

Another important property is transitivity. A relation a ⃝ c is transitive ” ⃝ ” if a ⃝ b and b ⃝ c together imply that a ⃝ c. In networks, it is presented as if there is an edge between a and b, and b and c, it implies that there exists an edge between a and c, resulting in a clique. Perfect transitivity is achieved in a complete graph, which is a graph where all vertices are connected to one another.

Clustering coefficient

Partial transitivity is used to calculate the of a network - a fraction of paths of size two in a network, belonging to a clique.

number of closed paths of length two C(G) = number of paths of length two , or alternatively (number of triangles) x 6 C(G) = number of paths of length two , or (number of triangles) x 3 C(G) = number of connected triples

The clustering coefficient represents the degree to which the network is clustered, and, as the above formula shows, it measures the frequency of loops of length three present in a network. One of the applications of it is to estimate a probability of two random vertices u and v to be connected is:

C(G) P (u, v) = , (u, v) ∈ V n where C(G) is the clustering coefficient and n is a total number of vertices.

Assortativity

Assortativity is the fraction of connected vertices of the same type, which can be a vertex degree or any other attribute. In social networks, this measure of or can represent a tendency of people of the same

11 age, sex, race, language, cultural background or geographical location to belong to the same group. In web networks, pages of the same language tend to be linked. In computer networks computers belonging to the same LANs tend to be connected.

Multiple characteristics can be used to calculate assortativity. The discrete and scalar characteristics shall be considered separately. The discrete ones are limited sets of values like sex, race, language, geographical location. The scalar characteristics have an order and thus even if the values are not exactly similar, the difference can be measured. Examples of such characteristics in social networks are age and income.

2.2 Graph visualization

The area of graph visualization consists of different aspects such as graph layout, clustering algorithms, reduction algorithms, edge drawing, and others. Various algorithms and metrics from network theory can be applied to, for instance, find shortest paths, assign page ranks, and determine clusters all of which can be visualized, depending on the information aimed to be communicated.

In order to visualize the graph from a collection of data in a text format, layout algorithms are applied to allocate coordinates to each node in such a way that the structural information is communicated efficiently, and it is easy for the user to perceive the topological structure of a graph. That implies minimizing the number of edge crossings and occlusions and allocating optimal between nodes. Once the coordinates are assigned, the graph can be rendered using different graphics engines depending on system requirements.

2.2.1 Graph layout

Graph drawing field is a large research sub-field of graph visualization, it combines the fields of mathematics and information visualization. It focuses on algorithms for allocating coordinates for vertices and edges. To draw a graph in

12 coordinate space, vertices have to be distributed in a given frame such that edge crossings would be minimized, reflecting symmetry and graph structure while having acceptable computational performance.

The planar graphs are the simplest case since, by definition, those are graphs that can be embedded in a plane, meaning that it is possible to allocate coordinates in a 2D space without having any edge crossings. The large power-law graphs are non-planar, therefore the algorithms need to be applied to make an optimal layout.

One common approach is to use a force-directed placement [9]. The algorithm assigns physics forces to nodes and edges. The distribution of forces can vary depending on the implementation details. In a nutshell, the nodes that are connected by an edge are leaning towards each other, and the other edges are pushed away from each other. There are implementation variants of this method [9, 10].

The force-directed approach has several advantages evaluated by many papers: simplicity, good quality of results, intuitive design, customizability. The strength of forces can be adjusted to proportionally assign the length to the edges. The known drawback of the basic algorithm is a long running time, that is equal to O(n3). There exist additions to the algorithm that improve performance [11].

An alternative algorithm is a spectral layout, which is based on a Laplace matrix of a graph. Two largest eigenvalues and eigenvectors are used to set the location of the nodes in a 2D place. The first eigenvector values are used to allocate X- coordinates and the second eigenvector values are used to allocate Y-coordinates. [12]

For hierarchical data, a layered approach can be used. Each layer in the hierarchy gets drawn as a row. Variations of the drawing approaches can be applied: drawing from top to bottom or left to right. [13]

13 2.2.2 Edge bundling

Densely connected graphs can look very cluttered, having links filling the entire background in the worst cases. Geometric edge bundling attempts to address this problem by bending edges towards each other: similar edges are drawn together, creating an empty white space in between and thus reducing visual complexity. Although the visual complexity is reduced using this approach, the computational complexity increases. The graph does not only have to be visualized but also new positions for each link have to be calculated. [14]

2.2.3 Reduction and Clustering

To achieve higher information comprehension for the user and to tackle the technical limitations, large scale graphs need to be reduced. Two common approaches used in the literature are graph clustering and graph sampling, also referenced as filtering. A combination of both approaches can be also used, thus a graph can be both reduced and clustered.

The graph sampling approach is addressing the problem of having a large graph size by removing edges or nodes. In contrast, the clustering algorithms are grouping graph data based on structural properties and creating levels of hierarchy, which can then be used to visualize only these higher level groupings instead of displaying all nodes and links.

2.3 Related Work

An extensive effort has been made in the field of graph reduction, resulting in numerous publications describing sampling algorithms. The algorithms differ in graph type and structure they can be applied to (trees, small world, power- law), connectivity, and graph properties (weighted or unweighted). There is also a specific type of algorithms using walks, that is applied to graphs that are not seen as a whole structure at once. For instance, when the graph comes in a continuous stream, or if the network is too large to fit in memory.

14 The graph sampling algorithms can be stochastic or deterministic. Leskovec and Faloutsos have made a comparative study on stochastic algorithms [15], evaluating sampling results from graphs with zero-weight edges. Due to randomness, stochastic algorithms may produce different results when applied with the same parameters.

Deterministic filtering, on the other hand, guarantees the same result after each run. Lee [16] and Boutin [17] are using tree reconstruction methods to yield an approximate graph from the given one. Another approach [18] is to use betweenness centrality as a metric for selecting the edges to keep.

Apart from filtering, as stated above, a clustering approach is common to represent groups of nodes. There are many algorithms for grouping nodes into clusters [19]. Examples of such approaches include selecting nodes by authority metric [20], removing weak edges and thus grouping well-connected components [21], or selecting groups of nodes by their shortest path distance from selected nodes [22].

The graph data used in this study already consists of several layers of groups of clusters, created from the real network characteristics, it would be redundant to apply the clustering algorithms. Since the graphs are very densely connected, having hundreds of thousands of edges and tens of thousands of vertices, the filtering (or sampling) approach was chosen instead.

2.4 Sampling algorithms

Four sampling algorithms were selected for this comparative study. The algorithms are focusing on reducing the graph while aiming to keep its original structural properties. The network overlay graphs, that need to be visualized in this study, are power-law weighted graphs. Thus the selection was based on algorithms that are applicable to such graphs.

15 2.4.1 Simple random sample (SRS2)

A ”Simple Random Sample” sampling algorithm is presented in a paper ”Effectively Visualizing Large Networks Through Sampling” by Rafiei and Davood [23]. A simple random sample of the edges is taken from an unweighted graph, and only vertices that are incident to those sampled edges are kept.

For a weighted graph, the same procedure is followed, with a difference that the selection criterium is the weight of an edge. The edges with the highest weight will be kept, and nodes with the largest amount of heavy-weight links will be in the sample. The algorithm does not guarantee connectivity. It is one of the most intuitive approaches when it comes to removing the edges, therefore it will be used as a baseline for comparison.

Given a number of edges to keep in the sampled graph, the procedure consists of the following 2 steps:

1. Sort

• Sort the graph edges by weight.

2. Reduce

• Remove the edges with the lowest weight until the amount of edges is reduced to the given threshold. Also, remove nodes that do not have any edges.

2.4.2 Weighted Independence Sampling (WIS)

The Weighted Independence Sampling algorithm is described in ”Walking on a Graph with a Magnifying Glass: Stratified Sampling via Weighted Random Walks” [24]. For a graph G, with a weight of an edge (u, v) ∈ E, being w(u, v). The weight of node u ∈ V is

∑ w(u) = w(u, v) v∈N(u)

16 For each node v in a set V , a probability proportional to the node’s weight is assigned.

w(v) π(v) = ∑ u∈V w(u)

Then the node is sampled with replacements, independently at random, based on the calculated probabilities. Since the algorithm samples nodes, the output is based on node filtering threshold rather than an edge threshold. Given k - the node threshold the algorithm is as follows:

1. Calculate vertex probabilities.

• To each node assign a probability π to be selected.

2. Select nodes

• Select the k nodes at random with probability π

3. Sample Graph

• The sampled graph consists of the k selected nodes and all the links existing between those nodes in the original graph.

2.4.3 Edge Filtering based on betweenness centrality (BC)

The algorithm introduced in the publication by Jia, Hoberock, Garland, and Hart ”On the Visualization of Social and other Scale-Free Networks” [18]. It is specifically designed for visualizing scale-free networks and is based on the betweenness centrality metric.

The algorithm consists of two steps: reduction and post-processing.

1. Reduction

• Calculate betweenness centrality for each edge in G.

• Sort the edges by the betweenness centrality in decreasing order

17 • If a node has at least 3 edges, remove an edge

2. Post-processing

This step is executed only if the resulted reduced graph is disconnected.

• Take a collection of edges removed in the Reduction step and sort by betweenness centrality in a decreasing order

• From the collection, starting with edges with the highest betweenness centrality, add back the edges that would reconnect the disconnected components until the graph is connected.

One detail of the algorithm is such that an edge cannot be removed if one of the vertices belonging to an edge has less than 2 edges.

In addition, to optimize performance, an approximation of betweenness centrality metric is applied instead of the full graph betweenness centrality calculation.

2.4.4 Focus-based Filtering (FF)

The Focus-based Filtering algorithm is described in the paper ”Focus-based filtering + clustering technique for power-law networks with small world phenomenon” by Boutin, Francois and Thievre [17]. The algorithm is designed for networks with power-law degree distribution and eventually produces a connected graph. It is using the shortest path metric. The algorithm begins with selecting a root node called the filtering focus. It first builds a tree, from the edges of the original graph, and then removes longest paths by adding more edges until a threshold is reached.

1. Select the filtering focus node V1.

The root is a node with the highest amount of weighted edges

2. Take the node Vn+1 that is connected to any node in Vn and has the highest degree.

• Find all neighbors of Vn

• Retrieve the one with the highest degree

18 • Connect it by selecting an edge with a node with the highest degree from

Vn

The output of this step is a tree .

3. Dense extraction.

• Calculate shortest path distances in the tree between all nodes, that are connected with an edge in the graph G.

• The edges in G, that have the longest shortest path in a new tree are added back until the size of a sampled graph reaches the set threshold.

This algorithm heavily depends on the filtering focus selection. The more edges are cut away, the more the sampled graph is limited around the filtering focus.

19 Chapter 3

Comparative study design

The following sections describe the process used to answer the research question. It is followed by a description of the experiment design, the planned analysis structure and the characteristics of the test data and test environment.

3.1 Research Process and Paradigm

The applied research method is used to determine the appropriate algorithms for graph reduction. The graph sampling algorithms are run on several datasets, and the quantitative results are collected from the reduced graphs.

3.2 Experimental design

The implementations of the four algorithms are producing different reduced graphs as output. To evaluate and compare them, different graph measurements are taken. These measurements are formulated into hypotheses, appropriate tests are executed and then hypotheses are proved or disproved.

The tests are run on six graphs. For the three smaller artificially generated test graphs the algorithms are run for ten different thresholds (a number of edges remaining in the sampled graph), resulting into 120 runs altogether. For the

20 three real-life datasets, they are run with nineteen different thresholds, which correspond to 228 reduced graphs.

3.3 Planned data analysis

The algorithms are producing sampled graphs that are smaller in size and may have different structural characteristics. To compare the produced graphs, their properties are measured and evaluated. Depending on the requirements on the expected outcome, different algorithms can be chosen for the end visualization implementation.

The goal of this study is to visualize a connected peer-to-peer network, giving an overview of the structure of the network. In this case, maintaining a structure means keeping the graph connected and keeping the dense clusters. A heavy reduction can lead to a complete loss of structure, as for instance, the components that are connected in the original graph, can appear disconnected. That should be avoided, a balance between reducing graph to a more comprehensible size and still keeping the connectivity and clustering information shall be maintained.

3.4 Test data collection

The real-life datasets are provided by the stakeholders and are representing a snapshot of a peer-to-peer overlay network in a particular moment in time when the video stream was active. The test datasets are compiled from the reported usage during a video stream, where a node is a viewer and a link is a data connection with another viewer or the source of the video stream. The graphs are weighted, the weight represents the amount of data in bytes transferred between two nodes.

These real overlay network graphs are composed of highly clustered connected components. In practice, those clusters are collections of computer LANs forming a site. The entire network consists of multiple of those sites, connected among

21 each other representing one connected component. Additionally, one node in the network is the source of a video stream.

Since even the reduced graphs are large, the tests are also run on sample graphs to ensure the correctness of the algorithms. The sample graphs are generated using a relaxed cavemen graph model. A cavemen graph has a number of groups made up from cliques, of size k. The graph is relaxed meaning that the edges are wired with a given probability. [25]

Graph measurements are shown both for the caveman and for the real graphs in Table 3.4.1. ”Avg degree” stands for the average degree. ”GCC” stands for the global clustering coefficient.

ID Nodes Edges Avg degree GCC

cavemen 2 5 10 20 2.0 0.67

cavemen 3 10 30 135 4.5 0.62

cavemen 8 50 400 9800 24.5 0.76

9101 33561 656584 19.56 0.39

9001 17296 907800 52.48 0.74

9037 24854 1085964 47.5 0.57

Table 3.4.1: Test data

3.4.1 Degree distribution

Degree distribution of a network is a distribution of the graph node degrees of the network, where the degree of a node is equal to the number of edges it has. Networks can be categorized by a distribution type. The two common types are a binomial degree distribution and a power-law distribution. This metric can give an idea on the structure of the network and is a way to distinguish different types

22 of networks.

Networks having a power-law degree distribution are called scale-free networks. Examples of such networks are some social networks and the Internet. The test graphs are scale-free graphs as can be seen in Figure 3.4.1.

Figure 3.4.1: Degree distribution of the test graphs

3.4.2 Reduction size

The graphs can be reduced by any number of edges or vertices. To maintain a structure of the original graph, the edge cut should not be too large. The upper limit of a possible number of edges is determined by the graphics engine used for visualization, there is a limit on the number of nodes and edges that could be held in memory and rendered with a reasonable frame rate.

The minimum possible amount of edges in a connected graph is (n−1), where n is a number of vertices. Since the original network is connected, the edge cut should

23 not exceed that.

3.5 Test environment

Running the implementations of the algorithms on a similar clean environment is essential to guarantee reproducible and viable results. There shall be no other processes and programs running that could possibly influence the performance or results. To ensure that, the tests were run in Amazon Web Services virtual instances. AWS is a cloud solution offering a variety of computing, storage, and other services. Most importantly it is possible to set up virtual machines to run code in parallel on the clean instances.

The tests were run on clean EC2 instances Ubuntu 18.04 (Bionic). The instance type is m5ad.large with 8 GiB RAM and 1 x 75 NVMe SSD, which is comparable performance-wise to an average computer. Each instance got graph-tool Ubuntu distribution installed along with Python and pip to install the dependencies. The graph files and the algorithms implementation code were copied to the instances and executed.

3.6 Assessing reliability and validity of the data collected

To ensure that collected results are valid, the tests are run on clean virtual machine instances. Therefore the results are not affected by any other running processes. All tests are run on the same installations, with the same versions of required libraries. The sample test data are of different graph size and represent different underlying networks, to prove the validity of results.

24 3.7 Evaluation framework

The gathered data is quantitative, thus for each run, a dependent parameter is calculated, and evaluated. The study aims for the fastest performance, smallest size of output, and the characteristics of a sampled graph to be close to the original graph.

25 Chapter 4

Implementation

The work for this comparative study consists of two parts: the implementation of the graph sampling algorithms and the network visualization. The algorithms are implemented as Python scripts that can be deployed to a back-end, and the visualization is implemented as a JavaScript script that can run in a browser.

4.1 Graph libraries

Graph libraries provide basic graph input and output methods for reading and writing graph files. In addition, they support the graph data structure and implement basic graph analysis and measurement functions presented in Chapter 2.1.

There are plenty of Python graph libraries available. The core algorithms for graph analysis are implemented in various frameworks such as JUNG, NetworkX, graph-tool, and others. These frameworks differ in terms of implementation, performance, and the set of algorithms offered.

Graph data formats

The basic representation of a graph is an adjacency matrix. The offers a simple graph representation without any metadata. Additionally to the relationships between nodes, it can be useful to have metadata on nodes

26 and edges. Such graphs can be stored in more complex formats such as GML, GraphML, JSON, Pajek, YAML, JSON, LEDA, and many others.

The right choice of a data format is determined by the structure of the data in question in combination with the libraries and software to be used. It is straightforward to convert from one format to another.

4.2 Graph-tool graph library

Graph-tool is a Python library for graph analysis. It has a well-documented set of APIs for core graph algorithms, measurements, and graph manipulation. It is essentially a C++ library wrapped in Python. It has dependencies to Boost, expat, SciPy, Numpy, CGAL, and other optional libraries. The Boost Graph Library is a C++ based graph library having standard mathematical graph algorithms implemented. [26]

Installation

A common way to install Python libraries that are not included in a standard Python distribution is by using the pip package manager to pull them from PyPI (The Python Package Index), a repository used to distribute software for Python. The graph-tool is not present there since it is practically a C++ library and has dependencies to other C++ libraries which are not possible to install using pip. For GNU/Linux distributions and MacOS, the installation can be done using package managers.

The fastest and the easiest way to get graph-tool is by downloading a Docker image. It is an OS-agnostic way, which requires minimal effort and does not cause compatibility issues since it is executed in an isolated Docker container.

The implementation of the algorithms for this research was done on a MacOS, where the installation shall be sufficient with the Homebrew package manager. Due to mismatch of dependency libraries versions, a manual compilation was done instead. The installations on the AWS Ubuntu instance and local MacOS behaved exactly the same, produced the same functions’ outputs and did not

27 require any adjustments in the code.

Performance

Graph-tool uses Open Multi-Processing API to run algorithms in parallel. It can be fully utilized when running on hardware with multiple cores, having parallel execution enabled.

It is based on Boost C++ libraries and takes advantage of metaprogramming techniques to achieve a high level of performance. Moreover, it has APIs for accessing vertices and edges as NumPy arrays, without having to create complex object collections. Another feature that boosts performance is a powerful filtering capability - it is an efficient way to filter out edges or vertices by assigning a property without having to copy an entire graph or modify the original structure.

License

The graph-tool is distributed under the GNU General Public License. The source is publicly available.

4.3 NetworkX graph library

NetworkX is a pure Python library for graph analysis, first publicly released in 2005 [27]. It is well-established and has extensive documentation and example code. It can be installed with the pip package manager and is compatible with all major operating systems (MacOS, Linux, Windows) [28].

It includes most of the standard network analysis algorithms, as well as links to the implementation, and usage examples.

Installation

NetworkX can be easily installed using the pip package manager and does not require any manual compilation.

28 License

NetworkX is free open-source software distributed with a 3-clause BSD License. It can be redistributed and modified under the terms of the license.

4.4 Graph sampling implementation

The implementation of the sampling algorithms was developed using Visual Studio Code Version 1.33.1 for Mac, with additional Python plugins installed. Originally the implementation was done using NetworkX 2.2, however, graphs loading and betweenness centrality calculation execution time was increasing significantly for the larger graphs. The shortest paths calculation for the overlay network graphs was taking over an hour. Therefore, the final implementation was rewritten using graph-tool graph library.

Graph-tool proved to have a better performance, and in addition, it has APIs for accessing vertices and edges as NumPy arrays, which enables the implementation of computations as matrix operations rather than looping over collections of objects. NumPy is a Python package for scientific computing, it supports multi- dimensional arrays and linear algebra computations, used heavily for data science [29].

4.5 Performance

The difference between graph-tool and NetworkX is not significant on small graphs with about a hundred edges, however, as the overlay network datasets are of the scale of a million edges, the gap in performance increases drastically.

The graph-tool graph filters were extensively utilized for tests. Since the tests included many runs over the same set of graphs, it would be wasteful to modify the original graphs, and then read them again for the next iteration. Instead, the filters were used to mark the removed edges, and each graph was read only once.

Performance-wise, there is a drastic difference between NetworkX and graph-tool

29 due to the fact that the libraries have different implementations: NetworkX is a pure Python library, while graph-tool has a C++ based implementation.

The performance measures are presented in Table 4.5.1, ”BC gt” column stands for betweenness centrality calculation performance for each graph using graph- tool. The results are measured in seconds. ”BC nx” is the betweenness centrality calculation using NetworkX. ”SP gt” and ”SP nx” is the time it took for the shortest path distance calculation for all node pairs to complete in the graph using graph- tool and NetworkX respectively.

graph BC gt BC nx SP gt SP nx

2 5 0.002 s 0.005 s 0.004 s 0.02 s

3 10 0.002 s 0.02 s 0.0025 s 0.27 s

8 50 0.005 s 3.024 s 0.0352 s 206.93 s

9101 2.15 s 660 s 494.208 > 1 hour

9001 2.213 s 820 s 219.16 > 1 hour

9037 3.99 s 1122 s 363.5 s > 1 hour

Table 4.5.1: Performance measurements for networkX and graph-tool

The performance is drastically different, especially the calculation of the shortest path which, when applied to the network overlay graphs, takes hours with NetworkX and only a couple of minutes with graph-tool.

4.6 Visualization

One of the requirements for this study is to offer a visualization solution running in the web browser. There are three common ways to render graphics in the web browser: SVG, Canvas, and WebGL.

30 SVG

SVG (Scalable Vector Graphics) is, as the name suggests, a vector-based image format represented in an XML like format. SVG images preserve proportions and shapes when scaled, and are resolution independent. The format supports well large high-resolution graphics, however, performs poorly when rendering many elements.

Canvas

Canvas is a container HTML element having a capability to draw interactive graphics in it. The containing elements can be paths, circles, boxes, text or images. Since those containing elements are also HTML elements, they are interactive and support all basic mouse events. The performance of Canvas decreases on large resolution visualization surfaces.

WebGL and Three.js

WebGL is a low-level engine which enables both three and two dimensional drawings. It is able to render many large objects while maintaining a reasonable performance, having strengths of both Canvas and SVG combined. This is achieved by having GPU-accelerated processing, that is not possible with regular Canvas and SVG elements. WebGL is used widely for 3D content creation, examples include game engines, like Unity, and Unreal Engine 4. [30]

Since WebGL exposes low level APIs, to facilitate developers efforts, there exist several libraries that abstract basic scene creation and manipulation functionality: A-Frame for virtual reality programs, BabylonJS, PlayCanvas, three.js, OSG.JS, and others. Three.js is a popular JavaScript library built on top of WebGL, providing high-level support for drawing GPU-enabled graphics. It abstracts away low-level WebGL calls, wraps repetitive bits and WebGL implementation details, resulting in lower overhead and ease of development. [31]

All of the graphic libraries listed above are supported by the latest browsers and are available under MIT license. One would pick the library based on the size of the data to be visualized and rendering dimensions. Since the networks in

31 this study are large, the Three.js based solution was chosen for the visualization implementation. d3

D3 is a visualization library, written in JavaScript. It can be used for visualizing data in a format of interactive charts, tables, trees, and many other data structures. D3 can be integrated with both Canvas, SVG, and WebGL. Mostly the documentation examples are done with SVG, and Canvas, to draw plots, and animated data tables. [32]

It also has support for visualizing graph structures and reading graph data format. In addition, it has an implementation for a force- layout in a sub- module called d3-force. The module adds forces to nodes, enabling a physical simulation. [33]

4.7 Implementation

The proposed end solution consists of two blocks: the graph reduction algorithm, and the visualization block. The solution is using the betweenness centrality algorithm for graph sampling. The original graph is delivered in a JSON format, it is converted to .graphML format to be read into the graph-tool library. Then the algorithm is applied and the graph is reduced to the specified threshold.

The visualization client side is consuming two graphs: the original and the sampled one. The original graph is used to assigning forces and allocate coordinates to the graph nodes. The sampled graph tells which edges to display. To avoid duplication of the data, and to optimize loading time, the reduced graph only contains the identifiers of the edges. The graph metadata is still accessible from the original graph.

Once the visualization is launched Three.js creates a scene. Meanwhile, the original graph is loaded, and the coordinates are calculated using d3-force library. Then for the links and nodes present reduced graph the graphic objects are created and displayed. All nodes and links are interactive and event handlers can be

32 assigned to them to display the metadata.

33 Chapter 5

Analysis and evaluation of results

In this chapter, the results of the comparison study on the selected graph sampling algorithms described in Chapter 2.4 are presented. In the first section, the algorithms are compared across six performance criteria. In the second section, the resulting reduced graph visualizations are presented for each algorithm.

5.1 Algorithms performance comparison

The algorithms performance comparisons based on various measurements are presented below. The test parameter for all hypothesis is the reduction threshold. The algorithms are run with different values of threshold values - a percentage of edges to keep. The motivation for choosing this test parameter is to see how the measurements are changing once the graph size decreases. For WIS algorithm, since it is applied to nodes, the threshold specifies how many edges shall be kept. Results for test runs for the caveman 2x5 are attached in Appendix B.

34 5.1.1 Hypothesis 1

Hypothesis: One of the algorithms performs better when it comes to keeping the percentage of the total edge weight

Related Questions: How big is the difference between algorithms SRS2, FF, BC and WIS in terms of maintaining edges with the highest weight?

Test output (observed/dependent variables): A percentage of total edge weights retained after the application of the graph sampling algorithm.

Results (Figure 5.1.1): The collected results are presented in Figure 5.1.1. As expected, SRS2 has the highest throughput, because the design of the algorithm is such that the edges with the highest weight are picked first, and no other selection parameter is applied. It is followed by FF. BC and WIS have similar results once the edge cut gets larger. Since SRS2 is filtering the edges with the largest weight, it is a reference point showing the maximum possible throughput at a given edge cut.

35

Figure 5.1.1: Hypothesis 1

36 5.1.2 Hypothesis 2

Hypothesis: Algorithms BC and FF perform better than WIS and SRS2 when it comes to keeping the graph connectivity.

Test output (observed/dependent variables): The structural connectivity of the resulting sampled graph expressed through the number of connected components.

Results (Figure 5.1.2): As expected FF and BC produce a connected sampled graph. The FF algorithm starts with selecting a vertex, and adding adjacent vertices, and connects edges one by one, forming a tree and then adding the edges between the vertices, forming the longest shortest path in the constructed tree. Therefore at all stages of the algorithm, the sampled graph is connected.

The BC algorithm first removes edges with the lowest betweenness centrality, which results in a disconnected graph. Then the post-processing step is applied, guaranteeing the connectivity.

The WIS and SRS2 do not consider connectivity in the algorithm design. After applying SRS2, the number of components in filtered graphs grows exponentially once the edge cut increases. The amount of components in a graph reduced with WIS is not as large as in a graph reduced with SRS2 and changes linearly.

37

Figure 5.1.2: Hypothesis 2

38 5.1.3 Hypothesis 3

Hypothesis: SRS2, FF, BC shall output graphs with the same degree distribution.

Test output (observed/dependent variables): the average node degree of a reduced graph.

Results (Figure 5.1.3): The average node degrees are the same for BC, FF and SRS2. They are decreasing linearly, proportionally to the edge cut, indicating that the algorithms are correctly reducing the number of edges proportionally to the specified edge threshold. The difference is visible as the edge cut gets larger, this is due to the fact that BC and FF have a limitation on a maximum amount of reduced edges, while SRS2 does not.

The results for WIS algorithm show that average degree at first goes up once the graph is reduced. This happens due to the fact that the nodes with a low degree are removed, thus each node on average has a higher degree comparing to the original. Eventually, the average degree decreases with the number of edges.

39

Figure 5.1.3: Hypothesis 3

40 5.1.4 Hypothesis 4

Hypothesis: One of the algorithms shall perform better in terms of computation time of the graph reduction.

Test output (observed/dependent variables): running time in seconds for different algorithms.

Output: FF is the slowest (due to the shortest path calculation). SRS2 is the fastest algorithm.

Results (Figure 5.1.4):

SRS2 and WIS take only a few seconds to execute since the algorithms are not using any time-demanding measurements. The more complex algorithms: FF and BC are slower in terms of running time, because of the shortest path and betweenness centrality calculations required. The running time comparison of these measures, implemented with Python libraries is presented in section 4.1.4.

The performance of FF does not depend on the edge cut since the algorithm is reconstructing the graph from scratch on every round and then removing the longest paths. The significant drop in running time is visible on all the larger graphs: 9001, 9037, 9101 once the threshold gets small. The reason to that is that the threshold is small enough, once the tree is being created, it is already reached, so the algorithm does not get to the second step of adding the longest paths.

When the edge cut is small the BC is as fast as SRS2 and WIS. Once the edge cut gets larger, the execution time grows since large reduction implies a longer post- processing step.

41

Figure 5.1.4: Hypothesis 4

42 5.1.5 Hypothesis 5

Hypothesis: One of the algorithms shall perform better in terms maintenance of global clustering structural characteristic.

Test output (observed/dependent variables): global clustering coefficient calculated using the formula described in Chapter 2.1 under the ”Clustering coefficient” section.

Results (Figure 5.1.5):

WIS has the highest global clustering coefficient (as well as an average degree as shown in Hypothesis 3). As well as for the average degree, the global clustering coefficient value goes up as the node threshold decreases. The outputs of SRS2 and FF depend on the density of the original graph, as the results for 9037, 9001 and 9101 differ.

The global clustering coefficient is based on a number of cliques, and the original structure in the FF algorithm is a tree. When the edge cut is small, the sampled graph still resembles the original graph. Once the edge cut increases, the sampled graph structure gets closer to the tree because of the shorter dense component step and thus has a lower global clustering coefficient.

43

Figure 5.1.5: Hypothesis 5

44 5.1.6 Hypothesis 6

Hypothesis: One of the algorithms shall perform better in terms of keeping assortativity coefficient.

Test data: Only the real overlay networks graphs were compared for this hypotheses. All reduced graphs assortativity is positive, indicating that they are assortative networks where edges tend to connect vertices of the same type.

Test parameter: The type parameter is the site, or geographical location, where the network computer is physically situated. The overlay network is designed to have direct connections between computers in the same location, and also connect to other sites.

Test output (observed/dependent variables): assortativity coefficient measuring the similarity of connections in the graph with respect to the given attribute.

Results (Figure 5.1.6):

The overlay networks graphs are densely connected within the sites, therefore, the first priority for reduction purposes would be to decrease the number of edges within clustered components and keep the edges connecting different sites, to show the data flow.

Since the vertex type is not considered in any of the implemented sampling algorithms, the results vary. The SRS2 keeps the assortativity at the same level but sacrifices connectivity as was shown in Hypotheses 2. The largest assortativity coefficient is present of the threshold of 0.01%, indicating that the edges are most likely to connect the vertices of the same type once the edge cut is large.

The FF algorithm shows the steepest curve as the cut gets larger. This is due to the fact that the base structure of the graph in the algorithm is a tree. Therefore all clusters, except for the one containing the filtering focus node, end up having leaf nodes that do not have direct connecting links.

The WIS algorithm shows high value for assortativity for 9101 and 9001, and very low for 9037. The 9037 is the graph with the largest amount of edges, as

45 the number of selected nodes get smaller, the amount of edges between nodes belonging to different sites them still stays large.

Figure 5.1.6: Hypothesis 6

46 5.1.7 Conclusions

The following patterns are observed on the test results: SRS2 is the fastest algorithm but produces a reduced graph that has little in common with the original one structure wise. The WIS is also a very fast algorithm and it keeps some of the sampled graphs connected even when having half of the edges removed from the original overlay network. This is a good result, considering that connectivity property is not specified in the algorithm description. The reason for that is that the test graphs are very densely connected. The sampled graph stays connected when reducing the graph by 30% or more.

The focus-filtering algorithm is based on creating a tree and then adding back the edges. It works well when the sampled amount of edges is at least twice higher than the number of vertices, to guarantee the connectivity of the smaller components. However, once the amount of edges in the graph decreases, the shape of the graph gets closer to a tree. It can be clearly seen when visualizing the graph with a force-directed layout, as will be shown in the following section. The algorithm starts with the selected source node, growing the tree from there on. The result of it is unbalanced: the nodes and links closer to the source are present in the sampled graph, while the ones that are forming smaller clusters in the original graph are not selected, and appear as leaf nodes.

The betweenness centrality based algorithm (BC) performs better in terms of speed than FF, however, its performance decreases linearly once the edge cut gets higher. It always keeps the graph connected by design and, generally, the sampled graphs are well-balanced.

Overall, it can be concluded that the betweenness centrality based algorithm is the best choice for the case where the graph is expected to stay connected, the important links stay in place, and it is acceptable to have non-immediate execution time, therefore the BC algorithm is the best choice for this study.

Source node

Each graph that was created from the video stream has a source node, it is a node at which the video has originated. When applying FF algorithm, the source node

47 is always selected as a ”filtering focus” node. The reasons behind that are that the source node has the highest amount of edges in all the test graphs. Since the ”filtering focus” is calculated as a node with the highest summed up weights of the edges, it gets selected as the root of the sampled tree. The outcome of the FF algorithm will always contain edges connected to the source node.

The same applies to the WIS algorithm: the weight of the source node is high and therefore it is present in all reduced graphs, guaranteeing that the video stream originating node will not be cropped off from the graph.

The source node also has a high betweenness centrality, after sampling the graph with the BC algorithm, the edges connecting to the source node have a high probability to stay.

As for the SRS2 algorithm, the edge cut depends purely on the weight, therefore if the links, adjacent to the source node have a high weight, they will be selected for the sampled graph.

48 5.2 Visualization results

In addition to having quantitative measures, a comparison of the visual representation of filtered graphs compared to the original ones is presented in this section. Such visualizations facilitate the understanding of the different datasets and their structural similarities.

The end goal of this study is to effectively visualize the networks by creating the closest possible approximation of the original graph. The end result should have visually similar properties to the original network, and show the structure while having the optimal performance.

5.2.1 Test graphs visualization

A generated caveman graph is presented in Figure 5.2.1 using the solution explained in Chapter 4. The caveman graph consists of two interconnected clusters, of size five, connected randomly, which results into 20 edges and 10 nodes.

Figure 5.2.1: Caveman 2x5

Examining how such a sample graph degrades after applying the sampling algorithms will facilitate a better understanding of algorithms results.

49 (a) BC (b) SRS2

(c) WIS (d) FF

Figure 5.2.2: Degrading caveman 2x5 graph, after applying sampling algorithms. 80% of edges left, original graph reduced by 20%.

First, we examine how the graph structure is affected when it is reduced by 20%. The resulting graphs can be seen in Figure 5.2.2. The number of removed edges is quite low and all sampled algorithms produce connected graphs. The betweenness centrality (BC) algorithm keeps the two components, while the components themselves are less densely connected. The SRS2 also keeps the

50 graph structure quite similar, however, two visibly distinct components cannot be determined. The WIS produces a graph with a structure the most similar to the original one, by removing nodes. The FF algorithm keeps one component as it was originally, and the second one is quite disconnected.

By further reducing the graph by half, we see the extremes of what kind of end shape each algorithm tends to fall into. It is not practical to reduce the graphs that drastically for visualization, however, for making a decision on what algorithm to pick, it is crucial to determine what kind of structure they converge to. The resulting sampled graphs are presented in Figure 5.2.3.

At this point, SRS2 outputs completely disconnected graph, with two disconnected components and one node without links. All other algorithms produce connected graphs. The BC outputs a graph that is very close to a ring. Generally, none of the sampled graphs resemble original structure, however, the FF is the closest one, while SRS2 and WIS are the furthest from the original.

51 (a) BC (b) SRS2

(c) WIS (d) FF

Figure 5.2.3: Degrading caveman 2x5 graph, after applying sampling algorithms. 50% of edges left, original graph reduced by 50%

52 5.2.2 Overlay network visualization

In this section, the algorithms are applied to one of the original graphs of the network overlay structure provided by the company. The graph 9031 has 1821 vertices and 52343 edges, and three large distinct clusters, one of which has many sub-clusters and also contains a source node, shown in Figure 5.2.5.

Figure 5.2.4: Full view on graph 9031

53 Figure 5.2.5: Zoomed view on graph 9031

The graph was sampled with a filtering threshold of 8%. It was chosen empirically so that the output graph would still be connected, while it is small enough to guarantee good visualization performance. The resulting graphs are presented in Figure 5.2.6.

The BC algorithm is keeping the components connected, however, it is hard to distinguish original components with a force-directed layout. This is due to the fact that the algorithm keeps the edges having the highest betweenness centrality and removes the ones with the lowest, therefore the smaller dense components become less connected. It is good in terms of structural properties, however, the resulting force-directed visualization has one clear dense center and is not evenly distributed.

The SRS2-sampled graph looks very close to the original. It has a couple of disconnected components, the structure is well represented. Considering that the algorithm is the least computation intensive out of the four, the results are very

54 (a) Graph reduced with BC (b) Graph reduced with SRS2

(c) Graph reduced with WIS (d) Graph reduced with FF

Figure 5.2.6: Degrading 9031 graph, after applying sampling algorithms. 8% of edges left in the sampled graphs, the graph is drawn using force-directed layout

55 satisfactory. The algorithm disconnects the connected component, as a result, many independent sub-graphs can be seen on the graph.

The FF also appears closer to the original, however, as can be seen from the caveman samples, when the filtering threshold decreases, the edges of the sampled graph will be concentrated in the area of the largest component and the other two components will be disconnecting.

The WIS size is different due to the design differences, to get closer to other samples 25% of the nodes were sampled resulting in a graph with 455 vertices and 5271 edges, approximately 10% edge cut. The sampled graph has two components since the weak nodes are removed.

2-step rendering

As can be seen from the previous section, once the graphs are reduced through sampling, the forces distribution changes. When the reduction percentage is small, it is possible to maintain a layout similar to the one of the original graph. However the original graphs have hundreds of thousands of edges and the rendering limitation is around tens of thousands of edges, thus a large portion of a graph needs to be removed.

To keep the structure the same, the 2-step rendering technique was developed. The visualization code takes two graphs: the original and the sampled one. The force-directed layout algorithm is first applied to the original graph, and the assigned coordinates of the nodes are kept in memory. Then as a second step, only the links that are present in the sampled graph are visualized using the coordinates, allocated to the original graph.

This approach keeps the structure of the original graph while rendering only the edges that were chosen by the sampling algorithm. The same graphs from Figure 5.2.6 are redrawn using the 2-steps approach resulting in the visualizations in Figure 5.2.7.

The SRS2 (Figure 5.2.7b) shows the best visibility on all the smaller components of the larger cluster, which were not seen in the original graph shown in Figure 5.2.5. The three clusters appear almost disconnected. As for WIS, the smallest

56 (a) Graph reduced with BC. (b) Graph reduced with SRS2.

(c) Graph reduced with WIS. 10% edge cut (d) Graph reduced with FF.

Figure 5.2.7: Degrading 9031 graph, after applying sampling algorithms. 8% of edges left in the sampled graphs, the graph is drawn using 2-steps force-directed layout

57 component has reduced drastically, it is hard to distinguish it in the view. The BC and FF algorithms have the resulting graphs closest to the original, having clearer structure. The difference between them is in the density of components: with BC the clusters are more densely connected between one another while with FF the components are more connected internally. Moreover, it can be seen that with FF the smaller independent components are not as clearly represented.

Large overlay network visualization

The implemented visualization based on a force-directed layout and Three.js has a limitation on the maximum amount of edges that can be visualized while keeping page responsive with a reasonable frame rate. The visualized reduced graph of 9001 is presented in Figure 5.2.8. The visualizations of 9037 and 9101 graphs of network overlays are attached in Appendix A. They are reduced with a threshold of 5% edge threshold using a betweenness centrality algorithm.

58 Figure 5.2.8: graph 9001, reduced with a edge threshold of 5%

59 Chapter 6

Conclusions

Throughout the thesis work process, the fact that the graph reduction is necessary for visualization got clear. None of the currently existing web-based graphics engines are able to load the original graphs as is while keeping the nodes interactive. Therefore the investigation on the sampling algorithms was a valid way to go.

The main outcome of this thesis is that it evaluated reduction algorithms and proposed a functional solution enabling the visual representation of the overlay network, that allows the user to identify network hubs, evaluate the general structure and see differences between the networks.

Discussion

The comparison study was done by evaluating a set of objective graph measurements. The chosen betweenness centrality based algorithm fits the particular visualization use-case, applied to a dense overlay network. The comparison study findings presented in Chapter 5 can be used as a general guide for the selection of a filtering approach, however, the choice of the algorithm may be different depending on the requirements.

Future Work

The future work should include the investigation of a hybrid sampling solution

60 that reduces the graph in steps which could lead to optimally reduced graphs and improved performance. For instance, using WIS first to reduce the number of nodes, then applying the betweenness centrality based algorithm to remove extra edges. On the visualization part of a solution, the future work could focus on usability and user requirements.

Finally, future research shall explore the ability to handle streaming graph datasets and large scale dynamic graphs leveraging reduction techniques. Such a solution would allow the dynamic representation of the evolution of the overlay network over time.

61 Bibliography

[1] Hu, Yifan and Shi, Lei. “Visualizing large graphs”. In: Wiley Interdisciplinary Reviews:Computational Statistics 7.2 (2015), pp. 115– 136.

[2] Wikipedia, community. Wikipedia:Statistics. 2019. URL: https : / / en . wikipedia.org/wiki/Wikipedia:Statistics (visited on 05/13/2019).

[3] Azevedo, Frederico AC et al. “Equal numbers of neuronal and nonneuronal cells make the human brain an isometrically scaled-up primate brain”. In: Journal of Comparative Neurology 513.5 (2009), pp. 532–541.

[4] Håkansson, Anne. “Portal of research methods and methodologies for research projects and degree projects”. In: The 2013 World Congress in Computer Science, Computer Engineering, and Applied Computing WORLDCOMP 2013 (2013), pp. 67–73.

[5] Newman, Mark. Networks: an introduction. Oxford university press, 2010.

[6] Cancho, Ramon Ferrer I and Solé, Richard V. “The small world of human language”. In: Proceedings of the Royal Society of London. Series B: Biological Sciences 268.1482 (2001), pp. 2261–2265.

[7] Bassett, Danielle Smith and Bullmore, ED. “Small-world brain networks”. In: The neuroscientist 12.6 (2006), pp. 512–523.

[8] Castells, Manuel. The Internet galaxy: Reflections on the Internet, business, and society. Oxford University Press on Demand, 2002, pp. 9– 35.

62 [9] Fruchterman, Thomas MJ and Reingold, Edward M. “Graph drawing by force-directed placement”. In: Software: Practice and experience 21.11 (1991), pp. 1129–1164.

[10] Eades, Peter. “A heuristic for graph drawing”. In: Congressus numerantium 42 (1984), pp. 149–160.

[11] Quigley, Aaron and Eades, Peter. “Fade: Graph drawing, clustering, and visual abstraction”. In: International Symposium on Graph Drawing. Springer. 2000, pp. 197–210.

[12] Koren, Yehuda. “On spectral graph drawing”. In: International Computing and Combinatorics Conference. Springer. 2003, pp. 496–508.

[13] Sugiyama, Kozo, Tagawa, Shojiro, and Toda, Mitsuhiko. “Methods for visual understanding of hierarchical system structures”. In: IEEE Transactions on Systems, Man, and Cybernetics 11.2 (1981), pp. 109–125.

[14] Holten, Danny and Van Wijk, Jarke J. “Force-directed edge bundling for graph visualization”. In: Computer graphics forum. Vol. 28. 3. Wiley Online Library. 2009, pp. 983–990.

[15] Leskovec, Jure and Faloutsos, Christos. “Sampling from large graphs”. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM. 2006, pp. 631–636.

[16] Lee, Bongshin et al. “Treeplus: Interactive exploration of networks with enhanced tree layouts”. In: IEEE Transactions on Visualization and Computer Graphics 12.6 (2006), pp. 1414–1426.

[17] Boutin, Francois, Thievre, Jérôme, and Hascoët, Mountaz. “Focus-based filtering+ clustering technique for power-law networks with small world phenomenon”. In: Visualization and Data Analysis 2006. Vol. 6060. International Society for Optics and Photonics. 2006, 60600Q.

[18] Jia, Yuntao et al. “On the visualization of social and other scale-free networks”. In: IEEE transactions on visualization and computer graphics 14.6 (2008), pp. 1285–1292.

63 [19] Herman, Ivan, Melançon, Guy, and Marshall, M Scott. “Graph visualization and navigation in information visualization: A survey”. In: IEEE Transactions on visualization and computer graphics 6.1 (2000), pp. 24–43.

[20] Auber, D, Munzner, T, and Archambault, D. “Visual exploration of complex time-varying graphs”. In: IEEE transactions on visualization and computer graphics 12.5 (2006), pp. 805–812.

[21] Auber, David et al. “Multiscale visualization of small world networks”. In: IEEE Symposium on Information Visualization 2003 (IEEE Cat. No. 03TH8714). IEEE. 2003, pp. 75–81.

[22] Wu, Andrew Y, Garland, Michael, and Han, Jiawei. “Mining scale-free networks using geodesic clustering”. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM. 2004, pp. 719–724.

[23] Rafiei, Davood. “Effectively visualizing large networks through sampling”. In: VIS 05. IEEE Visualization, 2005. IEEE. 2005, pp. 375–382.

[24] Kurant, Maciej et al. “Walking on a graph with a magnifying glass: stratified sampling via weighted random walks”. In: Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems. ACM. 2011, pp. 281–292.

[25] Fortunato, Santo. “Community detection in graphs”. In: Physics reports 486.3-5 (2010), pp. 75–174.

[26] Peixoto, Tiago P. “The graph-tool python library”. In: figshare (2014). DOI: 10.6084/m9.figshare.1164194. URL: http://figshare.com/articles/ graph_tool/1164194 (visited on 09/10/2014).

[27] Hagberg, Aric. NetworkX first public release (NX-0.2). 2005. URL: https: //mail.python.org/pipermail/python-announce-list/2005-April/ 003924.html (visited on 05/15/2019).

[28] Developers, NetworkX. Announcement: NetworkX 2.2. 2019. URL: https: //networkx.github.io/documentation/stable/release/release_2.2. html (visited on 05/13/2019).

64 [29] developers, NumPy. NumPy. 2019. URL: https : / / www . numpy . org/ (visited on 05/13/2019).

[30] Mozilla and contributors, individual. WebGL: 2D and 3D graphics for the web. 2019. URL: https://developer.mozilla.org/en- US/docs/Web/ API/WebGL_API (visited on 05/13/2019).

[31] three.js. 2019. URL: https://threejs.org/ (visited on 05/13/2019).

[32] Bostock, Mike. Data-Driven Documents. 2019. URL: https://d3js.org/ (visited on 05/13/2019).

[33] Bostock, Mike. d3-force. 2019. URL: https://github.com/d3/d3-force (visited on 05/13/2019).

65 Appendices

66 Appendix - Contents

Appendices ...... 68 A Network visualization ...... 68 B Test results for caveman 2x5 ...... 70

67 Appendices

A Network visualization

Figure A.1: graph 9101, reduced with a edge threshold of 5%

68 Figure A.2: graph 9037, reduced with a edge threshold of 5%

69 B Test results for caveman 2x5

70 TRITA-EECS-EX-2019:269

www.kth.se