The Core Decomposition of Networks: Theory, Algorithms and Applications Fragkiskos Malliaros, Christos Giatsidis, Apostolos Papadopoulos, Michalis Vazirgiannis
Total Page:16
File Type:pdf, Size:1020Kb
The Core Decomposition of Networks: Theory, Algorithms and Applications Fragkiskos Malliaros, Christos Giatsidis, Apostolos Papadopoulos, Michalis Vazirgiannis To cite this version: Fragkiskos Malliaros, Christos Giatsidis, Apostolos Papadopoulos, Michalis Vazirgiannis. The Core Decomposition of Networks: Theory, Algorithms and Applications. The VLDB Journal, Springer, 2019. hal-01986309v3 HAL Id: hal-01986309 https://hal-centralesupelec.archives-ouvertes.fr/hal-01986309v3 Submitted on 3 Dec 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. The Core Decomposition of Networks: Theory, Algorithms and Applications (The VLDB Journal, 2019) Fragkiskos D. Malliaros∗, Christos Giatsidisy, Apostolos N. Papadopoulosz, and Michalis Vazirgiannisx Abstract The core decomposition of networks has attracted significant attention due to its numerous appli- cations in real-life problems. Simply stated, the core decomposition of a network (graph) assigns to each graph node v, an integer number c(v) (the core number), capturing how well v is connected with respect to its neighbors. This concept is strongly related to the concept of graph degeneracy, which has a long history in Graph Theory. Although the core decomposition concept is extremely simple, there is an enormous interest in the topic from diverse application domains, mainly because it can be used to analyze a network in a simple and concise manner by quantifying the significance of graph nodes. Therefore, there exists a respectable number of research works that either propose efficient algorithmic techniques under different settings and graph types or apply the concept to another problem or scientific area. Based on this large interest in the topic, in this survey, we perform an in-depth discussion of core decomposition, focusing mainly on: i) the basic theory and fundamental concepts, ii) the algorithmic techniques proposed for computing it efficiently under different settings, and iii) the applications that can benefit significantly from it. Keywords: Core decomposition, graph mining, graph degeneracy, graph theory, algorithms ∗Center for Visual Computing, CentraleSup´elec,University of Paris-Saclay and Inria Saclay, France. Email: fragkiskos. [email protected] yComputer Science Laboratory, Ecole´ Polytechnique, France. Email: [email protected] zSchool of Informatics, Aristotle University of Thessaloniki, Greece. Email: [email protected] xComputer Science Laboratory, Ecole´ Polytechnique, France. Email: [email protected] 1 Contents 1 Introduction 4 2 Fundamental Concepts8 2.1 Simple Graphs . .8 2.2 Extensions to Other Types of Graphs . 10 2.2.1 Directed Graphs . 10 2.2.2 Weighted Graphs . 10 2.2.3 Signed Graphs . 11 2.2.4 Bipartite Graphs . 11 2.2.5 Dynamic Graphs . 11 2.2.6 Temporal Graphs . 12 2.2.7 Probabilistic Graphs . 12 2.2.8 Multilayer Graphs . 12 2.2.9 Hidden Graphs . 13 2.3 Generalized Core Decomposition . 13 2.4 Extensions of the Core Decomposition . 14 2.4.1 Truss Decomposition . 14 2.4.2 Density-friendly Core Decomposition . 15 2.4.3 Peak Decomposition . 15 2.4.4 Nucleus Decomposition . 16 2.4.5 Distance-based Core Decomposition . 17 2.4.6 k-tip Decomposition . 18 2.4.7 Radius-Bounded Decomposition . 19 3 Algorithmic Techniques 20 3.1 In-Memory Computation . 20 3.2 Disk-Resident Graphs . 21 3.3 Core Decomposition in Dynamic Environments . 23 3.4 Local Computation of Core Numbers . 26 3.5 Parallel and Distributed Techniques . 27 3.5.1 Parallel Computation . 28 3.5.2 Distributed Computation . 29 3.6 Probabilistic Core Decomposition . 31 4 Representative Application Domains 33 4.1 Network Modeling and Analysis . 33 4.2 Temporal Evolution . 33 4.3 Anomaly Detection . 34 4.4 Detection of Influential Spreaders . 34 4.5 Network Visualization . 35 4.6 Communities and Dense Subgraphs . 37 4.7 Text Analytics . 38 4.8 The Anchored k-Core Problem and Engagement Dynamics in Social Graphs . 39 4.9 Graph Similarity . 39 2 4.10 Physics, Biology and Ecology . 40 4.11 Neuroscience . 41 5 Conclusions and Further Research 41 3 1 Introduction Graph management and graph mining are two important research areas with a plethora of significant practical applications [3, 36]. The main reason for this is the fact that graphs are ubiquitous and, therefore, their efficient management and mining is necessary to guarantee fast and meaningful knowledge discovery. Research on graph processing and mining was boosted by the need to explore and analyze massive graphs in Big Data Analytics tasks, aiming at computational efficiency and high scalability. A network or graph (we will use the terms interchangeably) is denoted by G(V; E), where V is the set of nodes or vertices and E is the set of edges or links. We will follow the trend in the literature and use the symbol n for the number of nodes (n = jV j) and the symbol m for the number of edges (m = jEj). The number of neighbors of a node u 2 V plays a central role in general, and it will be denoted by deg(u). Figure1(a) presents a simple graph G(V; E) with n = 8 nodes and m = 12 edges. Based on the degree definition, deg(v1) = 2, deg(v4) = 2 whereas node v3 has the highest degree deg(v3) = 5. Therefore, node v8 has the smallest degree and node v3 the highest. (a) 1-core (b) 2-core (c) 3-core Figure 1: A small graph G with n = 8 nodes and m = 12 edges and the corresponding 1-core, 2-core and 3-core of G. In many modern applications graphs are first class citizens. For example: • The Web [80] can be modeled as a graph, where nodes represent web pages and edges represent hyperlinks among them. • In a social network [2], nodes may represent users and edges may provide information regarding friendship relationships. • In protein analysis, a protein-protein interaction network (PPI for short) [122] may be represented as a graph, where nodes represent different proteins and edges capture the interaction between proteins during a specific experiment. • Graphs may also represent interactions among different types of nodes. For example, purchase information may be captured by using a bipartite graph, where a set of nodes (e.g., customers) interact with another set of nodes (e.g., products). In this case, a link between a customer c and a product p captures the fact that p was purchased by c. These are only a few examples of applications that model elements and the interactions among them by using a graph structure. In these cases, it is important to rely on the structural properties of the interactions among the elements in order to discover useful and meaningful patterns. Exploring and analyzing massive complex networks involves the execution of (usually) computationally intensive tasks, aiming at uncovering the network structure and detecting the presence of useful patterns 4 that could be proven significant. Some important graph mining tasks involve: reachability queries, graph partitioning, graph clustering, classification of graph nodes, predicting network evolution, discovering dense subgraphs, detecting influential spreaders. In many cases we are searching for graph nodes that are considered \central" with respect to a specific problem at hand. Therefore, the concept of node importance is crucial in network analysis, since it is expected that among the nodes of a massive network, only a small fraction is of high significance. Evidently, one should first determine a method to quantify this significance (importance), since this concept is highly related to the application under consideration. For example, if we assume that we are interested in nodes with a large degree (simply because the number of links is important), then evidently, importance is directly measured by counting the number of neighbors of every node (a concept known as degree centrality). As another example, we may define importance in relation to the number of triangles each node participates in; the higher the number of triangles, the higher the node importance. Another popular measure of importance is the total number of shortest paths passing through a specific node (also known as betweenness centrality [50]). Also, one can quantify node importance using the concept of random walks and applying techniques similar to PageRank [27]. In such a case, the importance of a node is represented by the probability that this node will be visited by a random walker. Based on the previous discussion, we realize that there is a plethora of different ways to define impor- tance. However, it turns out that the concept of core decomposition can be used efficiently and effectively to quantify node importance in many different domains, thus, avoiding the use of more complex and com- putationally intensive algorithmic techniques. To be precise, the core decomposition of a simple graph G can be computed in linear time with respect to the number of edges of G, if the computation is done in main memory. Simply put, the k-core of a graph G is the maximal induced subgraph Gk, where the number of neighbors of every node u in Gk is at least k. The core number of a node u (c(u)) is defined as the maximum value of k such that u is contained in Gk. Figure1 illustrates the 1-core, the 2-core and the 3-core of a toy graph. Based on the definition of the core number, it holds that c(v8) = 1, c(v1) = 2 and c(v6) = 3.