The Geometric Block Model

Total Page:16

File Type:pdf, Size:1020Kb

The Geometric Block Model The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) The Geometric Block Model Sainyam Galhotra, Arya Mazumdar, Soumyabrata Pal, Barna Saha College of Information and Computer Sciences University of Massachusetts Amherst Amherst, MA 01003 {sainyam,arya,spal,barna}@cs.umass.edu Abstract are only two communities of exactly equal sizes, and the inter- b log n cluster edge probability is n and intra-cluster edge proba- To capture the inherent geometric features of many community a log n detection problems, we propose to use a new random graph bility is n√ , it is√ known√ that perfect recovery is possible if model of communities that we call a Geometric Block Model. and only if a − b> 2 (Abbe, Bandeira, and Hall 2016; The geometric block model generalizes the random geometric Mossel, Neeman, and Sly 2015). The regime of the prob- graphs in the same way that the well-studied stochastic block log n model generalizes the Erdos-Renyi¨ random graphs. It is also a abilities being Θ n has been put forward as one of natural extension of random community models inspired by most interesting ones, because in an Erdos-Renyi¨ random the recent theoretical and practical advancement in commu- graph, this is the threshold for graph connectivity (Bol- nity detection. While being a topic of fundamental theoretical lobas´ 1998). This result has been subsequently general- interest, our main contribution is to show that many practical ized for k communities (Abbe and Sandon 2015a; 2015b; community structures are better explained by the geometric Hajek, Wu, and Xu 2016) (for constant k or when k = block model. We also show that a simple triangle-counting o n algorithm to detect communities in the geometric block model (log )), and under the assumption that the communities is near-optimal. Indeed, even in the regime where the average are generated according to a probabilistic generative model degree of the graph grows only logarithmically with the num- (there is a prior probability pi of an element being in the ith ber of vertices (sparse-graph), we show that this algorithm community) (Abbe and Sandon 2015a). Note that, the results performs extremely well, both theoretically and practically. are not only of theoretical interest, many real-world networks In contrast, the triangle-counting algorithm is far from be- exhibit a “sparsely connected” community feature (Leskovec ing optimum for the stochastic block model. We simulate our et al. 2008), and any efficient recovery algorithm for SBM results on both real and synthetic datasets to show superior has many potential applications. performance of both the new model as well as our algorithm. One aspect that the SBM does not account for is a “tran- sitivity rule” (‘friends having common friends’) inherent to 1 Introduction many social and other community structures. To be precise, consider any three vertices x, y and z.Ifx and y are con- The planted-partition model or the stochastic block model nected by an edge (or they are in the same community), and (SBM) is a random graph model for community detection that y and z are connected by an edge (or they are in the same generalizes the well-known Erdos-Renyi¨ graphs (Holland, community), then it is more likely than not that x and z are Laskey, and Leinhardt 1983; Dyer and Frieze 1989; Decelle connected by an edge. This phenomenon can be seen in many et al. 2011; Abbe and Sandon 2015a; Abbe, Bandeira, and network structures - predominantly in social networks, blog- Hall 2016; Hajek, Wu, and Xu 2015; Chin, Rao, and Vu 2015; networks and advertising. SBM, primarily a generalization Mossel, Neeman, and Sly 2015). Consider a graph G(V,E), of Erdos-Renyi¨ random graph, does not take into account where V = V V ···Vk is a disjoint union of k clusters 1 2 this characteristic, and in particular, probability of an edge denoted by V ,...,Vk. The edges of the graph are drawn 1 between x and z there is independent of the fact that there randomly: there is an edge between u ∈ Vi and v ∈ Vj with exist edges between x and y and y and z. However, one needs probability qi,j, 1 ≤ i, j ≤ k. Given the adjacency matrix of to be careful such that by allowing such “transitivity”, the such a graph, the task is to find exactly (or approximately) simplicity and elegance of the SBM is not lost. the partition V V ···Vk of V . 1 2 Inspired by the above question, we propose a random graph This model has been incredibly popular both in theoreti- community detection model analogous to the stochastic block cal and practical domains of community detection, and the model, that we call the geometric block model (GBM). The aforementioned references are just a small sample. Recent GBM depends on the basic definition of the random geo- theoretical works focus on characterizing sharp threshold of metric graph that has found a lot of practical use in wireless recovering the partition in the SBM. For example, when there networking because of its inclusion of the notion of proximity Copyright c 2018, Association for the Advancement of Artificial between nodes (Penrose 2003). Intelligence (www.aaai.org). All rights reserved. Definition. A random geometric graph (RGG) on n ver- 2215 tices has parameters n, an integer t>1 and a real number Given any simple random graph model, it is possible to t β ∈ [−1, 1]. It is defined by assigning a vector Zi ∈ R to generalize it to a random block model of communities much vertex i, 1 ≤ i, n, where Zi, 1 ≤ i ≤ n are independent in line with the SBM. We however stress that the geometric and identical random vectors uniformly distributed in the Eu- block model is perhaps the simplest possible model of real- St−1 ≡{x ∈ Rt x } clidean sphere : 2 =1 . There will be world communities that also captures the transitive/geometric an edge between vertices i and j if and only if Zi,Zj ≥β. features of communities. Moreover, the GBM explains be- Note that, the definition can be further generalized by haviors of many real world networks as we will exemplify t−1 considering Zis to have a sample space other than S , and subsequently. by using a different notion of distance than inner product (i.e., the Euclidean distance). We simply stated one of the many 2 The Geometric Block Model and its equivalent definitions (Bubeck et al. 2016). Validation Random geometric graphs are often proposed as an alter- V ≡ V V ···V native to Erdos-Renyi¨ random graphs. They are quite well Let 1 2 k be the set of vertices that is a k V ,...,V studied theoretically (though not nearly as much as the Erdos-¨ disjoint union of clusters, denoted by 1 k.Given t ≥ u ∈ V Renyi graphs), and very precise results exist regarding their an integer 2, for each vertex , define a random Z ∈ Rt St−1 ⊂ Rt, connectivity, clique numbers and other structural properties vector u that is uniformly distributed in t − (Gupta and Kumar 1998; Penrose 1991; Devroye et al. 2011; the 1-dimensional sphere. Avin and Ercal 2007; Goel, Rai, and Krishnamachari 2005). Definition (Geometric Block Model For a survey of early results on geometric graphs and the (V,t,βi,j, 1 ≤ i<j≤ k)). Given V,t and a set of analogy to results in Erdos-Renyi¨ graphs, we refer the reader real numbers βi,j ∈ [−1, 1], 1 ≤ i ≤ j ≤ k, the geometric to (Penrose 2003). A very interesting question of distinguish- block model is a random graph with vertices V and an ing an Erdos-Renyi¨ graph from a geometric random graph edge exists between v ∈ Vi and u ∈ Vj if and only if has also recently been studied (Bubeck et al. 2016). This will Zu,Zv ≥βi,j. provide a way to test between the models which better fits a The case of t =2: In this paper we particularly analyze scenario, a potentially great practical use. our algorithm for t =2. In this special case, the above def- As mentioned earlier, the “transitivity” feature led to ran- inition is equivalent to choosing random variable θu uni- dom geometric graphs being used extensively to model formly distributed in [0, 2π], for all u ∈ V . Then there will wireless networks (for example, see (Haenggi et al. 2009; be an edge between two vertices u ∈ Vi,v ∈ Vj if and only Bettstetter 2002)). Surprisingly, however, to the best of our if cos θu cos θv +sinθu sin θv = cos(θu − θv) ≥ βi,j or knowledge, random geometric graphs are never used to model min{|θu − θv|, 2π −|θu − θv|} ≤ arccos βi,j. This in turn, community detection problems. In this paper we take the first is equivalent to choosing a random variable Xu uniformly step towards this direction. Our main contributions can be distributed in [0, 1] for all u ∈ V , and there exists an edge classified as follows. between two vertices u ∈ Vi,v ∈ Vj if and only if • We define a random generative model to study canonical problems of community detection, called the geometric min{|Xu − Xv|, 1 −|Xu − Xv|} ≤ ri,j, block model (GBM). This model takes into account a mea- r ∈ , 1 , ≤ i, j ≤ k sure of proximity between nodes and this proximity measure where i,j [0 2 ] 0 , are a set of real numbers. characterizes the likelihood of two nodes being connected For the rest of this paper, we concentrate on the case when r r i ∈{,...,k} when they are in same or different communities. The geo- i,i = s for all 1 , which we call the “intra- r r i, j ∈{ ,...,k},i metric block model inherits the connectivity properties of cluster distance” and i,j = d for all 1 = j the random geometric graphs, in particular the likelihood of , which we call the “inter-cluster distance,” mainly for the “transitivity” in triplet of nodes (or more).
Recommended publications
  • Basic Models and Questions in Statistical Network Analysis
    Basic models and questions in statistical network analysis Mikl´osZ. R´acz ∗ with S´ebastienBubeck y September 12, 2016 Abstract Extracting information from large graphs has become an important statistical problem since network data is now common in various fields. In this minicourse we will investigate the most natural statistical questions for three canonical probabilistic models of networks: (i) community detection in the stochastic block model, (ii) finding the embedding of a random geometric graph, and (iii) finding the original vertex in a preferential attachment tree. Along the way we will cover many interesting topics in probability theory such as P´olya urns, large deviation theory, concentration of measure in high dimension, entropic central limit theorems, and more. Outline: • Lecture 1: A primer on exact recovery in the general stochastic block model. • Lecture 2: Estimating the dimension of a random geometric graph on a high-dimensional sphere. • Lecture 3: Introduction to entropic central limit theorems and a proof of the fundamental limits of dimension estimation in random geometric graphs. • Lectures 4 & 5: Confidence sets for the root in uniform and preferential attachment trees. Acknowledgements These notes were prepared for a minicourse presented at University of Washington during June 6{10, 2016, and at the XX Brazilian School of Probability held at the S~aoCarlos campus of Universidade de S~aoPaulo during July 4{9, 2016. We thank the organizers of the Brazilian School of Probability, Paulo Faria da Veiga, Roberto Imbuzeiro Oliveira, Leandro Pimentel, and Luiz Renato Fontes, for inviting us to present a minicourse on this topic. We also thank Sham Kakade, Anna Karlin, and Marina Meila for help with organizing at University of Washington.
    [Show full text]
  • Arxiv:1705.10225V8 [Stat.ML] 6 Feb 2020
    Bayesian stochastic blockmodelinga Tiago P. Peixotoy Department of Mathematical Sciences and Centre for Networks and Collective Behaviour, University of Bath, United Kingdom and ISI Foundation, Turin, Italy This chapter provides a self-contained introduction to the use of Bayesian inference to ex- tract large-scale modular structures from network data, based on the stochastic blockmodel (SBM), as well as its degree-corrected and overlapping generalizations. We focus on non- parametric formulations that allow their inference in a manner that prevents overfitting, and enables model selection. We discuss aspects of the choice of priors, in particular how to avoid underfitting via increased Bayesian hierarchies, and we contrast the task of sampling network partitions from the posterior distribution with finding the single point estimate that maximizes it, while describing efficient algorithms to perform either one. We also show how inferring the SBM can be used to predict missing and spurious links, and shed light on the fundamental limitations of the detectability of modular structures in networks. arXiv:1705.10225v8 [stat.ML] 6 Feb 2020 aTo appear in “Advances in Network Clustering and Blockmodeling,” edited by P. Doreian, V. Batagelj, A. Ferligoj, (Wiley, New York, 2019 [forthcoming]). y [email protected] 2 CONTENTS I. Introduction 3 II. Structure versus randomness in networks 3 III. The stochastic blockmodel (SBM) 5 IV. Bayesian inference: the posterior probability of partitions 7 V. Microcanonical models and the minimum description length principle (MDL) 11 VI. The “resolution limit” underfitting problem, and the nested SBM 13 VII. Model variations 17 A. Model selection 18 B. Degree correction 18 C.
    [Show full text]
  • Networkx Reference Release 1.9.1
    NetworkX Reference Release 1.9.1 Aric Hagberg, Dan Schult, Pieter Swart September 20, 2014 CONTENTS 1 Overview 1 1.1 Who uses NetworkX?..........................................1 1.2 Goals...................................................1 1.3 The Python programming language...................................1 1.4 Free software...............................................2 1.5 History..................................................2 2 Introduction 3 2.1 NetworkX Basics.............................................3 2.2 Nodes and Edges.............................................4 3 Graph types 9 3.1 Which graph class should I use?.....................................9 3.2 Basic graph types.............................................9 4 Algorithms 127 4.1 Approximation.............................................. 127 4.2 Assortativity............................................... 132 4.3 Bipartite................................................. 141 4.4 Blockmodeling.............................................. 161 4.5 Boundary................................................. 162 4.6 Centrality................................................. 163 4.7 Chordal.................................................. 184 4.8 Clique.................................................. 187 4.9 Clustering................................................ 190 4.10 Communities............................................... 193 4.11 Components............................................... 194 4.12 Connectivity..............................................
    [Show full text]
  • Latent Distance Estimation for Random Geometric Graphs ∗
    Latent Distance Estimation for Random Geometric Graphs ∗ Ernesto Araya Valdivia Laboratoire de Math´ematiquesd'Orsay (LMO) Universit´eParis-Sud 91405 Orsay Cedex France Yohann De Castro Ecole des Ponts ParisTech-CERMICS 6 et 8 avenue Blaise Pascal, Cit´eDescartes Champs sur Marne, 77455 Marne la Vall´ee,Cedex 2 France Abstract: Random geometric graphs are a popular choice for a latent points generative model for networks. Their definition is based on a sample of n points X1;X2; ··· ;Xn on d−1 the Euclidean sphere S which represents the latent positions of nodes of the network. The connection probabilities between the nodes are determined by an unknown function (referred to as the \link" function) evaluated at the distance between the latent points. We introduce a spectral estimator of the pairwise distance between latent points and we prove that its rate of convergence is the same as the nonparametric estimation of a d−1 function on S , up to a logarithmic factor. In addition, we provide an efficient spectral algorithm to compute this estimator without any knowledge on the nonparametric link function. As a byproduct, our method can also consistently estimate the dimension d of the latent space. MSC 2010 subject classifications: Primary 68Q32; secondary 60F99, 68T01. Keywords and phrases: Graphon model, Random Geometric Graph, Latent distances estimation, Latent position graph, Spectral methods. 1. Introduction Random geometric graph (RGG) models have received attention lately as alternative to some simpler yet unrealistic models as the ubiquitous Erd¨os-R´enyi model [11]. They are generative latent point models for graphs, where it is assumed that each node has associated a latent d point in a metric space (usually the Euclidean unit sphere or the unit cube in R ) and the connection probability between two nodes depends on the position of their associated latent points.
    [Show full text]
  • On Community Structure in Complex Networks: Challenges and Opportunities
    On community structure in complex networks: challenges and opportunities Hocine Cherifi · Gergely Palla · Boleslaw K. Szymanski · Xiaoyan Lu Received: November 5, 2019 Abstract Community structure is one of the most relevant features encoun- tered in numerous real-world applications of networked systems. Despite the tremendous effort of a large interdisciplinary community of scientists working on this subject over the past few decades to characterize, model, and analyze communities, more investigations are needed in order to better understand the impact of community structure and its dynamics on networked systems. Here, we first focus on generative models of communities in complex networks and their role in developing strong foundation for community detection algorithms. We discuss modularity and the use of modularity maximization as the basis for community detection. Then, we follow with an overview of the Stochastic Block Model and its different variants as well as inference of community structures from such models. Next, we focus on time evolving networks, where existing nodes and links can disappear, and in parallel new nodes and links may be introduced. The extraction of communities under such circumstances poses an interesting and non-trivial problem that has gained considerable interest over Hocine Cherifi LIB EA 7534 University of Burgundy, Esplanade Erasme, Dijon, France E-mail: hocine.cherifi@u-bourgogne.fr Gergely Palla MTA-ELTE Statistical and Biological Physics Research Group P´azm´any P. stny. 1/A, Budapest, H-1117, Hungary E-mail: [email protected] Boleslaw K. Szymanski Department of Computer Science & Network Science and Technology Center Rensselaer Polytechnic Institute 110 8th Street, Troy, NY 12180, USA E-mail: [email protected] Xiaoyan Lu Department of Computer Science & Network Science and Technology Center Rensselaer Polytechnic Institute 110 8th Street, Troy, NY 12180, USA E-mail: [email protected] arXiv:1908.04901v3 [physics.soc-ph] 6 Nov 2019 2 Cherifi, Palla, Szymanski, Lu the last decade.
    [Show full text]
  • Community Detection with Dependent Connectivity∗
    Submitted to the Annals of Statistics arXiv: arXiv:0000.0000 COMMUNITY DETECTION WITH DEPENDENT CONNECTIVITY∗ BY YUBAI YUANy AND ANNIE QUy University of Illinois at Urbana-Champaigny In network analysis, within-community members are more likely to be connected than between-community members, which is reflected in that the edges within a community are intercorrelated. However, existing probabilistic models for community detection such as the stochastic block model (SBM) are not designed to capture the dependence among edges. In this paper, we propose a new community detection approach to incorporate intra-community dependence of connectivities through the Bahadur representation. The pro- posed method does not require specifying the likelihood function, which could be intractable for correlated binary connectivities. In addition, the pro- posed method allows for heterogeneity among edges between different com- munities. In theory, we show that incorporating correlation information can achieve a faster convergence rate compared to the independent SBM, and the proposed algorithm has a lower estimation bias and accelerated convergence compared to the variational EM. Our simulation studies show that the pro- posed algorithm outperforms the popular variational EM algorithm assuming conditional independence among edges. We also demonstrate the application of the proposed method to agricultural product trading networks from differ- ent countries. 1. Introduction. Network data has arisen as one of the most common forms of information collection. This is due to the fact that the scope of study not only focuses on subjects alone, but also on the relationships among subjects. Networks consist of two components: (1) nodes or ver- tices corresponding to basic units of a system, and (2) edges representing connections between nodes.
    [Show full text]
  • Developing a Credit Scoring Model Using Social Network Analysis
    Developing a Credit Scoring Model Using Social Network Analysis Ahmad Abd Rabuh UP821483 This thesis is submitted in fulfilment of the requirements for the award of the degree of Doctor of Philosophy in the School of Business and Law at the University of Portsmouth November 2020 1 Author’s Declaration Whilst registered as a candidate for the above degree, I have not been registered for any other research award. The results and conclusions embodied in this thesis are the work of the named candidate and have not been submitted for any other academic award. Signature: Name: Ahmad Abd Rabuh Date: the 30th of April 2020 Word Count: 63,809 2 Acknowledgements As I witness this global pandemic, my greatest gratitude goes to my mother, Hanan, for her moral support and encouragement through the good, the bad and the ugly. During the period of my PhD revision, I received mentoring and help from my future lifetime partner, Kawthar. Also, I have been blessed with the help of wonderful people from Scholars at Risk, Rose Anderson and Sarina Rosenthal, who assisted me in the PhD application and scholarship processes. Additionally, I cannot be thankful enough to the Associate Dean of Research, Andy Thorpe, who supported me in every possible way during my time at the University of Portsmouth. Finally, many thanks to my supervisors, Mark Xu and Renatas Kizys for their continuous guidance. 3 Developing a Credit Scoring Model Using Social Network Analysis Table of Contents Abstract ......................................................................................................................................... 10 CHAPTER 1: INTRODUCTION ................................................................................................. 12 1.1. Research Questions ........................................................................................................ 16 1.2. Aims and Objectives .....................................................................................................
    [Show full text]
  • The Domination Number of On-Line Social Networks and Random Geometric Graphs?
    The domination number of on-line social networks and random geometric graphs? Anthony Bonato1, Marc Lozier1, Dieter Mitsche2, Xavier P´erez-Gim´enez1, and Pawe lPra lat1 1 Ryerson University, Toronto, Canada, [email protected], [email protected], [email protected], [email protected] 2 Universit´ede Nice Sophia-Antipolis, Nice, France [email protected] Abstract. We consider the domination number for on-line social networks, both in a stochastic net- work model, and for real-world, networked data. Asymptotic sublinear bounds are rigorously derived for the domination number of graphs generated by the memoryless geometric protean random graph model. We establish sublinear bounds for the domination number of graphs in the Facebook 100 data set, and these bounds are well-correlated with those predicted by the stochastic model. In addition, we derive the asymptotic value of the domination number in classical random geometric graphs. 1. Introduction On-line social networks (or OSNs) such as Facebook have emerged as a hot topic within the network science community. Several studies suggest OSNs satisfy many properties in common with other complex networks, such as: power-law degree distributions [2, 13], high local cluster- ing [36], constant [36] or even shrinking diameter with network size [23], densification [23], and localized information flow bottlenecks [12, 24]. Several models were designed to simulate these properties [19, 20], and one model that rigorously asymptotically captures all these properties is the geometric protean model (GEO-P) [5{7] (see [16, 25, 30, 31] for models where various ranking schemes were first used, and which inspired the GEO-P model).
    [Show full text]
  • Clique Colourings of Geometric Graphs
    CLIQUE COLOURINGS OF GEOMETRIC GRAPHS COLIN MCDIARMID, DIETER MITSCHE, AND PAWELPRA LAT Abstract. A clique colouring of a graph is a colouring of the vertices such that no maximal clique is monochromatic (ignoring isolated vertices). The least number of colours in such a colouring is the clique chromatic number. Given n points x1;:::; xn in the plane, and a threshold r > 0, the corre- sponding geometric graph has vertex set fv1; : : : ; vng, and distinct vi and vj are adjacent when the Euclidean distance between xi and xj is at most r. We investigate the clique chromatic number of such graphs. We first show that the clique chromatic number is at most 9 for any geometric graph in the plane, and briefly consider geometric graphs in higher dimensions. Then we study the asymptotic behaviour of the clique chromatic number for the random geometric graph G(n; r) in the plane, where n random points are independently and uniformly distributed in a suitable square. We see that as r increases from 0, with high probability the clique chromatic number is 1 for very small r, then 2 for small r, then at least 3 for larger r, and finally drops back to 2. 1. Introduction and main results In this section we introduce clique colourings and geometric graphs; and we present our main results, on clique colourings of deterministic and random geometric graphs. Recall that a proper colouring of a graph is a labeling of its vertices with colours such that no two vertices sharing the same edge have the same colour; and the smallest number of colours in a proper colouring of a graph G = (V; E) is its chromatic number, denoted by χ(G).
    [Show full text]
  • Betweenness Centrality in Dense Random Geometric Networks
    Betweenness Centrality in Dense Random Geometric Networks Alexander P. Giles Orestis Georgiou Carl P. Dettmann School of Mathematics Toshiba Telecommunications Research Laboratory School of Mathematics University of Bristol 32 Queen Square University of Bristol Bristol, UK, BS8 1TW Bristol, UK, BS1 4ND Bristol, UK, BS8 1TW [email protected] [email protected] [email protected] Abstract—Random geometric networks are mathematical are routed in a multi-hop fashion between any two nodes that structures consisting of a set of nodes placed randomly within wish to communicate, allowing much larger, more flexible d a bounded set V ⊆ R mutually coupled with a probability de- networks (due to the lack of pre-established infrastructure or pendent on their Euclidean separation, and are the classic model used within the expanding field of ad-hoc wireless networks. In the need to be within range of a switch). Commercial ad hoc order to rank the importance of the network’s communicating networks are nowadays realised under Wi-Fi Direct standards, nodes, we consider the well established ‘betweenness’ centrality enabling device-to-device (D2D) offloading in LTE cellular measure (quantifying how often a node is on a shortest path of networks [5]. links between any pair of nodes), providing an analytic treatment This new diversity in machine betweenness needs to be of betweenness within a random graph model by deriving a closed form expression for the expected betweenness of a node understood, and moreover can be harnessed in at least three placed within a dense random geometric network formed inside separate ways: historically, in 2005 Gupta et al.
    [Show full text]
  • Robustness of Community Detection to Random Geometric Perturbations
    Robustness of Community Detection to Random Geometric Perturbations Sandrine Péché Vianney Perchet LPSM, University Paris Diderot Crest, ENSAE & Criteo AI Lab [email protected] [email protected] Abstract We consider the stochastic block model where connection between vertices is perturbed by some latent (and unobserved) random geometric graph. The objective is to prove that spectral methods are robust to this type of noise, even if they are agnostic to the presence (or not) of the random graph. We provide explicit regimes where the second eigenvector of the adjacency matrix is highly correlated to the true community vector (and therefore when weak/exact recovery is possible). This is possible thanks to a detailed analysis of the spectrum of the latent random graph, of its own interest. Introduction In a d-dimensional random geometric graph, N vertices are assigned random coordinates in Rd, and only points close enough to each other are connected by an edge. Random geometric graphs are used to model complex networks such as social networks, the world wide web and so on. We refer to [19]- and references therein - for a comprehensive introduction to random geometric graphs. On the other hand, in social networks, users are more likely to connect if they belong to some specific community (groups of friends, political party, etc.). This has motivated the introduction of the stochastic block models (see the recent survey [1] and the more recent breakthrough [5] for more details), where in the simplest case, each of the N vertices belongs to one (and only one) of the two communities that are present in the network.
    [Show full text]
  • Exploratory Analysis of Networks
    Exploratory Analysis of Networks James D. Wilson Data Institute Conference 2017 James D. Wilson (USF) Exploratory Analysis of Networks 1 / 35 Summarizing Networks Goals 1 Provide one number summaries of network 2 Develop hypotheses about observed data 3 Motivate predictive models 4 Augment standard multivariate analyses James D. Wilson (USF) Exploratory Analysis of Networks 2 / 35 At first glance: Network Summary Statistics Measures of Connectivity Degree: number of edges incident to each node (popularity) For directed networks, we consider In- and Out-Degree Geodesic distance: shortest path between two nodes Diameter: longest geodesic distance Clustering coefficient: fraction of triangles to total triples Reciprocity: fraction of reciprocated ties (directed graphs only) James D. Wilson (USF) Exploratory Analysis of Networks 3 / 35 Examples in Social Networks James D. Wilson (USF) Exploratory Analysis of Networks 4 / 35 At first glance: Network Summary Statistics Measures of Nodal Influence / Centrality Centrality: how ”central” is a node in the observed network? Degree: popularity of a node Eigenvector: self- neighbor- importance Betweenness: extent+ to which a node lies on the shortest path between two nodes Closeness: mean distance from node to all other vertices Authorities: vertices with useful information on a topic Hubs: vertices that point to authorities James D. Wilson (USF) Exploratory Analysis of Networks 5 / 35 Examples Internet Search PageRank Algorithm Popular web-search scoring and search algorithm (Google) Score based on eigenvector centrality HITS Algorithm (Kleinberg, 1999) Hyperlink-induced topic search Uses hub and authority scores as basis for measuring importance of webpages James D. Wilson (USF) Exploratory Analysis of Networks 6 / 35 Example: Recommendation Systems Figure: eye2data.blogspot.com Want individual (node) with most influence James D.
    [Show full text]