A Survey of Statistical Network Models Full Text Available At
Total Page:16
File Type:pdf, Size:1020Kb
Full text available at: http://dx.doi.org/10.1561/2200000005 A Survey of Statistical Network Models Full text available at: http://dx.doi.org/10.1561/2200000005 A Survey of Statistical Network Models Anna Goldenberg University of Toronto Canada [email protected] Alice X. Zheng Microsoft Research USA [email protected] Stephen E. Fienberg Carnegie Mellon University USA fi[email protected] Edoardo M. Airoldi Harvard University USA [email protected] Boston { Delft Full text available at: http://dx.doi.org/10.1561/2200000005 Foundations and Trends R in Machine Learning Published, sold and distributed by: now Publishers Inc. PO Box 1024 Hanover, MA 02339 USA Tel. +1-781-985-4510 www.nowpublishers.com [email protected] Outside North America: now Publishers Inc. PO Box 179 2600 AD Delft The Netherlands Tel. +31-6-51115274 The preferred citation for this publication is A. Goldenberg, A. X. Zheng, S. E. Fienberg and E. M. Airoldi, A Survey of Statistical Network Models, Foundation and Trends R in Machine Learning, vol 2, no 2, pp 129{233, 2009 ISBN: 978-1-60198-320-6 c 2010 A. Goldenberg, A. X. Zheng, S. E. Fienberg and E. M. Airoldi All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, mechanical, photocopying, recording or otherwise, without prior written permission of the publishers. Photocopying. In the USA: This journal is registered at the Copyright Clearance Cen- ter, Inc., 222 Rosewood Drive, Danvers, MA 01923. Authorization to photocopy items for internal or personal use, or the internal or personal use of specific clients, is granted by now Publishers Inc. for users registered with the Copyright Clearance Center (CCC). The `services' for users can be found on the internet at: www.copyright.com For those organizations that have been granted a photocopy license, a separate system of payment has been arranged. Authorization does not extend to other kinds of copy- ing, such as that for general distribution, for advertising or promotional purposes, for creating new collective works, or for resale. In the rest of the world: Permission to pho- tocopy must be obtained from the copyright owner. Please apply to now Publishers Inc., PO Box 1024, Hanover, MA 02339, USA; Tel. +1-781-871-0245; www.nowpublishers.com; [email protected] now Publishers Inc. has an exclusive license to publish this material worldwide. Permission to use this content must be obtained from the copyright license holder. Please apply to now Publishers, PO Box 179, 2600 AD Delft, The Netherlands, www.nowpublishers.com; e-mail: [email protected] Full text available at: http://dx.doi.org/10.1561/2200000005 Foundations and Trends R in Machine Learning Volume 2 Issue 2, 2009 Editorial Board Editor-in-Chief: Michael Jordan Department of Electrical Engineering and Computer Science Department of Statistics University of California, Berkeley Berkeley, CA 94720-1776 Editors Peter Bartlett (UC Berkeley) John Lafferty (Carnegie Mellon University) Yoshua Bengio (Universit´ede Montr´eal) Michael Littman (Rutgers University) Avrim Blum (Carnegie Mellon University) Gabor Lugosi (Pompeu Fabra University) Craig Boutilier (University of Toronto) David Madigan (Columbia University) Stephen Boyd (Stanford University) Pascal Massart (Universit´ede Paris-Sud) Carla Brodley (Tufts University) Andrew McCallum (University of Inderjit Dhillon (University of Texas at Massachusetts Amherst) Austin) Marina Meila (University of Washington) Jerome Friedman (Stanford University) Andrew Moore (Carnegie Mellon Kenji Fukumizu (Institute of Statistical University) Mathematics) John Platt (Microsoft Research) Zoubin Ghahramani (Cambridge Luc de Raedt (Albert-Ludwigs Universitaet University) Freiburg) David Heckerman (Microsoft Research) Christian Robert (Universit´e Tom Heskes (Radboud University Nijmegen) Paris-Dauphine) Geoffrey Hinton (University of Toronto) Sunita Sarawagi (IIT Bombay) Aapo Hyvarinen (Helsinki Institute for Robert Schapire (Princeton University) Information Technology) Bernhard Schoelkopf (Max Planck Institute) Leslie Pack Kaelbling (MIT) Richard Sutton (University of Alberta) Michael Kearns (University of Larry Wasserman (Carnegie Mellon Pennsylvania) University) Daphne Koller (Stanford University) Bin Yu (UC Berkeley) Full text available at: http://dx.doi.org/10.1561/2200000005 Editorial Scope Foundations and Trends R in Machine Learning will publish sur- vey and tutorial articles in the following topics: • Adaptive control and signal • Inductive logic programming processing • Kernel methods • Applications and case studies • Markov chain Monte Carlo • Behavioral, cognitive and neural • Model choice learning • Nonparametric methods • Bayesian learning • Online learning • Classification and prediction • Optimization • Clustering • Reinforcement learning • Data mining • Relational learning • Dimensionality reduction • Robustness • Evaluation • Spectral methods • Game theoretic learning • Statistical learning theory • Graphical models • Variational inference • Independent component analysis • Visualization Information for Librarians Foundations and Trends R in Machine Learning, 2009, Volume 2, 4 issues. ISSN paper version 1935-8237. ISSN online version 1935-8245. Also available as a combined paper and online subscription. Full text available at: http://dx.doi.org/10.1561/2200000005 Foundations and Trends R in Machine Learning Vol. 2, No. 2 (2009) 129{233 c 2010 A. Goldenberg, A. X. Zheng, S. E. Fienberg and E. M. Airoldi DOI: 10.1561/2200000005 A Survey of Statistical Network Models Anna Goldenberg1, Alice X. Zheng2, Stephen E. Fienberg3 and Edoardo M. Airoldi4 1 Center for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, M5S 3G4, Canada, [email protected] 2 Microsoft Research, One Microsoft Way, Redmond, WA, 98052, USA, [email protected] 3 Department of Statistics, Machine Learning Department, Cylab, and iLab Carnegie Mellon University, Pittsburgh, PA, 15213-3890, USA, fi[email protected] 4 Department of Statistics & FAS Center for Systems Biology, Harvard University, 1 Oxford Street, Cambridge, MA, 02138, USA, [email protected] Abstract Networks are ubiquitous in science and have become a focal point for discussion in everyday life. Formal statistical models for the analysis of network data have emerged as a major topic of interest in diverse areas of study, and most of these involve a form of graphical representation. Probability models on graphs date back to 1959. Along with empirical studies in social psychology and sociology from the 1960s, these early works generated an active \network community" and a substantial liter- ature in the 1970s. This effort moved into the statistical literature in the late 1970s and 1980s, and the past decade has seen a burgeoning net- work literature in statistical physics and computer science. The growth Full text available at: http://dx.doi.org/10.1561/2200000005 of the World Wide Web and the emergence of online \networking com- munities" such as Facebook, MySpace, and LinkedIn, and a host of more specialized professional network communities has intensified interest in the study of networks and network data. Our goal in this review is to provide the reader with an entry point to this burgeoning literature. We begin with an overview of the historical development of statistical network modeling and then we introduce a number of examples that have been studied in the network literature. Our subsequent discussion focuses on a number of prominent static and dynamic network models and their interconnections. We emphasize for- mal model descriptions, and pay special attention to the interpretation of parameters and their estimation. We end with a description of some open problems and challenges for machine learning and statistics. Full text available at: http://dx.doi.org/10.1561/2200000005 Contents 1 Introduction 1 1.1 Overview of Modeling Approaches 2 1.2 What This Survey Does Not Cover 7 2 Motivation and Data-Set Examples 11 2.1 Motivations for Network Analysis 11 2.2 Sample Data Sets 13 3 Static Network Models 25 3.1 Basic Notation and Terminology 26 3.2 The Erd¨os{R´enyi{Gilbert Random Graph Model 26 3.3 The Exchangeable Graph Model 28 3.4 The p1 Model for Social Networks 33 3.5 p2 Models for Social Networks and Their Bayesian Relatives 36 3.6 Exponential Random Graph Models 37 3.7 Random Graph Models with Fixed Degree Distribution 40 3.8 Blockmodels, Stochastic Blockmodels and Community Discovery 42 3.9 Latent Space Models 46 ix Full text available at: http://dx.doi.org/10.1561/2200000005 4 Dynamic Models for Longitudinal Data 53 4.1 Random Graphs and the Preferential Attachment Model 54 4.2 Small-World Models 57 4.3 Duplication-Attachment Models 60 4.4 Continuous-Time Markov Chain Models 61 4.5 Discrete Time Markov Models 66 5 Issues in Network Modeling 75 6 Summary 81 Acknowledgments 85 References 87 Full text available at: http://dx.doi.org/10.1561/2200000005 1 Introduction Many scientific fields involve the study of networks in some form. Networks have been used to analyze interpersonal social relationships, communication networks, academic paper co-authorships and cita- tions, protein interaction patterns, and much more. Popular books on networks and their analysis began to appear a decade ago (see, e.g., [24, 50, 68, 318, 319]), and online \networking communities" such as Facebook, MySpace, and LinkedIn are an even more recent phenomenon. In this work, we survey selective