Online Social Networks: Measurement, Analysis, and Applications to Distributed Information Systems

RICE UNIVERSITY Online Social Networks: Measurement, Analysis, and Applications to Distributed Information Systems by Alan E. Mislove A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Approved, Thesis Committee: Peter Druschel, Chair Professor of Computer Science T. S. Eugene Ng Assistant Professor of Computer Science Krishna P. Gummadi Assistant Professor of Computer Science Houston, Texas April, 2009 Online Social Networks: Measurement, Analysis, and Applications to Distributed Information Systems Alan E. Mislove Abstract Recently, online social networking sites have exploded in popularity. Numerous sites are dedicated to finding and maintaining contacts and to locating and sharing different types of content. Online social networks represent a new kind of information network that differs significantly from existing networks like the Web. For example, in the Web, hyperlinks between content form a graph that is used to organize, navigate, and rank information. The properties of the Web graph have been studied extensively, and have lead to useful algorithms such as PageRank. In contrast, few links exist between content in online social networks and instead, the links exist between content and users, and between users themselves. However, little is known in the research community about the properties of online social network graphs at scale, the factors that shape their structure, or the ways they can be leveraged in information systems. In this thesis, we use novel measurement techniques to study online social networks at scale, and use the resulting insights to design innovative new information systems. First, we examine the structure and growth patterns of online social networks, focusing on how users are connecting to one another. We conduct the first large-scale measurement study of multiple online social networks at scale, capturing information about over 50 million users and 400 million links. Our analysis identifies a common structure across multiple networks, characterizes the underlying processes that are shaping the network structure, and exposes the rich community structure. Second, we leverage our understanding of the properties of online social networks to design new information systems. Specifically, we build two distinct applications that leverage different properties of online social networks. We present and evaluate Ostra, a novel system for preventing unwanted communication that leverages the difficulty in establishing and maintaining relationships in social networks. We also present, deploy, and evaluate PeerSpective, a system for enhancing Web search using the natural community structure in social networks. Each of these systems has been evaluated on data from real online social networks or in a deployment with real users. Acknowledgments First and foremost, I would like to thank my advisors, Peter Druschel and Krishna P. Gummadi, for their help, advice, and mentoring during my graduate career. Without their support and guidance, none of the work presented in this thesis would have been possible. Moreover, I am deeply indebted to them both for showing me how to do successful research, how to mentor students, and how to communicate research results effectively. I suspect that this debt will only grow over time, as I use these skills in my own research career. I would also like to thank Eugene Ng for his service on my thesis committee. His insight and advice proved very useful during the preparation of this thesis, and in my search for a tenure-track job. I am also grateful to have worked with Bobby Bhattacharjee – his advice and enthusiasm played no small part in my decision to continue a career in academia. I am extremely grateful to have worked with and mentored numerous talented students during my research career. Working with Bimal, Malveeka, and Hema was a pleasure, and the excitement and energy they each brought to their research was both refreshing and invigorating. I hope that I am lucky enough to work with students of a similar caliber in the future. v I am deeply indebted to Brigitta Hansen, Claudia Richter, and Belia Martinez, whose assistance with many administrative matters proved invaluable. They all made living in Germany while finishing a Ph.D. at Rice a much easier experience. I would also like to thank my colleagues and friends in Saarbrücken: Ansley, Animesh, Atul, Andreas, Jeff, Jim, Rodrigo, Andrey, Derek, Rose, Marcel, Max, Nuno, Pedro, Mia, and Ashu. They all made MPI-SWS a wonderful place to be, and being in Germany is an experience that I will always treasure. I am also grateful for my close friendship with Rebecca. Her contagious excitement and enthusiasm was always refreshing, and I benefited greatly from her insight and advice. Additionally, I am grateful for my friendship with Stephanie – our travels and adventures often provided a needed break from research. Finally, I would like to express my deep gratitude to my family, and especially my parents, for their love and support during the ups and downs of graduate school. I am grateful beyond words for all that they have given me. Contents Abstract ii Acknowledgments iv List of Illustrations xv List of Tables xxii 1 Introduction 1 1.1 Background, related work, and methodology . ... 4 1.2 Networkstructureandgrowth . 5 1.3 Communities in online social networks . .. 7 1.4 Ostra: Leveragingrelationships . .. 8 1.5 Peerspective: Leveraging shared interest . ..... 9 2 Background 11 2.1 Whatareonlinesocialnetworks? . 11 2.1.1 Definitionandpurpose . 11 2.1.2 Abriefhistory .......................... 12 2.1.3 Mechanismsandpolicies . 14 2.1.4 Anewformofinformationexchange . 17 vii 2.2 Why study online social networks? . 19 2.2.1 Trust................................ 19 2.2.2 Sharedinterest .......................... 20 2.2.3 Contentexchange......................... 20 2.2.4 Otherdisciplines ......................... 22 2.3 How do we analyze complex networks? . 23 2.3.1 Preliminaries ........................... 23 2.3.2 Radiusanddiameter .. .. 23 2.3.3 Degreedistribution . 24 2.3.4 Jointdegreedistribution . 25 2.3.5 Scale-freebehavior . 26 2.3.6 Assortativity ........................... 26 2.3.7 Clusteringcoefficient . 27 2.3.8 Betweenness centrality . 27 2.3.9 Modularity ............................ 28 2.3.10 Connectedcomponents . 29 2.3.11 Classes of studied networks . 30 2.3.12 Preferentialattachment . 31 3 Related Work 32 3.1 Complexnetworkstructure. 32 3.1.1 Socialnetworks .......................... 33 viii 3.1.2 Otherinformationnetworks . 35 3.2 Complexnetworkgrowth. 36 3.2.1 Growthmodels .......................... 36 3.2.2 Observationsofnetworkgrowth . 39 3.3 Detectingcommunities . .. .. 41 3.3.1 Classical community detection . 41 3.3.2 Globalcommunitydetection . 42 3.3.3 Localcommunitydetection. 44 3.3.4 Observationsofcommunities. 46 3.4 Preventing unwanted communication . 47 3.4.1 Content-basedfiltering . 48 3.4.2 Originator-basedfiltering. 49 3.4.3 Imposingacostonthesender . 50 3.4.4 Contentrating .......................... 53 3.4.5 Leveragingrelationships . 53 3.5 Personalizedwebsearch . .. .. 54 4 Measurement Methodology 57 4.1 Challenges in crawling large graphs . .. 57 4.1.1 CrawlingtheentirelargeWCC . 58 4.1.2 Usingonlyforwardlinks . 59 4.2 Capturing social networks’ structure . ... 60 ix 4.2.1 Flickr ............................... 60 4.2.2 LiveJournal ............................ 62 4.2.3 Orkut ............................... 64 4.2.4 YouTube.............................. 66 4.2.5 Webgraph............................. 66 4.2.6 Summary ............................. 67 4.3 Capturinggroupmembership . 68 4.4 Capturing social networks’ growth . .. 68 4.4.1 Flickr ............................... 69 4.4.2 YouTube.............................. 70 4.4.3 Wikipedia ............................. 71 4.4.4 Internettopology ......................... 72 4.5 Capturingcommunities. 73 4.5.1 Measurementmethodology. 73 4.5.2 Collecteddata........................... 74 4.5.3 Limitations ............................ 75 4.6 Dataavailability ............................. 76 5 Network Structure 77 5.1 High-leveldatastatistics . 78 5.2 Linksymmetry .............................. 79 5.3 Power-lawnodedegrees. .. .. 80 x 5.4 Correlationofindegreeandoutdegree . ... 85 5.5 Pathlengthsanddiameter . 87 5.6 Linkdegreecorrelations . .. .. 88 5.6.1 Jointdegreedistribution . 88 5.6.2 Scale-freebehavior . 90 5.6.3 Assortativity ........................... 90 5.7 Denselyconnectedcore. .. .. 91 5.8 Tightlyclusteredfringe. 94 5.9 Groups................................... 96 5.10Discussion ................................. 98 5.10.1 Information dissemination and search . .. 99 5.10.2 Trust................................ 99 5.11Summary ................................. 101 6 Network Growth 102 6.1 High-level data characteristics . ... 103 6.2 Growth dominates network evolution . 104 6.3 Reciprocation ............................... 105 6.4 Preferentialattachment. 107 6.4.1 Undirectednetworks . 109 6.4.2 Directednetworks. 109 6.4.3 Discussion............................. 110 xi 6.5 Proximity bias in link creation . 111 6.6 Mechanisms causing proximity bias . 114 6.7 Discussion ................................. 118 6.7.1 Is proximity fundamental? . 118 6.7.2 Proximity mechanisms . 120 6.8 Summary ................................. 120 7 Network Communities 122 7.1 Datasetsused............................... 124 7.2 Attributesinthenetwork . 124 7.2.1 Friendswithcommonattributes . 125 7.2.2 Attribute-based communities . 126 7.2.3 Summary

Online Social Networks: Measurement, Analysis, and Applications to Distributed Information Systems

Nber Working Paper Series Human Decisions And

Himabindu Lakkaraju

Human Decisions and Machine Predictions*

Graph Evolution: Densification and Shrinking Diameters

CURRICULUM Vitae Jure LESKOVEC

Jure Leskovec

Real Time Traffic Congestion Prediction and Mitigation at the City Scale

Advancements in Graph Neural Networks: Pgnns, Pretraining, and OGB Jure Leskovec

Graphs Over Time: Densification Laws, Shrinking Diameters and Possible

Integrating Transductive and Inductive Embeddings Improves Link

Learning Cost-Effective and Interpretable Treatment Regimes

Curriculum Vitae