Unsupervised Spatial, Temporal and Relational Models for Social Processes
Total Page:16
File Type:pdf, Size:1020Kb
Unsupervised Spatial, Temporal and Relational Models for Social Processes George B. Davis February 2012 CMU-ISR-11-117 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Thesis Committee: Kathleen M. Carley (CMU, ISR), Chair Christos Faloutsos (CMU, CSD) Javier F. Pe~na(CMU, Tepper) Carter T. Butts (UCI, Sociology / MBS) Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy This work was supported in part by the National Science Foundation under the IGERT program (DGE- 9972762) for training and research in CASOS, the Office of Naval Research under Dynamic Network Analysis program (N00014- 02-1-0973, ONR N00014-06-1-0921, ONR N00014-06-1-0104) and ONR MURI N00014-08-1-1186, the Army Research Laboratory under ARL W911NF-08-R-0013, the Army Research Instituute under ARI W91WAW-07-C-0063, and ARO-ERDC W911NF-07-1-0317. Additional support was provided by CASOS - the Center for Computational Analysis of Social and Organizational Systems at Carnegie Mellon University. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied of the National Science Foundation, the Office of Naval Research, or the U.S. government. Keywords: Clustering, unsupervised learning, factor graphs, kernel density estimation Abstract This thesis addresses two challenges in extracting patterns from social data generated by modern sensor systems and electronic mechanisms. First, that such data often combine spatial, temporal, and relational evidence, requiring models that properly utilize the regularities of each domain. Sec- ond, that data from open-ended systems often contain a mixture between entities and relationships that are known a priori, others that are explicitly detected, and still others that are latent but significant in interpreting the data. Identifying the final category requires unsupervised inference techniques that can detect certain structures without explicit examples. I present new algorithms designed to address both issues within three frameworks: relational clus- tering, probabilistic graphical models, and kernel-conditional density estimation. These algorithms are applied to several datasets, including geospatial traces of international shipping traffic and a dynamic network of publicly declared supply relations between US companies. The inference tasks considered include community detection, path prediction, and link prediction. In each case, I present theoretical and empirical results regarding accuracy and complexity, and compare efficacy to previous techniques. For Mary, Lea, George and Bruce, who always looked forward. Acknowledgments I am deeply indebted to a small host whose collaboration and faith saw this work to completion. First thanks go to my advisor, Kathleen Carley, whose support and advice across every possible endeavor has broadened my world. It has been a pleasure studying under someone who refuses to limit their scope of interest, and encourages the same. My gratitude for the breadth of opportunity I've experienced at Carnegie Mellon extends to the rest of the Computations, Organizations and Society faculty, to the exceptional Computer Science Department faculty including Carlos Guestrin and Tuomas Sandholm, and to Javier, Carter and Christos, who continued to offer support and advice months outside and topics away from our original plan. I also wish to thank Dannie Durand, Christian Burks, and Darren Platt, whose mentoring in Computational Biology formed the foundation of my academic interest in applied Computer Sci- ence. I've benefited greatly from interactions with extraordinary collaborators. To Michael Benisch, for his contagious high standards, and to Jamie Olson, for his tenacity, I owe special thanks. I'm also very appreciative of the conversations I shared with Andrew Gilpin, Jesse St. Charles, Jana Deisner, Peter Landwehr, Brian Hirschman, Brad Malin, Edo Airoldi, and Ralph Gross, and many other exceptional students in the COS and CSD programs. Many people outside the Computer Science community helped me situate this work in the context of real problems. Robert Martinez and Kevin O'Brien at Revere Data, LLC were invaluable in both the data they donated and their experience in interpreting declared business-to-business relations, as were Arif Inayatullah and Gerald Leitner for their perspectives on financial networks. Rebecca Goolsby at the Office of Naval Research was supportive in far more than funding, helping me un- derstand how and where to look for data, and how government agencies process research. I'm also grateful to our collaborators at Singapore's DSTA, whose data and domain knowledge in shipping vii viii ACKNOWLEDGMENTS traffic was crucial to our joint projects. To yet unnamed friends, especially Drew, Leith, Sam, Cesar and both Bens, I say thank you for your help balancing my life during the first few years. To my partners at Hg Analytics, I say thank you for your moral support in persevering with academic work as entrepreneurial life grew inevitably more unbalanced. My most stalwart supporters have always been my family, and without love and encouragement from my parents Ann and Bart, and my sister Stephanie, this would not even have begun. I also wish to thank the Ricous, Fonsecas, and Schattens in my life for their kind encouragement. How- ever, my deepest gratitude is reserved for Joana, my true love, friend, and partner in this endeavor and all that follows. Contents 1 Introduction 1 1.1 Spatial, Temporal and Relational (STR) Data . 3 1.2 Unsupervised Learning . 4 2 Background and Preliminaries 7 2.1 STR Modeling . 7 2.1.1 Temporal Data . 7 2.1.2 Spatial Data . 9 2.1.3 Relational Data . 11 2.1.4 Frameworks for STR analysis . 12 2.2 Clustering . 14 2.2.1 Defining Similarity . 14 2.2.2 Clustering Algorithms . 15 2.2.3 How Many Groups? . 17 2.3 Probabilistic Graphical Models . 18 2.3.1 Factor graphs. 20 2.3.2 Parameter sharing. 21 2.3.3 Discriminative models. 22 2.3.4 Representation. 23 2.3.5 Inference. 25 ix x CONTENTS 2.3.6 The Bayesian view. 26 2.4 Kernel Conditional Density Estimation . 27 2.4.1 Regression . 27 2.4.2 Density Estimation . 29 2.4.3 Conditional and Categorical Density Estimation . 30 3 Data 33 3.1 AIS Data (AIS) . 33 3.2 Company Relations Data (B2B) . 34 3.3 Supplementary datasets . 41 3.3.1 Sampson's Monastery (SAMPSON) . 41 3.3.2 Davis' Southern Women (DAVIS) . 42 4 Community Detection 45 4.1 Problem Definition and Background . 45 4.1.1 Defining 'Group' . 46 4.1.2 Variations and Detection of Cohesive Groups . 47 4.1.3 Bipartite (Link) Data . 48 4.2 Fuzzy, Overlapping Grouping . 49 4.2.1 Link data from network data . 50 4.2.2 Stochastic model of evidence generation . 53 4.2.3 The H-FOG algorithm . 55 4.2.4 The k-FOG algorithm . 58 4.2.5 The α-FOG algorithm . 58 4.3 Example analysis: fuzzy groups in the monastery . 59 4.3.1 Example analysis: fuzzy groups among the southern women . 62 4.4 Parameter recovery experiments . 65 4.4.1 Generation of parameters and data . 65 CONTENTS xi 4.4.2 Evaluation of fit . 66 4.4.3 Convergence profiles . 67 4.5 Improving α-FOG performance with residual priority updates . 69 4.6 Performance comparison summary . 71 4.7 (Infinite) Residual Expectation Maximization (REM) . 80 4.8 Discussion . 85 4.8.1 Validation . 85 4.8.2 Interstitial Roles . 86 4.8.3 Generating link data from networks . 87 4.8.4 Analyzing and visualizing fuzzy relationships . 88 4.8.5 Efficient EM algorithms for an unknown number of clusters . 89 4.8.6 Final thoughts and remaining questions . 89 5 Path Prediction 91 5.1 Problem Definition . 91 5.2 Factor Graphs for Path Prediction . 93 5.3 Unsupervised Learning for Factor Graphs . 95 5.4 Plan Projection Experiment . 99 5.4.1 Method . 99 5.4.2 Results and Analysis . 100 5.5 Residual Expectation Maximizing Belief Propagation (REM-BP) . 102 5.5.1 Multi-clustering factors and factor graphs . 102 5.5.2 Inference . 104 5.5.3 Empirical performance of REM-BP . 108 5.6 Discussion and Future Work . 110 xii CONTENTS 6 Spatially embedded link prediction 115 6.1 Problem Definition . 115 6.2 Classifying spatial embedding models . 117 6.3 A kernel-based spatial correlation model . 123 6.3.1 Inference and training for spatial correlation . 124 6.3.2 Fast Approximation . 126 6.3.3 Distances . 127 6.4 Experimental Results . 128 6.4.1 Scoring link prediction . 128 6.4.2 Experimental Setup . 130 6.4.3 Impact of Approximation . 130 6.4.4 Robustness to Missing Data . 131 6.4.5 Comparative Performance . 134 6.5 Discussion . 135 7 Conclusion: contributions, limitations and themes 139 7.1 The human process that studies human processes . 139 7.2 Relaxed grouping in networks . ..