Social Network Extraction from Text Apoorv Agarwal
Total Page:16
File Type:pdf, Size:1020Kb
Social Network Extraction from Text Apoorv Agarwal Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2016 c 2016 Apoorv Agarwal All Rights Reserved ABSTRACT Social Network Extraction from Text Apoorv Agarwal In the pre-digital age, when electronically stored information was non-existent, the only ways of creating representations of social networks were by hand through surveys, inter- views, and observations. In this digital age of the internet, numerous indications of social interactions and associations are available electronically in an easy to access manner as structured meta-data. This lessens our dependence on manual surveys and interviews for creating and studying social networks. However, there are sources of networks that remain untouched simply because they are not associated with any meta-data. Primary examples of such sources include the vast amounts of literary texts, news articles, content of emails, and other forms of unstructured and semi-structured texts. The main contribution of this thesis is the introduction of natural language processing and applied machine learning techniques for uncovering social networks in such sources of unstructured and semi-structured texts. Specifically, we propose three novel techniques for mining social networks from three types of texts: unstructured texts (such as literary texts), emails, and movie screenplays. For each of these types of texts, we demonstrate the utility of the extracted networks on three applications (one for each type of text). Table of Contents List of Figures. viii List of Tables. xiii I Introduction 1 1 Introduction 2 1.1 Motivation . .2 1.2 A High Level Organization of the Thesis . .5 1.3 Contributions in Terms of Techniques . .8 1.4 Contributions in Terms of Applications . 11 1.5 Thesis Organization . 13 2 A New Kind of a Social Network 14 2.1 Definition of Entities . 15 2.2 Definition of Social Events . 15 2.3 Subcategorization of Social Events . 19 2.3.1 Examples for the Subcategories of OBS . 20 2.3.2 Examples for the Subcategories of INR . 21 2.4 Social Events Considered in this Thesis . 22 2.5 Comparison of Social Events with ACE Relations and Events . 23 2.5.1 Social Events versus ACE Relations . 23 2.5.2 Social Events versus ACE Events . 24 2.6 Comparison of Social Events with Conversations . 32 2.7 Conclusion . 35 i II Extracting Social Networks from Unstructured Text 36 3 Introduction 38 3.1 Task Definition . 38 3.2 Related Work on Relation Extraction . 39 3.2.1 Bootstrapping . 40 3.2.2 Unsupervised . 42 3.2.3 Distant Supervision . 43 3.2.4 Feature Based Supervision . 44 3.2.5 Convolution Kernels Based Supervision . 46 3.2.6 Rule Based . 51 3.3 Conclusion . 52 4 Machine Learning Approach 55 4.1 Data . 56 4.1.1 Annotation Procedure . 57 4.1.2 Additional Annotation Instructions . 60 4.1.3 Inter-annotator Agreement . 61 4.2 Baselines . 65 4.2.1 Co-occurrence Based (CoOccurN)................... 65 4.2.2 Syntactic Rule Based (SynRule)..................... 66 4.2.3 Feature Based Supervision (GuoDong05)................ 69 4.2.4 Convolution Kernel Based Supervision (Nguyen09).......... 71 4.2.5 Features derived from FrameNet (BOF and SemRules)........ 74 4.3 Structures We Introduce . 76 4.3.1 Sequence on Grammatical Relation Dependency Tree (SqGRW) . 76 4.3.2 Semantic trees (FrameForest, FrameTree, FrameTreeProp) . 76 4.4 Experiments and Results: Intrinsic Evaluation . 80 4.4.1 Task Definitions . 80 4.4.2 Data Distribution . 80 4.4.3 Experimental Set-up . 81 ii 4.4.4 Experiments and Results for Social Event Detection . 81 4.4.5 Experiments and Results for Social Event Classification . 85 4.4.6 Experiments and Results for Directionality Classification . 86 4.4.7 Experiments and Results for Social Network Extraction (SNE) . 87 4.5 Conclusion and Future Work . 88 5 Application: Validating Literary Theories 93 5.1 Introduction . 93 5.2 Literary Theories . 95 5.3 Conversational, Interaction, and Observation Networks . 95 5.4 Evaluation of SINNET .............................. 97 5.4.1 Gold standards: Conv-Gold and SocEv-Gold ............ 98 5.4.2 Evaluation and Results . 99 5.4.3 Discussion of Results . 100 5.5 Expanded Set of Literary Hypotheses . 100 5.6 Methodology for Validating Hypotheses . 102 5.7 Results for Testing Literary Hypotheses . 104 5.8 Conclusion and Future Work . 106 III Extracting Networks from Emails 108 6 Introduction 111 6.1 Terminology . 111 6.2 Task Definition . 114 6.3 Literature Survey . 114 6.4 Conclusion . 116 7 Machine Learning Approach 117 7.1 Data . 117 7.1.1 A Brief History of the Enron Corporation . 117 7.1.2 The Enron Email Corpus . 120 iii 7.2 Name Disambiguation Approach . 121 7.2.1 Terminology . 121 7.2.2 Candidate Set Generation Algorithm . 122 7.2.3 Our Name Disambiguation Algorithm . 125 7.3 Experiments and Results . 125 7.3.1 Evaluation Set for Name Disambiguation . 125 7.3.2 Experiments and Results . 127 7.3.3 Examples of the Name Disambiguation Algorithm in Action . 128 7.3.4 Error Analysis . 131 7.4 Conclusion and Future Work . 132 8 Application: Predicting Organizational Dominance Relations 135 8.1 Introduction . 135 8.2 Related . 137 8.3 A New Gold Standard for Enron Hierarchy Prediction . 140 8.4 Dominance Prediction Technique . 141 8.5 Baseline Approaches . 142 8.5.1 Unsupervised SNA-Based Approach . 142 8.5.2 Supervised NLP-Based Approach . 142 8.5.3 Experiments and Results . 143 8.6 Dominance Prediction Using the Mention Network . 144 8.6.1 Set of Experiments . 144 8.6.2 Weighted versus Unweighted Networks . 146 8.6.3 Type of Network . 147 8.6.4 Linking to the Mentioned Person . 149 8.6.5 Type of Degree Centrality . 150 8.6.6 Summary: What Matters in Mentions . 150 8.7 Conclusion and Future Work . 151 iv IV Extracting Networks from Screenplays 153 9 Introduction 156 9.1 Terminology . 156 9.2 The Structure of Screenplays . 158 9.3 Task Definition . 159 9.4 Literature Survey . 160 9.5 Conclusion . 163 10 Machine Learning Approach 164 10.1 Data . 165 10.1.1 Distant Supervision . 165 10.1.2 Heuristics for Preparing Training Data . 166 10.1.3 Data Distribution . 169 10.2 Baseline Approach . 169 10.3 Machine Learning Approach . 170 10.3.1 Terminology . 171 10.3.2 Overall Machine Learning Approach . 171 10.4 Features . 174 10.5 Experiments and Results . 177 10.5.1 Training Experts . 178 10.5.2 Finding the Right Ensemble . ..