Narrative Information Extraction with Non-Linear Natural Language Processing
Total Page:16
File Type:pdf, Size:1020Kb
Narrative Information Extraction with Non-Linear Natural Language Processing Pipelines A Thesis Submitted to the Faculty of Drexel University by Josep Valls Vargas in partial fulfillment of the requirements for the degree of Doctor of Philosophy December 2017 c Copyright 2017 Josep Valls Vargas. This work is licensed under the terms of the Creative Commons Attribution-ShareAlike 4.0 International license. The license is available at http://creativecommons.org/licenses/by-sa/4.0/. ii Acknowledgments First and foremost, I would like to thank my advisors, Dr. Santiago Ontañón and Dr. Jichen Zhu for all their guidance and support over the last few years. Then, the members of my PhD dissertation committee, Dr. David Elson, Dr. Mark Riedl, and Dr. Dario Salvucci. Additionally, I would like to thank Dr. Marcello Balduccini, Dr. Christopher Geib and Dr. Rachel Greenstadt for their involvement and valuable feedback in the candidacy and proposal stages of my PhD, and Dr. Michael Mateas and Dr. Luc Lamontagne for their mentorship and support earlier in my PhD. I would also like to thank Dr. Mark Finlayson, who kindly shared his translated corpus and annotations with us, and all the authors, researchers and scholars whose work I referred to throughout this dissertation. I should not forget all my colleagues, friends and family who supported and encouraged me through this journey. They all contributed to this dissertation and I would not had been able to finish it without them. Finally, I would also like to thank all the administrative personnel who supported me and my fellow graduate students; the social organizations that kept us sane; and the amazing staff from the recreation athletics center that kept us healthy. All their work is often forgotten but has been an integral part of the last few years of my life. iii iv Contents List of Tables ........................................... ix List of Figures .......................................... xiii Abstract .............................................. xvi 1. Introduction .......................................... 1 1.1 Open Problems........................................ 2 1.2 Thesis Statement....................................... 4 1.3 Organization ......................................... 5 2. Background and Related Work .............................. 7 2.1 Natural Language Processing ................................ 7 2.1.1 Segmentation, Chunking and Tokenization....................... 8 2.1.2 Morphological Analysis.................................. 8 2.1.3 Part-Of-Speech (POS) tagging.............................. 9 2.1.4 Grammatical and Syntactic Parsing........................... 9 2.1.5 Named Entity Recognition (NER) and Information Extraction (IE)......... 10 2.1.6 Coreference Resolution.................................. 11 2.1.7 Semantic Role Labeling (SRL).............................. 11 2.1.8 Word Sense Disambiguation, Common Sense and Domain Knowledge . 11 2.1.9 Natural Language Processing Pipelines......................... 12 2.1.10 Natural Language Generation.............................. 13 2.1.11 Neural Networks ..................................... 14 2.2 Narrative ........................................... 15 2.2.1 Narratology........................................ 15 2.2.2 Computational Narrative................................. 17 2.3 Computational Models of Narrative............................. 18 v 2.3.1 Annotating Narratives.................................. 18 2.4 Applications.......................................... 20 2.5 A Taxonomy of CN...................................... 26 2.5.1 Narrative Generation and Storytelling ......................... 26 2.5.2 Narrative Analysis and Narrative Information Extraction .............. 29 2.5.3 Statistical Models..................................... 31 2.6 Information Extraction.................................... 32 2.6.1 Automatically Extracting Narrative Information ................... 33 2.6.2 Evaluation of NLP Pipelines............................... 35 3. Proppian Stories ....................................... 37 3.1 Corpus of Russian Stories.................................. 38 3.2 Synthetic Dataset....................................... 44 3.3 Discussion........................................... 47 4. Automated Narrative Inf. Extraction ......................... 48 4.1 Infrastructure: Voz...................................... 50 4.2 Extracting Mentions..................................... 52 4.3 Identifying Characters.................................... 54 4.3.1 Verb Extraction...................................... 55 4.3.2 Computing Features from Extracted Mentions..................... 56 4.3.3 Case-Based Reasoning.................................. 60 4.3.4 Extending Mention Classification............................ 65 4.3.5 Improving Mention Classification............................ 66 4.4 Identifying Narrative Roles ................................. 68 4.5 Improving Coref. Resolution................................. 73 4.6 Narrative Functions ..................................... 80 4.7 Dialogue Participants .................................... 91 4.7.1 Experimental Results................................... 101 CONTENTS CONTENTS vi 4.7.2 Evaluating IE with Dialogue............................... 104 4.8 Discussion........................................... 105 5. Evaluation of IE Pipelines ................................. 107 5.1 Systematic Methodology................................... 108 5.2 Evaluating Voz........................................ 110 5.2.1 Incremental Dataset ................................... 110 5.2.2 Individual Module Evaluation.............................. 111 5.2.3 Error Propagation .................................... 116 5.3 Discussion........................................... 120 6. Non-linear IE Pipelines ................................... 123 6.1 Probabilistic Inference.................................... 123 6.1.1 Module Outputs to Probability Distributions ..................... 124 6.1.2 Aggregating Probability Distributions ......................... 125 6.2 Feedback Loops in Voz.................................... 127 6.2.1 Coreference Resolution.................................. 132 6.2.2 Verb Argument Extraction................................ 133 6.3 Discussion........................................... 134 7. Bridging the Gap ....................................... 135 7.1 End-to-end Computational Narrative Systems....................... 135 7.1.1 Mapping Computational Narrative Models....................... 136 7.2 Connecting Voz and Riu................................... 138 7.2.1 Riu............................................. 139 7.3 Evaluation of Riu’s Output ................................. 145 7.3.1 Analytical Evaluation .................................. 145 7.3.2 Empirical Evaluation................................... 150 7.4 Discussion........................................... 159 8. Conclusions and Future Work .............................. 162 CONTENTS CONTENTS vii 8.1 Dissertation Summary.................................... 163 8.2 Contributions......................................... 166 8.3 Discussion........................................... 170 8.4 Future Work ......................................... 172 Bibliography ............................................ 176 Appendix A: Publications .................................... 187 A.1 Journal Articles........................................ 187 A.2 Peer-Reviewed Conference Proceedings........................... 187 A.3 Workshops........................................... 188 A.4 Other Peer-Reviewed Publications ............................. 188 Appendix B: User Study on Connecting Voz+Riu . 190 Appendix C: Narrative Functions Identified by Propp . 197 C.1 Tables............................................. 197 Appendix D: Alexander Afanasyev’s Stories . 200 D.1 Story Titles.......................................... 200 Appendix E: Story-based PCG ................................. 207 E.1 Story-based Procedural Content Generation........................ 207 E.1.1 Experimental Evaluation................................. 218 E.2 Conclusions and Future Work................................ 221 E.3 Visualizing Story Spatial Information Using Voz ..................... 222 CONTENTS CONTENTS viii ix List of Tables 3.1 Annotation present in Finlayson’s 15 story dataset that we use in our work and the number of annotations for each................................. 42 3.2 Stories in our dataset. The second column lists the numerical identifier of the stories in the new Afanasyev collections. ................................ 45 3.3 Annotation present in Finlayson’s 15 story dataset that we use in our work and the number of annotations for each................................. 47 4.1 Performance of the different distance measures over the complete dataset (1122 instances) measured as classification accuracy (Acc.), Precision (Prec.) and Recall (Rec.). 63 4.2 Performance of the different distance measures over the filtered dataset removing the instances that contain personal pronouns measured as classification accuracy (Acc.), Pre- cision (Prec.) and Recall (Rec.)................................. 64 4.3 Performance of the SwJ distance measure with different feature subsets: Acc. only reports the accuracy only using features of a given type, and Acc. all except reports the accuracy using all the features except the ones of the given type. N reports the number of features of each type. .........................................