Techniques for the Automatic Extraction of Character Networks in German Historic Novels
Total Page:16
File Type:pdf, Size:1020Kb
Techniques for the Automatic Extraction of Character Networks in German Historic Novels vorgelegt von Markus Krug Wurzburg,¨ 2020 Dissertation zur Erlangung des naturwissenschaftlichen Doktorgrades der Bayerischen Julius-Maximilians-Universit¨at Wurzburg¨ ii Abstract Recent advances in Natural Language Preprocessing (NLP) allow for a fully automatic extraction of character networks for an incoming text. These networks serve as a compact and easy to grasp representation of literary fiction. They offer an aggregated view of the text, which can be used during distant reading approaches for the analysis of literary hypotheses. In their core, the networks consist of nodes, which represent literary characters, and edges, which represent relations between characters. For an automatic extraction of such a network, the first step is the detection of the references of all fictional entities that are of importance for a text. References to the fictional entities appear in the form of names, noun phrases and pronouns and prior to this work, no components capable of automatic detection of character references were available. Existing tools are only capable of detecting proper nouns, a subset of all character references. When evaluated on the task of detecting proper nouns in the domain of literary fiction, they still underperform at an F1-score of just about 50%. This thesis uses techniques from the field of semi-supervised learning, such as Distant supervision and Generalized Expectations, and improves the results of an existing tool to about 82%, when evaluated on all three categories in literary fiction, but without the need for annotated data in the target domain. However, since this quality is still not sufficient, the decision to annotate DROC, a corpus comprising 90 fragments of German novels was made. This resulted in a new general purpose annotation environment titled as ATHEN, as well as annotated data that spans about 500.000 tokens in total. Using this data, the combination of supervised algorithms and a tailored rule based algorithm, which in combination are able to exploit both - local consistencies as well as global consistencies - yield an algorithm with an F1-score of about 93%. This component is referred to as the Kallimachos tagger. A character network can not directly display references however, instead they need to be clustered so that all references that belong to a real world or fictional entity are grouped together. This process widely known as coreference resolution is a hard problem in the focus of research for more than half a century. This work experimented with adaptations of classical feature based machine learning, with a dedicated rule based algorithm and with modern techniques of Deep Learning, but no approach can surpass 55% B-Cubed F1, when evaluated on DROC. Due to this barrier, many researchers do not use a fully-fledged coreference resolution when they extract character networks, but only focus on a more forgiving subset- the names. For novels such as Alice's Adventures in Wonderland by Lewis Caroll, this would however only result in a network in which many important characters are missing. In order to integrate important characters into the network that are not named by the author, this work makes use of automatic detection of speaker and addressees for direct speech utterances (all entities involved in a dialog are considered to be of importance). This problem is by itself not an easy task, however the most successful system analysed in this thesis is able to correctly determine the speaker to about 85% of the utterances as well as about 65% of the addressees. This speaker information can not only help to identify the most dominant characters, but also serves as a way to model the relations between entities. iii During the span of this work, components have been developed to model relations between characters using speaker attribution, using co-occurrences as well as by the us- age of true interactions, for which yet again a dataset was annotated using ATHEN. Furthermore, since relations between characters are usually typed, a component for the extraction of a typed relation was developed. Similar to the experiments for the charac- ter reference detection, a combination of a rule based and a Maximum Entropy classifier yielded the best overall results, with the extraction of family relations showing a score of about 80% and the quality of love relations with a score of about 50%. For family relations, a kernel for a Support Vector Machine was developed that even exceeded the scores of the combined approach but is behind on the other labels. In addition, this work presents new ways to evaluate automatically extracted networks without the need of domain experts, instead it relies on the usage of expert summaries. It also refrains from the uses of social network analysis for the evaluation, but instead presents ranked evaluations using Precision@k and the Spearman Rank correlation coef- ficient for the evaluation of the nodes and edges of the network. An analysis using these metrics showed, that the central characters of a novel are contained with high probability but the quality drops rather fast if more than five entities are analyzed. The quality of the edges is mainly dominated by the quality of the coreference resolution and the correlation coefficient between gold edges and system edges therefore varies between 30 and 60%. All developed components are aggregated alongside a large set of other preprocessing modules in the Kallimachos pipeline and can be reused without any restrictions. iv Acknowledgements / Danksagung Zum Entstehen der vorliegenden Arbeit haben viele Personen beigetragen. Dabei gilt mein besonderer Dank an meinen Doktorvater Frank Puppe, der nicht nur stets ein offenes Ohr fur¨ Probleme hatte, ich m¨ochte auch insbesondere die angenehmen und spannenden Diskussionen uber¨ Forschungsfragen hervorheben, in denen wir immer zu einer sehr guten L¨osung gekommen sind, was mir insgesamt eine sehr gute Forschung- sumgebung verschafft hat. Des Weiteren m¨ochte ich mich bei Prof. Dr. Fotis Jannidis bedanken, der durch seine bestechende Expertise in der Computerlinguistik und daruber¨ hinaus stets sehr gute Ideen zu unseren Vorhaben beisteuerte. Fur¨ ein angenehmes Arbeitsklima haben auch die zahlreichen Diskussionen und Ak- tivit¨aten mit meinen Kollegen des Lehrstuhls fur¨ Informatik VI und des Lehrstuhls fur¨ Informatik X beigetragen, wofur¨ ich mich auch in diesem Sinne bedanken m¨ochte. An dieser Stelle m¨ochte ich die Unterstutzung,¨ die ich durch David Schmidt erhalten habe besonders betonen, ohne seine Unterstutzung¨ h¨atte sich mein Zeitplan sicher noch et- was verz¨ogert. Durch die Zusammenarbeit mit Lukas Weimer und Isabella Reger wurde die Arbeit im Projekt Kallimachos deutlich vereinfacht, so dass die Erstellung der im Rahmen dieser Arbeit entstanden Daten und Tools sehr angenehm verlief. Daruber¨ hinaus bin ich sehr dankbar fur¨ den Ruckhalt¨ und die Unterstutzung,¨ die ich von meiner Famlie, von meiner Frau und meiner Tochter erhalten habe, so dass auch mal lange Tage im Buro¨ ertr¨aglicher wurden. Wurzburg,¨ September 2019 v Contents 1. Motivation1 1.1. Context of this Thesis . .5 2. Goals of this Work7 2.1. Adaptation and Creation of Algorithms for the Domain of German historic novels ......................................7 2.2. Kallimachos Pipeline . .7 2.2.1. Experiments concerning Runtimes and Requirements of the differ- ent Algorithms . 12 2.3. Contributions . 15 3. Preliminaries 17 3.1. Performance Metrics . 17 3.1.1. Performance Metrics for Labelwise Evaluation . 18 3.1.2. Performance Metric for the Evaluation of Spans . 19 3.1.3. Performance Metric for the Evaluation of Coreference Clusters . 19 3.2. Background on Machine Learning . 34 3.2.1. Perceptron Algorithm . 34 3.2.2. Support Vector Machine . 39 3.2.3. Maximum Entropy Classifier . 40 3.2.4. A primer on Deep Learning . 41 3.2.5. (Neural) Conditional Random Fields . 42 3.3. Rule-Based Algorithms . 48 3.4. Comparison of Rule-based vs Machine Learning Methods . 50 3.5. The Convergence of Rule-based approaches and Machine Learning algo- rithms...................................... 51 4. Datasets and Resources used in this Work 53 4.1. Available Corpora . 53 4.1.1. Datasets for German Character Reference Detection . 53 4.1.2. Datasets for Quotation Attribution . 54 4.1.3. Datasets for Coreference Resolution . 54 4.2. The Creation of DROC . 55 4.2.1. Description of the Textual Sources . 55 4.2.2. Creation of the Corpus . 55 4.2.3. Annotation Guidelines . 56 4.2.4. Annotated Character References . 56 vii Contents 4.2.5. Annotated Coreferences . 58 4.2.6. Annotated Direct Speech . 59 4.2.7. Inter-Annotator Agreement . 60 4.2.8. Corpus Statistics . 60 4.3. Kindlers Expert Summaries . 61 4.3.1. Guidelines for the Annotation of Relations Between characters . 61 4.3.2. Statistics of the Kindler Data set . 62 4.4. Active Learning to Annotate Relations in German Novels . 62 4.5. Inter Annotator Agreement for the Annotation of Relations . 63 4.6. Annotation of Interactions in Parts of DROC . 65 4.6.1. Statistics about the Dataset . 68 4.7. Gazetteers and Further Lexical Resources . 68 4.7.1. Dictionaries . 68 4.7.2. Gazetteers and Semantic Resources . 69 4.7.3. Semi automatically created resources . 70 4.8. Selection of appropriate Preprocessing Components for German Literary Documents ................................... 70 4.8.1. Tokenizing and Sentence Splitting . 70 4.8.2. Part of Speech Tagging and Morphology . 71 4.8.3. Dependency and Constituency Parsing . 71 4.9. Development of unpublished Preprocessing Components for German lit- erarytexts.................................... 73 4.9.1. Tokenizer and Sentence Splitting Component . 73 4.9.2. Ontology Based Information Extraction . 74 5. ATHEN 81 5.1. Introduction . 81 5.2. Motivation . 82 5.3. Features of ATHEN . 83 5.3.1. General Purpose Annotation Using the Annotation Browser .