Knowledge-Based Supervision for Domain-Adaptive Semantic Role Labeling

Knowledge-based Supervision for Domain-adaptive Semantic Role Labeling Vom Fachbereich Informatik der Technischen Universität Darmstadt genehmigte Dissertation zur Erlangung des akademischen Grades Dr. rer. nat. vorgelegt von Silvana Hartmann, Dipl. Ling. geboren in Darmstadt Tag der Einreichung: 3. August 2016 Tag der Disputation: 30. September 2016 Referenten: Prof. Dr. Iryna Gurevych, Darmstadt Prof. Martha Palmer, Ph.D., Boulder Prof. Dr. Simone Paolo Ponzetto, Mannheim Darmstädter Dissertationen 2017 D17 Please cite this document as URN: urn:nbn:de:tuda-tuprints-67700 URL: http://tuprints.ulb.tu-darmstadt.de/id/eprint/6770 This document is provided by tuprints, E-Publishing-Service of the TU Darmstadt http://tuprints.ulb.tu-darmstadt.de [email protected] This work is published under the following Creative Commons license: Attribution – Non Commercial – No Derivative Works 4.0 International https://creativecommons.org/licenses/by-nc-nd/4.0/ Abstract Semantic role labeling (SRL) is a method for the semantic analysis of texts that adds a level of semantic abstraction on top of syntactic analysis, for instance adding semantic role labels like Agent on top of syntactic functions like Subject . SRL has been shown to benefit various natural language processing applications such as question answering, information extraction, and summarization. Automatic SRL systems are typically based on a predefined model of semantic predicate argument structure incorporated in lexical knowledge bases like PropBank or Frame- Net. They are trained using supervised or semi-supervised machine learning methods using training data labeled with predicate (word sense) and role labels. Even state-of-the-art systems based on deep learning still rely on a labeled training set. However, despite the success in an experimental setting, the real-world application of SRL methods is still prohibited by severe coverage problems (lexicon coverage problem) and lack of domain-relevant training data for training supervised systems (domain adaptation problem). These issues apply to English, but are even more severe for other languages, for which only small resources exist. The goal of this thesis is to develop knowledge-based methods to improve lexicon coverage and training data coverage for SRL. We use linked lexical knowledge bases to extend the lexicon coverage and as a basis for automatic training data generation across languages and domains. Links between lexical resources have already been previously used to address this problem, but the linkings have not been explored and applied at a large scale and the resulting generated training data only contained predicate (word sense) labels, but no role labels. To create predicate and role labels, corpus-based methods have been used. These rely on the existence of labeled training data as sources for label transfer to unlabeled cor- pora. For certain languages, like German or Spanish, several lexical knowledge bases, but only small amounts of labeled training data exist. For such languages, knowledge-based methods promise greater improvements. In our experiments, we target FrameNet, a lexical-semantic resource with a strong focus on semantic abstraction and generalization, but the methods developed in this thesis can be extended to other models of predicate argument structure, like VerbNet and PropBank. This iii work presents the first knowledge-based method for training data generation that covers both predicate and role labels. The main contributions of this work are the following: • First, we lay the foundations for the improvement of FrameNet coverage through the development of standardized models of interoperability for semantic knowledge bases like FrameNet, specifically modeling FrameNet in the standard-compliant lexicon model UBY-LMF, and through modeling links between lexical knowledge bases on the level of predicate argument structure, i.e., the level of frames and roles. • Second, we develop a novel method for creating a FrameNet in any language based on the automatic alignment of FrameNet and Wiktionary. The method is evaluated on the example of German, effectively building a larger FrameNet knowledge base to improve FrameNet semantic role labeling for English and German. • Third, we present the application of DistantSRL, a novel knowledge-based distant supervision method for the creation of frame- and role-labeled training data, which we evaluate for English and German. We find that the resulting training data are of high quality and complementary to the manually labeled data, i.e., the FrameNet fulltext corpus and SALSA. • Fourth, we assess the domain generalization capabilities of open-source FrameNet SRL and analyze how DistantSRL can contribute to the adaptation of FrameNet semantic role labeling to new domains, including the domain of user-generated dis- course. Therefore, we create a new, substantially-sized test dataset based on English texts from the community question-and-answer forum Yahoo! Answers. We find that domain adaptation is required for the predicate labeling step of FrameNet SRL. Our experiments include a comparison of DistantSRL to alternative methods for training data generation for English, e.g., paraphrasing and monolingual annotation projec- tion. We find that our automatically generated training data contribute to best results on a range of out-of-domain test sets. Finally, we summarize the main findings of our work and discuss open research ques- tions that result from our work. Results show that our method creates high-quality training data that complement the available FrameNet data for English and German. However, our automatically generated data is noisy and may require methods for training SRL systems that better deal with the large and noisy generated training data, for instance byusing additional domain adaptation methods. We point out possible directions and recommendations for future work, e.g., the automatic linking of semantic knowledge bases on the level of predicate argument structure to improve the coverage of our automatically labeled data, and the further integration of FrameNet-like resources with large semantic knowledge bases like Wikidata with the goal to extend the ontology coverage of FrameNet. iv Zusammenfassung Die automatische Annotation semantischer Rollen (Semantic Role Labeling, kurz SRL) ist eine Methode der automatischen Textanalyse, die auf der syntaktischen Analyse aufbaut und syntaktische Argumente um Annotationen ihrer semantischen Funktion ergänzt. Die syntaktische Funktion Subjekt erhält so beispielsweise die semantische Funktion, oder semantische Rolle, Agent . Frühere Arbeiten zeigen, dass Semantic Role Labeling eingesetzt werden kann um verschiedene Anwendungen, die semantische Informationen vor- aussetzen, zu verbessern. Beispiele sind das automatische Beantworten von Fragen (Ques- tion answering), die Informationsextraktion (Information extraction) oder die automatische Textzusammenfassung (Summarization). Systeme für die automatische Rollen-Annotation nutzen üblicherweise ein theoreti- sches Modell semantischer Prädikat-Argument-Struktur, das in lexikalischen Wissensba- sen wie PropBank oder FrameNet implementiert ist. Diese Modelle weisen semantischen Prädikaten, zumeist Verben, eine Lesartenannotation (Word Sense) zu, und annotieren (oft abhängig von der Lesart) syntaktische Argumente der Prädikate mit semantischen Rollen. Überwachte oder teilüberwachte Verfahren des Maschinellen Lernens werden auf entsprechend annotierten Trainingsdaten angewendet, um automatische Systeme zur Anno- tation der Prädikat-Argument-Strukturen zu trainieren. Auch Systeme, die dem neuesten Stand der Forschung entsprechend Deep Learning einsetzen, benötigen annotierte Trai- ningsdaten. Diese üblicherweise von Experten manuell annotierten Datensätze zu produ- zieren ist sehr aufwändig. Die mangelnde Abdeckung der Vielfalt natürlicher Sprache durch die Trainingskorpora (mangelnde Lexikonabdeckung) ist ein Grund dafür, dass Systeme für die automatische Annotation semantischer Rollen zwar in Laborexperimenten erfolgreich sind, in praktischen Anwendungen jedoch noch nicht umfassend eingesetzt werden kön- nen. Ein weiterer Grund ist der Mangel an Trainingsdaten für verschiedene Textarten oder Genres, auch Domänen genannt, denn trainierte Systeme müssen auf neue Genres, für die sie eingesetzt werden sollen, angepasst werden (Domänenadaption). Diese beiden Proble- me bestehen für das Englische, sind jedoch noch stärker ausgeprägt für andere Sprachen, v für die es nur wenige, kleine Ressourcen mit semantischen Rollen, also lexikalische Wis- sensbasen und annotierte Korpora, gibt. Das Forschungsziel dieser Arbeit ist die Entwicklung wissensbasierter Methoden, mit denen die Lexikonabdeckung und Abdeckung mit Trainingsdaten für die automatische An- notation semantischer Rollen verbessert werden kann, sowohl für neue Sprachen als auch für neue Genres. Die Verlinkung lexikalischer Wissensbasen auf der Ebene von Word Sen- se und semantischer Prädikat-Argument-Struktur dient als Grundlage für die automatische Generierung von Trainingsdaten mit Lesarten und semantischen Rollen für verschiedene Sprachen und Genres. Verlinkte lexikalische Wissensbasen wurden bereits in früheren Ar- beiten mit dem Ziel eingesetzt, die Lexikonabdeckung zu erhöhen. Diese Arbeiten nutzen jedoch nicht die umfassende Verlinkung mehrerer Wissensbasen. Zudem erstellen sie zwar neue Lesartenannotationen, nicht jedoch vollständige Prädikat-Argument-Strukturen, die eine schwierigere

Load more