Neural Machine Reading for Domain-Specific Text Resources”, Supervised by Prof

Department of Informatics University of Fribourg (Switzerland) Neural Machine Reading for Domain-Specific Text Resources THESIS presented to the Faculty of Science and Medicine of the University of Fribourg (Switzerland) in consideration for the award of the academic grade of Doctor of Philosophy in Computer Science by Sebastian Arnold from Berlin, Germany Thesis Nº 2220 Print Seven 2020 ii Accepted by the Faculty of Science and Medicine of the University of Fribourg (Switzerland) upon the recommendation of: Prof. Dr. Philippe Cudré-Mauroux, University of Fribourg (Switzerland), thesis supervisor, Prof. Dr.-Ing. habil. Alexander Löser, Beuth University of Applied Sciences Berlin (Germany), thesis supervisor, Prof. Dr.-Ing. Laura Dietz, University of New Hampshire (USA), external examiner, Prof. Dr. Ulrich Ultes-Nietsche, University of Fribourg (Switzerland), president of the jury. Fribourg, October 12, 2020 Thesis supervisor Thesis supervisor Prof. Dr. Philippe Cudré-Mauroux Prof. Dr.-Ing. habil. Alexander Löser Dean Prof. Dr. Gregor Rainer iii Declaration of Authorship Title: “Neural Machine Reading for Domain-Specific Text Resources” I, Sebastian Arnold, declare that I have authored this thesis independently, without illicit help, that I have not used any other than the declared sources/resources, and that I have explicitly marked all material which has been quoted either literally or by content from the used sources. Date: 08.09.2020 v Die Natur muß gefühlt werden, wer nur sieht und abstrahirt, kann ein Menschenalter, im Lebensge- dränge der glühenden Tropenwelt, Pflanzen und Thiere zergliedern, er wird die Natur zu beschreiben glauben, ihr selbst aber ewig fremd sein. — Alexander von Humboldt, an Johann Wolfgang von Goethe, Paris 3. Januar 1810. vii Acknowledgements The journey I took towards this PhD thesis not only required from myself to grow a rough idea into a larger vision. It also required me to step back and focus on precise scientific contributions over and over. Most of all, however, it required me to ask the people around me for help. In particular I would like to thank my supervisors Alexander Löser and Philippe Cudré- Mauroux. Alexander guided and facilitated my research since the very start of my Bachelor’s thesis at TU Berlin, and will continue to be a signpost for my own visions. Philippe helped me to establish the groundwork for my contributions and responded to all my concerns and questions on the way with comprehensive support from Fribourg. I am very fortunate to be in company of many supportive mentors from the Data Science Research Center at Beuth University of Applied Sciences Berlin. Felix A. Gers organized our journal club discussions, provided methodical soundness and the right amount of apprecia- tion in any situation; Peter Tröger not only established an excellent technical infrastructure that made it possible to concentrate on my work, but also contributed precise comments on the group’s visions; Amy Siu helped me many times with concise feedback for my research proposals. Most of this work would not have been possible without my coauthors, students, discus- sion partners and colleagues Torsten Kilias, Robert Dziuba, Rudolf Schneider, Christopher Kümmel, Robin Mehlitz, Tom Oberhauser, Betty van Aken, Benjamin Winter, Paul Grund- mann and Michalis Papaioannou (in order of appearance), as well as the entire DATEXIS and eXascale Infolab teams. I would further like to thank Djellel E. Difalla for his support in the first phase of this thesis; Iryna Gurevych for introducing me to the Natural Language Pro- cessing community; and Laura Dietz for being member of my committee and providing her constructive feedback during the writing of this thesis. I would like to express my personal gratitude to Franziska, Leonora and my family, Ms. Bräutigam, my friends and fellow musicians, who provided me with emotional grounding and creative space for the ups and downs during these five years. I would like to thank everyone who accompanied me along this way and made this journey possible. ix Abstract The vision of Machine Reading is to automatically understand written text and transform the contained information into machine-readable representations. This thesis approaches this challenge in particular in the context of commercial organizations. Here, an abundance of domain-specific knowledge is frequently stored in unstructured text resources. Existing methods often fail in this scenario, because they cannot handle heterogeneous document structure, idiosyncratic language, spelling variations and noise. Specialized methods can hardly over- come these issues and often suffer from recall loss. Moreover, they are expensive to develop and often require large amounts of task-specific labeled training examples. Our goal is to support the human information-seeking process with generalized language understanding methods. These methods need to eliminate expensive adaptation steps and must provide high error tolerance. Our central research question focuses on capturing domain- specific information from multiple levels of abstraction, such as named entities, latent topics, long-range discourse trajectory and document structure. We address this problem in three central information-seeking tasks: Named Entity Linking, Topic Modeling and Answer Pas- sage Retrieval. We propose a collection of Neural Machine Reading models for these tasks. Our models are based on the paradigm of artificial neural networks and utilize deep recurrent architectures and end-to-end sequence learning methods. We show that automatic language understanding requires a contextualized document rep- resentation that embeds the semantics and skeletal structure of a text. We further identify key components that allow for robust word representations and efficient learning from sparse data. We conduct large-scale experiments in English and German language to show that Neural Ma- chine Reading can adapt with high accuracy to various vertical domains, such as geopolitics, automotive, clinical healthcare and biomedicine. This thesis is the first comprehensive research approach to extend distributed language models with complementary structure information from long-range document discourse. It closes the gap between symbolic Information Extrac- tion and Information Retrieval by transforming both problems into continuous vector space representations and solving them jointly using probabilistic methods. Our models can fulfill task-specific information needs on large domain-specific text resources with low latency. This opens up possibilities for interactive applications to further evolve Machine Reading with human feedback. xi Zusammenfassung Machine Reading ist die Vision, Text automatisiert zu verstehen und in maschinenlesbare Form zu überführen. Die vorliegende Dissertation nimmt sich dieses Problems an und legt dabei be- sonderes Augenmerk auf die Anwendung in Unternehmen. Hier wird häufig eine große Fül- le domänenspezifischen Wissens in Form von unstrukturierten Textdaten vorgehalten. Exis- tierende Methoden der Informationsextraktion weisen in diesem Szenario erhebliche Mängel auf. Häufige Fehlerquellen sind heterogene Dokumente, eigentümliche Sprache, abweichende Schreibweisen und verrauschte Daten. Selbst spezialisierte Methoden können diese Heraus- forderungen nur mit eingeschränkter Trefferquote bewältigen. Zusätzlich sind sie kostspielig in der Entwicklung und benötigen oft große Mengen an annotierten Trainingsdaten. Unser Ziel ist es, den Anwender im Prozess der Informationssuche mit maschinellen Sprach- verständismethoden zu unterstützen. Diese Methoden sollen kostenintensive Anpassungs- schritte vermeiden und müssen eine hohe Fehlertoleranz aufweisen. Unsere zentrale Forschungs- frage richtet sich darauf, domänenspezifische Information auf mehreren Abstraktionsebenen zu erfassen. Dies umfasst u.a. die Identifikation von Objekten, latenten Themenverteilungen, Diskursverläufen und Dokumentenstruktur. Im Fokus stehen dabei drei zentrale Prozessschrit- te der Informationssuche: Named Entity Linking, Topic Modeling und Answer Passage Re- trieval. Die vorliegende Arbeit stellt für diese Zwecke eine Sammlung neuronaler Machine Reading Modelle vor. Auf der Grundlage von künstlichen neuronalen Netzen werden hierfür insbesondere Verfahren des tiefen und sequenzbasierten Lernens genutzt. Das zentrale Ergebnis dieser Arbeit ist eine kontextualisierte Dokumentenrepräsentation für automatisiertes Sprachverständnis, welche in verdichteter Form die Semantik und Grund- struktur eines Textes umfasst. Darüber hinaus werden grundlegende Komponenten vorge- stellt, die robuste Wortrepräsentation und effizientes Lernen aus spärlichen Daten ermögli- chen. Umfassende Experimente in englischer und deutscher Sprache belegen, dass neurona- les Machine Reading mit hoher Präzision auf eine Vielzahl vertikaler Domänen anwendbar ist, wie z.B. Geopolitik, Autoindustrie, Gesundheitswesen und Biomedizin. Diese Dissertati- on ist der erste umfassende Forschungsansatz, neuronale Sprachmodelle mit komplementären Strukturelementen auf Dokumentenebene anzureichern. Dieser Ansatz schließt die Lücke zwi- schen symbolischer Informationsextraktion und Informationssuche, indem beide Probleme in kontinuierliche Vektorraumrepräsentationen übersetzt und durchgängig probabilistisch gelöst werden. So können unternehmensspezifische Informationsbedürfnisse mit schnellen Antwort- zeiten erfüllt werden. Dies ermöglicht interaktive Anwendungen, die Machine Reading zu- künftig mit Hilfe von menschlichem Feedback verbessern

Load more