A Composite Model for Computing Similarity Between Texts

A Composite Model for Computing Similarity Between Texts Vom Fachbereich Informatik der Technischen Universität Darmstadt genehmigte Dissertation zur Erlangung des akademischen Grades Dr.-Ing. vorgelegt von Dipl.-Inform. Daniel Bär geboren in Würzburg Tag der Einreichung: 20. August 2013 Tag der Disputation: 11. Oktober 2013 Referenten: Prof. Dr. Iryna Gurevych Prof. Ido Dagan, Ph.D. Dr. Torsten Zesch Darmstadt 2013 D17 i Ehrenwörtliche Erklärung1 Hiermit erkläre ich, die vorgelegte Arbeit zur Erlangung des akademischen Grades Dr.-Ing. mit dem Titel “A Composite Model for Computing Similarity Between Texts” selbständig und ausschließlich unter Verwendung der angegebenen Hilfsmittel erstellt zu haben. Ich habe bisher noch keinen Promotionsversuch unternommen. Darmstadt, den 20. August 2013 Dipl.-Inform. Daniel Bär 1Gemäß §9 Abs. 1 der Promotionsordnung der TU Darmstadt ii iii Wissenschaftlicher Werdegang des Verfassers2 10/02 – 11/08 Studium der Informatik mit Nebenfach Linguistik an der Julius-Maximilians-Universität Würzburg 06/08 – 11/08 Diplomarbeit am Lehrstuhl für Künstliche Intelligenz und Angewandte Informatik der Universität Würzburg in Kooperation mit SAP AG (SAP Research), Karlsruhe, mit dem Titel “User-centered Annotation Concepts for the Semantic Web” 01/09 – 05/09 Wissenschaftlicher Mitarbeiter am Fraunhofer-Institut für Graphische Datenverarbeitung, Darmstadt 06/09 – 08/13 Wissenschaftlicher Mitarbeiter am Fachgebiet Ubiquitous Knowledge Processing der Technischen Universität Darmstadt 2Gemäß §20 Abs. 3 der Promotionsordnung der TU Darmstadt iv v Abstract Computing text similarity is a foundational technique for a wide range of tasks in natural language processing such as duplicate detection, question answering, or automatic essay grading. Just recently, text similarity received wide-spread attention in the research community by the establishment of the Semantic Textual Similarity (STS) Task at the Semantic Evaluation (SemEval) workshop in 2012—a fact that stresses the importance of text similarity research. The goal of the STS Task is to create automated measures which are able to compute the degree of similarity between two given texts in the same way that humans do. Measures are thereby ex- pected to output continuous text similarity scores, which are then either compared with human judgments or used as a means for solving a particular problem. We start this thesis with the observation that while the concept of similarity is well grounded in psychology, text similarity is much less well-defined in the natural language processing community. No attempt has been made yet to formalize in what way text similarity between two texts can be computed. Still, text similarity is re- garded as a fixed, axiomatic notion in the community. To alleviate this shortcoming, we describe existing formal models of similarity and discuss how we can adapt them to texts. We propose to judge text similarity along multiple text dimensions, i.e. characteristics inherent to texts, and provide empirical evidence based on a set of annotation studies that the proposed dimensions are perceived by humans. We continue with a comprehensive survey of state-of-the-art text similarity measures previously proposed in the literature. To the best of our knowledge, no such survey has been done yet. We propose a classification into compositional and non- compositional text similarity measures according to their inherent properties. Com- positional measures compute text similarity based on pairwise word similarity scores between all words which are then aggregated to an overall similarity score, while non- compositional measures project the complete texts onto particular models and then compare the texts based on these models. Based on our theoretical insights, we then present the implementation of a text similarity system which composes a multitude of text similarity measures along multiple text dimensions using a machine learning classifier. Depending on the concrete task at hand, we argue that such a system may need to address more than a single text dimension in order to best resemble human judgments. Our efforts culminate in the open source framework DKPro Similarity, which streamlines the development of text similarity measures and experimental setups. We apply our system in two evaluations, for which it consistently outperforms prior work and competing systems: an intrinsic and an extrinsic evaluation. In the intrinsic evaluation, the performance of text similarity measures is evaluated in an isolated setting by comparing the algorithmically produced scores with human judgments. We conducted the intrinsic evaluation in the context of the STS Task as part of the SemEval workshop. In the extrinsic evaluation, the performance of text similarity measures is evaluated with respect to a particular task at hand, where text similarity is a means for solving a particular problem. We conducted the extrinsic evaluation in the text classification task of text reuse detection. The results of both evaluations support our hypothesis that a composition of text similarity measures highly benefits the similarity computation process. vi Finally, we stress the importance of text similarity measures for real-world ap- plications. We therefore introduce the application scenario Self-Organizing Wikis, where users of wikis, i.e. web-based collaborative content authoring systems, are sup- ported in their everyday tasks by means of natural language processing techniques in general, and text similarity in particular. We elaborate on two use cases where text similarity computation is particularly beneficial: the detection of duplicates, and the semi-automatic insertion of hyperlinks. Moreover, we discuss two further applica- tions where text similarity is a valuable tool: In both question answering and textual entailment recognition, text similarity has been used successfully in experiments and appears to be a promising means for further research in these fields. We conclude this thesis with an analysis of shortcomings of current text similarity research and formulate challenges which should be tackled by future work. In particular, we believe that computing text similarity along multiple text dimensions— which depend on the specific task at hand—will benefit any other task where text similarity is fundamental, as a composition of text similarity measures has shown superior performance in both the intrinsic as well as the extrinsic evaluation. vii Zusammenfassung Die Berechnung von Textähnlichkeit ist eine grundlegende Technik für ein breites Anwendungsspektrum in der automatischen Sprachverarbeitung, wie etwa der Du- plikatserkennung, der Beantwortung natürlich-sprachlicher Fragen, oder auch der automatisierten Bewertung von Essays. Durch die Einrichtung des Semantic Textual Similarity Wettbewerbs im Rahmen des Semantic Evaluation (SemEval) Workshops im Jahr 2012 kam dem Thema Textähnlichkeit große Aufmerksamkeit in der wissenschaftlichen Gemeinde zugute – ein deutlicher Beleg dafür, dass hier aktuell großer Forschungsbedarf besteht. Ziel dieses Wettbewerbs ist es, maschinelle Maße zu ent- wickeln, die fähig sind, Ähnlichkeit zwischen zwei gegebenen Texten auf die gleiche Weise zu ermitteln, wie es auch Menschen tun. Von diesen Maßen wird dabei er- wartet, Ähnlichkeitswerte auf einer kontinuierlichen Skala zu produzieren, die im Anschluss entweder direkt mit menschlichen Referenzbewertungen verglichen werden, oder als Hilfsmittel zur Lösung eines konkreten Problems dienen. Wir beginnen diese Arbeit mit der Feststellung, dass der Begriff der Ähnlichkeit in der Psychologie zwar wohldefiniert ist, im Gegensatz dazu aber dem Begriff der Textähnlichkeit in unserer wissenschaftlichen Gemeinde nur eine rudimentäre Defi- nition zugrunde liegt. Bisher gab es unseres Wissens keinen konkreten Versuch, zu formalisieren, auf welche Weise Texte denn überhaupt ähnlich sein können. Noch bis heute wird Textähnlichkeit ausschließlich als pauschalisierter Begriff verwendet. Um diesen Missstand zu beheben, beschreiben wir existierende formale Ähnlichkeits- modelle und diskutieren, wie wir diese für Texte zuschneiden können. Wir schlagen vor, Textähnlichkeit anhand mehrerer Textdimensionen zu bestimmen, d.h. anhand von Merkmalen, die Texten zueigen sind. Im Rahmen mehrerer Annotationsstudi- en zeigen wir, dass die vorgeschlagenen Dimensionen in der Tat von Menschen zur Ähnlichkeitsbewertung von Texten herangezogen werden. Im Anschluss zeigen wir eine gründliche Analyse des aktuellen Forschungsstan- des zu Textähnlichkeitsmaßen auf, die unseres Wissens die bisher erste umfassende Analyse in diesem Bereich darstellt. Wir schlagen vor, die bestehenden Maße in zwei Merkmalsklassen einzuteilen: Aggregierende Maße berechnen zunächst paar- weise Wortähnlichkeiten zwischen allen Wörtern der gegebenen Texte und aggregie- ren diese im Anschluss, um einen finalen Textähnlichkeitswert zu erhalten. Nicht- aggregierende Maße hingegen bilden die gegebenen Texte auf bestimmte Modelle ab und vergleichen im Anschluss die Texte ausschließlich anhand dieser Modelle. Vor dem Hintergrund unserer theoretischen Analysen, die wir zu Beginn dieser Arbeit aufzeigten, entwerfen wir nun die Implementierung eines Textähnlichkeitssys- tems, welches eine Vielzahl von Textähnlichkeitsmaßen anhand verschiedener Text- dimensionen im Rahmen eines maschinellen Lernverfahrens vereint. Wir argumentie- ren, dass ein solches System – abhängig von der konkreten Aufgabe – mehr als eine Textdimension in Betracht ziehen sollte, um menschliche

Load more