Document Clustering in Large German Corpora Using Natural Language Processing

Zurich Open Repository and Archive University of Zurich Main Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch Year: 2006 Document clustering in large German corpora using Natural Language Processing Forster, Richard Abstract: Seitdem die Computer und das Internet in unseren Alltag getreten sind, hat die Informa- tionsmenge, zu der wir theoretisch Zugang haben, exponentiell zugenommen. Eine Methode, um diese gewaltige Datenflut zu bewältigen, ist die Clusteranalyse, mit der grosse unstrukturierte Textmengen in Haufen von miteinander verwandten Dokumenten unterteilt werden können. Text-Clustering besteht aus zwei grundlegenden Schritten: der Text-Repräsentation und dem Clustering. Trotz umfangreicher Literatur zur Clusteranalyse fehlt ein eigenständiges Lehrbuch zum Text-Clustering, weshalb der erste Teil dieser Arbeit einer systematischen Übersicht über die Cluster-Algorithmen und die geläufigen Text- Repräsentationsmethoden gewidmet ist. Anschliessend wird ein Schema zur Klassifikation von Text- Clustering-Anwendungen eingeführt, das sich an den zeitkritischen Komponenten orientiert. Der zweite Teil untersucht die Verwendung Natürlichsprachlicher Datenverarbeitung (Natural Language Processing - NLP) bei der Text-Repräsentation. Zu diesem Zweck werden fünf grosse deutsche Korpora zusam- mengestellt und beschrieben. NLP-Techniken aller Art werden über den fünf Sammlungen zur Anwen- dung gebracht und evaluiert. Es zeigt sich, dass der Erfolg vieler NLP-Methoden vom jeweiligen Korpus abhängt, wofür hypothetische Erklärungen formuliert werden. Insgesamt sprechen die Ergebnisse sowohl für wie wider den Einsatz von NLP. Für die Mehrheit der untersuchten Fälle kann jedoch ein deutliches Verbesserungspotential durch Natürlichsprachliche Datenverarbeitungsmethoden gezeigt werden. Ever since the advent of computer systems and, in particular, the Internet, the amount of information theoretically at our disposal has been increasing exponentially. One way to deal with the extraordinary flood of data is cluster analysis. It is used here to divide large unstructured document corpora into groups of more or less closely related documents. Document clustering consists of two fundamental stages: document representation and clustering. Despite a number of detailed textbooks on cluster analysis in general, no such work seems to have been carried out on the specific needs of document clustering. The first part of the thesis is therefore dedicated to comprehensive surveys of existing clustering algorithms and document representation techniques. In addition, a scheme is presented for classifying different clustering applications in accordance with their time-criticality. The second part of the thesis is devoted to an evaluation of Natural Language Processing (NLP) as a means of improving the document representations. To this end, five large German data sets have been compiled and described. NLP techniques ranging fromthe very simple to complex syntactic and semantic models were evaluated on these five data sets. It emerges that the success of many NLP representation techniques depends on the data under consideration, for which a hypothetical explanation is offered. All in all, evidence is found both pro and contra Natural Language Processing. For the majority of individual cases, distinct potential for improvement through NLP can be shown. Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-163398 Dissertation Published Version Originally published at: Forster, Richard. Document clustering in large German corpora using Natural Language Processing. 2006, University of Zurich, Faculty of Arts. 2 Document Clustering in Large German Corpora Using Natural Language Processing Thesis presented to the Faculty of Arts of the University of Zurich for the degree of Doctor of Philosophy by Richard Forster of Neunkirch (SH) Accepted on the recommendation of Professor Dr. Michael Hess and Professor Abraham Bernstein Ph.D. 2006 iii Abstract Ever since the advent of computer systems and, in particular, the Internet, the amount of information theoretically at our disposal has been increasing exponentially. This phenomenal growth in data is not only a blessing: the more information is available, the more difficult it becomes to find one’s way to the particular piece of information of interest. As a consequence, investigations into old and new techniques for dealing with the extraordinary flood of data remain topical for information science. Cluster analysis has a rich and independent history of its own. Relatively recently it has acquired two new application areas in the fields of Information Retrieval and Data Mining. Clustering is used here to divide large unstructured document corpora into groups of more or less closely related documents. The clusters can then be used as a well-arranged interface to a potentially huge and overwhelming number of documents, allowing a prospective user to home in quickly on his specific requirements. Document clustering consists of two fundamental stages: document representation (the trans- formation of documents as linear strings of words into suitable data structures) and clustering (the algorithmic grouping of these representations). Despite a number of detailed textbooks on cluster analysis in general, no such work seems to have been carried out on the specific needs of document clustering. Special attention is therefore paid to a comprehensive introduction. The first part of the thesis is dedicated to systematic surveys of existing clustering algorithms (with emphasis on those used for documents) and document representation techniques as encountered in practice. Particular care has been taken with the presentation of a uniform notation since the cluster analysis literature is notoriously rich in notations and multiple names for, often, one and the same concept. In addition, a scheme is presented for classifying different clustering applications in accordance with their time-criticality. The second part of the thesis is devoted to an evaluation of Natural Language Processing (NLP) as a means of improving the document representations. More generally, the goal has been to help answer the old key question of whether or not the extra effort usually required for sophisticated NLP applications is rewarded by sufficient extra benefits; document clustering provides one typical battle-ground. To this end, five large German data sets have been compiled and described. Each consists of several thousand documents and each was extracted from a specific source (two news outlets, two book services and one encyclopaedic data base). In each data set the documents carried separate content labels derived from meta-information provided by the original source. Serving as “objective truths”, these labels were used to evaluate the performance of different clustering algorithms and document representations. NLP techniques ranging from the very simple to complex syntactic and semantic models were then tested and evaluated on these five data sets. The techniques were divided into two groups: those aiming at a reduction of the representation complexity (with the twofold goal of achieving qualitative improvements and quantitative savings) and those aiming to enhance document representation with extra features (the goal now only being to improve the results through a more refined representation). Separately, there followed an evaluation of how well the various techniques worked together. The thesis ends with an interpretation of the results, a discussion of the virtues shown by NLP methods in this particular domain and an overview of future research areas. It emerges that the success of many NLP representation techniques depends on the data under consideration, for which a hypothetical explanation is offered. All in all, evidence is found both pro and contra Natural Language Processing. For the majority of individual cases, distinct potential for improvement through NLP can be shown. iv Zusammenfassung Seitdem die Computer und das Internet in unseren Alltag getreten sind, hat die Informations- menge, zu der wir theoretisch Zugang haben, exponentiell zugenommen. Dieses Wachstum ist nicht nur ein Segen; denn je mehr Informationen uns zur Verfügung stehen, desto schwieriger wird es, genau die gewünschte Einzelinformation zu finden. Die Erforschung von alten und neuen Strategien zur Bewältigung dieser Datenflut steht deshalb noch immer im Zentrum der Informationswissenschaften. Die sogenannte Clusteranalyse blickt auf eine abwechslungsreiche Geschichte zurück, doch wurde sie erst vor relativ kurzem für das “Information Retrieval” und “Data Mining” entdeckt. Sie wird hier dazu gebraucht, grosse unstrukturierte Textmengen in Gruppen oder Haufen (“Clus- ters”) von mehr oder weniger stark miteinander verwandten Dokumenten zu unterteilen. Daraus lassen sich einfache und übersichtliche Schnittstellen zu potentiell riesigen Korpora gewinnen, die es dem Anwender erlauben, schnell zu den für ihn relevanten Texten vorzustossen. Text-Clustering besteht aus zwei grundlegenden Schritten: der Text-Repräsentation (Um- wandlung von Texten als Zeichenketten in geeignete Datenstrukturen) und dem Clustering (Ana- lyse dieser Repräsentationen und Ordnung in Gruppen). Trotz umfangreicher Literatur zur Clusteranalyse fehlt ein eigenständiges Lehrbuch zum Text-Clustering, weshalb in der vorliegen- den Arbeit besonderer Wert auf eine umfassende Einleitung gelegt wurde. Der erste Teil der Dis- sertation besteht aus einer

Load more