Document Clustering in Large German Corpora Using Natural Language Processing

Total Page:16

File Type:pdf, Size:1020Kb

Document Clustering in Large German Corpora Using Natural Language Processing Zurich Open Repository and Archive University of Zurich Main Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch Year: 2006 Document clustering in large German corpora using Natural Language Processing Forster, Richard Abstract: Seitdem die Computer und das Internet in unseren Alltag getreten sind, hat die Informa- tionsmenge, zu der wir theoretisch Zugang haben, exponentiell zugenommen. Eine Methode, um diese gewaltige Datenflut zu bewältigen, ist die Clusteranalyse, mit der grosse unstrukturierte Textmengen in Haufen von miteinander verwandten Dokumenten unterteilt werden können. Text-Clustering besteht aus zwei grundlegenden Schritten: der Text-Repräsentation und dem Clustering. Trotz umfangreicher Literatur zur Clusteranalyse fehlt ein eigenständiges Lehrbuch zum Text-Clustering, weshalb der erste Teil dieser Arbeit einer systematischen Übersicht über die Cluster-Algorithmen und die geläufigen Text- Repräsentationsmethoden gewidmet ist. Anschliessend wird ein Schema zur Klassifikation von Text- Clustering-Anwendungen eingeführt, das sich an den zeitkritischen Komponenten orientiert. Der zweite Teil untersucht die Verwendung Natürlichsprachlicher Datenverarbeitung (Natural Language Processing - NLP) bei der Text-Repräsentation. Zu diesem Zweck werden fünf grosse deutsche Korpora zusam- mengestellt und beschrieben. NLP-Techniken aller Art werden über den fünf Sammlungen zur Anwen- dung gebracht und evaluiert. Es zeigt sich, dass der Erfolg vieler NLP-Methoden vom jeweiligen Korpus abhängt, wofür hypothetische Erklärungen formuliert werden. Insgesamt sprechen die Ergebnisse sowohl für wie wider den Einsatz von NLP. Für die Mehrheit der untersuchten Fälle kann jedoch ein deutliches Verbesserungspotential durch Natürlichsprachliche Datenverarbeitungsmethoden gezeigt werden. Ever since the advent of computer systems and, in particular, the Internet, the amount of information theoret- ically at our disposal has been increasing exponentially. One way to deal with the extraordinary flood of data is cluster analysis. It is used here to divide large unstructured document corpora into groups of more or less closely related documents. Document clustering consists of two fundamental stages: document representation and clustering. Despite a number of detailed textbooks on cluster analysis in general, no such work seems to have been carried out on the specific needs of document clustering. The first part of the thesis is therefore dedicated to comprehensive surveys of existing clustering algorithms and document representation techniques. In addition, a scheme is presented for classifying different clustering applica- tions in accordance with their time-criticality. The second part of the thesis is devoted to an evaluation of Natural Language Processing (NLP) as a means of improving the document representations. To this end, five large German data sets have been compiled and described. NLP techniques ranging fromthe very simple to complex syntactic and semantic models were evaluated on these five data sets. It emerges that the success of many NLP representation techniques depends on the data under consideration, for which a hypothetical explanation is offered. All in all, evidence is found both pro and contra Natural Language Processing. For the majority of individual cases, distinct potential for improvement through NLP can be shown. Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-163398 Dissertation Published Version Originally published at: Forster, Richard. Document clustering in large German corpora using Natural Language Processing. 2006, University of Zurich, Faculty of Arts. 2 Document Clustering in Large German Corpora Using Natural Language Processing Thesis presented to the Faculty of Arts of the University of Zurich for the degree of Doctor of Philosophy by Richard Forster of Neunkirch (SH) Accepted on the recommendation of Professor Dr. Michael Hess and Professor Abraham Bernstein Ph.D. 2006 iii Abstract Ever since the advent of computer systems and, in particular, the Internet, the amount of infor- mation theoretically at our disposal has been increasing exponentially. This phenomenal growth in data is not only a blessing: the more information is available, the more difficult it becomes to find one’s way to the particular piece of information of interest. As a consequence, investigations into old and new techniques for dealing with the extraordinary flood of data remain topical for information science. Cluster analysis has a rich and independent history of its own. Relatively recently it has acquired two new application areas in the fields of Information Retrieval and Data Mining. Clustering is used here to divide large unstructured document corpora into groups of more or less closely related documents. The clusters can then be used as a well-arranged interface to a potentially huge and overwhelming number of documents, allowing a prospective user to home in quickly on his specific requirements. Document clustering consists of two fundamental stages: document representation (the trans- formation of documents as linear strings of words into suitable data structures) and clustering (the algorithmic grouping of these representations). Despite a number of detailed textbooks on cluster analysis in general, no such work seems to have been carried out on the specific needs of document clustering. Special attention is therefore paid to a comprehensive introduction. The first part of the thesis is dedicated to systematic surveys of existing clustering algorithms (with emphasis on those used for documents) and document representation techniques as encountered in practice. Particular care has been taken with the presentation of a uniform notation since the cluster analysis literature is notoriously rich in notations and multiple names for, often, one and the same concept. In addition, a scheme is presented for classifying different clustering applications in accordance with their time-criticality. The second part of the thesis is devoted to an evaluation of Natural Language Processing (NLP) as a means of improving the document representations. More generally, the goal has been to help answer the old key question of whether or not the extra effort usually required for sophisticated NLP applications is rewarded by sufficient extra benefits; document clustering provides one typical battle-ground. To this end, five large German data sets have been compiled and described. Each consists of several thousand documents and each was extracted from a specific source (two news outlets, two book services and one encyclopaedic data base). In each data set the documents carried separate content labels derived from meta-information provided by the original source. Serving as “objective truths”, these labels were used to evaluate the performance of different clustering algorithms and document representations. NLP techniques ranging from the very simple to complex syntactic and semantic models were then tested and evaluated on these five data sets. The techniques were divided into two groups: those aiming at a reduction of the representation complexity (with the twofold goal of achieving qualitative improvements and quantitative savings) and those aiming to enhance document representation with extra features (the goal now only being to improve the results through a more refined representation). Separately, there followed an evaluation of how well the various techniques worked together. The thesis ends with an interpretation of the results, a discussion of the virtues shown by NLP methods in this particular domain and an overview of future research areas. It emerges that the success of many NLP representation techniques depends on the data under consideration, for which a hypothetical explanation is offered. All in all, evidence is found both pro and contra Natural Language Processing. For the majority of individual cases, distinct potential for improvement through NLP can be shown. iv Zusammenfassung Seitdem die Computer und das Internet in unseren Alltag getreten sind, hat die Informations- menge, zu der wir theoretisch Zugang haben, exponentiell zugenommen. Dieses Wachstum ist nicht nur ein Segen; denn je mehr Informationen uns zur Verf¨ugung stehen, desto schwieriger wird es, genau die gew¨unschte Einzelinformation zu finden. Die Erforschung von alten und neuen Strategien zur Bew¨altigung dieser Datenflut steht deshalb noch immer im Zentrum der Informationswissenschaften. Die sogenannte Clusteranalyse blickt auf eine abwechslungsreiche Geschichte zur¨uck, doch wurde sie erst vor relativ kurzem f¨ur das “Information Retrieval” und “Data Mining” entdeckt. Sie wird hier dazu gebraucht, grosse unstrukturierte Textmengen in Gruppen oder Haufen (“Clus- ters”) von mehr oder weniger stark miteinander verwandten Dokumenten zu unterteilen. Daraus lassen sich einfache und ¨ubersichtliche Schnittstellen zu potentiell riesigen Korpora gewinnen, die es dem Anwender erlauben, schnell zu den f¨ur ihn relevanten Texten vorzustossen. Text-Clustering besteht aus zwei grundlegenden Schritten: der Text-Repr¨asentation (Um- wandlung von Texten als Zeichenketten in geeignete Datenstrukturen) und dem Clustering (Ana- lyse dieser Repr¨asentationen und Ordnung in Gruppen). Trotz umfangreicher Literatur zur Clusteranalyse fehlt ein eigenst¨andiges Lehrbuch zum Text-Clustering, weshalb in der vorliegen- den Arbeit besonderer Wert auf eine umfassende Einleitung gelegt wurde. Der erste Teil der Dis- sertation besteht aus einer
Recommended publications
  • Probabilistic Topic Modelling with Semantic Graph
    Probabilistic Topic Modelling with Semantic Graph B Long Chen( ), Joemon M. Jose, Haitao Yu, Fajie Yuan, and Huaizhi Zhang School of Computing Science, University of Glasgow, Sir Alwyns Building, Glasgow, UK [email protected] Abstract. In this paper we propose a novel framework, topic model with semantic graph (TMSG), which couples topic model with the rich knowledge from DBpedia. To begin with, we extract the disambiguated entities from the document collection using a document entity linking system, i.e., DBpedia Spotlight, from which two types of entity graphs are created from DBpedia to capture local and global contextual knowl- edge, respectively. Given the semantic graph representation of the docu- ments, we propagate the inherent topic-document distribution with the disambiguated entities of the semantic graphs. Experiments conducted on two real-world datasets show that TMSG can significantly outperform the state-of-the-art techniques, namely, author-topic Model (ATM) and topic model with biased propagation (TMBP). Keywords: Topic model · Semantic graph · DBpedia 1 Introduction Topic models, such as Probabilistic Latent Semantic Analysis (PLSA) [7]and Latent Dirichlet Analysis (LDA) [2], have been remarkably successful in ana- lyzing textual content. Specifically, each document in a document collection is represented as random mixtures over latent topics, where each topic is character- ized by a distribution over words. Such a paradigm is widely applied in various areas of text mining. In view of the fact that the information used by these mod- els are limited to document collection itself, some recent progress have been made on incorporating external resources, such as time [8], geographic location [12], and authorship [15], into topic models.
    [Show full text]
  • A Hierarchical Clustering Approach for Dbpedia Based Contextual Information of Tweets
    Journal of Computer Science Original Research Paper A Hierarchical Clustering Approach for DBpedia based Contextual Information of Tweets 1Venkatesha Maravanthe, 2Prasanth Ganesh Rao, 3Anita Kanavalli, 2Deepa Shenoy Punjalkatte and 4Venugopal Kuppanna Rajuk 1Department of Computer Science and Engineering, VTU Research Resource Centre, Belagavi, India 2Department of Computer Science and Engineering, University Visvesvaraya College of Engineering, Bengaluru, India 3Department of Computer Science and Engineering, Ramaiah Institute of Technology, Bengaluru, India 4Department of Computer Science and Engineering, Bangalore University, Bengaluru, India Article history Abstract: The past decade has seen a tremendous increase in the adoption of Received: 21-01-2020 Social Web leading to the generation of enormous amount of user data every Revised: 12-03-2020 day. The constant stream of tweets with an innate complex sentimental and Accepted: 21-03-2020 contextual nature makes searching for relevant information a herculean task. Multiple applications use Twitter for various domain sensitive and analytical Corresponding Authors: Venkatesha Maravanthe use-cases. This paper proposes a scalable context modeling framework for a Department of Computer set of tweets for finding two forms of metadata termed as primary and Science and Engineering, VTU extended contexts. Further, our work presents a hierarchical clustering Research Resource Centre, approach to find hidden patterns by using generated primary and extended Belagavi, India contexts. Ontologies from DBpedia are used for generating primary contexts Email: [email protected] and subsequently to find relevant extended contexts. DBpedia Spotlight in conjunction with DBpedia Ontology forms the backbone for this proposed model. We consider both twitter trend and stream data to demonstrate the application of these contextual parts of information appropriate in clustering.
    [Show full text]
  • ESS9 Appendix A3 Political Parties Ed
    APPENDIX A3 POLITICAL PARTIES, ESS9 - 2018 ed. 3.0 Austria 2 Belgium 4 Bulgaria 7 Croatia 8 Cyprus 10 Czechia 12 Denmark 14 Estonia 15 Finland 17 France 19 Germany 20 Hungary 21 Iceland 23 Ireland 25 Italy 26 Latvia 28 Lithuania 31 Montenegro 34 Netherlands 36 Norway 38 Poland 40 Portugal 44 Serbia 47 Slovakia 52 Slovenia 53 Spain 54 Sweden 57 Switzerland 58 United Kingdom 61 Version Notes, ESS9 Appendix A3 POLITICAL PARTIES ESS9 edition 3.0 (published 10.12.20): Changes from previous edition: Additional countries: Denmark, Iceland. ESS9 edition 2.0 (published 15.06.20): Changes from previous edition: Additional countries: Croatia, Latvia, Lithuania, Montenegro, Portugal, Slovakia, Spain, Sweden. Austria 1. Political parties Language used in data file: German Year of last election: 2017 Official party names, English 1. Sozialdemokratische Partei Österreichs (SPÖ) - Social Democratic Party of Austria - 26.9 % names/translation, and size in last 2. Österreichische Volkspartei (ÖVP) - Austrian People's Party - 31.5 % election: 3. Freiheitliche Partei Österreichs (FPÖ) - Freedom Party of Austria - 26.0 % 4. Liste Peter Pilz (PILZ) - PILZ - 4.4 % 5. Die Grünen – Die Grüne Alternative (Grüne) - The Greens – The Green Alternative - 3.8 % 6. Kommunistische Partei Österreichs (KPÖ) - Communist Party of Austria - 0.8 % 7. NEOS – Das Neue Österreich und Liberales Forum (NEOS) - NEOS – The New Austria and Liberal Forum - 5.3 % 8. G!LT - Verein zur Förderung der Offenen Demokratie (GILT) - My Vote Counts! - 1.0 % Description of political parties listed 1. The Social Democratic Party (Sozialdemokratische Partei Österreichs, or SPÖ) is a social above democratic/center-left political party that was founded in 1888 as the Social Democratic Worker's Party (Sozialdemokratische Arbeiterpartei, or SDAP), when Victor Adler managed to unite the various opposing factions.
    [Show full text]
  • Query-Based Multi-Document Summarization by Clustering of Documents
    Query-based Multi-Document Summarization by Clustering of Documents Naveen Gopal K R Prema Nedungadi Dept. of Computer Science and Engineering Amrita CREATE Amrita Vishwa Vidyapeetham Dept. of Computer Science and Engineering Amrita School of Engineering Amrita Vishwa Vidyapeetham Amritapuri, Kollam -690525 Amrita School of Engineering [email protected] Amritapuri, Kollam -690525 [email protected] ABSTRACT based Hierarchical Agglomerative clustering, k-means, Clus- Information Retrieval (IR) systems such as search engines tering Gain retrieve a large set of documents, images and videos in re- sponse to a user query. Computational methods such as Au- 1. INTRODUCTION tomatic Text Summarization (ATS) reduce this information Automatic Text Summarization [12] reduces the volume load enabling users to find information quickly without read- of information by creating a summary from one or more ing the original text. The challenges to ATS include both the text documents. The focus of the summary may be ei- time complexity and the accuracy of summarization. Our ther generic, which captures the important semantic con- proposed Information Retrieval system consists of three dif- cepts of the documents, or query-based, which captures the ferent phases: Retrieval phase, Clustering phase and Sum- sub-concepts of the user query and provides personalized marization phase. In the Clustering phase, we extend the and relevant abstracts based on the matching between input Potential-based Hierarchical Agglomerative (PHA) cluster- query and document collection. Currently, summarization ing method to a hybrid PHA-ClusteringGain-K-Means clus- has gained research interest due to the enormous information tering approach. Our studies using the DUC 2002 dataset load generated particularly on the web including large text, show an increase in both the efficiency and accuracy of clus- audio, and video files etc.
    [Show full text]
  • Electoral Reform and Trade-Offs in Representation
    Electoral Reform and Trade-Offs in Representation Michael Becher∗ Irene Men´endez Gonz´alezy January 26, 2019z Conditionally accepted at the American Political Science Review Abstract We examine the effect of electoral institutions on two important features of rep- resentation that are often studied separately: policy responsiveness and the quality of legislators. Theoretically, we show that while a proportional electoral system is better than a majoritarian one at representing popular preferences in some contexts, this advantage can come at the price of undermining the selection of good politicians. To empirically assess the relevance of this trade-off, we analyze an unusually con- trolled electoral reform in Switzerland early in the twentieth century. To account for endogeneity, we exploit variation in the intensive margin of the reform, which intro- duced proportional representation, based on administrative constraints and data on voter preferences. A difference-in-difference analysis finds that higher reform intensity increases the policy congruence between legislators and the electorate and reduces leg- islative effort. Contemporary evidence from the European Parliament supports this conclusion. ∗Institute for Advanced Study in Toulouse and University Toulouse 1 Capitole. Email: [email protected] yUniversity of Mannheim. Email: [email protected] zReplication files for this article will be available upon publication from the Harvard Dataverse. For help- ful comments on previous versions we are especially grateful to Lucy Barnes,
    [Show full text]
  • Wordnet-Based Metrics Do Not Seem to Help Document Clustering
    Wordnet-based metrics do not seem to help document clustering Alexandre Passos1 and Jacques Wainer1 Instituto de Computação (IC) Universidade Estadual de Campinas (UNICAMP) Campinas, SP, Brazil Abstract. In most document clustering systems documents are repre- sented as normalized bags of words and clustering is done maximizing cosine similarity between documents in the same cluster. While this representation was found to be very effective at many different types of clustering, it has some intuitive drawbacks. One such drawback is that documents containing words with similar meanings might be considered very different if they use different words to say the same thing. This happens because in a traditional bag of words, all words are assumed to be orthogonal. In this paper we examine many possible ways of using WordNet to mitigate this problem, and find that WordNet does not help clustering if used only as a means of finding word similarity. 1 Introduction Document clustering is now an established technique, being used to improve the performance of information retrieval systems [11], as an aide to machine translation [12], and as a building block of narrative event chain learning [1]. Toolkits such as Weka [3] and NLTK [10] make it easy for anyone to experiment with document clustering and incorporate it into an application. Nonetheless, the most commonly accepted model—a bag of words representation [17], bisecting k-means algorithm [19] maximizing cosine similarity [20], preprocessing the corpus with a stemmer [14], tf-idf weighting [17], and a possible list of stop words [2]—is complex, unintuitive and has some obvious negative consequences.
    [Show full text]
  • Challenger Party List
    Appendix List of Challenger Parties Operationalization of Challenger Parties A party is considered a challenger party if in any given year it has not been a member of a central government after 1930. A party is considered a dominant party if in any given year it has been part of a central government after 1930. Only parties with ministers in cabinet are considered to be members of a central government. A party ceases to be a challenger party once it enters central government (in the election immediately preceding entry into office, it is classified as a challenger party). Participation in a national war/crisis cabinets and national unity governments (e.g., Communists in France’s provisional government) does not in itself qualify a party as a dominant party. A dominant party will continue to be considered a dominant party after merging with a challenger party, but a party will be considered a challenger party if it splits from a dominant party. Using this definition, the following parties were challenger parties in Western Europe in the period under investigation (1950–2017). The parties that became dominant parties during the period are indicated with an asterisk. Last election in dataset Country Party Party name (as abbreviation challenger party) Austria ALÖ Alternative List Austria 1983 DU The Independents—Lugner’s List 1999 FPÖ Freedom Party of Austria 1983 * Fritz The Citizens’ Forum Austria 2008 Grüne The Greens—The Green Alternative 2017 LiF Liberal Forum 2008 Martin Hans-Peter Martin’s List 2006 Nein No—Citizens’ Initiative against
    [Show full text]
  • Evaluation of Text Clustering Algorithms with N-Gram-Based Document Fingerprints
    Evaluation of Text Clustering Algorithms with N-Gram-Based Document Fingerprints Javier Parapar and Alvaro´ Barreiro IRLab, Computer Science Department University of A Coru~na,Spain fjavierparapar,[email protected] Abstract. This paper presents a new approach designed to reduce the computational load of the existing clustering algorithms by trimming down the documents size using fingerprinting methods. Thorough eval- uation was performed over three different collections and considering four different metrics. The presented approach to document clustering achieved good values of effectiveness with considerable save in memory space and computation time. 1 Introduction and Motivation Document's fingerprint could be defined as an abstraction of the original docu- ment that usually implies a reduction in terms of size. In the other hand data clustering consists on the partition of the input data collection in sets that ideally share common properties. This paper studies the effect of using the documents fingerprints as input to the clustering algorithms to achieve a better computa- tional behaviour. Clustering has a long history in Information Retrieval (IR)[1], but only re- cently Liu and Croft in [2] have demonstrated that cluster-based retrieval can also significantly outperform traditional document-based retrieval effectiveness. Other successful applications of clustering algorithms are: document browsing, search results presentation or document summarisation. Our approach tries to be useful in operational systems where the computing time is a critical point and the use of clustering techniques can significantly improve the quality of the outputs of different tasks as the above exposed. Next, section 2 introduces the background in clustering and document repre- sentation.
    [Show full text]
  • Word Sense Disambiguation in Biomedical Ontologies with Term Co-Occurrence Analysis and Document Clustering
    Int. J. Data Mining and Bioinformatics, Vol. 2, No. 3, 2008 193 Word Sense Disambiguation in biomedical ontologies with term co-occurrence analysis and document clustering Bill Andreopoulos∗, Dimitra Alexopoulou and Michael Schroeder Biotechnological Centre, Technischen Universität Dresden, Germany E-mail: [email protected] E-mail: [email protected] E-mail: [email protected] ∗Corresponding author Abstract: With more and more genomes being sequenced, a lot of effort is devoted to their annotation with terms from controlled vocabularies such as the GeneOntology. Manual annotation based on relevant literature is tedious, but automation of this process is difficult. One particularly challenging problem is word sense disambiguation. Terms such as ‘development’ can refer to developmental biology or to the more general sense. Here, we present two approaches to address this problem by using term co-occurrences and document clustering. To evaluate our method we defined a corpus of 331 documents on development and developmental biology. Term co-occurrence analysis achieves an F -measure of 77%. Additionally, applying document clustering improves precision to 82%. We applied the same approach to disambiguate ‘nucleus’, ‘transport’, and ‘spindle’, and we achieved consistent results. Thus, our method is a viable approach towards the automation of literature-based genome annotation. Keywords: WSD; word sense disambiguation; ontologies; GeneOntology; text mining; annotation; Bayes; clustering; GoPubMed; data mining; bioinformatics. Reference to this paper should be made as follows: Andreopoulos, B., Alexopoulou, D. and Schroeder, M. (2008) ‘Word Sense Disambiguation in biomedical ontologies with term co-occurrence analysis and document clustering’, Int. J. Data Mining and Bioinformatics, Vol.
    [Show full text]
  • A Query Focused Multi Document Automatic Summarization
    PACLIC 24 Proceedings 545 A Query Focused Multi Document Automatic Summarization Pinaki Bhaskar and Sivaji Bandyopadhyay Department of Computer Science & Engineering, Jadavpur University, Kolkata – 700032, India [email protected], [email protected] Abstract. The present paper describes the development of a query focused multi-document automatic summarization. A graph is constructed, where the nodes are sentences of the documents and edge scores reflect the correlation measure between the nodes. The system clusters similar texts having related topical features from the graph using edge scores. Next, query dependent weights for each sentence are added to the edge score of the sentence and accumulated with the corresponding cluster score. Top ranked sentence of each cluster is identified and compressed using a dependency parser. The compressed sentences are included in the output summary. The inter-document cluster is revisited in order until the length of the summary is less than the maximum limit. The summarizer has been tested on the standard TAC 2008 test data sets of the Update Summarization Track. Evaluation of the summarizer yielded accuracy scores of 0.10317 (ROUGE-2) and 0.13998 (ROUGE–SU-4). Keywords: Multi Document Summarizer, Query Focused, Cluster based approach, Parsed and Compressed Sentences, ROUGE Evaluation. 1 Introduction Text Summarization, as the process of identifying the most salient information in a document or set of documents (for multi document summarization) and conveying it in less space, became an active field of research in both Information Retrieval (IR) and Natural Language Processing (NLP) communities. Summarization shares some basic techniques with indexing as both are concerned with identification of the essence of a document.
    [Show full text]
  • Semantic Based Document Clustering Using Lexical Chains
    International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072 SEMANTIC BASED DOCUMENT CLUSTERING USING LEXICAL CHAINS SHABANA AFREEN1, DR. B. SRINIVASU2 1M.tech Scholar, Dept. of Computer Science and Engineering, Stanley College of Engineering and Technology for Women, Telangana-Hyderabad, India. 2Associate Professor, Dept. of Computer Science and Engineering, Stanley College of Engineering and Technology for Women, Telangana -Hyderabad, India. ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract – Traditional clustering algorithms do not organizing large document collections into smaller consider the semantic relationships among documents so that meaningful and manageable groups, which plays an cannot accurately represent cluster of the documents. To important role in information retrieval, browsing and overcome these problems, introducing semantic information comprehension [1]. from ontology such as WordNet has been widely used to improve the quality of text clustering. However, there exist 1.2 Problem Description several challenges such as extracting core semantics from Feature vectors generated using Bow results in very large texts, assigning appropriate description for the generated dimensional vectors. The feature selected is directly clusters and diversity of vocabulary. proportional to dimension. Extract core semantics from texts selecting the feature will In this project we report our attempt towards integrating reduced number of terms which depict high semantic conre WordNet with lexical chains to alleviate these problems. The content. proposed approach exploits the way we can identify the theme The quality of extracted lexical chains is highly depends on of the document based on disambiguated core semantic the quality and quantity of the concepts within a document.
    [Show full text]
  • Latent Semantic Sentence Clustering for Multi-Document Summarization
    UCAM-CL-TR-802 Technical Report ISSN 1476-2986 Number 802 Computer Laboratory Latent semantic sentence clustering for multi-document summarization Johanna Geiß July 2011 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44 1223 763500 http://www.cl.cam.ac.uk/ c 2011 Johanna Geiß This technical report is based on a dissertation submitted April 2011 by the author for the degree of Doctor of Philosophy to the University of Cambridge, St. Edmund’s College. Technical reports published by the University of Cambridge Computer Laboratory are freely available via the Internet: http://www.cl.cam.ac.uk/techreports/ ISSN 1476-2986 Latent semantic sentence clustering for multi-document summarization Johanna Geiß Summary This thesis investigates the applicability of Latent Semantic Analysis (LSA) to sentence clustering for Multi-Document Summarization (MDS). In contrast to more shallow approaches like measuring similarity of sentences by word overlap in a traditional vector space model, LSA takes word usage patterns into account. So far LSA has been successfully applied to different Information Retrieval (IR) tasks like information filtering and document classification (Dumais, 2004). In the course of this research, different parameters essential to sentence clustering using a hierarchical agglomerative clustering algorithm (HAC) in general and in combination with LSA in particular are investigated. These parameters include, inter alia, information about the type of vocabulary, the size of the semantic space and the optimal numbers of dimensions to be used in LSA. These parameters have not previously been studied and evaluated in combination with sentence clustering (chapter 4). This thesis also presents the first gold standard for sentence clustering in MDS.
    [Show full text]