Terry Ruas Final Dissertation.Pdf

Total Page:16

File Type:pdf, Size:1020Kb

Terry Ruas Final Dissertation.Pdf Semantic Feature Extraction Using Multi-Sense Embeddings and Lexical Chains by Terry L. Ruas A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer and Information Science) in the University of Michigan-Dearborn 2019 Doctoral Committee: Professor William Grosky, Chair Assistant Professor Mohamed Abouelenien Rajeev Agrawal, US Army Engineer Research and Development Center Associate Professor Marouane Kessentini Associate Professor Luis Ortiz Professor Armen Zakarian Terry L. Ruas [email protected] ORCID iD 0000-0002-9440-780X © Terry L. Ruas 2019 DEDICATION To my father, Júlio César Perez Ruas, to my mother Vládia Sibrão de Lima Ruas, and my dearest friend and mentor William Grosky. This work is also a result of those who made me laugh, cry, live, and die, because if was not for them, I would not be who I am today. ii ACKNOWLEDGEMENTS First, I would like to express my gratefulness to my parents, who always encouraged me, no matter how injurious the situation seemed. With the same importance, I am thankful to my dear friend and advisor Prof. William Grosky, without whom the continuous support of my Ph.D. study and related research, nothing would be possible. His patience, motivation, and immense knowledge are more than I could ever have wished for. His guidance helped me during all the time spent on the research and writing of this dissertation. I could not have asked for better company throughout this challenge. If I were to write all the great moments we had together, one book would not be enough. Besides my advisor, I would like to thank the rest of my dissertation committee: Dr. Mohamed Abouelenien, Dr. Rajeev Agrawal, Dr. Marouane Kessentini, Dr. Luis Ortiz, and Prof. Armen Zakarian, for their insightful comments and encouragement, but also for the hard questions which inspired me to widen my research from various perspectives. My sincere thanks also goes to Prof. Akiko Aizawa, who was my supervisor during my Ph.D. internship at the National Institute of Informatics (NII) in Tokyo, Japan. The connections and experiences I had there were essential for my growth as a researcher. Working in her lab helped me polish and mature my research work, assisting me in becoming a better scholar. I also would like to thank my dear friend Edward Williams, the person who first introduced me to the University of Michigan during a conference in Argentina in 2013. I had a pleasure to share with him countless lunches and talks during my stay in the United States. If was not for him, I would probably still be working in my previous job at IBM in Brazil. iii I would like to leave a special appreciation to Charles Henrique Porto Ferreira, whom I had the pleasure to work with during my last year as a doctoral student. His visit provided stimulating discussions, great quality time, and an amazing perspective on how to conduct research with integrity and organization. Anyone who has the opportunity to work with him should feel honored. Last, but not the least, I would like to thank my dear friends in Brazil, which I chose to be my family. It would not be fair to list them risking leaving someone out. For this reason, I express my gratitude to all of them, who directly and indirectly made my journey easier during tough times. Absque sudore et labore nullum opus perfectum est::: iv TABLE OF CONTENTS DEDICATION ii ACKNOWLEDGEMENTS iii LIST OF TABLES viii LIST OF FIGURES x LIST OF ABBREVIATIONS xii ABSTRACT xv CHAPTER I. Introduction 1 1.1 Problem Context2 1.2 Main Objectives4 1.3 Dissertation Structure5 II. Background and Related Work 7 2.1 Word Sense Disambiguation7 2.1.1 Related Work in Word-Sense Disambiguation9 2.1.2 Keyword Extraction and Related Work 10 2.2 Lexical Chains 12 2.2.1 Related Work in Lexical Chains 15 2.3 Word Embeddings 17 2.3.1 Multi-Sense Embeddings 20 2.3.2 Related Work in Word Embeddings 22 2.3.3 Related Work in Multi-Sense Embeddings 25 2.3.4 Related Work in Document Classification and Embeddings 29 III. Exploring Multi-Sense Embeddings and Lexical Chains 32 v 3.1 Semantic Architecture Representation 32 3.2 Synset Disambiguation, Annotation, and Embeddings 34 3.2.1 Most Suitable Sense Annotation (MSSA) 35 3.2.2 Most Suitable Sense Annotation N Refined (MSSA-NR) 39 3.2.3 Most Suitable Sense Annotation - Dijkstra (MSSA-D) 42 3.2.4 From Synset to Embeddings (Synset2Vec) 43 3.2.5 Complexity Analysis 44 3.3 Extending Lexical Chains 45 3.3.1 Flexible Lexical Chains II (FLLC II) 49 3.3.2 Fixed Lexical Chains II (FXLC II) 52 3.3.3 From Lexical Chains to Embeddings (Chains2Vec) 53 3.3.4 Building Lexical Chains 54 IV. Experiments and Validation Tasks 59 4.1 Word Similarity Task 59 4.1.1 Training Corpus 59 4.1.2 Hyperparameters, Setup and Details 60 4.1.3 Benchmark Details for Word Similarity Task 63 4.1.4 No Context Word Similarity 65 4.1.5 Context Word Similarity 74 4.2 Further Discussions and Limitations on MSSA 77 4.3 Document Classification Task 80 4.3.1 Datasets Details 80 4.3.2 Machine Learning Classifiers 82 4.3.3 Word Embedding Models Characteristics 85 4.3.4 Document Embeddings Models Characteristics 86 4.3.5 Experiment Configuration 87 4.3.6 Document Classification Task Results 90 4.3.7 Lexical Chains Behavior Analysis 98 4.4 Further Discussions and Limitations on FLLC II and FXLC II 102 V. Final Considerations 106 5.1 Future Directions 111 5.1.1 Scientific Paper Mining and Recommendation 111 5.1.2 Related Work in Scientific Paper Mining and Recom- mendation 114 VI. Early Findings and Contributions 119 6.1 Word Sense Disambiguation Techniques 119 6.1.1 Path-Based Measures: Wu & Palmer 119 6.1.2 Information Content-Based Measures: Jiang & Conrath 120 6.1.3 Feature-Based Measures: Tversky 120 vi 6.1.4 Hybrid Measures: Zhou 121 6.2 Best Synset Disambiguation and Lexical Chains Algorithms 122 6.2.1 Best Synset Disambiguation Algorithm (BSD) 124 6.2.2 Flexible Lexical Chains Algorithm (FLC) 128 6.2.3 Flexible to Fixed Lexical Chains Algorithm (F2F) 130 6.2.4 Fixed Lexical Chains Algorithm (FXLC) 131 6.2.5 Distributed Semantic Extraction Mapping 133 6.2.6 Semantic Term Frequency-Inverse Document Frequency (TF-IDF) 134 6.3 Proof-Of-Concept Experiments 135 6.3.1 Semantic Extraction Based on Lexical Chains 136 6.3.2 Semantic Extraction - Reuters-21578 Benchmarking 139 6.3.3 Keyconcept Extraction through Lexical Chains 144 BIBLIOGRAPHY 148 vii LIST OF TABLES Table 3.1 Sentence example transformation from words into synset. 55 3.2 Sample of related synsets extracted from WordNet. 56 3.3 Lexical chains construction. 56 3.4 Fictional pre-trained synset embeddings model. 57 3.5 Vector average for lexical chains centroids. 57 3.6 Cosine similarity between lexical chains elements and their centroids. 58 3.7 Final chains for FLLC II and FXLC II algorithms. 58 4.1 Dataset token details. WD10 - English Wikipedia Dump 2010 (April); WD18 - English Wikipedia Dump 2018 (January). 60 4.2 Spearman correlation score (ρ) for the RG65 benchmark. Highest results reported in bold face. 68 4.3 Spearman correlation score (ρ) for the MEN benchmark. The results of Chen et al. [32] and word2vec are reported in Mancini et al. [99] (MSSG/NP-MSSG). Highest results reported in bold face. 69 4.4 Spearman correlation score (ρ) for the WordSim353 benchmark. Huang et al. [73] results are reported in Neelakantan et al. [122] (MSSG/NP- MSSG). Highest results reported in bold face. 71 4.5 Spearman correlation score (ρ) for the SimLex999 benchmark. Chen et al. [32] and word2wec results are reported in Mancini et al. [99] (MSSG/NP- MSSG). Highest results reported in bold face. 73 4.6 Spearman correlation score for the MC28 benchmark. Highest results reported in bold face. 74 4.7 Spearman correlation score (ρ) for the SCWS benchmark. Huang et al. [73] results are reported in Neelakantan et al. [122] (MSSG/NP-MSSG). Highest results reported in bold face. 76 4.8 Technical details about the datasets after pre-processing 82 4.9 Grid-search configuration parameters. 84 4.10 Word embeddings used and their main characteristics. * For USE, Cer et al. [29] report its training data as a collection of sources from Wikipedia, web news, web question-answer pages discussion forums and Stanford Natural Language Inference corpus. 85 viii 4.11 Classification accuracy for BOW approach against the proposed tech- niques for each classifier and dataset. Values in bold represent the best result of that row. Underlined values represent the best value for that dataset. K-NN - K Nearest Neighbors; RF - Random Forest; LR - Logis- tic Regression; SVM - Support Vector Machine; NB - Naïve Bayes. 92 4.12 Classification accuracy for word embeddings models against proposed techniques for each classifier and dataset. Values in bold represent the best result of that row. Underlined values represent the best value for that dataset. K-NN - K Nearest Neighbors; RF - Random Forest; LR - Logistic Regression; SVM - Support Vector Machine; NB - Naïve Bayes. 94 4.13 Classification accuracy for document embeddings models against pro- posed techniques for each classifier and dataset. Values in bold represent the best result of that row. Underlined values represent the best value for that dataset. K-NN - K Nearest Neighbors; RF - Random Forest; LR - Logistic Regression; SVM - Support Vector Machine; NB - Naïve Bayes.
Recommended publications
  • Web Document Analysis: How Can Natural Language Processing Help in Determining Correct Content Flow?
    Web Document Analysis: How can Natural Language Processing Help in Determining Correct Content Flow? Hassan Alam, Fuad Rahman1 and Yuliya Tarnikova BCL Technologies Inc. [email protected] Abstract x What is the definition of 'relatedness'? x If other segments are geometrically embedded One of the fundamental questions for document within closely related segments, can we determine analysis and subsequent automatic re-authoring solutions if this segment is also related to the surrounding is the semantic and contextual integrity of the processed segments? document. The problem is particularly severe in web x When a hyperlink is followed and a new page is document re-authoring as the segmentation process often accessed, how do we know which exact segment creates an array of seemingly unrelated snippets of within that new page is directly related to the link content without providing any concrete clue to aid the we just followed? layout analysis process. This paper presents a generic It is very difficult to answer these questions with a technique based on natural language processing for high degree of confidence, as the absence of precise determining 'semantic relatedness' between segments information about the geometric and the linguistic within a document and applies it to a web page re- relationships among the candidate segments make it authoring problem. impossible to produce a quantitative measurement about the closeness or relatedness of these segments. This paper 1. Introduction has proposed a natural language processing (NLP) based method to determine relationship among different textual In web document analysis, document decomposition segments. The technique is generic and is applicable to and subsequent analysis is a very viable solution [1].
    [Show full text]
  • Question-Answering Using Semantic Relation Triples Kenneth C
    Question-Answering Using Semantic Relation Triples Kenneth C. Litkowski CL Research 9208 Gue Road Damascus, MD 20872 [email protected] http://www.clres.com Abstract can be used in such tasks as word-sense disambiguation and text summarization. This paper describes the development of a prototype system to answer questions by selecting The CL Research question-answering prototype sentences from the documents in which the answers extended functionality of the DIMAP dictionary occur. After parsing each sentence in these creation and maintenance software, which includes documents, databases are constructed by extracting some components intended for use as a lexicographer's relational triples from the parse output. The triples workstation.1 The TREC-8 Q&A track provided an consist of discourse entities, semantic relations, and opportunity not only for examining use of the governing words to which the entities are bound in computational lexicons, but also for their generation as the sentence. Database triples are also generated for well, since many dictionaries (particularly specialized the questions. Question-answering consists of one) contain encyclopedic information as well as the matching the question database records with the usual genus-differentiae definitions. The techniques records for the documents. developed for TREC and described herein are now being used for parsing dictionary definitions to help The prototype system was developed specifically construct computational lexicons that contain more to respond to the TREC-8 Q&A track, with an existing information about semantic relations, which in turn parser and some existing capability for analyzing parse will be useful for natural language processing tasks, output. The system was designed to investigate the including question-answering.
    [Show full text]
  • Text Relatedness Based on a Word Thesaurus
    Journal of Artificial Intelligence Research 37 (2010) 1-39 Submitted 07/09; published 01/10 Text Relatedness Based on a Word Thesaurus George Tsatsaronis [email protected] Department of Computer and Information Science Norwegian University of Science and Technology, Norway Iraklis Varlamis [email protected] Department of Informatics and Telematics Harokopio University, Greece Michalis Vazirgiannis [email protected] Department of Informatics Athens University of Economics and Business, Greece Abstract The computation of relatedness between two fragments of text in an automated manner requires taking into account a wide range of factors pertaining to the meaning the two fragments convey, and the pairwise relations between their words. Without doubt, a measure of relatedness between text segments must take into account both the lexical and the semantic relatedness between words. Such a measure that captures well both aspects of text relatedness may help in many tasks, such as text retrieval, classification and clustering. In this paper we present a new approach for measuring the semantic relatedness between words based on their implicit semantic links. The approach ex- ploits only a word thesaurus in order to devise implicit semantic links between words. Based on this approach, we introduce Omiotis, a new measure of semantic relatedness between texts which capitalizes on the word-to-word semantic relatedness measure (SR) and extends it to measure the relatedness between texts. We gradually validate our method: we first evaluate the performance of the semantic relatedness measure between individual words, covering word-to-word similar- ity and relatedness, synonym identification and word analogy; then, we proceed with evaluating the performance of our method in measuring text-to-text semantic relatedness in two tasks, namely sentence-to-sentence similarity and paraphrase recognition.
    [Show full text]
  • Predicting Thread Linking Structure by Lexical Chaining
    Predicting Thread Linking Structure by Lexical Chaining Li Wang,♠♥ Diana McCarthy} and Timothy Baldwin♠♥ ♠ Dept. of Computer Science and Software Engineering, University of Melbourne ~ NICTA Victoria Research Laboratory } Lexical Computing Ltd [email protected], [email protected], [email protected] Abstract ticipate in discussions or obtain/provide answers to questions, the vast volumes of data contained in fo- Web user forums are valuable means for rums make them a valuable resource for “support users to resolve specific information needs, sharing”, i.e. looking over records of past user inter- both interactively for participants and stati- cally for users who search/browse over histor- actions to potentially find an immediately applica- ical thread data. However, the complex struc- ble solution to a current problem. On the one hand, ture of forum threads can make it difficult for more and more answers to questions over a wide users to extract relevant information. Thread range of domains are becoming available on forums; linking structure has the potential to help tasks on the other hand, it is becoming harder and harder such as information retrieval (IR) and thread- to extract and access relevant information due to the ing visualisation of forums, thereby improv- sheer scale and diversity of the data. ing information access. Unfortunately, thread linking structure is not always available in fo- Previous research shows that the thread linking rums. structure can be used to improve information re- This paper proposes an unsupervised ap- trieval (IR) in forums, at both the post level (Xi et proach to predict forum thread linking struc- al., 2004; Seo et al., 2009) and thread level (Seo et ture using lexical chaining, a technique which al., 2009; Elsas and Carbonell, 2009).
    [Show full text]
  • Automated Text Summarization Base on Lexicales Chain and Graph Using of Wordnet and Wikipedia Knowledge Base
    IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 3, January 2012 ISSN (Online): 1694-0814 www.IJCSI.org 343 Automated Text Summarization Base on Lexicales Chain and graph Using of WordNet and Wikipedia Knowledge Base Mohsen Pourvali and Mohammad Saniee Abadeh Department of Electrical & Computer Qazvin Branch Islamic Azad University Qazvin, Iran Department of Electrical and Computer Engineering at Tarbiat Modares University Tehran, Iran Abstract The technology of automatic document summarization is delivering the majority of information content from a set of maturing and may provide a solution to the information overload documents about an explicit or implicit main topic [14]. problem. Nowadays, document summarization plays an important Authors of the paper [10] provide the following definition role in information retrieval. With a large volume of documents, for a summary: “A summary can be loosely defined as a presenting the user with a summary of each document greatly facilitates the task of finding the desired documents. Document text that is produced from one or more texts that conveys summarization is a process of automatically creating a important information in the original text(s), and that is no compressed version of a given document that provides useful longer than half of the original text(s) and usually information to users, and multi-document summarization is to significantly less than that. Text here is used rather loosely produce a summary delivering the majority of information content and can refer to speech, multimedia documents, hypertext, from a set of documents about an explicit or implicit main topic. etc. The main goal of a summary is to present the main The lexical cohesion structure of the text can be exploited to ideas in a document in less space.
    [Show full text]
  • Automating the Construction of Lexical Chains Using Roget's Thesaurus
    Not As Easy As It Seems: Automating the Construction of Lexical Chains Using Roget’s Thesaurus Mario Jarmasz and Stan Szpakowicz School of Information Technology and Engineering University of Ottawa Ottawa, Canada, K1N 6N5 {mjarmasz, szpak}@site.uottawa.ca Abstract. Morris and Hirst [10] present a method of linking significant words that are about the same topic. The resulting lexical chains are a means of identifying cohesive regions in a text, with applications in many natural language processing tasks, including text summarization. The first lexical chains were constructed manually using Roget’s International Thesaurus. Morris and Hirst wrote that automation would be straightforward given an electronic thesaurus. All applications so far have used WordNet to produce lexical chains, perhaps because adequate electronic versions of Roget’s were not available until recently. We discuss the building of lexical chains using an electronic version of Roget’s Thesaurus. We implement a variant of the original algorithm, and explain the necessary design decisions. We include a comparison with other implementations. 1 Introduction Lexical chains [10] are sequences of words in a text that represent the same topic. The concept has been inspired by the notion of cohesion in discourse [7]. A sufficiently rich and subtle lexical resource is required to decide on semantic proximity of words. Computational linguists have used lexical chains in a variety of tasks, from text segmentation [10], [11], to summarization [1], [2], [12], detection of malapropisms [7], the building of hypertext links within and between texts [5], analysis of the structure of texts to compute their similarity [3], and even a form of word sense disambiguation [1], [11].
    [Show full text]
  • Final Report 2004-2008
    Computational Linguistics for Metadata Building Final Report 2004-2008 Screenshot of CLiMB Toolkit for the ARTstor Art History Survey Collection Principal Investigator: Judith L. Klavans, Ph.D. Project Manager: Carolyn Sheffield Project Team: University of Maryland: Mairead Hunter, Tatyana Lavut, Jimmy Lin, Tandeep Sidhu, Dagobert Soergel, Rachel Wadsworth, Deborah Wallace Drexel University: Eileen Abels, Joan Beaudoin, Laura Jenemann, Dwight Swanson Columbia University: Tom Lippincott, Rebecca Passonneau, Tae Yano University of Maryland, College Park Computational Linguistics for Metadata Building Final Report Executive Summary................................................................................................................ 3 1. Project Background......................................................................................................... 4 2. The CLiMB Toolkit ........................................................................................................ 4 2.1. System Specifications ......................................................................................... 4 2.2. Workflow Studies and Usability Testing............................................................ 5 2.3. Importing Collections ......................................................................................... 5 2.4. Examining Images .............................................................................................. 7 2.5. Reviewing Texts ................................................................................................
    [Show full text]
  • ALEXIA - Acquisition of Lexical Chains for Text Summarization
    UNIVERSITY OF BEIRA INTERIOR Department of Computer Science ALEXIA - Acquisition of Lexical Chains for Text Summarization Cláudia Sofia Oliveira Santos A Thesis submitted to the University of Beira Interior to require the Degree of Master of Computer Science Engineering Supervisor: Prof. Gaël Harry Dias University of Beira Interior Covilhã, Portugal February 2006 Acknowledgments There are many people, without whose support and contributions this thesis would not have been possible. I am specially grateful to my supervisor, Gaël Dias for his constant support and help, for professional and personal support, and for getting me interested in the subject of Natural Language Processing and for introducing me to the summarization field. I would like to express my sincere thanks to Guillaume Cleuziou, for all his help and suggestions about how to use his clustering algorithm, for making available its code and always being there at an e-mail distance. I would like to thank all my friends, but in a special way to Ricardo, Suse, Eduardo and Rui Paulo for all their help and support when sometimes I thought I would not make it; for being really good friends. I would also like to thank the HULTIG research group and the members of the LIFO of the University of Orléans (France) in particular Sylvie Billot and one more time to Guillaume who helped me feeling less lonely in France and helped me in the beginning of this work. Finally, I am eternally indebted to my family. I would specially like to thank my parents and my sister, Joana, who were always there for me and always gave me support and encouragement all through my course, and always expressed their love.
    [Show full text]
  • Automatic Text Summarization Using Lexical Chains : Algorithms and Experiments
    University of Lethbridge Research Repository OPUS http://opus.uleth.ca Theses Arts and Science, Faculty of 2004 Automatic text summarization using lexical chains : algorithms and experiments Kolla, Maheedhar Lethbridge, Alta. : University of Lethbridge, Faculty of Arts and Science, 2004 http://hdl.handle.net/10133/226 Downloaded from University of Lethbridge Research Repository, OPUS AUTOMATIC TEXT SUMMARIZATION USING LEXICAL CHAINS: ALGORITHMS AND EXPERIMENTS Maheedhar Kolla B.Tech, Jawaharlal Nehru Technological University, 2002 A Thesis Submitted to the School of Graduate Studies of the University of Lethbridge in Partial Fulfillment of the Requirements for the Degree MASTER OF SCIENCE Department of Mathematics and Computer Science University of Lethbridge LETHBRIDGE, ALBERTA, CANADA ©Maheedhar Kolla, 2004 Abstract Summarization is a complex task that requires understanding of the document con­ tent to determine the importance of the text. Lexical cohesion is a method to identify connected portions of the text based on the relations between the words in the text. Lexical cohesive relations can be represented using lexical chains. Lexical chains are sequences of semantically related words spread over the entire text. Lexical chains are used in variety of Natural Language Processing (NLP) and Information Retrieval (IR) applications. In current thesis, we propose a lexical chaining method that includes the glossary relations in the chaining process. These relations enable us to identify topically related concepts, for instance dormitory and student, and thereby enhances the identification of cohesive ties in the text. We then present methods that use the lexical chains to generate summaries by extracting sentences from the document(s). Headlines are generated by filtering the portions of the sentences extracted, which do not contribute towards the meaning of the sentence.
    [Show full text]
  • Automatically Identifying Lexical Chains by Means of Statistical Methods— a Knowledge-Free Approach
    Automatically Identifying Lexical Chains by Means of Statistical Methods— A Knowledge-Free Approach Automatisches identifizieren lexikalischer Ketten mittels statistischer Methoden— Ein vorwissensfreier Ansatz Master-Thesis von Steffen Remus Oktober 2012 Automatically Identifying Lexical Chains by Means of Statistical Methods— A Knowledge-Free Approach Automatisches identifizieren lexikalischer Ketten mittels statistischer Methoden— Ein vorwissensfreier Ansatz Master-Thesis Eingereicht von Steffen Remus 1. Gutachten: Prof. Dr. Chris Biemann 2. Gutachten: Dr. Sabine Bartsch Tag der Einreichung: Technische Universität Darmstadt Department of Computer Science Ubiquitous Knowledge Processing (UKP) Lab Erklärung zur Master-Thesis Hiermit versichere ich die vorliegende Master-Thesis ohne Hilfe Dritter nur mit den angegebenen Quellen und Hilfsmitteln angefertigt zu haben. Alle Stellen, die aus Quellen entnommen wurden, sind als solche kenntlich gemacht. Diese Arbeit hat in gleicher oder ähnlicher Form noch keiner Prüfungsbehörde vorgelegen. Darmstadt, den 24. Oktober 2012 (S. Remus) Abstract The identification of lexical chains is an important building block in modern natural language processing applications such as summarization or text-segmentation. In order to extract lexical chains it is is necessary to identify the important lexico-semantic relations in a text. Traditional approaches mostly rely on the use of knowledge resources such as thesauri or lexical databases like WordNet. Hence, the quality of the extracted lexical chains highly depends on the quality and quantity of the entries in the particular knowledge resource. Statistical methods on the other hand have proven to deliver good results in many natural language processing applications where the only prerequisite is a sufficiently large and qualitatively good data collection. This thesis examines the suitability of statistical methods for the task of identifying lexico-semantic relations in order to build proper lexical chains.
    [Show full text]
  • Automatic Text Summarization Using Lexical Chains: Algorithms and Experiments
    AUTOMATIC TEXT SUMMARIZATION USING LEXICAL CHAINS: ALGORITHMS AND EXPERIMENTS Maheedhar Kolla B.Tech, Jawaharlal Nehru Technological University, 2002 A Thesis Submitted to the School of Graduate Studies of the University of Lethbridge in Partial Fulfillment of the Requirements for the Degree MASTER OF SCIENCE Department of Mathematics and Computer Science University of Lethbridge LETHBRIDGE, ALBERTA, CANADA ©Maheedhar Kolla, 2004 Abstract Summarization is a complex task that requires understanding of the document con­ tent to determine the importance of the text. Lexical cohesion is a method to identify connected portions of the text based on the relations between the words in the text. Lexical cohesive relations can be represented using lexical chains. Lexical chains are sequences of semantically related words spread over the entire text. Lexical chains are used in variety of Natural Language Processing (NLP) and Information Retrieval (IR) applications. In current thesis, we propose a lexical chaining method that includes the glossary relations in the chaining process. These relations enable us to identify topically related concepts, for instance dormitory and student, and thereby enhances the identification of cohesive ties in the text. We then present methods that use the lexical chains to generate summaries by extracting sentences from the document(s). Headlines are generated by filtering the portions of the sentences extracted, which do not contribute towards the meaning of the sentence. Headlines generated can be used in real world application to skim through the document collections in a digital library. Multi-document summarization is gaining demand with the explosive growth of online news sources. It requires identification of the several themes present in the collection to attain good compression and avoid redundancy.
    [Show full text]
  • Enhanced Word Embeddings Using Multi-Semantic
    ENHANCED WORD EMBEDDINGS USING MULTI-SEMANTIC REPRESENTATION THROUGH LEXICAL CHAINS APREPRINT Terry Ruas∗ Charles Henrique Porto Ferreira William Grosky University of Michigan - Dearborn Federal University of ABC University of Michigan - Dearborn University of Wuppertal [email protected] [email protected] [email protected] Fabr´ıcio Olivetti de Franc¸a, Debora´ Maria Rossi de Medeiros Federal University of ABC [email protected],[email protected] f g January 25, 2021 ABSTRACT The relationship between words in a sentence often tells us more about the underlying semantic content of a document than its actual words, individually. In this work, we propose two novel algorithms, called Flexible Lexical Chain II and Fixed Lexical Chain II. These algorithms combine the semantic relations derived from lexical chains, prior knowledge from lexical databases, and the robustness of the distributional hypothesis in word embeddings as building blocks forming a single system. In short, our approach has three main contributions: (i) a set of techniques that fully integrate word embeddings and lexical chains; (ii) a more robust semantic representation that considers the latent relation between words in a document; and (iii) lightweight word embeddings models that can be extended to any natural language task. We intend to assess the knowledge of pre-trained models to evaluate their robustness in document classification task. The proposed techniques are tested against seven word embeddings algorithms using five different machine learning classifiers over six scenarios in the document classification task. Our results show the integration between lexical chains and word embeddings representations sustain state-of-the-art results, even against more complex systems.
    [Show full text]