Terry Ruas Final Dissertation.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
Semantic Feature Extraction Using Multi-Sense Embeddings and Lexical Chains by Terry L. Ruas A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer and Information Science) in the University of Michigan-Dearborn 2019 Doctoral Committee: Professor William Grosky, Chair Assistant Professor Mohamed Abouelenien Rajeev Agrawal, US Army Engineer Research and Development Center Associate Professor Marouane Kessentini Associate Professor Luis Ortiz Professor Armen Zakarian Terry L. Ruas [email protected] ORCID iD 0000-0002-9440-780X © Terry L. Ruas 2019 DEDICATION To my father, Júlio César Perez Ruas, to my mother Vládia Sibrão de Lima Ruas, and my dearest friend and mentor William Grosky. This work is also a result of those who made me laugh, cry, live, and die, because if was not for them, I would not be who I am today. ii ACKNOWLEDGEMENTS First, I would like to express my gratefulness to my parents, who always encouraged me, no matter how injurious the situation seemed. With the same importance, I am thankful to my dear friend and advisor Prof. William Grosky, without whom the continuous support of my Ph.D. study and related research, nothing would be possible. His patience, motivation, and immense knowledge are more than I could ever have wished for. His guidance helped me during all the time spent on the research and writing of this dissertation. I could not have asked for better company throughout this challenge. If I were to write all the great moments we had together, one book would not be enough. Besides my advisor, I would like to thank the rest of my dissertation committee: Dr. Mohamed Abouelenien, Dr. Rajeev Agrawal, Dr. Marouane Kessentini, Dr. Luis Ortiz, and Prof. Armen Zakarian, for their insightful comments and encouragement, but also for the hard questions which inspired me to widen my research from various perspectives. My sincere thanks also goes to Prof. Akiko Aizawa, who was my supervisor during my Ph.D. internship at the National Institute of Informatics (NII) in Tokyo, Japan. The connections and experiences I had there were essential for my growth as a researcher. Working in her lab helped me polish and mature my research work, assisting me in becoming a better scholar. I also would like to thank my dear friend Edward Williams, the person who first introduced me to the University of Michigan during a conference in Argentina in 2013. I had a pleasure to share with him countless lunches and talks during my stay in the United States. If was not for him, I would probably still be working in my previous job at IBM in Brazil. iii I would like to leave a special appreciation to Charles Henrique Porto Ferreira, whom I had the pleasure to work with during my last year as a doctoral student. His visit provided stimulating discussions, great quality time, and an amazing perspective on how to conduct research with integrity and organization. Anyone who has the opportunity to work with him should feel honored. Last, but not the least, I would like to thank my dear friends in Brazil, which I chose to be my family. It would not be fair to list them risking leaving someone out. For this reason, I express my gratitude to all of them, who directly and indirectly made my journey easier during tough times. Absque sudore et labore nullum opus perfectum est::: iv TABLE OF CONTENTS DEDICATION ii ACKNOWLEDGEMENTS iii LIST OF TABLES viii LIST OF FIGURES x LIST OF ABBREVIATIONS xii ABSTRACT xv CHAPTER I. Introduction 1 1.1 Problem Context2 1.2 Main Objectives4 1.3 Dissertation Structure5 II. Background and Related Work 7 2.1 Word Sense Disambiguation7 2.1.1 Related Work in Word-Sense Disambiguation9 2.1.2 Keyword Extraction and Related Work 10 2.2 Lexical Chains 12 2.2.1 Related Work in Lexical Chains 15 2.3 Word Embeddings 17 2.3.1 Multi-Sense Embeddings 20 2.3.2 Related Work in Word Embeddings 22 2.3.3 Related Work in Multi-Sense Embeddings 25 2.3.4 Related Work in Document Classification and Embeddings 29 III. Exploring Multi-Sense Embeddings and Lexical Chains 32 v 3.1 Semantic Architecture Representation 32 3.2 Synset Disambiguation, Annotation, and Embeddings 34 3.2.1 Most Suitable Sense Annotation (MSSA) 35 3.2.2 Most Suitable Sense Annotation N Refined (MSSA-NR) 39 3.2.3 Most Suitable Sense Annotation - Dijkstra (MSSA-D) 42 3.2.4 From Synset to Embeddings (Synset2Vec) 43 3.2.5 Complexity Analysis 44 3.3 Extending Lexical Chains 45 3.3.1 Flexible Lexical Chains II (FLLC II) 49 3.3.2 Fixed Lexical Chains II (FXLC II) 52 3.3.3 From Lexical Chains to Embeddings (Chains2Vec) 53 3.3.4 Building Lexical Chains 54 IV. Experiments and Validation Tasks 59 4.1 Word Similarity Task 59 4.1.1 Training Corpus 59 4.1.2 Hyperparameters, Setup and Details 60 4.1.3 Benchmark Details for Word Similarity Task 63 4.1.4 No Context Word Similarity 65 4.1.5 Context Word Similarity 74 4.2 Further Discussions and Limitations on MSSA 77 4.3 Document Classification Task 80 4.3.1 Datasets Details 80 4.3.2 Machine Learning Classifiers 82 4.3.3 Word Embedding Models Characteristics 85 4.3.4 Document Embeddings Models Characteristics 86 4.3.5 Experiment Configuration 87 4.3.6 Document Classification Task Results 90 4.3.7 Lexical Chains Behavior Analysis 98 4.4 Further Discussions and Limitations on FLLC II and FXLC II 102 V. Final Considerations 106 5.1 Future Directions 111 5.1.1 Scientific Paper Mining and Recommendation 111 5.1.2 Related Work in Scientific Paper Mining and Recom- mendation 114 VI. Early Findings and Contributions 119 6.1 Word Sense Disambiguation Techniques 119 6.1.1 Path-Based Measures: Wu & Palmer 119 6.1.2 Information Content-Based Measures: Jiang & Conrath 120 6.1.3 Feature-Based Measures: Tversky 120 vi 6.1.4 Hybrid Measures: Zhou 121 6.2 Best Synset Disambiguation and Lexical Chains Algorithms 122 6.2.1 Best Synset Disambiguation Algorithm (BSD) 124 6.2.2 Flexible Lexical Chains Algorithm (FLC) 128 6.2.3 Flexible to Fixed Lexical Chains Algorithm (F2F) 130 6.2.4 Fixed Lexical Chains Algorithm (FXLC) 131 6.2.5 Distributed Semantic Extraction Mapping 133 6.2.6 Semantic Term Frequency-Inverse Document Frequency (TF-IDF) 134 6.3 Proof-Of-Concept Experiments 135 6.3.1 Semantic Extraction Based on Lexical Chains 136 6.3.2 Semantic Extraction - Reuters-21578 Benchmarking 139 6.3.3 Keyconcept Extraction through Lexical Chains 144 BIBLIOGRAPHY 148 vii LIST OF TABLES Table 3.1 Sentence example transformation from words into synset. 55 3.2 Sample of related synsets extracted from WordNet. 56 3.3 Lexical chains construction. 56 3.4 Fictional pre-trained synset embeddings model. 57 3.5 Vector average for lexical chains centroids. 57 3.6 Cosine similarity between lexical chains elements and their centroids. 58 3.7 Final chains for FLLC II and FXLC II algorithms. 58 4.1 Dataset token details. WD10 - English Wikipedia Dump 2010 (April); WD18 - English Wikipedia Dump 2018 (January). 60 4.2 Spearman correlation score (ρ) for the RG65 benchmark. Highest results reported in bold face. 68 4.3 Spearman correlation score (ρ) for the MEN benchmark. The results of Chen et al. [32] and word2vec are reported in Mancini et al. [99] (MSSG/NP-MSSG). Highest results reported in bold face. 69 4.4 Spearman correlation score (ρ) for the WordSim353 benchmark. Huang et al. [73] results are reported in Neelakantan et al. [122] (MSSG/NP- MSSG). Highest results reported in bold face. 71 4.5 Spearman correlation score (ρ) for the SimLex999 benchmark. Chen et al. [32] and word2wec results are reported in Mancini et al. [99] (MSSG/NP- MSSG). Highest results reported in bold face. 73 4.6 Spearman correlation score for the MC28 benchmark. Highest results reported in bold face. 74 4.7 Spearman correlation score (ρ) for the SCWS benchmark. Huang et al. [73] results are reported in Neelakantan et al. [122] (MSSG/NP-MSSG). Highest results reported in bold face. 76 4.8 Technical details about the datasets after pre-processing 82 4.9 Grid-search configuration parameters. 84 4.10 Word embeddings used and their main characteristics. * For USE, Cer et al. [29] report its training data as a collection of sources from Wikipedia, web news, web question-answer pages discussion forums and Stanford Natural Language Inference corpus. 85 viii 4.11 Classification accuracy for BOW approach against the proposed tech- niques for each classifier and dataset. Values in bold represent the best result of that row. Underlined values represent the best value for that dataset. K-NN - K Nearest Neighbors; RF - Random Forest; LR - Logis- tic Regression; SVM - Support Vector Machine; NB - Naïve Bayes. 92 4.12 Classification accuracy for word embeddings models against proposed techniques for each classifier and dataset. Values in bold represent the best result of that row. Underlined values represent the best value for that dataset. K-NN - K Nearest Neighbors; RF - Random Forest; LR - Logistic Regression; SVM - Support Vector Machine; NB - Naïve Bayes. 94 4.13 Classification accuracy for document embeddings models against pro- posed techniques for each classifier and dataset. Values in bold represent the best result of that row. Underlined values represent the best value for that dataset. K-NN - K Nearest Neighbors; RF - Random Forest; LR - Logistic Regression; SVM - Support Vector Machine; NB - Naïve Bayes.