Lexbank: a Multilingual Lexical Resource for Low-Resource
Total Page:16
File Type:pdf, Size:1020Kb
LEXBANK: A MULTILINGUAL LEXICAL RESOURCE FOR LOW-RESOURCE LANGUAGES by Feras Ali Al Tarouti M.S., King Fahd University of Petroleum & Minerals, 2008 B.S., University of Dammam, 2001 A dissertation submitted to the Graduate Faculty of the University of Colorado Colorado Springs in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Computer Science 2016 ii © Copyright by Feras Ali Al Tarouti 2016 All Rights Reserved iii This dissertation for Doctor of Philosophy degree by Feras Ali Al Tarouti has been approved for the Department of Computer Science by Jugal Kalita, Chair Tim Chamillard Rory Lewis Khang Nhut Lam Sudhanshu Semwal Date iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource Languages Dissertation directed by Professor Jugal Kalita In this dissertation, we present new methods to create essential lexical resources for low-resource languages. Specifically, we develop methods for enhancing automatically cre- ated wordnets. As a baseline, we start by producing core wordnets, for several languages, using methods that need limited freely available resources for creating lexical resources (Lam et al., 2014a,b, 2015b). Then, we establish the semantic relations between synsets in wordnets we create. Next, we introduce a new method to automatically add glosses to the synsets in our wordnets. Our techniques use limited resources as input to ensure that they can be felicitously used with languages that currently lack many original resources. Most existing research works with languages that have significant lexical resources available, which are costly to construct. To make our created lexical resources publicly available, we developed LexBank which is a web-based system that provides language services for several low-resource languages. To my mother, father and my wife. vi Acknowledgments I would like to express my appreciation to my wife and the mother of my kids Omima for the unlimited support she gave to me during my journey toward my Ph.D. I am also very grateful to the support and guidance provided by my advisor Dr. Jugal Kalita. In addition, I would like to thank my dissertation committee members: Dr. Sudhanshu Semwal, Dr. Tim Chamillard, Dr. Rory Lewis and Dr. Khang Nhut Lam for their guidance and consultation. vii Table of Contents 1 Introduction 1 1.1 Motivation . 1 1.2 Research Focus . 4 1.2.1 Assamese Language . 4 1.2.1.1 Assamese Script . 5 1.2.1.2 Assamese Morphology . 5 1.2.1.3 Assamese Syntax . 6 1.2.2 Vietnamese Language . 6 1.2.2.1 Vietnamese Script . 7 1.2.2.2 Vietnamese Morphology . 7 1.2.2.3 Vietnamese Syntax . 8 1.3 Research Contributions . 8 2 Case Study: The Current Status of and Challenges in Processing Information in Arabic 11 2.1 Introduction . 11 2.2 Fundamentals of Arabic . 12 2.2.1 Arabic Script . 13 viii 2.2.2 Arabic Morphology . 16 2.2.3 Arabic Syntax . 18 2.3 Summary . 19 3 Literature Review 21 3.1 Automatic Construction of Wordnets . 21 3.2 Wordnet Management Tools . 26 3.3 Creating Bilingual Dictionaries . 29 3.4 Summary . 31 4 Automaticaaly Constructing Structured Wordnets 32 4.1 Constructing Core Wordnets . 32 4.2 Constructing Wordnet Semantic Relations . 34 4.3 Experiments and Evaluation . 37 4.4 Summary . 39 5 Enhancing Automatic Wordnet Construction Using Word Embeddings 40 5.1 Introduction . 40 5.2 Similarity Metrics . 42 5.3 Generating Word Embeddings . 43 5.4 Removing Irrelevant Words in Synsets . 44 5.5 Validating Candidate Relations . 45 5.6 Selecting Thresholds . 46 5.7 Experiments . 46 ix 5.7.1 Generating Vector Representations of Wordnets Words . 46 5.7.2 Producing Word Embeddings for Arabic . 48 5.8 Evaluation and Discussion . 49 5.9 Summary . 51 6 Selecting Glosses for Wordnet Synsets Using Word Embeddings 54 6.1 Literature Review . 54 6.2 Creating Language Model Using Word Embeddings . 55 6.3 Generating Vector Representation of Wordnet Synsets . 55 6.4 Automatically Selecting a Synset Gloss From a Corpus Using Synset2Vec . 58 6.5 Evaluation . 59 6.5.1 Using Synset2vec to Select Glosses for PWN Synsets . 60 6.5.2 Using Synset2vec to Select Glosses for Arabic, Assamese and Viet- namese Synsets . 61 6.5.3 Results and Discussion . 62 6.6 Summary . 64 7 LexBank: A Multilingual Lexical Resource 67 7.1 Introduction . 67 7.2 Database Design . 68 7.2.1 The System Settings Database . 68 7.2.1.1 Users_Info . 68 7.2.1.2 System_log . 69 7.2.2 The Lexical Resources Database . 69 x 7.2.2.1 CoreWordnet . 70 7.2.2.2 Sem_Relations . 70 7.2.2.3 WordnetGlosses . 70 7.2.2.4 Sem_Relations_Eval_Data . 71 7.2.2.5 Sem_Relations_Eval_Response . 71 7.2.2.6 WordnetGlosses_Eval_Data . 72 7.2.2.7 WordnetGlosses_Eval_Response . 72 7.3 Application Layer . 73 7.4 Web Interface Design and Implementation . 74 7.4.1 Registration Form . 75 7.4.2 Log-in Form . 76 7.4.3 The Main Menu . 78 7.4.4 Searching Wordnet By Lexeme Web Form . 79 7.4.5 Searching Wordnet by OffsetPos Web Form . 80 7.4.6 Evaluating Semantic Relations Between Synsets Web Form . 82 7.4.7 Evaluating Wordnet Synsets Glosses Web Form . 86 7.4.8 Searching Bilingual Dictionary Web Form . 88 7.4.9 Users Management Web Form . 89 7.5 Summary . 91 8 Conclusions 92 9 Future Work 95 9.1 Extending Bilingual Dictionaries . 95 xi 9.1.1 Related Work . 95 9.1.2 Extending Bilingual Dictionaries Using Structured Wordnets . 97 9.2 Integrating Part-of-speech Tagging into Wordnet Construction . 99 9.3 Wordnet Expansion Using Word Embeddings . 100 9.4 Producing Vector Representation for Multi-word Lexemes . 101 9.5 Vector Representation for Mulit-lingual Wordnets . 101 Bibliography 102 Appendices 115 A Papers Resulting from The Dissertation 115 B Data Processing Software Code 116 B.1 ComputCosineSim.py . 116 B.2 GenerateVectorForSynset.py . 118 B.3 GenerateVectorForGloss.py . 119 B.4 ComputeGlossSynsetSimilarity.py . 120 C Microsoft SQL Server Tables 121 D LexBank Utility Class 133 E IRB Approval Letter 146 xii List of Tables 3.1 A list of the Java libraries tested in (Finlayson, 2014). 26 3.2 A comparison between some of the Java libraries for accessing the PWN. 27 4.1 Wordnet semantic relations. 35 4.2 Size, coverage and precision of the core wordnets we create for Arabic, Assamese and Vietnamese. 38 4.3 Precision of the semantic relations established for our Arabic wordnet. 38 5.1 An example of cosine similarity between words in a candidate synset . 45 5.2 The weighted average similarity between related words in AWN. 48 5.3 Comparison between the weighted similarity averages obtained using dif- ferent word2vec settings. 49 5.4 Comparison between the number of synsets in AWN and our Arabic word- net using different threshold values. 49 5.5 Precision of the Arabic wordnet we create. 50 5.6 Precision of the Assamese wordnet we create. 50 5.7 Precision of the Vietnamese wordnet we create. 50 5.8 Examples of related words and their cosine similarity from our Arabic wordnet. 51 xiii 5.9 Examples of related words and their cosine similarity from our Assamese wordnet. 52 5.10 Examples of related words and their cosine similarity from our Vietnamese wordnet. 52 6.1 Meanings of the noun “spill” and its synonyms. 56 6.2 Cosine similarity between the different synset vectors and glosses of the word “abduction” in PWN. 60 6.3 The precision of selecting glosses for PWN synsets . 62 6.4 Examples of Arabic glosses we produce in our Arabic wordnet. 63 6.5 Examples of Assamese glosses we produce in our Assamese wordnet. 64 6.6 Examples of Vietnamese glosses we produce in our Vietnamese wordnet. 65 6.7 The precision of selecting glosses for Arabic, Assamese and Vietnamese synsets . 65 xiv List of Figures 3.1 An overview of the CSS management tool, adapted from (Nagvenkar et al., 2014) . 28 4.1 IWND . 33 4.2 Core wordnet mapping to structured wordnet. 35 4.3 Creating wordnet semantic relations using intermediate wordnet. 36 4.4 The effect of missing synsets in recovering wordnet semantic relations us- ing intermediate wordnet. 37 4.5 Percentage of synset semantic relations recovered for the Arabic, Assamese and Vietnamese wordnets. 39 5.1 A histogram of synonyms, semantically.