LEXBANK: A MULTILINGUAL LEXICAL RESOURCE FOR LOW-RESOURCE
LANGUAGES
by
Feras Ali Al Tarouti
M.S., King Fahd University of Petroleum & Minerals, 2008
B.S., University of Dammam, 2001
A dissertation submitted to the Graduate Faculty of the
University of Colorado Colorado Springs
in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
Department of Computer Science
2016 ii
© Copyright by Feras Ali Al Tarouti 2016 All Rights Reserved iii
This dissertation for Doctor of Philosophy degree by
Feras Ali Al Tarouti
has been approved for the
Department of Computer Science
by
Jugal Kalita, Chair
Tim Chamillard
Rory Lewis
Khang Nhut Lam
Sudhanshu Semwal
Date iv
Al Tarouti, Feras A. (Ph.D., Computer Science)
LexBank: A Multilingual Lexical Resource for Low-Resource Languages
Dissertation directed by Professor Jugal Kalita
In this dissertation, we present new methods to create essential lexical resources for low-resource languages. Specifically, we develop methods for enhancing automatically cre- ated wordnets. As a baseline, we start by producing core wordnets, for several languages, using methods that need limited freely available resources for creating lexical resources
(Lam et al., 2014a,b, 2015b). Then, we establish the semantic relations between synsets in wordnets we create. Next, we introduce a new method to automatically add glosses to the synsets in our wordnets. Our techniques use limited resources as input to ensure that they can be felicitously used with languages that currently lack many original resources. Most existing research works with languages that have significant lexical resources available, which are costly to construct. To make our created lexical resources publicly available, we developed LexBank which is a web-based system that provides language services for several low-resource languages. To my mother, father and my wife. vi
Acknowledgments
I would like to express my appreciation to my wife and the mother of my kids Omima for the unlimited support she gave to me during my journey toward my Ph.D. I am also very grateful to the support and guidance provided by my advisor Dr. Jugal Kalita. In addition, I would like to thank my dissertation committee members: Dr. Sudhanshu Semwal, Dr. Tim
Chamillard, Dr. Rory Lewis and Dr. Khang Nhut Lam for their guidance and consultation. vii
Table of Contents
1 Introduction 1
1.1 Motivation ...... 1
1.2 Research Focus ...... 4
1.2.1 Assamese Language ...... 4
1.2.1.1 Assamese Script ...... 5
1.2.1.2 Assamese Morphology ...... 5
1.2.1.3 Assamese Syntax ...... 6
1.2.2 Vietnamese Language ...... 6
1.2.2.1 Vietnamese Script ...... 7
1.2.2.2 Vietnamese Morphology ...... 7
1.2.2.3 Vietnamese Syntax ...... 8
1.3 Research Contributions ...... 8
2 Case Study: The Current Status of and Challenges in Processing Information in
Arabic 11
2.1 Introduction ...... 11
2.2 Fundamentals of Arabic ...... 12
2.2.1 Arabic Script ...... 13 viii
2.2.2 Arabic Morphology ...... 16
2.2.3 Arabic Syntax ...... 18
2.3 Summary ...... 19
3 Literature Review 21
3.1 Automatic Construction of Wordnets ...... 21
3.2 Wordnet Management Tools ...... 26
3.3 Creating Bilingual Dictionaries ...... 29
3.4 Summary ...... 31
4 Automaticaaly Constructing Structured Wordnets 32
4.1 Constructing Core Wordnets ...... 32
4.2 Constructing Wordnet Semantic Relations ...... 34
4.3 Experiments and Evaluation ...... 37
4.4 Summary ...... 39
5 Enhancing Automatic Wordnet Construction Using Word Embeddings 40
5.1 Introduction ...... 40
5.2 Similarity Metrics ...... 42
5.3 Generating Word Embeddings ...... 43
5.4 Removing Irrelevant Words in Synsets ...... 44
5.5 Validating Candidate Relations ...... 45
5.6 Selecting Thresholds ...... 46
5.7 Experiments ...... 46 ix
5.7.1 Generating Vector Representations of Wordnets Words ...... 46
5.7.2 Producing Word Embeddings for Arabic ...... 48
5.8 Evaluation and Discussion ...... 49
5.9 Summary ...... 51
6 Selecting Glosses for Wordnet Synsets Using Word Embeddings 54
6.1 Literature Review ...... 54
6.2 Creating Language Model Using Word Embeddings ...... 55
6.3 Generating Vector Representation of Wordnet Synsets ...... 55
6.4 Automatically Selecting a Synset Gloss From a Corpus Using Synset2Vec . 58
6.5 Evaluation ...... 59
6.5.1 Using Synset2vec to Select Glosses for PWN Synsets ...... 60
6.5.2 Using Synset2vec to Select Glosses for Arabic, Assamese and Viet-
namese Synsets ...... 61
6.5.3 Results and Discussion ...... 62
6.6 Summary ...... 64
7 LexBank: A Multilingual Lexical Resource 67
7.1 Introduction ...... 67
7.2 Database Design ...... 68
7.2.1 The System Settings Database ...... 68
7.2.1.1 Users_Info ...... 68
7.2.1.2 System_log ...... 69
7.2.2 The Lexical Resources Database ...... 69 x
7.2.2.1 CoreWordnet ...... 70
7.2.2.2 Sem_Relations ...... 70
7.2.2.3 WordnetGlosses ...... 70
7.2.2.4 Sem_Relations_Eval_Data ...... 71
7.2.2.5 Sem_Relations_Eval_Response ...... 71
7.2.2.6 WordnetGlosses_Eval_Data ...... 72
7.2.2.7 WordnetGlosses_Eval_Response ...... 72
7.3 Application Layer ...... 73
7.4 Web Interface Design and Implementation ...... 74
7.4.1 Registration Form ...... 75
7.4.2 Log-in Form ...... 76
7.4.3 The Main Menu ...... 78
7.4.4 Searching Wordnet By Lexeme Web Form ...... 79
7.4.5 Searching Wordnet by OffsetPos Web Form ...... 80
7.4.6 Evaluating Semantic Relations Between Synsets Web Form . . . . 82
7.4.7 Evaluating Wordnet Synsets Glosses Web Form ...... 86
7.4.8 Searching Bilingual Dictionary Web Form ...... 88
7.4.9 Users Management Web Form ...... 89
7.5 Summary ...... 91
8 Conclusions 92
9 Future Work 95
9.1 Extending Bilingual Dictionaries ...... 95 xi
9.1.1 Related Work ...... 95
9.1.2 Extending Bilingual Dictionaries Using Structured Wordnets . . . . 97
9.2 Integrating Part-of-speech Tagging into Wordnet Construction ...... 99
9.3 Wordnet Expansion Using Word Embeddings ...... 100
9.4 Producing Vector Representation for Multi-word Lexemes ...... 101
9.5 Vector Representation for Mulit-lingual Wordnets ...... 101
Bibliography 102
Appendices 115
A Papers Resulting from The Dissertation 115
B Data Processing Software Code 116
B.1 ComputCosineSim.py ...... 116
B.2 GenerateVectorForSynset.py ...... 118
B.3 GenerateVectorForGloss.py ...... 119
B.4 ComputeGlossSynsetSimilarity.py ...... 120
C Microsoft SQL Server Tables 121
D LexBank Utility Class 133
E IRB Approval Letter 146 xii
List of Tables
3.1 A list of the Java libraries tested in (Finlayson, 2014)...... 26
3.2 A comparison between some of the Java libraries for accessing the PWN. . 27
4.1 Wordnet semantic relations...... 35
4.2 Size, coverage and precision of the core wordnets we create for Arabic,
Assamese and Vietnamese...... 38
4.3 Precision of the semantic relations established for our Arabic wordnet. . . . 38
5.1 An example of cosine similarity between words in a candidate synset . . . . 45
5.2 The weighted average similarity between related words in AWN...... 48
5.3 Comparison between the weighted similarity averages obtained using dif-
ferent word2vec settings...... 49
5.4 Comparison between the number of synsets in AWN and our Arabic word-
net using different threshold values...... 49
5.5 Precision of the Arabic wordnet we create...... 50
5.6 Precision of the Assamese wordnet we create...... 50
5.7 Precision of the Vietnamese wordnet we create...... 50
5.8 Examples of related words and their cosine similarity from our Arabic
wordnet...... 51 xiii
5.9 Examples of related words and their cosine similarity from our Assamese
wordnet...... 52
5.10 Examples of related words and their cosine similarity from our Vietnamese
wordnet...... 52
6.1 Meanings of the noun “spill” and its synonyms...... 56
6.2 Cosine similarity between the different synset vectors and glosses of the
word “abduction” in PWN...... 60
6.3 The precision of selecting glosses for PWN synsets ...... 62
6.4 Examples of Arabic glosses we produce in our Arabic wordnet...... 63
6.5 Examples of Assamese glosses we produce in our Assamese wordnet. . . . 64
6.6 Examples of Vietnamese glosses we produce in our Vietnamese wordnet. . 65
6.7 The precision of selecting glosses for Arabic, Assamese and Vietnamese
synsets ...... 65 xiv
List of Figures
3.1 An overview of the CSS management tool, adapted from (Nagvenkar et al.,
2014) ...... 28
4.1 IWND ...... 33
4.2 Core wordnet mapping to structured wordnet...... 35
4.3 Creating wordnet semantic relations using intermediate wordnet...... 36
4.4 The effect of missing synsets in recovering wordnet semantic relations us-
ing intermediate wordnet...... 37
4.5 Percentage of synset semantic relations recovered for the Arabic, Assamese
and Vietnamese wordnets...... 39
5.1 A histogram of synonyms, semantically related words, and non-related
words extracted from AWN...... 47
6.1 An example of creating a vector for a wordnet synset that include more
than one word...... 57
6.2 An example of creating vectors for wordnet synsets that share a single word. 58
7.1 An overview of LexBank system...... 67
7.2 LexBank web site map ...... 75 xv
7.3 The registration web form ...... 76
7.4 Sequence diagram of the registration process ...... 77
7.5 The log-in web form ...... 77
7.6 Sequence diagram of the login process ...... 78
7.7 The main menu ...... 79
7.8 The web form for searching wordnet by lexeme. The form shows the result which means Egypt...... 80 ,(ﻣﺼﺮ) of searching the Arabic lexeme
7.9 Sequence diagram of the process of searching wordnet using lexeme . . . . 81
7.10 The web form for searching wordnet by OffsetPos. The form shows the
result of searching the Arabic synset (08897065-n)...... 82
7.11 The web form for searching wordnet by OffsetPos. The form shows the
result of searching the Vietnamese synset (08897065-n)...... 83
7.12 The web form for searching wordnet by OffsetPos. The form shows the re-
sult of searching the Assamese synset (08897065-n). The third part meronym
in Assamese is wrong. It comes from the verb meaning of “desert” which
means to leave without intending to return...... 83
7.13 Sequence diagram of the process of searching wordnet using OffsetPos. . . 84
7.14 The web form for evaluating semantic relations between synsets in a word-
net. The form shows an example of evaluating a hyponymy relation be-
tween the two Assamese lexemes, one for radio telegraph and the other for
radio...... 84
7.15 Sequence diagram of the process of evaluating the relation between two
lexemes...... 85 xvi
7.16 The web form for evaluating wordnet synsets glosses. The form shows an
example of evaluating Arabic synset (13108841-n)...... 86
7.17 Sequence diagram of the process of evaluating the relation between two
lexemes...... 87
7.18 The web form for searching a bilingual dictionary. The form shows the re- which means Egypt, to Assamese. 88 ,(ﻣﺼﺮ) sult of translating the Arabic word
7.19 Sequence diagram of the process of searching a bilingual dictionary. . . . . 89
7.20 The web form for managing users in LexBank...... 89
7.21 Sequence diagram of the process of managing users in LexBank...... 90
9.1 The IW approach for creating a new bilingual dictionary ...... 96
9.2 Extending bilingual dictionaries using structured wordnets ...... 98 Chapter 1
INTRODUCTION
1.1 Motivation
The word lexicon means a repository that stores the vocabulary of a person, language or branch of knowledge such as computer science, the military or medicine (Wikipedia,
2016c). In linguistics, a lexicon is an inventory of basic units of meaning or lexemes. In practice, a lexicon (we may also call it a dictionary) may be printed as a book, or stored in a computer database that can be searched and used by a program. A lexical resource is a group of lexical units that provide linguistic information. The lexical units can be mor- phemes, words or multi-word expressions. The basic unit of a lexical resource is usually called a lexical entry. Some lexical resources can be used by humans directly while other lexical resources are machine readable. Lexical resources are the base of most Natural
Language Processing (NLP) applications.
There are many types of lexical resources. Based on its type, a lexical resource can provide syntactical, morphological, phonological or semantic information. Unilingual dic- tionaries, bilingual dictionaries and wordnets are examples of lexical resources. There are a few fortunate languages, such as English and Chinese, which have a relatively large num- ber of high quality lexical resources. These languages are usually called resource-rich.
Most high quality lexical resources for the resource-rich languages have been painstak- 2 ingly created by researchers manually over many years. Unfortunately, most other existing human languages lack many such lexical resources. Languages which lack lexical and other resources are called resource-low or resource-poor languages. While some of these languages have some resources, there are many other languages that barely have any re- sources. Especially poor in this context are the endangered languages around the world.
One important resource that is very helpful in computational processing and in human language learning is a thesaurus providing synonyms and antonyms of words. An extended version of a thesaurus that provides additional relations among words in the computational context is usually called a wordnet. A wordnet is a structured lexical ontology of words that groups words based on their meaning using sets that are called synsets. For example, the words helicopter, chopper, whirlybird and eggbeater are grouped in one synset that has the gloss, an aircraft without wings that obtains its lift from the rotation of overhead blades. The wordnet connects synsets with each other based on semantic relations. Word- nets are used in many applications such as word sense disambiguation, machine translation, information retrieval, text classification and text summarization.
The Princeton WordNet (PWN) is the original English version of such a wordnet and has been produced with diligent manual work augmented by the development of computa- tional tools, over several decades at Princeton University. Similar complete wordnets have also been produced for a small number of additional languages such as French (Sagot and
Fišer, 2008), Finnish (Lindén and Carlson, 2010) and Japanese (Kaji and Watanabe, 2006).
Efforts to produce wordnets for a variety of other languages have been proposed, but most are moving slowly, such as the effort to construct the Asian wordnets (Charoenporn et al.,
2008) and Indian wordnets (Bhattacharyya, 2010). 3
Another important type of resource is the bilingual dictionary, an essential tool for
human language learners. Most existing (online) bilingual dictionaries are between two
resource-rich languages or between a resource-rich language and a resource-poor language.
It is fortunate that many endangered languages have one bilingual dictionary, created usu-
ally by explorers, evangelists or other scholars. However, dictionaries or translators for
translations between two resource-poor languages do not really exist. Wiktionary1, a dic-
tionary created by volunteers, supports over 172 languages, although coverage is poor for
many of them. The online translation machines developed by Google2 and Microsoft3 provide pairwise translations, including translations for single words, for 103 and 53 lan- guages, respectively. While this is a wide range of languages, these machine translators still leave out many widely-spoken languages, not to mention endangered ones. 7097 lan- guages are spoken in the world today (Gordon and Grimes, 2005), of which 400 are spoken by at least a million people.
In previous work we focused on developing new techniques that leverage existing resources for resource-rich languages to build bilingual dictionaries, and core wordnets and other resources such as simple translators for resource-poor languages, including a few endangered ones (Lam et al., 2014a,b, 2015b). In this thesis work, we take these resources to the next level by improving the functionality, quality and coverage of these resources.
We present several new techniques that we did not use in our previous work. Our ultimate goal is to produce an integrated multilingual lexical resource available online, one that
1http://en.wiktionary.org/wiki/Wiktionary:Main_Page 2http://translate.google.com/ 3http://www.bing.com/translator 4
includes several important individual resources for several languages. We believe that our
resources will help researches, speakers, learners and other users of these languages.
1.2 Research Focus
The goal of this dissertation is to create and make available multilingual lexical re-
sources for several languages by bootstrapping from a limited number of existing resources.
Our study has the potential not only to construct new lexical resources, but also to provide
support for communities using languages with limited resources. Additionally, our re-
search presents novel approaches to generate new lexical resources from a limited number
of existing resources.
The main focus of our work is to collect data from disparate sources, develop algo-
rithms for mining and integrating such data, produce lexical resources, and evaluate the
resources in regards to the quality and quantity of entries. To develop and test our ideas, we
work with a few languages with in-house expertise. These include Assamese (asm), Arabic
(arb), English (eng) and Vietnamese (vie). In Chapter 2 we present a detailed introduction to Arabic. Next, we present a brief introduction to Assamese and Vietnamese.
1.2.1 Assamese Language
Assamese is an Indo-European language that are spoken by more than 15 million people (Hinkle et al., 2013). It is mainly used in the Indian states of Assam, Arunachal
Pradesh, Meghalaya, Nagaland and West Bengal. Assamese has 4 dialects: Standard As- samese, Jharwa, Mayang and Western Assamese (Gordon and Grimes, 2005). We present a brief description of the script, morphology and syntax of Assamese. 5
1.2.1.1 Assamese Script
Assamese script consists of 37 consonants, 11 vowels, 147 conjuncts and a few punc-
tuation marks (Hinkle et al., 2013). Unlike English where the written letters might have
variable pronunciation, Assamese written letters have one pronunciation. A consonant that
does not occur at the end of a word is assumed to have implicit vowel a following it. How- ever, when several consonants need to be pronounced together, they are usually written using a new conjunct letter.
When a vowel follows a consonant, the vowel is not written explicitly, but implicitly as an operator. These operators are attached to consonants in different positions (Hinkle et al., 2013). They can appear to the left, right, below or above the consonants. Foreign words can appear in Assamese script as transliteration. However, It is not unusual to write foreign words in foreign alphabets within a piece of Assamese text.
1.2.1.2 Assamese Morphology
Assamese morphology has two types of morphological transformations: derivational and inflectional. Around 48% of the Assamese words are constructed using those two types of transformation (Sharma et al., 2008). The derivational transformation in Assamese is usually performed by changing the vowel component in the word, while the inflectional transformation is performed by adding prefixes or suffixes to the word. Assamese is well- known for its complex suffixes. It is common in Assamese that a word includes a sequence of suffixes. Four to six suffixes in sequence are not uncommon (Saharia et al., 2009). 6
In Assamese, suffixes are used for many purposes. The most common purpose of suffixes is determination (Sharma et al., 2008). In fact, a large number of the Assamese suffixes are determiners. As in other languages, some determiners are attached to nouns and pronouns to make them specific. This is similar to using this and that in English.
Unlike in many other languages, such as English, where affixes are used, determiners in
Assamese are also used to transfer single noun to plural.
1.2.1.3 Assamese Syntax
Assamese has less firm syntax which means that it is considered a free word order language. This means that sentences could be written in different word orders and still have the same meaning. The normal form of a simple Assamese sentences is Subject+Object+
Verb (SOV) (Sarma, 2012), although other orders are acceptable.
1.2.2 Vietnamese Language
Vietnamese, the first language of Vietnam, is an Austroasiatic language that arose in Indo-China (Thompson, 1987). It is the first language of more than 75 millions peo- ple living in Vietnam (Gordon and Grimes, 2005). Also, due to emigration, it is the fist language of many people living around the world, specially in East and Southeast Asia.
Vietnamese, which is called Annamese also, has five main dialects that differ mainly in their sound systems. The five main dialects of Vietnamese are: Northern Vietnamese,
North-central Vietnamese, Mid-Central Vietnamese, South-Central Vietnamese and South- ern Vietnamese (Wikipedia, 2016a). In the next sections, we present a brief description of the script, morphology and syntax of Vietnamese. 7
1.2.2.1 Vietnamese Script
Old Vietnamese texts are written using Chinese characters. In the 17th century, the
Latin alphabet was introduced to Vietnamese by the French. By the beginning of the 20th century, the Romanized version of Vietnamese became dominant (Thompson, 1987).
Compared to other languages, Vietnamese has a large number of vowels. It has 11 single vowels in addition to three types of composed vowels: centering diphthongs, clos- ing diphthongs and triphthongs (Gordon and Grimes, 2005). These vowels are created by combining single vowels together. Vowels are modified by diacritics. The diacritics, which can be written above or below a vowel, are used to specify the tone of the vowel.
These tones have different lengths, pitch heights, pitch melodies and phonations. There are
25 consonants in Vietnamese. Consonants are represented in written script by a variable number of letters. Some of the consonants are represented using one letter and other conso- nants are represented by a digraph, which is a combination of two letters. There are some consonants which are represented by more than one digraph or letter (Wikipedia, 2016a).
1.2.2.2 Vietnamese Morphology
In Vietnamese, the majority of words are polysyllabic words (Noyer, 1998). Poly- syllabic words are words composed of two or more syllables. The construction of polymor- phemic words in Vietnamese is done in three ways: combining two words, adding affixes to stem or reduplication. Words formed using reduplication morphology are constructed by duplicating a word or a part of a word. There are a small number of affixes in Vietnamese.
Most of them are in the form of prefixes and suffixes. One distinct characteristic of Viet- 8 namese is that it does not have any number, gender, case and tense distinction (Wikipedia,
2016b). However, usually a noun classifier is used as a determiner and is added after the word to specify those characteristics.
1.2.2.3 Vietnamese Syntax
Vietnamese sentences follow the Subject+Verb+Object (SVO) word order. To dis- tinguish between verbs and nouns in a Vietnamese sentence, a copula is used before the nouns. Noun phrases are usually composed of a noun and a modifier. The modifier can be a numerator, classifier, prepositional phrase or other description word. Like in other languages, pronouns are used to substitute the nouns and noun phrases.
1.3 Research Contributions
The resources created by Khang’s PhD dissertation (Lam, 2015) and reported in (Lam et al., 2014a,b, 2015b), have many holes. E.g., the wordnets have only synsets, which are sets of synonyms for words. In this dissertation work, we develop algorithms and models to automatically establish the semantic relations between synsets in our previously created core wordnets for our languages of focus using both pre-existing resources, as well as by bootstrapping with resources we create ourselves. Following are the contributions produced by this thesis:
• We construct the rest of the structure for our core wordnets with acceptable qual-
ity. We focus on the construction of wordnet semantic relations such as Hypernyms,
Hyponyms, Member Meronyms, Part Meronyms and Part Holonyms between the 9
synsets.We believe that our work contributes significantly to the repository of re-
sources for languages that lack them.
• We present a method to enhance the quality of wordnets, which we create in the first
task, by filtering the mistakenly created synsets and relations. In this task, we use one
of the state of the art techniques which is word embedding (Mikolov et al., 2013).
This method give a solution to the problem of wrong translation produced by the
translation method.
• We produce an approach to create a vector representation for synsets. This approach
aims to produce a better way for representing meaning. This representation can be
used in several areas. In this task we use it to automatically extract glosses from
corpora for wordnet synsets we create in the previous tasks. It, also, can be used in
the word-sense disambiguation (WSD) problem which occurs with words that have
multiple meanings.
• Then, based on the vector representation of synsets, we present a novel approach
to add a gloss for each synonym set (synset) in our core wordnets. A gloss is a
definition or a sentence that clarifies the meaning of the synset. Glosses are mostly
added manually by human or automatically generated using rule-based generation
approach (Cucchiarelli et al., 2004).
• Finally, we present LexBank which is a system that makes our created resources
available for public. We design and implement the system such that it provides use-
ful services for users that seek linguistics resources in a friendly manner. We aim 10 to make our system flexible and expandable so it can accommodate additional new languages and resources. Chapter 2
CASE STUDY: THE CURRENT STATUS OF AND CHALLENGES IN
PROCESSING INFORMATION IN ARABIC
Since Arabic is one of the languages we use in our experiments throughout this dis- sertation, we present the current status of Arabic language processing as an example in this chapter.
2.1 Introduction
According to Ethnologue (Gordon and Grimes, 2005), Arabic is the official language of more than 223 million people in 25 countries which makes it one of the most widely- spoken languages in the word. Arabic is the language of Islam, which is the religion of
1.6 billion people around the world. Muslims are required to use Arabic to read the Quran
(the Holy Book of Islam) and to perform the rituals of Islam. There are around 30 major dialects in Arabic. These dialects have different phonologies, morphologies, syntax and even lexicons (Habash, 2010). However, these dialects are not used as official languages by themselves. They are used for informal speech. For formal writing and speaking, the official Modern Standard Arabic (MSA) is used. MSA was developed based on Classical
Arabic, which is the language of historical literature. However, dialects are commonly used for writing now-a-days in social media. But, they are rarely used in books, newspapers and in literary writing. Most Arabs can speak MSA, however, it is not the natively spoken 12 language of any region (Diab and Habash, 2007). This coexistence between MSA and dialects is problematic for Arabic language processing. This happens to be a problem in most of widely spoken languages in the world (Haugen, 1966).
One important survey (Farghaly and Shaalan, 2009) discussed the importance of research in the field of Arabic processing from three perspectives. First, the perspective of non-Arabic speakers who need to process a huge amount of Arabic texts. The Department of Homeland Security in the United States is a good example. With increasing security risks, there is a crucial need to be able to understand the meaning of Arabic documents and retrieve important information from them such as names, organization and places. The second perspective is that of Arabic speakers. Machine translation, retrieving information, summarization, and linguistic tools are some of the applications which are requested by
Arabic speakers.
In the rest of this chapter, we give a summary of the features that make the process- ing of Arabic text so challenging and some of the solutions and resources that have been designed to address these challenges. First, in Section 2, we discuss the fundamental issues in Arabic which are the script, the morphological issues, and the syntactical issues. Then, in Section 3, we discuss three of the most valuable resources for Arabic processing. These are The Penn Arabic Treebank (PATB), The Prague Arabic Dependency Treebank (PADT), and The Columbia Arabic Treebank (CATiB).
2.2 Fundamentals of Arabic
In this section we discuss the script, morphology and syntax of Arabic. 13
2.2.1 Arabic Script
Arabic is written as a right to left script. The Arabic script is also used by languages such as Kurdish, Urdu, Persian and Pashto (Habash, 2010). One important aspect of Arabic is that most of Arabic letters are composed of two parts: a base form and a mark. There are three kinds of marks in Arabic letters. The first kind consists of dots which are used to distinguish between letters that share the same base form. An example of letters that tha”. The second kind“ (ﺙ) ta”, and“ (ﺕ),”ba“ (ﺏ) share the same base form are the letters u”, or“ (ﺅ) which can be written above some letters, as in (ﺀ) of mark is the Hamza mark I”. Unfortunately, people often misspell words by not writing“ (ﺇ) under some letters, as in such marks making it hard to distinguish between similar letters and causing ambiguity in can also be considered a letter by (ﺀ) the text. It is also important to notice that Hamza which means (ﺳﻤﺎﺀ) itself. An example of a word that has the Hamza letter is the word ”Kaf“ (ﻙ) sky”. The third kind of mark is the Hamza mark that distinguishes the letter“ .”Lam“ (ﻝ) from the letter
Most letters in Arabic have several shapes. The shape of a written letter is determined qaf” as an“ (ﻕ) based on the position of that letter in the word. Let us take the letter whereas it (ﻗـ) example. If it appears at the beginning of the word, it will have the shape if it (ـﻖ) if it appears in the middle of the word, and the shape (ـﻘـ) will have the shape
is at the end of the word. All word processors select the appropriate letter shape based on
the rules which govern these shapes, and therefore, there is only one key for each letter.
Inflectional morphology is also a factor that governs the shape of some Arabic letters. which means ,(ﺃﺻﺪﻗﺎﺀ) The Arabic letter “Hamza” is a good example for that. The word 14 which (ﻱ) when we add the letter (ﺃﺻﺪﻗﺎﺀﻱ) instead of (ﺃﺻﺪﻗﺎﺋﻲ) friends”, becomes“
means the possessive pronoun “my”.
In Arabic, each letter is mapped to one unvarying sound, which makes it a phonetic always has the pronunciation /s/. On the (ﺳـ) language. For example, the Arabic letter
other hand, letter “s” in English has three pronunciations: /z/, /s/, and /sh/ as in “nose”,
“salt”, and “sugar”, respectively. However, in Arabic a short vowel may be added to the
letter to change its sound. There are three short vowels in Arabic, which means that each
letter has three more sounds in addition to the original sound. There are no dedicated letters
to represent short vowels. The short vowels may be specified in the written language using
optional diacritics. To show how the short vowels change the sound of Arabic letters, let us is pronounced as /s/; however, if we (ﺳـ) again. We said that (ﺳـ) look at the Arabic letter
add the short vowel “Dhamma” it will be pronounced as “su” and it may be written, with
If we add the short vowel “Kasra”, it will be pronounced .(ﺳuـ) the “Dhamma” diacritic, as Keep in mind that in MSA, the .(ﺳiـ) as “si” and it may be written with “Kasra” diacritic, as
writing of the diacritics is optional, although a change in a diacritic of a letter can change
the meaning of the word and may even change the morphological structure of the sentence.
Clearly, this a major source of ambiguity in Arabic processing (Diab and Habash, 2007).
Obviously, with all these problems caused by the Arabic script, Arabic input text
has to be pre-processed to enhance recognition during the actual processing. This pre-
processing, which is called normalization, aims to standardize different Arabic script varia-
tions. There are several solutions which have been proposed to normalize the Arabic script.
For example, (Larkey et al., 2002) normalized the corpus, the queries, and the dictionaries
of Arabic using the following steps. They first unified the encoding and removed punctua- 15 tions in the text. Then they removed all the diacritics and the non-letters called “tatweel”. from the letter “Alif” to standardize all the (ﺀ) After that, they removed the Hamza mark (ﺓ) and ,(ﻱ) with (ﻯ) ,(ﺉ) with (ﻯﺀ) Also, they replaced .(ﺍ) to (ﺁ) and ,(ﺇ),(ﺃ) variations The Stanford Natural Language Processing Group adopted a similar procedure in .(ﻩ) with
the Stanford Arabic Statistical Parser (Green and Manning, 2010). The normalization pro-
cess, as you might expect, does not come without a price. Since all these removed marks
purpose to clarify ambiguity, the normalization of the variant scripts causes the ambiguity
probability to increase (Farghaly and Shaalan, 2009).
Unlike English and other languages, there are no silent letters in Arabic. An example
of a silent letter in English is the letter “p” in the word “pneumonia”. There are no new
sounds in Arabic produced by combining two letters. For instance, in English, “c” and “h”
are combined to produce three distinct sounds: the sound at the beginning of “cheese”, the
sound at the beginning of “character”, and the sound at the beginning of “chef.”
It is well known that the process of splitting text into sentences is an essential step
in many Natural Language Processing (NLP) applications. In English, this is relatively an
easy task since English sentences start with an uppercase letter and finish with a period.
However, splitting Arabic sentences is not as easy as in English since there is no capital
form for Arabic letters (Chinese, Japanese, and Korean have no capitalization too). In
addition, punctuation rules in Arabic are not strict; so many people do not use it properly. In
fact, Arabic writers excessively use coordinations, subordinations and logical connectives
to conjoin the sentences (Farghaly and Shaalan, 2009). Hence, it is not unusual for an
Arabic article to have a complete paragraph which does not include any periods other than 16 the period at the end of the paragraph. Therefore, texts in the Arabic must go through complicated preprocessing.
The lack of capitalization obviously makes it hard to detect named entities (Darwish,
2013) which is an essential part of Information Retrieval (IR). In English, extracting named entities such as cities, names of people, addresses and organizations is done with the help of capitalization and punctuation. For example, to recognize a name like “Barack H. Obama”, a simple algorithm can be used to search for an uppercase word followed by an initial with an optional period followed by an uppercase letter. We are not claiming that NER in English is straightforward or simple in general, but since Arabic does not have these features, new methods must be used to address the problem of named entity recognition
(Darwish, 2013).
2.2.2 Arabic Morphology
Arabic has a very rich and complex morphology (Attia, 2008). Similar to the other
Semitic languages, morphology in Arabic is of two types, derivational and inflectional.
Derivational morphology is the process of creating new words. This is done by mapping a root to a pattern. The root holds the meaning while the pattern changes the structure of the root generating a new word with a different part-of-speech. This type of derivational morphology is called nonlinear morphology (Bhattacharya et al., 2005). On the other hand, inflectional morphology is the process of modifying the words with features to create plural, feminine, or definite forms of the word (Habash, 2010).
A morpheme is “a linguistic form which bears no partial phonetic-semantic resem- blance to any other form” (Bloomfield, 1933). Morphological processing in NLP is the 17 process of decomposing a word into morphemes. Relatively, this is an easy task in con- catenative morphology. However, in languages with nonconcatenative morphology, like
Arabic, it is a much harder task. In Arabic, words are built by merging a consonantal root and a vocalism (McCarthy, 1981). The root holds a semantic field while the vocalism specifies the grammatical form. An example showing the nonconcatenative morphology katab” which means “to write”. It is composed by“ (ﻛﺘﺐ) of Arabic would be the word .”k-t-b/ which has the meaning of “writing/ (ﻛﺘﺐ) associating the root
Several approaches have been used to decompose Arabic words. The first approach recovers the root by extracting all prefixes and suffixes from the word, then, stripping the rest of the word using a lexicon of roots (Hlal, 1985). This approach is very common; however, it requires a lexicon of all possible Arabic roots, prefixes, infixes and suffixes
(Beesley, 1996; Shaalan et al., 2006). Buckwalter introduced another approach in his mor- phological analyzer (BAMA) (Buckwalter, 2004). Rather than recovering the root, BAMA recovers the stem and considers it the main building block for the Arabic word. The stem is recovered by just removing the prefixes and the suffixes. Therefore, BAMA decomposes the Arabic word into three parts: Arabic stems, Arabic prefixes and Arabic suffixes.
The decomposition process searches for the prefixes and the suffixes in the word that satisfy constraints governing the possibility of combining them with the stem in the word. BAMA has a bidirectional transliteration schema from Arabic script to Latin script.
That means that developers can work with unstructured Arabic texts without any Arabic language knowledge. For this reason, many recent statistical ANLP systems use BAMA as the foundation for machine translation and information retrieval. However, BAMA has the limitation of giving a general analysis that includes all possible cases of the word without 18 considering the context of the input text. A more refined result can be obtained using a disambiguation module that considers the context of the input text after eliminating the incorrect analyses (Habash and Rambow, 2005).
Dialectal Arabic differs from MSA morphologically, lexically and phonologically
(Habash et al., 2013). Furthermore, there are no standard orthographies and no language academies in dialectal Arabic. Therefore, the tools and resources designed for MSA do not work with dialectal Arabic. Recently, several research efforts have focused on Arabic dialectal texts (Habash and Rambow, 2005; Habash et al., 2013; Zaidan and Callison-
Burch, 2014). The state-of-the-art dialectal Arabic morphological analyzer is the Columbia
Arabic Language and dialect Morphological Analyzer (CALIMA) (Habash et al., 2013).
Arabic is an agglutinative language, which means that Arabic words usually include
(ﻛaﺘaﺒﺘuﻬaffixes and clitics that represent different parts-of-speech. Let us take the word (Â
“katabto ho” which means “I wrote it”. This word is a verb that has the subject and the ta”, while the“ (ﺕ) object attached to it. The subject is the diacritic on the fourth letter ha”. This is just a simple example whereas words usually have“ (ـﻪ) object is the suffix more complex structures that include other clitics to specify the gender, person, number and voice. Hence, due to complex phonological rules, the decomposition of words in Arabic is relatively more difficult in comparison to other languages.
2.2.3 Arabic Syntax
According to (Habash, 2010), there are two kinds of sentences in Arabic: sentences that start with verb (V-SENT), and sentences that start with noun (N-SENT). Verb-subject- object (VSO) is the primary structure of a V-SENT sentence in the Classical and Modern 19
Standard Arabic. However, the object-verb-subject (OVS) and subject-verb-object are also commonly used. As we mentioned before, Arabic is a pro-drop language which means that the subjectless sentences are perfectly grammatical in Arabic. Also, unlike English, the use of the equational sentences like “He a journalist”, are allowed without the need of a
“to be” verb. Russian, Hungarian, Hebrew, and Quechuan languages also allow this type of sentences.
In Arabic, the structure of constituent questions is usually composed by starting with a wh-phrase. However, it is grammatically correct if the constituent question does not start literally means “you (ﺃﻛﻠﺖ ﻣﺎﺫﺍ ﺑﺎﻷﻣﺲ؟) with the wh-phrase. For example, the question eat what yesterday?”. Furthermore, relative clauses in Arabic are connected using relative
:there are two clauses (ﺍﺣﺒﺒﺖ ﺍﻟﺒﻴﺖ ﺍﻟﺬﻱ ﺍﺷﺘﺮﻳﺘﻪ) pronouns. For example, in the sentence which means “which (ﺍﻟﺬﻱ ﺍﺷﺘﺮﻳﺘﻪ) which means “I liked the house”, and (ﺃﺣﺒﺒﺖ ﺍﻟﺒﻴﺖ) which means (ﺍﻟﺬﻱ) I bought”. The two clauses are connected using the relative pronoun
“which”. The Arabic relative pronouns must agree with the noun which it modifies at the second clause in number and gender.
2.3 Summary
In this chapter, we presented a short overview of inofrmation processing in Arabic.
We summarized challenges that face developers and researchers when processing Arabic text due to many of its features. The lack of capitalization, dropped subjects, missing short vowels and the nonconcatenative nature are some of these features. In addition, there are many dialects in Arabic, which are used in the informal speaking and writing. These 20 dialects must be treated differently when processing Arabic texts. Much research has been conducted to address the challenges of Arabic text processing. Some valuable resources and techniques have been presented for Arabic. However, more work needs to be done to give Arabic developers and speakers the support they need. Chapter 3
LITERATURE REVIEW
In this chapter, we provide a summary of the main existing approaches for creating lexical resources. We focus on two types of lexical resources: wordnets and bilingual dictionaries.
3.1 Automatic Construction of Wordnets
Wordnet is a lexical ontology of words. There are two ways to construct wordnets for languages that do not possess such resources: manual construction and automatic con- struction. We intend to use automatic construction using core wordnets we have created in our earlier work (Lam et al., 2014a,b, 2015b) and other existing resources that are freely available. Other efforts are underway to manually (or mostly manually) create wordnets in a variety of languages, although progress seems slow all around.
High-quality wordnets have been developed for a small number of languages. Word- nets, other than the Princeton WordNet (Fellbaum, 1998; Miller, 1995), are typically con- structed by one of two approaches. The first approach, which is called the expansion ap- proach, translates the PWN to target languages (Akaraputthiporn et al., 2009; Barbu, 2007;
Bilgin et al., 2004; Kaji and Watanabe, 2006; Lam et al., 2014b; Lindén and Niemi, 2014;
Oliver and Climent, 2012; Sagot and Fišer, 2008; Saveski and Trajkovski, 2010). In con- 22 trast, the second approach, which is called the merge approach, builds the semantic taxon- omy of a wordnet in a target language, and then aligns it with the Princeton WordNet by generating translations (Borin and Forsberg, 2014; Gunawan and Saputra, 2010; Maziarz et al., 2013; Rigau et al., 1998). To construct the taxonomic relations between words,
first definitions of words are retrieved from machine readable dictionaries. Then, a genus disambiguation process, which is the process of finding a word with a broad meaning that more specific words fall under, is performed using the definitions to construct a hierarchical class of concepts. Next, classes are merged with the synsets in the PWN using a bilingual dictionary to form the target wordnet.
The expansion approach dominates the merge approach in popularity. Wordnets gen- erated using the merge approach may have different structures from the Princeton Word-
Net. In contrast, wordnets created using the expansion approach have the same structure as the Princeton WordNet, which provides for a level of uniformity among them, pos- sibly at the cost of some natural language-specific expressiveness (Leenoi et al., 2008).
Many approaches to construct wordnets are semi-automatic and, therefore, can be used only for languages that have some existing lexical resources. Therefore, any attempt to build wordnets for resource-poor languages using these methods would be doomed from the start. Moreover, while wordnets are always difficult to evaluate, it is even harder to eval- uate machine-created wordnets in resource-poor languages because these languages do not have gold standards to compare with, and frequently do not have easily-accessible experts to evaluate such resources.
Crouch (Crouch, 1990) clusters documents using a complete link clustering algo- rithm and generates thesaurus classes or synonym lists based on user-supplied parameters. 23
Curran and Moens evaluate the performance and efficiency of thesaurus extraction meth- ods and also propose an approximation method that provides for better time complexity with little loss in accuracy (Curran and Moens, 2002a,b). Ramirez and Matsumoto develop a multilingual Japanese-English-Spanish thesaurus using two freely available resources:
Wikipedia and the Princeton WordNet (Ramírez et al., 2013). They extract translation tu- ples from Wikipedia articles in these languages, disambiguate them by mapping to wordnet senses, and extract a multilingual thesaurus with a total of 25,375 entries. One thing we must note about all these approaches is that they are resource-hungry, requiring a large cor- pus of Wikipedia or non-Wikipedia documents and wordnets. For example, Lin works with a 64-million word English corpus to produce a high quality thesaurus with about 10,000 entries (Lin, 1998). Ramirez and Matsumoto have the entire Wikipedia at their disposal with millions of articles in three languages, although for experiments they use only about
13,000 articles in total (Ramírez et al., 2013). Furthermore, (Miller and Gurevych, 2014) work with more than 19 thousands of Wiktionary senses and 16 thousands of Wikipedia articles to produce a three-way alignment of WordNet, Wiktionary, and Wikipedia. When we work with low-resource or endangered languages, we do not have the luxury of collect- ing such big corpora or accessing even a few thousand articles from Wikipedia or the entire
Web. Many such languages have none or very limited Web presence. As a result, we have to work with whatever limited resources are available.
In this work we introduce approaches to generate synonyms, hypernyms, hyponyms and some other semantic relations. To enhance the quality of wordnets we create, we present several approaches to measure relatedness between concepts or words. Some po- 24 tential approaches for measuring semantic relationships are a dictionary-based approach
(Kozima and Furugori, 1993) and thesaurus-based approach (Hirst and St-Onge, 1998).
Oliver (Vossen, 1998) presented approaches for constructing wordnets using the ex- pand model and made them available through a Python toolkit (Oliver, 2014). The authors designed three strategies that use three types of resources to construct wordnets: dictio- naries, semantic networks (Navigli and Ponzetto, 2010) and parallel corpora. While the construction approaches of wordnets using dictionaries and semantic networks were direct, the authors used machine translation and automatic sense-tagging to construct their word- nets using parallel corpora. A toolkit1 provides access to the three construction methods besides access to some freely available lexical resources. To test their dictionary based approach, the authors constructed wordnets for six languages: Spanish, Catalan, French,
Italian, German and Portuguese with precision between 48.09% and 84.8%. Using their semantic network based approach, the authors constructed wordnets for the six languages with precision between 49.43% and 94.58%. The parallel corpus based approach with machine translation achieved precision between 70.26% and 93.81%, while with auto- matic sense-tagging it achieved between 75.35% and 82.44%. The authors stated that their automatically-calculated precision value is very prone to errors.
Another example of constructing wordnets using dictionary based methods is JAWS
(Mouton and de Chalendar, 2010). JAWS is a French wordnet for nouns constructed by translating wordnet nouns using a bilingual dictionary and syntactic language model. The construction of JAWS starts with copying the structure (the synsets with no words) of the source wordnet. Then, the phrases that are available in the bilingual dictionary are used to
1http://lpg.uoc.edu/wn-toolkit 25
fill out the initial synsets. Finally, the language model is used to incrementally add new phrases to JAWS. An improved version of JAWS is called WoNeF (Pradet et al., 2014).
The new improved wordnet includes parts of speech and was evaluated using a gold stan- dard produced by two annotators. In addition, WoNef uses a better translation selection algorithm that uses machine learning to select variable thresholds for translations.
In (Lam et al., 2014b), we presented three methods to construct wordnet synsets for several resource-rich and resource-poor languages. We used some publicly available wordnets, a machine translator and a single bilingual dictionary. Our algorithms translate synsets of existing wordnets to a target language T, then apply a ranking method on the translation candidates to find best translations in T. The approaches we used are applicable to any language which has at least one existing bilingual dictionary translating from English to it.
In the first approach, which we call the direct translation approach (DR), for each synset in PWN, we directly translate the words from English to the target language. In the second approach, which we call IW, we extract candidates from several intermediate wordnets rather than just using PWN to disambiguate the translation. In the third approach, which we call IWND, we try to reduce the number of bilingual dictionaries we use in the second approach. When the intermediate wordnet is not PWN, we translate the extracted words from the wordnets to English, and then we use a single bilingual dictionary to trans- late the words from English to the target language. In all of these methods, after extracting the candidates, we use a ranking method to select the best translations and insert them as a synset in the traget wordnet. 26
3.2 Wordnet Management Tools
Library URL CICWN http://fviveros.gelbukh.com/wordnet.html extJWNL http://extjwnl.sourceforge.net/ Javatools http://www.mpi-inf.mpg.de/yago-naga/javatools/ Jawbone http://sites.google.com/site/mfwallace/jawbone/ JawJaw http://www.cs.cmu.edu/~hideki/software/jawjaw/ JAWS http://lyle.smu.edu/~tspell/jaws/ JWI http://projects.csail.mit.edu/jwi/ JWNL http://sourceforge.net/apps/mediawiki/jwordnet/ URCS http://www.cs.rochester.edu/research/cisd/wordnet/ WNJN http://wnjn.sourceforge.net/ WNPojo http://wnpojo.sourceforge.net/ WordnetEJB http://wnejb.sourceforge.net/
Table 3.1. A list of the Java libraries tested in (Finlayson, 2014).
Maintaining wordnets is an important area of research. The manual construction of a wordnet is an intensive process that requires a large number of specialists to work for several years. Furthermore, a wordnet is not static. The meaning of many phrases change through time and new phrases appear every year. For example, the country Sudan was divided into two countries Sudan and South Sudan in 2011. If one searches the PWN 3.1 for
Sudan, only the senses corresponding to the old Sudan show up since the new sense has not yet been added. Moreover, the representation of wordnets evolves over time. For example, many old wordnets were upgraded to provide the XML representation. In addition, as this section shows, many wordnets are built based on the PWN. Every time PWN gets updated, these wordnets must be updated also to preserve the alignment with PWN. All the previous issues show the need for wordnet maintenance tools.
One recent work on tools for maintaining wordnets is by (Mladenovic et al., 2014).
The tools are designed to provide for upgrade, cleaning, validation, search, import and 27 export of functionalities for the Serbian wordnet (Christodoulakis et al., 2002). Another recent work develops a Java library, which is called JWI, for accessing the PWN and com- pares it with eleven other libraries is (Finlayson, 2014). The comparison between the li- braries was based on five features: special requirements, used similarity metrics, ability to edit the wordnet, whether they need to work with the Maven project or not, and forward- compatibility with Java. Table 3.1 shows the tested libraries and Table 3.2 shows a sum- mary of the comparison.
Metric Maven Editing Standalone Minimum Java Similarity Metrics CICWN Yes No No No 1.6 extJWNL No No Yes Yes 1.6 Javatools Yes Yes No No 1.6 Jawbone Yes Yes No No 1.6 JawJaw Yes Yes No No 1.5 JAWS Yes No No No 1.4 JWI Yes Yes No No 1.5 JWNL No Yes No Yes 1.4 URCS Yes No No No 1.6 WNJN No No No No 1.5 WNPojo No No No No 1.6 WordnetEJB No No No No 1.6
Table 3.2. A comparison between some of the Java libraries for accessing the PWN.
Another wordnet management tool was also presented recently for the IndoWordNet2
(Nagvenkar et al., 2014). The tool, which is called the Concept Space Synset Management
Tool3 (CSS), provides an interactive user interface for creating new language synsets and
2http://www.cfilt.iitb.ac.in/indowordnet/ 3http://indradhanush.unigoa.ac.in/conceptspace 28 linking them to other Indian language wordnets. The CSS tool uses a role-based access control to restrict the access to the wordnet. Figure 3.1 shows an overview of the CSS tool.
Figure 3.1: An overview of the CSS management tool, adapted from (Nagvenkar et al., 2014)
Sense marking is the process of tagging words with senses in corpus. It is a necessary
task in preparing training data for machine learning techniques. Since sense marking is an
intensive process, sense marking tools are very handy. For example, the Indian Institute
of Technology Bombay has developed a sense marker tool for the IndoWordNet (Prab-
hugaonkar et al., 2014). The sense marking tool shows a highlighted word in a piece of text
and asks the annotator to choose the most appropriate sense from the available senses. The
tool, also, allows the annotator to add new senses that do not exist in the wordnet. 29
3.3 Creating Bilingual Dictionaries
Bilingual dictionaries are essential lexical resources which we use in our approaches.
The majority of low-resource languages have bilingual dictionaries to provide phrase trans- lation between them and rich-resource languages. However, only relativity few bilingual dictionaries are available for translation between low-resource languages. Several meth- ods have been presented to automatically construct such dictionaries between low-resource lanauges. Since wordnets we create in this dissertation are aligned with each others, we believe that they can be good resources for phrase translation between languages. In this section, we discuss some methods for automatically creating bilingual dictionaries.
Given two input dictionaries L1-Lp and Lp-L2, a naïve method to create a new bilin- gual dictionary L1-L2 may use Lp as a pivot using a straightforward transitive approach.
However, if a word has more than one sense, being a polysemous word, this method may introduce incorrect translations. After computing an initial bilingual dictionary, past re- searchers have used several approaches to mitigate the effect of ambiguity in word senses.
Methods used for disambiguation use wordnet distance between source and target words in some way, look at dictionary entries in both forward and backward directions and compute the amount of overlap to compute disambiguation scores (Ahn and Frampton, 2006; Bond and Ogura, 2008; Gollins and Sanderson, 2001; Lam and Kalita, 2013; Shaw et al., 2013;
Soderland et al., 2010; Tanaka and Umemura, 1994).
Researchers have also merged information from several sources such as parallel cor- pora or comparable corpora (Nerima and Wehrli, 2008; Otero and Campos, 2010) and a wordnet (István and Shoichi, 2009; Lam and Kalita, 2013) to address the ambiguity prob- 30 lem. Some researchers extract bilingual dictionaries directly from monolingual corpora, parallel corpora or comparable corpora using statistical methods (Bouamor et al., 2013;
Brown, 1997; Haghighi et al., 2008; Héja, 2010; Ljubešic´ and Fišer, 2011; Nakov and Ng,
2009; Yu and Tsujii, 2009).
Obviously, the quality and quantity of existing resources strongly affect the accura- cies of newly-created dictionaries. For instance, Nerima and Wehrli create new English-
German and English-Italian bilingual dictionaries with 21,600 and 26,834 entries, respec- tively, from 76,311 entries in an English-French dictionary, 45,492 entries in a German-
French dictionary, and 36,672 entries in a French-Italian dictionary (Nerima and Wehrli,
2008). Given parallel corpora of Lithuanian consisting of 1,765,000 tokens and Hungarian including 2,121,000 tokens, Heja can extract only 2,616 correct translation candidates with accuracy over a certain threshold from 4,025 translation candidates (Héja, 2010). Thus, new bilingual dictionaries created using current approaches have very few entries com- pared to the size of the input dictionaries. Furthermore, most resource-poor languages do not have any corpora, or even online documents. Some languages have only one very small bilingual dictionary, such as the Karbi-English dictionary of 2,341 words.
In (Lam et al., 2015b), we present approaches to automatically build a large num- ber of new bilingual dictionaries for low-resource languages, especially resource-poor and endangered languages, using a single input bilingual dictionary. Our algorithms produce translations of words in a source language to many target languages using publicly avail- able wordnets and a machine translator (MT). Our approaches may produce any bilingual dictionary as long as one of the two languages is English or has a wordnet linked to the
PWN. Using our approaches and starting with 5 available bilingual dictionaries, we cre- 31 ated 48 new bilingual dictionaries. Of these, 30 pairs of languages are not supported by the popular MTs: Google4 and Bing5.
3.4 Summary
In this chapter, we have discussed the existing methods for the automatic construc-
tion of wordnets. We have also discussed several tools and system for managing wordnets.
Moreover, we covered some of the approaches for automatically creating bilingual dictio-
naries.
4http://translate.google.com/ 5http://www.bing.com/translator Chapter 4
AUTOMATICAALY CONSTRUCTING STRUCTURED WORDNETS
The core idea behind a wordnet is to group words which are synonyms, or roughly synonymous, into lexical categories that are called synsets. Then, semantic relations be- tween these synsets are established in a hierarchical manner. In this chapter, we present a method to automatically construct the wordnet semantic relations such as Hypernyms,
Hyponyms, Member Meronyms, Part Meronyms and Part Holonyms using PWN.
4.1 Constructing Core Wordnets
In (Lam et al., 2014b), we introduced an approach, which we refer to as the IWND approach, that creates wordnet synsets with relatively high coverage. As Figure 4.1 shows, in IWND, to create wordnet synsets for a target language T we used existing wordnets and a machine translator (MT) and/or a single bilingual dictionary. First, we extracted every synset in Princeton WordNet (PWN) using the unique offset-POS key, which refers to the offset for a synset with a particular part-of-speech (POS). Notice here that each synset may have one or more words, each of which may be in one or more synsets. Words in a synset have the same sense. Then, we extracted the corresponding synsets for each offset-
POS from existing wordnets linked to PWN, in several languages. Next, we translated the extracted synsets in each language to T to produce synset candidates using MT or a 33 dictionary. Then, we applied a ranking method on these candidates to find the correct words for a specific offset-POS in T.
Figure 4.1: Creating wordnet synsets using the IWND algorithm (Lam et al., 2014b). 34
The ranking method we used in (Lam et al., 2014b) is based on the occurrence count of a candidate. Specifically, the rank of a word w, the so-called rankw , is computed as below.
= occurw ∗ numDstWordNets rankw numCandidates numW ordNets where:
- numCandidates is the total number of translation candidates of an offset-POS
- occurw is the occurrence count of the word w in the numCandidates
- numWordNets is the number of intermediate wordnets used, and
- numDstWordNets is the number of distinct intermediate wordnets that have words
translated to the word w in the target language.
4.2 Constructing Wordnet Semantic Relations
Synsets in a wordnet are linked in a hierarchal fashion. The hierarchy in a wordnet
is established using the super-subordinate relation between synsets. For example, nouns
are linked using hyperonymy, which is a relation between a general synset and a specific
one. An example of a hyperonymy relation is the relation between the synsets {food,
solid_food} and {baked_goods}. The Hyperonymy relation is transitive, for example, the
synset {bread}, which is a hyponym of the synset {baked_goods}, is also a hyponym of
the synset {food, solid_food}. Table 4.1 shows the semantic relations available in wordnet
(Wikipedia, 2015).
In (Lam et al., 2014b), we constructed core wordnets, which essentially means that
we created synsets with no connections between them. As Figure 4.2 shows, our goal is to
recover the taxonomy of synsets. To establish the semantic relations between the sysnets 35
Phrase Type Relation Definition Hypernyms Y is a hypernym of X if every X is a (kind of) Y Hyponyms Y is a hyponym of X if every Y is a (kind of) X Nouns Coordinate terms Y is a coordinate term of X if X and Y share a hypernym Meronyms Y is a meronym of X if Y is a part of X Holonyms Y is a holonym of X if X is a part of Y Hypernyms The verb Y is a hypernym of the verb X if the activity X is a (kind of) Y Verbs Troponyms The verb Y is a troponym of the verb X if the activity Y is doing X in some manner Entailments The verb Y is entailed by X if by doing X you must be doing Y Coordinate terms Those verbs sharing a common hypernym
Table 4.1. Wordnet semantic relations.
Figure 4.2: Core wordnet mapping to structured wordnet. we created in (Lam et al., 2014b), we rely on the Princeton WordNet (Fellbaum, 2005) as an intermediate resource. 36
As Figure 4.3 shows, to construct the links between synsets in our wordnet for lan- guage T, we extract each synseti from wordnett and map it to synsetj, which is the corre-
sponding synset in the Princeton WordNet. Then, for each synsetj in the Princeton Word-
Net, we extract each semantic relations rj and the linked synsetsk . Next, we check the
availability of synsetk in wordnett . Finally, if synsetk is available in wordnett , we add a
relation between synseti and synsetk to wordnett .
Figure 4.3: Creating wordnet semantic relations using intermediate wordnet.
We notice here that although we used some disambiguation methods when we created
the core wordnets, there still are words that are misplaced. This will cause some false
classification of synset relations. Another challenge is that translation leads to loss of some
information. For example, it is very important to distinguish between classes and instances
in wordnets (Miller and Hristea, 2006). There is no guarantee that an instance will not
be translated into the target language as a class and vice versa. Furthermore, as Figure
4.4 shows, since the core wordnets are automatically created, there will be some missing 37 synsets that might not be available in the target languages. That is will lead to fragments in the recovered links. All the previous issues need to be observed and dealt with to obtain accepted accuracy.
Figure 4.4: The effect of missing synsets in recovering wordnet semantic relations using intermediate wordnet.
4.3 Experiments and Evaluation
In this section, we generate the semantic relations between synsets in three wordnets:
Arabic, Assamese and Vietnamese. We start by creating the core nets using the algorithms
we described in Section 4.1. Table 4.2 shows the result of creating the core wordnets for the
three languages. Next we apply our method, which we presented in Section 4.2, to link the
synsets. The algorithm was able to recover a total of 206,766 relations between the Arabic 38
Language Synsets Coverage Precision /4.00 Arabic 93,383 59.95% 3.82 Assamese 107,616 36.95% 3.78 Vietnamese 55,451 36.20% 3.75
Table 4.2. Size, coverage and precision of the core wordnets we create for Arabic, As- samese and Vietnamese.
Relation Precision SimilarTo 75.62% Hypernym 70.41% Hyponym 71.23% MemberMeronym 77.54% PartHolonym 84.29% Average 75.82%
Table 4.3. Precision of the semantic relations established for our Arabic wordnet. synsets, 139,502 relations between the Assamese synsets and 146,172 relations between
the Vietnamese synsets. As Figure 4.5 shows, most of the recovered relations are hyponym
and hypernym relations.
To evaluate our algorithm, we evaluated the relations recovered for the Arabic word-
net. We asked three Arabic to evaluate a sample of 500 relations. The sample consists of
the following relations: 100 “hypernym” relations, 100 “hyponym” relations, 100 “simi-
lar to” relations, 100 “MemberMeronym” relations and 100 “PartHolonym” relations. The
evaluation done using a True and False questions where the True gives score of 1 and
False gives a score of 0 to the relation.
As Table 4.3 shows, the precision of algorithm was between 70.41%, which was for the “hypernym” relation, and 84.29% which was for the “PartHolonym” relation. The average precision score was 75.82%. 39
Figure 4.5: Percentage of synset semantic relations recovered for the Arabic, Assamese and Vietnamese wordnets.
4.4 Summary
In this chapter, we presented an approach that automatically constructs semantic re- lations between synsets in a wordnet. The approach depends on the PWN to establish the links between the synsets. We conducted an experiment to evaluate our algorithm. Our approach produces semantic relations between the Arabic synsets with 75.82% precision. Chapter 5
ENHANCING AUTOMATIC WORDNET CONSTRUCTION USING WORD
EMBEDDINGS
In the previous chapters, we have shown that a wordnet for a new language, possibly resource-poor, can be constructed automatically by translating wordnets of resource-rich languages. The quality of these constructed wordnets is affected by the quality of the resources used such as dictionaries and translation methods in the construction process.
Recent work shows that vector representation of words (word embeddings) can be used to discover related words in text. In this chapter, we propose a method that performs such similarity computation using word embeddings to improve the quality of automatically constructed wordnets.
5.1 Introduction
It is well known that one way to find out semantically related word is to use context as lead (Firth, 1957; Harris, 1954). Words that share the same neighbors are usually somehow related to each other. For example, consider the two sentences:
“He rides his bike to the park everyday” and
“He rides his bicycle to the park everyday”. 41
One can conclude that the words “bike” and “bicycle” are similar or semantically related since they appear in similar context. This observation led researches to what is called distributional methods which are widely used in recent days. In these methods, also known as vector semantics and word embeddings, co-occurrences of the words in a corpus is represented as vectors in a multidimensional space forming a word-word matrix (Jurafsky and Martin, 20016).
Since a corpus consists of a large number of distinct words, these vectors are usually long and sparse. The sparseness of the vectors is caused by the fact that a word often co- occurs with a limited number of other words in a given corpus. For this reason, special algorithms are used to process and save these sparse vectors. Usually, the co-occurrence of a word is limited to a specific window of words before and after the word. According to (Jurafsky and Martin, 20016), there are two types of co-occurrence: first-order co- occurrence and second-order co-occurrence. The first type is used to describe words that
appear next to each other, while in the second type, the words share similar surrounding
words.
In order to reduce the effect of stop words, i.e., words that co-occur with most of
the words, usually the pointwise mutual information measure (PMI) (Fano and Hawkins,
1961) is used rather than pure co-occurrences. This measure considers the probability of
the co-occurrence of two words compared to the words occurring together by chance alone.
The PMI between two words w1 and w2 is
, = P (w1 w2) PMI (w1,w2) log2 . (5.1) P (w1)P (w2) 42
where P (w1) is the probability of word w1,
P (w2) is the probability of word w2, and
P (w1,w2) is the probability of w1 in the context of w2
5.2 Similarity Metrics
There are many ways to compute similarity between vectors (Jurafsky and Martin,
20016). We list three common metrics used to measure similarity or relatedness between two vectors A~ and B~ with size N.
• Cosine Similarity: It is the most common measure used in natural language pro-
cessing. It produces similarity values from 0 to 1. When using row co-occurrences
or PMI, words with cosine similarity value near 1 are supposedly very similar and
words with cosine similarity value near 0 are supposedly unrelated. Cosine similarity
is measured using the next formula:
N ~ ~ i=1 AiBi cosine(A,B) = P . (5.2) N 2 N 2 q i=1 Ai q i=1 Bi P P
• Jaccard Measure: It was introduced by (Jaccard, 1912) and adapted by (Grefen-
stette, 2012) to be used with vectors. The Jaccard similarity is computed using the
following formula: N = min(Ai,Bi) Jaccard ( ~, ~) = i 1 (5.3) sim A B PN = min(Ai,Bi) Pi 1 43
• Dice Measure: It was originally used with binary vectors and was adapted by (Cur-
ran, 2004) to be applied with semantic similarity. The Dice similarity measure is
computed using the next equation:
N 2 = min(Ai,Bi) Dice ( ~, ~) = i 1 (5.4) sim A B P N = (Ai + Bi) Pi 1
5.3 Generating Word Embeddings
In order to validate the synsets we create using translation and obtain relations be- tween them, we use the word2vec algorithm (Mikolov et al., 2013) to generate word rep-
resentations from an existing corpus. The word2vec algorithm uses a feedforward neural
network to predict the vector representation of words within a multi-dimensional language
model. W ord2vec has two variations: Skip-Gram (SG) and Continuous Bag-Of-Words
(CBOW). In the SG version, the neural network predicts words adjacent to a given word
on either side, while in the CBOW model the network predicts the word in the middle of a
given sequence of words. In the work presented in this section, we generate representations
of words using both models with several different vector and window sizes to obtain the
settings for the highest precision. The purpose of the steps discussed next is to improve the
quality of synsets produced by the translation process in addition to generating relations
among the synsets. 44
5.4 Removing Irrelevant Words in Synsets
We compute the cosine similarity between word vectors within each single synset in
TWN, the wordnet being constructed in language T , to filter false word members within
synsets. To filter the initially constructed synsets in TWN, we pick a threshold value α such that the selected words have cosine similarity larger than α with each other. We describe the filtering process we propose below.
1. Let
c = synseti {word1,word2,word3,word4} (5.5)
be a candidate synset to be potentially included in TWN.
c 2. Compute the cosine similarity between all the possible pairs of words in synseti .
3. Extract the pair of words with the highest cosine similarity.
4. If this pair of words have cosine similarity larger than α, keep the pair in the final
c synset synseti, otherwise, discard synseti itself. This may have been a low quality
candidate synset generated in the translation process.
c 5. Next, among the remaining words in synseti , keep a word if it has a connection with
any word in synseti with similarity higher than α.
c For example, let us assume that the cosine similarity between the words in synseti are as shown in Table 5.1 and α=0.70. First, the pair with the highest cosine similarity,
(word1,word2) is kept in the final synseti since its cosine similarity is larger than α. Then, word3 is discarded since it does not have any cosine similarity larger than α with any of the 45
Pair Cosine Similarity (word1,word2) 0.91 (word1,word3) 0.22 (word1,word4) 0.82 (word2,word3) 0.34 (word2,word4) 0.72 (word3,word4) 0.12 Table 5.1. An example of cosine similarity between words in a candidate synset
words in the current final synseti. Finally, word4 is kept synseti since it does have a cosine
similarity with word1 that satisfies the threshold α.
5.5 Validating Candidate Relations
Similarly, we compute the cosine similarity between words within pairs of semanti-
cally related synsets. This allow us to verify the constructed relations between synsets in
TWN. For example, let
synseti = {wordi1,wordi2,wordi3,wordi4}, and
synsetj = {wordj1,wordj2,wordj3,wordj4}
be synsets in TWN. And let
ρi j be a candidate semantic relation between synseti and synsetj.
We compute the cosine similarity between all the possible pairs of words from synseti to synsetj and obtain the maximum similarity obtained. Then, if this value is larger than a
threshold α ρ, then we retain the relation ρi j, otherwise, we discard it. The pseudo code of
the validation algorithm is shown in Algorithm 1. 46
Algorithm 1: Validating Semantic Relation Data: synseti , synsetj , relation ρi j , threshold α ρ Result: retain or discard the relation ρi j initialization; Similaritymax ← 0; foreach wordi in synseti do foreach wordj in synsetj do sim ← ComputeCosineSimilarity(wordi,wordj); if sim > Similaritymax then Similaritymax = sim; end end end if Similaritymax < α ρ then Discard(ρi j) ; end
5.6 Selecting Thresholds
To pick the synset similarity threshold value α and the threshold α ρ for each semantic relation we create, we compute the cosine similarity between pairs of synonym words, semantically related words, and non-related words obtained from existing wordnets. Then, based on the previous data, we select the threshold values that are associated with higher precision and maximum coverage.
5.7 Experiments
In this section, we discuss the enhancement of the Arabic, Assamese and Vietnamese wordnets we create using our method we described in the previous sections.
5.7.1 Generating Vector Representations of Wordnets Words
For generating vector representations of the Arabic Words we use the following freely available corpora:
• Watan-2004 corpus (12 million words) (Abbas et al., 2011), 47
• Khaleej-2004 corpus (3 million) (Abbas and Smaili, 2005), and
• 21 million words of Wikipedia1 Arabic articles.
We process and combine the three corpora into a single plain text file. For both Assamese and Vietnamese, we used Wikipedia articles to generate the vector representation for words. The size of the Assamese Wikipedia articles we used is 1.4 million of words, While the size of Vietnamese articles was 80 million words.
Figure 5.1: A histogram of synonyms, semantically related words, and non-related words extracted from AWN.
In order to compute the synset similarity threshold value α and the threshold for each semantic relation α ρ, we use the freely available Arabic wordnet (AWN) (Rodríguez et al., 2008). AWN was manually constructed in 2006 and has been semi-automatically enhanced and extended several times. We start by extracting synonym words, semantically related words, and non-related words from AWN. The Python program that we wrote to
1https://ar.wikipedia.org 48
Weighted Average Relation Similarity Synonyms 0.28 Hypernyms 0.22 TopicDomains 0.23 PartHolonyms 0.28 InstanceHypernyms 0.08 MemberMeronyms 0.29
Table 5.2. The weighted average similarity between related words in AWN. compute the cosine similarity between the words is listed in Appendix B.1. Then, we use the histogram representation of the cosine similarity of the previous sets of words to set the thresholds. As Figure 5.1 shows, more than 67% of the non-related words have cosine similarity less than 0.1, while about 23% of the synonym words in AWN have a cosine similarity less than 0.1. Furthermore, about 34% of the semantically related words in AWN have cosine similarity less than 0.1. Table 5.2 shows the weighted average cosine similarity between synonyms, hypernyms, topic-domain related, part-holonyms, instance- hypernyms, and member-meronyms in AWN where the frequency of the similarity value is the weight.
5.7.2 Producing Word Embeddings for Arabic
In this part of this experiment, we use the word2vec algorithm to produce vector representation of Arabic. We test the word2vec algorithm with different window sizes to select the window size that produces the highest similarity. We generate word embeddings using the CBOW version with window sizes 3, 5 and 8. Next, we compute the weighted averages of the cosine similarity between the synonyms in AWN. The highest weighted average we obtained was 0.288 with window size 3, while the weighted averages obtained with window sizes 5 and 8 were 0.283 and 0.277, respectively. Then, we compare between the SG and the CBOW approaches with different vector sizes. Table 5.3 shows the weighted average cosine similarity obtained between 16,000 pairs of synonyms in AWN using both 49
Algorithm Vector Size Similarity Average SG 100 0.289 SG 200 0.258 SG 500 0.194 CBOW 100 0.288 CBOW 200 0.259 CBOW 500 0.195
Table 5.3. Comparison between the weighted similarity averages obtained using different word2vec settings.
Threshold AWN Our Arabic WordNet 0.000 5,941 17,349 0.100 3,433 2,073 0.288 2,471 943 0.500 1,190 271 0.750 209 13
Table 5.4. Comparison between the number of synsets in AWN and our Arabic wordnet using different threshold values.
variations of word2vec, with window size=3 and vector size set to 100, 200, and 500. We notice that both versions produce almost similar results with a slight advantage to SG with the cost of more execution time. However, for the corpus we use, smaller vector size produces better precision.
5.8 Evaluation and Discussion
We compute cosine similarity between semantically related words extracted from our initial Arabic, Assamese and Vietnamese wordnets produced in the previous chapter. The language model to calculate the cosine similarity is created using CBOW with vector size=100 and window size=3. Table 5.4 shows a comparison between the number of Arabic synsets we create and the number of synsets in AWN. We notice that the translation method we use produces a high number of synsets com- pared to the manually constructed AWN. However, the number of synsets sharply decreases after filtering the initial synonyms using the method described in Section 5.3. Although 50
Threshold Range 0- 0.1 0.1 - 0.288 0.288 - 1 Synonyms 34.8% 56.8% 78.4% Hypernyms 45.2% 57.2% 84.4% PartHolonym 50.8% 75.2% 90.4% Member- 40.8% 56.8% 79.6% Meronym Overall 42.9% 61.5% 83.2%
Table 5.5. Precision of the Arabic wordnet we create.
Threshold Range 0- 0.1 0.1 - 0.288 0.288 - 1 Synonyms 52.0% 57.6% 88.0% Hypernyms 37.6% 49.6% 76.0% PartHolonym 51.2% 46.4% 82.4% Member- 62.4% 67.2% 81.6% Meronym Overall 50.8% 55.2% 82.0%
Table 5.6. Precision of the Assamese wordnet we create. our Arabic wordnet is automatically created, the number of synsets we create is 60% of the number of synsets in the manually created AWN when filtering the synsets using α= 0.1. We evaluate precision by comparing 600 pairs of synonyms, hypernyms, part-holonyms, and member-meronyms with three ranges of cosine similarity values: 0 to 0.1, 0.1 to 0.288, and 0.288 to 1. We asked 3 Arabic speakers to evaluate the pairs using a 0 to 5 scale where 0 represents the minimum score and 5 represents the maximum score. We compute precision by taking the average score and converting it to a percentage. See Table 5.5.
Threshold Range 0- 0.1 0.1 - 0.288 0.288 - 1 Synonyms 31.2% 40.2% 57.6% Hypernyms 31.8% 39.0% 69.4% PartHolonym 32.2% 42.8% 75.0% Member- 22.0% 24.0% 73.8% Meronym Overall 29.3% 36.5% 68.95%
Table 5.7. Precision of the Vietnamese wordnet we create. 51
Table 5.8. Examples of related words and their cosine similarity from our Arabic wordnet.
The precision of the synonyms, hypernyms, part-holonyms, and member-meronyms we produce is 78.4%, 84.4%, 90.4%, and 79.6% respectively, with the threshold set to 0.288. This is higher than the precision obtained by (Lam et al., 2014b) which produces synonyms with 76.4% precision when just using PWN. Furthermore, the precision of the Assamese and Vietnamese wordnets are shown in Tables 5.6 and 5.7, respectively. As shown in Tables (5.8, 5.9, 5.10), our results suggest that using lower precision for producing synsets reduces the quality of the other created semantic relations. Our results show that pairs with higher cosine similarity are more likely to be semantically related. It confirms the benefit of combining the translation method with word embeddings in the process of automatically generating new wordnets.
5.9 Summary
In this chapter, we discuss an approach for enhancing the automatically generated wordnets we create for low-resource languages. Our approach takes advantage of word embeddings to enhance the translation method for automatic wordnet creation. We present 52
Table 5.9. Examples of related words and their cosine similarity from our Assamese word- net.
Table 5.10. Examples of related words and their cosine similarity from our Vietnamese wordnet. 53 an application of our approach to producing new Arabic Wordnet. Our method automat- ically produces Arabic synonyms with 78.4% precision and semantically related pairs of words with up to 90.4% precision.
Acknowledgment This chapter is based on the paper “Enhancing Automatic Wordnet Construction Using Word Embeddings” (Al tarouti and Kalita, 2016), written in collabora- tion with Jugal Kalita, that appeared in the Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP, San Diego, USA, June 2016. Association for Compu- tational Linguistics: Human Language Technologies (NAACLHLT). Chapter 6 SELECTING GLOSSES FOR WORDNET SYNSETS USING WORD EMBEDDINGS
Word embeddings provide a way to represent words as vectors in a multi-dimensional space such that related words are represented as vectors with similar direction. It has been shown that this model can be used to discover relations between words effectively. In this chapter, we introduce a method to represents wordnet synsets in similar way. A wordnet synset is a group of synonym words grouped together because they all represent the same concept. Our proposed method can be used in several NLP applications such as word-sense disambiguation and automatic wordnet construction. To test our method we use it in the task of selecting glosses for wordnet synsets of several languages. 6.1 Literature Review
Several methods were introduced to produce vector representations of meanings. Clustering is one technique that is commonly used to separate the vector of a multi-sense word into several vectors which represent the senses of the word. For example, (Neelakan- tan et al., 2015) modified the skip-gram version of the word2vec algorithm to produce multiple word embeddings per word. In this work, the senses of a word are learned online by creating clusters of the contexts of the word. When a new context of the word starts to appear far away from the center of the known context, a new vector is created for the new context. A global context-aware neural model was presented by (Huang et al., 2012) to learn the context vectors of words using both local and global context. To evaluate their neural architecture, the author produced new dataset that provide similarity, based on human judgments, between words within specific contexts. 55
Other techniques of producing sense vectors representation are based on ontology. For example, (Chen et al., 2014) modified the objective of the skip-gram model of the word2vec algorithm to assign vector representation for the synsets based on their glosses. The work also presented two word-sense-disambiguation algorithms based on the sense vectors. Another approach to learn synset embeddings was introduced by (Rothe and Schütze, 2015). The approach, which is called AutoExtend, is a neural network based learning model. It include hidden layers for both synset lexemes and embeddings. Foley and Kalita (Foley and Kalita, 2016) compared between several models which use WordNet to create sense vectors. They also presented an approach, which is called hyponym tree propagation model (HTP), that uses vector space model (VSM) to produce sense vectors.
6.2 Creating Language Model Using Word Embeddings
We start by creating word embeddings using a corpus and the word2vec software (Mikolov et al., 2013). word2vec is a two-layer feedforward neural-network learning model that produces multi-dimensional vector representations of words. There are two implementations of this learning model: Skip-Gram (SG) implementation and Continuous Bag-Of-Words (CBOW) implementation. In the SG implementation, the model learns the words around a given word, while in the CBOW implementation the model learns the word within a given sequence of words.
6.3 Generating Vector Representation of Wordnet Synsets
In this section, we present our method to produce wordnet synsets. We build our method based on the vectors of the synonym words produced by the word embedding method. We believe that combining the vectors of synonym words into one vector can produce a way to represent meaning. Next, we describe our proposed method to build the
vector representation of synsets, which we call synset2vec. Let 56
Synset Key Gloss Synonyms 00076884-n a sudden drop from an upright position {spill, tumble, fall} 00329619-n the act of allowing a fluid to escape {spill, spillage, release} a channel that carries excess water 04277034-n {spill, spillway, wasteweir} over or around a dam or other obstruction 15049594-n liquid that is spilled {spill}
Table 6.1. Meanings of the noun “spill” and its synonyms.
synseti = {word1,word2,...,wordj} be a synset in wordnetx ,
{n1,n2,...,nj} is the number of synsets for each word in synseti, and
~ ~ ~ {V1,V2,...,Vj} is the set of corresponding vectors for {word1,word2,...,wordj} in the word embedding model.
We identify two cases:
1. The first case is when a word, which does not have any synonyms, represents several synsets, i.e., has more than one meaning. Therefore, the vector that is produced by the word embedding is actually representing the combined meanings of the word. For example, in PWN, the word “abduction” is the only word in both synset 00775460- n, “the criminal act of capturing and carrying away by force a family member”, and synset 00333037-n, “moving of a body part away from the central axis of the body”. Hence, the vector for “abduction” actually represents both meanings.
2. The second case is when a word, which does have one or more synonyms, have one or more meanings. In this case, the synonyms might or might not have other meanings also. For example, the noun “spill” has four meanings in PWN and it has 6 synonyms. Table 6.1 shows all the meanings of the noun “spill” and all its synonyms in PWN.
Obviously, to generate a combined vector for a synset, we need a way to limit the effect of the other meanings that the synonyms might hold. To do so we start by solving 57 the second case where the synsets have more than one word. In this case, we normalize the vector of each word by dividing its coordinates by the number of synsets that the word belongs to. This reduces the noise when generating the synset vector caused by the other ~ meanings that a word can hold. We define the vector of synseti (Vsi) as follows: ~ 1 ~ 1 ~ 1 ~ 1 Vsi = · (V1 · +V2 · + ... +Vj · ). j n1 n2 nj Figure 6.1 shows an example of creating a vector for the synset 00076884-n which include three words: spill, tumble and fall.
Figure 6.1: An example of creating a vector for a wordnet synset that include more than one word.
Next, we produce vectors for the synsets that share a single word, i.e., words that do not have any synonyms and have more than one meaning. In this case, for each synset, we produce the synset vector by combing the word vector with the vector of a word in a
related synset, e.g., a hypernym, a hyponym, or a meronym. For example, let synseti and synsetk be synsets that both include the same single word w. And let h1 be a word from the
hypernym of synseti and h2 be a word from the hypernym of synsetk . We define the vector ~ of synseti (Vsi) as follows: 58
~ = 1 · ~ · 1 + ~ · 1 Vsi (Vw Vh1 ) . 2 nw nh1 ~ Similarly, we define the vector of synsetk (Vsk ) as follows:
~ = 1 · ~ · 1 + ~ · 1 Vsk (Vw Vh2 ). 2 nw nh2 Figure 6.2 shows an example of creating vectors for the two synsets of the word “abduc- tion”. In Appendix B.2 we list a python implementation of the procedure.
Figure 6.2: An example of creating vectors for wordnet synsets that share a single word.
6.4 Automatically Selecting a Synset Gloss From a Corpus Using Synset2Vec
In this section, we give one example of use of our model. We show how our proposed model can be used in the automatic selection of glosses for wordnet synsets. The automatic selection of synset gloss is a word-sense disambiguation problem. A gloss is short sentence which is, usually, manually attached to a synset to clarify the meaning of the synset. This short sentence can be a definition or an example sentence of one of the members of the synset. We test our method using PWN and, then, apply it to automatically add glosses to wordnets created in (Lam et al., 2014b). 59
In the following steps, we present our method to select a gloss for synseti we defined in section 6.3.
• Let G = {g1,g2,...,gy} be set of candidate glosses that include a word belongs to
synseti.
• To select the closest gloss to synseti from G we generate a vector for each gloss gz ∈ G. We list a Python function for this step in Appendix B.3.
• Assume that the gloss gz consists of the words {w1,w2,...,wd },
{m1,m2,...,md } is the number of synsets for each word in gz, and
~ ~ ~ {Vw1,Vw2,...,Vwd } is the set of corresponding vectors for {w1,w2,...,wd }.
• We compute the vector of gloss gz as follows:
~ = 1 · ~ · 1 + ~ · 1 + + ~ · 1 Vgz (Vw1 Vw2 ... Vwd ). d m1 m2 md ~ • Then, we compute the cosine similarity between the vector of each gloss gz and Vsi. We present a Python implementation for this step in Appendix B.4.
~ • Finally, we select the gloss with highest cosine similarity with Vsi.
For instance, as shown in Table 6.2, if we consider the word “abduction” which belongs to two synsets and does not have any synonyms, we notice that our algorithm was able to distinguish between the two meanings and select the right gloss for both synsets.
6.5 Evaluation
In this section, we introduce two forms of evaluation. First, we apply our method to select glosses for the PWN synsets. In this case, we directly compare our results to the manually attached glosses in PWN. Then, we apply our method to attach glosses to wordnet synsets generated by (Lam et al., 2014b). In this case, we ask human judges to evaluate the resulting glosses for three languages: Arabic, Assamese and Vietnamese. 60
Cosine Synset Key Gloss Similarity the criminal act of capturing and carrying away 0.172 00333037-n by force a family member moving of a body part away from the central 0.214 axis of the body the criminal act of capturing and carrying away 0.204 00775460-n by force a family member moving of a body part away from the central 0.189 axis of the body
Table 6.2. Cosine similarity between the different synset vectors and glosses of the word “abduction” in PWN.
6.5.1 Using Synset2vec to Select Glosses for PWN Synsets
In order to evaluate our synset vector representation in the task of selecting glosses for wordnets, we use it in the process of gloss selection for PWN synsets. We take advantage of the glosses manually added to the synsets in PWN to automatically measure the precision of our synsets representation. The following steps describe the evaluation process of selecting glosses for PWN synsets.
• For each synseti in PWN, we construct a set of candidate glosses. The candidate glosses are extracted from PWN using the following method. First the gloss attached
to synseti in PWN is added to the candidate set of glosses. Next, to generate negative
glosses for synseti, we extract words which belong to synseti and other synsets, i.e.,
words have the meaning of synseti and one or more other meaning. This allow us to examine the ability of the algorithm to differentiate between the different meanings of synsets.
• We randomly select two types of synsets from PWN: synsets that have single words, i.e. synsets that are represented by only single words, and synsets that include multi- ple synonym words.
• We generate the synset vectors using the algorithm we described in Section 6.3. 61
• Next, we generate the gloss vectors using the method we described in Section 6.4.
• Then, we compute the cosine similarity between synseti and each gloss in the candi- date set.
• Finally, we select the gloss with the highest cosine similarity.
6.5.2 Using Synset2vec to Select Glosses for Arabic, Assamese and Vietnamese Synsets
In this section, we examine the precision of our method by applying it for the pur- pose of selecting glosses from corpora to attach to the wordnets we create in the previous chapters. In this experiment, we use the wordnets of the languages: Arabic, Assamese and Vietnamese. Next, we describe the steps of evaluating glosses selected by our method for the synsets of the target languages.
• For each synseti in the target wordnet wordnett , we generate a set of candidate
glosses by extracting the set of sentences that include any member of synseti from the corpora we described in Section 5.7.
• We randomly select two types of synsets from wordnett : synsets that have single words, i.e., synsets that are represented by only single words, and synsets that include multiple synonym words.
• We generate the synset vectors using the algorithm we described in Section 6.3.
• Next, we generate vectors for each sentence in the set of candidate glosses using the method we described in Section 6.4.
• Then, we compute the cosine similarity between synseti and each sentence in the candidate set.
• Next, the top 3 sentences with the highest cosine similarity with the synseti are se- lected. 62
Synset Type Number of Synsets Precision Single Member 1400 76.5% Multi Member 600 79.6%
Table 6.3. The precision of selecting glosses for PWN synsets
• Finally, 3 native speakers of the target language are asked to evaluate the selected sentences using a 5 point scale.
6.5.3 Results and Discussion
As shown in Table 6.3, we used our algorithm to select glosses for 1400 single- member synsets from PWN. The algorithm achieved 76.5% precision. In addition, we used it to select glosses for 600 multi-member synsets from PWN. The precision was 79.6% in this case. As expected, the precision of selecting glosses is better when it is used with multi-member synsets since more information about the context of the sense is provided by the multi-member synsets. In the second evaluation, we randomly selected 300 synsets from the Arabic, As- samese and Vietnamese wordnets we create (100 synset each). For each synset, we ex- tracted all the sentences that included any member of the synset from the corpora. The sentences were sorted according to the cosine similarity with the synset vector and the top 3 sentences where selected. As shown in Table 6.7, the precision of selecting glosses for the Arabic synsets is 81.4% when selecting the sentences with the highest cosine similarity with the synset vec- tor. Furthermore, the precision of the top 2 and top 3 sentences is 70.4% and 65.8% respec- tively. The overall precision of selecting glosses using our method for the Arabic synsets is 72.6%. Table 6.4 shows some examples of glosses we produce for the Arabic synsets along with the their cosine similarity values. The precision of our method for selecting glosses for the Assamese synsets is 85.2% when selecting the sentences with the highest cosine similarity. Moreover, the top 2 and 63
Table 6.4. Examples of Arabic glosses we produce in our Arabic wordnet.
top 3 selected sentences achieved 83.2% and 84.6% respectively. The overall precision for Assamese glosses is 84.4%. Table 6.5 shows some examples of glosses we produce for the Assamese synsets along with the their cosine similarity values. The top Vietnamese glosses selected by our method has 39.4% precision. The top 2 and top 3 Vietnamese glosses selected by our method has 36.6% and 37% precision. Table 6.6 shows some examples of glosses we produce for the Vietnamese synsets along with the their cosine similarity values. In general, the precision of the recently published algorithms for the task of multilin- gual word-sense disambiguation is arround 68.7% (Apidianaki and Von Neumann, 2013), meaning that our algorithm is showing reasonably good performance for English, Arabic and Assamese. However, we notice that our method perform poorly with Vietnamese. The reason behind the poor results with Vietnamese is that Vietnamese words are not separated by white spaces (Gordon and Grimes, 2005). This means that the meaning of most the words can change based on the following words. This makes the process of generating the vectors for both the synsets and sentences extremely difficult since the word2vec algorithm assumes that words are separated by white spaces. The same problem appears in the pro- 64
Table 6.5. Examples of Assamese glosses we produce in our Assamese wordnet. cess of automatically generating bilingual dictionaries for Vietnamese (Lam et al., 2015a). One possible solution to this problem is replacing the white spaces within the single Viet- namese words with a special non-white character. This requires the existence of a language dictionary to distinguish the words that include white spaces within them.
6.6 Summary
In this chapter, we presented a new method for selecting synset glosses from a corpus. Our glosses are example sentences that clarify the meaning of the synset. The method can be used for low-resource languages to attach glosses to wordnets constructed automatically. Our method presents vector representation for wordnet synsets in a multi- dimensional space. We construct a synset vector by grouping the word embedding vector 65
Table 6.6. Examples of Vietnamese glosses we produce in our Vietnamese wordnet.
Precision Wordnet Top 1 Top 2 Top 3 Overall Arabic 81.4% 70.4% 65.8% 72.6% Assamese 85.2% 83.2% 84.6% 84.4% Vietnamese 39.4% 36.6% 37.0% 37.6%
Table 6.7. The precision of selecting glosses for Arabic, Assamese and Vietnamese synsets 66 of each synonym in the synset. Our evaluation showed that our method selects glosses with precision up to 84.4%. Chapter 7 LEXBANK: A MULTILINGUAL LEXICAL RESOURCE
Figure 7.1: An overview of LexBank system.
7.1 Introduction
In this chapter, we discuss the design and implementation of LexBank: a system that provides access to the multilingual lexical resources we create in this dissertation. We aim to give public users the ability to access and use the resources that we have created in our project. The system provides wordnet search services to several resource-poor languages 68 in addition to bilingual dictionary look-up services. In addition, the system receives evalu- ation and feedback from users to improve the quality of the resources. As Figure 7.1 shows, the system is divided into three layers: web interface, appli- cation layer and database layer. The web interface allows users to log into the system and access the search services. The web interface, also, provides a control panel for adminis- trators to allow them to manage the system. The application layer includes all the software required to securely execute the user’s requests. The database layer has two databases: lexical resources database and system database. The system database stores user’s infor- mation and the system settings. The design of the system allows inclusion of new language resources and easy modifications.
7.2 Database Design
LexBank uses two databases: one for storing the system settings and one for storing the lexical resources. We have used Microsoft SQL Server to construct the databases. The SQL code we use to construct the databases is listed in Appendix C. Next, we describe each database in details.
7.2.1 The System Settings Database
There are two tables in the setting database: Users_Info and System_log. We de- scribe both of the tables below.
7.2.1.1 Users_Info
The Users_Info table contains information about registered users. The following are the fields contained in the Users_Info table.
• UserId: a unique short alias name, which is selected by the user, that is used to identify users in the system. 69
• UserName: the full name of the user.
• UserEmail: the email address of the user.
• UserPwd: the encrypted password used by the user to access the system.
• UserPriv: a text field that determines the privileges that the user has. There are two levels of users in the system. The first level is administrator which has the privileges of managing users and data in the system. The second level is client which has the privilege of browsing the available resources.
• UserStatus: this field specifies the status of the user. The status can be Active, Inactive or New.
7.2.1.2 System_log
The System_log table keeps records of all the user activities in the system. This helps us in maintenance and keeping track of the utilization of the system. The following fields are contained in the System_log table.
• EventId: a unique key that is used to identify the event.
• EventDesc: a text description of the event.
• EventTime: the date and time of the event.
• UserId: the identification key of the user who committed the event.
7.2.2 The Lexical Resources Database
The lexical resources database contains the resources we produced in this thesis. For each language supported by the system, the database maintain tables for storing the core wordnet, the semantic relations, the wordnet glosses, the evaluation data for the semantic relations and the evaluation data for the wordnet glosses. Next, we describe each table in this database. 70
7.2.2.1 CoreWordnet
The CoreW ordnet table stores the wordnet synsets we created in this thesis. The core wordnet groups the synonym words into sets called synsets. In this table, synsets are identified using the offset-pos of the corresponding synset in PWN. In PWN, the offset-pos consists of two parts: byte offset used to locate the synset in the data file and the part-of- speech of the synset. The following are the fields in the CoreW ordnet table.
• offset-pos: the offset-pos of the wordnet synset which is used as an identifier for the synset.
• Member: a word that belongs to the synset.
7.2.2.2 Sem_Relations
Whereas the synonymy relation is stored in the CoreWordnet table, other semantic relations such as hyperonymy and meronymy are stored in the Sem_Relations table. As we described in Section 4.2, the semantic relations are directed relations. Therefore, we should maintain the direction by specifying the two sides of each synset in the relation. The
Sem_Relations table contains the following fields.
• Le f t_offset-pos: this field specifies the offset-pos of the synset in the left side of the relation.
• Relation: a text field that specifies the relation between the left side and the right side synsets.
• Right_offset-pos: the offset-pos of the synset in the right side of the relation.
7.2.2.3 WordnetGlosses
The WordnetGlosses table stores the wordnet glosses we generate in Chpater 6. The following are the fields of the WordnetGlosses table. 71
• offset-pos: the offset-pos of the wordnet synset.
• Gloss: a text field that contains the gloss of the synset.
7.2.2.4 Sem_Relations_Eval_Data
The Sem_Relations_Eval_Data table contains the semantic relations sample data used in the evaluation. This table contains the following fields.
• RelationKey: a unique identification number used to identify the semantic relation being evaluated.
• Le f t_offset-pos: the offset-pos of the synset on the left side of the relation being evaluated.
• Word1: this field specifies the word on the left side of the relation being evaluated.
• Relation: a text field that specifies the type of relation being evaluated.
• Right_offset-pos: the offset-pos of the synset on the right side of the relation being evaluated.
• Word2: this field specifies the word on the right side of the relation being evaluated.
• COS: the cosine distance, as measured in Section 5.4, between the left word and the right word in the relation being evaluated.
7.2.2.5 Sem_Relations_Eval_Response
The Sem_Relations_Eval_Response table contains the collected responses of the se- mantic relations we produce from evaluators. This table consists of the following fields.
• AnswerKey: a unique integer that is generated automatically to identify the response.
• RelationKey: the key of the semantic relation being evaluated. 72
• Score: an integer value from 1 to 5 that represents the score assigned by the evaluator to the semantic relation.
• UserId: identification key of the evaluator who evaluated the response.
7.2.2.6 WordnetGlosses_Eval_Data
The WordnetGlosses_Eval_Data table holds the wordnet glosses sample being eval- uated by the users. The table includes the following fields.
• GlossKey: an automatically generated unique integer used to identify the gloss being evaluated.
• offset-pos: the offset-pos of the wordnet synset.
• Word: the word which is used in the gloss to represent the wordnet synset.
• Sentence: the sentence selected as gloss for this wordnet synset.
• PWNGloss: the English gloss of the corresponding synset in PWN.
• CosSem: the cosine similarity between the selected sentence and the synset as mea- sured in Section 6.4.
• GlossRank: an integer value that represents the rank of the gloss among the other candidate glosses. The rank is assigned by the system to the gloss being evaluated based on the CosSem value. Glosses with the highest CosSem value have a rank value 1.
7.2.2.7 WordnetGlosses_Eval_Response
Responses from the users for evaluating the wordnet glosses we produced in Section 6.4 are stored in the WordnetGlosses_Eval table. This table consists of the following fields: 73
• AnswerKey: a unique integer number that is generated automatically to identify the response.
• GlossKey: the key of the gloss being evaluated.
• Score: an integer value from 1 to 5 that represents the score assigned by the evaluator to the gloss.
• UserId: identification key of the evaluator who evaluated the gloss.
7.3 Application Layer
In this section, we describe the main functions provided by LexBank. In order to maintain simplicity, we implement most of the functions of the system in one utility class (LexBankUtils.cs) written in Microsoft C#. The utility class, which is listed in Appendix D, consists of the following methods.
• IsUserIdAvailable(): takes a userId and returns true if this has never been used by another user before.
• EncryptPassword(): takes a plain text password and returns an encrypted password.
• DecryptPassword(): takes an encrypted password and returns a decrypted password.
• CreateNewUser(): takes the details of a new user and creates an account for him by string the data in the Users_Info table.
• IsAuthenticated(): takes the user identification and password and returns true if it matches the user information in the users table.
• FindSynSet(): takes a lexeme and returns a list of synsets that include this lexeme.
• FindSynSetLexemes(): takes an OffsetPos of a synset and returns the list of lexemes of this synset. 74
• IsSynSetAvailable(): takes an OffsetPos of a synset in a specific wordnet, and returns true if the synset is available in the spcified wordnet.
• FindSynSetRelations(): takes an OffsetPos of a synset and returns all the semantically related lexemes.
• FindGloss(): takes an OffsetPos of a synset and returns the gloss of the synset.
• ReadRelation(): takes a RelationKey and returns the details of the relation.
• ReadSynsetGloss(): takes a GlossKey and returns the details of the gloss.
• EvaluateRelation(): takes RelationKey, Score and UserId and stores them in the eval- uation table of the semantic Relations.
• EvaluateGloss(): takes GlossKey, Score and UserId and stores them in the evaluation table of the wordnet glosses.
• LogEvent(): takes event description and stores it in the System_log table.
• ChangeUserStatus(): takes UserId of a user and changes his status to a specific new status.
• RetrieveUsers(): a method that returns a list of all the users in the system and their information.
7.4 Web Interface Design and Implementation
In this section, we describe the design of the web interface of LexBank. The web interface is implemented in ASP.NET using Microsoft Visual Studio 2012. Figure 7.2 shows the site map of the web interface. The interface is accessed by the login web page (frmLogin.aspx). New users need to register to gain access to the system. Registration can be done by filling the web registration form (frmRegister.aspx). Once a user logs into the 75 system, the main menu web page (frmMainMenu.aspx) is shown. The main menu includes links to access the services available in the system. In the following sections, we describe each web page in the system.
Figure 7.2: LexBank web site map
7.4.1 Registration Form
New users needs to register in the system using the registration form (frmRegis- ter.aspx). As shown in Figure 7.3, a new user needs to provide the full name, email, email confirmation, user identification, password and password confirmation, and then press the Register button. The registration process starts when a new user submits his information through the registration web form. Once the registration form receives the information, it checks if all the fields meet the requirements of the system. The requirements include a valid format for the email address and the password. The requirements, also, include that the user identification was never used before by an existing user. If the information sent by the user passes the validation process, the registration form calls the CreateNewUser() method from the utility class. The CreateNewUser() method uses the EncryptPassword() method to 76
Figure 7.3: The registration web form encrypt the password, and then it writes the data into the Users_info table. The registration process is summarized in the sequence diagram shown in Figure 7.4.
7.4.2 Log-in Form
Registered users can login to the system using the login web page (frmLogin.aspx) which is shown in Figure 7.5. A user with an active account needs to provide his user identification and password to start the login process. As shown in Figure 7.6, when the login web form (frmLogin.aspx) receives the userid and the passowrd, it calls the IsAuthenticated() method from the utility class. Then, the password is encrypted using the EncryptPassword() and compared with the encrypted pass- word stored in the users table. If the userid and the password provided by the user match 77
Figure 7.4: Sequence diagram of the registration process
Figure 7.5: The log-in web form 78
Figure 7.6: Sequence diagram of the login process the userid and the password stored in the users table, the main menu of the web interface is shown to the users; otherwise, an error message is shown to the user. The main menu is shown in Figure 7.7.
7.4.3 The Main Menu
The main menu includes links to access the services available in the system. The services presented by the web interface are given below.
• Searching wordnet using lexeme, provided by the web page (frmWordnetSearch.aspx).
• Searching wordnet using OffsetPos, provided by the web page (frmSynsetDetails.aspx).
• Evaluating semantic relations between synsets, provided by the web page (frmEval- Relations.aspx).
• Evaluating wordnet glosses, provided by the web page (frmEvalGloss.aspx).
• Searching a bilingual dictionary, provided by the web page (frmDictionarySearch.aspx). 79
Figure 7.7: The main menu
• User management, provided by the web page (frmManageUsers.aspx).
7.4.4 Searching Wordnet By Lexeme Web Form
The web form (frmWordnetSearch.aspx) allows users to search for the synsets of a lexeme in a specific langauge. As shown in Figure 7.8, this web form consists of the following components.
• A text box used to allow the user to enter a lexeme.
• A drop-down menu to allow the user to select the language.
• A list box for showing the synsets list of the entered lexeme.
• A list box for showing the synonyms of the entered lexeme.
• A list box for showing the related lexemes.
• A button to start the searching process.
The searching process, as shown in Figure 7.9, starts when the user submits a lexeme and a language to the frmWordnetSearch.aspx web form. Then, the method FindSynset() 80
Figure 7.8: The web form for searching wordnet by lexeme. The form shows the result of .which means Egypt ,(ﻣﺼﺮ) searching the Arabic lexeme from the utility class is called to retrieve the synsets that include the entered lexeme and show the result in the synsets list. Next, when the user selects a synset from the synsets list, the frmWordnetSearch.aspx web form calls the FindSynsetLexemes() method from the utility class to show the synonyms of the lexeme in the synonym list. It, also, calls the FindSynsetRelations() method to obtain the related lexemes and show them to the user in the related lexemes list. The user also can extend the details of the synset shown in the synset list and the related lemexes list by double-clicking on the synset OffsetPos. This shows the frmSynsetDetails.aspx web form which we describe next.
7.4.5 Searching Wordnet by OffsetPos Web Form
Wordnet search using OffsetPos is provided by the frmSynsetDetails.aspx. An exam- ple of searching for a synset with the OffsetPos (08897065-n) in our Arabic, Vietnamese 81
Figure 7.9: Sequence diagram of the process of searching wordnet using lexeme and Assamese wordnets using the frmSynsetDetails.aspx web form is shown in Figures 7.10, 7.11 and 7.12 respectively. This web form consists of the following components.
• A text box for entering the OffsetPos of the synset.
• A drop-down menu to allow the user to select the language.
• A text box for showing the gloss of the synset.
• A text box for showing the English gloss of the synset.
• A list box to show the synonym list of the synset.
• A list box to show the related synsets and lexemes of the entered synset.
• A button to start the search process. 82
Figure 7.10: The web form for searching wordnet by OffsetPos. The form shows the result of searching the Arabic synset (08897065-n).
In this form, the user starts the process of searching wordnet by submitting the Off- setPos of the synset and the target language to the frmSynsetDetails.aspx web form. The web form calls the FindGloss() method from the utility class to retrieve the gloss of the synset. It, also, calls the FindSynSetLexemes() and the FindSynSetRelations() methods to obtain the synonym list and releated synsets of the input synset to show them in the form.
7.4.6 Evaluating Semantic Relations Between Synsets Web Form
The web form frmEvalRealtions.aspx allows users to evaluate semantic relations be- tween lexemes and synsets in the system. The form shows the relation as a sentence and asks the user to rate the correctness of the sentence using a Likert-type scale. The form consists of the following components. 83
Figure 7.11: The web form for searching wordnet by OffsetPos. The form shows the result of searching the Vietnamese synset (08897065-n).
Figure 7.12: The web form for searching wordnet by OffsetPos. The form shows the result of searching the Assamese synset (08897065-n). The third part meronym in Assamese is wrong. It comes from the verb meaning of “desert” which means to leave without intending to return. 84
Figure 7.13: Sequence diagram of the process of searching wordnet using OffsetPos.
Figure 7.14: The web form for evaluating semantic relations between synsets in a wordnet. The form shows an example of evaluating a hyponymy relation between the two Assamese lexemes, one for radio telegraph and the other for radio. 85
• A text box showing the relation key.
• A text box showing the relation in the form of a sentence.
• A text box showing the UserId of the evaluator.
• An option box that allows the user to rate the relation.
• A button to submit the score.
• A button to end the evaluation session.
Figure 7.15: Sequence diagram of the process of evaluating the relation between two lex- emes.
The evaluation form frmEvalRealtions.aspx starts the evaluation process by calling the ReadRelation() method from the utility class to show the relation details to the user. When the user submits the score he assigns to a relation, the evaluation form frmEvalReal- tions.aspx stores the score by calling the EvaluateRelation() method from the utility class. Then, the evaluation form reads the next relation and shows it to the user. The user can stop the evaluation process by clicking the End Session button. The user has the option 86 to resume the evaluation process if he stops any time he wishes without re-evaluating the relations he has already evaluated.
7.4.7 Evaluating Wordnet Synsets Glosses Web Form
Figure 7.16: The web form for evaluating wordnet synsets glosses. The form shows an example of evaluating Arabic synset (13108841-n).
The glosses of the wordnets can be evaluated using the frmEvalGloss.aspx web form. To evaluate a synset gloss, the form attaches the English gloss of the synset obtained from the PWN to the selected gloss in the target language. Then, the user is asked if the lexeme in the selected gloss has the same meaning as in the PWN gloss. This evaluation form is composed of the following components.
• A text box showing the gloss key.
• A text box showing a lexeme from a synset, a candidate gloss written in the target language, and the English gloss of the synset. 87
• A text box showing the UserId of the evaluator.
• An option box that allows the user to rate the candidate gloss.
• A button to submit the score.
• A button to end the evaluation session.
Figure 7.17: Sequence diagram of the process of evaluating the relation between two lex- emes.
The web form frmEvalGloss.aspx starts the evaluation process for glosses by calling the ReadSynsetGloss() method from the utility class to obtain the lexeme, the candidate gloss and the English gloss of the synset being evaluated. Then, the web form uses the pre- vious data to construct a question for the user. When the user submits the score he assigns to the candidate gloss, the evaluation form stores the score by calling the EvaluateGloss() method from the utility class. Then, the evaluation form reads the next gloss and shows it to the user. The user can stop the gloss evaluation process by clicking the End Session button. The user can resume the gloss evaluation process any time he wishes without re-evaluating the glosses he has already evaluated. 88
7.4.8 Searching Bilingual Dictionary Web Form
Figure 7.18: The web form for searching a bilingual dictionary. The form shows the result .which means Egypt, to Assamese ,(ﻣﺼﺮ) of translating the Arabic word
The web form (frmDictionarySearch.aspx) allows users to use the bilingual dictio- naries we create in (Lam et al., 2015b) to translate words between languages. As shown in Figure 7.18, this form consists of the following components.
• A text box used to allow the user to enter a word.
• A drop-down menu to allow the user to select the source language.
• A drop-down menu to allow the user to select the target language.
• A list box for showing the translations list of the entered word.
• A button to start the searching process.
The translation process, as shown in Figure 7.19, starts when the user submits a word, source language, and target language to the frmDictionarySearch.aspx web form. Then, the method Translate() from the utility class is called to retrieve the translation list from the bilingual dictionary and show them to the user. 89
Figure 7.19: Sequence diagram of the process of searching a bilingual dictionary.
Figure 7.20: The web form for managing users in LexBank.
7.4.9 Users Management Web Form
The web form frmManageUsers.aspx allows the administrators of LexBank to man- age users. Access to this form is restricted to administrators. The form lists all registered users with related information. An administrator can activate the accounts of new users using this form. He can also deactivate any user from the list. This form can be extended in the future by adding more functionality. As shown in Figure 7.20, this form consists of the following components.
• ID: the UserId of the user.
• Name: the full name of the user. 90
• Email: the email address of the user.
• Privilege: the privilege assigned to the user. This can be administrator or client.
• Status: the current status of the user.
• Change Status: a command link to changed the current status of the user. The status of the user can be change to be Inactive or Active.
Figure 7.21: Sequence diagram of the process of managing users in LexBank.
As summarized in the sequence diagram shown in Figure 7.21, an administrator user starts the process of user management by trying to access the frmManageUsers.aspx web form. The web form calls the method IsAdmin() from the utility class to verify if the user is authorized to access the form or not. If the user is not authorized, an error message is sent to the user. Otherwise, if the user is authorized, the web form calls the method 91
RetrieveUsers() to obtain the list of registered users in the system. Then, the administrator can select a user from the list and click the change status link to change the current status of the user. The web form calls the ChangeUserStatus() method from the utility class to store the new status and reload the updated users list in the screen.
7.5 Summary
In this chapter, we have described the design and implementation of LexBank, the multilingual lexical resource we produce in this thesis. The architecture of LexBank con- sists of three layers: the database layer, the application layer and the web interface layer. The database layer consists of two databases: system settings database and resource database. The application layer of the system is implemented using Microsoft C#. It provides admin- istrative and resource access services to the web interface. The web interface is designed and implemented using Microsoft Visual Studio 2012. The interface includes web forms for managing users and provides different wordnet search services in several languages. The system can easily updated to accommodate other lingual services and languages. Chapter 8 CONCLUSIONS
In this chapter, we summarize the main contributions of this dissertation. This disser- tation is motivated by the fact that many languages around the word lack the computational lexical resources that are essential in natural language processing. Our first goal in this dis- sertation was to develop automatic techniques that rely on few available public resources, for constructing wordnets for low-resource languages. A wordnet is a structured lexical on- tology of words that groups words based on their meaning using sets that are called synsets. A wordnet is a very important lexical resource that is used in many applications, such as translation, word-sense disambiguation, information retrieval and document classification. The second goal of this dissertation is to design and implement a system that makes the lexical resources we produce available to the public. Below, we list the main contributions of this dissertation.
• We have developed an approach for constructing structured wordnets. This ap- proach was developed by extending the approach for constructing the core word- nets presented by (Lam et al., 2014b). A core wordnet consists of only synsets that group synonym words in sets with unique ID. In a more comprehensive wordnet, these synsets are semantically connected to represent the relation among the mean- ings of the synsets. Our approach produces synsets that are semantically connected by semantic relations. Examples of the semantic relations we produced are: syn- onyms, hypernyms, topic-domain relation, part-holonyms and instance-hypernyms and member-meronyms. 93
• We presented an approach for enhancing the quality of automatically constructed wordnets. The approach is based on the vector representation of words (word embed- dings). Word embeddings are produced by a machine learning technique that maps words to real numbers vectors in a multi-dimensional space. Our approach uses the
word2vec algorithm (Mikolov et al., 2013) to generate word representations from an existing corpus. The word2vec algorithm is a feedforward neural network that pre- dicts the vector representation of words within a multi-dimensional language model.
Our approach computes the cosine similarity, using word2vec, between semantically related words in our constructed wordnets and filters any entries which do not satisfy a pre-selected threshold value.
• We introduced synset2vec, which is an algorithm for representing wordnet synsets in a multi-dimensional space. Word embeddings provide an excellent vector represen- tation of words. However, the words representation is affected by the fact that many words have multiple meanings. In order to represent meanings rather than words, we combine the vectors of synset lexemes into a single vector that represents the meaning. We believe that this vector representation can be used in many important applications. For example, it can be used in word-sense disambiguation, machine translation and gloss selection for wordnet synsets.
• We used our algorithm synset2vec to add glosses to our automatically constructed synsets. Glosses are a very important part of wordnets. A gloss is used to declare or clarify the meaning of a synset in a wordnet. A gloss can be a definition statement or an example sentence that shows the usage of the synonyms of the synset. To select
a gloss (an example sentence) from a corpus for a synset, we used synset2vec to generate vector representations of candidate glosses and the synset. We compute the cosine similarity between each candidate gloss and the synsets. Finally, we select the gloss with highest cosine similarity with synset and attach it to the synset. 94
• We have developed LexBank which is a web application that gives access to public users to our created resources. LexBank provides useful services for users that seek linguistic assistance in a friendly manner. It also includes evaluation web forms that are used to gather feedback from human judges. The design of LexBank is flexible and it can be easily expanded to accommodate additional new languages and resources. Chapter 9 FUTURE WORK
In this chapter, we propose some potential future work that can be performed as extension to the work presented in this dissertation. The general goal of the proposed future work is to enhance the quality and extend the coverage of the lexical resources we have created. For example, we produced our core wordnets using machine translation and small dictionaries. The quality of these wordnets is limited by the resources we used to create them. It is well-known that these resources do not guarantee high coverage and accuracy for all of the target languages. Below, we list some of the potential future work. 9.1 Extending Bilingual Dictionaries
In this section, we provide a task that can be undertaken in future work. We propose a new method to extend the bilingual dictionaries created in (Lam et al., 2015b). To increase the coverage of the bilingual dictionaries, we take advantage of the wordnets we have created in this dissertation. This section is divided into two parts. In the first part, we describe the approach we used in (Lam et al., 2015b) to create the bilingual dictionaries. In the second part, we describe the proposed method to extend these bilingual dictionaries.
9.1.1 Related Work
In (Lam et al., 2015b) we have created a large number of new bilingual dictionaries using intermediate core wordnets and a machine translator. A dictionary or a lexicon, as defined by (Landau, 1984), consists of sorted 2-tuple
in the dictionaries are of the form < LexicalUnit,Sense1 >, < LexicalUnit,Sense2 >,.... The approach for creating dictionaries using intermediate wordnets and a machine translator (IW) is described as in Figure 9.1 and Algorithm 2.
Figure 9.1: The IW approach for creating a new bilingual dictionary
Suppose that we would like to construct a bilingual dictionary Dict(S,D), where S is a source language and D is a target language, given the dictionary Dict(S,R) where R 97 is a resource-rich intermediate language. The IW algorithm reads each LexicalEntry from
Dict(S,R) and extracts SenseR from it. Then, it retrieves all Offset-POSs of SenseR from the wordnet of language R (Algorithm 2, lines 2-5). All the synonyms of the extracted Offset-POSs are extracted from all the available intermediate wordnets. Then, the algo- rithm constructs a candidate set candidateSet for the final translations in language D by translating all the extracted synonyms to language D using machine translation (Algorithm 3). There are 2 attributes in each candidate in candidateSet: word which represents a translation in language D, and rank which counts the occurrence of this translation. The rank attribute is used to order the candidates in descending order where the top candidate is the best translation. Finally, the sorted candidates are inserted into the new dictionary Dict(S,D) (Algorithm 2, lines 8-10).
Algorithm 2: IW algorithm (taken from (Lam et al., 2015b)) Input: Dict(S,R) Output: Dict(S, D) 1: Dict(S, D) := ϕ 2: for all LexicalEntry ∈ Dict(S,R) do 3: for all SenseR ∈ LexicalEntry do 4: candidateSet := ϕ 5: Find all Offset-POSs of synsets containing SenseR from the R Wordnet 6: candidatSet = FindCandidateSet (Offset-POSs, D) 7: sort all candidate in descending order based on their rank values 8: for all candidate ∈ candidateSet do 9: SenseD=candidate.word 10: add tuple
9.1.2 Extending Bilingual Dictionaries Using Structured Wordnets
In this section, we propose a new method to extend dictionaries we created in (Lam et al., 2015b) using the structured wordnets that we have created in this dissertation. The 98
Algorithm 3: FindCandidateSet (Offset-POSs,D) (taken from (Lam et al., 2015b)) Input: Offset-POSs, D Output: candidateSet 1: candidateSet := ϕ 2: for all Offset-POS ∈ Offset-POSs do 3: for all word in the Offset-POS extracted from the intermediate wordnets do 4: candidate.word= translate (word,D) 5: candidate.rank++ 6: candidateSet += candidate 7: end for 8: end for 9: return candidateSet following steps, which are summarized in Figure 9.2, describe the proposed method to extend the dictionaries.
Figure 9.2: Extending bilingual dictionaries using structured wordnets
• We start by extracting each input enrty Si from the source language S in the bilingual dictionary from S to D. 99
• Then, we retrieve the synsets list of Si from the wordnet of S.
• Next, we extract the corresponding synsets from the wordnet of D.
• For each synset member Dk we extracted from wordnet of D, we create a lexical
entry (Si,Dk ).
• Besides that, for each synset we extracted from wordnet of D, we extract the direct
hypernyms and we also create a lexical entry (Si,Hl ).
• Finally, we add any lexical entry we have created in the previous steps to the bilingual
dictionary from S to D if it does not already exists in the dictionary.
9.2 Integrating Part-of-speech Tagging into Wordnet Construction
Since our approach for automatic wordnet construction is based on translation, some of the generated synsets include words that have the wrong part-of-speech. One solution is to use a Part-Of-Speech Tagger (POS Tagger) to correct the wrong form of the words in the synset. A POS Tagger is a computer program which is used to specify the part-of-speech of words in a text written in some language. For example, the Stanford Part-Of-Speech Tagger (Toutanova et al., 2003), which is freely available, provides part-of-speech tagging for Arabic, Chinese, French, Spanish and German. POS Taggers are available for Assamese (Saharia et al., 2009) and Vietnamese (Le-Hong et al., 2010) as well. Since we are dealing with low-resource languages, many languages do not have any POS Taggers, and therefore, this approach is not applicable to them. To correct the part-of-speech in the words within a synset, we propose the following steps.
• For each synset synseti in a wordnet wordnetT , we extract the part-of-speech of the
synset from Offset-POS of synseti. 100
• For each word wordj in synseti, we find the part-of-speech of wordj and compare it
with the part-of-speech of synseti. If the parts-of-speech of wordj and synseti do not
match, we convert the form of wordj to the correct part-of-speech form and update
synseti.
9.3 Wordnet Expansion Using Word Embeddings
One possible way to automatically improve the coverage of a wordnet is by looking for additional related words in a corpus using word embeddings. In Chapter 6, we intro- duced synset2vec, which produces vector representations of synsets in a multi-dimensional space. Taking advantage of synset2vec, we believe it is possible to look for previously un- known words that are semantically related to a synset and include them in the wordnet. Below, we present a brief description of our idea.
• Assume that we would like to expand a wordnet wordnetT of language T . First, word embeddings for T are generated.
~ • Next, for each synset synseti in wordnetT , the vector for synseti Vi is generated using synset2vec.
• Then, all the words that have cosine similarity value of a preselected threshold α or less are extracted. From these words, only the words that do not have any semantic
relation with synseti are inserted into a candidate set Ci.
• Next, for each word wordj in Ci, a semantic relation rj is selected based on a classi- fication approach.
• Finally, wordj is inserted into wordnetT and connected to synseti using semantic
relation rj. 101
9.4 Producing Vector Representation for Multi-word Lexemes
One issue that appears when producing vector representations is that wordnet lex- emes can be multi-word phrases. Most of the existing tools for producing word embed- dings consider single words only. This means that they produce vectors for lexical units that are surrounded by spaces. Therefore, when we try to generate a vector for a wordnet synset, we avoid multi-word lexemes. An enhanced version of our approach for generat- ing vectors for wordnet synsets can be achieved by including a vector representation for multi-word lexemes. The vectors of the single words within a multi-word lexeme should be aggregated such that it becomes one vector within the synset. However, one issue that arises is that each single word within the multi-word lexeme may have several meanings when they appear individually. Therefore, careful research is needed to determine a good solution for this problem.
9.5 Vector Representation for Mulit-lingual Wordnets
In this dissertation, we produced vector representation for individual wordnets. One work that might help in problems such as wordnet expansion and machine translation, is obtaining the vector representation of aggregated wordnets of several languages. Since all of the wordnets we create in this dissertation are aligned with PWN, synsets having the same Offset-Pos in different wordnets actually represent the same meaning. Therefore, we believe that combining the vectors of aligned synsets from different languages will produce representation for the meaning within several language. One can use this representation to discover the closest meaning of new words that are not included within the wordnets. This could also be used in discovering a rough translation for words that are not included in a dictionary. 102
BIBLIOGRAPHY
M. Abbas and K. Smaili. Comparison of topic identification methods for arabic language. In International Conference on Recent Advances in Natural Language Processing- RANLP 2005, volume 14, 2005.
M. Abbas, K. Smaïli, and D. Berkani. Evaluation of topic identification methods on arabic corpora. JDIM, 9(5):185–192, 2011.
K. Ahn and M. Frampton. Automatic generation of translation dictionaries using inter- mediary languages. In Proceedings of the International Workshop on Cross-Language Knowledge Induction, pages 41–44. Association for Computational Linguistics, 2006.
P. Akaraputthiporn, K. Kosawat, and W. Aroonmanakun. A Bi-directional Translation Approach for Building Thai Wordnet. In Asian Language Processing, 2009. IALP’09. International Conference on, pages 97–101. IEEE, 2009.
F. Al tarouti and J. Kalita. Enhancing automatic wordnet construction using word em- beddings. In Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP, pages 30–34, San Diego, California, June 2016. Association for Computational Linguistics.
M. Apidianaki and R. J. Von Neumann. Limsi: Cross-lingual word sense disambiguation using translation sense clustering. In Second Joint Conference on Lexical and Computa- tional Semantics (* SEM), volume 2, pages 178–182, 2013. 103
M. A. Attia. Handling Arabic morphological and syntactic ambiguity within the LFG framework with a view to machine translation. PhD thesis, University of Manchester, 2008.
E. Barbu. Automatic Building of Wordnets EdUArd BarbU* &: Verginica BarbU Mi- TiTElU*** Graphitech Italy" Romanian Academy, Research Institute for Artificial In- telligence. Recent Advances in Natural Language Processing IV: Selected Papers from RANLP 2005, 292:217, 2007.
K. R. Beesley. Arabic finite-state morphological analysis and generation. In Proceedings of the 16th conference on Computational linguistics-Volume 1, pages 89–94. Association for Computational Linguistics, 1996.
S. Bhattacharya, M. Choudhury, S. Sarkar, and A. Basu. Inflectional morphology synthesis for bengali noun, pronoun and verb systems. Proc. of NCCPB, 8, 2005.
P. Bhattacharyya. Indowordnet. In In Proc. of LREC-10, 2010.
O. Bilgin, z. Çetinoglu,˘ and K. Oflazer. Building a wordnet for Turkish. Romanian Journal of Information Science and Technology, 7(1-2):163–172, 2004.
L. Bloomfield. Language. new york: Holt, rinehart and winston. A classic in linguistic studies and the first serious attempt in the development of morphology. Pre-and post- generative morphology conceptually were nurtured from the remarkable insights given
in this linguistic masterpiece, 1933.
F. Bond and K. Ogura. Combining linguistic resources to create a machine-tractable Japanese-Malay dictionary. Language Resources and Evaluation, 42(2):127–136, 2008.
L. Borin and M. Forsberg. Swesaurus; or, the frankenstein approach to wordnet construc- tion. In Proceedings of the Seventh Global WordNet Conference (GWC 2014), 2014. 104
D. Bouamor, N. Semmar, C. France, and P. Zweigenbaum. Using Wordnet and semantic similarity for bilingual terminology mining from comparable corpora. In Proceedings of the 6th Workshop on Building and Using Comparable Corpora, pages 16–23. Citeseer, 2013.
R. D. Brown. Automated dictionary extraction for “knowledge-free” example-based trans- lation. In Proceedings of the Seventh International Conference on Theoretical and Methodological Issues in Machine Translation, pages 111–118, 1997.
T. Buckwalter. Issues in arabic orthography and morphology analysis. In Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, pages 31–34. Association for Computational Linguistics, 2004.
T. Charoenporn, V. Sornlertlamvanich, C. Mokarat, and H. Isahara. Semi-automatic com- pilation of Asian WordNet. In 14th Annual Meeting of the Association for Natural Lan- guage Processing, pages 1041–1044, 2008.
X. Chen, Z. Liu, and M. Sun. A unified model for word sense representation and disam- biguation. In EMNLP, pages 1025–1035. Citeseer, 2014.
D. Christodoulakis, K. Oflazer, D. Dutoit, S. Koeva, G. Totkov, K. Pala, D. Cristea, D. Tufi¸s, M. Grigoriadou, I. Tsakou, and others. BalkaNet: A Multilingual Semantic Network for Balkan Languages. In Proceedings of the 1st International Wordnet Conference, Mysore, India, 2002.
C. J. Crouch. An approach to the automatic construction of global thesauri. Information Processing & Management, 26(5):629–640, 1990.
A. Cucchiarelli, R. Navigli, F. Neri, and P. Velardi. Automatic Generation of Glosses in the OntoLearn System. In LREC. Citeseer, 2004.
J. R. Curran. From distributional to semantic similarity. 2004. 105
J. R. Curran and M. Moens. Improvements in automatic thesaurus extraction. In Pro- ceedings of the ACL-02 workshop on Unsupervised lexical acquisition-Volume 9, pages 59–66. Association for Computational Linguistics, 2002a.
J. R. Curran and M. Moens. Scaling context space. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 231–238. Association for Computational Linguistics, 2002b.
K. Darwish. Named entity recognition using cross-lingual resources: Arabic as an example. In ACL (1), pages 1558–1567, 2013.
M. Diab and N. Habash. Arabic dialect processing tutorial. In Proceedings of the Hu- man Language Technology Conference of the NAACL, Companion Volume: Tutorial Ab-
stracts, pages 5–6. Association for Computational Linguistics, 2007.
R. M. Fano and D. Hawkins. Transmission of information: A statistical theory of commu- nications. American Journal of Physics, 29(11):793–794, 1961.
A. Farghaly and K. Shaalan. Arabic natural language processing: Challenges and solutions. ACM Transactions on Asian Language Information Processing (TALIP), 8(4):14, 2009.
C. Fellbaum. A semantic network of English verbs. WordNet: An electronic lexical database, 3:153–178, 1998.
C. Fellbaum. WordNet and Wordnets. In A. Barber, editor, Encyclopedia of Language and Linguistics, pages 2–665. Elsevier, 2005.
M. A. Finlayson. Java libraries for accessing the Princeton WordNet: Comparison and evaluation. In Proceedings of the 7th Global Wordnet Conference, pages 78–85, 2014.
J. R. Firth. {A synopsis of linguistic theory, 1930-1955}. 1957. 106
D. Foley and J. Kalita. Integrating wordnet for multiple sense embeddings in vector seman- tics. In REU on Machine Learning and Applications. University of Colorado, Colorado Springs, 2016.
T. Gollins and M. Sanderson. Improving cross language retrieval with triangulated transla- tion. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 90–95. ACM, 2001.
R. G. Gordon and B. F. Grimes. Ethnologue: Languages of the world, volume 15. SIL international Dallas, TX, 2005.
S. Green and C. D. Manning. Better arabic parsing: Baselines, evaluations, and analysis. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 394–402. Association for Computational Linguistics, 2010.
G. Grefenstette. Explorations in automatic thesaurus discovery, volume 278. Springer Science & Business Media, 2012.
G. Gunawan and A. Saputra. Building synsets for Indonesian Wordnet with monolingual lexical resources. In Asian Language Processing (IALP), 2010 International Conference on, pages 297–300. IEEE, 2010.
N. Habash and O. Rambow. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting on Asso- ciation for Computational Linguistics, pages 573–580. Association for Computational Linguistics, 2005.
N. Habash, R. Roth, O. Rambow, R. Eskander, and N. Tomeh. Morphological analysis and disambiguation for dialectal arabic. In HLT-NAACL, pages 426–432, 2013.
N. Y. Habash. Introduction to arabic natural language processing. Synthesis Lectures on Human Language Technologies, 3(1):1–187, 2010. 107
A. Haghighi, P. Liang, T. Berg-Kirkpatrick, and D. Klein. Learning Bilingual Lexicons from Monolingual Corpora. In ACL, volume 2008, pages 771–779, 2008.
Z. S. Harris. Distributional structure. Word, 10(2-3):146–162, 1954.
E. Haugen. Dialect, language, nation. American anthropologist, 68(4):922–935, 1966.
L. Hinkle, A. Brouillette, S. Jayakar, L. Gathings, M. Lezcano, and J. Kalita. Design and evaluation of soft keyboards for brahmic scripts. ACM Transactions on Asian Language Information Processing (TALIP), 12(2):6, 2013.
G. Hirst and D. St-Onge. Lexical chains as representations of context for the detection and correction of malapropisms. WordNet: An electronic lexical database, 305:305–332, 1998.
E. Héja. Dictionary Building based on Parallel Corpora and Word Alignment. In Proceed- ings of the XIV Euralex International Congress, Leeuwarden, pages 6–10, 2010.
Y. Hlal. Morphological analysis of arabic speech. In Workshop Papers Kuwait/Proceedings of Kuwait Conference on Computer Processing of the Arabic Language, pages 273–294, 1985.
E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 873–882. Association for Computational Linguistics, 2012.
V. István and Y. Shoichi. Bilingual dictionary generation for low-resourced language pairs. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Pro- cessing: Volume 2-Volume 2, pages 862–870. Association for Computational Linguistics, 2009. 108
P. Jaccard. The distribution of the flora in the alpine zone. New phytologist, 11(2):37–50, 1912.
D. Jurafsky and J. H. Martin. Speech and Language Processing (3rd Edition Draft). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 20016.
H. Kaji and M. Watanabe. Automatic Construction of Japanese WordNet. Proceedings of LREC2006, Italy, 2006.
H. Kozima and T. Furugori. Similarity between words computed by spreading activation on an English dictionary. In Proceedings of the sixth conference on European chapter of the Association for Computational Linguistics, pages 232–239. Association for Compu- tational Linguistics, 1993.
K. N. Lam. Automatically Creating MultiLingual Resources. PhD thesis, University of Colorado, Colorado Springs, Apr. 2015.
K. N. Lam and J. Kalita. Creating Reverse Bilingual Dictionaries. In HLT-NAACL, pages 524–528. Citeseer, 2013.
K. N. Lam, F. Al Tarouti, and J. Kalita. Creating Lexical Resources for Endangered Lan- guages. In Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 54–62, Baltimore, Maryland, USA, June 2014a. Association for Computational Linguistics.
K. N. Lam, F. A. Tarouti, and J. Kalita. Automatically constructing Wordnet synsets. In 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), Baltimore, USA, June, 2014b.
K. N. Lam, F. Al Tarouti, and J. Kalita. Phrase translation using a bilingual dictionary and n-gram data: A case study from vietnamese to english. In Proceedings of NAACL-HLT, pages 65–69, 2015a. 109
K. N. Lam, F. Al Tarouti, and J. Kalita. Automatically Creating a Large Number of New Bilingual Dictionaries. In Twenty-Ninth AAAI Conference on Artificial Intelligence, Feb. 2015b.
S. I. Landau. Dictionaries. NY: Scribners, 1984.
L. S. Larkey, L. Ballesteros, and M. E. Connell. Improving stemming for arabic informa- tion retrieval: light stemming and co-occurrence analysis. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in informa-
tion retrieval, pages 275–282. ACM, 2002.
P. Le-Hong, A. Roussanaly, T. M. H. Nguyen, and M. Rossignol. An empirical study of maximum entropy approach for part-of-speech tagging of vietnamese texts. In Traitement Automatique des Langues Naturelles-TALN 2010, page 12, 2010.
D. Leenoi, T. Supnithi, and W. Aroonmanakun. Building a Gold Standard for Thai Word- Net. In Proceeding of The International Conference on Asian Language Processing 2008 (IALP2008), pages 78–82, 2008.
D. Lin. Automatic retrieval and clustering of similar words. In Proceedings of the 36th An- nual Meeting of the Association for Computational Linguistics and 17th International
Conference on Computational Linguistics-Volume 2, pages 768–774. Association for Computational Linguistics, 1998.
K. Lindén and J. Niemi. Is it possible to create a very large wordnet in 100 days? an evaluation. Language resources and evaluation, 48(2):191–201, 2014.
K. Lindén and L. Carlson. Finn WordNet-WordNet p\a a finska via översättning. Lexi- coNordica, 17(17), 2010.
N. Ljubešic´ and D. Fišer. Bootstrapping bilingual lexicons from comparable corpora for closely related languages. In Text, Speech and Dialogue, pages 91–98. Springer, 2011. 110
M. Maziarz, M. Piasecki, E. Rudnicka, and S. Szpakowicz. Beyond the transfer-and-merge wordnet construction: plwordnet and a comparison with wordnet. In RANLP, pages 443–452, 2013.
J. J. McCarthy. A prosodic theory of nonconcatenative morphology. Linguistic inquiry, 12 (3):373–418, 1981.
T. Mikolov, W.-t. Yih, and G. Zweig. Linguistic regularities in continuous space word representations. In HLT-NAACL, pages 746–751, 2013.
G. A. Miller. WordNet: a lexical database for English. Communications of the ACM, 38 (11):39–41, 1995.
G. A. Miller and F. Hristea. WordNet nouns: Classes and instances. Computational Lin- guistics, 32(1):1–3, 2006.
T. Miller and I. Gurevych. Wordnet-wikipedia-wiktionary: Construction of a three-way alignment. In LREC, pages 2094–2100, 2014.
M. Mladenovic, J. Mitrovic, and C. Krstev. Developing and Maintaining a WordNet: Pro- cedures and Tools. In Proceedings of the 7th Global Wordnet Conference (GWC 2014), pages 55–62, 2014.
C. Mouton and G. de Chalendar. JAWS: Just another WordNet subset. Proc. of TALN’10, 2010.
A. S. Nagvenkar, N. R. Prabhugaonkar, V. P. Prabhu, R. N. Karmali, and J. D. Pawar. Con- cept Space Synset Manager Tool. In Proceedings of the 7th Global Wordnet Conference, pages 86–94, 2014.
P. Nakov and H. T. Ng. Improved statistical machine translation for resource-poor lan- guages using related resource-rich languages. In Proceedings of the 2009 Conference on 111
Empirical Methods in Natural Language Processing: Volume 3-Volume 3, pages 1358– 1367. Association for Computational Linguistics, 2009.
R. Navigli and S. P. Ponzetto. BabelNet: Building a very large multilingual semantic network. In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 216–225. Association for Computational Linguistics, 2010.
A. Neelakantan, J. Shankar, A. Passos, and A. McCallum. Efficient non-parametric estima- tion of multiple embeddings per word in vector space. arXiv preprint arXiv:1504.06654, 2015.
L. Nerima and E. Wehrli. Generating Bilingual Dictionaries by Transitivity. In LREC, volume 8, pages 2584–2587, 2008.
R. Noyer. Vietnamese’morphology’and the definition of word. University of Pennsylvania Working Papers in Linguistics, 5(2):5, 1998.
A. Oliver. Wn-toolkit: Automatic generation of wordnets following the expand model. Proceedings of the 7th Global WordNetConference, Tartu, Estonia, 2014.
A. Oliver and S. Climent. Parallel corpora for Wordnet construction: machine translation vs. automatic sense tagging. In Computational Linguistics and Intelligent Text Process- ing, pages 110–121. Springer, 2012.
P. G. Otero and J. R. P. Campos. Automatic generation of bilingual dictionaries using inter- mediary languages and comparable corpora. In Computational Linguistics and Intelligent Text Processing, pages 473–483. Springer, 2010.
N. R. Prabhugaonkar, J. D. Pawar, and T. Plateau. Use of Sense Marking for Improving WordNet Coverage. In Proceedings of the 7th Global Wordnet Conference, pages 95–99, 2014. 112
Q. Pradet, G. de Chalendar, and J. B. Desormeaux. Wonef, an improved, expanded and evaluated automatic french translation of wordnet. Proceedings of the 7th Global Word- NetConference, Tartu, Estonia, 2014.
J. Ramírez, M. Asahara, and Y. Matsumoto. Japanese-Spanish thesaurus construction using English as a pivot. arXiv preprint arXiv:1303.1232, 2013.
G. Rigau, H. Rodriguez, and E. Agirre. Building accurate semantic taxonomies from monolingual MRDs. In Proceedings of the 17th international conference on Compu- tational linguistics-Volume 2, pages 1103–1109. Association for Computational Linguis- tics, 1998.
H. Rodríguez, D. Farwell, J. Ferreres, M. Bertran, M. Alkhalifa, and M. A. Martí. Arabic wordnet: Semi-automatic extensions using bayesian inference. In LREC, 2008.
S. Rothe and H. Schütze. Autoextend: Extending word embeddings to embeddings for synsets and lexemes. arXiv preprint arXiv:1507.01127, 2015.
B. Sagot and D. Fišer. Building a free French wordnet from multilingual resources. In OntoLex, 2008.
N. Saharia, D. Das, U. Sharma, and J. Kalita. Part of speech tagger for assamese text. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 33–36. Associa- tion for Computational Linguistics, 2009.
R. C. S. K. Sarma. Structured and logical representations of assamese text for question- answering system. In 24th International Conference on Computational Linguistics, page 27, 2012.
M. Saveski and I. Trajkovski. Automatic construction of wordnets by using machine trans- lation and language modeling. In 13th Multiconference Information Society, Ljubljana, Slovenia, 2010. 113
K. Shaalan, A. A. Monem, and A. Rafea. Arabic morphological generation from interlin- gua. In Intelligent Information Processing III, pages 441–451. Springer, 2006.
U. Sharma, J. K. Kalita, and R. K. Das. Acquisition of morphology of an indic lan- guage from text corpus. ACM Transactions on Asian Language Information Processing (TALIP), 7(3):9, 2008.
R. Shaw, A. Datta, D. VanderMeer, and K. Dutta. Building a scalable database-driven reverse dictionary. Knowledge and Data Engineering, IEEE Transactions on, 25(3): 528–540, 2013.
S. Soderland, O. Etzioni, D. S. Weld, K. Reiter, M. Skinner, M. Sammer, J. Bilmes, and others. Panlingual lexical translation via probabilistic inference. Artificial Intelligence, 174(9):619–637, 2010.
K. Tanaka and K. Umemura. Construction of a bilingual dictionary intermediated by a third language. In Proceedings of the 15th conference on Computational linguistics-Volume 1, pages 297–303. Association for Computational Linguistics, 1994.
L. C. Thompson. A Vietnamese reference grammar, volume 13. University of Hawaii Press, 1987.
K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language
Technology-Volume 1, pages 173–180. Association for Computational Linguistics, 2003.
P. Vossen. Introduction to eurowordnet. In EuroWordNet: A multilingual database with lexical semantic networks, pages 1–17. Springer, 1998. 114
Wikipedia. Wordnet — wikipedia, the free encyclopedia, 2015. URL http://en. wikipedia.org/w/index.php?title=WordNet&oldid=656664111. [Online; accessed 22-April-2015].
Wikipedia. Vietnamese language — wikipedia, the free encyclopedia, 2016a. URL https://en.wikipedia.org/w/index.php?title=Vietnamese_
language&oldid=731154067. [Online; accessed 30-July-2016].
Wikipedia. Vietnamese morphology — wikipedia, the free encyclopedia, 2016b. URL https://en.wikipedia.org/w/index.php?title=Vietnamese_ morphology&oldid=730832239. [Online; accessed 30-July-2016].
Wikipedia. Lexicon — wikipedia, the free encyclopedia, 2016c. URL https:// en.wikipedia.org/w/index.php?title=Lexicon&oldid=718057169. [Online; accessed 3-August-2016].
K. Yu and J. Tsujii. Extracting bilingual dictionary from comparable corpora with de- pendency heterogeneity. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational
Linguistics, Companion Volume: Short Papers, pages 121–124. Association for Compu- tational Linguistics, 2009.
O. F. Zaidan and C. Callison-Burch. Arabic dialect identification. Computational Linguis- tics, 40(1):171–202, 2014. Appendix A
PAPERS RESULTING FROM THE DISSERTATION Appendix B
DATA PROCESSING SOFTWARE CODE
B.1 ComputCosineSim.py
########################### # Program to compute cosine similarity # between semantically related words in a WordNet # using Word2Vec # Author: Feras Al Tarouti # Date : Feb 4 2016 import unicodecsv as csv import codecs import gensim import editdistance word2vecmodel=gensim.models.Word2Vec.load_word2vec_format(' VieVectors_SG_Size100_W5.bin', binary=True) with open('LexBankVieSemRelatedWords_WithCOS.csv', 'wb') as f: writer = csv.writer(f) writer.writerow(['OffsetPos1','Word1','Relation','OffsetPos2','Word2', 'COS','ld']) with open('LexBankVieSemRelatedWords.csv', 'rb') as f: reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE) firstline = True rownum = 0 for row in reader: if firstline: firstline=False else: print("Compute Similarity for pairs number: {0}".format(rownum)) SynsetID1=row[0] Word1= row[1] Relation=row[2] SynsetID2=row[3] Word2=row[4] try: cos= round(word2vecmodel.similarity(Word1,Word2),3) 117
except Exception: cos=00.00 ld= editdistance.eval(Word1,Word2) newrow=[SynsetID1,Word1,Relation,SynsetID2,Word2,cos,ld] writer.writerow(newrow) rownum =rownum +1 118
B.2 GenerateVectorForSynset.py
########################### # A function for computing a synset vector # Author: Feras Al Tarouti # Date : May 18 2016 def GenerateVectorForSynset(syn,thislemma): FinalVector=np.zeros(100) VectorList=[] # define the vector set for this synset LemmasList=FindLemmasOfSyns(syn) # the list of lemmas for this synset
for lemma in LemmasList: if lemma != thislemma: Vector= GenerateVectorForLemma(lemma) if np.count_nonzero(Vector)>0: VectorList.append(Vector) # add vector of word to the synset Vector
# Find out if this synset have only one word, # in this case we have to find a related word and add it to the vector sets if len(VectorList)<2: #we need to find out a related synset relatedword=FindRelatedSyn(syn) if relatedword != "": Vector =GenerateVectorForLemma(relatedword) if np.count_nonzero(Vector)>0: VectorList.append(Vector) # add vector of word to the synset Vector
for vec in VectorList: FinalVector=np.add(FinalVector,vec) # compute the average numbofVec= len(VectorList) scalar=np.divide(float(1),float(numbofVec)) FinalVector=np.multiply(FinalVector, scalar) return FinalVector 119
B.3 GenerateVectorForGloss.py
########################### # A function for computing a gloss vector # Author: Feras Al Tarouti # Date : May 18 2016 def GenerateVectorFor(thisSentence,lemma): VectorList=[] # define the vector set for this Sentence FinalVector=np.zeros(100) for word in thisSentence.split(): skip = False if word not in stopwrds and word != lemma: try: Vector = word2vecmodel[word] NofSyns = FindNumberOfSyns(word) # Scale the vector base on the number of synsets if NofSyns > 1: thisScalar = np.divide(float(1),float(NofSyns)) Vector = np.multiply(Vector, thisScalar) VectorList.append(Vector) skip=False # we have this word in our model except Exception: skip=True if len(VectorList)>0: for vec in VectorList: FinalVector=np.add(FinalVector,vec) numbofVec= len(VectorList) saclar=np.divide(float(1),numbofVec) FinalVector=np.multiply(FinalVector, saclar) return FinalVector; 120
B.4 ComputeGlossSynsetSimilarity.py
########################### # A program for computing similarity between synset and gloss # Author: Feras Al Tarouti # Date : May 18 2016 # First Step : Open the synset-gloss files, and read the sentence # Second Step : Generate the vector for the synset # Third Step : Generate the vector for the sentence # Fourth Step : Compute the cosine similarity between the synset vector # and the sentence vector # Fivth Step : Save the result ########################### with open(InputDataFile,'rb') as SentencesFile, open(outputfile, 'wb') as out_file: reader = csv.reader(SentencesFile,encoding='utf-8' ,delimiter=',') writer = csv.writer(out_file, encoding='utf-8') writer.writerow(['ID','CosSem']) rownum=0 for row in reader: if rownum!=0: print("Computing Cosine Similarity for Row numb: {0}".format(rownum) ) thisSenID = row[0] # read the current sentence ID thisSynset = row[1] # read the current synsetID thisSynMem = row[2] # read number of members for this synset thiswrd = row[3] # read the word used in this sentence thiswrdSyns = row[4] # read the number of synsets for this word thisSentence = row[5] # read the current sentence
#Compute a vector for this synset thisSynsetVector = GenerateVectorForSynset(thisSynset,"")
# Generate Vector for this sentence thisSentenceVector = GenerateVectorFor(thisSentence,"")
CosDistance = ComputeCosine (thisSynsetVector, thisSentenceVector) x=Decimal(CosDistance) if math.isnan(x): CosDistance=0 newrow=[thisSenID,CosDistance] writer.writerow(newrow)
rownum=rownum+1 Appendix C MICROSOFT SQL SERVER TABLES
-- -- Database: `LexBank_System` ------Table structure for table `Users_Info` -- USE [LexBank_System] GO
SET ANSI_NULLS ON GO
SET QUOTED_IDENTIFIER ON GO
SET ANSI_PADDING ON GO
CREATE TABLE [dbo].[Users_Info]( [UserId] [varchar](50) NOT NULL, [UserName] [varchar](100) NOT NULL, [UserEmail] [varchar](70) NOT NULL, [UserPwd] [varchar](max) NOT NULL, [UserPriv] [varchar](15) NOT NULL, [UserStatus] [varchar](15) NOT NULL, CONSTRAINT [PK_Users_Info] PRIMARY KEY CLUSTERED ( [UserId] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY]
GO
SET ANSI_PADDING OFF GO ------122
-- Table structure for table `System_Log` -- USE [LexBank_System] GO
SET ANSI_NULLS ON GO
SET QUOTED_IDENTIFIER ON GO
SET ANSI_PADDING ON GO
CREATE TABLE [dbo].[System_Log]( [EventId] [int] IDENTITY(1,1) NOT NULL, [EventDesc] [varchar](200) NOT NULL, [EventTime] [datetime] NOT NULL, [UserId] [varchar](50) NOT NULL ) ON [PRIMARY]
GO
SET ANSI_PADDING OFF GO ------Database: `LexBank_Resources` ------Table structure for table `Arabic_CoreWordnet` -- USE [LexBank_Resources] GO
SET ANSI_NULLS ON GO
SET QUOTED_IDENTIFIER ON GO
SET ANSI_PADDING ON GO
CREATE TABLE [dbo].[Arabic_CorWordnet]( [Offset_Pos] [nvarchar](10) NOT NULL, [Member] [nvarchar](200) NOT NULL ) ON [PRIMARY]
SET ANSI_PADDING OFF GO ------Table structure for table `Assamese_CoreWordnet` 123
-- USE [LexBank_Resources] GO
SET ANSI_NULLS ON GO
SET QUOTED_IDENTIFIER ON GO
SET ANSI_PADDING ON GO
CREATE TABLE [dbo].[Assamese_CorWordnet]( [Offset_Pos] [nvarchar](10) NOT NULL, [Member] [nvarchar](200) NOT NULL ) ON [PRIMARY]
SET ANSI_PADDING OFF GO ------Table structure for table `Vietnamese_CoreWordnet` -- USE [LexBank_Resources] GO
SET ANSI_NULLS ON GO
SET QUOTED_IDENTIFIER ON GO
SET ANSI_PADDING ON GO
CREATE TABLE [dbo].[Vietnamese_CorWordnet]( [Offset_Pos] [nvarchar](10) NOT NULL, [Member] [nvarchar](200) NOT NULL ) ON [PRIMARY]
SET ANSI_PADDING OFF GO ------Table structure for table `Arabic_Sem_Relations` -- USE [LexBank_Resources] GO
SET ANSI_NULLS ON GO
SET QUOTED_IDENTIFIER ON GO 124
SET ANSI_PADDING ON GO
CREATE TABLE [dbo].[Arabic_Sem_Relations]( [Left_Offset_Pos] [nvarchar](10) NOT NULL, [Relation] [nvarchar](50) NOT NULL, [Right_Offset_Pos] [nvarchar](10) NOT NULL ) ON [PRIMARY]
SET ANSI_PADDING OFF GO ------Table structure for table `Assamese_Sem_Relations` -- USE [LexBank_Resources] GO
SET ANSI_NULLS ON GO
SET QUOTED_IDENTIFIER ON GO
SET ANSI_PADDING ON GO
CREATE TABLE [dbo].[Assamese_Sem_Relations]( [Left_Offset_Pos] [nvarchar](10) NOT NULL, [Relation] [nvarchar](50) NOT NULL, [Right_Offset_Pos] [nvarchar](10) NOT NULL ) ON [PRIMARY]
SET ANSI_PADDING OFF GO ------Table structure for table `Vietnamese_Sem_Relations` -- USE [LexBank_Resources] GO
SET ANSI_NULLS ON GO
SET QUOTED_IDENTIFIER ON GO
SET ANSI_PADDING ON GO
CREATE TABLE [dbo].[Vietnamese_Sem_Relations]( [Left_Offset_Pos] [nvarchar](10) NOT NULL, [Relation] [nvarchar](50) NOT NULL, [Right_Offset_Pos] [nvarchar](10) NOT NULL ) ON [PRIMARY] 125
SET ANSI_PADDING OFF GO ------Table structure for table `Arabic_WordnetGlosses` -- USE [LexBank_Resources] GO
SET ANSI_NULLS ON GO
SET QUOTED_IDENTIFIER ON GO
SET ANSI_PADDING ON GO
CREATE TABLE [dbo].[Arabic_WordnetGlosses]( [Offset_Pos] [varchar](10) NOT NULL, [Gloss] [varchar](4000) NULL ) ON [PRIMARY]
GO
SET ANSI_PADDING OFF GO ------Table structure for table `Assamese_WordnetGlosses` -- USE [LexBank_Resources] GO
SET ANSI_NULLS ON GO
SET QUOTED_IDENTIFIER ON GO
SET ANSI_PADDING ON GO
CREATE TABLE [dbo].[Assamese_WordnetGlosses]( [Offset_Pos] [varchar](10) NOT NULL, [Gloss] [varchar](4000) NULL ) ON [PRIMARY]
GO
SET ANSI_PADDING OFF GO ------Table structure for table `Vietnamese_WordnetGlosses` -- 126
USE [LexBank_Resources] GO
SET ANSI_NULLS ON GO
SET QUOTED_IDENTIFIER ON GO
SET ANSI_PADDING ON GO
CREATE TABLE [dbo].[Vietnamese_WordnetGlosses]( [Offset_Pos] [varchar](10) NOT NULL, [Gloss] [varchar](4000) NULL ) ON [PRIMARY]
GO
SET ANSI_PADDING OFF GO ------Table structure for table `Arabic_Sem_Relations_Eval_Data` -- USE [LexBank_Resources] GO
SET ANSI_NULLS ON GO
SET QUOTED_IDENTIFIER ON GO
SET ANSI_PADDING ON GO
CREATE TABLE [dbo].[Arabic_Sem_Relations_Eval_Data]( [RelationKey] [int] IDENTITY(1,1) NOT NULL, [Left_Offset_Pos] [nvarchar](10) NOT NULL, [Word1] [nvarchar](100) NOT NULL, [Relation] [nvarchar](50) NOT NULL, [Right_Offset_Pos] [nvarchar](10) NOT NULL, [Word2] [nvarchar](100) NOT NULL, [COS] [real] NULL, ) ON [PRIMARY]
GO
SET ANSI_PADDING OFF GO ------Table structure for table `Assamese_Sem_Relations_Eval_Data` -- USE [LexBank_Resources] 127
GO
SET ANSI_NULLS ON GO
SET QUOTED_IDENTIFIER ON GO
SET ANSI_PADDING ON GO
CREATE TABLE [dbo].[Assamese_Sem_Relations_Eval_Data]( [RelationKey] [int] IDENTITY(1,1) NOT NULL, [Left_Offset_Pos] [nvarchar](10) NOT NULL, [Word1] [nvarchar](100) NOT NULL, [Relation] [nvarchar](50) NOT NULL, [Right_Offset_Pos] [nvarchar](10) NOT NULL, [Word2] [nvarchar](100) NOT NULL, [COS] [real] NULL, ) ON [PRIMARY]
GO
SET ANSI_PADDING OFF GO ------Table structure for table `Vietnamese_Sem_Relations_Eval_Data` -- USE [LexBank_Resources] GO
SET ANSI_NULLS ON GO
SET QUOTED_IDENTIFIER ON GO
SET ANSI_PADDING ON GO
CREATE TABLE [dbo].[Vietnamese_Sem_Relations_Eval_Data]( [RelationKey] [int] IDENTITY(1,1) NOT NULL, [Left_Offset_Pos] [nvarchar](10) NOT NULL, [Word1] [nvarchar](100) NOT NULL, [Relation] [nvarchar](50) NOT NULL, [Right_Offset_Pos] [nvarchar](10) NOT NULL, [Word2] [nvarchar](100) NOT NULL, [COS] [real] NULL, ) ON [PRIMARY]
GO
SET ANSI_PADDING OFF GO 128
------Table structure for table `Arabic_Sem_Relations_Eval_Response` -- USE [LexBank_Resources] GO
SET ANSI_NULLS ON GO
SET QUOTED_IDENTIFIER ON GO
SET ANSI_PADDING ON GO
CREATE TABLE [dbo].[Arabic_Sem_Relations_Eval_Response]( [AnswerKey] [int] IDENTITY(1,1) NOT NULL, [RelationKey] [int] NOT NULL, [Score] [int] NOT NULL, [UserId] [varchar](50) NULL ) ON [PRIMARY]
GO
SET ANSI_PADDING OFF GO ------Table structure for table `Assamese_Sem_Relations_Eval_Response` -- USE [LexBank_Resources] GO
SET ANSI_NULLS ON GO
SET QUOTED_IDENTIFIER ON GO
SET ANSI_PADDING ON GO
CREATE TABLE [dbo].[Assamese_Sem_Relations_Eval_Response]( [AnswerKey] [int] IDENTITY(1,1) NOT NULL, [RelationKey] [int] NOT NULL, [Score] [int] NOT NULL, [UserId] [varchar](50) NULL ) ON [PRIMARY]
GO
SET ANSI_PADDING OFF GO ------Table structure for table `Vietnamese_Sem_Relations_Eval_Response` 129
-- USE [LexBank_Resources] GO
SET ANSI_NULLS ON GO
SET QUOTED_IDENTIFIER ON GO
SET ANSI_PADDING ON GO
CREATE TABLE [dbo].[Vietnamese_Sem_Relations_Eval_Response]( [AnswerKey] [int] IDENTITY(1,1) NOT NULL, [RelationKey] [int] NOT NULL, [Score] [int] NOT NULL, [UserId] [varchar](50) NULL ) ON [PRIMARY]
GO
SET ANSI_PADDING OFF GO ------Table structure for table `Arabic_WordnetGloss_Eval_Data` -- USE [LexBank_Resources] GO
SET ANSI_NULLS ON GO
SET QUOTED_IDENTIFIER ON GO
SET ANSI_PADDING ON GO
CREATE TABLE [dbo].[Arabic_WordnetGloss_Eval_Data]( [GlossKey] [int] IDENTITY(1,1) NOT NULL, [Offset-pos] [varchar](10) NOT NULL, [Word] [nvarchar](500) NULL, [Sentence] [nvarchar](4000) NULL, [PWNGloss] [nvarchar](900) NULL, [CosSem] [real] NULL, [GlossRank] [int] NULL ) ON [PRIMARY]
GO
SET ANSI_PADDING OFF GO ------130
-- Table structure for table `Assamese_WordnetGloss_Eval_Data` -- USE [LexBank_Resources] GO
SET ANSI_NULLS ON GO
SET QUOTED_IDENTIFIER ON GO
SET ANSI_PADDING ON GO
CREATE TABLE [dbo].[Assamese_WordnetGloss_Eval_Data]( [GlossKey] [int] IDENTITY(1,1) NOT NULL, [Offset-pos] [varchar](10) NOT NULL, [Word] [nvarchar](500) NULL, [Sentence] [nvarchar](4000) NULL, [PWNGloss] [nvarchar](900) NULL, [CosSem] [real] NULL, [GlossRank] [int] NULL ) ON [PRIMARY]
GO
SET ANSI_PADDING OFF GO ------Table structure for table `Vietnamese_WordnetGloss_Eval_Data` -- USE [LexBank_Resources] GO
SET ANSI_NULLS ON GO
SET QUOTED_IDENTIFIER ON GO
SET ANSI_PADDING ON GO
CREATE TABLE [dbo].[Vietnamese_WordnetGloss_Eval_Data]( [GlossKey] [int] IDENTITY(1,1) NOT NULL, [Offset-pos] [varchar](10) NOT NULL, [Word] [nvarchar](500) NULL, [Sentence] [nvarchar](4000) NULL, [PWNGloss] [nvarchar](900) NULL, [CosSem] [real] NULL, [GlossRank] [int] NULL ) ON [PRIMARY]
GO 131
SET ANSI_PADDING OFF GO ------Table structure for table `Arabic_WordnetGloss_Eval_Response` -- USE [LexBank_Resources] GO
SET ANSI_NULLS ON GO
SET QUOTED_IDENTIFIER ON GO
SET ANSI_PADDING ON GO
CREATE TABLE [dbo].[Arabic_WordnetGlosses_Eval_Response]( [AnswerKey] [int] IDENTITY(1,1) NOT NULL, [GlossKey] [int] NOT NULL, [Score] [int] NOT NULL, [UserId] [varchar](50) NULL ) ON [PRIMARY]
GO
SET ANSI_PADDING OFF GO ------Table structure for table `Assamese_WordnetGloss_Eval_Response` -- USE [LexBank_Resources] GO
SET ANSI_NULLS ON GO
SET QUOTED_IDENTIFIER ON GO
SET ANSI_PADDING ON GO
CREATE TABLE [dbo].[Assamese_WordnetGlosses_Eval_Response]( [AnswerKey] [int] IDENTITY(1,1) NOT NULL, [GlossKey] [int] NOT NULL, [Score] [int] NOT NULL, [UserId] [varchar](50) NULL ) ON [PRIMARY]
GO
SET ANSI_PADDING OFF 132
GO ------Table structure for table `Vietnamese_WordnetGloss_Eval_Response` -- USE [LexBank_Resources] GO
SET ANSI_NULLS ON GO
SET QUOTED_IDENTIFIER ON GO
SET ANSI_PADDING ON GO
CREATE TABLE [dbo].[Vietnamese_WordnetGlosses_Eval_Response]( [AnswerKey] [int] IDENTITY(1,1) NOT NULL, [GlossKey] [int] NOT NULL, [Score] [int] NOT NULL, [UserId] [varchar](50) NULL ) ON [PRIMARY]
GO
SET ANSI_PADDING OFF GO ------Appendix D LEXBANK UTILITY CLASS
1 using System; 2 using System.Collections.Generic; 3 using System.Linq; 4 using System.Web; 5 using System.Data; 6 using System.Data.SqlClient; 7 using System.Web.Configuration; 8 using System.IO; 9 using System.Text; 10 using System.Security.Cryptography; 11 12 namespace LexBank2016 13 { 14 public class LexBankUtils 15 { 16 private string LexBankConnectionString = WebConfigurationManager.ConnectionStrings["LexBankData"]. ToString(); 17 18 public Boolean IsUserIdAvailable(string UserId) 19 { 20 // This function takes user id and check if it is already used or not 21 Boolean result = false; 22 23 24 using (SqlConnection connection = new SqlConnection( LexBankConnectionString)) 25 { 26 connection.Open(); 27 // 28 // Create new SqlCommand object. 29 // 30 using (SqlCommand command = new SqlCommand("SELECT UserId FROM Users_Info where UserId like @UserId", connection)) 31 { 32 // Define the parameters 33 command.Parameters.AddWithValue("@UserId", UserId .Trim()); 134
34 // 35 // Invoke ExecuteReader method. 36 // 37 var firstColumn = command.ExecuteScalar(); 38 if (firstColumn == null) 39 { 40 result = true; 41 } 42 } 43 } 44 return result; 45 46 47 } 48 49 public string EncryptPassword(string PlanePassword) 50 { 51 string EncryptionKey = "LexBank"; 52 byte[] PlaneBytes = Encoding.Unicode.GetBytes( PlanePassword); 53 using (Aes PasswordEncryptor = Aes.Create()) 54 { 55 Rfc2898DeriveBytes PBKDF = new Rfc2898DeriveBytes( EncryptionKey, new byte[] { 0x49, 0x76, 0x61, 0x6e , 0x20, 0x4d, 0x65, 0x64, 0x76, 0x65, 0x64, 0x65, 0x76 }); 56 PasswordEncryptor.Key = PBKDF.GetBytes(32); 57 PasswordEncryptor.IV = PBKDF.GetBytes(16); 58 using (MemoryStream ms = new MemoryStream()) 59 { 60 using (CryptoStream cs = new CryptoStream(ms, PasswordEncryptor.CreateEncryptor(), CryptoStreamMode.Write)) 61 { 62 cs.Write(PlaneBytes, 0, PlaneBytes.Length); 63 cs.Close(); 64 } 65 PlanePassword = Convert.ToBase64String(ms.ToArray ()); 66 } 67 } 68 return PlanePassword; 69 } 70 71 public string DecryptPassword(string EncryptedPassword) 72 { 73 string EncryptionKey = "LexBank"; 74 byte[] DecryptedBytes = Convert.FromBase64String( EncryptedPassword); 75 using (Aes PasswordEncryptor = Aes.Create()) 76 { 77 Rfc2898DeriveBytes PBKDF = new Rfc2898DeriveBytes( EncryptionKey, new byte[] { 0x49, 0x76, 0x61, 0x6e 135
, 0x20, 0x4d, 0x65, 0x64, 0x76, 0x65, 0x64, 0x65, 0x76 }); 78 PasswordEncryptor.Key = PBKDF.GetBytes(32); 79 PasswordEncryptor.IV = PBKDF.GetBytes(16); 80 using (MemoryStream ms = new MemoryStream()) 81 { 82 using (CryptoStream cs = new CryptoStream(ms, PasswordEncryptor.CreateDecryptor(), CryptoStreamMode.Write)) 83 { 84 cs.Write(DecryptedBytes, 0, DecryptedBytes. Length); 85 cs.Close(); 86 } 87 EncryptedPassword = Encoding.Unicode.GetString(ms .ToArray()); 88 } 89 } 90 return EncryptedPassword; 91 } 92 93 public Boolean CreateNewUser(string UserId, string UserName, string UserEmail, string UserPwd) 94 { 95 Boolean result = false; 96 string UserPriv = "client"; 97 string UserStatus = "New"; 98 using (SqlConnection connection = new SqlConnection( LexBankConnectionString)) 99 { 100 connection.Open(); 101 // 102 // Create new SqlCommand object. 103 // 104 using (SqlCommand command = new SqlCommand("INSERT INTO Users_Info VALUES(@UserId,@UserName, @UserEmail,@UserPwd,@UserPriv,@UserStatus)", connection)) 105 { 106 // Define the parameters 107 command.Parameters.AddWithValue("@UserId", UserId .Trim()); 108 command.Parameters.AddWithValue("@UserName", UserName.Trim()); 109 command.Parameters.AddWithValue("@UserEmail", UserEmail.Trim()); 110 command.Parameters.AddWithValue("@UserPwd", UserPwd.Trim()); 111 command.Parameters.AddWithValue("@UserPriv", UserPriv.Trim()); 112 command.Parameters.AddWithValue("@UserStatus", UserStatus.Trim()); 113 // 114 // Invoke ExecuteNonQuery method. 136
115 // 116 int c = 0; 117 try 118 { 119 c = command.ExecuteNonQuery(); 120 if (c == 1) 121 result = true; 122 } 123 catch (Exception e) 124 { 125 126 } 127 128 129 } 130 131 } 132 133 134 135 136 return result; 137 } 138 139 public bool IsAuthenticated(string userid, string userpassword) 140 { 141 142 bool result = false; 143 SqlConnection LexBankDataConnection = new SqlConnection( LexBankConnectionString); 144 SqlCommand AuthCommand = new SqlCommand("Select UserId, UserPriv, UserStatus from Users_Info where UserId= @userid and UserPwd=@userpassword", LexBankDataConnection); 145 AuthCommand.Parameters.AddWithValue("@userid", userid); 146 AuthCommand.Parameters.AddWithValue("@userpassword", EncryptPassword(userpassword.Trim())); 147 LexBankDataConnection.Open(); 148 SqlDataReader reader = AuthCommand.ExecuteReader(); 149 while (reader.Read()) 150 { 151 string UserStatus = reader["UserStatus"].ToString(); 152 if (UserStatus == "Active") 153 { 154 result = true; 155 LogEvent("Login", DateTime.Now, userid.Trim()); 156 157 } 158 } 159 return result; 160 } 161 162 public List
163 { 164 165 List
208 // Invoke ExecuteReader method. 209 // 210 SqlDataReader reader = command.ExecuteReader(); 211 while (reader.Read()) 212 { 213 result.Add(reader.GetString(1).Trim()); 214 215 }//end while 216 217 } //end the second using 218 }//end the first using 219 return result; 220 } 221 222 public Boolean IsSynSetAvailable(string OffsetPos, string Wordnet) 223 { 224 // This function takes synsetID and check if it is included in a Wordnet 225 Boolean result = false; 226
227 228 using (SqlConnection connection = new SqlConnection( LexBankConnectionString)) 229 { 230 connection.Open(); 231 // 232 // Create new SqlCommand object. 233 // 234 using (SqlCommand command = new SqlCommand("SELECT Offset_Pos FROM " + Wordnet.Trim() + " where Offset_Pos like @OffsetPos", connection)) 235 { 236 // Define the parameters 237 command.Parameters.AddWithValue("@OffsetPos", OffsetPos.Trim()); 238 // 239 // Invoke ExecuteReader method. 240 // 241 SqlDataReader reader = command.ExecuteReader(); 242 243 if (reader.Read()) 244 result = true; 245 246 } 247 248 249 } 250 251 252 return result; 253 254 255 } 139
256 257 public Dictionary
297 } 298 299 300 }//end while 301 302 303 304 305 } //end the second using 306 }//end the first using 307 308 return result; 309 310 } 311 312 public string FindGloss(string OffsetPos, string GlossTable) 313 { 314 string result = "Gloss is not available"; 315 316 using (SqlConnection connection = new SqlConnection( LexBankConnectionString)) 317 { 318 connection.Open(); 319 // 320 // Create new SqlCommand object. 321 // 322 using (SqlCommand command = new SqlCommand("SELECT * FROM " + GlossTable + " where Offset_Pos like @OffsetPos", connection)) 323 { 324 // Define the parameters 325 command.Parameters.AddWithValue("@OffsetPos", OffsetPos.Trim()); 326 // 327 // Invoke ExecuteReader method. 328 // 329 SqlDataReader reader = command.ExecuteReader(); 330 while (reader.Read()) 331 { 332 result=reader.GetString(1).Trim(); 333 334 }//end while 335 336 } //end the second using 337 }//end the first using 338 339 340 341 return result; 342
343 344 } 345 141
346 public List
390 // This method Read a synset gloss from the table and return it to be evaluated 391 392 List
434 string sqls = "INSERT INTO " + EvaluationTable + " ([ RelationKey],[Score] ,[UserID]) values ( @RelationKey,@Score,@UserId)"; 435 var command = new SqlCommand(sqls, MyConnection); 436 command.Parameters.AddWithValue("@RelationKey", RelationKey); 437 command.Parameters.AddWithValue("@Score", Score); 438 command.Parameters.AddWithValue("@UserId", UserId. Trim()); 439 MyConnection.Open(); 440 command.ExecuteNonQuery(); 441 MyConnection.Close(); 442 return true; 443 444 } 445 446 catch (Exception ex) 447 { 448 return false; 449 } 450 451 } 452 453 private Boolean EvaluateGloss(int GlossKey, int Score, string UserId, string EvaluationTable) 454 { 455 456 try 457 { 458 459 SqlConnection MyConnection = new SqlConnection( LexBankConnectionString); 460 461 string sqls2 = "INSERT INTO " + EvaluationTable + " ([GlossKey],[Score] ,[UserID]) values (@GlossKey, @Score,@UserId)"; 462 var command = new SqlCommand(sqls2, MyConnection); 463 command.Parameters.AddWithValue("@GlossKey", GlossKey ); 464 command.Parameters.AddWithValue("@Score", Score); 465 command.Parameters.AddWithValue("@UserId", UserId); 466 467 MyConnection.Open(); 468 command.ExecuteNonQuery(); 469 MyConnection.Close(); 470 return true; 471 472 } 473 474 catch (Exception ex) 475 { 476 return false; 477 } 478 144
479 } 480 481 public void LogEvent(string EventDesc, DateTime EventTime, string UserId) 482 { 483 using (SqlConnection connection = new SqlConnection( LexBankConnectionString)) 484 { 485 connection.Open(); 486 // 487 // Create new SqlCommand object. 488 // 489 using (SqlCommand command = new SqlCommand("INSERT INTO System_Log([EventDesc], [EventTime], [UserId ]) VALUES(@EventDesc, @EventTime, @UserId)", connection)) 490 { 491 // Define the parameters 492 command.Parameters.AddWithValue("@EventDesc", EventDesc.Trim()); 493 command.Parameters.AddWithValue("@EventTime", SqlDbType.DateTime).Value = EventTime; 494 command.Parameters.AddWithValue("@UserId", UserId .Trim()); 495 // 496 // Invoke ExecuteNonQuery method. 497 // 498 //try 499 //{ 500 command.ExecuteNonQuery(); 501 //} 502 //catch (Exception e) 503 //{ 504 505 //} 506 507 } 508 } 509 510 } 511 512 public void ChangeUserStatus(string UserId, string NewStatus) 513 { 514 using (SqlConnection connection = new SqlConnection( LexBankConnectionString)) 515 { 516 connection.Open(); 517 // 518 // Create new SqlCommand object. 519 // 520 using (SqlCommand command = new SqlCommand("UPDATE Users_Info SET UserStatus=@UserStatus WHERE UserId=@UserId", connection)) 521 { 145
522 // Define the parameters 523 command.Parameters.AddWithValue("@UserId", UserId .Trim()); 524 command.Parameters.AddWithValue("@UserStatus", NewStatus.Trim()); 525 // 526 // Invoke ExecuteNonQuery method. 527 // 528 //try 529 //{ 530 command.ExecuteNonQuery(); 531 //} 532 //catch (Exception e) 533 //{ 534 535 //} 536 537 } 538 } 539 540 } 541 542 public DataTable RetrieveUsers() 543 { 544 DataTable result = new DataTable(); 545 546 using (SqlConnection connection = new SqlConnection( LexBankConnectionString)) 547 { 548 connection.Open(); 549 // 550 // Create new SqlCommand object. 551 // 552 using (SqlCommand command = new SqlCommand("SELECT [ UserId], [UserName], [UserEmail], [UserPriv], [ UserStatus] FROM [Users_Info]", connection)) 553 { 554 SqlDataAdapter dadapter = new SqlDataAdapter( command); 555 dadapter.Fill(result); 556 557 } 558 } 559 return result; 560 561 } 562 563 } 564 } Appendix E IRB APPROVAL LETTER 147