LEXBANK: A MULTILINGUAL FOR LOW-RESOURCE

LANGUAGES

by

Feras Ali Al Tarouti

M.S., King Fahd University of Petroleum & Minerals, 2008

B.S., University of Dammam, 2001

A dissertation submitted to the Graduate Faculty of the

University of Colorado Colorado Springs

in partial fulfillment of the

requirements for the degree of

Doctor of Philosophy

Department of Computer Science

2016 ii

© Copyright by Feras Ali Al Tarouti 2016 All Rights Reserved iii

This dissertation for Doctor of Philosophy degree by

Feras Ali Al Tarouti

has been approved for the

Department of Computer Science

by

Jugal Kalita, Chair

Tim Chamillard

Rory Lewis

Khang Nhut Lam

Sudhanshu Semwal

Date iv

Al Tarouti, Feras A. (Ph.D., Computer Science)

LexBank: A Multilingual Lexical Resource for Low-Resource

Dissertation directed by Professor Jugal Kalita

In this dissertation, we present new methods to create essential lexical resources for low-resource languages. Specifically, we develop methods for enhancing automatically cre- ated . As a baseline, we start by producing core wordnets, for several languages, using methods that need limited freely available resources for creating lexical resources

(Lam et al., 2014a,b, 2015b). Then, we establish the semantic relations between synsets in wordnets we create. Next, we introduce a new method to automatically add glosses to the synsets in our wordnets. Our techniques use limited resources as input to ensure that they can be felicitously used with languages that currently lack many original resources. Most existing research works with languages that have significant lexical resources available, which are costly to construct. To make our created lexical resources publicly available, we developed LexBank which is a web-based system that provides services for several low-resource languages. To my mother, father and my wife. vi

Acknowledgments

I would like to express my appreciation to my wife and the mother of my kids Omima for the unlimited support she gave to me during my journey toward my Ph.D. I am also very grateful to the support and guidance provided by my advisor Dr. Jugal Kalita. In addition, I would like to thank my dissertation committee members: Dr. Sudhanshu Semwal, Dr. Tim

Chamillard, Dr. Rory Lewis and Dr. Khang Nhut Lam for their guidance and consultation. vii

Table of Contents

1 Introduction 1

1.1 Motivation ...... 1

1.2 Research Focus ...... 4

1.2.1 Assamese Language ...... 4

1.2.1.1 Assamese Script ...... 5

1.2.1.2 Assamese Morphology ...... 5

1.2.1.3 Assamese Syntax ...... 6

1.2.2 ...... 6

1.2.2.1 Vietnamese Script ...... 7

1.2.2.2 Vietnamese Morphology ...... 7

1.2.2.3 Vietnamese Syntax ...... 8

1.3 Research Contributions ...... 8

2 Case Study: The Current Status of and Challenges in Processing Information in

Arabic 11

2.1 Introduction ...... 11

2.2 Fundamentals of Arabic ...... 12

2.2.1 Arabic Script ...... 13 viii

2.2.2 Arabic Morphology ...... 16

2.2.3 Arabic Syntax ...... 18

2.3 Summary ...... 19

3 Literature Review 21

3.1 Automatic Construction of Wordnets ...... 21

3.2 Wordnet Management Tools ...... 26

3.3 Creating Bilingual ...... 29

3.4 Summary ...... 31

4 Automaticaaly Constructing Structured Wordnets 32

4.1 Constructing Core Wordnets ...... 32

4.2 Constructing Wordnet Semantic Relations ...... 34

4.3 Experiments and Evaluation ...... 37

4.4 Summary ...... 39

5 Enhancing Automatic Wordnet Construction Using Embeddings 40

5.1 Introduction ...... 40

5.2 Similarity Metrics ...... 42

5.3 Generating Word Embeddings ...... 43

5.4 Removing Irrelevant in Synsets ...... 44

5.5 Validating Candidate Relations ...... 45

5.6 Selecting Thresholds ...... 46

5.7 Experiments ...... 46 ix

5.7.1 Generating Vector Representations of Wordnets Words ...... 46

5.7.2 Producing Word Embeddings for Arabic ...... 48

5.8 Evaluation and Discussion ...... 49

5.9 Summary ...... 51

6 Selecting Glosses for Wordnet Synsets Using Word Embeddings 54

6.1 Literature Review ...... 54

6.2 Creating Language Model Using Word Embeddings ...... 55

6.3 Generating Vector Representation of Wordnet Synsets ...... 55

6.4 Automatically Selecting a Synset Gloss From a Corpus Using Synset2Vec . 58

6.5 Evaluation ...... 59

6.5.1 Using Synset2vec to Select Glosses for PWN Synsets ...... 60

6.5.2 Using Synset2vec to Select Glosses for Arabic, Assamese and Viet-

namese Synsets ...... 61

6.5.3 Results and Discussion ...... 62

6.6 Summary ...... 64

7 LexBank: A Multilingual Lexical Resource 67

7.1 Introduction ...... 67

7.2 Design ...... 68

7.2.1 The System Settings Database ...... 68

7.2.1.1 Users_Info ...... 68

7.2.1.2 System_log ...... 69

7.2.2 The Lexical Resources Database ...... 69 x

7.2.2.1 CoreWordnet ...... 70

7.2.2.2 Sem_Relations ...... 70

7.2.2.3 WordnetGlosses ...... 70

7.2.2.4 Sem_Relations_Eval_Data ...... 71

7.2.2.5 Sem_Relations_Eval_Response ...... 71

7.2.2.6 WordnetGlosses_Eval_Data ...... 72

7.2.2.7 WordnetGlosses_Eval_Response ...... 72

7.3 Application Layer ...... 73

7.4 Web Interface Design and Implementation ...... 74

7.4.1 Registration Form ...... 75

7.4.2 Log-in Form ...... 76

7.4.3 The Main Menu ...... 78

7.4.4 Searching Wordnet By Web Form ...... 79

7.4.5 Searching Wordnet by OffsetPos Web Form ...... 80

7.4.6 Evaluating Semantic Relations Between Synsets Web Form . . . . 82

7.4.7 Evaluating Wordnet Synsets Glosses Web Form ...... 86

7.4.8 Searching Bilingual Web Form ...... 88

7.4.9 Users Management Web Form ...... 89

7.5 Summary ...... 91

8 Conclusions 92

9 Future Work 95

9.1 Extending Bilingual Dictionaries ...... 95 xi

9.1.1 Related Work ...... 95

9.1.2 Extending Bilingual Dictionaries Using Structured Wordnets . . . . 97

9.2 Integrating Part-of-speech Tagging into Wordnet Construction ...... 99

9.3 Wordnet Expansion Using Word Embeddings ...... 100

9.4 Producing Vector Representation for Multi-word ...... 101

9.5 Vector Representation for Mulit-lingual Wordnets ...... 101

Bibliography 102

Appendices 115

A Papers Resulting from The Dissertation 115

B Data Processing Software Code 116

B.1 ComputCosineSim.py ...... 116

B.2 GenerateVectorForSynset.py ...... 118

B.3 GenerateVectorForGloss.py ...... 119

B.4 ComputeGlossSynsetSimilarity.py ...... 120

C Microsoft SQL Server Tables 121

D LexBank Utility Class 133

E IRB Approval Letter 146 xii

List of Tables

3.1 A list of the Java libraries tested in (Finlayson, 2014)...... 26

3.2 A comparison between some of the Java libraries for accessing the PWN. . 27

4.1 Wordnet semantic relations...... 35

4.2 Size, coverage and precision of the core wordnets we create for Arabic,

Assamese and Vietnamese...... 38

4.3 Precision of the semantic relations established for our Arabic . . . . 38

5.1 An example of cosine similarity between words in a candidate synset . . . . 45

5.2 The weighted average similarity between related words in AWN...... 48

5.3 Comparison between the weighted similarity averages obtained using dif-

ferent settings...... 49

5.4 Comparison between the number of synsets in AWN and our Arabic word-

net using different threshold values...... 49

5.5 Precision of the Arabic wordnet we create...... 50

5.6 Precision of the Assamese wordnet we create...... 50

5.7 Precision of the Vietnamese wordnet we create...... 50

5.8 Examples of related words and their cosine similarity from our Arabic

wordnet...... 51 xiii

5.9 Examples of related words and their cosine similarity from our Assamese

wordnet...... 52

5.10 Examples of related words and their cosine similarity from our Vietnamese

wordnet...... 52

6.1 Meanings of the noun “spill” and its ...... 56

6.2 Cosine similarity between the different synset vectors and glosses of the

word “abduction” in PWN...... 60

6.3 The precision of selecting glosses for PWN synsets ...... 62

6.4 Examples of Arabic glosses we produce in our Arabic wordnet...... 63

6.5 Examples of Assamese glosses we produce in our Assamese wordnet. . . . 64

6.6 Examples of Vietnamese glosses we produce in our Vietnamese wordnet. . 65

6.7 The precision of selecting glosses for Arabic, Assamese and Vietnamese

synsets ...... 65 xiv

List of Figures

3.1 An overview of the CSS management tool, adapted from (Nagvenkar et al.,

2014) ...... 28

4.1 IWND ...... 33

4.2 Core wordnet mapping to structured wordnet...... 35

4.3 Creating wordnet semantic relations using intermediate wordnet...... 36

4.4 The effect of missing synsets in recovering wordnet semantic relations us-

ing intermediate wordnet...... 37

4.5 Percentage of synset semantic relations recovered for the Arabic, Assamese

and Vietnamese wordnets...... 39

5.1 A histogram of synonyms, semantically related words, and non-related

words extracted from AWN...... 47

6.1 An example of creating a vector for a wordnet synset that include more

than one word...... 57

6.2 An example of creating vectors for wordnet synsets that share a single word. 58

7.1 An overview of LexBank system...... 67

7.2 LexBank web site map ...... 75 xv

7.3 The registration web form ...... 76

7.4 Sequence diagram of the registration process ...... 77

7.5 The log-in web form ...... 77

7.6 Sequence diagram of the login process ...... 78

7.7 The main menu ...... 79

7.8 The web form for searching wordnet by lexeme. The form shows the result which means Egypt...... 80 ,(ﻣﺼﺮ) of searching the Arabic lexeme

7.9 Sequence diagram of the process of searching wordnet using lexeme . . . . 81

7.10 The web form for searching wordnet by OffsetPos. The form shows the

result of searching the Arabic synset (08897065-n)...... 82

7.11 The web form for searching wordnet by OffsetPos. The form shows the

result of searching the Vietnamese synset (08897065-n)...... 83

7.12 The web form for searching wordnet by OffsetPos. The form shows the re-

sult of searching the Assamese synset (08897065-n). The third part meronym

in Assamese is wrong. It comes from the verb meaning of “desert” which

means to leave without intending to return...... 83

7.13 Sequence diagram of the process of searching wordnet using OffsetPos. . . 84

7.14 The web form for evaluating semantic relations between synsets in a word-

net. The form shows an example of evaluating a hyponymy relation be-

tween the two Assamese lexemes, one for radio telegraph and the other for

radio...... 84

7.15 Sequence diagram of the process of evaluating the relation between two

lexemes...... 85 xvi

7.16 The web form for evaluating wordnet synsets glosses. The form shows an

example of evaluating Arabic synset (13108841-n)...... 86

7.17 Sequence diagram of the process of evaluating the relation between two

lexemes...... 87

7.18 The web form for searching a bilingual dictionary. The form shows the re- which means Egypt, to Assamese. 88 ,(ﻣﺼﺮ) sult of translating the Arabic word

7.19 Sequence diagram of the process of searching a bilingual dictionary. . . . . 89

7.20 The web form for managing users in LexBank...... 89

7.21 Sequence diagram of the process of managing users in LexBank...... 90

9.1 The IW approach for creating a new bilingual dictionary ...... 96

9.2 Extending bilingual dictionaries using structured wordnets ...... 98 Chapter 1

INTRODUCTION

1.1 Motivation

The word means a repository that stores the vocabulary of a person, language or branch of knowledge such as computer science, the military or medicine (,

2016c). In linguistics, a lexicon is an inventory of basic units of meaning or lexemes. In practice, a lexicon (we may also call it a dictionary) may be printed as a book, or stored in a computer database that can be searched and used by a program. A lexical resource is a group of lexical units that provide linguistic information. The lexical units can be mor- phemes, words or multi-word expressions. The basic unit of a lexical resource is usually called a lexical entry. Some lexical resources can be used by humans directly while other lexical resources are machine readable. Lexical resources are the base of most Natural

Language Processing (NLP) applications.

There are many types of lexical resources. Based on its type, a lexical resource can provide syntactical, morphological, phonological or semantic information. Unilingual dic- tionaries, bilingual dictionaries and wordnets are examples of lexical resources. There are a few fortunate languages, such as English and Chinese, which have a relatively large num- ber of high quality lexical resources. These languages are usually called resource-rich.

Most high quality lexical resources for the resource-rich languages have been painstak- 2 ingly created by researchers manually over many years. Unfortunately, most other existing human languages lack many such lexical resources. Languages which lack lexical and other resources are called resource-low or resource-poor languages. While some of these languages have some resources, there are many other languages that barely have any re- sources. Especially poor in this context are the endangered languages around the world.

One important resource that is very helpful in computational processing and in human language learning is a thesaurus providing synonyms and antonyms of words. An extended version of a thesaurus that provides additional relations among words in the computational context is usually called a wordnet. A wordnet is a structured lexical ontology of words that groups words based on their meaning using sets that are called synsets. For example, the words helicopter, chopper, whirlybird and eggbeater are grouped in one synset that has the gloss, an aircraft without wings that obtains its lift from the rotation of overhead blades. The wordnet connects synsets with each other based on semantic relations. Word- nets are used in many applications such as word sense disambiguation, , information retrieval, text classification and text summarization.

The Princeton WordNet (PWN) is the original English version of such a wordnet and has been produced with diligent manual work augmented by the development of computa- tional tools, over several decades at Princeton University. Similar complete wordnets have also been produced for a small number of additional languages such as French (Sagot and

Fišer, 2008), Finnish (Lindén and Carlson, 2010) and Japanese (Kaji and Watanabe, 2006).

Efforts to produce wordnets for a variety of other languages have been proposed, but most are moving slowly, such as the effort to construct the Asian wordnets (Charoenporn et al.,

2008) and Indian wordnets (Bhattacharyya, 2010). 3

Another important type of resource is the bilingual dictionary, an essential tool for

human language learners. Most existing (online) bilingual dictionaries are between two

resource-rich languages or between a resource-rich language and a resource-poor language.

It is fortunate that many endangered languages have one bilingual dictionary, created usu-

ally by explorers, evangelists or other scholars. However, dictionaries or translators for

translations between two resource-poor languages do not really exist. Wiktionary1, a dic-

tionary created by volunteers, supports over 172 languages, although coverage is poor for

many of them. The online translation machines developed by Google2 and Microsoft3 provide pairwise translations, including translations for single words, for 103 and 53 lan- guages, respectively. While this is a wide range of languages, these machine translators still leave out many widely-spoken languages, not to mention endangered ones. 7097 lan- guages are spoken in the world today (Gordon and Grimes, 2005), of which 400 are spoken by at least a million people.

In previous work we focused on developing new techniques that leverage existing resources for resource-rich languages to build bilingual dictionaries, and core wordnets and other resources such as simple translators for resource-poor languages, including a few endangered ones (Lam et al., 2014a,b, 2015b). In this thesis work, we take these resources to the next level by improving the functionality, quality and coverage of these resources.

We present several new techniques that we did not use in our previous work. Our ultimate goal is to produce an integrated multilingual lexical resource available online, one that

1http://en.wiktionary.org/wiki/Wiktionary:Main_Page 2http://translate.google.com/ 3http://www.bing.com/translator 4

includes several important individual resources for several languages. We believe that our

resources will help researches, speakers, learners and other users of these languages.

1.2 Research Focus

The goal of this dissertation is to create and make available multilingual lexical re-

sources for several languages by bootstrapping from a limited number of existing resources.

Our study has the potential not only to construct new lexical resources, but also to provide

support for communities using languages with limited resources. Additionally, our re-

search presents novel approaches to generate new lexical resources from a limited number

of existing resources.

The main focus of our work is to collect data from disparate sources, develop algo-

rithms for mining and integrating such data, produce lexical resources, and evaluate the

resources in regards to the quality and quantity of entries. To develop and test our ideas, we

work with a few languages with in-house expertise. These include Assamese (asm), Arabic

(arb), English (eng) and Vietnamese (vie). In Chapter 2 we present a detailed introduction to Arabic. Next, we present a brief introduction to Assamese and Vietnamese.

1.2.1 Assamese Language

Assamese is an Indo-European language that are spoken by more than 15 million people (Hinkle et al., 2013). It is mainly used in the Indian states of Assam, Arunachal

Pradesh, Meghalaya, Nagaland and West Bengal. Assamese has 4 dialects: Standard As- samese, Jharwa, Mayang and Western Assamese (Gordon and Grimes, 2005). We present a brief description of the script, morphology and syntax of Assamese. 5

1.2.1.1 Assamese Script

Assamese script consists of 37 consonants, 11 vowels, 147 conjuncts and a few punc-

tuation marks (Hinkle et al., 2013). Unlike English where the written letters might have

variable pronunciation, Assamese written letters have one pronunciation. A consonant that

does not occur at the end of a word is assumed to have implicit vowel a following it. How- ever, when several consonants need to be pronounced together, they are usually written using a new conjunct letter.

When a vowel follows a consonant, the vowel is not written explicitly, but implicitly as an operator. These operators are attached to consonants in different positions (Hinkle et al., 2013). They can appear to the left, right, below or above the consonants. Foreign words can appear in Assamese script as transliteration. However, It is not unusual to write foreign words in foreign alphabets within a piece of Assamese text.

1.2.1.2 Assamese Morphology

Assamese morphology has two types of morphological transformations: derivational and inflectional. Around 48% of the Assamese words are constructed using those two types of transformation (Sharma et al., 2008). The derivational transformation in Assamese is usually performed by changing the vowel component in the word, while the inflectional transformation is performed by adding prefixes or suffixes to the word. Assamese is well- known for its complex suffixes. It is common in Assamese that a word includes a sequence of suffixes. Four to six suffixes in sequence are not uncommon (Saharia et al., 2009). 6

In Assamese, suffixes are used for many purposes. The most common purpose of suffixes is determination (Sharma et al., 2008). In fact, a large number of the Assamese suffixes are determiners. As in other languages, some determiners are attached to nouns and pronouns to make them specific. This is similar to using this and that in English.

Unlike in many other languages, such as English, where affixes are used, determiners in

Assamese are also used to transfer single noun to plural.

1.2.1.3 Assamese Syntax

Assamese has less firm syntax which means that it is considered a free word order language. This means that sentences could be written in different word orders and still have the same meaning. The normal form of a simple Assamese sentences is Subject+Object+

Verb (SOV) (Sarma, 2012), although other orders are acceptable.

1.2.2 Vietnamese Language

Vietnamese, the first language of Vietnam, is an Austroasiatic language that arose in Indo-China (Thompson, 1987). It is the first language of more than 75 millions peo- ple living in Vietnam (Gordon and Grimes, 2005). Also, due to emigration, it is the fist language of many people living around the world, specially in East and Southeast Asia.

Vietnamese, which is called Annamese also, has five main dialects that differ mainly in their sound systems. The five main dialects of Vietnamese are: Northern Vietnamese,

North-central Vietnamese, Mid-Central Vietnamese, South-Central Vietnamese and South- ern Vietnamese (Wikipedia, 2016a). In the next sections, we present a brief description of the script, morphology and syntax of Vietnamese. 7

1.2.2.1 Vietnamese Script

Old Vietnamese texts are written using Chinese characters. In the 17th century, the

Latin alphabet was introduced to Vietnamese by the French. By the beginning of the 20th century, the Romanized version of Vietnamese became dominant (Thompson, 1987).

Compared to other languages, Vietnamese has a large number of vowels. It has 11 single vowels in addition to three types of composed vowels: centering diphthongs, clos- ing diphthongs and triphthongs (Gordon and Grimes, 2005). These vowels are created by combining single vowels together. Vowels are modified by diacritics. The diacritics, which can be written above or below a vowel, are used to specify the tone of the vowel.

These tones have different lengths, pitch heights, pitch melodies and phonations. There are

25 consonants in Vietnamese. Consonants are represented in written script by a variable number of letters. Some of the consonants are represented using one letter and other conso- nants are represented by a digraph, which is a combination of two letters. There are some consonants which are represented by more than one digraph or letter (Wikipedia, 2016a).

1.2.2.2 Vietnamese Morphology

In Vietnamese, the majority of words are polysyllabic words (Noyer, 1998). Poly- syllabic words are words composed of two or more syllables. The construction of polymor- phemic words in Vietnamese is done in three ways: combining two words, adding affixes to stem or reduplication. Words formed using reduplication morphology are constructed by duplicating a word or a part of a word. There are a small number of affixes in Vietnamese.

Most of them are in the form of prefixes and suffixes. One distinct characteristic of Viet- 8 namese is that it does not have any number, gender, case and tense distinction (Wikipedia,

2016b). However, usually a noun classifier is used as a determiner and is added after the word to specify those characteristics.

1.2.2.3 Vietnamese Syntax

Vietnamese sentences follow the Subject+Verb+Object (SVO) word order. To dis- tinguish between verbs and nouns in a Vietnamese sentence, a copula is used before the nouns. Noun phrases are usually composed of a noun and a modifier. The modifier can be a numerator, classifier, prepositional phrase or other description word. Like in other languages, pronouns are used to substitute the nouns and noun phrases.

1.3 Research Contributions

The resources created by Khang’s PhD dissertation (Lam, 2015) and reported in (Lam et al., 2014a,b, 2015b), have many holes. E.g., the wordnets have only synsets, which are sets of synonyms for words. In this dissertation work, we develop algorithms and models to automatically establish the semantic relations between synsets in our previously created core wordnets for our languages of focus using both pre-existing resources, as well as by bootstrapping with resources we create ourselves. Following are the contributions produced by this thesis:

• We construct the rest of the structure for our core wordnets with acceptable qual-

ity. We focus on the construction of wordnet semantic relations such as Hypernyms,

Hyponyms, Member Meronyms, Part Meronyms and Part Holonyms between the 9

synsets.We believe that our work contributes significantly to the repository of re-

sources for languages that lack them.

• We present a method to enhance the quality of wordnets, which we create in the first

task, by filtering the mistakenly created synsets and relations. In this task, we use one

of the state of the art techniques which is (Mikolov et al., 2013).

This method give a solution to the problem of wrong translation produced by the

translation method.

• We produce an approach to create a vector representation for synsets. This approach

aims to produce a better way for representing meaning. This representation can be

used in several areas. In this task we use it to automatically extract glosses from

corpora for wordnet synsets we create in the previous tasks. It, also, can be used in

the word-sense disambiguation (WSD) problem which occurs with words that have

multiple meanings.

• Then, based on the vector representation of synsets, we present a novel approach

to add a gloss for each set (synset) in our core wordnets. A gloss is a

definition or a sentence that clarifies the meaning of the synset. Glosses are mostly

added manually by human or automatically generated using rule-based generation

approach (Cucchiarelli et al., 2004).

• Finally, we present LexBank which is a system that makes our created resources

available for public. We design and implement the system such that it provides use-

ful services for users that seek linguistics resources in a friendly manner. We aim 10 to make our system flexible and expandable so it can accommodate additional new languages and resources. Chapter 2

CASE STUDY: THE CURRENT STATUS OF AND CHALLENGES IN

PROCESSING INFORMATION IN ARABIC

Since Arabic is one of the languages we use in our experiments throughout this dis- sertation, we present the current status of Arabic language processing as an example in this chapter.

2.1 Introduction

According to Ethnologue (Gordon and Grimes, 2005), Arabic is the official language of more than 223 million people in 25 countries which makes it one of the most widely- spoken languages in the word. Arabic is the language of Islam, which is the religion of

1.6 billion people around the world. Muslims are required to use Arabic to read the Quran

(the Holy Book of Islam) and to perform the rituals of Islam. There are around 30 major dialects in Arabic. These dialects have different , morphologies, syntax and even (Habash, 2010). However, these dialects are not used as official languages by themselves. They are used for informal speech. For formal writing and speaking, the official Modern Standard Arabic (MSA) is used. MSA was developed based on Classical

Arabic, which is the language of historical literature. However, dialects are commonly used for writing now-a-days in social media. But, they are rarely used in books, newspapers and in literary writing. Most Arabs can speak MSA, however, it is not the natively spoken 12 language of any region (Diab and Habash, 2007). This coexistence between MSA and dialects is problematic for Arabic language processing. This happens to be a problem in most of widely spoken languages in the world (Haugen, 1966).

One important survey (Farghaly and Shaalan, 2009) discussed the importance of research in the field of Arabic processing from three perspectives. First, the perspective of non-Arabic speakers who need to process a huge amount of Arabic texts. The Department of Homeland Security in the United States is a good example. With increasing security risks, there is a crucial need to be able to understand the meaning of Arabic documents and retrieve important information from them such as names, organization and places. The second perspective is that of Arabic speakers. Machine translation, retrieving information, summarization, and linguistic tools are some of the applications which are requested by

Arabic speakers.

In the rest of this chapter, we give a summary of the features that make the process- ing of Arabic text so challenging and some of the solutions and resources that have been designed to address these challenges. First, in Section 2, we discuss the fundamental issues in Arabic which are the script, the morphological issues, and the syntactical issues. Then, in Section 3, we discuss three of the most valuable resources for Arabic processing. These are The Penn Arabic (PATB), The Prague Arabic Dependency Treebank (PADT), and The Columbia Arabic Treebank (CATiB).

2.2 Fundamentals of Arabic

In this section we discuss the script, morphology and syntax of Arabic. 13

2.2.1 Arabic Script

Arabic is written as a right to left script. The Arabic script is also used by languages such as Kurdish, Urdu, Persian and Pashto (Habash, 2010). One important aspect of Arabic is that most of Arabic letters are composed of two parts: a base form and a mark. There are three kinds of marks in Arabic letters. The first kind consists of dots which are used to distinguish between letters that share the same base form. An example of letters that tha”. The second kind“ (ﺙ) ta”, and“ (ﺕ),”ba“ (ﺏ) share the same base form are the letters u”, or“ (ﺅ) which can be written above some letters, as in (ﺀ) of mark is the Hamza mark I”. Unfortunately, people often misspell words by not writing“ (ﺇ) under some letters, as in such marks making it hard to distinguish between similar letters and causing ambiguity in can also be considered a letter by (ﺀ) the text. It is also important to notice that Hamza which means (ﺳﻤﺎﺀ) itself. An example of a word that has the Hamza letter is the word ”Kaf“ (ﻙ) sky”. The third kind of mark is the Hamza mark that distinguishes the letter“ .”Lam“ (ﻝ) from the letter

Most letters in Arabic have several shapes. The shape of a written letter is determined qaf” as an“ (ﻕ) based on the position of that letter in the word. Let us take the letter whereas it (ﻗـ) example. If it appears at the beginning of the word, it will have the shape if it (ـﻖ) if it appears in the middle of the word, and the shape (ـﻘـ) will have the shape

is at the end of the word. All word processors select the appropriate letter shape based on

the rules which govern these shapes, and therefore, there is only one key for each letter.

Inflectional morphology is also a factor that governs the shape of some Arabic letters. which means ,(ﺃﺻﺪﻗﺎﺀ) The Arabic letter “Hamza” is a good example for that. The word 14 which (ﻱ) when we add the letter (ﺃﺻﺪﻗﺎﺀﻱ) instead of (ﺃﺻﺪﻗﺎﺋﻲ) friends”, becomes“

means the possessive pronoun “my”.

In Arabic, each letter is mapped to one unvarying sound, which makes it a phonetic always has the pronunciation /s/. On the (ﺳـ) language. For example, the Arabic letter

other hand, letter “s” in English has three pronunciations: /z/, /s/, and /sh/ as in “nose”,

“salt”, and “sugar”, respectively. However, in Arabic a short vowel may be added to the

letter to change its sound. There are three short vowels in Arabic, which means that each

letter has three more sounds in addition to the original sound. There are no dedicated letters

to represent short vowels. The short vowels may be specified in the written language using

optional diacritics. To show how the short vowels change the sound of Arabic letters, let us is pronounced as /s/; however, if we (ﺳـ) again. We said that (ﺳـ) look at the Arabic letter

add the short vowel “Dhamma” it will be pronounced as “su” and it may be written, with

If we add the short vowel “Kasra”, it will be pronounced .(ﺳuـ) the “Dhamma” diacritic, as Keep in mind that in MSA, the .(ﺳiـ) as “si” and it may be written with “Kasra” diacritic, as

writing of the diacritics is optional, although a change in a diacritic of a letter can change

the meaning of the word and may even change the morphological structure of the sentence.

Clearly, this a major source of ambiguity in Arabic processing (Diab and Habash, 2007).

Obviously, with all these problems caused by the Arabic script, Arabic input text

has to be pre-processed to enhance recognition during the actual processing. This pre-

processing, which is called normalization, aims to standardize different Arabic script varia-

tions. There are several solutions which have been proposed to normalize the Arabic script.

For example, (Larkey et al., 2002) normalized the corpus, the queries, and the dictionaries

of Arabic using the following steps. They first unified the encoding and removed punctua- 15 tions in the text. Then they removed all the diacritics and the non-letters called “tatweel”. from the letter “Alif” to standardize all the (ﺀ) After that, they removed the Hamza mark (ﺓ) and ,(ﻱ) with (ﻯ) ,(ﺉ) with (ﻯﺀ) Also, they replaced .(ﺍ) to (ﺁ) and ,(ﺇ),(ﺃ) variations The Stanford Natural Language Processing Group adopted a similar procedure in .(ﻩ) with

the Stanford Arabic Statistical Parser (Green and Manning, 2010). The normalization pro-

cess, as you might expect, does not come without a price. Since all these removed marks

purpose to clarify ambiguity, the normalization of the variant scripts causes the ambiguity

probability to increase (Farghaly and Shaalan, 2009).

Unlike English and other languages, there are no silent letters in Arabic. An example

of a silent letter in English is the letter “p” in the word “pneumonia”. There are no new

sounds in Arabic produced by combining two letters. For instance, in English, “c” and “h”

are combined to produce three distinct sounds: the sound at the beginning of “cheese”, the

sound at the beginning of “character”, and the sound at the beginning of “chef.”

It is well known that the process of splitting text into sentences is an essential step

in many Natural Language Processing (NLP) applications. In English, this is relatively an

easy task since English sentences start with an uppercase letter and finish with a period.

However, splitting Arabic sentences is not as easy as in English since there is no capital

form for Arabic letters (Chinese, Japanese, and Korean have no capitalization too). In

addition, punctuation rules in Arabic are not strict; so many people do not use it properly. In

fact, Arabic writers excessively use coordinations, subordinations and logical connectives

to conjoin the sentences (Farghaly and Shaalan, 2009). Hence, it is not unusual for an

Arabic article to have a complete paragraph which does not include any periods other than 16 the period at the end of the paragraph. Therefore, texts in the Arabic must go through complicated preprocessing.

The lack of capitalization obviously makes it hard to detect named entities (Darwish,

2013) which is an essential part of Information Retrieval (IR). In English, extracting named entities such as cities, names of people, addresses and organizations is done with the help of capitalization and punctuation. For example, to recognize a name like “Barack H. Obama”, a simple algorithm can be used to search for an uppercase word followed by an initial with an optional period followed by an uppercase letter. We are not claiming that NER in English is straightforward or simple in general, but since Arabic does not have these features, new methods must be used to address the problem of named entity recognition

(Darwish, 2013).

2.2.2 Arabic Morphology

Arabic has a very rich and complex morphology (Attia, 2008). Similar to the other

Semitic languages, morphology in Arabic is of two types, derivational and inflectional.

Derivational morphology is the process of creating new words. This is done by mapping a root to a pattern. The root holds the meaning while the pattern changes the structure of the root generating a new word with a different part-of-speech. This type of derivational morphology is called nonlinear morphology (Bhattacharya et al., 2005). On the other hand, inflectional morphology is the process of modifying the words with features to create plural, feminine, or definite forms of the word (Habash, 2010).

A morpheme is “a linguistic form which bears no partial phonetic-semantic resem- blance to any other form” (Bloomfield, 1933). Morphological processing in NLP is the 17 process of decomposing a word into morphemes. Relatively, this is an easy task in con- catenative morphology. However, in languages with nonconcatenative morphology, like

Arabic, it is a much harder task. In Arabic, words are built by merging a consonantal root and a vocalism (McCarthy, 1981). The root holds a semantic field while the vocalism specifies the grammatical form. An example showing the nonconcatenative morphology katab” which means “to write”. It is composed by“ (ﻛﺘﺐ) of Arabic would be the word .”k-t-b/ which has the meaning of “writing/ (ﻛﺘﺐ) associating the root

Several approaches have been used to decompose Arabic words. The first approach recovers the root by extracting all prefixes and suffixes from the word, then, stripping the rest of the word using a lexicon of roots (Hlal, 1985). This approach is very common; however, it requires a lexicon of all possible Arabic roots, prefixes, infixes and suffixes

(Beesley, 1996; Shaalan et al., 2006). Buckwalter introduced another approach in his mor- phological analyzer (BAMA) (Buckwalter, 2004). Rather than recovering the root, BAMA recovers the stem and considers it the main building block for the Arabic word. The stem is recovered by just removing the prefixes and the suffixes. Therefore, BAMA decomposes the Arabic word into three parts: Arabic stems, Arabic prefixes and Arabic suffixes.

The decomposition process searches for the prefixes and the suffixes in the word that satisfy constraints governing the possibility of combining them with the stem in the word. BAMA has a bidirectional transliteration schema from Arabic script to Latin script.

That means that developers can work with unstructured Arabic texts without any Arabic language knowledge. For this reason, many recent statistical ANLP systems use BAMA as the foundation for machine translation and information retrieval. However, BAMA has the limitation of giving a general analysis that includes all possible cases of the word without 18 considering the context of the input text. A more refined result can be obtained using a disambiguation module that considers the context of the input text after eliminating the incorrect analyses (Habash and Rambow, 2005).

Dialectal Arabic differs from MSA morphologically, lexically and phonologically

(Habash et al., 2013). Furthermore, there are no standard orthographies and no language academies in dialectal Arabic. Therefore, the tools and resources designed for MSA do not work with dialectal Arabic. Recently, several research efforts have focused on Arabic dialectal texts (Habash and Rambow, 2005; Habash et al., 2013; Zaidan and Callison-

Burch, 2014). The state-of-the-art dialectal Arabic morphological analyzer is the Columbia

Arabic Language and dialect Morphological Analyzer (CALIMA) (Habash et al., 2013).

Arabic is an agglutinative language, which means that Arabic words usually include

(ﻛaﺘaﺒﺘuﻬaffixes and clitics that represent different parts-of-speech. Let us take the word (Â

“katabto ho” which means “I wrote it”. This word is a verb that has the subject and the ta”, while the“ (ﺕ) object attached to it. The subject is the diacritic on the fourth letter ha”. This is just a simple example whereas words usually have“ (ـﻪ) object is the suffix more complex structures that include other clitics to specify the gender, person, number and voice. Hence, due to complex phonological rules, the decomposition of words in Arabic is relatively more difficult in comparison to other languages.

2.2.3 Arabic Syntax

According to (Habash, 2010), there are two kinds of sentences in Arabic: sentences that start with verb (V-SENT), and sentences that start with noun (N-SENT). Verb-subject- object (VSO) is the primary structure of a V-SENT sentence in the Classical and Modern 19

Standard Arabic. However, the object-verb-subject (OVS) and subject-verb-object are also commonly used. As we mentioned before, Arabic is a pro-drop language which means that the subjectless sentences are perfectly grammatical in Arabic. Also, unlike English, the use of the equational sentences like “He a journalist”, are allowed without the need of a

“to be” verb. Russian, Hungarian, Hebrew, and Quechuan languages also allow this type of sentences.

In Arabic, the structure of constituent questions is usually composed by starting with a wh-phrase. However, it is grammatically correct if the constituent question does not start literally means “you (ﺃﻛﻠﺖ ﻣﺎﺫﺍ ﺑﺎﻷﻣﺲ؟) with the wh-phrase. For example, the question eat what yesterday?”. Furthermore, relative clauses in Arabic are connected using relative

:there are two clauses (ﺍﺣﺒﺒﺖ ﺍﻟﺒﻴﺖ ﺍﻟﺬﻱ ﺍﺷﺘﺮﻳﺘﻪ) pronouns. For example, in the sentence which means “which (ﺍﻟﺬﻱ ﺍﺷﺘﺮﻳﺘﻪ) which means “I liked the house”, and (ﺃﺣﺒﺒﺖ ﺍﻟﺒﻴﺖ) which means (ﺍﻟﺬﻱ) I bought”. The two clauses are connected using the relative pronoun

“which”. The Arabic relative pronouns must agree with the noun which it modifies at the second clause in number and gender.

2.3 Summary

In this chapter, we presented a short overview of inofrmation processing in Arabic.

We summarized challenges that face developers and researchers when processing Arabic text due to many of its features. The lack of capitalization, dropped subjects, missing short vowels and the nonconcatenative nature are some of these features. In addition, there are many dialects in Arabic, which are used in the informal speaking and writing. These 20 dialects must be treated differently when processing Arabic texts. Much research has been conducted to address the challenges of Arabic . Some valuable resources and techniques have been presented for Arabic. However, more work needs to be done to give Arabic developers and speakers the support they need. Chapter 3

LITERATURE REVIEW

In this chapter, we provide a summary of the main existing approaches for creating lexical resources. We focus on two types of lexical resources: wordnets and bilingual dictionaries.

3.1 Automatic Construction of Wordnets

Wordnet is a lexical ontology of words. There are two ways to construct wordnets for languages that do not possess such resources: manual construction and automatic con- struction. We intend to use automatic construction using core wordnets we have created in our earlier work (Lam et al., 2014a,b, 2015b) and other existing resources that are freely available. Other efforts are underway to manually (or mostly manually) create wordnets in a variety of languages, although progress seems slow all around.

High-quality wordnets have been developed for a small number of languages. Word- nets, other than the Princeton WordNet (Fellbaum, 1998; Miller, 1995), are typically con- structed by one of two approaches. The first approach, which is called the expansion ap- proach, translates the PWN to target languages (Akaraputthiporn et al., 2009; Barbu, 2007;

Bilgin et al., 2004; Kaji and Watanabe, 2006; Lam et al., 2014b; Lindén and Niemi, 2014;

Oliver and Climent, 2012; Sagot and Fišer, 2008; Saveski and Trajkovski, 2010). In con- 22 trast, the second approach, which is called the merge approach, builds the semantic taxon- omy of a wordnet in a target language, and then aligns it with the Princeton WordNet by generating translations (Borin and Forsberg, 2014; Gunawan and Saputra, 2010; Maziarz et al., 2013; Rigau et al., 1998). To construct the taxonomic relations between words,

first definitions of words are retrieved from machine readable dictionaries. Then, a genus disambiguation process, which is the process of finding a word with a broad meaning that more specific words fall under, is performed using the definitions to construct a hierarchical class of concepts. Next, classes are merged with the synsets in the PWN using a bilingual dictionary to form the target wordnet.

The expansion approach dominates the merge approach in popularity. Wordnets gen- erated using the merge approach may have different structures from the Princeton Word-

Net. In contrast, wordnets created using the expansion approach have the same structure as the Princeton WordNet, which provides for a level of uniformity among them, pos- sibly at the cost of some natural language-specific expressiveness (Leenoi et al., 2008).

Many approaches to construct wordnets are semi-automatic and, therefore, can be used only for languages that have some existing lexical resources. Therefore, any attempt to build wordnets for resource-poor languages using these methods would be doomed from the start. Moreover, while wordnets are always difficult to evaluate, it is even harder to eval- uate machine-created wordnets in resource-poor languages because these languages do not have gold standards to compare with, and frequently do not have easily-accessible experts to evaluate such resources.

Crouch (Crouch, 1990) clusters documents using a complete link clustering algo- rithm and generates thesaurus classes or synonym lists based on user-supplied parameters. 23

Curran and Moens evaluate the performance and efficiency of thesaurus extraction meth- ods and also propose an approximation method that provides for better time complexity with little loss in accuracy (Curran and Moens, 2002a,b). Ramirez and Matsumoto develop a multilingual Japanese-English-Spanish thesaurus using two freely available resources:

Wikipedia and the Princeton WordNet (Ramírez et al., 2013). They extract translation tu- ples from Wikipedia articles in these languages, disambiguate them by mapping to wordnet senses, and extract a multilingual thesaurus with a total of 25,375 entries. One thing we must note about all these approaches is that they are resource-hungry, requiring a large cor- pus of Wikipedia or non-Wikipedia documents and wordnets. For example, Lin works with a 64-million word English corpus to produce a high quality thesaurus with about 10,000 entries (Lin, 1998). Ramirez and Matsumoto have the entire Wikipedia at their disposal with millions of articles in three languages, although for experiments they use only about

13,000 articles in total (Ramírez et al., 2013). Furthermore, (Miller and Gurevych, 2014) work with more than 19 thousands of Wiktionary senses and 16 thousands of Wikipedia articles to produce a three-way alignment of WordNet, Wiktionary, and Wikipedia. When we work with low-resource or endangered languages, we do not have the luxury of collect- ing such big corpora or accessing even a few thousand articles from Wikipedia or the entire

Web. Many such languages have none or very limited Web presence. As a result, we have to work with whatever limited resources are available.

In this work we introduce approaches to generate synonyms, hypernyms, hyponyms and some other semantic relations. To enhance the quality of wordnets we create, we present several approaches to measure relatedness between concepts or words. Some po- 24 tential approaches for measuring semantic relationships are a dictionary-based approach

(Kozima and Furugori, 1993) and thesaurus-based approach (Hirst and St-Onge, 1998).

Oliver (Vossen, 1998) presented approaches for constructing wordnets using the ex- pand model and made them available through a Python toolkit (Oliver, 2014). The authors designed three strategies that use three types of resources to construct wordnets: dictio- naries, semantic networks (Navigli and Ponzetto, 2010) and parallel corpora. While the construction approaches of wordnets using dictionaries and semantic networks were direct, the authors used machine translation and automatic sense-tagging to construct their word- nets using parallel corpora. A toolkit1 provides access to the three construction methods besides access to some freely available lexical resources. To test their dictionary based approach, the authors constructed wordnets for six languages: Spanish, Catalan, French,

Italian, German and Portuguese with precision between 48.09% and 84.8%. Using their based approach, the authors constructed wordnets for the six languages with precision between 49.43% and 94.58%. The parallel corpus based approach with machine translation achieved precision between 70.26% and 93.81%, while with auto- matic sense-tagging it achieved between 75.35% and 82.44%. The authors stated that their automatically-calculated precision value is very prone to errors.

Another example of constructing wordnets using dictionary based methods is JAWS

(Mouton and de Chalendar, 2010). JAWS is a French wordnet for nouns constructed by translating wordnet nouns using a bilingual dictionary and syntactic language model. The construction of JAWS starts with copying the structure (the synsets with no words) of the source wordnet. Then, the phrases that are available in the bilingual dictionary are used to

1http://lpg.uoc.edu/wn-toolkit 25

fill out the initial synsets. Finally, the language model is used to incrementally add new phrases to JAWS. An improved version of JAWS is called WoNeF (Pradet et al., 2014).

The new improved wordnet includes parts of speech and was evaluated using a gold stan- dard produced by two annotators. In addition, WoNef uses a better translation selection algorithm that uses machine learning to select variable thresholds for translations.

In (Lam et al., 2014b), we presented three methods to construct wordnet synsets for several resource-rich and resource-poor languages. We used some publicly available wordnets, a machine translator and a single bilingual dictionary. Our algorithms translate synsets of existing wordnets to a target language T, then apply a ranking method on the translation candidates to find best translations in T. The approaches we used are applicable to any language which has at least one existing bilingual dictionary translating from English to it.

In the first approach, which we call the direct translation approach (DR), for each synset in PWN, we directly translate the words from English to the target language. In the second approach, which we call IW, we extract candidates from several intermediate wordnets rather than just using PWN to disambiguate the translation. In the third approach, which we call IWND, we try to reduce the number of bilingual dictionaries we use in the second approach. When the intermediate wordnet is not PWN, we translate the extracted words from the wordnets to English, and then we use a single bilingual dictionary to trans- late the words from English to the target language. In all of these methods, after extracting the candidates, we use a ranking method to select the best translations and insert them as a synset in the traget wordnet. 26

3.2 Wordnet Management Tools

Library URL CICWN http://fviveros.gelbukh.com/wordnet.html extJWNL http://extjwnl.sourceforge.net/ Javatools http://www.mpi-inf.mpg.de/yago-naga/javatools/ Jawbone http://sites.google.com/site/mfwallace/jawbone/ JawJaw http://www.cs.cmu.edu/~hideki/software/jawjaw/ JAWS http://lyle.smu.edu/~tspell/jaws/ JWI http://projects.csail.mit.edu/jwi/ JWNL http://sourceforge.net/apps/mediawiki/jwordnet/ URCS http://www.cs.rochester.edu/research/cisd/wordnet/ WNJN http://wnjn.sourceforge.net/ WNPojo http://wnpojo.sourceforge.net/ WordnetEJB http://wnejb.sourceforge.net/

Table 3.1. A list of the Java libraries tested in (Finlayson, 2014).

Maintaining wordnets is an important area of research. The manual construction of a wordnet is an intensive process that requires a large number of specialists to work for several years. Furthermore, a wordnet is not static. The meaning of many phrases change through time and new phrases appear every year. For example, the country Sudan was divided into two countries Sudan and South Sudan in 2011. If one searches the PWN 3.1 for

Sudan, only the senses corresponding to the old Sudan show up since the new sense has not yet been added. Moreover, the representation of wordnets evolves over time. For example, many old wordnets were upgraded to provide the XML representation. In addition, as this section shows, many wordnets are built based on the PWN. Every time PWN gets updated, these wordnets must be updated also to preserve the alignment with PWN. All the previous issues show the need for wordnet maintenance tools.

One recent work on tools for maintaining wordnets is by (Mladenovic et al., 2014).

The tools are designed to provide for upgrade, cleaning, validation, search, import and 27 export of functionalities for the Serbian wordnet (Christodoulakis et al., 2002). Another recent work develops a Java library, which is called JWI, for accessing the PWN and com- pares it with eleven other libraries is (Finlayson, 2014). The comparison between the li- braries was based on five features: special requirements, used similarity metrics, ability to edit the wordnet, whether they need to work with the Maven project or not, and forward- compatibility with Java. Table 3.1 shows the tested libraries and Table 3.2 shows a sum- mary of the comparison.

Metric Maven Editing Standalone Minimum Java Similarity Metrics CICWN Yes No No No 1.6 extJWNL No No Yes Yes 1.6 Javatools Yes Yes No No 1.6 Jawbone Yes Yes No No 1.6 JawJaw Yes Yes No No 1.5 JAWS Yes No No No 1.4 JWI Yes Yes No No 1.5 JWNL No Yes No Yes 1.4 URCS Yes No No No 1.6 WNJN No No No No 1.5 WNPojo No No No No 1.6 WordnetEJB No No No No 1.6

Table 3.2. A comparison between some of the Java libraries for accessing the PWN.

Another wordnet management tool was also presented recently for the IndoWordNet2

(Nagvenkar et al., 2014). The tool, which is called the Concept Space Synset Management

Tool3 (CSS), provides an interactive user interface for creating new language synsets and

2http://www.cfilt.iitb.ac.in/indowordnet/ 3http://indradhanush.unigoa.ac.in/conceptspace 28 linking them to other Indian language wordnets. The CSS tool uses a role-based access control to restrict the access to the wordnet. Figure 3.1 shows an overview of the CSS tool.

Figure 3.1: An overview of the CSS management tool, adapted from (Nagvenkar et al., 2014)

Sense marking is the process of tagging words with senses in corpus. It is a necessary

task in preparing training data for machine learning techniques. Since sense marking is an

intensive process, sense marking tools are very handy. For example, the Indian Institute

of Technology Bombay has developed a sense marker tool for the IndoWordNet (Prab-

hugaonkar et al., 2014). The sense marking tool shows a highlighted word in a piece of text

and asks the annotator to choose the most appropriate sense from the available senses. The

tool, also, allows the annotator to add new senses that do not exist in the wordnet. 29

3.3 Creating Bilingual Dictionaries

Bilingual dictionaries are essential lexical resources which we use in our approaches.

The majority of low-resource languages have bilingual dictionaries to provide phrase trans- lation between them and rich-resource languages. However, only relativity few bilingual dictionaries are available for translation between low-resource languages. Several meth- ods have been presented to automatically construct such dictionaries between low-resource lanauges. Since wordnets we create in this dissertation are aligned with each others, we believe that they can be good resources for phrase translation between languages. In this section, we discuss some methods for automatically creating bilingual dictionaries.

Given two input dictionaries L1-Lp and Lp-L2, a naïve method to create a new bilin- gual dictionary L1-L2 may use Lp as a pivot using a straightforward transitive approach.

However, if a word has more than one sense, being a polysemous word, this method may introduce incorrect translations. After computing an initial bilingual dictionary, past re- searchers have used several approaches to mitigate the effect of ambiguity in word senses.

Methods used for disambiguation use wordnet distance between source and target words in some way, look at dictionary entries in both forward and backward directions and compute the amount of overlap to compute disambiguation scores (Ahn and Frampton, 2006; Bond and Ogura, 2008; Gollins and Sanderson, 2001; Lam and Kalita, 2013; Shaw et al., 2013;

Soderland et al., 2010; Tanaka and Umemura, 1994).

Researchers have also merged information from several sources such as parallel cor- pora or comparable corpora (Nerima and Wehrli, 2008; Otero and Campos, 2010) and a wordnet (István and Shoichi, 2009; Lam and Kalita, 2013) to address the ambiguity prob- 30 lem. Some researchers extract bilingual dictionaries directly from monolingual corpora, parallel corpora or comparable corpora using statistical methods (Bouamor et al., 2013;

Brown, 1997; Haghighi et al., 2008; Héja, 2010; Ljubešic´ and Fišer, 2011; Nakov and Ng,

2009; Yu and Tsujii, 2009).

Obviously, the quality and quantity of existing resources strongly affect the accura- cies of newly-created dictionaries. For instance, Nerima and Wehrli create new English-

German and English-Italian bilingual dictionaries with 21,600 and 26,834 entries, respec- tively, from 76,311 entries in an English-French dictionary, 45,492 entries in a German-

French dictionary, and 36,672 entries in a French-Italian dictionary (Nerima and Wehrli,

2008). Given parallel corpora of Lithuanian consisting of 1,765,000 tokens and Hungarian including 2,121,000 tokens, Heja can extract only 2,616 correct translation candidates with accuracy over a certain threshold from 4,025 translation candidates (Héja, 2010). Thus, new bilingual dictionaries created using current approaches have very few entries com- pared to the size of the input dictionaries. Furthermore, most resource-poor languages do not have any corpora, or even online documents. Some languages have only one very small bilingual dictionary, such as the Karbi-English dictionary of 2,341 words.

In (Lam et al., 2015b), we present approaches to automatically build a large num- ber of new bilingual dictionaries for low-resource languages, especially resource-poor and endangered languages, using a single input bilingual dictionary. Our algorithms produce translations of words in a source language to many target languages using publicly avail- able wordnets and a machine translator (MT). Our approaches may produce any bilingual dictionary as long as one of the two languages is English or has a wordnet linked to the

PWN. Using our approaches and starting with 5 available bilingual dictionaries, we cre- 31 ated 48 new bilingual dictionaries. Of these, 30 pairs of languages are not supported by the popular MTs: Google4 and Bing5.

3.4 Summary

In this chapter, we have discussed the existing methods for the automatic construc-

tion of wordnets. We have also discussed several tools and system for managing wordnets.

Moreover, we covered some of the approaches for automatically creating bilingual dictio-

naries.

4http://translate.google.com/ 5http://www.bing.com/translator Chapter 4

AUTOMATICAALY CONSTRUCTING STRUCTURED WORDNETS

The core idea behind a wordnet is to group words which are synonyms, or roughly synonymous, into lexical categories that are called synsets. Then, semantic relations be- tween these synsets are established in a hierarchical manner. In this chapter, we present a method to automatically construct the wordnet semantic relations such as Hypernyms,

Hyponyms, Member Meronyms, Part Meronyms and Part Holonyms using PWN.

4.1 Constructing Core Wordnets

In (Lam et al., 2014b), we introduced an approach, which we refer to as the IWND approach, that creates wordnet synsets with relatively high coverage. As Figure 4.1 shows, in IWND, to create wordnet synsets for a target language T we used existing wordnets and a machine translator (MT) and/or a single bilingual dictionary. First, we extracted every synset in Princeton WordNet (PWN) using the unique offset-POS key, which refers to the offset for a synset with a particular part-of-speech (POS). Notice here that each synset may have one or more words, each of which may be in one or more synsets. Words in a synset have the same sense. Then, we extracted the corresponding synsets for each offset-

POS from existing wordnets linked to PWN, in several languages. Next, we translated the extracted synsets in each language to T to produce synset candidates using MT or a 33 dictionary. Then, we applied a ranking method on these candidates to find the correct words for a specific offset-POS in T.

Figure 4.1: Creating wordnet synsets using the IWND algorithm (Lam et al., 2014b). 34

The ranking method we used in (Lam et al., 2014b) is based on the occurrence count of a candidate. Specifically, the rank of a word w, the so-called rankw , is computed as below.

= occurw ∗ numDstWordNets rankw numCandidates numW ordNets where:

- numCandidates is the total number of translation candidates of an offset-POS

- occurw is the occurrence count of the word w in the numCandidates

- numWordNets is the number of intermediate wordnets used, and

- numDstWordNets is the number of distinct intermediate wordnets that have words

translated to the word w in the target language.

4.2 Constructing Wordnet Semantic Relations

Synsets in a wordnet are linked in a hierarchal fashion. The hierarchy in a wordnet

is established using the super-subordinate relation between synsets. For example, nouns

are linked using hyperonymy, which is a relation between a general synset and a specific

one. An example of a hyperonymy relation is the relation between the synsets {food,

solid_food} and {baked_goods}. The Hyperonymy relation is transitive, for example, the

synset {bread}, which is a hyponym of the synset {baked_goods}, is also a hyponym of

the synset {food, solid_food}. Table 4.1 shows the semantic relations available in wordnet

(Wikipedia, 2015).

In (Lam et al., 2014b), we constructed core wordnets, which essentially means that

we created synsets with no connections between them. As Figure 4.2 shows, our goal is to

recover the taxonomy of synsets. To establish the semantic relations between the sysnets 35

Phrase Type Relation Definition Hypernyms Y is a hypernym of X if every X is a (kind of) Y Hyponyms Y is a hyponym of X if every Y is a (kind of) X Nouns Coordinate terms Y is a coordinate term of X if X and Y share a hypernym Meronyms Y is a meronym of X if Y is a part of X Holonyms Y is a holonym of X if X is a part of Y Hypernyms The verb Y is a hypernym of the verb X if the activity X is a (kind of) Y Verbs Troponyms The verb Y is a troponym of the verb X if the activity Y is doing X in some manner Entailments The verb Y is entailed by X if by doing X you must be doing Y Coordinate terms Those verbs sharing a common hypernym

Table 4.1. Wordnet semantic relations.

Figure 4.2: Core wordnet mapping to structured wordnet. we created in (Lam et al., 2014b), we rely on the Princeton WordNet (Fellbaum, 2005) as an intermediate resource. 36

As Figure 4.3 shows, to construct the links between synsets in our wordnet for lan- guage T, we extract each synseti from wordnett and map it to synsetj, which is the corre-

sponding synset in the Princeton WordNet. Then, for each synsetj in the Princeton Word-

Net, we extract each semantic relations rj and the linked synsetsk . Next, we check the

availability of synsetk in wordnett . Finally, if synsetk is available in wordnett , we add a

relation between synseti and synsetk to wordnett .

Figure 4.3: Creating wordnet semantic relations using intermediate wordnet.

We notice here that although we used some disambiguation methods when we created

the core wordnets, there still are words that are misplaced. This will cause some false

classification of synset relations. Another challenge is that translation leads to loss of some

information. For example, it is very important to distinguish between classes and instances

in wordnets (Miller and Hristea, 2006). There is no guarantee that an instance will not

be translated into the target language as a class and vice versa. Furthermore, as Figure

4.4 shows, since the core wordnets are automatically created, there will be some missing 37 synsets that might not be available in the target languages. That is will lead to fragments in the recovered links. All the previous issues need to be observed and dealt with to obtain accepted accuracy.

Figure 4.4: The effect of missing synsets in recovering wordnet semantic relations using intermediate wordnet.

4.3 Experiments and Evaluation

In this section, we generate the semantic relations between synsets in three wordnets:

Arabic, Assamese and Vietnamese. We start by creating the core nets using the algorithms

we described in Section 4.1. Table 4.2 shows the result of creating the core wordnets for the

three languages. Next we apply our method, which we presented in Section 4.2, to link the

synsets. The algorithm was able to recover a total of 206,766 relations between the Arabic 38

Language Synsets Coverage Precision /4.00 Arabic 93,383 59.95% 3.82 Assamese 107,616 36.95% 3.78 Vietnamese 55,451 36.20% 3.75

Table 4.2. Size, coverage and precision of the core wordnets we create for Arabic, As- samese and Vietnamese.

Relation Precision SimilarTo 75.62% Hypernym 70.41% Hyponym 71.23% MemberMeronym 77.54% PartHolonym 84.29% Average 75.82%

Table 4.3. Precision of the semantic relations established for our Arabic wordnet. synsets, 139,502 relations between the Assamese synsets and 146,172 relations between

the Vietnamese synsets. As Figure 4.5 shows, most of the recovered relations are hyponym

and hypernym relations.

To evaluate our algorithm, we evaluated the relations recovered for the Arabic word-

net. We asked three Arabic to evaluate a sample of 500 relations. The sample consists of

the following relations: 100 “hypernym” relations, 100 “hyponym” relations, 100 “simi-

lar to” relations, 100 “MemberMeronym” relations and 100 “PartHolonym” relations. The

evaluation done using a True and False questions where the True gives score of 1 and

False gives a score of 0 to the relation.

As Table 4.3 shows, the precision of algorithm was between 70.41%, which was for the “hypernym” relation, and 84.29% which was for the “PartHolonym” relation. The average precision score was 75.82%. 39

Figure 4.5: Percentage of synset semantic relations recovered for the Arabic, Assamese and Vietnamese wordnets.

4.4 Summary

In this chapter, we presented an approach that automatically constructs semantic re- lations between synsets in a wordnet. The approach depends on the PWN to establish the links between the synsets. We conducted an experiment to evaluate our algorithm. Our approach produces semantic relations between the Arabic synsets with 75.82% precision. Chapter 5

ENHANCING AUTOMATIC WORDNET CONSTRUCTION USING WORD

EMBEDDINGS

In the previous chapters, we have shown that a wordnet for a new language, possibly resource-poor, can be constructed automatically by translating wordnets of resource-rich languages. The quality of these constructed wordnets is affected by the quality of the resources used such as dictionaries and translation methods in the construction process.

Recent work shows that vector representation of words (word embeddings) can be used to discover related words in text. In this chapter, we propose a method that performs such similarity computation using word embeddings to improve the quality of automatically constructed wordnets.

5.1 Introduction

It is well known that one way to find out semantically related word is to use context as lead (Firth, 1957; Harris, 1954). Words that share the same neighbors are usually somehow related to each other. For example, consider the two sentences:

“He rides his bike to the park everyday” and

“He rides his bicycle to the park everyday”. 41

One can conclude that the words “bike” and “bicycle” are similar or semantically related since they appear in similar context. This observation led researches to what is called distributional methods which are widely used in recent days. In these methods, also known as vector and word embeddings, co-occurrences of the words in a corpus is represented as vectors in a multidimensional space forming a word-word matrix (Jurafsky and Martin, 20016).

Since a corpus consists of a large number of distinct words, these vectors are usually long and sparse. The sparseness of the vectors is caused by the fact that a word often co- occurs with a limited number of other words in a given corpus. For this reason, special algorithms are used to process and save these sparse vectors. Usually, the co-occurrence of a word is limited to a specific window of words before and after the word. According to (Jurafsky and Martin, 20016), there are two types of co-occurrence: first-order co- occurrence and second-order co-occurrence. The first type is used to describe words that

appear next to each other, while in the second type, the words share similar surrounding

words.

In order to reduce the effect of stop words, i.e., words that co-occur with most of

the words, usually the pointwise mutual information measure (PMI) (Fano and Hawkins,

1961) is used rather than pure co-occurrences. This measure considers the probability of

the co-occurrence of two words compared to the words occurring together by chance alone.

The PMI between two words w1 and w2 is

, = P (w1 w2) PMI (w1,w2) log2 . (5.1) P (w1)P (w2) 42

where P (w1) is the probability of word w1,

P (w2) is the probability of word w2, and

P (w1,w2) is the probability of w1 in the context of w2

5.2 Similarity Metrics

There are many ways to compute similarity between vectors (Jurafsky and Martin,

20016). We list three common metrics used to measure similarity or relatedness between two vectors A~ and B~ with size N.

• Cosine Similarity: It is the most common measure used in natural language pro-

cessing. It produces similarity values from 0 to 1. When using row co-occurrences

or PMI, words with cosine similarity value near 1 are supposedly very similar and

words with cosine similarity value near 0 are supposedly unrelated. Cosine similarity

is measured using the next formula:

N ~ ~ i=1 AiBi cosine(A,B) = P . (5.2) N 2 N 2 q i=1 Ai q i=1 Bi P P

• Jaccard Measure: It was introduced by (Jaccard, 1912) and adapted by (Grefen-

stette, 2012) to be used with vectors. The Jaccard similarity is computed using the

following formula: N = min(Ai,Bi) Jaccard ( ~, ~) = i 1 (5.3) sim A B PN = min(Ai,Bi) Pi 1 43

• Dice Measure: It was originally used with binary vectors and was adapted by (Cur-

ran, 2004) to be applied with . The Dice similarity measure is

computed using the next equation:

N 2 = min(Ai,Bi) Dice ( ~, ~) = i 1 (5.4) sim A B P N = (Ai + Bi) Pi 1

5.3 Generating Word Embeddings

In order to validate the synsets we create using translation and obtain relations be- tween them, we use the word2vec algorithm (Mikolov et al., 2013) to generate word rep-

resentations from an existing corpus. The word2vec algorithm uses a feedforward neural

network to predict the vector representation of words within a multi-dimensional language

model. W ord2vec has two variations: Skip-Gram (SG) and Continuous Bag-Of-Words

(CBOW). In the SG version, the neural network predicts words adjacent to a given word

on either side, while in the CBOW model the network predicts the word in the middle of a

given sequence of words. In the work presented in this section, we generate representations

of words using both models with several different vector and window sizes to obtain the

settings for the highest precision. The purpose of the steps discussed next is to improve the

quality of synsets produced by the translation process in addition to generating relations

among the synsets. 44

5.4 Removing Irrelevant Words in Synsets

We compute the cosine similarity between word vectors within each single synset in

TWN, the wordnet being constructed in language T , to filter false word members within

synsets. To filter the initially constructed synsets in TWN, we pick a threshold value α such that the selected words have cosine similarity larger than α with each other. We describe the filtering process we propose below.

1. Let

c = synseti {word1,word2,word3,word4} (5.5)

be a candidate synset to be potentially included in TWN.

c 2. Compute the cosine similarity between all the possible pairs of words in synseti .

3. Extract the pair of words with the highest cosine similarity.

4. If this pair of words have cosine similarity larger than α, keep the pair in the final

c synset synseti, otherwise, discard synseti itself. This may have been a low quality

candidate synset generated in the translation process.

c 5. Next, among the remaining words in synseti , keep a word if it has a connection with

any word in synseti with similarity higher than α.

c For example, let us assume that the cosine similarity between the words in synseti are as shown in Table 5.1 and α=0.70. First, the pair with the highest cosine similarity,

(word1,word2) is kept in the final synseti since its cosine similarity is larger than α. Then, word3 is discarded since it does not have any cosine similarity larger than α with any of the 45

Pair Cosine Similarity (word1,word2) 0.91 (word1,word3) 0.22 (word1,word4) 0.82 (word2,word3) 0.34 (word2,word4) 0.72 (word3,word4) 0.12 Table 5.1. An example of cosine similarity between words in a candidate synset

words in the current final synseti. Finally, word4 is kept synseti since it does have a cosine

similarity with word1 that satisfies the threshold α.

5.5 Validating Candidate Relations

Similarly, we compute the cosine similarity between words within pairs of semanti-

cally related synsets. This allow us to verify the constructed relations between synsets in

TWN. For example, let

synseti = {wordi1,wordi2,wordi3,wordi4}, and

synsetj = {wordj1,wordj2,wordj3,wordj4}

be synsets in TWN. And let

ρi j be a candidate semantic relation between synseti and synsetj.

We compute the cosine similarity between all the possible pairs of words from synseti to synsetj and obtain the maximum similarity obtained. Then, if this value is larger than a

threshold α ρ, then we retain the relation ρi j, otherwise, we discard it. The pseudo code of

the validation algorithm is shown in Algorithm 1. 46

Algorithm 1: Validating Semantic Relation Data: synseti , synsetj , relation ρi j , threshold α ρ Result: retain or discard the relation ρi j initialization; Similaritymax ← 0; foreach wordi in synseti do foreach wordj in synsetj do sim ← ComputeCosineSimilarity(wordi,wordj); if sim > Similaritymax then Similaritymax = sim; end end end if Similaritymax < α ρ then Discard(ρi j) ; end

5.6 Selecting Thresholds

To pick the synset similarity threshold value α and the threshold α ρ for each semantic relation we create, we compute the cosine similarity between pairs of synonym words, semantically related words, and non-related words obtained from existing wordnets. Then, based on the previous data, we select the threshold values that are associated with higher precision and maximum coverage.

5.7 Experiments

In this section, we discuss the enhancement of the Arabic, Assamese and Vietnamese wordnets we create using our method we described in the previous sections.

5.7.1 Generating Vector Representations of Wordnets Words

For generating vector representations of the Arabic Words we use the following freely available corpora:

• Watan-2004 corpus (12 million words) (Abbas et al., 2011), 47

• Khaleej-2004 corpus (3 million) (Abbas and Smaili, 2005), and

• 21 million words of Wikipedia1 Arabic articles.

We process and combine the three corpora into a single plain text file. For both Assamese and Vietnamese, we used Wikipedia articles to generate the vector representation for words. The size of the articles we used is 1.4 million of words, While the size of Vietnamese articles was 80 million words.

Figure 5.1: A histogram of synonyms, semantically related words, and non-related words extracted from AWN.

In order to compute the synset similarity threshold value α and the threshold for each semantic relation α ρ, we use the freely available Arabic wordnet (AWN) (Rodríguez et al., 2008). AWN was manually constructed in 2006 and has been semi-automatically enhanced and extended several times. We start by extracting synonym words, semantically related words, and non-related words from AWN. The Python program that we wrote to

1https://ar.wikipedia.org 48

Weighted Average Relation Similarity Synonyms 0.28 Hypernyms 0.22 TopicDomains 0.23 PartHolonyms 0.28 InstanceHypernyms 0.08 MemberMeronyms 0.29

Table 5.2. The weighted average similarity between related words in AWN. compute the cosine similarity between the words is listed in Appendix B.1. Then, we use the histogram representation of the cosine similarity of the previous sets of words to set the thresholds. As Figure 5.1 shows, more than 67% of the non-related words have cosine similarity less than 0.1, while about 23% of the synonym words in AWN have a cosine similarity less than 0.1. Furthermore, about 34% of the semantically related words in AWN have cosine similarity less than 0.1. Table 5.2 shows the weighted average cosine similarity between synonyms, hypernyms, topic-domain related, part-holonyms, instance- hypernyms, and member-meronyms in AWN where the frequency of the similarity value is the weight.

5.7.2 Producing Word Embeddings for Arabic

In this part of this experiment, we use the word2vec algorithm to produce vector representation of Arabic. We test the word2vec algorithm with different window sizes to select the window size that produces the highest similarity. We generate word embeddings using the CBOW version with window sizes 3, 5 and 8. Next, we compute the weighted averages of the cosine similarity between the synonyms in AWN. The highest weighted average we obtained was 0.288 with window size 3, while the weighted averages obtained with window sizes 5 and 8 were 0.283 and 0.277, respectively. Then, we compare between the SG and the CBOW approaches with different vector sizes. Table 5.3 shows the weighted average cosine similarity obtained between 16,000 pairs of synonyms in AWN using both 49

Algorithm Vector Size Similarity Average SG 100 0.289 SG 200 0.258 SG 500 0.194 CBOW 100 0.288 CBOW 200 0.259 CBOW 500 0.195

Table 5.3. Comparison between the weighted similarity averages obtained using different word2vec settings.

Threshold AWN Our Arabic WordNet 0.000 5,941 17,349 0.100 3,433 2,073 0.288 2,471 943 0.500 1,190 271 0.750 209 13

Table 5.4. Comparison between the number of synsets in AWN and our Arabic wordnet using different threshold values.

variations of word2vec, with window size=3 and vector size set to 100, 200, and 500. We notice that both versions produce almost similar results with a slight advantage to SG with the cost of more execution time. However, for the corpus we use, smaller vector size produces better precision.

5.8 Evaluation and Discussion

We compute cosine similarity between semantically related words extracted from our initial Arabic, Assamese and Vietnamese wordnets produced in the previous chapter. The language model to calculate the cosine similarity is created using CBOW with vector size=100 and window size=3. Table 5.4 shows a comparison between the number of Arabic synsets we create and the number of synsets in AWN. We notice that the translation method we use produces a high number of synsets com- pared to the manually constructed AWN. However, the number of synsets sharply decreases after filtering the initial synonyms using the method described in Section 5.3. Although 50

Threshold Range 0- 0.1 0.1 - 0.288 0.288 - 1 Synonyms 34.8% 56.8% 78.4% Hypernyms 45.2% 57.2% 84.4% PartHolonym 50.8% 75.2% 90.4% Member- 40.8% 56.8% 79.6% Meronym Overall 42.9% 61.5% 83.2%

Table 5.5. Precision of the Arabic wordnet we create.

Threshold Range 0- 0.1 0.1 - 0.288 0.288 - 1 Synonyms 52.0% 57.6% 88.0% Hypernyms 37.6% 49.6% 76.0% PartHolonym 51.2% 46.4% 82.4% Member- 62.4% 67.2% 81.6% Meronym Overall 50.8% 55.2% 82.0%

Table 5.6. Precision of the Assamese wordnet we create. our Arabic wordnet is automatically created, the number of synsets we create is 60% of the number of synsets in the manually created AWN when filtering the synsets using α= 0.1. We evaluate precision by comparing 600 pairs of synonyms, hypernyms, part-holonyms, and member-meronyms with three ranges of cosine similarity values: 0 to 0.1, 0.1 to 0.288, and 0.288 to 1. We asked 3 Arabic speakers to evaluate the pairs using a 0 to 5 scale where 0 represents the minimum score and 5 represents the maximum score. We compute precision by taking the average score and converting it to a percentage. See Table 5.5.

Threshold Range 0- 0.1 0.1 - 0.288 0.288 - 1 Synonyms 31.2% 40.2% 57.6% Hypernyms 31.8% 39.0% 69.4% PartHolonym 32.2% 42.8% 75.0% Member- 22.0% 24.0% 73.8% Meronym Overall 29.3% 36.5% 68.95%

Table 5.7. Precision of the Vietnamese wordnet we create. 51

Table 5.8. Examples of related words and their cosine similarity from our Arabic wordnet.

The precision of the synonyms, hypernyms, part-holonyms, and member-meronyms we produce is 78.4%, 84.4%, 90.4%, and 79.6% respectively, with the threshold set to 0.288. This is higher than the precision obtained by (Lam et al., 2014b) which produces synonyms with 76.4% precision when just using PWN. Furthermore, the precision of the Assamese and Vietnamese wordnets are shown in Tables 5.6 and 5.7, respectively. As shown in Tables (5.8, 5.9, 5.10), our results suggest that using lower precision for producing synsets reduces the quality of the other created semantic relations. Our results show that pairs with higher cosine similarity are more likely to be semantically related. It confirms the benefit of combining the translation method with word embeddings in the process of automatically generating new wordnets.

5.9 Summary

In this chapter, we discuss an approach for enhancing the automatically generated wordnets we create for low-resource languages. Our approach takes advantage of word embeddings to enhance the translation method for automatic wordnet creation. We present 52

Table 5.9. Examples of related words and their cosine similarity from our Assamese word- net.

Table 5.10. Examples of related words and their cosine similarity from our Vietnamese wordnet. 53 an application of our approach to producing new Arabic Wordnet. Our method automat- ically produces Arabic synonyms with 78.4% precision and semantically related pairs of words with up to 90.4% precision.

Acknowledgment This chapter is based on the paper “Enhancing Automatic Wordnet Construction Using Word Embeddings” (Al tarouti and Kalita, 2016), written in collabora- tion with Jugal Kalita, that appeared in the Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP, San Diego, USA, June 2016. Association for Compu- tational Linguistics: Human Language Technologies (NAACLHLT). Chapter 6 SELECTING GLOSSES FOR WORDNET SYNSETS USING WORD EMBEDDINGS

Word embeddings provide a way to represent words as vectors in a multi-dimensional space such that related words are represented as vectors with similar direction. It has been shown that this model can be used to discover relations between words effectively. In this chapter, we introduce a method to represents wordnet synsets in similar way. A wordnet synset is a group of synonym words grouped together because they all represent the same concept. Our proposed method can be used in several NLP applications such as word-sense disambiguation and automatic wordnet construction. To test our method we use it in the task of selecting glosses for wordnet synsets of several languages. 6.1 Literature Review

Several methods were introduced to produce vector representations of meanings. Clustering is one technique that is commonly used to separate the vector of a multi-sense word into several vectors which represent the senses of the word. For example, (Neelakan- tan et al., 2015) modified the skip-gram version of the word2vec algorithm to produce multiple word embeddings per word. In this work, the senses of a word are learned online by creating clusters of the contexts of the word. When a new context of the word starts to appear far away from the center of the known context, a new vector is created for the new context. A global context-aware neural model was presented by (Huang et al., 2012) to learn the context vectors of words using both local and global context. To evaluate their neural architecture, the author produced new dataset that provide similarity, based on human judgments, between words within specific contexts. 55

Other techniques of producing sense vectors representation are based on ontology. For example, (Chen et al., 2014) modified the objective of the skip-gram model of the word2vec algorithm to assign vector representation for the synsets based on their glosses. The work also presented two word-sense-disambiguation algorithms based on the sense vectors. Another approach to learn synset embeddings was introduced by (Rothe and Schütze, 2015). The approach, which is called AutoExtend, is a neural network based learning model. It include hidden layers for both synset lexemes and embeddings. Foley and Kalita (Foley and Kalita, 2016) compared between several models which use WordNet to create sense vectors. They also presented an approach, which is called hyponym tree propagation model (HTP), that uses (VSM) to produce sense vectors.

6.2 Creating Language Model Using Word Embeddings

We start by creating word embeddings using a corpus and the word2vec software (Mikolov et al., 2013). word2vec is a two-layer feedforward neural-network learning model that produces multi-dimensional vector representations of words. There are two implementations of this learning model: Skip-Gram (SG) implementation and Continuous Bag-Of-Words (CBOW) implementation. In the SG implementation, the model learns the words around a given word, while in the CBOW implementation the model learns the word within a given sequence of words.

6.3 Generating Vector Representation of Wordnet Synsets

In this section, we present our method to produce wordnet synsets. We build our method based on the vectors of the synonym words produced by the word embedding method. We believe that combining the vectors of synonym words into one vector can produce a way to represent meaning. Next, we describe our proposed method to build the

vector representation of synsets, which we call synset2vec. Let 56

Synset Key Gloss Synonyms 00076884-n a sudden drop from an upright position {spill, tumble, fall} 00329619-n the act of allowing a fluid to escape {spill, spillage, release} a channel that carries excess water 04277034-n {spill, spillway, wasteweir} over or around a dam or other obstruction 15049594-n liquid that is spilled {spill}

Table 6.1. Meanings of the noun “spill” and its synonyms.

synseti = {word1,word2,...,wordj} be a synset in wordnetx ,

{n1,n2,...,nj} is the number of synsets for each word in synseti, and

~ ~ ~ {V1,V2,...,Vj} is the set of corresponding vectors for {word1,word2,...,wordj} in the word embedding model.

We identify two cases:

1. The first case is when a word, which does not have any synonyms, represents several synsets, i.e., has more than one meaning. Therefore, the vector that is produced by the word embedding is actually representing the combined meanings of the word. For example, in PWN, the word “abduction” is the only word in both synset 00775460- n, “the criminal act of capturing and carrying away by force a family member”, and synset 00333037-n, “moving of a body part away from the central axis of the body”. Hence, the vector for “abduction” actually represents both meanings.

2. The second case is when a word, which does have one or more synonyms, have one or more meanings. In this case, the synonyms might or might not have other meanings also. For example, the noun “spill” has four meanings in PWN and it has 6 synonyms. Table 6.1 shows all the meanings of the noun “spill” and all its synonyms in PWN.

Obviously, to generate a combined vector for a synset, we need a way to limit the effect of the other meanings that the synonyms might hold. To do so we start by solving 57 the second case where the synsets have more than one word. In this case, we normalize the vector of each word by dividing its coordinates by the number of synsets that the word belongs to. This reduces the noise when generating the synset vector caused by the other ~ meanings that a word can hold. We define the vector of synseti (Vsi) as follows: ~ 1 ~ 1 ~ 1 ~ 1 Vsi = · (V1 · +V2 · + ... +Vj · ). j n1 n2 nj Figure 6.1 shows an example of creating a vector for the synset 00076884-n which include three words: spill, tumble and fall.

Figure 6.1: An example of creating a vector for a wordnet synset that include more than one word.

Next, we produce vectors for the synsets that share a single word, i.e., words that do not have any synonyms and have more than one meaning. In this case, for each synset, we produce the synset vector by combing the word vector with the vector of a word in a

related synset, e.g., a hypernym, a hyponym, or a meronym. For example, let synseti and synsetk be synsets that both include the same single word w. And let h1 be a word from the

hypernym of synseti and h2 be a word from the hypernym of synsetk . We define the vector ~ of synseti (Vsi) as follows: 58

~ = 1 · ~ · 1 + ~ · 1 Vsi (Vw Vh1 ) . 2 nw nh1 ~ Similarly, we define the vector of synsetk (Vsk ) as follows:

~ = 1 · ~ · 1 + ~ · 1 Vsk (Vw Vh2 ). 2 nw nh2 Figure 6.2 shows an example of creating vectors for the two synsets of the word “abduc- tion”. In Appendix B.2 we list a python implementation of the procedure.

Figure 6.2: An example of creating vectors for wordnet synsets that share a single word.

6.4 Automatically Selecting a Synset Gloss From a Corpus Using Synset2Vec

In this section, we give one example of use of our model. We show how our proposed model can be used in the automatic selection of glosses for wordnet synsets. The automatic selection of synset gloss is a word-sense disambiguation problem. A gloss is short sentence which is, usually, manually attached to a synset to clarify the meaning of the synset. This short sentence can be a definition or an example sentence of one of the members of the synset. We test our method using PWN and, then, apply it to automatically add glosses to wordnets created in (Lam et al., 2014b). 59

In the following steps, we present our method to select a gloss for synseti we defined in section 6.3.

• Let G = {g1,g2,...,gy} be set of candidate glosses that include a word belongs to

synseti.

• To select the closest gloss to synseti from G we generate a vector for each gloss gz ∈ G. We list a Python function for this step in Appendix B.3.

• Assume that the gloss gz consists of the words {w1,w2,...,wd },

{m1,m2,...,md } is the number of synsets for each word in gz, and

~ ~ ~ {Vw1,Vw2,...,Vwd } is the set of corresponding vectors for {w1,w2,...,wd }.

• We compute the vector of gloss gz as follows:

~ = 1 · ~ · 1 + ~ · 1 + + ~ · 1 Vgz (Vw1 Vw2 ... Vwd ). d m1 m2 md ~ • Then, we compute the cosine similarity between the vector of each gloss gz and Vsi. We present a Python implementation for this step in Appendix B.4.

~ • Finally, we select the gloss with highest cosine similarity with Vsi.

For instance, as shown in Table 6.2, if we consider the word “abduction” which belongs to two synsets and does not have any synonyms, we notice that our algorithm was able to distinguish between the two meanings and select the right gloss for both synsets.

6.5 Evaluation

In this section, we introduce two forms of evaluation. First, we apply our method to select glosses for the PWN synsets. In this case, we directly compare our results to the manually attached glosses in PWN. Then, we apply our method to attach glosses to wordnet synsets generated by (Lam et al., 2014b). In this case, we ask human judges to evaluate the resulting glosses for three languages: Arabic, Assamese and Vietnamese. 60

Cosine Synset Key Gloss Similarity the criminal act of capturing and carrying away 0.172 00333037-n by force a family member moving of a body part away from the central 0.214 axis of the body the criminal act of capturing and carrying away 0.204 00775460-n by force a family member moving of a body part away from the central 0.189 axis of the body

Table 6.2. Cosine similarity between the different synset vectors and glosses of the word “abduction” in PWN.

6.5.1 Using Synset2vec to Select Glosses for PWN Synsets

In order to evaluate our synset vector representation in the task of selecting glosses for wordnets, we use it in the process of gloss selection for PWN synsets. We take advantage of the glosses manually added to the synsets in PWN to automatically measure the precision of our synsets representation. The following steps describe the evaluation process of selecting glosses for PWN synsets.

• For each synseti in PWN, we construct a set of candidate glosses. The candidate glosses are extracted from PWN using the following method. First the gloss attached

to synseti in PWN is added to the candidate set of glosses. Next, to generate negative

glosses for synseti, we extract words which belong to synseti and other synsets, i.e.,

words have the meaning of synseti and one or more other meaning. This allow us to examine the ability of the algorithm to differentiate between the different meanings of synsets.

• We randomly select two types of synsets from PWN: synsets that have single words, i.e. synsets that are represented by only single words, and synsets that include multi- ple synonym words.

• We generate the synset vectors using the algorithm we described in Section 6.3. 61

• Next, we generate the gloss vectors using the method we described in Section 6.4.

• Then, we compute the cosine similarity between synseti and each gloss in the candi- date set.

• Finally, we select the gloss with the highest cosine similarity.

6.5.2 Using Synset2vec to Select Glosses for Arabic, Assamese and Vietnamese Synsets

In this section, we examine the precision of our method by applying it for the pur- pose of selecting glosses from corpora to attach to the wordnets we create in the previous chapters. In this experiment, we use the wordnets of the languages: Arabic, Assamese and Vietnamese. Next, we describe the steps of evaluating glosses selected by our method for the synsets of the target languages.

• For each synseti in the target wordnet wordnett , we generate a set of candidate

glosses by extracting the set of sentences that include any member of synseti from the corpora we described in Section 5.7.

• We randomly select two types of synsets from wordnett : synsets that have single words, i.e., synsets that are represented by only single words, and synsets that include multiple synonym words.

• We generate the synset vectors using the algorithm we described in Section 6.3.

• Next, we generate vectors for each sentence in the set of candidate glosses using the method we described in Section 6.4.

• Then, we compute the cosine similarity between synseti and each sentence in the candidate set.

• Next, the top 3 sentences with the highest cosine similarity with the synseti are se- lected. 62

Synset Type Number of Synsets Precision Single Member 1400 76.5% Multi Member 600 79.6%

Table 6.3. The precision of selecting glosses for PWN synsets

• Finally, 3 native speakers of the target language are asked to evaluate the selected sentences using a 5 point scale.

6.5.3 Results and Discussion

As shown in Table 6.3, we used our algorithm to select glosses for 1400 single- member synsets from PWN. The algorithm achieved 76.5% precision. In addition, we used it to select glosses for 600 multi-member synsets from PWN. The precision was 79.6% in this case. As expected, the precision of selecting glosses is better when it is used with multi-member synsets since more information about the context of the sense is provided by the multi-member synsets. In the second evaluation, we randomly selected 300 synsets from the Arabic, As- samese and Vietnamese wordnets we create (100 synset each). For each synset, we ex- tracted all the sentences that included any member of the synset from the corpora. The sentences were sorted according to the cosine similarity with the synset vector and the top 3 sentences where selected. As shown in Table 6.7, the precision of selecting glosses for the Arabic synsets is 81.4% when selecting the sentences with the highest cosine similarity with the synset vec- tor. Furthermore, the precision of the top 2 and top 3 sentences is 70.4% and 65.8% respec- tively. The overall precision of selecting glosses using our method for the Arabic synsets is 72.6%. Table 6.4 shows some examples of glosses we produce for the Arabic synsets along with the their cosine similarity values. The precision of our method for selecting glosses for the Assamese synsets is 85.2% when selecting the sentences with the highest cosine similarity. Moreover, the top 2 and 63

Table 6.4. Examples of Arabic glosses we produce in our Arabic wordnet.

top 3 selected sentences achieved 83.2% and 84.6% respectively. The overall precision for Assamese glosses is 84.4%. Table 6.5 shows some examples of glosses we produce for the Assamese synsets along with the their cosine similarity values. The top Vietnamese glosses selected by our method has 39.4% precision. The top 2 and top 3 Vietnamese glosses selected by our method has 36.6% and 37% precision. Table 6.6 shows some examples of glosses we produce for the Vietnamese synsets along with the their cosine similarity values. In general, the precision of the recently published algorithms for the task of multilin- gual word-sense disambiguation is arround 68.7% (Apidianaki and Von Neumann, 2013), meaning that our algorithm is showing reasonably good performance for English, Arabic and Assamese. However, we notice that our method perform poorly with Vietnamese. The reason behind the poor results with Vietnamese is that Vietnamese words are not separated by white spaces (Gordon and Grimes, 2005). This means that the meaning of most the words can change based on the following words. This makes the process of generating the vectors for both the synsets and sentences extremely difficult since the word2vec algorithm assumes that words are separated by white spaces. The same problem appears in the pro- 64

Table 6.5. Examples of Assamese glosses we produce in our Assamese wordnet. cess of automatically generating bilingual dictionaries for Vietnamese (Lam et al., 2015a). One possible solution to this problem is replacing the white spaces within the single Viet- namese words with a special non-white character. This requires the existence of a language dictionary to distinguish the words that include white spaces within them.

6.6 Summary

In this chapter, we presented a new method for selecting synset glosses from a corpus. Our glosses are example sentences that clarify the meaning of the synset. The method can be used for low-resource languages to attach glosses to wordnets constructed automatically. Our method presents vector representation for wordnet synsets in a multi- dimensional space. We construct a synset vector by grouping the word embedding vector 65

Table 6.6. Examples of Vietnamese glosses we produce in our Vietnamese wordnet.

Precision Wordnet Top 1 Top 2 Top 3 Overall Arabic 81.4% 70.4% 65.8% 72.6% Assamese 85.2% 83.2% 84.6% 84.4% Vietnamese 39.4% 36.6% 37.0% 37.6%

Table 6.7. The precision of selecting glosses for Arabic, Assamese and Vietnamese synsets 66 of each synonym in the synset. Our evaluation showed that our method selects glosses with precision up to 84.4%. Chapter 7 LEXBANK: A MULTILINGUAL LEXICAL RESOURCE

Figure 7.1: An overview of LexBank system.

7.1 Introduction

In this chapter, we discuss the design and implementation of LexBank: a system that provides access to the multilingual lexical resources we create in this dissertation. We aim to give public users the ability to access and use the resources that we have created in our project. The system provides wordnet search services to several resource-poor languages 68 in addition to bilingual dictionary look-up services. In addition, the system receives evalu- ation and feedback from users to improve the quality of the resources. As Figure 7.1 shows, the system is divided into three layers: web interface, appli- cation layer and database layer. The web interface allows users to log into the system and access the search services. The web interface, also, provides a control panel for adminis- trators to allow them to manage the system. The application layer includes all the software required to securely execute the user’s requests. The database layer has two : lexical resources database and system database. The system database stores user’s infor- mation and the system settings. The design of the system allows inclusion of new language resources and easy modifications.

7.2 Database Design

LexBank uses two databases: one for storing the system settings and one for storing the lexical resources. We have used Microsoft SQL Server to construct the databases. The SQL code we use to construct the databases is listed in Appendix C. Next, we describe each database in details.

7.2.1 The System Settings Database

There are two tables in the setting database: Users_Info and System_log. We de- scribe both of the tables below.

7.2.1.1 Users_Info

The Users_Info table contains information about registered users. The following are the fields contained in the Users_Info table.

• UserId: a unique short alias name, which is selected by the user, that is used to identify users in the system. 69

• UserName: the full name of the user.

• UserEmail: the email address of the user.

• UserPwd: the encrypted password used by the user to access the system.

• UserPriv: a text field that determines the privileges that the user has. There are two levels of users in the system. The first level is administrator which has the privileges of managing users and data in the system. The second level is client which has the privilege of browsing the available resources.

• UserStatus: this field specifies the status of the user. The status can be Active, Inactive or New.

7.2.1.2 System_log

The System_log table keeps records of all the user activities in the system. This helps us in maintenance and keeping track of the utilization of the system. The following fields are contained in the System_log table.

• EventId: a unique key that is used to identify the event.

• EventDesc: a text description of the event.

• EventTime: the date and time of the event.

• UserId: the identification key of the user who committed the event.

7.2.2 The Lexical Resources Database

The lexical resources database contains the resources we produced in this thesis. For each language supported by the system, the database maintain tables for storing the core wordnet, the semantic relations, the wordnet glosses, the evaluation data for the semantic relations and the evaluation data for the wordnet glosses. Next, we describe each table in this database. 70

7.2.2.1 CoreWordnet

The CoreW ordnet table stores the wordnet synsets we created in this thesis. The core wordnet groups the synonym words into sets called synsets. In this table, synsets are identified using the offset-pos of the corresponding synset in PWN. In PWN, the offset-pos consists of two parts: byte offset used to locate the synset in the data file and the part-of- speech of the synset. The following are the fields in the CoreW ordnet table.

• offset-pos: the offset-pos of the wordnet synset which is used as an identifier for the synset.

• Member: a word that belongs to the synset.

7.2.2.2 Sem_Relations

Whereas the synonymy relation is stored in the CoreWordnet table, other semantic relations such as hyperonymy and meronymy are stored in the Sem_Relations table. As we described in Section 4.2, the semantic relations are directed relations. Therefore, we should maintain the direction by specifying the two sides of each synset in the relation. The

Sem_Relations table contains the following fields.

• Le f t_offset-pos: this field specifies the offset-pos of the synset in the left side of the relation.

• Relation: a text field that specifies the relation between the left side and the right side synsets.

• Right_offset-pos: the offset-pos of the synset in the right side of the relation.

7.2.2.3 WordnetGlosses

The WordnetGlosses table stores the wordnet glosses we generate in Chpater 6. The following are the fields of the WordnetGlosses table. 71

• offset-pos: the offset-pos of the wordnet synset.

• Gloss: a text field that contains the gloss of the synset.

7.2.2.4 Sem_Relations_Eval_Data

The Sem_Relations_Eval_Data table contains the semantic relations sample data used in the evaluation. This table contains the following fields.

• RelationKey: a unique identification number used to identify the semantic relation being evaluated.

• Le f t_offset-pos: the offset-pos of the synset on the left side of the relation being evaluated.

• Word1: this field specifies the word on the left side of the relation being evaluated.

• Relation: a text field that specifies the type of relation being evaluated.

• Right_offset-pos: the offset-pos of the synset on the right side of the relation being evaluated.

• Word2: this field specifies the word on the right side of the relation being evaluated.

• COS: the cosine distance, as measured in Section 5.4, between the left word and the right word in the relation being evaluated.

7.2.2.5 Sem_Relations_Eval_Response

The Sem_Relations_Eval_Response table contains the collected responses of the se- mantic relations we produce from evaluators. This table consists of the following fields.

• AnswerKey: a unique integer that is generated automatically to identify the response.

• RelationKey: the key of the semantic relation being evaluated. 72

• Score: an integer value from 1 to 5 that represents the score assigned by the evaluator to the semantic relation.

• UserId: identification key of the evaluator who evaluated the response.

7.2.2.6 WordnetGlosses_Eval_Data

The WordnetGlosses_Eval_Data table holds the wordnet glosses sample being eval- uated by the users. The table includes the following fields.

• GlossKey: an automatically generated unique integer used to identify the gloss being evaluated.

• offset-pos: the offset-pos of the wordnet synset.

• Word: the word which is used in the gloss to represent the wordnet synset.

• Sentence: the sentence selected as gloss for this wordnet synset.

• PWNGloss: the English gloss of the corresponding synset in PWN.

• CosSem: the cosine similarity between the selected sentence and the synset as mea- sured in Section 6.4.

• GlossRank: an integer value that represents the rank of the gloss among the other candidate glosses. The rank is assigned by the system to the gloss being evaluated based on the CosSem value. Glosses with the highest CosSem value have a rank value 1.

7.2.2.7 WordnetGlosses_Eval_Response

Responses from the users for evaluating the wordnet glosses we produced in Section 6.4 are stored in the WordnetGlosses_Eval table. This table consists of the following fields: 73

• AnswerKey: a unique integer number that is generated automatically to identify the response.

• GlossKey: the key of the gloss being evaluated.

• Score: an integer value from 1 to 5 that represents the score assigned by the evaluator to the gloss.

• UserId: identification key of the evaluator who evaluated the gloss.

7.3 Application Layer

In this section, we describe the main functions provided by LexBank. In order to maintain simplicity, we implement most of the functions of the system in one utility class (LexBankUtils.cs) written in Microsoft C#. The utility class, which is listed in Appendix D, consists of the following methods.

• IsUserIdAvailable(): takes a userId and returns true if this has never been used by another user before.

• EncryptPassword(): takes a plain text password and returns an encrypted password.

• DecryptPassword(): takes an encrypted password and returns a decrypted password.

• CreateNewUser(): takes the details of a new user and creates an account for him by string the data in the Users_Info table.

• IsAuthenticated(): takes the user identification and password and returns true if it matches the user information in the users table.

• FindSynSet(): takes a lexeme and returns a list of synsets that include this lexeme.

• FindSynSetLexemes(): takes an OffsetPos of a synset and returns the list of lexemes of this synset. 74

• IsSynSetAvailable(): takes an OffsetPos of a synset in a specific wordnet, and returns true if the synset is available in the spcified wordnet.

• FindSynSetRelations(): takes an OffsetPos of a synset and returns all the semantically related lexemes.

• FindGloss(): takes an OffsetPos of a synset and returns the gloss of the synset.

• ReadRelation(): takes a RelationKey and returns the details of the relation.

• ReadSynsetGloss(): takes a GlossKey and returns the details of the gloss.

• EvaluateRelation(): takes RelationKey, Score and UserId and stores them in the eval- uation table of the semantic Relations.

• EvaluateGloss(): takes GlossKey, Score and UserId and stores them in the evaluation table of the wordnet glosses.

• LogEvent(): takes event description and stores it in the System_log table.

• ChangeUserStatus(): takes UserId of a user and changes his status to a specific new status.

• RetrieveUsers(): a method that returns a list of all the users in the system and their information.

7.4 Web Interface Design and Implementation

In this section, we describe the design of the web interface of LexBank. The web interface is implemented in ASP.NET using Microsoft Visual Studio 2012. Figure 7.2 shows the site map of the web interface. The interface is accessed by the login web page (frmLogin.aspx). New users need to register to gain access to the system. Registration can be done by filling the web registration form (frmRegister.aspx). Once a user logs into the 75 system, the main menu web page (frmMainMenu.aspx) is shown. The main menu includes links to access the services available in the system. In the following sections, we describe each web page in the system.

Figure 7.2: LexBank web site map

7.4.1 Registration Form

New users needs to register in the system using the registration form (frmRegis- ter.aspx). As shown in Figure 7.3, a new user needs to provide the full name, email, email confirmation, user identification, password and password confirmation, and then press the Register button. The registration process starts when a new user submits his information through the registration web form. Once the registration form receives the information, it checks if all the fields meet the requirements of the system. The requirements include a valid format for the email address and the password. The requirements, also, include that the user identification was never used before by an existing user. If the information sent by the user passes the validation process, the registration form calls the CreateNewUser() method from the utility class. The CreateNewUser() method uses the EncryptPassword() method to 76

Figure 7.3: The registration web form encrypt the password, and then it writes the data into the Users_info table. The registration process is summarized in the sequence diagram shown in Figure 7.4.

7.4.2 Log-in Form

Registered users can login to the system using the login web page (frmLogin.aspx) which is shown in Figure 7.5. A user with an active account needs to provide his user identification and password to start the login process. As shown in Figure 7.6, when the login web form (frmLogin.aspx) receives the userid and the passowrd, it calls the IsAuthenticated() method from the utility class. Then, the password is encrypted using the EncryptPassword() and compared with the encrypted pass- word stored in the users table. If the userid and the password provided by the user match 77

Figure 7.4: Sequence diagram of the registration process

Figure 7.5: The log-in web form 78

Figure 7.6: Sequence diagram of the login process the userid and the password stored in the users table, the main menu of the web interface is shown to the users; otherwise, an error message is shown to the user. The main menu is shown in Figure 7.7.

7.4.3 The Main Menu

The main menu includes links to access the services available in the system. The services presented by the web interface are given below.

• Searching wordnet using lexeme, provided by the web page (frmWordnetSearch.aspx).

• Searching wordnet using OffsetPos, provided by the web page (frmSynsetDetails.aspx).

• Evaluating semantic relations between synsets, provided by the web page (frmEval- Relations.aspx).

• Evaluating wordnet glosses, provided by the web page (frmEvalGloss.aspx).

• Searching a bilingual dictionary, provided by the web page (frmDictionarySearch.aspx). 79

Figure 7.7: The main menu

• User management, provided by the web page (frmManageUsers.aspx).

7.4.4 Searching Wordnet By Lexeme Web Form

The web form (frmWordnetSearch.aspx) allows users to search for the synsets of a lexeme in a specific langauge. As shown in Figure 7.8, this web form consists of the following components.

• A text box used to allow the user to enter a lexeme.

• A drop-down menu to allow the user to select the language.

• A list box for showing the synsets list of the entered lexeme.

• A list box for showing the synonyms of the entered lexeme.

• A list box for showing the related lexemes.

• A button to start the searching process.

The searching process, as shown in Figure 7.9, starts when the user submits a lexeme and a language to the frmWordnetSearch.aspx web form. Then, the method FindSynset() 80

Figure 7.8: The web form for searching wordnet by lexeme. The form shows the result of .which means Egypt ,(ﻣﺼﺮ) searching the Arabic lexeme from the utility class is called to retrieve the synsets that include the entered lexeme and show the result in the synsets list. Next, when the user selects a synset from the synsets list, the frmWordnetSearch.aspx web form calls the FindSynsetLexemes() method from the utility class to show the synonyms of the lexeme in the synonym list. It, also, calls the FindSynsetRelations() method to obtain the related lexemes and show them to the user in the related lexemes list. The user also can extend the details of the synset shown in the synset list and the related lemexes list by double-clicking on the synset OffsetPos. This shows the frmSynsetDetails.aspx web form which we describe next.

7.4.5 Searching Wordnet by OffsetPos Web Form

Wordnet search using OffsetPos is provided by the frmSynsetDetails.aspx. An exam- ple of searching for a synset with the OffsetPos (08897065-n) in our Arabic, Vietnamese 81

Figure 7.9: Sequence diagram of the process of searching wordnet using lexeme and Assamese wordnets using the frmSynsetDetails.aspx web form is shown in Figures 7.10, 7.11 and 7.12 respectively. This web form consists of the following components.

• A text box for entering the OffsetPos of the synset.

• A drop-down menu to allow the user to select the language.

• A text box for showing the gloss of the synset.

• A text box for showing the English gloss of the synset.

• A list box to show the synonym list of the synset.

• A list box to show the related synsets and lexemes of the entered synset.

• A button to start the search process. 82

Figure 7.10: The web form for searching wordnet by OffsetPos. The form shows the result of searching the Arabic synset (08897065-n).

In this form, the user starts the process of searching wordnet by submitting the Off- setPos of the synset and the target language to the frmSynsetDetails.aspx web form. The web form calls the FindGloss() method from the utility class to retrieve the gloss of the synset. It, also, calls the FindSynSetLexemes() and the FindSynSetRelations() methods to obtain the synonym list and releated synsets of the input synset to show them in the form.

7.4.6 Evaluating Semantic Relations Between Synsets Web Form

The web form frmEvalRealtions.aspx allows users to evaluate semantic relations be- tween lexemes and synsets in the system. The form shows the relation as a sentence and asks the user to rate the correctness of the sentence using a Likert-type scale. The form consists of the following components. 83

Figure 7.11: The web form for searching wordnet by OffsetPos. The form shows the result of searching the Vietnamese synset (08897065-n).

Figure 7.12: The web form for searching wordnet by OffsetPos. The form shows the result of searching the Assamese synset (08897065-n). The third part meronym in Assamese is wrong. It comes from the verb meaning of “desert” which means to leave without intending to return. 84

Figure 7.13: Sequence diagram of the process of searching wordnet using OffsetPos.

Figure 7.14: The web form for evaluating semantic relations between synsets in a wordnet. The form shows an example of evaluating a hyponymy relation between the two Assamese lexemes, one for radio telegraph and the other for radio. 85

• A text box showing the relation key.

• A text box showing the relation in the form of a sentence.

• A text box showing the UserId of the evaluator.

• An option box that allows the user to rate the relation.

• A button to submit the score.

• A button to end the evaluation session.

Figure 7.15: Sequence diagram of the process of evaluating the relation between two lex- emes.

The evaluation form frmEvalRealtions.aspx starts the evaluation process by calling the ReadRelation() method from the utility class to show the relation details to the user. When the user submits the score he assigns to a relation, the evaluation form frmEvalReal- tions.aspx stores the score by calling the EvaluateRelation() method from the utility class. Then, the evaluation form reads the next relation and shows it to the user. The user can stop the evaluation process by clicking the End Session button. The user has the option 86 to resume the evaluation process if he stops any time he wishes without re-evaluating the relations he has already evaluated.

7.4.7 Evaluating Wordnet Synsets Glosses Web Form

Figure 7.16: The web form for evaluating wordnet synsets glosses. The form shows an example of evaluating Arabic synset (13108841-n).

The glosses of the wordnets can be evaluated using the frmEvalGloss.aspx web form. To evaluate a synset gloss, the form attaches the English gloss of the synset obtained from the PWN to the selected gloss in the target language. Then, the user is asked if the lexeme in the selected gloss has the same meaning as in the PWN gloss. This evaluation form is composed of the following components.

• A text box showing the gloss key.

• A text box showing a lexeme from a synset, a candidate gloss written in the target language, and the English gloss of the synset. 87

• A text box showing the UserId of the evaluator.

• An option box that allows the user to rate the candidate gloss.

• A button to submit the score.

• A button to end the evaluation session.

Figure 7.17: Sequence diagram of the process of evaluating the relation between two lex- emes.

The web form frmEvalGloss.aspx starts the evaluation process for glosses by calling the ReadSynsetGloss() method from the utility class to obtain the lexeme, the candidate gloss and the English gloss of the synset being evaluated. Then, the web form uses the pre- vious data to construct a question for the user. When the user submits the score he assigns to the candidate gloss, the evaluation form stores the score by calling the EvaluateGloss() method from the utility class. Then, the evaluation form reads the next gloss and shows it to the user. The user can stop the gloss evaluation process by clicking the End Session button. The user can resume the gloss evaluation process any time he wishes without re-evaluating the glosses he has already evaluated. 88

7.4.8 Searching Bilingual Dictionary Web Form

Figure 7.18: The web form for searching a bilingual dictionary. The form shows the result .which means Egypt, to Assamese ,(ﻣﺼﺮ) of translating the Arabic word

The web form (frmDictionarySearch.aspx) allows users to use the bilingual dictio- naries we create in (Lam et al., 2015b) to translate words between languages. As shown in Figure 7.18, this form consists of the following components.

• A text box used to allow the user to enter a word.

• A drop-down menu to allow the user to select the source language.

• A drop-down menu to allow the user to select the target language.

• A list box for showing the translations list of the entered word.

• A button to start the searching process.

The translation process, as shown in Figure 7.19, starts when the user submits a word, source language, and target language to the frmDictionarySearch.aspx web form. Then, the method Translate() from the utility class is called to retrieve the translation list from the bilingual dictionary and show them to the user. 89

Figure 7.19: Sequence diagram of the process of searching a bilingual dictionary.

Figure 7.20: The web form for managing users in LexBank.

7.4.9 Users Management Web Form

The web form frmManageUsers.aspx allows the administrators of LexBank to man- age users. Access to this form is restricted to administrators. The form lists all registered users with related information. An administrator can activate the accounts of new users using this form. He can also deactivate any user from the list. This form can be extended in the future by adding more functionality. As shown in Figure 7.20, this form consists of the following components.

• ID: the UserId of the user.

• Name: the full name of the user. 90

• Email: the email address of the user.

• Privilege: the privilege assigned to the user. This can be administrator or client.

• Status: the current status of the user.

• Change Status: a command link to changed the current status of the user. The status of the user can be change to be Inactive or Active.

Figure 7.21: Sequence diagram of the process of managing users in LexBank.

As summarized in the sequence diagram shown in Figure 7.21, an administrator user starts the process of user management by trying to access the frmManageUsers.aspx web form. The web form calls the method IsAdmin() from the utility class to verify if the user is authorized to access the form or not. If the user is not authorized, an error message is sent to the user. Otherwise, if the user is authorized, the web form calls the method 91

RetrieveUsers() to obtain the list of registered users in the system. Then, the administrator can select a user from the list and click the change status link to change the current status of the user. The web form calls the ChangeUserStatus() method from the utility class to store the new status and reload the updated users list in the screen.

7.5 Summary

In this chapter, we have described the design and implementation of LexBank, the multilingual lexical resource we produce in this thesis. The architecture of LexBank con- sists of three layers: the database layer, the application layer and the web interface layer. The database layer consists of two databases: system settings database and resource database. The application layer of the system is implemented using Microsoft C#. It provides admin- istrative and resource access services to the web interface. The web interface is designed and implemented using Microsoft Visual Studio 2012. The interface includes web forms for managing users and provides different wordnet search services in several languages. The system can easily updated to accommodate other lingual services and languages. Chapter 8 CONCLUSIONS

In this chapter, we summarize the main contributions of this dissertation. This disser- tation is motivated by the fact that many languages around the word lack the computational lexical resources that are essential in natural language processing. Our first goal in this dis- sertation was to develop automatic techniques that rely on few available public resources, for constructing wordnets for low-resource languages. A wordnet is a structured lexical on- tology of words that groups words based on their meaning using sets that are called synsets. A wordnet is a very important lexical resource that is used in many applications, such as translation, word-sense disambiguation, information retrieval and document classification. The second goal of this dissertation is to design and implement a system that makes the lexical resources we produce available to the public. Below, we list the main contributions of this dissertation.

• We have developed an approach for constructing structured wordnets. This ap- proach was developed by extending the approach for constructing the core word- nets presented by (Lam et al., 2014b). A core wordnet consists of only synsets that group synonym words in sets with unique ID. In a more comprehensive wordnet, these synsets are semantically connected to represent the relation among the mean- ings of the synsets. Our approach produces synsets that are semantically connected by semantic relations. Examples of the semantic relations we produced are: syn- onyms, hypernyms, topic-domain relation, part-holonyms and instance-hypernyms and member-meronyms. 93

• We presented an approach for enhancing the quality of automatically constructed wordnets. The approach is based on the vector representation of words (word embed- dings). Word embeddings are produced by a machine learning technique that maps words to real numbers vectors in a multi-dimensional space. Our approach uses the

word2vec algorithm (Mikolov et al., 2013) to generate word representations from an existing corpus. The word2vec algorithm is a feedforward neural network that pre- dicts the vector representation of words within a multi-dimensional language model.

Our approach computes the cosine similarity, using word2vec, between semantically related words in our constructed wordnets and filters any entries which do not satisfy a pre-selected threshold value.

• We introduced synset2vec, which is an algorithm for representing wordnet synsets in a multi-dimensional space. Word embeddings provide an excellent vector represen- tation of words. However, the words representation is affected by the fact that many words have multiple meanings. In order to represent meanings rather than words, we combine the vectors of synset lexemes into a single vector that represents the meaning. We believe that this vector representation can be used in many important applications. For example, it can be used in word-sense disambiguation, machine translation and gloss selection for wordnet synsets.

• We used our algorithm synset2vec to add glosses to our automatically constructed synsets. Glosses are a very important part of wordnets. A gloss is used to declare or clarify the meaning of a synset in a wordnet. A gloss can be a definition statement or an example sentence that shows the usage of the synonyms of the synset. To select

a gloss (an example sentence) from a corpus for a synset, we used synset2vec to generate vector representations of candidate glosses and the synset. We compute the cosine similarity between each candidate gloss and the synsets. Finally, we select the gloss with highest cosine similarity with synset and attach it to the synset. 94

• We have developed LexBank which is a web application that gives access to public users to our created resources. LexBank provides useful services for users that seek linguistic assistance in a friendly manner. It also includes evaluation web forms that are used to gather feedback from human judges. The design of LexBank is flexible and it can be easily expanded to accommodate additional new languages and resources. Chapter 9 FUTURE WORK

In this chapter, we propose some potential future work that can be performed as extension to the work presented in this dissertation. The general goal of the proposed future work is to enhance the quality and extend the coverage of the lexical resources we have created. For example, we produced our core wordnets using machine translation and small dictionaries. The quality of these wordnets is limited by the resources we used to create them. It is well-known that these resources do not guarantee high coverage and accuracy for all of the target languages. Below, we list some of the potential future work. 9.1 Extending Bilingual Dictionaries

In this section, we provide a task that can be undertaken in future work. We propose a new method to extend the bilingual dictionaries created in (Lam et al., 2015b). To increase the coverage of the bilingual dictionaries, we take advantage of the wordnets we have created in this dissertation. This section is divided into two parts. In the first part, we describe the approach we used in (Lam et al., 2015b) to create the bilingual dictionaries. In the second part, we describe the proposed method to extend these bilingual dictionaries.

9.1.1 Related Work

In (Lam et al., 2015b) we have created a large number of new bilingual dictionaries using intermediate core wordnets and a machine translator. A dictionary or a lexicon, as defined by (Landau, 1984), consists of sorted 2-tuple entries. Each entry is called LexicalEntry. The first part of a LexicalEntry is the phrase being de- fined, while the second part is the definition of the phrase. The definition includes the 96 meaning of the LexicalUnit and usually has several Senses, each of which is a separate rep- resentation of a single aspect of the meaning of a phrase. In (Lam et al., 2015b), the entries

in the dictionaries are of the form < LexicalUnit,Sense1 >, < LexicalUnit,Sense2 >,.... The approach for creating dictionaries using intermediate wordnets and a machine translator (IW) is described as in Figure 9.1 and Algorithm 2.

Figure 9.1: The IW approach for creating a new bilingual dictionary

Suppose that we would like to construct a bilingual dictionary Dict(S,D), where S is a source language and D is a target language, given the dictionary Dict(S,R) where R 97 is a resource-rich intermediate language. The IW algorithm reads each LexicalEntry from

Dict(S,R) and extracts SenseR from it. Then, it retrieves all Offset-POSs of SenseR from the wordnet of language R (Algorithm 2, lines 2-5). All the synonyms of the extracted Offset-POSs are extracted from all the available intermediate wordnets. Then, the algo- rithm constructs a candidate set candidateSet for the final translations in language D by translating all the extracted synonyms to language D using machine translation (Algorithm 3). There are 2 attributes in each candidate in candidateSet: word which represents a translation in language D, and rank which counts the occurrence of this translation. The rank attribute is used to order the candidates in descending order where the top candidate is the best translation. Finally, the sorted candidates are inserted into the new dictionary Dict(S,D) (Algorithm 2, lines 8-10).

Algorithm 2: IW algorithm (taken from (Lam et al., 2015b)) Input: Dict(S,R) Output: Dict(S, D) 1: Dict(S, D) := ϕ 2: for all LexicalEntry ∈ Dict(S,R) do 3: for all SenseR ∈ LexicalEntry do 4: candidateSet := ϕ 5: Find all Offset-POSs of synsets containing SenseR from the R Wordnet 6: candidatSet = FindCandidateSet (Offset-POSs, D) 7: sort all candidate in descending order based on their rank values 8: for all candidate ∈ candidateSet do 9: SenseD=candidate.word 10: add tuple to Dict(S,D) 11: end for 12: end for 13: end for

9.1.2 Extending Bilingual Dictionaries Using Structured Wordnets

In this section, we propose a new method to extend dictionaries we created in (Lam et al., 2015b) using the structured wordnets that we have created in this dissertation. The 98

Algorithm 3: FindCandidateSet (Offset-POSs,D) (taken from (Lam et al., 2015b)) Input: Offset-POSs, D Output: candidateSet 1: candidateSet := ϕ 2: for all Offset-POS ∈ Offset-POSs do 3: for all word in the Offset-POS extracted from the intermediate wordnets do 4: candidate.word= translate (word,D) 5: candidate.rank++ 6: candidateSet += candidate 7: end for 8: end for 9: return candidateSet following steps, which are summarized in Figure 9.2, describe the proposed method to extend the dictionaries.

Figure 9.2: Extending bilingual dictionaries using structured wordnets

• We start by extracting each input enrty Si from the source language S in the bilingual dictionary from S to D. 99

• Then, we retrieve the synsets list of Si from the wordnet of S.

• Next, we extract the corresponding synsets from the wordnet of D.

• For each synset member Dk we extracted from wordnet of D, we create a lexical

entry (Si,Dk ).

• Besides that, for each synset we extracted from wordnet of D, we extract the direct

hypernyms and we also create a lexical entry (Si,Hl ).

• Finally, we add any lexical entry we have created in the previous steps to the bilingual

dictionary from S to D if it does not already exists in the dictionary.

9.2 Integrating Part-of-speech Tagging into Wordnet Construction

Since our approach for automatic wordnet construction is based on translation, some of the generated synsets include words that have the wrong part-of-speech. One solution is to use a Part-Of-Speech Tagger (POS Tagger) to correct the wrong form of the words in the synset. A POS Tagger is a computer program which is used to specify the part-of-speech of words in a text written in some language. For example, the Stanford Part-Of-Speech Tagger (Toutanova et al., 2003), which is freely available, provides part-of-speech tagging for Arabic, Chinese, French, Spanish and German. POS Taggers are available for Assamese (Saharia et al., 2009) and Vietnamese (Le-Hong et al., 2010) as well. Since we are dealing with low-resource languages, many languages do not have any POS Taggers, and therefore, this approach is not applicable to them. To correct the part-of-speech in the words within a synset, we propose the following steps.

• For each synset synseti in a wordnet wordnetT , we extract the part-of-speech of the

synset from Offset-POS of synseti. 100

• For each word wordj in synseti, we find the part-of-speech of wordj and compare it

with the part-of-speech of synseti. If the parts-of-speech of wordj and synseti do not

match, we convert the form of wordj to the correct part-of-speech form and update

synseti.

9.3 Wordnet Expansion Using Word Embeddings

One possible way to automatically improve the coverage of a wordnet is by looking for additional related words in a corpus using word embeddings. In Chapter 6, we intro- duced synset2vec, which produces vector representations of synsets in a multi-dimensional space. Taking advantage of synset2vec, we believe it is possible to look for previously un- known words that are semantically related to a synset and include them in the wordnet. Below, we present a brief description of our idea.

• Assume that we would like to expand a wordnet wordnetT of language T . First, word embeddings for T are generated.

~ • Next, for each synset synseti in wordnetT , the vector for synseti Vi is generated using synset2vec.

• Then, all the words that have cosine similarity value of a preselected threshold α or less are extracted. From these words, only the words that do not have any semantic

relation with synseti are inserted into a candidate set Ci.

• Next, for each word wordj in Ci, a semantic relation rj is selected based on a classi- fication approach.

• Finally, wordj is inserted into wordnetT and connected to synseti using semantic

relation rj. 101

9.4 Producing Vector Representation for Multi-word Lexemes

One issue that appears when producing vector representations is that wordnet lex- emes can be multi-word phrases. Most of the existing tools for producing word embed- dings consider single words only. This means that they produce vectors for lexical units that are surrounded by spaces. Therefore, when we try to generate a vector for a wordnet synset, we avoid multi-word lexemes. An enhanced version of our approach for generat- ing vectors for wordnet synsets can be achieved by including a vector representation for multi-word lexemes. The vectors of the single words within a multi-word lexeme should be aggregated such that it becomes one vector within the synset. However, one issue that arises is that each single word within the multi-word lexeme may have several meanings when they appear individually. Therefore, careful research is needed to determine a good solution for this problem.

9.5 Vector Representation for Mulit-lingual Wordnets

In this dissertation, we produced vector representation for individual wordnets. One work that might help in problems such as wordnet expansion and machine translation, is obtaining the vector representation of aggregated wordnets of several languages. Since all of the wordnets we create in this dissertation are aligned with PWN, synsets having the same Offset-Pos in different wordnets actually represent the same meaning. Therefore, we believe that combining the vectors of aligned synsets from different languages will produce representation for the meaning within several language. One can use this representation to discover the closest meaning of new words that are not included within the wordnets. This could also be used in discovering a rough translation for words that are not included in a dictionary. 102

BIBLIOGRAPHY

M. Abbas and K. Smaili. Comparison of topic identification methods for arabic language. In International Conference on Recent Advances in Natural Language Processing- RANLP 2005, volume 14, 2005.

M. Abbas, K. Smaïli, and D. Berkani. Evaluation of topic identification methods on arabic corpora. JDIM, 9(5):185–192, 2011.

K. Ahn and M. Frampton. Automatic generation of translation dictionaries using inter- mediary languages. In Proceedings of the International Workshop on Cross-Language Knowledge Induction, pages 41–44. Association for Computational Linguistics, 2006.

P. Akaraputthiporn, K. Kosawat, and W. Aroonmanakun. A Bi-directional Translation Approach for Building Thai Wordnet. In Asian Language Processing, 2009. IALP’09. International Conference on, pages 97–101. IEEE, 2009.

F. Al tarouti and J. Kalita. Enhancing automatic wordnet construction using word em- beddings. In Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP, pages 30–34, San Diego, California, June 2016. Association for Computational Linguistics.

M. Apidianaki and R. J. Von Neumann. Limsi: Cross-lingual word sense disambiguation using translation sense clustering. In Second Joint Conference on Lexical and Computa- tional Semantics (* SEM), volume 2, pages 178–182, 2013. 103

M. A. Attia. Handling Arabic morphological and syntactic ambiguity within the LFG framework with a view to machine translation. PhD thesis, University of Manchester, 2008.

E. Barbu. Automatic Building of Wordnets EdUArd BarbU* &: Verginica BarbU Mi- TiTElU*** Graphitech Italy" Romanian Academy, Research Institute for Artificial In- telligence. Recent Advances in Natural Language Processing IV: Selected Papers from RANLP 2005, 292:217, 2007.

K. R. Beesley. Arabic finite-state morphological analysis and generation. In Proceedings of the 16th conference on Computational linguistics-Volume 1, pages 89–94. Association for Computational Linguistics, 1996.

S. Bhattacharya, M. Choudhury, S. Sarkar, and A. Basu. Inflectional morphology synthesis for bengali noun, pronoun and verb systems. Proc. of NCCPB, 8, 2005.

P. Bhattacharyya. Indowordnet. In In Proc. of LREC-10, 2010.

O. Bilgin, z. Çetinoglu,˘ and K. Oflazer. Building a wordnet for Turkish. Romanian Journal of Information Science and Technology, 7(1-2):163–172, 2004.

L. Bloomfield. Language. new york: Holt, rinehart and winston. A classic in linguistic studies and the first serious attempt in the development of morphology. Pre-and post- generative morphology conceptually were nurtured from the remarkable insights given

in this linguistic masterpiece, 1933.

F. Bond and K. Ogura. Combining linguistic resources to create a machine-tractable Japanese-Malay dictionary. Language Resources and Evaluation, 42(2):127–136, 2008.

L. Borin and M. Forsberg. Swesaurus; or, the frankenstein approach to wordnet construc- tion. In Proceedings of the Seventh Global WordNet Conference (GWC 2014), 2014. 104

D. Bouamor, N. Semmar, C. France, and P. Zweigenbaum. Using Wordnet and semantic similarity for bilingual terminology mining from comparable corpora. In Proceedings of the 6th Workshop on Building and Using Comparable Corpora, pages 16–23. Citeseer, 2013.

R. D. Brown. Automated dictionary extraction for “knowledge-free” example-based trans- lation. In Proceedings of the Seventh International Conference on Theoretical and Methodological Issues in Machine Translation, pages 111–118, 1997.

T. Buckwalter. Issues in arabic orthography and morphology analysis. In Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, pages 31–34. Association for Computational Linguistics, 2004.

T. Charoenporn, V. Sornlertlamvanich, C. Mokarat, and H. Isahara. Semi-automatic com- pilation of Asian WordNet. In 14th Annual Meeting of the Association for Natural Lan- guage Processing, pages 1041–1044, 2008.

X. Chen, Z. Liu, and M. Sun. A unified model for word sense representation and disam- biguation. In EMNLP, pages 1025–1035. Citeseer, 2014.

D. Christodoulakis, K. Oflazer, D. Dutoit, S. Koeva, G. Totkov, K. Pala, D. Cristea, D. Tufi¸s, M. Grigoriadou, I. Tsakou, and others. BalkaNet: A Multilingual Semantic Network for Balkan Languages. In Proceedings of the 1st International Wordnet Conference, Mysore, India, 2002.

C. J. Crouch. An approach to the automatic construction of global thesauri. Information Processing & Management, 26(5):629–640, 1990.

A. Cucchiarelli, R. Navigli, F. Neri, and P. Velardi. Automatic Generation of Glosses in the OntoLearn System. In LREC. Citeseer, 2004.

J. R. Curran. From distributional to semantic similarity. 2004. 105

J. R. Curran and M. Moens. Improvements in automatic thesaurus extraction. In Pro- ceedings of the ACL-02 workshop on Unsupervised lexical acquisition-Volume 9, pages 59–66. Association for Computational Linguistics, 2002a.

J. R. Curran and M. Moens. Scaling context space. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 231–238. Association for Computational Linguistics, 2002b.

K. Darwish. Named entity recognition using cross-lingual resources: Arabic as an example. In ACL (1), pages 1558–1567, 2013.

M. Diab and N. Habash. Arabic dialect processing tutorial. In Proceedings of the Hu- man Language Technology Conference of the NAACL, Companion Volume: Tutorial Ab-

stracts, pages 5–6. Association for Computational Linguistics, 2007.

R. M. Fano and D. Hawkins. Transmission of information: A statistical theory of commu- nications. American Journal of Physics, 29(11):793–794, 1961.

A. Farghaly and K. Shaalan. Arabic natural language processing: Challenges and solutions. ACM Transactions on Asian Language Information Processing (TALIP), 8(4):14, 2009.

C. Fellbaum. A semantic network of English verbs. WordNet: An electronic lexical database, 3:153–178, 1998.

C. Fellbaum. WordNet and Wordnets. In A. Barber, editor, of Language and Linguistics, pages 2–665. Elsevier, 2005.

M. A. Finlayson. Java libraries for accessing the Princeton WordNet: Comparison and evaluation. In Proceedings of the 7th Global Wordnet Conference, pages 78–85, 2014.

J. R. Firth. {A synopsis of linguistic theory, 1930-1955}. 1957. 106

D. Foley and J. Kalita. Integrating wordnet for multiple sense embeddings in vector seman- tics. In REU on Machine Learning and Applications. University of Colorado, Colorado Springs, 2016.

T. Gollins and M. Sanderson. Improving cross language retrieval with triangulated transla- tion. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 90–95. ACM, 2001.

R. G. Gordon and B. F. Grimes. Ethnologue: Languages of the world, volume 15. SIL international Dallas, TX, 2005.

S. Green and C. D. Manning. Better arabic : Baselines, evaluations, and analysis. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 394–402. Association for Computational Linguistics, 2010.

G. Grefenstette. Explorations in automatic thesaurus discovery, volume 278. Springer Science & Business Media, 2012.

G. Gunawan and A. Saputra. Building synsets for Indonesian Wordnet with monolingual lexical resources. In Asian Language Processing (IALP), 2010 International Conference on, pages 297–300. IEEE, 2010.

N. Habash and O. Rambow. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting on Asso- ciation for Computational Linguistics, pages 573–580. Association for Computational Linguistics, 2005.

N. Habash, R. Roth, O. Rambow, R. Eskander, and N. Tomeh. Morphological analysis and disambiguation for dialectal arabic. In HLT-NAACL, pages 426–432, 2013.

N. Y. Habash. Introduction to arabic natural language processing. Synthesis Lectures on Human Language Technologies, 3(1):1–187, 2010. 107

A. Haghighi, P. Liang, T. Berg-Kirkpatrick, and D. Klein. Learning Bilingual Lexicons from Monolingual Corpora. In ACL, volume 2008, pages 771–779, 2008.

Z. S. Harris. Distributional structure. Word, 10(2-3):146–162, 1954.

E. Haugen. Dialect, language, nation. American anthropologist, 68(4):922–935, 1966.

L. Hinkle, A. Brouillette, S. Jayakar, L. Gathings, M. Lezcano, and J. Kalita. Design and evaluation of soft keyboards for brahmic scripts. ACM Transactions on Asian Language Information Processing (TALIP), 12(2):6, 2013.

G. Hirst and D. St-Onge. Lexical chains as representations of context for the detection and correction of malapropisms. WordNet: An electronic lexical database, 305:305–332, 1998.

E. Héja. Dictionary Building based on Parallel Corpora and Word Alignment. In Proceed- ings of the XIV Euralex International Congress, Leeuwarden, pages 6–10, 2010.

Y. Hlal. Morphological analysis of arabic speech. In Workshop Papers Kuwait/Proceedings of Kuwait Conference on Computer Processing of the Arabic Language, pages 273–294, 1985.

E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 873–882. Association for Computational Linguistics, 2012.

V. István and Y. Shoichi. Bilingual dictionary generation for low-resourced language pairs. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Pro- cessing: Volume 2-Volume 2, pages 862–870. Association for Computational Linguistics, 2009. 108

P. Jaccard. The distribution of the flora in the alpine zone. New phytologist, 11(2):37–50, 1912.

D. Jurafsky and J. H. Martin. Speech and Language Processing (3rd Edition Draft). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 20016.

H. Kaji and M. Watanabe. Automatic Construction of Japanese WordNet. Proceedings of LREC2006, Italy, 2006.

H. Kozima and T. Furugori. Similarity between words computed by spreading activation on an English dictionary. In Proceedings of the sixth conference on European chapter of the Association for Computational Linguistics, pages 232–239. Association for Compu- tational Linguistics, 1993.

K. N. Lam. Automatically Creating MultiLingual Resources. PhD thesis, University of Colorado, Colorado Springs, Apr. 2015.

K. N. Lam and J. Kalita. Creating Reverse Bilingual Dictionaries. In HLT-NAACL, pages 524–528. Citeseer, 2013.

K. N. Lam, F. Al Tarouti, and J. Kalita. Creating Lexical Resources for Endangered Lan- guages. In Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 54–62, Baltimore, Maryland, USA, June 2014a. Association for Computational Linguistics.

K. N. Lam, F. A. Tarouti, and J. Kalita. Automatically constructing Wordnet synsets. In 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), Baltimore, USA, June, 2014b.

K. N. Lam, F. Al Tarouti, and J. Kalita. Phrase translation using a bilingual dictionary and n-gram data: A case study from vietnamese to english. In Proceedings of NAACL-HLT, pages 65–69, 2015a. 109

K. N. Lam, F. Al Tarouti, and J. Kalita. Automatically Creating a Large Number of New Bilingual Dictionaries. In Twenty-Ninth AAAI Conference on Artificial Intelligence, Feb. 2015b.

S. I. Landau. Dictionaries. NY: Scribners, 1984.

L. S. Larkey, L. Ballesteros, and M. E. Connell. Improving for arabic informa- tion retrieval: light stemming and co-occurrence analysis. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in informa-

tion retrieval, pages 275–282. ACM, 2002.

P. Le-Hong, A. Roussanaly, T. M. H. Nguyen, and M. Rossignol. An empirical study of maximum entropy approach for part-of-speech tagging of vietnamese texts. In Traitement Automatique des Langues Naturelles-TALN 2010, page 12, 2010.

D. Leenoi, T. Supnithi, and W. Aroonmanakun. Building a Gold Standard for Thai Word- Net. In Proceeding of The International Conference on Asian Language Processing 2008 (IALP2008), pages 78–82, 2008.

D. Lin. Automatic retrieval and clustering of similar words. In Proceedings of the 36th An- nual Meeting of the Association for Computational Linguistics and 17th International

Conference on Computational Linguistics-Volume 2, pages 768–774. Association for Computational Linguistics, 1998.

K. Lindén and J. Niemi. Is it possible to create a very large wordnet in 100 days? an evaluation. Language resources and evaluation, 48(2):191–201, 2014.

K. Lindén and L. Carlson. Finn WordNet-WordNet p\a a finska via översättning. Lexi- coNordica, 17(17), 2010.

N. Ljubešic´ and D. Fišer. Bootstrapping bilingual lexicons from comparable corpora for closely related languages. In Text, Speech and Dialogue, pages 91–98. Springer, 2011. 110

M. Maziarz, M. Piasecki, E. Rudnicka, and S. Szpakowicz. Beyond the transfer-and-merge wordnet construction: plwordnet and a comparison with wordnet. In RANLP, pages 443–452, 2013.

J. J. McCarthy. A prosodic theory of nonconcatenative morphology. Linguistic inquiry, 12 (3):373–418, 1981.

T. Mikolov, W.-t. Yih, and G. Zweig. Linguistic regularities in continuous space word representations. In HLT-NAACL, pages 746–751, 2013.

G. A. Miller. WordNet: a lexical database for English. Communications of the ACM, 38 (11):39–41, 1995.

G. A. Miller and F. Hristea. WordNet nouns: Classes and instances. Computational Lin- guistics, 32(1):1–3, 2006.

T. Miller and I. Gurevych. Wordnet-wikipedia-wiktionary: Construction of a three-way alignment. In LREC, pages 2094–2100, 2014.

M. Mladenovic, J. Mitrovic, and C. Krstev. Developing and Maintaining a WordNet: Pro- cedures and Tools. In Proceedings of the 7th Global Wordnet Conference (GWC 2014), pages 55–62, 2014.

C. Mouton and G. de Chalendar. JAWS: Just another WordNet subset. Proc. of TALN’10, 2010.

A. S. Nagvenkar, N. R. Prabhugaonkar, V. P. Prabhu, R. N. Karmali, and J. D. Pawar. Con- cept Space Synset Manager Tool. In Proceedings of the 7th Global Wordnet Conference, pages 86–94, 2014.

P. Nakov and H. T. Ng. Improved statistical machine translation for resource-poor lan- guages using related resource-rich languages. In Proceedings of the 2009 Conference on 111

Empirical Methods in Natural Language Processing: Volume 3-Volume 3, pages 1358– 1367. Association for Computational Linguistics, 2009.

R. Navigli and S. P. Ponzetto. BabelNet: Building a very large multilingual semantic network. In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 216–225. Association for Computational Linguistics, 2010.

A. Neelakantan, J. Shankar, A. Passos, and A. McCallum. Efficient non-parametric estima- tion of multiple embeddings per word in vector space. arXiv preprint arXiv:1504.06654, 2015.

L. Nerima and E. Wehrli. Generating Bilingual Dictionaries by Transitivity. In LREC, volume 8, pages 2584–2587, 2008.

R. Noyer. Vietnamese’morphology’and the definition of word. University of Pennsylvania Working Papers in Linguistics, 5(2):5, 1998.

A. Oliver. Wn-toolkit: Automatic generation of wordnets following the expand model. Proceedings of the 7th Global WordNetConference, Tartu, Estonia, 2014.

A. Oliver and S. Climent. Parallel corpora for Wordnet construction: machine translation vs. automatic sense tagging. In Computational Linguistics and Intelligent Text Process- ing, pages 110–121. Springer, 2012.

P. G. Otero and J. R. P. Campos. Automatic generation of bilingual dictionaries using inter- mediary languages and comparable corpora. In Computational Linguistics and Intelligent Text Processing, pages 473–483. Springer, 2010.

N. R. Prabhugaonkar, J. D. Pawar, and T. Plateau. Use of Sense Marking for Improving WordNet Coverage. In Proceedings of the 7th Global Wordnet Conference, pages 95–99, 2014. 112

Q. Pradet, G. de Chalendar, and J. B. Desormeaux. Wonef, an improved, expanded and evaluated automatic french translation of wordnet. Proceedings of the 7th Global Word- NetConference, Tartu, Estonia, 2014.

J. Ramírez, M. Asahara, and Y. Matsumoto. Japanese-Spanish thesaurus construction using English as a pivot. arXiv preprint arXiv:1303.1232, 2013.

G. Rigau, H. Rodriguez, and E. Agirre. Building accurate semantic taxonomies from monolingual MRDs. In Proceedings of the 17th international conference on Compu- tational linguistics-Volume 2, pages 1103–1109. Association for Computational Linguis- tics, 1998.

H. Rodríguez, D. Farwell, J. Ferreres, M. Bertran, M. Alkhalifa, and M. A. Martí. Arabic wordnet: Semi-automatic extensions using bayesian inference. In LREC, 2008.

S. Rothe and H. Schütze. Autoextend: Extending word embeddings to embeddings for synsets and lexemes. arXiv preprint arXiv:1507.01127, 2015.

B. Sagot and D. Fišer. Building a free French wordnet from multilingual resources. In OntoLex, 2008.

N. Saharia, D. Das, U. Sharma, and J. Kalita. Part of speech tagger for assamese text. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 33–36. Associa- tion for Computational Linguistics, 2009.

R. C. S. K. Sarma. Structured and logical representations of assamese text for question- answering system. In 24th International Conference on Computational Linguistics, page 27, 2012.

M. Saveski and I. Trajkovski. Automatic construction of wordnets by using machine trans- lation and language modeling. In 13th Multiconference Information Society, Ljubljana, Slovenia, 2010. 113

K. Shaalan, A. A. Monem, and A. Rafea. Arabic morphological generation from interlin- gua. In Intelligent Information Processing III, pages 441–451. Springer, 2006.

U. Sharma, J. K. Kalita, and R. K. Das. Acquisition of morphology of an indic lan- guage from . ACM Transactions on Asian Language Information Processing (TALIP), 7(3):9, 2008.

R. Shaw, A. Datta, D. VanderMeer, and K. Dutta. Building a scalable database-driven reverse dictionary. Knowledge and Data Engineering, IEEE Transactions on, 25(3): 528–540, 2013.

S. Soderland, O. Etzioni, D. S. Weld, K. Reiter, M. Skinner, M. Sammer, J. Bilmes, and others. Panlingual lexical translation via probabilistic inference. Artificial Intelligence, 174(9):619–637, 2010.

K. Tanaka and K. Umemura. Construction of a bilingual dictionary intermediated by a third language. In Proceedings of the 15th conference on Computational linguistics-Volume 1, pages 297–303. Association for Computational Linguistics, 1994.

L. C. Thompson. A Vietnamese reference grammar, volume 13. University of Hawaii Press, 1987.

K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language

Technology-Volume 1, pages 173–180. Association for Computational Linguistics, 2003.

P. Vossen. Introduction to eurowordnet. In EuroWordNet: A multilingual database with lexical semantic networks, pages 1–17. Springer, 1998. 114

Wikipedia. Wordnet — wikipedia, the free encyclopedia, 2015. URL http://en. wikipedia.org/w/index.php?title=WordNet&oldid=656664111. [Online; accessed 22-April-2015].

Wikipedia. Vietnamese language — wikipedia, the free encyclopedia, 2016a. URL https://en.wikipedia.org/w/index.php?title=Vietnamese_

language&oldid=731154067. [Online; accessed 30-July-2016].

Wikipedia. Vietnamese morphology — wikipedia, the free encyclopedia, 2016b. URL https://en.wikipedia.org/w/index.php?title=Vietnamese_ morphology&oldid=730832239. [Online; accessed 30-July-2016].

Wikipedia. Lexicon — wikipedia, the free encyclopedia, 2016c. URL https:// en.wikipedia.org/w/index.php?title=Lexicon&oldid=718057169. [Online; accessed 3-August-2016].

K. Yu and J. Tsujii. Extracting bilingual dictionary from comparable corpora with de- pendency heterogeneity. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational

Linguistics, Companion Volume: Short Papers, pages 121–124. Association for Compu- tational Linguistics, 2009.

O. F. Zaidan and C. Callison-Burch. Arabic dialect identification. Computational Linguis- tics, 40(1):171–202, 2014. Appendix A

PAPERS RESULTING FROM THE DISSERTATION Appendix B

DATA PROCESSING SOFTWARE CODE

B.1 ComputCosineSim.py

########################### # Program to compute cosine similarity # between semantically related words in a WordNet # using Word2Vec # Author: Feras Al Tarouti # Date : Feb 4 2016 import unicodecsv as csv import codecs import gensim import editdistance word2vecmodel=gensim.models.Word2Vec.load_word2vec_format(' VieVectors_SG_Size100_W5.bin', binary=True) with open('LexBankVieSemRelatedWords_WithCOS.csv', 'wb') as f: writer = csv.writer(f) writer.writerow(['OffsetPos1','Word1','Relation','OffsetPos2','Word2', 'COS','ld']) with open('LexBankVieSemRelatedWords.csv', 'rb') as f: reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE) firstline = True rownum = 0 for row in reader: if firstline: firstline=False else: print("Compute Similarity for pairs number: {0}".format(rownum)) SynsetID1=row[0] Word1= row[1] Relation=row[2] SynsetID2=row[3] Word2=row[4] try: cos= round(word2vecmodel.similarity(Word1,Word2),3) 117

except Exception: cos=00.00 ld= editdistance.eval(Word1,Word2) newrow=[SynsetID1,Word1,Relation,SynsetID2,Word2,cos,ld] writer.writerow(newrow) rownum =rownum +1 118

B.2 GenerateVectorForSynset.py

########################### # A function for computing a synset vector # Author: Feras Al Tarouti # Date : May 18 2016 def GenerateVectorForSynset(syn,thislemma): FinalVector=np.zeros(100) VectorList=[] # define the vector set for this synset LemmasList=FindLemmasOfSyns(syn) # the list of lemmas for this synset

for lemma in LemmasList: if lemma != thislemma: Vector= GenerateVectorForLemma(lemma) if np.count_nonzero(Vector)>0: VectorList.append(Vector) # add vector of word to the synset Vector

# Find out if this synset have only one word, # in this case we have to find a related word and add it to the vector sets if len(VectorList)<2: #we need to find out a related synset relatedword=FindRelatedSyn(syn) if relatedword != "": Vector =GenerateVectorForLemma(relatedword) if np.count_nonzero(Vector)>0: VectorList.append(Vector) # add vector of word to the synset Vector

for vec in VectorList: FinalVector=np.add(FinalVector,vec) # compute the average numbofVec= len(VectorList) scalar=np.divide(float(1),float(numbofVec)) FinalVector=np.multiply(FinalVector, scalar) return FinalVector 119

B.3 GenerateVectorForGloss.py

########################### # A function for computing a gloss vector # Author: Feras Al Tarouti # Date : May 18 2016 def GenerateVectorFor(thisSentence,lemma): VectorList=[] # define the vector set for this Sentence FinalVector=np.zeros(100) for word in thisSentence.split(): skip = False if word not in stopwrds and word != lemma: try: Vector = word2vecmodel[word] NofSyns = FindNumberOfSyns(word) # Scale the vector base on the number of synsets if NofSyns > 1: thisScalar = np.divide(float(1),float(NofSyns)) Vector = np.multiply(Vector, thisScalar) VectorList.append(Vector) skip=False # we have this word in our model except Exception: skip=True if len(VectorList)>0: for vec in VectorList: FinalVector=np.add(FinalVector,vec) numbofVec= len(VectorList) saclar=np.divide(float(1),numbofVec) FinalVector=np.multiply(FinalVector, saclar) return FinalVector; 120

B.4 ComputeGlossSynsetSimilarity.py

########################### # A program for computing similarity between synset and gloss # Author: Feras Al Tarouti # Date : May 18 2016 # First Step : Open the synset-gloss files, and read the sentence # Second Step : Generate the vector for the synset # Third Step : Generate the vector for the sentence # Fourth Step : Compute the cosine similarity between the synset vector # and the sentence vector # Fivth Step : Save the result ########################### with open(InputDataFile,'rb') as SentencesFile, open(outputfile, 'wb') as out_file: reader = csv.reader(SentencesFile,encoding='utf-8' ,delimiter=',') writer = csv.writer(out_file, encoding='utf-8') writer.writerow(['ID','CosSem']) rownum=0 for row in reader: if rownum!=0: print("Computing Cosine Similarity for Row numb: {0}".format(rownum) ) thisSenID = row[0] # read the current sentence ID thisSynset = row[1] # read the current synsetID thisSynMem = row[2] # read number of members for this synset thiswrd = row[3] # read the word used in this sentence thiswrdSyns = row[4] # read the number of synsets for this word thisSentence = row[5] # read the current sentence

#Compute a vector for this synset thisSynsetVector = GenerateVectorForSynset(thisSynset,"")

# Generate Vector for this sentence thisSentenceVector = GenerateVectorFor(thisSentence,"")

CosDistance = ComputeCosine (thisSynsetVector, thisSentenceVector) x=Decimal(CosDistance) if math.isnan(x): CosDistance=0 newrow=[thisSenID,CosDistance] writer.writerow(newrow)

rownum=rownum+1 Appendix C MICROSOFT SQL SERVER TABLES

-- -- Database: `LexBank_System` ------Table structure for table `Users_Info` -- USE [LexBank_System] GO

SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO

SET ANSI_PADDING ON GO

CREATE TABLE [dbo].[Users_Info]( [UserId] [varchar](50) NOT NULL, [UserName] [varchar](100) NOT NULL, [UserEmail] [varchar](70) NOT NULL, [UserPwd] [varchar](max) NOT NULL, [UserPriv] [varchar](15) NOT NULL, [UserStatus] [varchar](15) NOT NULL, CONSTRAINT [PK_Users_Info] PRIMARY KEY CLUSTERED ( [UserId] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY]

GO

SET ANSI_PADDING OFF GO ------122

-- Table structure for table `System_Log` -- USE [LexBank_System] GO

SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO

SET ANSI_PADDING ON GO

CREATE TABLE [dbo].[System_Log]( [EventId] [int] IDENTITY(1,1) NOT NULL, [EventDesc] [varchar](200) NOT NULL, [EventTime] [datetime] NOT NULL, [UserId] [varchar](50) NOT NULL ) ON [PRIMARY]

GO

SET ANSI_PADDING OFF GO ------Database: `LexBank_Resources` ------Table structure for table `Arabic_CoreWordnet` -- USE [LexBank_Resources] GO

SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO

SET ANSI_PADDING ON GO

CREATE TABLE [dbo].[Arabic_CorWordnet]( [Offset_Pos] [nvarchar](10) NOT NULL, [Member] [nvarchar](200) NOT NULL ) ON [PRIMARY]

SET ANSI_PADDING OFF GO ------Table structure for table `Assamese_CoreWordnet` 123

-- USE [LexBank_Resources] GO

SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO

SET ANSI_PADDING ON GO

CREATE TABLE [dbo].[Assamese_CorWordnet]( [Offset_Pos] [nvarchar](10) NOT NULL, [Member] [nvarchar](200) NOT NULL ) ON [PRIMARY]

SET ANSI_PADDING OFF GO ------Table structure for table `Vietnamese_CoreWordnet` -- USE [LexBank_Resources] GO

SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO

SET ANSI_PADDING ON GO

CREATE TABLE [dbo].[Vietnamese_CorWordnet]( [Offset_Pos] [nvarchar](10) NOT NULL, [Member] [nvarchar](200) NOT NULL ) ON [PRIMARY]

SET ANSI_PADDING OFF GO ------Table structure for table `Arabic_Sem_Relations` -- USE [LexBank_Resources] GO

SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO 124

SET ANSI_PADDING ON GO

CREATE TABLE [dbo].[Arabic_Sem_Relations]( [Left_Offset_Pos] [nvarchar](10) NOT NULL, [Relation] [nvarchar](50) NOT NULL, [Right_Offset_Pos] [nvarchar](10) NOT NULL ) ON [PRIMARY]

SET ANSI_PADDING OFF GO ------Table structure for table `Assamese_Sem_Relations` -- USE [LexBank_Resources] GO

SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO

SET ANSI_PADDING ON GO

CREATE TABLE [dbo].[Assamese_Sem_Relations]( [Left_Offset_Pos] [nvarchar](10) NOT NULL, [Relation] [nvarchar](50) NOT NULL, [Right_Offset_Pos] [nvarchar](10) NOT NULL ) ON [PRIMARY]

SET ANSI_PADDING OFF GO ------Table structure for table `Vietnamese_Sem_Relations` -- USE [LexBank_Resources] GO

SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO

SET ANSI_PADDING ON GO

CREATE TABLE [dbo].[Vietnamese_Sem_Relations]( [Left_Offset_Pos] [nvarchar](10) NOT NULL, [Relation] [nvarchar](50) NOT NULL, [Right_Offset_Pos] [nvarchar](10) NOT NULL ) ON [PRIMARY] 125

SET ANSI_PADDING OFF GO ------Table structure for table `Arabic_WordnetGlosses` -- USE [LexBank_Resources] GO

SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO

SET ANSI_PADDING ON GO

CREATE TABLE [dbo].[Arabic_WordnetGlosses]( [Offset_Pos] [varchar](10) NOT NULL, [Gloss] [varchar](4000) NULL ) ON [PRIMARY]

GO

SET ANSI_PADDING OFF GO ------Table structure for table `Assamese_WordnetGlosses` -- USE [LexBank_Resources] GO

SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO

SET ANSI_PADDING ON GO

CREATE TABLE [dbo].[Assamese_WordnetGlosses]( [Offset_Pos] [varchar](10) NOT NULL, [Gloss] [varchar](4000) NULL ) ON [PRIMARY]

GO

SET ANSI_PADDING OFF GO ------Table structure for table `Vietnamese_WordnetGlosses` -- 126

USE [LexBank_Resources] GO

SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO

SET ANSI_PADDING ON GO

CREATE TABLE [dbo].[Vietnamese_WordnetGlosses]( [Offset_Pos] [varchar](10) NOT NULL, [Gloss] [varchar](4000) NULL ) ON [PRIMARY]

GO

SET ANSI_PADDING OFF GO ------Table structure for table `Arabic_Sem_Relations_Eval_Data` -- USE [LexBank_Resources] GO

SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO

SET ANSI_PADDING ON GO

CREATE TABLE [dbo].[Arabic_Sem_Relations_Eval_Data]( [RelationKey] [int] IDENTITY(1,1) NOT NULL, [Left_Offset_Pos] [nvarchar](10) NOT NULL, [Word1] [nvarchar](100) NOT NULL, [Relation] [nvarchar](50) NOT NULL, [Right_Offset_Pos] [nvarchar](10) NOT NULL, [Word2] [nvarchar](100) NOT NULL, [COS] [real] NULL, ) ON [PRIMARY]

GO

SET ANSI_PADDING OFF GO ------Table structure for table `Assamese_Sem_Relations_Eval_Data` -- USE [LexBank_Resources] 127

GO

SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO

SET ANSI_PADDING ON GO

CREATE TABLE [dbo].[Assamese_Sem_Relations_Eval_Data]( [RelationKey] [int] IDENTITY(1,1) NOT NULL, [Left_Offset_Pos] [nvarchar](10) NOT NULL, [Word1] [nvarchar](100) NOT NULL, [Relation] [nvarchar](50) NOT NULL, [Right_Offset_Pos] [nvarchar](10) NOT NULL, [Word2] [nvarchar](100) NOT NULL, [COS] [real] NULL, ) ON [PRIMARY]

GO

SET ANSI_PADDING OFF GO ------Table structure for table `Vietnamese_Sem_Relations_Eval_Data` -- USE [LexBank_Resources] GO

SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO

SET ANSI_PADDING ON GO

CREATE TABLE [dbo].[Vietnamese_Sem_Relations_Eval_Data]( [RelationKey] [int] IDENTITY(1,1) NOT NULL, [Left_Offset_Pos] [nvarchar](10) NOT NULL, [Word1] [nvarchar](100) NOT NULL, [Relation] [nvarchar](50) NOT NULL, [Right_Offset_Pos] [nvarchar](10) NOT NULL, [Word2] [nvarchar](100) NOT NULL, [COS] [real] NULL, ) ON [PRIMARY]

GO

SET ANSI_PADDING OFF GO 128

------Table structure for table `Arabic_Sem_Relations_Eval_Response` -- USE [LexBank_Resources] GO

SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO

SET ANSI_PADDING ON GO

CREATE TABLE [dbo].[Arabic_Sem_Relations_Eval_Response]( [AnswerKey] [int] IDENTITY(1,1) NOT NULL, [RelationKey] [int] NOT NULL, [Score] [int] NOT NULL, [UserId] [varchar](50) NULL ) ON [PRIMARY]

GO

SET ANSI_PADDING OFF GO ------Table structure for table `Assamese_Sem_Relations_Eval_Response` -- USE [LexBank_Resources] GO

SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO

SET ANSI_PADDING ON GO

CREATE TABLE [dbo].[Assamese_Sem_Relations_Eval_Response]( [AnswerKey] [int] IDENTITY(1,1) NOT NULL, [RelationKey] [int] NOT NULL, [Score] [int] NOT NULL, [UserId] [varchar](50) NULL ) ON [PRIMARY]

GO

SET ANSI_PADDING OFF GO ------Table structure for table `Vietnamese_Sem_Relations_Eval_Response` 129

-- USE [LexBank_Resources] GO

SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO

SET ANSI_PADDING ON GO

CREATE TABLE [dbo].[Vietnamese_Sem_Relations_Eval_Response]( [AnswerKey] [int] IDENTITY(1,1) NOT NULL, [RelationKey] [int] NOT NULL, [Score] [int] NOT NULL, [UserId] [varchar](50) NULL ) ON [PRIMARY]

GO

SET ANSI_PADDING OFF GO ------Table structure for table `Arabic_WordnetGloss_Eval_Data` -- USE [LexBank_Resources] GO

SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO

SET ANSI_PADDING ON GO

CREATE TABLE [dbo].[Arabic_WordnetGloss_Eval_Data]( [GlossKey] [int] IDENTITY(1,1) NOT NULL, [Offset-pos] [varchar](10) NOT NULL, [Word] [nvarchar](500) NULL, [Sentence] [nvarchar](4000) NULL, [PWNGloss] [nvarchar](900) NULL, [CosSem] [real] NULL, [GlossRank] [int] NULL ) ON [PRIMARY]

GO

SET ANSI_PADDING OFF GO ------130

-- Table structure for table `Assamese_WordnetGloss_Eval_Data` -- USE [LexBank_Resources] GO

SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO

SET ANSI_PADDING ON GO

CREATE TABLE [dbo].[Assamese_WordnetGloss_Eval_Data]( [GlossKey] [int] IDENTITY(1,1) NOT NULL, [Offset-pos] [varchar](10) NOT NULL, [Word] [nvarchar](500) NULL, [Sentence] [nvarchar](4000) NULL, [PWNGloss] [nvarchar](900) NULL, [CosSem] [real] NULL, [GlossRank] [int] NULL ) ON [PRIMARY]

GO

SET ANSI_PADDING OFF GO ------Table structure for table `Vietnamese_WordnetGloss_Eval_Data` -- USE [LexBank_Resources] GO

SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO

SET ANSI_PADDING ON GO

CREATE TABLE [dbo].[Vietnamese_WordnetGloss_Eval_Data]( [GlossKey] [int] IDENTITY(1,1) NOT NULL, [Offset-pos] [varchar](10) NOT NULL, [Word] [nvarchar](500) NULL, [Sentence] [nvarchar](4000) NULL, [PWNGloss] [nvarchar](900) NULL, [CosSem] [real] NULL, [GlossRank] [int] NULL ) ON [PRIMARY]

GO 131

SET ANSI_PADDING OFF GO ------Table structure for table `Arabic_WordnetGloss_Eval_Response` -- USE [LexBank_Resources] GO

SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO

SET ANSI_PADDING ON GO

CREATE TABLE [dbo].[Arabic_WordnetGlosses_Eval_Response]( [AnswerKey] [int] IDENTITY(1,1) NOT NULL, [GlossKey] [int] NOT NULL, [Score] [int] NOT NULL, [UserId] [varchar](50) NULL ) ON [PRIMARY]

GO

SET ANSI_PADDING OFF GO ------Table structure for table `Assamese_WordnetGloss_Eval_Response` -- USE [LexBank_Resources] GO

SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO

SET ANSI_PADDING ON GO

CREATE TABLE [dbo].[Assamese_WordnetGlosses_Eval_Response]( [AnswerKey] [int] IDENTITY(1,1) NOT NULL, [GlossKey] [int] NOT NULL, [Score] [int] NOT NULL, [UserId] [varchar](50) NULL ) ON [PRIMARY]

GO

SET ANSI_PADDING OFF 132

GO ------Table structure for table `Vietnamese_WordnetGloss_Eval_Response` -- USE [LexBank_Resources] GO

SET ANSI_NULLS ON GO

SET QUOTED_IDENTIFIER ON GO

SET ANSI_PADDING ON GO

CREATE TABLE [dbo].[Vietnamese_WordnetGlosses_Eval_Response]( [AnswerKey] [int] IDENTITY(1,1) NOT NULL, [GlossKey] [int] NOT NULL, [Score] [int] NOT NULL, [UserId] [varchar](50) NULL ) ON [PRIMARY]

GO

SET ANSI_PADDING OFF GO ------Appendix D LEXBANK UTILITY CLASS

1 using System; 2 using System.Collections.Generic; 3 using System.Linq; 4 using System.Web; 5 using System.Data; 6 using System.Data.SqlClient; 7 using System.Web.Configuration; 8 using System.IO; 9 using System.Text; 10 using System.Security.Cryptography; 11 12 namespace LexBank2016 13 { 14 public class LexBankUtils 15 { 16 private string LexBankConnectionString = WebConfigurationManager.ConnectionStrings["LexBankData"]. ToString(); 17 18 public Boolean IsUserIdAvailable(string UserId) 19 { 20 // This function takes user id and check if it is already used or not 21 Boolean result = false; 22 23 24 using (SqlConnection connection = new SqlConnection( LexBankConnectionString)) 25 { 26 connection.Open(); 27 // 28 // Create new SqlCommand object. 29 // 30 using (SqlCommand command = new SqlCommand("SELECT UserId FROM Users_Info where UserId like @UserId", connection)) 31 { 32 // Define the parameters 33 command.Parameters.AddWithValue("@UserId", UserId .Trim()); 134

34 // 35 // Invoke ExecuteReader method. 36 // 37 var firstColumn = command.ExecuteScalar(); 38 if (firstColumn == null) 39 { 40 result = true; 41 } 42 } 43 } 44 return result; 45 46 47 } 48 49 public string EncryptPassword(string PlanePassword) 50 { 51 string EncryptionKey = "LexBank"; 52 byte[] PlaneBytes = Encoding.Unicode.GetBytes( PlanePassword); 53 using (Aes PasswordEncryptor = Aes.Create()) 54 { 55 Rfc2898DeriveBytes PBKDF = new Rfc2898DeriveBytes( EncryptionKey, new byte[] { 0x49, 0x76, 0x61, 0x6e , 0x20, 0x4d, 0x65, 0x64, 0x76, 0x65, 0x64, 0x65, 0x76 }); 56 PasswordEncryptor.Key = PBKDF.GetBytes(32); 57 PasswordEncryptor.IV = PBKDF.GetBytes(16); 58 using (MemoryStream ms = new MemoryStream()) 59 { 60 using (CryptoStream cs = new CryptoStream(ms, PasswordEncryptor.CreateEncryptor(), CryptoStreamMode.Write)) 61 { 62 cs.Write(PlaneBytes, 0, PlaneBytes.Length); 63 cs.Close(); 64 } 65 PlanePassword = Convert.ToBase64String(ms.ToArray ()); 66 } 67 } 68 return PlanePassword; 69 } 70 71 public string DecryptPassword(string EncryptedPassword) 72 { 73 string EncryptionKey = "LexBank"; 74 byte[] DecryptedBytes = Convert.FromBase64String( EncryptedPassword); 75 using (Aes PasswordEncryptor = Aes.Create()) 76 { 77 Rfc2898DeriveBytes PBKDF = new Rfc2898DeriveBytes( EncryptionKey, new byte[] { 0x49, 0x76, 0x61, 0x6e 135

, 0x20, 0x4d, 0x65, 0x64, 0x76, 0x65, 0x64, 0x65, 0x76 }); 78 PasswordEncryptor.Key = PBKDF.GetBytes(32); 79 PasswordEncryptor.IV = PBKDF.GetBytes(16); 80 using (MemoryStream ms = new MemoryStream()) 81 { 82 using (CryptoStream cs = new CryptoStream(ms, PasswordEncryptor.CreateDecryptor(), CryptoStreamMode.Write)) 83 { 84 cs.Write(DecryptedBytes, 0, DecryptedBytes. Length); 85 cs.Close(); 86 } 87 EncryptedPassword = Encoding.Unicode.GetString(ms .ToArray()); 88 } 89 } 90 return EncryptedPassword; 91 } 92 93 public Boolean CreateNewUser(string UserId, string UserName, string UserEmail, string UserPwd) 94 { 95 Boolean result = false; 96 string UserPriv = "client"; 97 string UserStatus = "New"; 98 using (SqlConnection connection = new SqlConnection( LexBankConnectionString)) 99 { 100 connection.Open(); 101 // 102 // Create new SqlCommand object. 103 // 104 using (SqlCommand command = new SqlCommand("INSERT INTO Users_Info VALUES(@UserId,@UserName, @UserEmail,@UserPwd,@UserPriv,@UserStatus)", connection)) 105 { 106 // Define the parameters 107 command.Parameters.AddWithValue("@UserId", UserId .Trim()); 108 command.Parameters.AddWithValue("@UserName", UserName.Trim()); 109 command.Parameters.AddWithValue("@UserEmail", UserEmail.Trim()); 110 command.Parameters.AddWithValue("@UserPwd", UserPwd.Trim()); 111 command.Parameters.AddWithValue("@UserPriv", UserPriv.Trim()); 112 command.Parameters.AddWithValue("@UserStatus", UserStatus.Trim()); 113 // 114 // Invoke ExecuteNonQuery method. 136

115 // 116 int c = 0; 117 try 118 { 119 c = command.ExecuteNonQuery(); 120 if (c == 1) 121 result = true; 122 } 123 catch (Exception e) 124 { 125 126 } 127 128 129 } 130 131 } 132 133 134 135 136 return result; 137 } 138 139 public bool IsAuthenticated(string userid, string userpassword) 140 { 141 142 bool result = false; 143 SqlConnection LexBankDataConnection = new SqlConnection( LexBankConnectionString); 144 SqlCommand AuthCommand = new SqlCommand("Select UserId, UserPriv, UserStatus from Users_Info where UserId= @userid and UserPwd=@userpassword", LexBankDataConnection); 145 AuthCommand.Parameters.AddWithValue("@userid", userid); 146 AuthCommand.Parameters.AddWithValue("@userpassword", EncryptPassword(userpassword.Trim())); 147 LexBankDataConnection.Open(); 148 SqlDataReader reader = AuthCommand.ExecuteReader(); 149 while (reader.Read()) 150 { 151 string UserStatus = reader["UserStatus"].ToString(); 152 if (UserStatus == "Active") 153 { 154 result = true; 155 LogEvent("Login", DateTime.Now, userid.Trim()); 156 157 } 158 } 159 return result; 160 } 161 162 public List FindSynSet(string lexeme, string WordNet) 137

163 { 164 165 List result = new List(); 166 167 using (SqlConnection connection = new SqlConnection( LexBankConnectionString)) 168 { 169 connection.Open(); 170 // 171 // Create new SqlCommand object. 172 // 173 using (SqlCommand command = new SqlCommand("SELECT * FROM " + WordNet + " where Member like @lexeme", connection)) 174 { 175 // Define the parameters 176 command.Parameters.AddWithValue("@lexeme", lexeme .Trim()); 177 // 178 // Invoke ExecuteReader method. 179 // 180 SqlDataReader reader = command.ExecuteReader(); 181 while (reader.Read()) 182 { 183 result.Add(reader.GetString(0).Trim()); 184 185 }//end while 186 187 } //end the second using 188 }//end the first using 189 return result; 190 } 191 192 public List FindSynSetLexemes(string OffsetPos, string WordNet) 193 { 194 195 List result = new List(); 196 197 using (SqlConnection connection = new SqlConnection( LexBankConnectionString)) 198 { 199 connection.Open(); 200 // 201 // Create new SqlCommand object. 202 // 203 using (SqlCommand command = new SqlCommand("SELECT * FROM " + WordNet + " where Offset_Pos like @OffsetPos", connection)) 204 { 205 // Define the parameters 206 command.Parameters.AddWithValue("@OffsetPos", OffsetPos.Trim()); 207 // 138

208 // Invoke ExecuteReader method. 209 // 210 SqlDataReader reader = command.ExecuteReader(); 211 while (reader.Read()) 212 { 213 result.Add(reader.GetString(1).Trim()); 214 215 }//end while 216 217 } //end the second using 218 }//end the first using 219 return result; 220 } 221 222 public Boolean IsSynSetAvailable(string OffsetPos, string Wordnet) 223 { 224 // This function takes synsetID and check if it is included in a Wordnet 225 Boolean result = false; 226

227 228 using (SqlConnection connection = new SqlConnection( LexBankConnectionString)) 229 { 230 connection.Open(); 231 // 232 // Create new SqlCommand object. 233 // 234 using (SqlCommand command = new SqlCommand("SELECT Offset_Pos FROM " + Wordnet.Trim() + " where Offset_Pos like @OffsetPos", connection)) 235 { 236 // Define the parameters 237 command.Parameters.AddWithValue("@OffsetPos", OffsetPos.Trim()); 238 // 239 // Invoke ExecuteReader method. 240 // 241 SqlDataReader reader = command.ExecuteReader(); 242 243 if (reader.Read()) 244 result = true; 245 246 } 247 248 249 } 250 251 252 return result; 253 254 255 } 139

256 257 public Dictionary FindSynSetRelations(string OffsetPos, string WordNet, string RelationsTable) 258 { 259 260 Dictionary result = new Dictionary(); 261 262 263 using (SqlConnection connection = new SqlConnection( LexBankConnectionString)) 264 { 265 connection.Open(); 266 // 267 // Create new SqlCommand object. 268 // 269 270 using (SqlCommand command = new SqlCommand("SELECT * FROM " + RelationsTable.Trim() + " where Left_Offset_Pos like @OffsetPos", connection)) 271 { 272 // Define the parameters 273 command.Parameters.AddWithValue("@OffsetPos", OffsetPos.Trim()); 274 // 275 // Invoke ExecuteReader method. 276 // 277 SqlDataReader reader = command.ExecuteReader(); 278 279 string Relation = ""; 280 281 int c = 0; 282 while (reader.Read()) 283 { 284 if (IsSynSetAvailable(reader.GetString(2). Trim(), WordNet)) 285 { 286 Relation = reader.GetString(1).Trim() + " :" + reader.GetString(2).Trim(); 287 string RelatedOffsetPos = reader. GetString(2).Trim(); 288 List RelatedLexemes = FindSynSetLexemes(RelatedOffsetPos, WordNet); 289 290 foreach (string lexeme in RelatedLexemes) 291 { 292 c++; 293 result.Add(RelatedOffsetPos + c. ToString(), Relation + "-->" + lexeme); 294 295 } 296 140

297 } 298 299 300 }//end while 301 302 303 304 305 } //end the second using 306 }//end the first using 307 308 return result; 309 310 } 311 312 public string FindGloss(string OffsetPos, string GlossTable) 313 { 314 string result = "Gloss is not available"; 315 316 using (SqlConnection connection = new SqlConnection( LexBankConnectionString)) 317 { 318 connection.Open(); 319 // 320 // Create new SqlCommand object. 321 // 322 using (SqlCommand command = new SqlCommand("SELECT * FROM " + GlossTable + " where Offset_Pos like @OffsetPos", connection)) 323 { 324 // Define the parameters 325 command.Parameters.AddWithValue("@OffsetPos", OffsetPos.Trim()); 326 // 327 // Invoke ExecuteReader method. 328 // 329 SqlDataReader reader = command.ExecuteReader(); 330 while (reader.Read()) 331 { 332 result=reader.GetString(1).Trim(); 333 334 }//end while 335 336 } //end the second using 337 }//end the first using 338 339 340 341 return result; 342

343 344 } 345 141

346 public List ReadRelation(string RelationKey, string RelationDataTable) 347 { 348 // This method reads a relation and return it to be evaluated 349 350 List Result = new List(); 351 352 try 353 { 354 SqlConnection MyConnection = new SqlConnection( LexBankConnectionString); 355 356 string Sqls = "SELECT [RelationKey], [Word1] , [ Relation], [Word2] FROM " + RelationDataTable + " where [RelationKey] = @RelationKey"; 357 SqlCommand Mycommand = new SqlCommand(Sqls, MyConnection); 358 DataTable MyTable = new DataTable(); 359 using (SqlDataAdapter Myadapter = new SqlDataAdapter( Mycommand)) 360 { 361 362 Myadapter.Fill(MyTable); 363 364 if (MyTable.Rows.Count > 0) 365 { 366 367 for (int x = 0; x < 4; x++) 368 { 369 370 Result.Add(MyTable.Rows[0][x].ToString()) ; 371 372 } 373 374 } 375 376 } 377 378 return Result; 379 } 380 381 catch (Exception ex) 382 { 383 return Result; 384 } 385 386 } 387 388 public List ReadSynsetGloss(int GlossKey,string TableName) 389 { 142

390 // This method Read a synset gloss from the table and return it to be evaluated 391 392 List Result = new List(); 393 394 try 395 { 396 SqlConnection MyConnection = new SqlConnection( LexBankConnectionString); 397 398 string Sqls = "SELECT [GlossKey], [Word] , [Sentence ], [PWN_Gloss] FROM " + TableName + " where [ GlossKey] =@GlossKey"; 399 DataTable MyTable = new DataTable(); 400 SqlCommand Mycommand = new SqlCommand(Sqls, MyConnection); 401 Mycommand.Parameters.AddWithValue("@GlossKey", GlossKey); 402 using (SqlDataAdapter Myadapter = new SqlDataAdapter( Mycommand)) 403 { 404 Myadapter.Fill(MyTable); 405 if (MyTable.Rows.Count > 0) 406 { 407 408 for (int x = 0; x < 4; x++) 409 { 410 Result.Add(MyTable.Rows[0][x].ToString()) ; 411 } 412 413 } 414 415 } 416 417 return Result; 418 } 419 catch (Exception ex) 420 { 421 return Result; 422 } 423 424 } 425 426 public Boolean EvaluateRelation(int RelationKey, int Score, string UserId, string EvaluationTable) 427 { 428 429 try 430 { 431 432 SqlConnection MyConnection = new SqlConnection( LexBankConnectionString); 433 143

434 string sqls = "INSERT INTO " + EvaluationTable + " ([ RelationKey],[Score] ,[UserID]) values ( @RelationKey,@Score,@UserId)"; 435 var command = new SqlCommand(sqls, MyConnection); 436 command.Parameters.AddWithValue("@RelationKey", RelationKey); 437 command.Parameters.AddWithValue("@Score", Score); 438 command.Parameters.AddWithValue("@UserId", UserId. Trim()); 439 MyConnection.Open(); 440 command.ExecuteNonQuery(); 441 MyConnection.Close(); 442 return true; 443 444 } 445 446 catch (Exception ex) 447 { 448 return false; 449 } 450 451 } 452 453 private Boolean EvaluateGloss(int GlossKey, int Score, string UserId, string EvaluationTable) 454 { 455 456 try 457 { 458 459 SqlConnection MyConnection = new SqlConnection( LexBankConnectionString); 460 461 string sqls2 = "INSERT INTO " + EvaluationTable + " ([GlossKey],[Score] ,[UserID]) values (@GlossKey, @Score,@UserId)"; 462 var command = new SqlCommand(sqls2, MyConnection); 463 command.Parameters.AddWithValue("@GlossKey", GlossKey ); 464 command.Parameters.AddWithValue("@Score", Score); 465 command.Parameters.AddWithValue("@UserId", UserId); 466 467 MyConnection.Open(); 468 command.ExecuteNonQuery(); 469 MyConnection.Close(); 470 return true; 471 472 } 473 474 catch (Exception ex) 475 { 476 return false; 477 } 478 144

479 } 480 481 public void LogEvent(string EventDesc, DateTime EventTime, string UserId) 482 { 483 using (SqlConnection connection = new SqlConnection( LexBankConnectionString)) 484 { 485 connection.Open(); 486 // 487 // Create new SqlCommand object. 488 // 489 using (SqlCommand command = new SqlCommand("INSERT INTO System_Log([EventDesc], [EventTime], [UserId ]) VALUES(@EventDesc, @EventTime, @UserId)", connection)) 490 { 491 // Define the parameters 492 command.Parameters.AddWithValue("@EventDesc", EventDesc.Trim()); 493 command.Parameters.AddWithValue("@EventTime", SqlDbType.DateTime).Value = EventTime; 494 command.Parameters.AddWithValue("@UserId", UserId .Trim()); 495 // 496 // Invoke ExecuteNonQuery method. 497 // 498 //try 499 //{ 500 command.ExecuteNonQuery(); 501 //} 502 //catch (Exception e) 503 //{ 504 505 //} 506 507 } 508 } 509 510 } 511 512 public void ChangeUserStatus(string UserId, string NewStatus) 513 { 514 using (SqlConnection connection = new SqlConnection( LexBankConnectionString)) 515 { 516 connection.Open(); 517 // 518 // Create new SqlCommand object. 519 // 520 using (SqlCommand command = new SqlCommand("UPDATE Users_Info SET UserStatus=@UserStatus WHERE UserId=@UserId", connection)) 521 { 145

522 // Define the parameters 523 command.Parameters.AddWithValue("@UserId", UserId .Trim()); 524 command.Parameters.AddWithValue("@UserStatus", NewStatus.Trim()); 525 // 526 // Invoke ExecuteNonQuery method. 527 // 528 //try 529 //{ 530 command.ExecuteNonQuery(); 531 //} 532 //catch (Exception e) 533 //{ 534 535 //} 536 537 } 538 } 539 540 } 541 542 public DataTable RetrieveUsers() 543 { 544 DataTable result = new DataTable(); 545 546 using (SqlConnection connection = new SqlConnection( LexBankConnectionString)) 547 { 548 connection.Open(); 549 // 550 // Create new SqlCommand object. 551 // 552 using (SqlCommand command = new SqlCommand("SELECT [ UserId], [UserName], [UserEmail], [UserPriv], [ UserStatus] FROM [Users_Info]", connection)) 553 { 554 SqlDataAdapter dadapter = new SqlDataAdapter( command); 555 dadapter.Fill(result); 556 557 } 558 } 559 return result; 560 561 } 562 563 } 564 } Appendix E IRB APPROVAL LETTER 147