A Two-Phase Approach for Building Vietnamese Wordnet

A Two-Phase Approach for Building Vietnamese WordNet Phuong-Thai Nguyen Van-Lam Pham VNU University of Engineering and Technology VASS Institute of Linguistics [email protected] [email protected] Hoang-An Nguyen Huy-Hien Vu Naiscorp Inc. VNU University of Engineering and Technology [email protected] [email protected] Ngoc-Anh Tran Thi-Thu-Ha Truong Le Quy Don Technical University VASS Institute of [email protected] Lexicography and Encyclopedia [email protected] Abstract Wordnets play an important role not only in Length Words Percentage linguistics but also in natural language pro- 1 6,303 15.69 cessing (NLP). This paper reports major results of a project which aims to construct a 2 28,416 70.72 wordnet for Vietnamese language. We pro- 3 2,259 5.62 pose a two-phase approach to the construction 4 2,784 6.93 of Vietnamese WordNet employing available 5 419 1.04 language resources and ensuring Vietnamese Total 40,181 100 specific linguistic and cultural characteristics. We also give statistical results and analyses to Table 1: Word length statistics from a popular Viet- show characteristics of the wordnet. namese dictionary, made by the Vietnam Lexicography Center (Vietlex). 1 Introduction In order to solve various problems in NLP including than the number of single words. As in many other information retrieval, machine translation, text clas- Asian languages such as Chinese, Japanese and sification, etc. we need language resources such as Thai, there is no word delimiter in Vietnamese. The corpora and dictionaries. Wordnet is one of impor- space is a syllable delimiter but not a word delimiter, tant resources for solving such problems. The first so a Vietnamese sentence can often be segmented wordnet was created at Princeton University for En- in many ways. Secondly, Vietnamese is an isolat- glish language. After that, diverse wordnets were ing language in which words do not change their constructed such as EuroWordNet for European lan- forms according to their grammatical function in a guages, Asian WordNet for Asian languages, etc. sentence. There are a number of important characteristics Constructing wordnets is a complicated task. This of the Vietnamese language that impact the con- task involves answering questions including which struction of wordnet. Firstly, the smallest unit in approach is appropriate, how to ensure specific char- the formation of Vietnamese words is the sylla- acteristics of the language, how to take full advan- ble. Words can have just one syllable, for example tage of available resources. This paper makes an at- ‘đẹp’beautiful, or be a compound of two or more tempt to answer these fundamental questions and re- syllables, for example ‘màu sắc’color. As shown in ports major results of a project aiming to construct a Table 1, single-syllable words only cover a small wordnet for Vietnamese language, whose database proportion while two-syllable words account for the includes 30,000 synonym sets and 50,000 words largest proportion of the whole vocabulary. Form- with 30,000 commonly used by the Vienamese. ing that vocabulary is a set of 7,729 syllables, higher Figure 1 represents major steps in construction statistics and analyses of the wordnet being constructed. Section 5 gives a number of conclusions and future works. 2 Existing Wordnets 2.1 Princeton’s WordNet Since 1978, George Miller (Fellbaum, 1998) had re- searched and developed a database of words and semantic relations between words. This database was called wordnet and was considered a model of men- tal lexicon. Conceivably, wordnet is a large discrete graph in which nodes are synonym sets (synsets) and edges are semantic relations of synsets. A synset is a collection of synonym words of the same part of speech in which each word can be replaced by one of the others in certain contexts. For example, car, auto, automobile, machine, motorcar form a synset. This synset has a hyponymy relation with the synset vehicle because a car is a kind of vehicle. Figure 1: Steps in Vietnamese WordNet construction. 2.2 EuroWordNet EuroWordNet (Vossen, 2002) is a multilingual lexical database of nine European languages. Each lan- process of Vietnamese WordNet. We put these steps guage has its own wordnet. These component word- in two phases. Phase 1 involves steps 1-3, phase 2 nets are linked via Princeton’s WordNet version involves steps 4 and 5. We exploit a number of lan- 1.5. More specifically, their synsets are linked to guage resources including Princeton’s WordNet, a Princeton’s WordNet’s synsets which are equivalent Vietnamese dictionary and an English-Vietnamese or closest in meaning. EuroWordNet accepts differ- bilingual dictionary. ent levels of lexicalization. For example, Princeton’s The class of adverbs in Vietnamese is a closed WordNet contains both lexicalized and unlexicalized class (or a class of function words), while in En- synsets, while Dutch WordNet contains only lexical- glish the class of adverbs is an open class (or a ized ones. Component wordnets have been built by class of content words). Vietnamese adverbs express exploiting available resources such as monolingual time (such as ‘đã’past, ‘đang’continuous), degree dictionaries, bilingual dictionaries, and the Prince- (such as ‘rất’very, ‘hơi’rather), and negation (such ton’s WordNet. as ‘không’not). Therefore the number of adverbs in Vietnamese is much smaller than that in English. For 2.3 Asian WordNet that reason, there are only three parts of speech in This project (Virach et al., 2009) aims to cre- Vietnamese WordNet including noun, verb, and ad- ate wordnets for Asian languages such as Thai, jective. Semantic relations in Vietnamese WordNet Japanese, Korean, etc. Currently, there are data of 13 are similar to those in Princeton WordNet except a languages in Asian WordNet. The authors adopted number of relations such as derivationally related a semi-automatic approach to translate Princeton’s form, participle of verb, etc. WordNet’s synsets into Asian languages using bilin- The remaining part of this paper is organized as gual dictionaries. The authors also built an online follows: Section 2 gives a review of several exist- tool for editing and visualizing contents of the word- ing wordnets. Section 3 introduces our method to net. By using this tool, many people can easily par- construct Vietnamese WordNet. Section 4 presents ticipate in the task of translation. They can also mod- ify translations and can vote for the best one. In ởn’white, ‘làng xã’village, etc. or words relating to terms of wordnet design, Asian WordNet is a spe- history, society and culture of Vietnamese such cial case of EuroWordNet because it was built by as ‘truyện Kiều’a famous story in V ietnam, ‘bánh translation approach. The major limitation of Asian chưng’a kind of cake, etc. Therefore in phase 2, we WordNet is that it lacks specific concepts of Asian select coordinated compound words, reduplicative languages. words, and subordinated compound words to add to the Vietnamese WordNet. We choose words from a 2.4 Laconec popular Vietnamese dictionary, made by the Viet- This is a semantic-based multilingual dictionary nam Lexicography Center (Vietlex). available on the Internet1. According to the information on the website: This dictionary has been devel- 3.2 Guideline Development oped since 2007. The goal of Laconec is to provide Editing data for wordnet is not an easy task, guide- multilingual lexical knowledge word lookup based line documents are required to ensure the cor- on semantics. The core of the system is the large rectness and the consistency of data. In a word- scale Princeton’s Wordnet-like monolingual dictio- net, words are linked by semantic relations, there- naries linked to each other. This dictionary acknowl- fore in the guideline document we focus on de- edges Dr. Francis Bond’s works (Bond and Paik, scribing how to identify semantic relations espe- 2012) and four wordnets including English, Thai, cially synonymy, antonymy, hypernymy, hyponymy, Japanese, and Finnish. holonymy, meronymy, and troponymy. We created diagnostic tests to verify relations between synsets. 3 A Method to Construct Vietnamese For instance, synonymy relation is identified on the WordNet basis of the possibility of a word being replaced by 3.1 Two Phases in Constructing Vietnamese another in a specific context. This can be verified WordNet by the possibility of being mutually substitutable in sentence ‘X is a Noun1 therefore X is a Noun2’. In We construct Vietnamese Wordnet through two addition to the tests there are a number of principles phases (Figure 1). In phase 1 (steps 1 to 3), we fo- which can be used for encoding the relations, for ex- cus on translating a part of Princeton’s WordNet into ample the Economy principle and the Compatibility Vietnamese. In phase 2 (steps 4 and 5), we make use principle (Fellbaum, 1998). Besides, we give guide- of Vietnamese resources to create the wordnet. Con- lines as to handling Vietnamese specific linguistic tents and requirements of these phases are different and cultural characteristics. Last but not least, the and separated. guideline document contains instructions as to how The major work of phase 1 is translating a part of to give definitions and examples, how to exploit re- English Wordnet into Vietnamese. Thus, we firstly sources such as existing dictionaries, and spelling need to determine a list of English synsets to trans- rules. late. Because of the significantly smaller size of our target Vietnamese wordnet, we choose to translate 3.3 Treatment of Vietnamese Specific Words only a part of Princeton’s WordNet. Our criteria for With regard to their structure, Vietnamese words selecting English synsets include: (1) the lexicaliza- can be divided into a number of types includ- tion possibility in Vietnamese; (2) the connectivity ing single-syllable words, coordinated compound of the selected part; (3) the inclusion of common words, subordinated compound words, reduplicative base concepts.

A Two-Phase Approach for Building Vietnamese Wordnet

Compound Word Formation.Pdf

Hunspell – the Free Spelling Checker

Building a Treebank for French

Compounds and Word Trees

A Poor Man's Approach to CLEF

Robust Prediction of Punctuation and Truecasing for Medical ASR

Lexical Analysis

Name Description

Part-Of-Speech Tagging Guidelines for the Penn Treebank Project (3Rd Revision)

Learning to SMILE

Building a Large Annotated Corpus of English: the Penn Treebank

Real-Time Statistical Speech Translation