Comparing the Value of Latent Semantic Analysis on Two English-To- Indonesian Lexical Mapping Tasks

Comparing the value of Latent Semantic Analysis on two English-to- Indonesian lexical mapping tasks Eliza Margaretha Ruli Manurung Faculty of Computer Science Faculty of Computer Science University of Indonesia University of Indonesia Depok, Indonesia Depok, Indonesia [email protected] [email protected] over concepts defined in two language resources. Abstract In other words, we try to automatically construct two variants of a bilingual dictionary between two This paper describes an experiment that at- tempts to automatically map English words and languages, i.e. one with sense disambiguated en- concepts, derived from the Princeton WordNet, tries and one without. to their Indonesian analogues appearing in a In this paper we present an extension to LSA in- widely-used Indonesian dictionary, using La- to a bilingual context that is similar to (Rehder et tent Semantic Analysis (LSA). A bilingual se- al., 1997; Clodfelder, 2003; Deng and Gao, 2007), mantic model is derived from an English- and then apply it to the two mapping tasks de- Indonesian parallel corpus. Given a particular scribed above, specifically towards the lexical re- word or concept, the semantic model is then sources of Princeton WordNet (Fellbaum, 1998), used to identify its neighbours in a high- an English semantic lexicon, and the Kamus Besar dimensional semantic space. Results from vari- Bahasa Indonesia (KBBI)2, considered by many to ous experiments indicate that for bilingual word mapping, LSA is consistently outperformed by be the official dictionary of the Indonesian lan- the basic vector space model, i.e. where the guage. full-rank word-document matrix is applied. We We first provide formal definitions of our two speculate that this is due to the fact that the tasks of bilingual word and concept mapping (Sec- ‘smoothing’ effect LSA has on the word- tion 2) before discussing how these tasks can be document matrix, whilst very useful for reveal- automated using LSA (Section 3). We then present ing implicit semantic patterns, blurs the cooc- our experiment design and results (Sections 4 and currence information that is necessary for estab- 5) followed by an analysis and discussion of the lishing word translations. results in Section 6. 1 Overview 2 Task Definitions An ongoing project at the Information Retrieval As mentioned above, our work concerns the map- Lab, Faculty of Computer Science, University of ping of two monolingual dictionaries. In our work, Indonesia, concerns the development of an Indone- we refer to these resources as WordNets due to the 1 sian WordNet . To that end, one major task con- fact that we view them as semantic lexicons that cerns the mapping of two monolingual dictionaries index entries based on meaning. However, we do at two different levels: bilingual word mapping, not consider other semantic relations typically as- which seeks to find translations of a lexical entry sociated with a WordNet such as hypernymy, hy- from one language to another, and bilingual con- ponymy, etc. For our purposes, a WordNet can be cept mapping, which defines equivalence classes 2 The KBBI is the copyright of the Language Centre, Indone- 1 http://bahasa.cs.ui.ac.id/iwn sian Ministry of National Education. formally defined as a 4-tuple , , χ, ω as fol- concept. Thus, , where , returns lows: , the set of all words that convey c. Using the examples above, • A concept is a semantic entity, which . represents a distinct, specific meaning. Each concept is associated with a gloss, which is a We can define different WordNets for different textual description of its meaning. For exam- languages, e.g. ,,, and ple, we could define two concepts, and , , , , . We also introduce the nota- where the former is associated with the gloss tion to denote word in and to denote “a financial institution that accepts deposits concept in . For the sake of our discussion, we and channels the money into lending activi- will assume to be an English WordNet, and ties” and the latter with “sloping land (espe- to be an Indonesian WordNet. cially the slope beside a body of water”. If we make the assumption that concepts are • A word is an orthographic entity, language independent, and should theoreti- which represents a word in a particular lan- cally share the same set of universal concepts, . guage (in the case of Princeton WordNet, Eng- In practice, however, we may have two WordNets lish). For example, we could define two words, with different conceptual representations, hence and , where the former represents the or- the distinction between and . We introduce thographic string bank and the latter represents the relation to denote the explicit spoon. mapping of equivalent concepts in and . • A word may convey several different concepts. We now describe two tasks that can be per- The function : returns all concepts formed between and , namely bilingual con- conveyed by a particular word. Thus, , cept mapping and bilingual word mapping. The task of bilingual concept mapping is essen- where , returns , the set of all concepts that can be conveyed by w. Using the tially the establishment of the concept equivalence examples above, , . relation E. For example, given the example concepts in Table 1, bilingual concept mapping seeks • Conversely, a concept may be conveyed by to establish ,, ,, ,, , several words. The function : re- . turns all words that can convey a particular Concept Word Gloss Example w an instance or single occasion for some event “this time he succeeded” c w a suitable moment “it is time to go” c w a reading of a point in time as given by a clock (a word “do you know what time it is?” signifying the frequency of an event) c w kata untuk menyatakan kekerapan tindakan (a word “dalam satu minggu ini, dia sudah empat signifying a particular instance of an ongoing series of kali datang ke rumahku” (this past week, events) she has come to my house four times) w kata untuk menyatakan salah satu waktu terjadinya “untuk kali ini ia kena batunya” (this time peristiwa yg merupakan bagian dari rangkaian peristiwa yg he suffered for his actions) pernah dan masih akan terus terjadi (a word signifying a particular instance of an ongoing series of events) saat yg tertentu untuk melakukan sesuatu (a specific time to “waktu makan” (eating time) be doing something) saat tertentu, pada arloji jarumnya yg pendek menunjuk “ia bangun jam lima pagi” (she woke up at angka tertentu dan jarum panjang menunjuk angka 12 (the five o’clock) point in time when the short hand of a clock points to a certain hour and the long hand points to 12) sebuah sungai yang kecil (a small river) “air di kali itu sangat keruh” (the water in that small river is very murky) Table 1. Sample Concepts in and The task of bilingual word mapping is to find, that , where is an matrix of given word , the set of all its plausible left singular vectors, is an matrix of right translations in , regardless of the concepts being singular vectors, and is an matrix contain- conveyed. We can also view this task as computing ing the singular values of . the union of the set of all words in that convey Crucially, this decomposition factors using an the set of all concepts conveyed by . Formally, orthonormal basis that produces an optimal re- we compute the set w where duced rank approximation matrix (Kalman, 1996). By reducing dimensions of the matrix irrelevant , and . For example, in Princeton WordNet, given information and noise are removed. The optimal rank reduction yields useful induction of implicit (i.e. the English orthographic form time), relations. However, finding the optimal level of returns more than 15 different concepts, rank reduction is an empirical issue. among others ,, (see Table 1). LSA can be applied to exploit a parallel corpus In Indonesian, assuming the relation E as de- to automatically perform bilingual word and con- fined above, the set of words that convey , i.e. cept mapping. We define a parallel corpus P as a , includes (as in “kali ini dia berhasil” set of pairs ,, where is a document = “this time she succeeded”). written in the language of , and is its transla- On the other hand, may include tion in the language of . (as in “ini waktunya untuk pergi” = “it is time to Intuitively, we would expect that if two words go”) and (as in “sekarang saatnya menjual and consistently occur in documents that saham” = “now is the time to sell shares”), and are translations of each other, but not in other doc- lastly, may include (as in “apa anda uments, that they would at the very least be seman- tahu jam berapa sekarang?” = “do you know what tically related, and possibly even be translations of time it is now?”). each other. For instance, imagine a parallel corpus Thus, the bilingual word mapping task seeks to consisting of news articles written in English and Indonesian: in English articles where the word Ja- compute, for the English word , the set of Indonesian words , , , ,…. pan occurs, we would expect the word Jepang to occur in the corresponding Indonesian articles. Note that each of these Indonesian words may This intuition can be represented in a word- convey different concepts, e.g. may in- document matrix as follows: let be a word- clude in Table 1. document matrix of English documents and English words, and be a word-document matrix 3 Automatic mapping using Latent Se- of Indonesian documents and Indonesian mantic Analysis words. The documents are arranged such that, for Latent semantic analysis, or simply LSA, is a me- 1, the English document represented by thod to discern underlying semantic information column of and the Indonesian document from a given corpus of text, and to subsequently represented by column of form a pair of trans- represent the contextual meaning of the words in lations.

Load more