Improving Wordnets for Under-Resourced Languages Using Machine Translation Bharathi Raja Chakravarthi, Mihael Arcan, John P
Total Page:16
File Type:pdf, Size:1020Kb
Improving Wordnets for Under-Resourced Languages Using Machine Translation Bharathi Raja Chakravarthi, Mihael Arcan, John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Galway, Ireland [email protected] , [email protected], [email protected] Abstract individual words. Wordnets can be constructed by either the merge or the expand approach (Vossen, Wordnets are extensively used in natural 1997). Princeton WordNet (Miller, 1995; Fell- language processing, but the current ap- baum, 2010) was manually created within Prince- proaches for manually building a word- ton University covering the vocabulary in En- net from scratch involves large research glish language only. Then, based on the Prince- groups for a long period of time, which are ton WordNet, wordnets for several languages were typically not available for under-resourced created. As an example, EuroWordNet (Vossen, languages. Even if wordnet-like resources 1997) is a multilingual lexical database for sev- are available for under-resourced lan- eral European languages, structured in the same guages, they are often not easily accessi- way as Princeton’s WordNet. The Multiword- ble, which can alter the results of applica- net (Pianta et al., 2002) is strictly aligned with tions using these resources. Our proposed Princeton WordNet and allows to access senses method presents an expand approach for in Italian, Spanish, Portuguese, Hebrew, Roma- improving and generating wordnets with nian and Latin language. Many others have fol- the help of machine translation. We ap- lowed for different languages. The IndoWordNet ply our methods to improve and extend (Bhattacharyya, 2010) was compiled for eighteen wordnets for the Dravidian languages, i.e., out of the twenty-two official languages of India Tamil, Telugu, Kannada, which are sev- and made available for public use. It is based on erly under-resourced languages. We report the expand approach like EuroWordNet, but from evaluation results of the generated word- the Hindi wordnet, which is then linked to En- net senses in term of precision for these glish. On the Global WordNet Association web- languages. In addition to that, we carried site,1 a comprehensive list of wordnets available out a manual evaluation of the translations for different languages can be found, including In- for the Tamil language, where we demon- doWordNet and EuroWordNet etc. strate that our approach can aid in improv- This paper describes the effort towards gen- ing wordnet resources for under-resourced erating and improving wordnets for the under- Dravidian languages. resourced Dravidian languages. Since studies (Federico et al., 2012; Laubli¨ et al., 2013; Green 1 Introduction et al., 2013) have shown significant productiv- As computational activities and the Internet cre- ity gains when human translators post-edit ma- ates a wider multilingual and global commu- chine translation output rather than translating text nity, under-resourced languages acquire political from scratch, we use the available parallel cor- 2 as well as economic interest to develop Natural pora from multiple sources, like OPUS, to cre- Language Processing (NLP) systems for these lan- ate a machine translation system to translate the guages. In general, creating NLP systems requires wordnet senses in the Princeton WordNet into an extensive amount of resources and manual ef- the mentioned under-resourced languages. Trans- 3 fort, however, under-resourced languages lack in lation tools such as Google Translate, or open both. source SMT systems such as Moses (Koehn et Wordnets are lexical resources, which provide 1http://globalwordnet.org/ a hierarchical structure based on synsets (a set of 2http://opus.lingfil.uu.se/ one or more synonyms) and semantic features of 3http://translate.google.com/ al., 2007) trained on generic data are the most In their further work (Rajendran et al., 2010), they common solutions, but they often result in unsat- emphasize the need for an independent wordnet isfactory translations of domain-specific expres- for the Dravidian languages, based on EuroWord- sions. Therefore, we follow the idea of Arcan et al. Net. This is due the observation that the mor- (2016b), where the authors automatically identify phology and lexical concepts of these languages relevant sentences in English containing the Word- are different compared to other Indian languages. Net senses and translate them within the context, The authors have combined the Tamil wordnet and which showed translation quality improvement of wordnets in other Dravidian languages to form the the targeted entries. The effectiveness of our ap- IndoWordNet. proach is evaluated by comparing the generated Mohanty et al. (2017) built SentiWordNet for translations with the IndoWordNet entries, auto- the Odia language, which is one of the official lan- matically and manually, respectively. This paper guages of India. Being an under-resourced lan- reports our first outcomes in improving wordnet guage, Odia lacks proper machine translation sys- for under-resourced Dravidian languages such as tem to translate the vocabulary of the available re- Tamil(ISO 639-2: tam), Telugu (ISO 639-2: tel) source from English into Odia. The authors have and Kannada (ISO 639-2: kan). created SentiWordNet for Odia using resources of other Indian languages and the IndoWordNet. Al- 2 Related work though the IndoWordNet structure does not map directly to the SentiWordNet, instead synsets are Scannell (2007) describes the start of the creation matched. The authors used these for translation of a resource for the Irish language using the Web from source lexicon to target lexicon. Aliabadi as a resource for NLP approaches. This work et al. (2014) have created a wordnet for the Kur- started by creating a resource for Irish language dish language, one of the under-resourced lan- using the Web as a resources for NLP. Since 2000, guages in western Iranian language family. They the author and his collaborators developed many have created Kurdish translation for the “core” resources like monolingual corpora, bilingual cor- wordnet synsets (Vossen, 1997), which is a set pora and parsers etc, for many under-resourced of 5,000 essential concepts. They used a dictio- languages, but they did not cover all languages in nary to translate its literals (words), adopted an the world. A six-level typology was proposed by indirect evaluation alternative in which they look Alegria et al. (2011) that separated languages into at the effectiveness of using KurdNet for rewrit- six levels. According to the authors, except for ing Information Retrieval queries. Similarly, the top ten languages in the world all the other lan- work by Horvath´ et al. (2016) focuses on the semi- guages are under-resourced languages. The third automatic construction of wordnet for the Mansi and fourth level languages are the languages which language, which is spoken by Mansi people in have some resource on the internet. These six level Russia, an endangered under-resourced languages typologies is a relative definition for the under- with a low number of native speakers. The au- resourced language, but still can be useful for our thors have used the Hungarian wordnet as a start- study of under-resourced languages. ing point. With the help of a Hungarian-Mansi dic- IndoWordNet covers official Indian languages, tionary, which was used to create possible transla- from the major three families: Indo-Aryan, Dra- tions between the languages, the Mansi wordnet vidian and Sino-Tibetan languages. In general, In- was continuously expanded. dian languages are rich in morphology and each Previous works did lots of manual effort to cre- of the three language families has different mor- ate wordnet-like resources, which was funded by phology structure. It was compiled for eighteen public research for a long period of time. How- out of the twenty-two official languages and made ever, IndoWordNet is not complete and biased to- publicly available.4 Similarly to EuroWordNet it wards Hindi, because the authors created a Hindi- is based on the expand approach, but the central Tamil bilingual dictionary, rather than a wordnet. language is Hindi, which is then linked to English. As explained in Rajendran et al. (2010), the mor- The IndoWordNet entries are updated frequently. phology and lexical concepts of Dravidian lan- For the Tamil language, Rajendran et al. (2002) guages are different from Hindi, which illustrates proposed a design template for the Tamil wordnet. that the IndoWordNet may not be the most suitable 4http://www.cfilt.iitb.ac.in/ resource to represent the wordnet for the targeted indowordnet/index.jsp Dravidian languages. To evaluate and improve the wordnets for the script is used to write other under-resourced lan- targeted Dravidian languages, we follow the ap- guages like Tulu, Konkani and Sankethi. In the proach of Arcan et al. (2016b), which uses the ex- Kannada language, the derivation of words is ei- isting translations of wordnets in other languages ther by combining two distinct words or by affixes. to identify contextual information for wordnet Different to Tamil, Kannada and Telugu inherits senses from a large set of generic parallel corpora. some of the affixes from Sanskrit. We use this contextual information to improve the translation quality of WordNet senses. We show 3.2 Machine Translation that our approach can help overcome drawbacks Statistical Machine Translation (SMT) sys- of simple translations of words without context. tems assume that we have a set of example translations(S(k), T (k)) for k = 1 : : : :n, where 3 Background S(k) is the kth source sentence, T (k) is the kth target sentence which is the translation of S(k) Our specific aim of this work is to generate and in the corpus. SMT systems try to maximize the improve wordnets for under-resourced languages. conditional probability p(tjs) of target sentence For our task we chose the expand approach and au- t given a source sentence s by maximizing tomatically translated the Princeton WordNet en- separately a language model p(t) and the inverse tries within a disambiguate context to obtain en- translation model p(sjt).