<<

Improving for Under-Resourced Using Machine Bharathi Raja Chakravarthi, Mihael Arcan, John P. McCrae Insight Centre for Data Analytics National University of Galway Galway, Ireland [email protected] , [email protected], [email protected]

Abstract individual . Wordnets can be constructed by either the merge or the expand approach (Vossen, Wordnets are extensively used in natural 1997). Princeton WordNet (Miller, 1995; Fell- processing, but the current ap- baum, 2010) was manually created within Prince- proaches for manually building a - ton University covering the vocabulary in En- net from scratch involves large research glish language only. Then, based on the Prince- groups for a long period of time, which are ton WordNet, wordnets for several languages were typically not available for under-resourced created. As an example, EuroWordNet (Vossen, languages. Even if -like resources 1997) is a multilingual lexical database for sev- are available for under-resourced lan- eral European languages, structured in the same guages, they are often not easily accessi- way as Princeton’s WordNet. The Multiword- ble, which can alter the results of applica- net (Pianta et al., 2002) is strictly aligned with tions using these resources. Our proposed Princeton WordNet and allows to access senses method presents an expand approach for in Italian, Spanish, Portuguese, Hebrew, Roma- improving and generating wordnets with nian and Latin language. Many others have fol- the help of . We ap- lowed for different languages. The IndoWordNet ply our methods to improve and extend (Bhattacharyya, 2010) was compiled for eighteen wordnets for the Dravidian languages, i.e., out of the twenty-two official languages of Tamil, Telugu, , which are sev- and made available for public use. It is based on erly under-resourced languages. We report the expand approach like EuroWordNet, but from evaluation results of the generated word- the wordnet, which is then linked to En- net senses in term of precision for these glish. On the Global WordNet Association web- languages. In addition to that, we carried site,1 a comprehensive list of wordnets available out a manual evaluation of the for different languages can be found, including In- for the , where we demon- doWordNet and EuroWordNet etc. strate that our approach can aid in improv- This paper describes the effort towards gen- ing wordnet resources for under-resourced erating and improving wordnets for the under- Dravidian languages. resourced Dravidian languages. Since studies (Federico et al., 2012; Laubli¨ et al., 2013; Green 1 Introduction et al., 2013) have shown significant productiv- As computational activities and the Internet cre- ity gains when human translators post-edit ma- ates a wider multilingual and global commu- chine translation output rather than translating text nity, under-resourced languages acquire political from scratch, we use the available parallel cor- 2 as well as economic interest to develop Natural pora from multiple sources, like OPUS, to cre- Language Processing (NLP) systems for these lan- ate a machine translation system to translate the guages. In general, creating NLP systems requires wordnet senses in the Princeton WordNet into an extensive amount of resources and manual ef- the mentioned under-resourced languages. Trans- 3 fort, however, under-resourced languages lack in lation tools such as Translate, or open both. source SMT systems such as Moses (Koehn et Wordnets are lexical resources, which provide 1http://globalwordnet.org/ a hierarchical structure based on synsets (a set of 2http://opus.lingfil.uu.se/ one or more synonyms) and semantic features of 3http://translate.google.com/ al., 2007) trained on generic data are the most In their further work (Rajendran et al., 2010), they common solutions, but they often result in unsat- emphasize the need for an independent wordnet isfactory translations of domain-specific expres- for the Dravidian languages, based on EuroWord- sions. Therefore, we follow the idea of Arcan et al. Net. This is due the observation that the mor- (2016b), where the authors automatically identify phology and lexical concepts of these languages relevant sentences in English containing the Word- are different compared to other Indian languages. Net senses and translate them within the context, The authors have combined the Tamil wordnet and which showed translation quality improvement of wordnets in other Dravidian languages to form the the targeted entries. The effectiveness of our ap- IndoWordNet. proach is evaluated by comparing the generated Mohanty et al. (2017) built SentiWordNet for translations with the IndoWordNet entries, auto- the , which is one of the official lan- matically and manually, respectively. This paper guages of India. Being an under-resourced lan- reports our first outcomes in improving wordnet guage, Odia lacks proper machine translation sys- for under-resourced Dravidian languages such as tem to translate the vocabulary of the available re- Tamil(ISO 639-2: tam), Telugu (ISO 639-2: tel) source from English into Odia. The authors have and Kannada (ISO 639-2: kan). created SentiWordNet for Odia using resources of other Indian languages and the IndoWordNet. Al- 2 Related work though the IndoWordNet structure does not map directly to the SentiWordNet, instead synsets are Scannell (2007) describes the start of the creation matched. The authors used these for translation of a resource for the using the Web from source to target lexicon. Aliabadi as a resource for NLP approaches. This work et al. (2014) have created a wordnet for the Kur- started by creating a resource for Irish language dish language, one of the under-resourced lan- using the Web as a resources for NLP. Since 2000, guages in western Iranian language family. They the author and his collaborators developed many have created Kurdish translation for the “core” resources like monolingual corpora, bilingual cor- wordnet synsets (Vossen, 1997), which is a set pora and parsers etc, for many under-resourced of 5,000 essential concepts. They used a dictio- languages, but they did not cover all languages in nary to translate its literals (words), adopted an the world. A six-level typology was proposed by indirect evaluation alternative in which they look Alegria et al. (2011) that separated languages into at the effectiveness of using KurdNet for rewrit- six levels. According to the authors, except for ing queries. Similarly, the top ten languages in the world all the other lan- work by Horvath´ et al. (2016) focuses on the semi- guages are under-resourced languages. The third automatic construction of wordnet for the Mansi and fourth level languages are the languages which language, which is spoken by Mansi people in have some resource on the internet. These six level , an endangered under-resourced languages typologies is a relative definition for the under- with a low number of native speakers. The au- resourced language, but still can be useful for our thors have used the Hungarian wordnet as a start- study of under-resourced languages. ing point. With the help of a Hungarian-Mansi dic- IndoWordNet covers official Indian languages, tionary, which was used to create possible transla- from the major three families: Indo-Aryan, Dra- tions between the languages, the Mansi wordnet vidian and Sino-Tibetan languages. In general, In- was continuously expanded. dian languages are rich in and each Previous works did lots of manual effort to cre- of the three language families has different mor- ate wordnet-like resources, which was funded by phology structure. It was compiled for eighteen public research for a long period of time. How- out of the twenty-two official languages and made ever, IndoWordNet is not complete and biased to- publicly available.4 Similarly to EuroWordNet it wards Hindi, because the authors created a Hindi- is based on the expand approach, but the central Tamil bilingual , rather than a wordnet. language is Hindi, which is then linked to English. As explained in Rajendran et al. (2010), the mor- The IndoWordNet entries are updated frequently. phology and lexical concepts of Dravidian lan- For the Tamil language, Rajendran et al. (2002) guages are different from Hindi, which illustrates proposed a design template for the Tamil wordnet. that the IndoWordNet may not be the most suitable 4http://www.cfilt.iitb.ac.in/ resource to represent the wordnet for the targeted indowordnet/index.jsp Dravidian languages. To evaluate and improve the wordnets for the script is used to write other under-resourced lan- targeted Dravidian languages, we follow the ap- guages like Tulu, Konkani and Sankethi. In the proach of Arcan et al. (2016b), which uses the ex- Kannada language, the derivation of words is ei- isting translations of wordnets in other languages ther by combining two distinct words or by affixes. to identify contextual information for wordnet Different to Tamil, Kannada and Telugu inherits senses from a large set of generic parallel corpora. some of the affixes from . We use this contextual information to improve the translation quality of WordNet senses. We show 3.2 Machine Translation that our approach can help overcome drawbacks Statistical Machine Translation (SMT) sys- of simple translations of words without context. tems assume that we have a set of example translations(S(k), T (k)) for k = 1 . . . .n, where 3 Background S(k) is the kth source sentence, T (k) is the kth target sentence which is the translation of S(k) Our specific aim of this work is to generate and in the corpus. SMT systems try to maximize the improve wordnets for under-resourced languages. conditional p(t|s) of target sentence For our task we chose the expand approach and au- t given a source sentence s by maximizing tomatically translated the Princeton WordNet en- separately a p(t) and the inverse tries within a disambiguate context to obtain en- translation model p(s|t). A language model tries for the Dravidian languages. assigns a probability p(t) for any sentence t and translation model assigns a conditional probability 3.1 Dravidian languages p(s|t) to source / target pair of sentence. By Bayes Dravidian languages, a family of languages spo- rule ken primarily in the Southern part of India and p(t|s) ∝ p(t)p(s|t) (1) also spread over South Asia. The Dravidian lan- guages are divided into four groups: South, South- This decomposition into a translation and a lan- Central, Central, and North groups. Dravidian guage model improves the fluency of generated morphology is agglutinating and exclusively suf- texts by making full use of available corpora. The fixal. Words are built from small elements called language model is not only meant to ensure a flu- morphemes. Two broad classes of morphemes are ent output, but also supports difficult decisions stems and affixes. Words are made up of mor- about word order and word translation (Koehn, phemes concatenated based on the of 2010). We used the Moses (Koehn et al., 2007) language. Tamil language is also a free word-order toolkit that provides end-to-end support for the language. Due to the nature of morphology, the creation and evaluation of machine translation sys- noun phrase and verb phrase may appear in any tem based on BLEU (Papineni et al., 2002) score. permutation and still able to produce same sense There are two major criteria for automatic SMT of the sentence (Steever, 1987). evaluation: completeness and correctness, which The four major literary Dravidian languages are are considered by BLEU, an automatic evaluation Tamil, Telugu, , and Kannada. Tamil, technique, which is a geometric mean of n-gram Malayalam, and Kannada fall under the South precision. BLEU score is language independent, Dravidian subgroup, whereby Telugu belongs to fast, and shows good correlation with human eval- the South Central Dravidian subgroup (Vikram uation campaigns. Therefore we plan to use this and Urs, 2007). All the four languages have of- metric to evaluate our work. ficial status in Government of India and use their own unique script. Outside India, Tamil also has 3.3 Available Corpora for Machine official status in Sri Lanka and . Tamil Translation script is descended from the Southern Brahmi This section describes the data collection and the script and has 12 vowels, 18 consonants and one pre-processing process steps. The English-Tamil aytam (special sound). The Telugu script is also parallel corpus, which we used to train our SMT descendant of the Southern Brahmi script. It has system is collected from various sources and com- 16 vowels and 36 consonants, which are more in bined into a single parallel corpus. We used the number than those of Tamil alphabets. The Kan- EnTam corpus (Ramasamy et al., 2012), which nada and Telugu scripts are most similar and of- was pre-processed from raw Web data to become ten considered as a regional variant. The Kannada a sentence-aligned corpus. The parallel corpora English-Tamil English-Telugu English-Kannada English Tamil English Telugu English Kannada Number of tokens 7,738,432 6,196,245 258,165 226,264 68,197 71,697 Number of unique words 134,486 459,620 18,455 28,140 7,740 15,683 Average word length 4.2 7.0 3.7 4.8 4.5 6.0 Average sentence length 5.2 7.9 4.6 5.6 5.3 6.8 Number of sentences 449,337 44,588 13,543 Table 1: Statistics of the parallel corpora used to train the translation systems. contains text from the news domain,5 sentences opment set is used to measure the system from the Tamil cinema articles6 and the Bible.7 performance of the phrase-based translation For the news corpus, the authors downloaded web model. pages that have matching file names in both En- • Test set: A blind set of 1000 sentence ran- glish and Tamil. For the cinema corpus, all the domly chosen from parallel corpus that is English articles had a link to the corresponding used to the test the system. There is no over- Tamil translation. The collection of the Bible lap between these set of data. corpus followed a similar pattern. We also took the English-Tamil parallel corpora for six Indian • Training set: A larger size parallel corpus that languages created with the help of Mechanical is used to train the phrase-based translation Turk for Wikipedia documents (Post et al., 2012). model. It is remaining corpus after develop- Since the data was created by non-expert transla- ment and test are extracted. tors hired over the Mechanical Turk, it is of mixed In this work, we focus on three languages from quality. From the OPUS website, we have col- Dravidian family namely, Tamil, Telugu, and Kan- lected the Gnome, KDE, Ubuntu and movie subti- nada. This is mainly due to available parallel cor- tles (Tiedemann, 2012). We furthermore manually 8 pora and we believe that this method can be ex- aligned Tamil text Tirukkural, and combined all tended for other under-resourced languages with- the parallel corpora into a single corpus. We first out much effort. tokenized sentences in English and Tamil and then true-cased only the English side of the parallel cor- 3.4 Resource Scarceness pus, since the Tamil language does not have a cas- There are few resources, which can be used to au- ing. Finally, we cleaned up the data by eliminating tomatically create a wordnet for under-resourced the sentences whose length is above 80 words. languages. One way to cross the language bar- To obtain the parallel corpora for Telugu and rier is with the help of machine translation. As Kannada, we used the corpora available on the with any methods, SMT tends OPUS website. The same pre-processing proce- to improve translation quality when using a large dure was followed for Telugu and Kannada lan- amount of training data. That is, if the train- guage, since both languages are close to the Tamil ing method sees a specific word or phrase mul- language. The Table 1 shows the statistics of the tiple times during training, it is more likely to parallel corpora for the three language pairs. From learn a correct translation. SMT suffers due to this table we can see that the English-Tamil par- the scarcity of parallel corpora, Dravidian word or- allel corpus is much larger than for the other lan- der and the morphological complexity attached to guage pairs. On the other hand, the number of sen- the language. For the Dravidian languages when tences for English-Kannada is very small. Once translating from or to English the translation mod- we have obtained the parallel corpus, we created els suffer because of syntactic differences while the SMT systems for the English-Tamil, English- the morphological differences contribute to data Telugu, and English-Kannada language pairs. sparsity. In contrast, small corpora used for train- We define the following set of data: ing lead to incomplete word coverage, which may • Development set: Randomly selected 2000 cause the out-of-vocabulary (OOV) issues. sentences from the parallel corpus as devel- Besides the resource scarceness, another issue observed with the corpus for Dravidian languages 5http://www.wsws.org/ 6http://www.cinesouth.com/ was -switching contents in the data. Code- 7http://biblephone.intercer.net/ switching is an act of alternating between elements 8http://www.projectmadurai.org/ of two or more languages, which is prevalent in Original Non-Code mixing translate WordNet entries into the targeted Dravid- English→Tamil 20.29 20.61 ian languages. While an domain-unadapted SMT English→Telugu 28.81 28.25 system can only return the most frequent transla- English→Kannada 14.64 14.45 tion when given a term by itself, it has been ob- served that translation quality of single word ex- Table 2: Automatic translation evaluation of the of pressions improves when the word is given in an 1000 randomly selected sentences in terms of the disambiguated context of a sentence (Arcan et al., BLEU metric. 2016a; Arcan et al., 2016b). Therefore existing translations of WordNet senses in other languages multilingual countries (Barman et al., 2014). With than English were used to select the most rele- English being the most used language in the digital vant sentences for wordnet senses from a large set world, people tend to mix English words with their of generic parallel corpora. The goal is to iden- native languages. That might be the case in other tify sentences that share the same semantic in- languages as well. formation in respect to the synset of the Word- Net entry that we want to translate. To ensure a 4 Methodology broad lexical and domain coverage of English sen- The principle approaches for constructing word- tences, existing parallel corpora for various lan- nets are the merge approach or the expand ap- guage pairs were merged into one parallel data set, proach. In the merge approach, the synsets and i.e., Europarl (Koehn, 2005), DGT - translation relations are built independently and then aligned memories generated by the Directorate-General with WordNet. The drawbacks of the merge ap- for Translation (Steinberger et al., 2014), Mul- proach are that it is time-consuming and requires tiUN corpus (Eisele and Chen, 2010), EMEA, a lot of manual effort to build. On the contrary KDE4, OpenOffice (Tiedemann, 2009), OpenSub- in the expand model, wordnet can be created au- titles2012 (Tiedemann, 2012). Similarly, word- tomatically by translating synsets using different nets in a variety of languages, provided by the strategies, whereby the synsets are built in cor- Open Multilingual Wordnet web page,9 were used. respondence with the existing wordnet synsets. As a motivating example, we consider the word We followed the expand approach and created a vessel, which is a member of three synsets in machine translation systems to translate the sen- Princeton WordNet, whereby the most frequent tences, which contained the WordNet senses in translation, e.g., as given by , is English to the target language Schiff in German and nave in Italian, correspond- ing to i6083310 ‘a craft designed for water trans- 4.1 Training Machine Translation portation’. For the second sense, i65336 ‘a tube parameters in which a body fluid circulates’, we assume that In the following section, we takes as a base- we know the German translation for this sense line a , that has been aligned at the is Gefaߨ and we look in our approach for sen- sentence level. To obtain the translations, we tences in a parallel corpus, where the words vessel use Moses SMT toolkit with of baseline setup and Gefaߨ both occur and obtain a context such with 5-gram language model created using the as ‘blood vessel’ that allows the SMT system to training data by KenLM (Heafield, 2011). The translate this sense correctly. This alone is not suf- baseline SMT system was built for three lan- ficient as Gefaߨ is also a translation of i60834 guage pairs, English-Tamil, English-Telugu, and ‘an object used as a container’, however in Italian English-Kannada. The test set mentioned in Sec- these two senses are distinct (vaso and recipiente tion 3.3 was used to evaluate our system. From respectively), thus by using as many languages as Table 1 and Table 2 we can see that size of the par- possible we maximize our chances of finding a allel corpus has an impact on the BLEU score for well disambiguated context. test set which is evaluation criteria for the transla- 4.3 Code-mixing tion model. Code-switching and code-mixing is a phe- 4.2 Context Identification nomenon found among bilingual communities all Since manual translation of wordnets using the ex- 9http://compling.hss.ntu.edu.sg/omw/ tend approach is a very time consuming and ex- 10We use the CILI identifiers for synsets (Bond et al., pensive process, we apply SMT to automatically 2016) English-Tamil English-Telugu English-Kannada English Tamil English Telugu English Kannada tok 0.5% (45,847) 1.1% (72,833) 2.8% (7,303) 4.9% (12,818) 3.5% (2,425) 9.0% (6,463) sent 0.9% (4,100) 3.1% (1,388) 3.4% (468) Table 3: Number of sentences (sent) and number of tokens (tok) removed from the original corpus.

Source sentence: “இꯍேபா�, நாꟍ அைத loving.” script words in it. Table 3 show the percentage : :Ippōtu, nāṉ atai loving of code-mixed text removed from original corpus. Target sentence: “Right now, I'm loving it.” The goal of this approach is to investigate whether code-mixing criteria and corresponding training Source sentence: “�ꟍன��ꯍ� GNOME ெபா�쿍” Transliteration: :Muṉṉiruppu GNOME poruḷ are directly related to the improvement of the Target sentence: “Default GNOME Theme” translation quality measured with automatic eval- Figure 1: Examples of Code-mixing in Tamil- uation and manual evaluation. We assumed that English parallel corpus. In the first example the code-mixed text can be found by different scripts verb loving is code-mixed in Tamil. In Second Ex- and did not evaluate the code-mixing written in the ample the noun GNOME is code-mixed. native script or Latin script to write the native lan- guage as was done by (Das and Gamback,¨ 2013)

5 Evaluation over the world (Ayeomoni, 2006; Yoder et al., 2017). Code-mixing is mixing of words, phrases, The most reliable method to evaluate the wordnet and sentence from two or more languages with in is a manual evaluation, but a manual evaluation of the same sentence or between sentences. In many whole the WordNet is time consuming and very bilingual or multilingual communities like India, expensive. Therefore, we did the automatic eval- , or Singapore, language in- uation of the our translations and measured the teraction often happens in which two or more lan- precision. In order to determine the correctness guages are mixed. Furthermore, it increasingly oc- of our work, we have furthermore randomly taken curs in monolingual cultures due to globalization. 50 WordNet entries for manual evaluation on these In many contexts and domains, English is mixed entries. with native languages within their utterance than in the past due to Internet boom. Due to the history 5.1 Automatic Evaluation and popularity of the , on the In- In this paper, we have compared our result to the ternet Indian languages are more frequently mixed IndoWordNet. Once the translation step the of dis- with English than other native languages (Chanda ambiguated context, containing the target entries, et al., 2016). was finished, we use the word alignment infor- A major part of our corpora comes from movie mation to extract the translation of the WordNet and technical documents, which makes entry. Since several disambiguated sentences per it even more prone to code-mixing of English in WordNet entry were used, we took the translations the Dravidian languages. In our corpus, movie for each context and then combined the results to speeches are transcribed to text and they differ count the most frequent one. The top 10-words from that in other written genres: the vocabulary is entries were compared to the IndoWordNet for the informal, non- sounds like ah, and mix- exact match. ing of scripts in case of English and native lan- We took precision at 10, precision at 5, preci- guages (Tiedemann, 2008). Two example of code- sion at 2, and precision at 1. We did this com- switching are demonstrated in Figure 1.The paral- parison for the all the three languages, i.e. Tamil, lel corpus is initially segregated into English script Telugu, and Kannada. As an additional experi- and native script. All of the annotations are done ment, we removed the code-mixing part of the cor- using an automatic process. All words from a lan- pus and created an new translation system, which guage other than the native script of our experi- was used again to translate the same WordNet en- ment are taken out on both sides of corpus if it tries. The table 4 shows the result of the auto- occurs in native language side of the parallel cor- matic evaluation of the translation of the entries pus. The sentences are removed from both sides into the Targeted Dravidian languages. The ta- if the target language side does not contain native ble shows the precision at the different level of English→Tamil Original Non-Code mixing P@10 P@5 P@2 P@1 Agrees with IWN 18% 20% Inflected Form 12% 22% original corpus 0.120 0.109 0.083 0.065 Transliteration 4% 4% non-code mixed 0.125 0.115 0.091 0.073 Spelling variant 2% 2% English→Telugu Correct, but not in IWN 18% 24% Incorrect 46% 28% P@10 P@5 P@2 P@1 original corpus 0.047 0.046 0.038 0.028 Table 5: Manual Evaluation of wordnet creation non-code mixed 0.047 0.045 0.038 0.027 for Tamil language compared with IndoWordNet English→Kannada (IWN) at precision at 10 presented in percentage. P@10 P@5 P@2 P@1 original corpus 0.009 0.010 0.008 0.005 non-code mixed 0.011 0.011 0.009 0.007 we proceeded as follows: we randomly extracted a sample of 50 wordnet entries from the Word- Table 4: Results of Automatic evaluation of word- Net. First, each of these 50 wordnet entries were net with IndoWordNet Precision at different level compared to the IndoWordNet for the exact match. denoted by P@10 which means Precision at 10. Subsequently, regardless of this decision, each of the 50 wordnet entries were evaluated and classi- fied according to its quality. The classification is the translations, based on the translation model, the following: generation from the original corpus and non-code mixed corpus. Non-code mixed often outperforms • Agrees with IndoWordNet Exact match the baseline in terms of precision, whereby the dif- found in IndoWordNet. ference is less visible in . This is • Inflected form The root of a word is found likely due to the short sentences in the Telugu cor- with a different inflection, which can make pus. These differences in the precision are signif- the translation correct but imprecise. icant in the manual evaluation of Tamil tests with 50 samples. The wide difference between man- • Transliteration The word is transliterated, ual and automatics evaluation can be explained in which can be caused by the unavailability of part by different forms. Table 4 shows an exam- the translation form in the parallel corpus, ple of how our system differs from the baseline since some words are used in transliteration SMT system and how it benefits the wordnet trans- because of foreign words. lation. This is a clear evidence that an SMT with- • Spelling Variant Since our data in day to out code-mixing described above achieves an im- day language of Tamil and IndoWordNet is provement over the baseline without using any ad- skewed towards classical sense of language. ditional training data. However, it has been shown Our method produces the Spelling Variant in Arcan et al. (2016b) that better performance on which can be caused by wrong or misspelling WordNet translation can be achieved, if the cor- of the word according to IndoWordNet. pora contained a sufficient amount of parallel sen- • Correct, but not in IndoWordNet In- tences. Their translation evaluation based on the doWordNet is large and it covers eighteen BLEU metric on unigrams (similar to precision at languages, but it lacks some wordnet entries 1, P@1), showed a range between 0.55 and 0.70 for the Dravidian languages. We verified we BLEU points, for the well resourced languages, had identified the correct sense by referring like Slovene, Spanish, Croatian and Italian. Re- to the wordnet gloss. stricting the task to a small data set tends to hurt the translation performance, but it can useful to aid • Incorrect This error class can be caused due in the creation or improvement of new resources to inappropriate term or mistranslation. for the under-resourced languages. The examples in the Figure 2 list the Tamil transla- tion wordnet in our experiment. Neither the word 5.2 Manual Evaluation nor its translation has appeared in the training cor- In order to able to evaluate our method in contrast pus therefore, the SMT system cannot translate the to stand-alone approaches, we manually evaluated word and chooses to produce the word in English. our method in comparison with IndoWordNet en- On the other side, these examples may produce tries. To select the sample for manual evaluation, some insights into the word. ILI code Gloss IWN Meaning Translation Meaning Comments

any of several compounds containing chlorine and Spelling 14647235-n nitrogen nitrogen nitrogen; used as an ைந翍ரஜꟍ ைநதரசꟍ variant antiseptic in wounds give the name or Inflected identifying characteristics form, name, 01026095-v of; refer to by name or name different ெபய�� identity ெபய쏍 some other identifying part-of- characteristic speech a game in which balls are Correct rolled at an object or translation, 00461782-n group of objects with the பꏍ� ball ெபௗலி柍 bowling sense aim of knocking them over missing in or moving them IWN diverseness, diverseness, Agrees with 04751305-n noticeable heterogeneity ப쯍ேவ� diversity ப쯍ேவ� diversity IWN correct translation, 01546111-v be standing; be upright �埍� to lift நி쟍க to stand sense missing in IWN Figure 2: Examples of the manual evaluation of Tamil wordnet entries in comparison to the IndoWordNet (IWN).

We should note that this evaluation was car- cision at 10 gave only 12%. Our method relays ried out for both, original, uncleaned, corpus as on IndoWordNet for evaluation but IndoWord- well as cleaned corpus (non-code mixing). We Net is biased over one particular language, which observed that the cleaned data produce better re- is Hindi. The resulting wordnet entries, though sults compared to the original data which have noisy, is suitable for aiding wordnet creation for many code-mixing entries. From the table 5, we under-resourced languages. can see that there is a significant improvement The handling of code-mixing in this paper ap- over the inflected form and correct but not found pears to improve the quality of the proposed trans- in IndoWordNet categories. This shows that our lation, outperforming the baseline results of word- method can help to improve the wordnet entries net entries once code-mixed was removed from for under-resourced languages. data. Thus we believe that the method presented here still applicable to resource creation of under- 6 Discussion resourced languages. While our automatic evaluation results are a lit- tle disappointing, and this is perhaps unsurpris- ing in the context of under-resourced languages 7 Conclusion as there is very little a data availability for these language, our manual evaluation shows that this is far from reality. Evaluating using a resource In this paper we showed the challenges in build- such as IndoWordNet is always likely to be prob- ing wordnet for under-resourced languages and lematic as the resource is far from complete and presented that our method can aid the creation does not claim to cover all words in the Dravid- or improvement of wordnets for under-resourced ian languages studied in this paper. Moreover, In- languages. We experimented with available data doWordNet is overly skewed to the the classical to created SMT systems for three Dravidian lan- words of these languages, but the majority our par- guages and used those as a baseline. To improve allel corpus is day to day conversation texts. De- the results we removed the code-mixed terms from spite the low precision in determining the exact the corpus. Our results indicated that the proposed match to the IndoWordNet, our technique yields removing of code-mixed text from the corpus re- 48% for precision at 10 in manual evaluation, al- sults in gains for the wordnet entries with limited though the automatic evaluation considering pre- data. 8 Acknowledgements Andreas Eisele and Yu Chen. 2010. MultiUN: A multilingual corpus from United Nation doc- This work was supported by the Science Founda- uments. In Daniel Tapias, Mike Rosner, Ste- tion Ireland under Grant Number SFI/12/RC/2289 lios Piperidis, Jan Odjik, Joseph Mariani, Bente (Insight). Maegaard, Khalid Choukri, and Nicoletta Calzo- lari (Conference Chair), editors, Proceedings of the Seventh conference on International Language Re- sources and Evaluation, pages 2868–2872. Euro- References pean Language Resources Association (ELRA), 5. Inaki˜ Alegria, Xabier Artola, Arantza Diaz De Ilarraza, Marcello Federico, Alessandro Cattelan, and Marco and Kepa Sarasola. 2011. Strategies to develop Trombetti. 2012. Measuring user productivity language for less-resourced languages in machine translation enhanced computer assisted based on the case of basque. translation. In Proceedings of the Tenth Conference Purya Aliabadi, Mohammad Sina Ahmadi, Shahin of the Association for Machine Translation in the Salavati, and Kyumars Sheykh Esmaili. 2014. To- Americas (AMTA). wards building KurdNet, the Kurdish WordNet. In Christiane Fellbaum. 2010. Wordnet. In Theory and Proceedings of the 7th Global WordNet Conference applications of ontology: computer applications, (GWC’14), pages 1–6. pages 231–243. Springer. Mihael Arcan, Mauro Dragoni, and Paul Buitelaar. Spence Green, Jeffrey Heer, and Christopher D Man- 2016a. Translating ontologies in real-world settings. ning. 2013. The efficacy of human post-editing for In Proceedings of the 15th International Semantic language translation. In Proceedings of the SIGCHI Web Conference (ISWC-2016), Kobe, . Conference on Human Factors in Computing Sys- tems, pages 439–448. ACM. Mihael Arcan, John P McCrae, and Paul Buitelaar. 2016b. Expanding wordnets to new languages Kenneth Heafield. 2011. Kenlm: Faster and smaller with multilingual sense disambiguation. In Inter- language model queries. In Proceedings of the Sixth national Conference on Computational Linguistics Workshop on Statistical Machine Translation, pages (COLING-2016), Osaka, Japan. 187–197. Association for Computational Linguis- tics. Moses Omoniyi Ayeomoni. 2006. Code-switching and code-mixing: Style of language use in child- Csilla Horvath,´ Agoston´ Nagy, Norbert Szilagyi,´ and hood in Yoruba speech community. Nordic Journal Veronika Vincze. 2016. Where bears have the of African Studies, 15(1):90–99. eyes of currant: Towards a mansi wordnet. In Pro- ceedings of the Eighth Global WordNet Conference, Utsab Barman, Amitava Das, Joachim Wagner, and pages 130–134. Jennifer Foster. 2014. Code mixing: A challenge for language identification in the language of so- , Hieu Hoang, Alexandra Birch, Chris cial media. In Proceedings of The First Workshop Callison-Burch, Marcello Federico, Nicola Bertoldi, on Computational Approaches to Code Switching, Brooke Cowan, Wade Shen, Christine Moran, pages 13–23. Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Pro- . 2010. Indowordnet. In Nico- ceedings of the 45th annual meeting of the ACL on letta Calzolari (Conference Chair), Khalid Choukri, interactive poster and demonstration sessions, pages Bente Maegaard, Joseph Mariani, Jan Odijk, Ste- 177–180. Association for Computational Linguis- lios Piperidis, Mike Rosner, and Daniel Tapias, tics. editors, Proceedings of the Seventh International Philipp Koehn. 2005. Europarl: A Parallel Corpus Conference on Language Resources and Evaluation for Statistical Machine Translation. In Conference (LREC’10), Valletta, Malta, May. European Lan- Proceedings: the tenth Machine Translation Sum- guage Resources Association (ELRA). mit. AAMT. Francis Bond, Piek Vossen, John P. McCrae, and Chris- Philipp Koehn. 2010. Statistical Machine Translation. tiane Fellbaum. 2016. CILI: the Collaborative Inter- Cambridge University Press, New York, NY, USA, lingual Index. In Proceedings of the Global Word- 1st edition. Net Conference 2016. Samuel Laubli,¨ Mark Fishel, Gary Massey, Maureen Arunavha Chanda, Dipankar Das, and Chandan Ehrensberger-Dow, and Martin Volk. 2013. As- Mazumdar. 2016. Columbia-Jadavpur submission sessing post-editing efficiency in a realistic trans- for emnlp 2016 code-switching workshop shared lation environment. Machine Translation Summit task: System description. EMNLP 2016, page 112. XIV, page 83. Amitava Das and Bjorn¨ Gamback.¨ 2013. Code-mixing George A Miller. 1995. Wordnet: a lexical in social media text. the last language identification database for english. Communications of the ACM, frontier? TAL, 54(3):41–64. 38(11):39–41. Gaurav Mohanty, Abishek Kannan, and Radhika Jorg Tiedemann. 2009. News from OPUS – A Col- Mamidi. 2017. Building a sentiwordnet for odia. lection of Multilingual Parallel Corpora with Tools EMNLP 2017, page 143. and Interfaces. In Advances in Processing, volume V, chapter V, pages 237–248. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Borovets, Bulgaria. Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of Jorg¨ Tiedemann. 2012. Parallel data, tools and the 40th annual meeting on association for compu- interfaces in opus. In Nicoletta Calzolari (Con- tational linguistics, pages 311–318. Association for ference Chair), Khalid Choukri, Thierry Declerck, Computational Linguistics. Mehmet Ugur˘ Dogan,˘ Bente Maegaard, Joseph Mar- iani, Asuncion Moreno, Jan Odijk, and Stelios Emanuele Pianta, Luisa Bentivogli, and Christian Gi- Piperidis, editors, Proceedings of the Eight Interna- rardi. 2002. Developing an aligned multilingual tional Conference on Language Resources and Eval- database. In Proc. 1st Int’l Conference on Global uation (LREC’12), Istanbul, , may. European WordNet. Language Resources Association (ELRA). Matt Post, Chris Callison-Burch, and Miles Osborne. 2012. Constructing parallel corpora for six indian T. N. Vikram and Shalini R. Urs, 2007. Develop- languages via crowdsourcing. In Proceedings of the ment of Prototype Morphological Analyzer for he Seventh Workshop on Statistical Machine Transla- South Indian Language of Kannada, pages 109–116. tion, pages 401–409. Association for Computational Springer Berlin Heidelberg, Berlin, Heidelberg. Linguistics. Piek Vossen. 1997. Eurowordnet: a multilingual S Rajendran, S Arulmozi, B Kumara Shanmugam, database for information retrieval. In In: Proceed- S Baskaran, and S Thiagarajan. 2002. Tamil ings of the DELOS workshop on Cross-language In- WordNet. In Proceedings of the First International formation Retrieval, pages 5–7. Global WordNet Conference. , volume 152, pages 271–274. Michael Miller Yoder, Shruti Rijhwani, Carolyn Pen- stein Rose,´ and Lori Levin. 2017. Code-Switching S Rajendran, G Shivapratap, V Dhanlakshmi, and as a Social Act: The Case of Arabic Wikipedia Talk KP Soman. 2010. Building a wordnet for Dravidian Pages. ACL 2017, page 73. languages. In Proceedings of the Global WordNet Conference (GWC 10). Citeseer. Loganathan Ramasamy, Ondrejˇ Bojar, and Zdenekˇ Zabokrtskˇ y.´ 2012. Morphological processing for English-Tamil statistical machine translation. In Proceedings of the Workshop on Machine Trans- lation and in Indian Languages (MTPIL- 2012), pages 113–122. Kevin P Scannell. 2007. The Crubad´ an´ Project: Corpus building for under-resourced languages. In Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop, volume 4, pages 5–15. Sanford B Steever. 1987. Tamil and the dravidian lan- guages. The world’s major languages, pages 725– 746. Ralf Steinberger, Mohamed Ebrahim, Alexandros Poulis, Manuel Carrasco-Benitez, Patrick Schluter,¨ Marek Przybyszewski, and Signe Gilbro. 2014. An overview of the european union’s highly multi- lingual parallel corpora. Language Resources and Evaluation, 48(4):679–707. Jorg¨ Tiedemann. 2008. Synchronizing translated movie subtitles. In Bente Maegaard Joseph Mari- ani Jan Odijk Stelios Piperidis Daniel Tapias Nico- letta Calzolari (Conference Chair), Khalid Choukri, editor, Proceedings of the Sixth International Conference on Language Resources and Evalua- tion (LREC’08), Marrakech, Morocco, may. Eu- ropean Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2008/.