Comparing Two Techniques for Learning Transliteration Models Using a Parallel Corpus
Total Page:16
File Type:pdf, Size:1020Kb
Comparing Two Techniques for Learning Transliteration Models Using a Parallel Corpus Hassan Sajjad Nadir Durrani Helmut Schmid Alexander Fraser Institute for Natural Language Processing University of Stuttgart sajjad,durrani,schmid,fraser @ims.uni-stuttgart.de { } Abstract overlapping words will help to bridge the script- We compare the use of an unsupervised ing gap between Hindi and Urdu. The remaining transliteration mining method and a rule- words must be converted into the other language based method to automatically extract lists with a bilingual dictionary which is beyond the of transliteration word pairs from a par- scope of this work. allel corpus of Hindi/Urdu. We build In this paper, the term transliteration pair refers joint source channel models on the auto- to a word pair where the words are translitera- matically aligned orthographic transliter- tions of each other and the term transliteration ation units of the automatically extracted unit refers to a character pair where the charac- lists of transliteration pairs resulting in two ters are transliterations of each other. We are transliteration systems. We compare our interested in building joint source channel mod- systems with three transliteration systems els for transliteration. Because we do not have a available on the web, and show that our list of transliteration pairs to use as training data systems have better performance. We per- in building such a transliteration model, we use form an extensive analysis of the results two methods to extract the list of transliteration of using both methods and show evidence pairs from a parallel corpus of Hindi/Urdu. The that the unsupervised transliteration min- first method uses the transliteration mining algo- ing method is superior for applications rithm of Sajjad et al. (2011) to automatically ex- requiring high recall transliteration lists, tract transliteration pairs. This approach does not while the rule-based method is useful for use any language specific knowledge. The sec- obtaining high precision lists. ond method uses handcrafted transliteration rules specific to the mapping between Hindi and Urdu 1 Introduction to extract transliteration pairs. We automatically Urdu and Hindi are closely related languages align the two lists of extracted transliteration pairs which have a similar phonological, semantic and at the character level and learn two transliteration syntactic structure. Hindi is derived from San- models. We compare the results with three other skrit and Urdu is a mixture of Persian, Arabic, transliteration systems. Both of our transliteration Turkish and Sanskrit. Both share closed class vo- systems perform better than the other systems. cabulary which they inherit from Sanskrit. They The 1-best output of the transliteration system differ however in the open class vocabulary and built on the list extracted using the rule-based in the writing script used. Hindi is written in method is better than the 1-best output of the sys- Devanagari script and borrows most of the open tem built on the automatically extracted list. The class vocabulary from Sanskrit. Urdu is written rule-based extraction method is focused on obtain- in Perso-Arabic script and borrows most of the ing a high precision list as compared to the auto- open class vocabulary from Persian, Arabic, Turk- matic method which obtains a higher recall list. ish and Sanskrit. Both languages have lived to- The 10-best and 20-best output of the transliter- gether for centuries and now share a large part ation system built on the automatically extracted of their vocabulary with each other. In an initial list is better than the N-best outputs of the sys- study on a small parallel corpus, we found that tem built on the list extracted using the rule-based both languages share approximately 82% (tokens) method. The wide coverage of transliteration and 62% (types) of the vocabulary. Transliterating units in the automatically extracted list helps the 129 Proceedings of the 5th International Joint Conference on Natural Language Processing, pages 129–137, Chiang Mai, Thailand, November 8 – 13, 2011. c 2011 AFNLP transliteration system to produce difficult translit- erations which are hard to learn using the rule- based list. The transliteration task between Hindi and Urdu is non-trivial. The missing short vowels in the writing of Urdu and a missing short vowel in the writing of Hindi are a particular problem, and we Table 1: Ambiguous Hindi characters (charac- identify other areas of difficulty. We provide a de- ters which can transliterate to many different Urdu tailed error analysis to account for the complexi- characters) ties in Hindi to Urdu transliteration motivated by netic Alphabets) and X-SAMPA1 to develop a linguistic phenomena. phoneme-based mapping scheme between Urdu The paper is organized as follows. Previous and Hindi (J C. Wells, 1995). work on transliteration is summarized in Section Malik et al. (2008) reported an accuracy of 2. The two methods used to extract lists of translit- 97.9% for transliterating Hindi to Urdu. How- eration pairs are described in Section 3. The joint ever, this number is not comparable to ours. Some probability model for transliteration is explained Hindi characters can be ambiguously transliterated in Section 4. The evaluation and the results in to several Urdu characters (see Table 1). Malik et comparison with three other transliteration sys- al. (2008) do not deal with these ambiguous char- tems are presented in Section 5. A detailed dis- acters and count any occurrence of an ambiguous cussion and error analysis is presented in Section character as a correct transliteration in all scenar- 6. Section 7 concludes. ios. We discuss this further in Section 6. In the previous work, a transliteration system 2 Previous Work is built on transliteration units learned either au- Transliteration can be done with phoneme-based tomatically from a list of transliteration pairs (Li or grapheme-based models. Knight and Graehl et al., 2004), (Pervouchine et al., 2009) or using a (1998), Stalls and Knight (1998), Al-Onaizan and heuristic-based method (Ekbal et al., 2006). We do Knight (2002) and Pervouchine et al. (2009) use not have a list of transliteration pairs for the train- the phoneme-based approach for transliteration. ing of our Hindi to Urdu transliteration system. Kashani et al. (2007) and Al-Onaizan and Knight Therefore we use two methods to extract transliter- (2002) use a grapheme-based model to translit- ation pairs from parallel data of Hindi/Urdu. In the erate from Arabic into English. Al-Onaizan and first approach, we use the transliteration mining al- Knight (2002) compare a grapheme-based ap- gorithm proposed by Sajjad et al. (2011) to extract proach, a phoneme-based approach and a linear transliteration pairs. This method does not use any combination of both for transliteration. They build language dependent information. In the second a conditional probability model. The grapheme- approach, we use a rule-based method to extract based model performs better than the phoneme- transliteration pairs. Both processes are imperfect, based model and the hybrid model. This motivates meaning that there is noise in the extracted list of our use of grapheme-based models. transliteration pairs. We build a joint source chan- nel model as described by Li et al. (2004) and Ek- In this paper, we use a grapheme-based ap- bal et al. (2006) on the extracted list of translitera- proach for transliteration from Hindi to Urdu. The tion pairs. The following sections describe the two phoneme-based approach would involve the con- mining approaches and the model in detail. version of Hindi and Urdu text into a phonemic representation which is not a trivial task as the 3 Extraction of Transliteration Pairs short vowel ‘a’ is not written in Hindi text and no short vowels are written in Urdu text. The diffi- We automatically word-align the parallel corpus culty of this additional step would be likely to lead and extract a word list, later referred to as “list of to additional errors. word pairs“ (see Section 5, for details on training Malik et al. (2008) and Malik et al. (2009) data). We use two methods to extract translitera- work on transliteration from Hindi to Urdu and tion pairs from the list of word pairs. In the first Urdu to Hindi respectively. They use the rules 1SAMPA and XSAMPA are used to represent the IPA of SAMPA (Speech Assessment Methods Pho- symbols using 7-bit printable ASCII characters. 130 approach, we automatically extract transliteration iteration which best predicts the held-out data is pairs using the transliteration mining algorithm as selected as the stopping iteration for the transliter- proposed in Sajjad et al. (2011). We align the ation mining algorithm. transliteration pairs at character level using a char- We first ran Algorithm 2 on the list of word pairs acter aligner. In the second approach, we use an for 100 iterations. It returned the 45th iteration as edit distance metric and handcrafted equivalence the best stopping iteration for Algorithm 1. Then rules to extract transliteration pairs from a paral- we ran Algorithm 1 for 45 iterations and obtained lel corpus. We align the list of transliteration pairs a list of 2245 transliteration pairs. Due to data at character level using the edit distance metric. sparsity, there were two Hindi characters which The transliteration system is then trained on these were missing in the extracted list of transliteration character aligned transliteration pairs which is de- pairs. We could either add complete word ex- scribed in Section 5. The following subsections amples or just transliteration units of the missing describe the extraction methods in detail. Hindi characters to the list of transliteration pairs. Adding examples will provide context information 3.1 Automatic Extraction of Transliteration which may bias the results of the evaluation.