<<

Soundex-based Translation Correction in Urdu–English Cross-Language Information Retrieval

Manaal Faruqui Prasenjit Majumder Sebastian Padó Computer Science & Engg. Computer Science & Engg. Computational Linguistics Indian Institute of Technology DAIICT Gandhinagar Heidelberg University Kharagpur, India Gandhinagar, India Heidelberg, Germany [email protected] [email protected] [email protected]

Abstract sorting (Hussain et al., 2007), and information ex- traction (Mukund et al., 2010). Many other pro- Cross-language information retrieval is dif- cessing tasks are still missing, and the size of the ficult for languages with few processing Urdu internet is minuscule compared to English tools or resources such as Urdu. An easy and other major languages, making Urdu a prime way of translating content words is pro- candidate for a CLIR source language. vided by Google Translate, but due to lex- A particular challenge which Urdu poses for icon limitations named entities (NEs) are CLIR is its writing system. Even though it is a transliterated letter by letter. The resulting Central Indo-Aryan language and closely related to NEs errors (zynydyny zdn for Zinedine Zi- Hindi, its development was shaped predominantly dane) hurts retrieval. We propose to replace by Persian and Arabic, and it is written in Perso- English non-words in the translation out- Arabic script rather than Devanagari. CLIR with put. First, we determine phonetically sim- a target language that uses another script needs to ilar English words with the Soundex algo- transliterate (Knight and Graehl, 1998) any ma- rithm. Then, we choose among them by a terial that cannot be translated (typically out-of- modified Levenshtein distance that models vocabulary items like Named Entities). The diffi- correct transliteration patterns. This strat- culties of Perso-Arabic in this respect are (a), some egy yields an improvement of 4% MAP vowels are represented by letters which are also (from 41.2 to 45.1, monolingual 51.4) on consonants and (b), short© vowels© are customarily the FIRE-2010 dataset. omitted. For example, in AKñKð (Winona) the first ð is used for the W but the second is used for O. Also 1 Introduction the i sound is missing after ð (W). Cross-language information retrieval (CLIR) re- In this paper, we consider Urdu–English CLIR. search is the study of systems that accept queries Starting from a readily available baseline (using in one language and return text documents in a dif- Google Translate to obtain English queries), we ferent language. CLIR is of considerable practical show that transliteration of Named Entities, more importance in countries with many languages like specifically missing vowels, is indeed a major fac- India. One of the most widely used languages is tor in wrongly answered queries. We reconstruct Urdu, the official language of five Indian states as missing vowels in an unsupervised manner through well as the national language of Pakistan. There are an approximate string matching procedure based around 60 million speakers of Urdu – 48 million in on phonetic similarity and orthographic similarity India and 11 million in Pakistan (Lewis, 2009). by using Soundex code (Knuth, 1975) and Leven- Despite this large number of speakers, NLP shtein distance (Gusfield, 1997) respectively, and for Urdu is still at a fairly early stage (Hussain, find a clear improvement over the baseline. 2008). Studies have been conducted on POS tag- ging (Sajjad and Schmid, 2009), corpus construc- 2 Translation Strategies for Urdu–English tion (Becker and Riaz, 2002), word segmenta- We present a series of strategies for translating tion (Durrani and Hussain, 2010), lexicographic Urdu queries into English so that they can be pre-

25 Proceedings of the 5th International Joint Conference on Natural Language Processing, pages 25–29, Chiang Mai, Thailand, November 8-12, 2011. sented to a monolingual English IR system that 2.3 NER-centered Approximate Matching works on some English document collection. In- (GTR+SoExNER) spection of the strategies’ errors led us to develop An analysis of the output of the GTR+SoEx model a hierarchy of increasingly sophisticated strategies. showed that the model indeed ensured that all words in the translation were English words, but 2.1 Baseline model (GTR) that it “overcorrected”, replacing correctly trans- As our baseline, we aimed for a model that is state- lated, but infrequent, English words by more fre- of-the-art, freely available, and can be used by users quent words with the same Soundex code. Unfor- without the need for heavy computational machin- tunately, Google Translate does not indicate which ery. We decided to render the Urdu query into words in its output are out-of-vocabulary. English with the Google Translate web service.1 Recall that our original motivation was to im- prove coverage specifically for out-of-vocabulary 2.2 Approximate Matching (GTR+SoEx) words, virtually all of which are Named Entities. Thus, we decided to apply Soundex matching only Google Translate appears to have a limited Urdu to NEs. As a practical and simple way of identi- lexicon. Words that are out of vocabulary (OOV) fying malformed NEs, we considered those words are transliterated letter by letter into the Latin alpha- in the Google Translate output which did not oc- bet. Without an attempt to restore short (unwritten) cur in the English document base at all (i.e., which vowels, these match the actual English terms only were “non-words”). We manually verified that this very rarely. For example, Singur, the name of a heuristic indeed identified malformed Named En- village in India gets translated to Sngur. tities in our experimental materials (see Section 3 To address this problem, we attempt to map below for details). We found a recall of 100% (all these incomplete transliterations onto well-formed true NEs were identified) and a precision of 96% (a English words using approximate string match- small number of non-NEs was classified as NEs). ing. We use Soundex (Knuth, 1975), an algorithm The GTR+SoExNER strategy applies Soundex which is normally used for “phonetic normaliza- matching to all NEs, but not to other words in the tion”. Soundex maps English words onto their first Google Translate output. letter plus three digits which represent equivalence classes over consonants, throwing away all vowels 2.4 Disambiguation in the process. For example, Ashcraft is mapped (GTR+SoExNER+LD(mod)) onto A261, where 2 stands for the “gutturals” and “sibilants” S and K, 6 for R, and 1 for the “labio- Generally, a word that has been wrongly translit- dental” F. All codes beyond the first three are ig- erated from Urdu maps onto the same Soundex nored. The same soundex code would be assigned, code as several English words. The num- for example, to Ashcroft, Ashcrop, or even Azaroff. ber of English words per transliteration is 7. This can be seen as a sort of ambiguity, and the strat- The two components which make Soundex a well- suited choice for our purposes are exactly (a), the egy adopted by the previous models is to just forming of equivalence classes over consonants, choose the most frequent candidate, similar to the which counteracts introduced by one-to- “predominant” sense baseline in word sense dis- many correspondences between Latin and Arabic ambiguation (McCarthy et al., 2004). We found however that the most frequent candidate is of- letters; and (b), the omission of vowels. ten wrong, since Soundex conflates fairly different Specifically, we use Soundex as a hash function, words (cf. Section 2.2). For example, Subhas, the mapping all English words from our English docu- first name of an Indian freedom fighter, receives ment collection onto their Soundex codes. The the soundex code S120 but it is mapped onto the GTR+SoEx model then attempts to correct all English term Space (freq=7243) instead of Subhas words in the Google Translate output by replac- (freq=2853). ing them with the English word sharing the same We therefore experimented with a more in- Soundex code that has the highest frequency in the formed strategy that chooses the English candi- English document collection. date based on two variants of Levenshtein distance. 1http://translate.google.com. All queries The first model, GTR+SoExNER+LD, uses stan- were translated in the first week of January 2011. dard Levenshtein distance with a cost of 1 for

26 each insertion, deletion and substitution. Our fi- a pre-, we experimented with adding nal model, GTR+SoExNER+LDmod uses a modi- either only the single most similar term for each fied version of Levenshtein distance which is opti- OOV item (1-best) or the best n terms (n-best). mized to model the correspondences that we expect. We consistently found better results for 1-best and Specifically, the addition of vowels and the replace- report results for this condition only. ment of consonants by vowels come with no cost, to favour the recovery of English vowels that are Monolingual model We also computed a mono- unexpressed in Urdu or expressed as consonants lingual English model which did not use the trans- (cf. Section 1). Thus, the LDmod between zdn and lated Urdu queries but the original English ones zidane would be Zero. instead. The result for this model can be seen as an upper bound for Urdu-English CLIR models. 3 Experimental Setup Evaluation We report two evaluation measures. Document Collection and Queries Our experi- The first one is Average Precision (MAP), an ments are based on the FIRE-20102 English data, evaluation measure that is highest when all correct consisting of documents and queries, as our exper- items are ranked at the top (Manning et al., 2008). imental materials. The document collection con- MAP measures the global quality of the ranked doc- sists of about 124,000 documents from the English- ument list; however improvements in MAP could language newspaper “The Telegraph India”3 from result from an improved treatment of marginally 2004-07. The average length of a document was relevant documents, while it is the quality of the 40 words. The FIRE query collection consists of top-ranked documents that is most important in 50 English queries which were of the same domain practice and correlates best with extrinsic measures as that of the document collection. The average (Scholer and Turpin, 2009). Therefore we also number of relevant documents for a query was 76 consider P@5, the precision of the five top-ranked (with a minimum of 13 and a maximum of 228). documents. The first author, who has an advanced knowledge of Urdu, translated the English FIRE queries man- 4 Results and Discussion ually into Urdu. One of the resulting Urdu query is Table 2 shows the results of our . shown in Table 1, together with the Google trans- Monolingual English retrieval achieves a MAP of lations back into English (GTR) which form the 51.4, while the CLIR baseline (Google Translate basis of the CLIR queries in the simplest model. only – GTR) is 41.3. We expect the results of our Every query has a title, and a description, both of experiments to fall between these two extremes. which we used for retrieval. The bottom row (en- We first extend the baseline model with Soundex tity) shows the Translate output and from the best matching for all terms in the title and description model (Soundex matching with modified Leven- (GTR+SoEx), we actually obtain a result way be- shtein distance). The bold-faced terms correspond low the baseline (MAP=36.7). The reason is that, to names that are corrected successfully, increasing as discussed in Section 2.2, Soundex is too coarse- the query’s precision from 49% to 86%. grained for non-NEs, grouping words such as red Cross-lingual IR setup We implemented the and road into the same equivalence class, thus models described in Section 2, using the Terrier pulling in irrelevant terms. This analysis is sup- IR engine (Ounis et al., 2006) for retrieval from ported by the observation, mentioned above, that the FIRE-2010 English document collection. We 1-best always performs better than n-best. used the PL2 weighting model with the term fre- We are however able to obtain a clear improve- quecy normalisation parameter of 10.99. The doc- ment of about 1.5% absolute by limiting Soundex ument collection and the queries were stemmed matching to automatically identified Named En- using the Porter Stemmer (Porter, 1980). We ap- tities, up to MAP=43.0 (GTR+SoExNER). How- plied all translation strategies defined in Section 2 ever, this model still relies completely on fre- as query expansion modules that enrich the Google quency for choosing among competitors with the Translate output with new relevant query terms. In same Soundex code, leading to errors like the Subhas/Space mixup discusssed in Section 2.4. 2http://www.isical.ac.in/~fire/2010/ data_download.html The use of Levenshtein distance, representing a 3http://www.telegraphindia.com/ more informed manner of disambiguation, makes

27  © á © © © © © © © title UR íª¯@ð A¿ ÿPAÓ ÿ Qå ÓI » YËPð A¿ à@YK P áK YJ K P title EN (GTR) Zynydyny zydan World© Cup head butt incident ñ» øP©Q KAÓ ÿ© à@© YK P© á Ó k. áK Q» €C K ñ» H@ QK© ðAJƒX ÿ„@ desc UR © © © à@© YK P© ÿ© øñËAK@I k. @PAÓ á ÓÉJK© ZA¯ ÿ» 200T I» YËPð ÿ Qå .  © © á ËñK. á KAK . P@ñÃAK© ¬Cg ÿ» desc EN (GTR) Find these documents from public opinion zdn to mtrzzy, from Italian to zydan about offensive comments, World Cup finals in 2006 head to kill incidents are mentioned entity EN (GTR) Zynydyny Zydan zdn Mtrzzy entity (GTR+SoExNER+LDmod) zinedine zaydan zidane materazzi

Table 1: A sample query

Model MAP P@5 5 Related Work GTR 41.3 62.4 GTR+SoEx 36.7 59.2 There are several areas of related work. The first GTR+SoExNER 43.0 62.4 is IR in Urdu, where monolingual work has been GTR+SoExNER+LD 45.0 65.2 done (Riaz, 2008). However, to our knowledge, our study is the first one to address Urdu CLIR. GTR+SoExNER+LDmod 45.3 65.6 Monolingual English 51.4 71.6 The second is machine transliteration, which is a widely researched area (Knight and Graehl, 1998) Table 2: Results for Urdu-English CLIR models on but which usually requires some sort of bilingual the FIRE 2010 collection (Mean Average Precision resource. Knight and Graehl (1998) use 8000 and Precision of top five documents) English-Japanese place name pairs, and Mandal et al. (2007) hand-code rules for Hindi and Bengali to English. In contrast, our method does not require any bilingual resources. Finally, Soundex codes a considerable difference, and leads to a final have been applied to Thai-English CLIR (Suwan- MAP of 45.33 or about 4% absolute increase visat and Prasitjutrakul, 1998) and Arabic name for the (GTR+SoExNER+LDmod) model. A search (Aqeel et al., 2006). They have also been bootstrap analysis (Efron and Tibshi- found useful for indexing Named Entities (Ragha- rani, 1994) confirmed that the difference between van and Allan, 2004; Kondrak, 2004) as well as IR GTR+SoExNER+LDmod and GTR model is signif- more generally (Holmes and McCabe, 2002). icant (p<0.05). All models are still significantly worse than the monolingual English model. 6 Conclusion The P@5 results are in tandem with the MAP re- In this paper, we have considered CLIR from Urdu sults for all models, showing that the improvement into English. With Google Translate as translation which we obtain for the best model leads to top-5 system, the biggest hurdle is that most named enti- lists whose precision is on average more than 3% ties are out-of-vocabulary items and transliterated better than the baseline top-5 lists. This difference incorrectly. A simple, completely unsupervised is not significant, but we attribute the absence of postprocessing strategy that replaces English non- significance to the small sample size (50 queries). words by phonetically similar words with minimal edit distance is able to recover almost half of the In a qualitative analysis, we found that many re- loss in MAP that the cross-lingal setup incurs over maining low-MAP queries still suffer from missing a monolingual English one. Directions for future or incorrect Named Entities. For example, Noida work include monolingual query expansion in Urdu (an industrial area near New Delhi), was translit- to improve the non-NE part of the query and train- erated to Nuydh and then incorrectly modified to ing a full Urdu-English transliteration system. Nidhi (an Indian name). This case demonstrates the limits of our method which cannot distinguish well Acknowledgments We thank A. Tripathi and H. among NEs which differ mainly in their vowels. Sajjad for invaluable discussions and suggestions.

28 References Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Syed Uzair Aqeel, Steve Beitzel, Eric Jensen, David Retrieval. Cambridge University Press, Cambridge, Grossman, and Ophir Frieder. 2006. On the devel- UK, 1st edition. opment of name search techniques for Arabic. Jour- nal of the American Society for Information Science Diana McCarthy, Rob Koeling, Julie Weeds, and John and Technology, 57(6):728–739. Carroll. 2004. Finding predominant senses in un- tagged text. In Proceedings of the Annual Meet- Dara Becker and Kashif Riaz. 2002. A study in urdu ing of the Association for Computational Linguistics, corpus construction. In Proceedings of the 3rd COL- pages 280–287, Barcelona, Spain. ING workshop on Asian language resources and in- ternational standardization, pages 1–5, Taipei, Tai- Smruthi Mukund, Rohini Srihari, and Erik Peterson. wan. 2010. An information-extraction system for urdu— a resource-poor language. ACM Transactions on Nadir Durrani and Sarmad Hussain. 2010. Urdu word Asian Language Information Processing, 9:15:1– segmentation. In Proceedings of the Annual Confer- 15:43. ence of the North American Chapter of the Associa- tion for Computational Linguistics, pages 528–536, Iadh Ounis, Gianni Amati, Vassilis Plachouras, Ben Los Angeles, California. He, Craig Macdonald, and Christina Lioma. 2006. Terrier: A High Performance and Scalable Informa- Bradley Efron and . 1994. An Intro- tion Retrieval Platform. In Proceedings of ACM SI- duction to the Bootstrap. Monographs on GIR’06 Workshop on Open Source Information Re- and Applied Probability 57. Chapman & Hall. trieval, Seattle, WA, USA.

Dan Gusfield. 1997. Algorithms on strings, trees, and Martin F. Porter. 1980. An algorithm for suffix strip- sequences: computer science and computational bi- ping. Program, 14(3):130–137. ology. Cambridge University Press, Cambridge, UK. Hema Raghavan and James Allan. 2004. Us- ing soundex codes for indexing names in ASR David Holmes and M. Catherine McCabe. 2002. Im- documents. In Proceedings of the HLT-NAACL proving precision and recall for soundex retrieval. 2004 Workshop on Interdisciplinary Approaches In Proceedings of the International Conference on to Speech Indexing and Retrieval, pages 22–27, Information Technology: Coding and Computing, Boston, MA. pages 22–27, Washington, DC, USA. Kashif Riaz. 2008. Baseline for Urdu IR evaluation. Sarmad Hussain, Sana Gul, and Afifah Waseem. 2007. In Proceeding of the 2nd ACM workshop on Im- Developing lexicographic sorting: An example for proving non-English web searching, pages 97–100, urdu. ACM Transactions on Asian Language Infor- Napa, CA, USA. mation Processing, 6(3):10:1–10:17. Hassan Sajjad and Helmut Schmid. 2009. Tagging Sarmad Hussain. 2008. Resources for urdu language Urdu text with parts of speech: A tagger comparison. processing. In Proceedings of the Workshop on In Proceedings of the 12th Conference of the Euro- Asian Language Resources at IJCNLP 2008, Hyder- pean Chapter of the ACL, pages 692–700, Athens, abad, India. Greece.

Kevin Knight and Jonathan Graehl. 1998. Ma- Falk Scholer and Andrew Turpin. 2009. Metric and rel- chine transliteration. Computational Linguistics, evance mismatch in retrieval evaluation. In Informa- 24(4):599–612. tion Retrieval Technology, volume 5839 of Lecture Notes in Computer Science, pages 50–62. Springer. Donald E. Knuth. 1975. Fundamental Algorithms, volume III of The Art of Computer Programming. Prayut Suwanvisat and Somboon Prasitjutrakul. 1998. Addison-Wesley, Reading, MA. Thai-English cross-language transliterated word re- trieval using soundex technique. In Proceesings Grzegorz Kondrak. 2004. Identification of confus- of the National Computer Science and Engineering able drug names: A new approach and evalua- Conference, Bangkok, Thailand. tion methodology. In In Proceedings of the Inter- national Conference on Computational Linguistics, pages 952–958, Geneva, Switzerland.

M. Paul Lewis, editor. 2009. Ethnologue – Languages of the World. SIL International, 16th edition.

Debasis Mandal, Mayank Gupta, Sandipan Dandapat, Pratyush Banerjee, and Sudeshna Sarkar. 2007. Bengali and Hindi to English CLIR evaluation. In Proceedings of CLEF, pages 95–102.

29