Proceedings of the Conference on Language & Technology 2009
Adaptive Transliteration Based on Cross-Script Trie Generation: A Case of
Roman-Urdu BBB
Usman Afzal, Dr. Naveed Iqbal Rao and Ahmad Muqeem Sheri Image Processing Center, Military College of Signals, National University of Sciences and Technology, Rawalpindi, Pakistan {usman.afzal, naveedi}@mcs.edu.pk, [email protected]
Abstract language localization, especially for Asian languages [1], [2]. Standardization of keyboard layouts and In past few years many tools and techniques have development of different Input Method Editors (IMEs) been developed to support language localization. have been the primary focus of language localization Keyboard layouts and IMEs have been developed and efforts. The purpose of an IME is to allow users to standardized to support local language scripts. But input characters of their local languages using standard people commonly use Romanized versions of their Latin keyboards. Microsoft Windows XP provides local languages for Internet and mobile chat. Like many IMEs for East Asian languages like Chinese, many other Asian languages, Urdu is also used in Japanese, and Korean [3]. Pronunciation-based IMEs Romanized form (Roman-Urdu) which is very popular are very helpful for users who are less familiar with among Urdu speakers. There is no single globally language specific keyboards and use transliterated accepted standard to write Roman-Urdu, so people versions of their languages using Roman alphabets. write Roman-alphabet spellings by intuition with Like many other Asian languages (Arabic, Persian, respect to their language experience. Therefore, and Hindi) Urdu is also written using Roman retrieving back the original Urdu word from a given alphabets, and is named as Roman-Urdu. Although Roman-spellings (reverse-transliteration) is a there are different interfaces and layouts available to challenging problem. input Urdu characters, majority of the language In this paper, we propose an adaptive cross-script speakers prefer to use Roman-Urdu, especially for trie model which solves the reverse-transliteration Internet and mobile chat. The obvious reason is the problem effectively. The model consists of three layers: unfamiliarity and less use of Urdu keyboards. a) pre-processing, b) cross-script mapping, and c) trie Transliteration is a transformation of text from one generation. Pre-processing layer simplify the input script to another, usually based on phonetic word for the next layer. Cross-script mapping layer is equivalences. Roman alphabets are most commonly the core of the model which performs mapping and used for transliteration of languages which have non- transformation across the scripts. This layer is totally Latin scripts. A transliteration scheme defines based on the analysis of vowel and consonant unambiguous one-to-one mapping of characters across mappings which is another contribution we present in the scripts. Urdu language has many different this paper. Our model returns all possible equivalent transliteration schemes [4]- [11] and, therefore, no Urdu words as a trie for any given Roman-Urdu word. single set standard is followed by the speakers to write Trie generation layer uses trie-pruning to remove false Roman-Urdu. Roman-alphabet spellings are highly branches. Experimental results have proved the dependent on accent and educational level, so everyone significance and effectiveness of our model against writes roman spellings of Urdu words by intuition. As variations of roman-spellings in test data. a result, there exists multiple spelling combinations equivalent to one Urdu word, as shown in Figure 2(a). 1. Introduction This diversity of roman spellings makes retrieval of the original word (technically known as Reverse Writing a language in its customary script is a basic transliteration ) very difficult. Transliteration is not need of speakers of any language. With advancement trivial to automate, but reverse transliteration is even in technology and language computing, many more challenging problem. techniques and tools have been developed to support
32 Proceedings of the Conference on Language & Technology 2009
(a) Urdu to Roman script (b) Roman to Urdu script Figure 2: Transliteration two-fold mapping.
Pre-Processing
Cross-Script Mapping (a) Transliteration scheme Trie Generation (1) Z^abt^ Figure 3: Cross-script trie generation model 1 P {z bt, discipline} P
(2) Sabar cannot be employed for reverse transliteration problem of Roman-Urdu unless a parallel corpus is available. {s