Author Guidelines for 8

Author Guidelines for 8

Urdu-Hindi-Urdu Machine Translation: Some Problems Amba Kulkarni Rahmat Yousufzai Pervez Ahmed Azmi Department of Sanskrit Studies, University of Hyderabad, Hyderabad, India [email protected], [email protected] Abstract from Sanskrit becomes Hindi. During the Mughal Empire and years there after, lot of Persio-Arabic In this paper we discuss the problems in Urdu- words have entered the common vocabulary of Hindi. Hindi-Urdu Machine Translation at various levels. This raises an important issue. The common Though because of large common vocabulary it may vocabulary in Hindi and Urdu tempts a Urdu-Hindi- sound that only transliteration can help to overcome Urdu Machine Translation developer towards the the language barrier between Urdu and Hindi, the transliteration. At the same time the presence of tendency of Urdu to use words from Persian and Persio-Arabic words in Urdu and Sanskrit words in Arabic origin, and the tendency of Hindi to use words Hindi, along with certain structural differences of Sanskrit origin, call for the use of proper Machine demand various modules such as Morphological Translation System. However, we point out the Analyser, POS Tagger, Chunker, etc. to be part of a problems at various levels of Machine Translation, Machine Translation system. In this paper we discuss and suggest an alternative approach. Following this the problems at various levels of Machine Translation alternative approach a working system has been built and finally suggest a model for developing an easy and is available at access of Urdu text through Hindi and vice versa. http://sanskrit.uohhyd.ernet.in/~anusaaraka/urdu/Urd u-Hindi-Translation. 2. Transliteration Module: A large common vocabulary makes an Urdu-Hindi 1. Introduction: transliteration module an important component of MT Urdu and Hindi are very widely spoken system. Unlike majority of Indian scripts which languages in the world, particularly in the Indian originated through the Brahmi script, Urdu uses subcontinent. Both have Indian origin and have drawn Persio-Arabic script. Urdu has 38 consonants while from Sanskrit through Shourseni, Apbhransh and Hindi has 33 consonants which are part of Khadi Boli. The syntax of both languages is almost the Devanagari. Further Hindi has adopted few more same and there are many words and expressions consonants such as: क़, ख़, ग़, ज़,ड़ to represent These .(ق ، خ، غ، ز، ڑ) commonly used in both the languages. The common faithfully Urdu consonants language with common vocabulary is referred to as are generated typically by placing a nukta (.) Hindustani that could be written in both the scripts that character below these consonants. Urdu does not have is Devanagari and Persio-Arabic. Use of two scripts special symbols for aspirated. An aspirated consonant for Hindustani has divided the world of Hindustani is represented orthographically as a corresponding into two. Urdu has a tendency to use words from non-aspirated consonant followed by do-chashmi he ( Persian and Arabic origin, whereas Hindi has a ) do chshmi he + (ب) For example bha (भ) = Be .(ھ tendency to adopt words from Sanskrit. Thus ھ Hindustani with more words from Persio-Arabic ). Urdu does not have conjunction of consonants as becomes Urdu while Hindustani with more words in Hindi and thus there is no concept of halant in Urdu alphabet. However to represent the conjunction frequency information will be used to prune out less .the diacritic mark Jazam ()ْ is used. Hindi has vowels frequent matches and vowel modifiers. Urdu on the other hand does not c) The above resources may also be used to try .The semi vowels various Machine Learning techniques .(ا) have any pure vowels except alif ) along with alif (ے) and badi ye (ی) choti ye , (و) waw (play the role of long vowels when required. Urdu Frequency distribution of Hindi words (CIIL corpus (ا does not have any short vowels; instead, it has the was available readily and hence we followed the diacritic marks zer( ِ ), zabar ( َ ) , pesh (),ُ Jazam (),ْ approach (b) and the results are summarized as follows. tashdeed ()ّ and Tanveen ()ً which are used very rarely or only in the basic / elementary texts. The literary Urdu text Size in Correct texts, news-papers and web sites, rarely use these words Transliteration (%) marks leaving the text ambiguous. Tourism text1 824 98.4% 2.1. Urdu-Hindi Transliteration: Tourism text2 666 99.1% The requirements of a good transliteration scheme Health Text 334 95% among Urdu to Hindi then are: 2.2. Hindi-Urdu Transliteration: a) words common to both Urdu & Hindi should be transliterated correctly as per their conventional The problem of Hindi to Urdu transliteration spellings. is easier on account of the following: b) Persio-Arabic words in Urdu that are not common The conjuncts in Hindi need to be split as in Hindi should be transliterated to the phonetically sequence of full consonants or in other closed spellings. words the additional 'halant' character in Devanagari needs to be deleted. The problem of translation Urdu-Hindi then reduces Vowels get mapped to the corresponding to: diacritical marks and may be dropped easily Identifying consonant clusters as conjuncts, if not required. Identifying missing short vowels, The long vowels are mapped to either waw ( according to the Panini's rule (ی، ?) or ye ( و ,( و) Disambiguating the semi vowels waw and badi ye (? ), स्थानेs न्तरतम (Panini:1.1.50). The one (ی) choti ye which is the closest with respect to the place In addition there are less frequent of articulation, is the best match. and noon-e-ghunna ( ن) occurrences of noon :which need to be mapped to the The major issues then are ,( ں) corresponding nasalized consonants, similarly he ( ? ) at the end need to be Though Hindi has extended the Devanagari mapped to either ाा or ह, etc. script by adopting the nukta character and coining new consonants with this nukta, Followings are same of the possible approaches: there is no uniformity among the Hindi users in the use of these adapted consonants. This a)Have a good coverage Urdu-Hindi dictionary of then leads to wrong Urdu spelling in the common Hindustani words, written both in Urdu as transliteration. well as Devanagari script. This approach definitely is The missing consonants in Hindi also introduce some errors. However since the the best one. However to start with, till such a are basically of (ض ص) words that use dictionary be made available in electronic form, one persio-Arabic origin and not used frequently needs to have an alternative approach. in Hindi, the transliteration from Hindi to Urdu as far as these consonants are b)Have a good coverage Hindi Monolingual concerned, does not pose much problem. which are the (ع) and ain (ا) dictionary. The transliterated word from Urdu will be It is the alif searched in this dictionary for the best match and all major trouble-givers. Unless one refers to the possible answers will be returned. If Hindi lexicon is dictionary, the correct spelling can't be available with frequency distribution data, then the guessed in such case. To handle this ambiguity, we use the Urdu-Hindi bilingual is 95% for verbs. But for nouns it was found to be dictionary. only 60%. Major problems were because of non- availability of root words in the dictionary, and not Following table shows the performance of the because of any missing paradigms. Unlike Hindi or current system using the above mentioned rules. any other Indian Language, it was little difficult in case of Urdu to decide the default paradigm. In Hindi text Size in words Correct Indian Language the words are marked with vowels Transliteration (%) and the vowels at the end of a word decide the text1 381 95.6% paradigm. However since in Urdu, the orthography does not mark the vowel, it was difficult to decide the text2 415 95.7% default paradigm. We used the dictionary of text3 482 97.6% pronunciation which contain the missing vowels to decide the paradigm. 3. Morphological Analyzer : 4. Standardization Issues: The Finite State Transducer approach to Urdu data entry operators do not enter the data in a morphology has became very common and popular standardized format which creates problems in among the developers of morphological analyzers and transliteration as well as it increases the ambiguity. generators. In the past decade one will see up-shoot of The problems in e-representation of Urdu texts may Morph Analyzer for a variety of languages like be classified into three categories: European, Indian, Arabic, etc. Since Urdu borrows heavily from Hindi as well as Persian and Arabic, it a) wrong spellings: .and ? when in middle is not differentiated ی has a mixed morphology. The morphology of Hindi is .(میل) are written in the same way میل and م ??ل very simple and can be best captured by the word and ? is written as ی and ی paradigm model (Bharati, 1995). Many a times ? is written due to the same appearance in Urdu text editors. The morphology for the Persio-Arabic words making the ن in middle is written as ں on the other hand is an item and process based. Simple word paradigm model is not sufficient since the transliteration difficult. orthography does not reflect the underlying vowel combination. In case of Persian and Arabic languages b) rare use of diacritic marks: it is the vowel combinations which determine the Diacritic marks Zabar, Zer, Pesh, Tashdeed, paradigm. Thus as tried by Beesley(1998) for Arabic, Jazam are normally not written hence differentiation .is difficult ُاس and ِاس a two level analysis - one representing the between combinations of consonants in the roots and the other representing the vowel combinations is required.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    5 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us