Urdu to Punjabi Machine Translation: an Incremental Training Approach

(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 7, No. 4, 2016 Urdu to Punjabi Machine Translation: An Incremental Training Approach Umrinderpal Singh Vishal Goyal Gurpreet Singh Lehal Department of Computer Science, Department of Computer Science, Department of Computer Science, Punjabi University Punjabi University Punjabi University Patiala, Punjab, India Patiala, Punjab, India Patiala, Punjab, India Abstract—The statistical machine translation approach is sentence corpus. Collecting parallel phrases were more highly popular in automatic translation research area and convenient as compared to the parallel sentences. promising approach to yield good accuracy. Efforts have been made to develop Urdu to Punjabi statistical machine translation II. URDU AND PUNJABI: A CLOSELY RELATED LANGUAGE system. The system is based on an incremental training approach PAIR to train the statistical model. In place of the parallel sentences Urdu2 is the national language of Pakistan and has official corpus has manually mapped phrases which were used to train the model. In preprocessing phase, various rules were used for language status in few states of India like New Delhi, Uttar tokenization and segmentation processes. Along with these rules, Pradesh, Bihar, Telangana, Jammu and Kashmir where it is text classification system was implemented to classify input text widely spoken and well understood throughout in the other states of India like Punjab, Rajasthan, Maharashtra, to predefined classes and decoder translates given text according 1 to selected domain by the text classifier. The system used Hidden Jharkhand, Madhya Pradesh and many other . The majority Markov Model(HMM) for the learning process and Viterbi of Urdu speakers belong to India and Pakistan, 70 million algorithm has been used for decoding. Experiment and native Urdu speakers are in India and around 10 million evaluation have shown that simple statistical model like HMM speakers in Pakistan2 and thousands of Urdu speakers living in yields good accuracy for a closely related language pair like US, UK, Canada, Saudi Arabia and Bangladesh. The word Urdu-Punjabi. The system has achieved 0.86 BLEU score and in Urdu is derived from Turkic word ordu which means army manual testing and got more than 85% accuracy. camp2. The Urdu language was developed in 6th to 13th century. Urdu vocabulary mainly derived from Arabic, Keywords—Machine Translation; Urdu to Punjabi Machine Persian, and Sanskrit and it is very closely related to modern Translation; NLP; Urdu; Punjabi; Indo-Aryan Languages Hindi language. Urdu is written in Nastaliq style and script is written from right to left using heavily derided alphabets from I. INTRODUCTION Persian which is an extension of Arabic alphabets. 3Punjabi is The machine translation is a burning topic in the area of an Indo-Aryan language and 10th most widely spoken artificial intelligence. In this digital era where across the world language in the world there are around 102 million native different communities are connected to each other and sharing speakers of Punjabi language across worldwide4. Punjabi a vast amount of resources. In this kind of digital environment, speaking people mainly lived in India‟s Punjab state and in different natural languages are the main obstacle to Pakistan‟s Punjab. Punjabi is the official language of Indian communicate. To remove this barrier researcher from different states like Punjab, Haryana, and Delhi and well understood by countries and big companies are putting efforts to develop many other northern Indian regions. Punjabi is also a popular machine transition system to resolve this barrier. Various language in Pakistani Punjab region but still did not get kinds of approaches have been developed to decode natural official language status. In India, Punjabi is written in languages like Rule based, Example-based, Statistical and Gurmukhi script means from Guru‟s mouth and in Pakistan various hybrid approaches. Among all these approaches, Shahmukhi is used means from the king‟s mouth. Despite statistical based approach is a quite dominant and popular in from the different scripts use to write Punjabi, both languages the machine translation research community. The statistical share all other linguistics features from grammar to systems yield good accuracy as compared to other approaches vocabulary in common. but statistical models need a huge amount of training data. In comparison to European languages Asian languages are Urdu and Punjabi are closely related languages and both resources poor languages therefore it is challenging task to belong to same family tree and share many linguistic features collect parallel corpus for training these statistical model. like grammatical structure and vast amount of vocabulary etc. There are many machine translation systems which have been for example: وٍ پٌجبثی یوًیورسٹی کب طبلت علن ہے ۔ :developed for Indo-Aryan languages [Garje G V, 2013]. Most Urdu of the work have been done using rule-based or hybrid approaches because the non-availability of resources. The Punjabi: । proposed system based on an incremental training process for ਉਸ ਩ੰਜਾਫੀ ਮੂਨੀਵਯਸ਷ਟੀ ਦਾ ਸਵਸਦਆਯਥੀ ਸੈ training the machine learning algorithm. Efforts have been English: He is a student of Punjabi University. made to develop parallel phrase corpus in place of parallel 227 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 7, No. 4, 2016 Despite from script and writing order where Urdu is Transliteration: raam ne dee sata ko apanee kitaab. written in right to left using Arabic script and Punjabi from رام ًے دے دے اپٌی کتبة ستب کو :left to right using Gurumukhi script, every other linguistic Urdu feature is the same in both sentences. Both sentences shares Transliteration: raam ne de apanee kitaab sata ko. same grammatical order and most of the vocabulary, this is رام ًے اپٌی کتبة ستب کو دے دی :also true in care of more complex sentences. By analysis of Urdu both languages, we found that both languages share many Transliteration: raam ne apanee kitaab sata de dee. similarities and are used by a vast community of India and Pakistan. Therefore, we need a natural language processing Above example shows that same sentence can be written system which can help these people to share and understand in various ways due to free word order and all sentences give text and knowledge. The efforts have been made to develop a exactly the same meaning. Therefore, it is always difficult to machine translation system for Urdu to Punjabi text to form every possible rule to interpreter‟s source language text overcome this language barrier between both the communities. to do machine translation. With the help of this machine translation system, native D. Segmentation issues in Urdu: Urdu word segmentation Punjabi reader can understand Urdu text by translating into issue is a primary and most significant task [Lehal, G. Punjabi text. 2009]. Urdu is effected with two kinds of segmentation III. CHALLENGES TO DEVELOP URDU TO PUNJABI MT issues, space insertion and space omission [Durrani, Nadir SYSTEM et.al. 2010]. Urdu is written in Nastaliq style which makes the white space completely an optional concept. For A. Resource poor languages: Urdu and Punjabi languages example, are new in natural language processing area like any other ا لبفلےکےصذراحوذشیرڈوگراًےکہ :Non-Segmented Indo-Aryan language. Both languages are resource-poor لبفلے کے صذر احوذ شیر ڈوگرا ًے کہ :language, very small or no annotated corpus is available Segmented Text for development of a full-fledged system. Urdu reader can read this non-segmented text easily but To develop a machine translation system based on the this is still difficult for computer algorithms to understand. In statistical model, one should need a huge parallel corpus to preprocessing phase, modules like tokenization need to training the model. For rule-based approach or hybrid machine identify individual words for further processing, without translation system, one should need a good part of speech resolving the segmentation issue, no NLP system can process tagger or stemmer and large parallel dictionaries. To best of Urdu text efficiently and yield less accuracy. our knowledge, Urdu-Punjabi language pair does not have these resources in a vast amount to train or develop the E. Morphological rich languages: Urdu and Punjabi are system. Therefore, development of resources is one of the key morphological rich languages, where one word can be challenges to work on this language pair. inflected in many ways. For example, word „chair‟ ,kursiya/کرسیب kursi can take any form like/کرسی kurseye etc. One should need to/کرسئے ,kurseo/کرسیو B. Spelling variation: Due to lack of spelling standardization rules, there are many spelling variation for the same word. incorporate all the inflation in our knowledge base to [Singh, UmrinderPal et.al 2012] Both languages use tons translate them into the target language. Adding all the of loan words from English. Therefore, many variations inflation forms of all words in training data is a big come in existence, for example, word „Hospital‟ can be challenge otherwise, it will effect on the accuracy of the .hasptaal/asptaal. system ہسپتبل/ اسپتبل written in two ways in Urdu It is always a challenging task to cover all variation of a word. There is no standardization in spelling. Therefore, it F. Word without diacritical marks: Urdu has derived all depends on a writer which spelling he/she choose to various diacritical marks from Arabic to produce vowel write foreign language words. sounds, like Zabar, Zer, Pesh , Shad , hamza , Khari- Zabar, do-Zabar and do-Zer [Sani, Tajinder Singh 2011]. C. Free word order: Urdu and Punjabi are free word order In naturally written text diacritical marks are used very languages. Both languages have unrestricted word order or rarely. Due to missing of diacritical marks, an Urdu word phrase structures to form the sentences that make the can be mapped to many different target language often used without دل/machine translation task more challenging.

Urdu to Punjabi Machine Translation: an Incremental Training Approach

(And Potential) Language and Linguistic Resources on South Asian Languages

Linguistic Survey of India Bihar

English Language in Pakistan: Expansion and Resulting Implications

Map by Steve Huffman; Data from World Language Mapping System

Genealogical Classification of New Indo-Aryan Languages and Lexicostatistics

Map by Steve Huffman Data from World Language Mapping System 16

FACTORS AFFECTING PROFICIENCY AMONG GUJARATI HERITAGE LANGUAGE LEARNERS on THREE CONTINENTS a Dissertation Submitted to the Facu

THE CONSTITUTION (AMENDMENT) BILL, 2015 By

Spotlight on Tamil

Hindi-Urdu (HNU) 1

Majority Language Death

National Languages Policy Recommendation Commission 1994(2050VS)