Hindustani Or Hindi Vs. Urdu: a Computational Approach for the Exploration of Similarities Under Phonetic Aspects
Total Page:16
File Type:pdf, Size:1020Kb
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 11, 2020 Hindustani or Hindi vs. Urdu: A Computational Approach for the Exploration of Similarities Under Phonetic Aspects Muhammad Suffian Nizami Tafseer Ahmed Muhammad Yaseen Khan Department of Computer Science Center for Language Computing Center for Language Computing FAST National University Mohammad Ali Jinnah University Mohammad Ali Jinnah University Chiniot-Faisalabad, Pakistan Karachi, Pakistan Karachi, Pakistan Abstract—The semantic coexistence is the reason to adopt the language spoken by other people. In such human habitats, different languages share words typically known as loan words which appears not only as of the principal medium of enriching language vocabulary but also for creating influence upon each other for building stronger relationships and forming multilin- gualism. In this context, the spoken words are usually common but their writing scripts vary or the language may have become a digraphia. In this paper, we presented the similarities and relatedness between Hindi and Urdu (that are mutually intelligible and major languages of Indian sub-continent). In general, the method modifies edit-distance; and works in the fashion that instead of using alphabets from the words it uses articulatory features from the International Phonetic Alphabets (IPA) to get the phonetic edit distance. This paper also shows the results for the languages consonant under the method which quantifies the evidence that the Urdu and Hindi languages are 67.8% similar on average despite the script differences. Keywords—Lexical Similarity; Urdu; Hindi; Edit Distance; Pho-netics; Natural Language Processing; Computational Linguistics Figure 1: Major languages of the world presented in the format of family tree. Courtesy Minna Sundberg. I. INTRODUCTION In the Indian sub-continent hundreds of different languages are spoken throughout the area it spans, most of which belong to the Dravidian and Aryan families. It is the accepted theories describing Urdu as a creole language; and maintained fact by linguists that the Aryan family of languages evolved that it is the ‘Khariboli’ of the central zone of India with the from Sanskrit [1]. Historically, during the medieval period of exception that if its vocabulary draws words from the Persian India, Sanskrit was the language of rulers and of the people and Arabic it would become Urdu, and similarly, if it uses from the upper-class, this period also shows the witness for words from Sanskrit it would be Hindi. The Urdu and Hindi Prakrits and the other languages derived from the Sanskrit are the mutually intelligible languages; however, are the victim [2]. Followed by time, we see the rule of Persian language of language-split which resulted in the usage of modified in Indian courts; and at the near end of Mughal era, Urdu Perso-Arabic script called ‘Nastaliq’ and Devanagari script, eventually became the official language of the court [2], [3]. respectively, for writing. Amongst many other characteristics Many of the researchers argue that the languages Urdu and of these scripts the two which appear salient are: Nastaliq is Hindi are same because they share the same grammar and a a cursive scripted and supposed to be written in right-to-left large number of words in their common vocabulary; while in direction; whereas, Hindi is block-letter and follows left-to- the same context, many other researchers express their findings right direction. Figure 1 depicts the languages of the world in refutal. The debate engages the development and origin of and their respective sizes (in terms of size of leaves spread), the Urdu. A common understanding behind the development where specifically Urdu and Hindi appear on the top-left, in of Urdu language shows that it is a creole language which came the Indo-European!Indo-Iranian!Indo-Aryan branch. In the into being through the mixing of local Indian people and the same context, the two languages in the world of today can foreign invaders from the different background and ethnicities be combined under the common term ‘Hindustani’ and also [4]; and hence, often referred with the ‘camp language’. In recognized as separate Persianised and Sanskritised registers contrast, veteran Urdu lexicographer Parekh [5] rejected the of the Hindustani language. www.ijacsa.thesai.org 749 j P a g e (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 11, 2020 they use a single Hindi ’غ‘ and ’گ‘ for Urdu alphabets ;’غ‘ alphabet‘ग’ and add bindi in it (‘ग़’) to substitute it for and ’ک‘ in a very similar fashion, it uses ‘क’ and ‘क़’ for respectively. In a contrasting manner, for the two Urdu ’ق‘ Hindi corresponds with the ’ع‘ and ’ا‘ alphabets such as ,’ھ‘ and ’ح‘ three alphabets ‘अ’, इ’, and उ’; similarly, for ,’ث‘ Hindi has only one alphabet ‘ह’; and lastly, for ’ہ‘ and Hindi has got only one alphabet i.e. ‘स’; so ’ص‘ and ,’س‘ it has got a single ’ی‘ and ’ے‘ with the Urdu alphabets alphabet i.e. ‘य’. Collectively all of the alphabet marking are mapped in the figure 2. Thus, with these many-to-many mappings among the alphabets of two scripts, we can easily anticipate the production of very severe semantic mistakes. (Da.li:l] (humiliated] ’ذ ‘ For example, take the Urdu word for which the Hindi may have the chance to pronounce as ‘जलील’ [dZa"li:l] (exalted, magnificent). Similarly, for the multi-words, the Urdu language has to give an additional space hence a single word would consist of multiple tokens; for ,’ بڑھ‘ ¸ ’ان‘ n¦p@óH] (illiterate) which is@] ’ انبڑھ‘ ,example however; Hindi language has no compulsion of giving a white- ان‘ space in between tokens, so for the given Urdu example it will render ‘अनपढ़’(pronounced as per same IPA and , ’ بڑھ meant into the same thing). Thus, in order to find the similarity between the two languages, we are required to transform every cognate as per a Figure 2: Alphabets of Urdu and Hindi languages in modified similar scheme. For such scheme, Romanized transliteration is Perso-Arabic script (shown in blue colour) and Devanagari the popular way but it undergoes with the same issue i.e. many- script (show in white colour) respectively. ’ص‘ and ,’س‘ ,’ث‘ to-many alphabet mapping; for example will have only substitute in the Latin script i.e. ‘S’ et cetera [6]. The alternate approach, as used in this paper, is taking the Before we proceed further, another behaviour is noteworthy IPA into account for the transformation. In addition to it, this for which we see that since the division of India happened paper presents a modified version of conventional edit distance, in 1947, a special focus has been made on official grounds namely ‘Phonetic Edit Distance’ (PED), where the articulatory for inducting Persian and Arabic words in Urdu and Sanskrit features of the IPA are employed. This will also help us in Hindi by the right governments and print and electronic to see the relatedness of the same word, spoken/pronounced media associates of the two countries (i.e., Pakistan and India). by the people of core Urdu and Hindi backgrounds, at a Thus, where these languages share a vast vocabulary and slight/negligible distance; instead of getting a hard distance morphological structure, their speakers attempt to distinguish through standard edit distance metric (yielded on romanized them through the word borrowing or with loan words from the transliteration). Likewise, Nizami et al. [7], we considered to source languages as mentioned earlier. Hence, it is observable find the similarities w.r.t the Parts-of-Speech (PoS); such that that it may be very difficult for the youth of contemporary time it would be more interesting to find the right cognate of book to comprehend the news bulletin announced officially in the as a noun in the list of nouns of the other language, rather the pure Urdu and Hindi. However, movies and other channels of make a generalized look-up on all possible words. entertainment can be accounted for as the principal medium of vocabulary enhancement. The rest of the paper organization is as The literature review about the lexical similarity of languages and earlier techniques With more than 329.1 million native speakers all around the is in Section II, the methodology is described in Section III, world, [2] and being a victim of digraphia; the main challenges the detailed results are shared in Section IV, followed by the and point of research investigation—w.r.t the similarities be- conclusion and future works in Section V, and bibliographical tween Hindi and Urdu languages—taken into the consideration references in the end. for this paper are discussed in the subsequent paragraphs. The difference in writing script leads the matter not only II. LITERATURE REVIEW towards the inabilities of reading but also to the pronunciation. In this section, the literature review, the existing tech- We see that the pronunciation of certain Perso-Arabic alphabets niques, and approaches for lexical similarity concerning script are improper w.r.t the core Hindi speakers such that they and sound are described. as ;’ظ‘ and ,’ض‘ ,’ز‘ ,’ذ‘ ,’ج‘ are not able to differentiate they tend to pronounce ‘ज’ for all of them. It is seen very The two most frequent method for resolving the problem of often to associate a diacritic symbol, namely bindi, in the this kind are string matching algorithms and employment of the Soundex algorithm. The edit distance algorithm has different ’ظ‘ and ’ض‘ transformation of ‘ज’!‘ज़’ for differentiating from the rest of aforementioned Urdu alphabets. Similarly, variants for different types of tasks like string alignment and www.ijacsa.thesai.org 750 j P a g e (IJACSA) International Journal of Advanced Computer Science and Applications, Vol.