International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: [email protected] Volume 4, Issue 2, February 2015 ISSN 2319 - 4847

Fundamentals Tools of Modern Multilingual Speech Processing Technology in “” langauage Speech Processing

Mohammed Waseem Ashfaque1, Quadri N.Naveed2, Sayyada Sara Banu3, Quadri S. S Ali Ahmed4

1. Department of Computer Science & IT, College of Management and Computer Technology, Aurangabad, India

2. Department of Computer Science & IT, King Khalid University, Abha, Saudi Arabia

3 . College of computer science and information system,J azan university,Saudi Arabia

4. Department of Computer Science & IT, College of Management and Computer Technology, Aurangabad, India

ABSTRACT In this paper we have try out to describe of Fundamentals Tools of Modernized Multilingual Speech Processing Technology in Urdu Speech processing language the case study being selected as Urdu language. And one of the comprehensive reviews of Urdu speech processing technology is being discussed Standard Urdu is a standardized register of the Indian language. Urdu is historically associated with the Muslims of the region of India. It is the national language and lingua franca of Pakistan, and an official language of six Indian states and one of the 22 scheduled languages in the Constitution of India. 1920, is a history of the Urdu language from its origins to the development of an Urdu literature. Urdu and Hindi share an Indo-Aryan base, but Urdu is associated with the script style of Persian calligraphy and reads right-to-left, whereas Hindi resembles Sanskrit and reads left-to-right. The earliest linguistic influences in the development of Urdu probably began with the Muslim conquest of Sindh in 711. Using a , but has no explicit word boundaries, similar to several Asian languages, such as Hindi and Persian. It does have explicit marks for tones, as in the languages of the neighboring countries Like Pakistan, Bangladesh, Afghanistan Therefore, with these unique features of Urdu Research and development of language and speech processing specifically for Urdu language is necessary and quite challenging. Keywords:Multilingual speech processing,speech synthesis,speech recognition,speech-to-speech translation, language identification.

1.INTRODUCTION 1.1 Urdu Language at a Glance Urdu formed from Khariboli a Prakrit, or vernacular, spoken in North India by adding Persian and Arabic words to it. Contrary to the widely held misconception it is not formed in the camp of the Mughal armies.[1][2][3][4][5][6] But the word Urdu is derived from the same Turkic word ordu (army) that has given Englishhorde.[7] However, Turkish borrowings in Urdu are minimum.[8] The words that Urdu has borrowed from Turkish and Arabic have been borrowed ( ة ) through Farsi and hence are a Persianized version of the original word, for instance the Arabic 'teh marbuta The Mughal Empire's official language was Persian [10]. With the advent of . [9] .( ت ) or teh ( ه ) changes toheh the British RajPersian language was replaced by the Hindustani written in the Persian script and this script was used by both Hindus and Muslims. The name Urdu was first used by the poet Ghulam Hamadani Mushafi around 1780.[11][12], From the 13th century until the end of the 18th century Urdu was commonly known as Hindi. The language was also known by various other names such as Hindavi and DehlaviThe communal nature of the language lasted until it replaced Persian as the official language in 1837 and was made co-official, along with English. Urdu was promoted in British India by British policies to counter the previous emphasis on Persian.[13] Fig:1 Phonetics Alphabets of Urdu With Other Languages

Volume 4, Issue 2, February 2015 Page 59 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: [email protected] Volume 4, Issue 2, February 2015 ISSN 2319 - 4847

This triggered a Hindu backlash in northwestern India, which argued that the language should be written in the native Devanagari script. Thus a new literary register, called "Hindi", replaced traditional Hindustani as the official language of Bihar in 1881, establishing a sectarian divide of "Urdu" for Muslims and "Hindi" for Hindus, a divide that was formalized with the division of India and Pakistan after independence (though there are Hindu poets who continue to write in Urdu to this day, with post-independence examples including Gopi Chand Narang and Gulzar). At independence, Pakistan established a highly Persianized literary form of Urdu as its national language.There have been attempts to "purify" Urdu and Hindi, by purging Urdu of Sanskrit loan words, and Hindi of Persian loan words, and new vocabulary draws primarily from Persian and Arabic for Urdu and from Sanskrit for Hindi. This has primarily affected academic and literary vocabulary, and both national standards remain heavily influenced by both Persian and Sanskrit.[14] English has exerted a heavy influence on both as a co-official language.[15]. Urdu is the national and one of the two official languages of Pakistan, along with English, and is spoken and understood throughout the country, whereas the state-by-state languages (languages spoken throughout various regions) are the provincial languages. Only 7.57% of Pakistanis have Urdu as their mother language,

1.2 Official Status [16] but Urdu is understood all over Pakistan. It is used in education, literature, office and court business.[17] It holds in itself a repository of the cultural andsocial heritage of the country.[18] Although English is used in most elite circles, and Punjabi has a plurality of native speakers, Urdu is the lingua franca and national language of Pakistan. In practice English is used instead of Urdu in the higher echelons of government.[19] Urdu is also one of the officially recognized languages in India and has official language status in the Indian states of Uttar Pradesh, Bihar,[20] Telangana, Jammu and Kashmir and the national capital, New Delhi.In Jammu and Kashmir, section 145 of the Kashmir Constitution provides: "The official language of the State shall be Urdu but the English language shall unless the Legislature by law otherwise provides, continue to be used for all the official purposes of the State for which it was being used immediately before the commencement of the Constitution."[21] The importance of Urdu in the Muslim world is visible in the Islamic Holy cities of Mecca and Medina in Saudi Arabia, where most informational signage is written in Arabic, English and Urdu, and sometimes in other languages.[22]. 1.3 Multilingual speech processing (MLSP) is a distinct field of research in speech and language technology that combines many of the techniques developed for monolingual systems with new approaches that address specific challenges of the multilingual domain. Research in MLSP should be of particular interest to countries such as India and Switzerland where there are several officially recognized languages and many more additional languages are commonly spoken. In such multilingual environments, the language barrier can pose significant difficulties in both commerce and government administration and technological advances that could help break down this barrier would be of great cultural and economic value [23][24][25][26].In this paper, we discuss current trends in MLSP and how these relate to advances in the general domain of speech and language technology. Examples of current advances and trends in the field of MLSP are provided in the form of case studies of research being conducted at Idiap Research Institute in the framework of international and national research programmes.In language, there are normally two forms of communication, namely, spoken form and written form1. Depending upon the granularity of representation, both these forms can have different or common representation in terms of basic units. For instance, in spoken form, phonemes/syllables can be considered as the basic unit. Similarly, in the case of (most) written forms, graphemes/characters are the smallest basic units. However, above the smaller units, a word2 can be seen as a common unit for both spoken and written forms. The word then in turn can be described in terms of the smaller basic units corresponding to the spoken form or written form. For a given language, there can be a

Volume 4, Issue 2, February 2015 Page 60 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: [email protected] Volume 4, Issue 2, February 2015 ISSN 2319 - 4847 consistent relationship between spoken form and written form. However, the relationship may not be same across languages. Addressing the issueof a common basic representation across languages is central to MLSP, although this can raise [27][28] different challenges in applications such as ASR and TTS 1.4 Automatic speech recognition In the field of ASR, advances have been largely promoted by the US government via the National Institute of Science and Technology (NIST6). Besides running evaluations, one of the the main contributions of this has been to provide databases This has led naturally to an English language focus for ASR research and development. That is not to say that ASR research is English centric; rather, many of the algorithmic advances have been made initially in that language. Such advances have, however, translated easily to different languages, largely owing to the robustness of statistical approaches to the different specificities of languages. State-of-the-art ASR systems are stochastic in nature and are typically based on HMM. Illustrates a typical HMM-based ASR system for English language. From a spoken language perspective, the different components of the HMM-based ASR system are as follows [29][30][31][32]. (i) Feature extractor that extracts the relevant information from the speech signal yielding a sequence of feature observations/vectors. Feature extraction is typically considered a language independent process (i.e., common feature extraction algorithms are used in most systems regardless of language), although in some cases (such as tonal languages) specific processing should be explored. (ii) Acoustic model that models the relation between the feature vector and units of spoken form (sound units such as phones). (iii) Lexicon that integrates lexical constraints on top of spoken unit level representation yielding a unit representation that is typically common to both spoken form and written form such as, word or morpheme. (iv) Language model that tends to model syntactical/grammatical constraints of the spoken language using the unit representation resulting after integrating lexical constraints[33][34]. 1.5 Text-to-speech synthesis In the domain of TTS, issues in multilingual speech processing have been dominated by two main areas. This is partially due to the limitations of the dominant unit selection paradigm, which directly uses the recordings from voice talents in the generation of synthesized speech. It is clear that multilingual techniques in unit selection are thus bound by the limited availability of multilingual voice talents and more so by the availability of such recordings to the research community, although some research has been conducted to overcome this The first area concerns the construction of TTS voices in multiple languages, which is dealt with in detail by Sproat (1997). This area of research is largely concerned with the development of methodologies that can be made portable across languages, and include issues in text parsing, intonation and waveform generation. Many of these general issues are mirrored in challenges faced by the multilingual ASR community, as has been discussed in the opening of this section, although many of the related challenges are arguably more complex in TTS due to the extensive use of supra-segmental features and the greater degree of language dependence of such features there is no ‘international prosodic alphabet’[35][36][37]. 1.6 Automatic language recognition The objective of automatic language recognition is to recognize the spoken language by automatic analysis of speech data. Automatic language recognition can be classified into two tasks (i) automatic language identification and (ii) automatic language detection. In principle, this classification is similar to speaker identification and speaker verification in speaker recognition research. The goal of automatic language identification (LID) is to classify a given input speech utteranceas belonging to one out of L languages. Various possible applications of LID can be found in multilingual speech processing, call routing, interactive voice response applications, and front-end processing for speech translation translation. There are a variety of cues, including phonological, morphological, syntactical or prosodic cues, that can be exploited by an LID system In the literature, different approaches have been proposed to perform LID, such as using only low-level spectral information using phoneme recognizers in conjunction with phonotactic constraints or using medium- to high-level information (e.g. lexical constraints, language models) through speech recognition [38][39]. 1.7 Recent trends in speech and language processing We can identify a number of advances in speech and languages processing that have significantly impacted MLSP. One of the most important developments has been the rise of statistical machine translation that has resulted in substantial funding and, consequently, research activity being directed towards machine translation and its related multilingual applications (speech-to-speech, speech-to-text translation, etc.). Several notable projects have been pursued in recent years, to mention only a few: Spoken Language Communication and Translation System for Tactical Use (TRANSTAC DARPA initiative), Technology and Corpora for Speech to Speech Translation (TC-STAR FP6 European project), and the Global Autonomous Language Exploitation (GALE DARPA initiative10)[40][41][42][43][44] .Research in these projects not only needs to focus on the optimization of individual components, but also on how the components can be integrated together effectively to provide overall improved output. This is not a trivial task. For instance, speech recognition systems are typically optimized to reduce word error rate (WER) (or character/letter error rate, CER/LER, in some languages). The WER measure gives equal importance to the different types of error that can be committed by

Volume 4, Issue 2, February 2015 Page 61 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: [email protected] Volume 4, Issue 2, February 2015 ISSN 2319 - 4847 the ASR system, namely, deletion, insertion and substitution. Suppose, if the ASR system output (i.e., text transcript) is processed by a machine translation system, then deletion error is probably more expensive compared to other two errors as all information is lost. It then follows that the optimal performance with respect to WER may not provide the best possible translated output and vice versa. Indeed, in the GALE project (in the context of translation of Mandarin language speech data to English), it was observed that CER is less correlated with translation error rate when compared to objective functions such as SParseval (parsing-based word string scoring function)[45][46][47][48].

2. LITERATURE REVIEW Dialects:-Urdu has a few recognised dialects, including Dakhni, Rekhta, and Modern Vernacular Urdu (based on the Khariboli dialect of the Delhi region).Dakhni (also known as Dakani, Deccani, Desia, Mirgan) is spoken in Deccan region of southern India. It is distinct by its mixture of vocabulary from Marathi and Konkani, as well as some vocabulary from Arabic, Persian and Turkish that are not found in the standard dialect of Urdu. Dakhini is widely spoken in all parts of Maharashtra, Telangana, Andhra Pradesh and Karnataka. Urdu is read and written as in other parts of India. A number of daily newspapers and several monthly magazines in Urdu are published in these states. In (ق) "terms of pronunciation, the easiest way to recognize a native speaker is their pronunciation of the letter "qāf The Pakistani variant of the language becomes increasingly divergent from the Indian dialects and forms.(خ) ̱"as "kẖ e of Urdu, as it has absorbed many loan words, proverbs and phonetics from Pakistan's indigenous languages such as Pashto, Punjabi, Balochi and Sindhi. Furthermore, due to the region's history, the Urdu dialect of Pakistan draws heavily from the Persian and Arabic languages, and the intonation and pronunciation are more formal compared with corresponding Indian dialects.In addition, Rekhta (or Rekhti), the language of Urdu poetry, is sometimes counted as a separate dialect, one famously used by several poets of high acclaim in the bulk of their work. These included Mirza Ghalib, Mir Taqi Mir and Muhammad Iqbal. Urdu spoken in Indian state of Odisha is different from Urdu spoken in other areas; it is a mixture of Oriya and Bihari. Comparasion with Hindi language:- Standard Urdu is often contrasted with Standard Hindi. Apart from religious associations, the differences are largely restricted to the standard forms: Standard Urdu is conventionally written in the Nastaliq style of the and relies heavily on Persian and Arabic as a source for technical and literary vocabulary,[49] whereas Standard Hindi is conventionally written in Devanāgarī and draws on Sanskrit.[50] However, both have large numbers of Arabic, Persian and Sanskrit words, and most linguists consider them to be two standardized forms of the same language,[51][52] and consider the differences to besociolinguistic,[53] though a few classify them separately.[54] Old Urdu dictionaries also contain most of the Sanskrit words now present in Hindi.[55][56] Mutual intelligibility decreases in literary and specialized contexts that rely on educated vocabulary. Further, it is quite easy in a longer conversation to distinguish differences in vocabulary and pronunciation of some Urdu phonemes. Due to religious nationalism since thepartition of British India and continued communal tensions, native speakers of both Hindi and Urdu frequently assert them to be distinct languages, despite the numerous similarities between the two in a colloquial setting. The barrier created between the Hindi and Urdu is eroding: Hindi- speakers are comfortable with using Persian-Arabic borrowed words[57] and Urdu-speakers are also comfortable with using Sanskrit terminology.[58][59] 2.1 Vocubalry The language's Indo-Aryan base has been enriched by borrowing from Persian and Arabic. There are also a smaller number of borrowings fromChagatai, Portuguese, and more recently English. Many of the words of Arabic origin have been adopted through Persian and have different pronunciations and nuances of meaning and usage than they do in Arabic. reːxta̪ ː]), meaning] ,رﯾﺨﺘہ) Levels of formality:-Urdu in its less formalised register has been referred to as a rekh̤ tah̤ زﺑﺎنِ اُردُوﺋﮯ ) rough mixture". The more formal register of Urdu is sometimes referred to as zabān-i Urdū-yi muʿ allá" zəbaːn eː ʊrdu̪ eː moəllaː]), the "Language of the Exalted Camp", referring to the Imperial] ﻣﻌﻠّٰﻰ army.[60]The etymology of the word used in the Urdu language for the most part decides how polite or refined one's āb, both meaning "water"; the former آب pānī and ﭘﺎﻧﯽ speech is. For example, Urdu speakers would distinguish between is used colloquially and has older Indic origins, whereas the latter is used formally and poetically, being of Persian origin.If a word is of Persian or Arabic origin, the level of speech is considered to be more formal and grand. Similarly, if Persian or Arabic grammar constructs, such as the izafat, are used in Urdu, the level of speech is also considered more formal and grand. If a word is inherited fromSanskrit, the level of speech is considered more colloquial and personal.[61] This distinction is similar to the division in English between words ofLatin, French and Old English origins. 2.2 Politeness and Smoothness Urdu syntax and vocabulary reflect a three tiered system of politeness called ādāb. Due to its emphasis on politeness and propriety, Urdu has always been considered an elevated, somewhat aristocratic, language in South Asia. It continues to conjure a subtle, polished affect in South Asian linguistic and literary sensibilities and thus continues to be

Volume 4, Issue 2, February 2015 Page 62 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: [email protected] Volume 4, Issue 2, February 2015 ISSN 2319 - 4847 preferred for song-writing and poetry, even by non-native speakers.Any verb can be conjugated as per three or four .(ﺑﯿﭩﮭﻨﺎ) and the verbto sit is baiṭ hnā (ﺑﻮﻟﻨﺎ) different tiers of politeness. For example, the verb to speak in Urdu is bolnā The imperatives "speak!" and "sit!" can thus be conjugated five different ways, each marking subtle variation in politeness and propriety. These permutations exclude a host of auxiliary verbs and expressions that can be added to these verbs to add even greater degree of subtle variation. For extremely polite, formal or ceremonial situations, nearly all commonly used verbs have equivalent Persian/Arabic synonyms. Table:1 Comparasion with other languages

TŪ] [तू] ﺗُﻮ] [तू] ﺗُﻮ] [Disparaging/Ex [TŪ BAIṭ H !बैठ [ﺑﯿﭩﮫ [ﺑﻮل tremely Casual BOL बोल![ !! [TU TUM] [तुम] ﺗُﻢ] .] [तुम] ﺗُﻢ] . [Casual And M BAIṭ H बैठो [ﺑﯿﭩﮭﻮ۔ [ﺑﻮﻟﻮ۔ Intimate BOL बोलो O O

[ĀP] [आप [ĀP] [आप آپ] آپ] [. Polite And BOL ] BAIṭ H बैठो [ ﺑﯿﭩﮭﻮ [ﺑﻮﻟﻮ۔ Intimate

O बोलो O.

[ĀP] [ آپ] [आप आप] آپ] [Formal Yet [ĀP ﺑﯿﭩﮭﯿﮟ۔ BAIṭ H [ﺑﻮﻟﯿﮟ Intimate BOL ] बोल ][ बैठ EṈ EN.̱ [आप ĀP] [आप] . آپ] [ĀP] آپ] [ Polite And [ BAIṭ H ﺑﻮﻟﺌﮯ۔ 'BOLI ﺑﯿﭩﮭﺌﮯ Formal ] बोिल E I'E. बैठए ए ĀP] [आप [آپ] ] Ceremonial / [ĀP] आप ] ﺗﺸﺮﯾ TASH̱ Ṟ آپ] Extremely FAR ] तशर ف رﻣﺎﺋﯿ Formal ĪF رﮐﮭﺌ फ़रमा [ے MĀ'I (Persian RAKHI' फ़ ]ے۔ .YE इये E रखए

Urdu script:-Urdu is written right-to left in an extension of the Persian alphabet, which is itself an extension of the . Urdu is associated with the Nastalīq style ofPersian calligraphy, whereas Arabic is generally written in the or Ruq'ahstyles. Nasta’liq is notoriously difficult to typeset, so Urdu newspapers were hand- written by masters of calligraphy, known as katib or khush-navees, until the late 1980s.[citation needed] One handwritten Urdu newspaper, The Musalman, is still published daily in Chennai[62]. 2. 3 Kaithi script Urdu was also written in the Kaithi script. A highly Persianized and technical form of Urdu was the lingua franca of the law courts of the British administration in Bengal,Bihar, and the North-West Provinces & Oudh. Until the late 19th century, all proceedings and court transactions in this register of Urdu were written officially in the Persian script. In 1880, Sir Ashley Eden, the Lieutenant-Governor of Bengal abolished the use of the Persian alphabet in the law courts of Bengal and Bihar and ordered the exclusive use of Kaithi, a popular script used for both Urdu and Hindi.[63].

2.4 Devanagari script More recently in India, Urdu speakers have adopted Devanagari for publishing Urdu periodicals and have innovated new strategies to mark Urdū in Devanagari as distinct from Hindi in Devanagari. Such publishers have introduced new orthographic features into Devanagari for the purpose of representing the Perso-Arabic etymology of Urdu words. One ain), in violation of Hindi‘)ع example is the use of अ (Devanagari a) with vowel signs to mimic contexts of

Volume 4, Issue 2, February 2015 Page 63 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: [email protected] Volume 4, Issue 2, February 2015 ISSN 2319 - 4847 orthographic rules. For Urdu publishers, the use of Devanagari gives them a greater audience, whereas the orthographic changes help them preserve a distinct identity of Urdu.[64]. 2.5 Roman script Urdu is occasionally written in the Roman script. Roman Urdu has been used since the days of the British Raj, partly as a result of the availability and low cost of Roman movable type for printing presses. The use of Roman Urdu was common in contexts such as product labels. Today it is regaining popularity among users of text-messaging and Internet services and is developing its own style and conventions. Habib R. Sulemanisays,

2.6 Encoding Urdu in Unicode & Writing/Sound systems Like other writing systems derived from the , Urdu uses the 0600–06FF Unicode range.[63] Certain glyphs in this range appear visually similar (or identical when presented using particular fonts) even though the underlying encoding is different. This presents problems for information storage and retrieval. For example, the University of 'ﺑﮭﺎرت' Chicago's electronic copy of John Shakespear's "A Dictionary, Hindustani, and English"[64] includes the word (returns no results, whereas querying with the (identical-looking in many fonts "ﺑﮭﺎرت" India). Searching for the string) (returns the correct entry.[65] This is because the medial form of the Urdu letter do chashmi (U+06BE "ﺑﮭﺎرت" string — used to form aspirate digraphs in Urdu — is visually identical in its medial form to the Arabic letter hā (U+0647; phonetic value /h/). In Urdu, the /h/ phoneme is represented by the character U+06C1, called gol he (round he), or chhoti he (small he). Table:2 Encoding Urdu in Unicode

2.7 Recent and ongoing MLSP research at Idiap Research Institute At Idiap, multilingual speech processing is an important research objective given Switzerland’slocation within Europe, at the intersection of three major linguistic cultures, and Swiss society itself being inherently multilingual. Towards this end, we have been conducting research in MLSP as part of several internationally and nationally funded projects, including for instance: (i) EMIME11 (Effective Multilingual Interaction in Mobile Environments): This FP7 EU project commenced in March 2008 and is investigating the personalization of speech-to speech translation systems. The EMIME project aims to achieve its goal through the use of HMM-based ASR and TTS, more specifically, the main research goal is the development of techniques that enable unsupervised cross-lingual speaker adaptation for TTS. (ii) GALE12 (Global Autonomous Language Exploitation): Idiap was involved in this DARPAfunded project as part of the SRI-lead team. The project primarily involved machine translation and information distillation.We have mostly studied the development of new discriminative MLP-based features and MLP combination methods for the ASR components. Despite cessation of the project, we have continued our research of this topic. (iii) MULTI (MULTImodal Interaction and MULTImedia Data Mining) is a Swiss National Science Foundation (SNSF) project carrying out fundamental research in several related fields of multimodal interaction and multimedia data mining. A recently initiated MULTI sub-project is conducting research in MLP-based methods for language identification and multilingual speech recognition with a focus on Swiss languages (iv) COMTIS13 (Improving the coherence of machine translation output by modelling intersentential relations) is a new machine translation project funded by the SNSF that was started on March 2010. The project is concerned with modelling the coherency between sentences in machine translation output, thereby improving overall translation performance. In the remainder of the paper, we present three case studies conducted in MSLP at Idiap that have resulted from participation in the above-mentioned research projects.

Volume 4, Issue 2, February 2015 Page 64 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: [email protected] Volume 4, Issue 2, February 2015 ISSN 2319 - 4847

3.METHODOLOGY 3.1 Fundamental analyses and tools Certain fundamental tools are critical to research and development of advanced speech processing technology. This section summarizes existing speech, text, and pronunciation modeling tools developed mainly for Urdu. One of the key issues is the analysis of the behavior of tones in Urdu, which is often modulated by phrase and sentence level intonation. Other attributes of speech which have been explored for Urdu include stress and pause. Work conducted on speech segmentation and basic phoneme modeling is also described. For text processing, word and sentence segmentation are reviewed. Fundamental tools for pronunciation modeling are described at the end of this section. There are certain Speech analyses tools are as follows  Tone analysis  Intonation analysis  Stress analysis  4 Pause analysis 3.2 Speech segmentation Syllable-unit based speech recognition has been pursued by some researchers since this approach makes it easier to incorporate prosodic features, such as tones, into the syllable units. Moreover, some papers have shown a high accuracy for tone recognition when applying this method to isolated syllables and hand-segmented syllables extracted from continuous speech. Automatic segmentation of continuous speech into syllable units is therefore an important issue. proposed using several prosody features including short-time energy, a zero-crossing rate, and pitch with some heuristic rules for syllable segmentation. suggested the use of a dual-band energy contour for phoneme segmentation. This method decomposed input speech into a low and a high-frequency component using wavelet transformation, computed the time-domain normalized energy of both components, and introduced some heuristic rules for selecting endpoints of syllables and phonemes based on energy contours. Although there is no comparative experiment using other typical techniques, dividing the speech signal into detailed frequency bands before applying energy-based segmentation rules seems to be an effective approach. 3.3 Phoneme modeling During the early 2000s, a series of research papers on Urdu vowel modeling was published by Chulalongkorn University Basicfeatures such as formant frequencies were extracted and grouped according to vowels. This research provided the first analysis of Urdu phonemes using a large real-speech corpus, not just a small set of sample utterances. There is still a lack of sufficient research on prosody analysis for Urdu, however data collection is not a trivial task, and thus research on prosody modeling has been substantially delayed. Since 2000, research on basic linguistic features has decreased and the engineering of high-level speech applications such as speech recognition and synthesis has taken priority 3.4 Text analyses and tools:- 3.4.1 Morphological analysis Morphological analysis is a process for segmenting words into sub-word units called morphemes. The morpheme is the shortest textual chunk that still has meaning. Morphological analysis not only helps in finding the pronunciation of a word, but also serves as a necessary process for reducing the number of lexical words in automatic speech recognition (ASR). For Urdu, a morpheme may be as short as a syllable. Since Urdu contains a large number of monosyllabic words and many polysyllabic words are concatenations of monosyllabic words, morphological analysis in Urdu consists mainly in a joint process of word segmentation and syllabification. 3.4.2 Pronunciation modeling tools Pronunciation modeling is an important component for various language processing technologies such as speech recognition, text-to-speech synthesis, and spoken document retrieval. The task is to find the pronunciation of a given text by using a pronunciation dictionary and/or a phonological analyzer. This subsection describes the details of these two techniques for Urdu. 3.4.3 Pronunciation dictionary At present, there are many Urdu dictionaries available both offline and online, in a monolingual format, a bilingual format such as Urdu to English, or a multi-lingual format i.e., Urdu to multiple languages. Dictionaries have been constructed for varying purposes, generally for translation and writing tasks. Only a few dictionaries are designed for use in natural language processing research. There are some Urdu dictionaries. 3.4.4 Urdu Speech synthesis A general structure of TTS is shown in Fig. 2; viewing this structure, research regarding TTS can be classified into four major groups. The first group focuses on the general problem of developing Urdu TTS engines, while the other three groups focus on the various subcomponents of these systems, which include text processing, linguistic/prosodic processing, and waveform synthesis.

Volume 4, Issue 2, February 2015 Page 65 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: [email protected] Volume 4, Issue 2, February 2015 ISSN 2319 - 4847

Table:3 Some words in Urdu dictionary

English Urdu Transliterati Notes on assalām-u- lit. "peace be upon اﻟﺴﻼمُ hello) peace) .(alaikum you." (from arabic ﻋﻠﯿﮑﻢ۔ .be upon you often shortened to 'salām'

,wa- lit. "and upon you وَﻋﻠﯿﮑُﻢ reply to alaikumussal peace." response اﻟﺴﻼم۔ salam) peace be upon you, ām to assalāmu alaikum. too. ādāb (arz hai). lit. "regards (are آداب .hello expressed).", a very (ﻋَﺮض .formal secular greeting ﮨﮯ)۔ khuda hāfiz, lit. "may god be your ﺧُﺪا .goodbye "allah hāfiz. guardian". "khuda اﻟﻠّٰہ ﺣﺎﻓِﻆ، ,"from persian for "god ﺣﺎﻓِﻆ۔ "allah" from arabic for "god". .hāⁿ. casual ﮨﺎں۔ .yes ji formal ﺟﯽ۔ .yes .jī hāⁿ. confident formal ﺟﯽ ﮨﺎں۔ .yes nā. rare ﻧَہ۔ no .shukriyā. from arabic shukran ﺷُﮑﺮِﯾَہ۔ .thank you tashrīf la'iyē. lit. "(please) bring your ﺗَﺸﺮﯾﻒ please come ."honour ﻻﺋﯿﮯ in

3.4.5 Development of Urdu TTS The development of a Urdu TTS engine was published in 1983 by where a speech-unit concatenation algorithm was applied to Urdu. two publications on similar techniques. This paper reviews the progress of Urdu speech technology in five areas of research: fundamental analyses and tools, text-to-speech synthesis like (TTS), automatic speech recognition (ASR), speech applications, and language resources Latest tools of speech recognition systems merge interdisciplinary technologies from Linguistics into a unified statistical framework ,Signal Processing, Natural Language ,and Pattern Recognition, These systems, have varieties of applications in a bright and broad range of signal processing problems, represent a revolution in Digital Signal Processing. Such tools are now capable of recognition continuous speech input for vocabularies of several thousand words in operational environments. Multilingual speech processing and provide emerging challenges for the researcher to research for community. Multilingual speech processing has been a topic of ongoing interest to the research.

Fig:2 Architecture of Urdu Text to Speech (TTS )

The synthesis approach that is recognized as giving the most human-like speech quality is based on corpus-based Unit- selection. In Urdu, such an approach was implemented in the latest version developed at NECTEC. Although the newest Vaja engine produces a much higher sound-quality than the former unit-concatenation based engine, the synthetic speech sometimes sounds unnatural, especially when synthesizing non-Urdu words written with Urdu characters. Improvement of the quality in both text processing and speech synthesis is required before state-of-the-art engines of the same level of quality as those which have been successfully developed for other languages, such as English and Japanese, can be achieved

Volume 4, Issue 2, February 2015 Page 66 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: [email protected] Volume 4, Issue 2, February 2015 ISSN 2319 - 4847

3.4.6 Text processing Text processing in TTS works by finding the most probablesequence of phonemes given an input text. For Urdu, there are several crucial subtasks involved as well. Urdu text has no explicit sentence or word boundary marker. Space is often placed between sentences but not all the spaces are sentence boundary markers. Therefore, text is first tokenized into text chunks using some specific characters such as punctuations, digits, and spaces. The next subtask is to determine which space characters function as sentence boundary markers. Algorithms for sentence break detection were described, In these algorithms, text chunks are segmented into word sequences, and a POS is attributed to each word. The space is determined to be a sentence break based on the POS of surrounding words. Therefore, word segmentation is considered to be one of the most crucial sub-tasks involved since it is commonly the first necessary step in the processing systems. Algorithms for Urdu word segmentation have been explained, the next subtask is called text normalization, where non-verbal words are replaced by their orthographies. Non-verbal words include numerals, symbols e.g., ‘‘$’’ and ‘‘=’’, and abbreviations. The final step in text processing is grapheme-to-phoneme conversion. This issue was discussed extensively in Section. 3.4.7 Linguistic/prosodic processing In TTS, important prosody features include phrasal pauses, F0, duration, and energy. These basic features form higher- level supras egmental features such as stress and intonation. In Urdu, the lexical tone is another additional feature which needs to be generated. The lexical tone itself is often modified by stress and phrasal intonation. There are a number of research sites investigating Urdu speech prosody. Some have focused exclusively on prosody analysis, whereas some have attempted to model prosody and have used the models for Urdu TTS. This section focuses on only the latter. Lexical tones in Urdu are almost completely described by textual forms. Simple rules or a pronunciation dictionary are useful for examining the lexical tones. Only some specific words, mostly those which originated in the Bali or Sanskrit languages, diverge from the rules. 3.4.8 Urdu speech recognition Fig. 3 illustrates a diagrammatic representation of the major components of general ASR. For other languages, such as English, there have been major advancements in each of these areas. For Urdu ASR, we cluster research issues into five groups: general Urdu ASR applications, acoustic modeling, language modeling, noise robustness, and prosody 3.4.9 Development of Urdu ASR The first discussion of Urdu isolated syllable recognition appeared in the thesis where distinctive features were extracted from speech utterances and used for recognition. Since 1995, many academic institutions in Pakistan have presented their inventions for Urdu ASR. Development started with isolated word recognition, including numeral speech recognition

Fig:3 Architecture of Urdu ASR 3.4.10 Urdu speech applications Since basic research of Urdu speech, as well as the development of large Urdu resources, is still limited and, in certain areas, still in the initial stages, research into high-level speech applications presents quite a challenge. According to past literature reviews, three applications including spoken dialogue systems, speech translation, and speaker recognition have been explored. It is noted that the speaker recognition issue is not particularly language-dependent. Phonetic-feature analysis Speech input Tonalfeature analysis Tonal-syllable recognizer HMMs of tonalsyllables Tonal-syllable sequence Phonetic –feature analysis Speech input Tonalfeature analysis Base syllable recognizer HMMs of base syllables Tone sequence Tone recognizer HMMs of tones Base-syllable sequence Phonetic – feature analysis Speech input Tonalfeature analysis Base –syllable recognizer HMMs of basesyllables Tone recognizer HMMs of tones Tonal-syllable sequence

Volume 4, Issue 2, February 2015 Page 67 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: [email protected] Volume 4, Issue 2, February 2015 ISSN 2319 - 4847

4.FUTURE WORK PLAN 4.1 Phonetically Rich and Balanced Speech Corpus Speech corpus is an important requirement for developing any ASR system. The developed corpus contains recordings of 500 Urdu sentences. 400 written and were recorded and used for training the acoustic model. For testing the acou tic model, 48 additional sentences representing Urdu proverbs were created by an Urdu language specialist. The speech corpus was recorded by 60 (30 male and 30 female) Urdu native speakers from 11 different Indian countries This speech corpus was recorded in a sound proof studio using Sound Forge 8.0 software and took nearly six months to complete starting from April 2012 to Oct 212 This speech corpus was enriched with varieties of Urdu native speakers taking into consideration the following characteristics representing:  Age Criteria.  various nationalities.  Various indian states.  various professionals.  various academic qualifications. Since this speech corpus contains training and testing written and spoken data of a variety of Urdu native speakers who represent different genders, age categories, nationalities, regions, and professions, and is also based on phonetically rich and balanced sentences, it is expected to be used for development of many Urdu speech and text based applications, such as speaker dependent and independent ASR, Text-To- Speech (TTS) synthesis, speaker recognition, and many others.

Fig:4 Flowchart carrying towards Result

4.2 Speech Data Preparation and Analysis This section covers all preparation and pre-processing steps we developed in order to produce ready to use speech data, which are used later for training the acoustic model. 4.3 Automatic Urdu Speech Segmentation During the recording sessions, speakers were asked to utter the 500 sentences sequentially starting with training sentences followed by testing sentences. Recordings for a single speaker were saved into one “.wav” file and sometimes up to four “.wav” files depending on number of sessions the speaker spent to finish recording the 415 sentences. It is time consuming to save every single recording once uttered. Therefore, there was a need to segment these bigger “.wav” files into smaller ones each having a single recording of a single sentence. We developed a Matlab program that has two functions. The first function “read.m” reads the original bigger “.wav” files, identifies the starting and ending points for each sentence utterance, generates a text “segments.txt” file that automatically assigns a name for each utterance and concatenates the name with the corresponding starting and ending points. Whereas the second function

Volume 4, Issue 2, February 2015 Page 68 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: [email protected] Volume 4, Issue 2, February 2015 ISSN 2319 - 4847

“segment.m” reads the automatically generated text file “segments.txt” and compares it with the original bigger “.wav” file, it then segments the bigger “.wav” file based on starting and ending points read from “segments.txt” into smaller“.wav” files carrying the same name as identified in “segments.txt”. 4.4 Manual Classification and Validation of Correct Speech Data The automatic Urdu speech segmentation explained in subsection outputs all possible “.wav” files in a single directory. Therefore, a human manual classification of those “.wav” files into the corresponding sentence directories was done. Wrongly pronounced utterances were ignored at this stage. As a result, only correct utterances are considered for further pre-processing steps 4.5 Directory Structure, Sound Filenames Convention and Automatic Generation of Training and Testing Transcription Files Each speaker has a single folder that contains three sub-folders namely, “Training Sentences”, “Testing Sentences”, and “Others”. “Training Sentences” subfolder contains 400 sub-folders representing the 400 training sentences, whereas “Testing Sentences” subfoldercontains 48 sub-folders representing the 51 testing sentences. The sub-folder “Others” contains out of content utterances for each speaker. Each sentence sub-folder contains two other sub-folders namely “Correct” and “Wrong”. Utterances classified under the sub-folder “Correct” are the ones used for further pre- processing steps 4.6 Speech Corpus Testing and Evaluation for Urdu ASR Systems This section covers three major requirements for Urdu ASR namely, Urdu phonetic dictionary, the acoustic model training, the language model training. An evaluation of the Urdu ASR is finally presented

Fig:5 System of Speech translation Mechanism

4.7 Software/Interfacing There are efforts underway to develop more sophisticated and user-friendly Urdu support on computers and the Internet,the most widespread of which is InPage Desktop Publishing package. Microsoft has included Urdu language support in all new versions of Windows versions and Microsoft Office 2007 are available in Urdu through Language Interface Pack support. Most Linux Desktop distributions allow the easy installation of Urdu support and translations as well; Apple implemented the Urdu language keyboard across Mobile devices in its iOS 8 update in September 2014. and now a day’s latest Android operating system also providing the Urdu plugging in its various device s.

Fig:6 Supporting Urdu Plugins in various devices

Volume 4, Issue 2, February 2015 Page 69 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: [email protected] Volume 4, Issue 2, February 2015 ISSN 2319 - 4847

5.CONCLUSION In this paper we march out towards developing a compatible and high performance Urdu language ASR system based on fundamental tools which are currently available in multilingual speech processing technology and its being utilized in Urdu language speech processing ,phonetically rich and balanced speech corpus we found. This hypothesis also includes creating the full diacritical marks Transcription, build an Urdu phonetic dictionary, and Urdu statistical language model. Recognition results show that this Urdu ASR system is speaker independent and is highly better than many others reported Urdu recognition results.

References

[1] Literary Notes: Common misconceptions about Urdu, Dawn Media Group [2] Two Languages or One?, Hindi-Urdu Flagship University of Texas [3] Hamari History, Hamari Boli [4] Dua, Hans R. (1992). Hindi-Urdu as a pluricentric language. In M. G. Clyne (Ed.), Pluricentric languages: Differing norms in different nations. Berlin: Mouton de Gruyter. ISBN 3-11-012855-1. [5] Salimuddin S et al. (2013). "Oxford Urdu-English Dictionary ". Pakistan: Oxford University Press. ISBN 978-0- 19-597994-7(Introduction Chapter) [6] Kachru, Yamuna (2008), Braj Kachru; Yamuna Kachru; S. N. Sridhar, eds., Hindi-Urdu-Hindustani, Language in South Asia, Cambridge University Press, p. 82, ISBN 978-0-521-78653-9 [7] Peter Austin (1 September 2008). One thousand languages: living, endangered, and lost. University of California Press. pp. 120–.ISBN 978-0-520-25560-9. Retrieved 29 December 2011. [8] Urdu and the Borrowed Words, Dawn Media Group [9] John R. Perry, "Lexical Areas and Semantic Fields of Arabic" in Éva Ágnes Csató, Eva Agnes Csato, Bo Isaksson, Carina Jahani,Linguistic convergence and areal diffusion: case studies from Iranian, Semitic and Turkic, Routledge, 2005. pg 97: "It is generally understood that the bulk of the Arabic vocabulary in the central, contiguous Iranian, Turkic and Indic languages was originally borrowed into literary Persian between the ninth and thirteenth centuries" [10] Ethnologue. Retrieved 5 March 2013. [11] Faruqi, Shamsur Rahman (2003), Sheldon Pollock, ed., A Long History of Urdu Literary Culture Part 1, Literary Cultures in History: Reconstructions From South Asia, University of California Press, p. 806, ISBN 0-520-22821-9 [12] Rahman, Tariq (2001). From Hindi to Urdu: A Social and Political History. Oxford University Press. pp. 1– 22. ISBN 978-0-19-906313-0. [13] Rahman, Tariq (2000). "The Teaching of Urdu in British India".The Annual of Urdu Studies 15: 55. [14] The Urdu Language. Retrieved 18 December 2011. [15] Rahman, Tariq (2014), Pakistani English, Quaid-i-Azam University=Islamabad, p. 9. [16] Government of Pakistan: Population by Mother Tongue Pakistan Bureau of Statistics [17] In the lower courts in Pakistan, despite the proceedings taking place in Urdu, the documents are in English, whereas in the higher courts, i.e. the High Courts and the Supreme Court, both documents and proceedings are in English. [18] Zia, Khaver (1999), "A Survey of Standardisation in Urdu". 4th Symposium on Multilingual Information Processing, (MLIT-4),Yangon, Myanmar. CICC, Japan [19] Rahman, Tariq (2010). Language Policy, Identity and Religion. Islamabad: Quaid-i-Azam University. p. 59. [20] "Urdu in Bihar". Language in India. Retrieved 17 May 2008. [21] http://jkgad.nic.in/statutory/Rules-Costitution-of-J&K.pdf [22] "Importance Of Urdu". GeoTauAisay.com. Retrieved 8 August 2010. [23] Paul D B, Baker J 1992 The design for the wall street journal-based CSR corpus. In Human Language Technology Conference: Proceedings of the Workshop on Speech and Natural Language, pp. 357–362, Harriman, NY [24] Pinto J, Yegnanarayana B, Hermansky H, Magimai-Doss M 2008 Exploiting contextual information for improved phoneme recognition. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP),

Volume 4, Issue 2, February 2015 Page 70 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: [email protected] Volume 4, Issue 2, February 2015 ISSN 2319 - 4847

pp. 4449–4452 [25] Pinto J, Sivaram G S V S, Magimai-Doss M, Hermansky H, Bourlard H 2011 Analysis of MLP based hierarchical phoneme posterior probability estimator. IEEE Trans. Audio Speech Lang. Process. 19(2): 225–241 [26] Pinto J P 2010 Multilayer Perceptron Based Hierarchical Acoustic Modeling for Automatic Speech Recognition. PhD thesis, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerl and Rabiner L R 1989 A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2): 257–286 [27] Saheer L, Dines J, Garner P N, Liang H 2010a Implementation of VTLN for statistical speech synthesis. In Proc. 7th Speech Synthesis Workshop, Kyoto, Japan [28] Saheer L, Garner P N, Dines J 2010b Study of Jacobian normalisation for VTLN. Idiap-RR Idiap-RR-25- 2010, Idiap Research Institute, Martigny, Switzerland [29] Saheer L, Garner P N, Dines J, Liang H 2010c VTLN adaptation for statistical speech synthesis. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 4838–4841, Dallas, USA [30] Schlüter R, Macherey W, M"uller B, Ney H 2001 Comparison of discriminative training criteria and optimization methods for speech recognition. Speech Commun. 34(3): 287–310 [31] Schultz T 2006 Multilingual acoustical modeling. In (eds) T Schultz, K Kirchoff, Multilingual Speech Processing, chapter 4, pp. 71–122. Academic Press, USA [32] Schultz T,Waibel A 1997a Fast bootstrapping of LVCSR systems with multilingual phoneme sets. In Proc. European Conference on Speech Communication and Technology (EUROSPEECH), vol. 1, pp. 371–374, Rhodes, Greece [33] Schultz T,Waibel A 1997b Fast bootstrapping of LVCSR systems with multilingual phoneme sets. In Proc. European Conference on Speech Communication and Technology (EUROSPEECH), pp. 371–374 [34] Schultz T, Waibel A 1998 Language independent and language adaptive large vocabulary speech recognition. In Proc. Int. Conf. Spoken Lang. Process. (ICSLP), pp. 1819–1822 [35] Schultz T, Waibel A 2001 Language-independent and language-adaptive acoustic modeling for speech recognition. Speech Commun. 35(1–2): 31–50 [36] Schultz T, Rogina I,Waibel A 1996 LVCSR-based language identification. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), vol. 2, pp. 781–784 [37] Siemund R, Höge H, Kunzmann S, Marasek K 2000 SPEECON – speech data for consumer devices. In Proc. 2nd Int. Conf. Language Resources & Evaluation, pp. 883–886, Athens, Greece [38] Silén H, Hel E, Nurminen J, Gabbouj M 2009 Parameterization of vocal fry in HMM-based speech synthesis. In Proc. Interspeech, pp. 1775–1778, Brighton, UK [39] Sivadas S, Hermansky H 2004 On Use of Task Independent Training Data in Tandem Feature Extraction. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. I–541–I–544 [40] Sproat R (ed) 1997 Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Kluwer Academic Publishers, Norwell, Massachussetts, USA [41] Stolcke A, Grézl F, Hwang M-Y, Lei X, Morgan N, Vergyri D 2006 Cross-domain and cross-lingual portability of acoustic features estimated by multilayer perceptrons. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. I–321–I–324 [42] Suendermann D, Hoege H, Bonafonte A, Ney H, Hirschberg J 2006 TC-Star: Cross-language voice conversion revisited. In Proc. TC-Star Workshop, Barcelona, Spain [43] SugiyamaM1991 Automatic language recognition using acoustic features. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 813–816 [44] Tokuda K, Masuko T, Miyazaki N, Kobayashi T 2002 Multi-space probability distribution HMM. IEICE Trans. Inf. Syst. E85-D(3): 455–464 [45] Toth L, Frankel J, Gosztolya G, King S 2008 Cross-lingual portability of MLP-based tandem features a case study for English and Hungarian. In Proc. Interspeech, pp. 2695–2698 [46] Traber C, Huber K, Nedir K, Pfister B, Keller E, Zellner B 1999 From multilingual to polyglot speech synthesis. In Proc. European Conference on Speech Communication and Technology (EUROSPEECH), pp. 835–838, Budapest, Hungary [47] Valente F 2009 A novel criterion for classifiers combination in multistream speech recognition. IEEE Signal Process. Lett. 16(7): 561–564. [48] Hervé Bourlard1,2,∗, John Dines1,Mathew Magimai-Doss1, Philip N Garner1,David Imseng1,2, Petr Motlicek1, Hui Liang1,2,Lakshmi Saheer1,2 And Fabio Valente1,,” Current Trends In Multilingual Speech Processing”, Sa¯Dhana¯ Vol. 36, Part 5, October 2011, Pp. 885–915._C Indian Academy Of Sciences [49] "Bringing Order to Linguistic Diversity: Language Planning in the British Raj". Language in India. Retrieved 20 May 2008. [50] "A Brief Hindi — Urdu FAQ". sikmirza. Archived from the original on 2 December 2007. Retrieved 20 May 2008.

Volume 4, Issue 2, February 2015 Page 71 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: [email protected] Volume 4, Issue 2, February 2015 ISSN 2319 - 4847

[51] "Hindi/Urdu Language Instruction". University of California, Davis. Archived from the original on 5 April 2008. Retrieved 20 May 2008. [52] "Ethnologue Report for Hindi". Ethnologue. Retrieved 26 February 2008. [53] "Urdu and its Contribution to Secular Values". South Asian Voice. Retrieved 26 February 2008. [54] The Annual of Urdu studies, number 11, 1996, “Some notes on Hindi and Urdu”, pp. 203–208. [55] Shakespear, John (1834), A dictionary, Hindustani and English, Black, Kingsbury, Parbury and Allen [56] Fallon, S. W. (1879), A new Hindustani-English dictionary, with illustrations from Hindustani literature and folk- lore, Banāras: Printed at the Medical Hall Press [57] Urdu: Language of the Aam Aadmi Times of India [58] 'Vishvās’: A word that threatens Pakistan Khaled Ahmed, The Express Tribune [59] Kids have it right: boundaries of Urdu and Hindi are blurredFirstpost [60] Colin P. Masica, The Indo-Aryan languages. Cambridge Language Surveys (Cambridge: Cambridge University Press, 1993). 466. [61] Afroz Taj (University of North Carolina at Chapel Hill. Retrieved 26 February 2008.

[62] India: The Last Handwritten Newspaper in the World · Global Voices. Globalvoicesonline.org (2012-03-26). Retrieved on 2013-07-12. [63] The News, Karachi, Pakistan: Roman Urdu by Habib R Sulemani [64] http://www.ukindia.com/zurdu1.htm. [65] http://www.unicode.org/charts/PDF/U0600.pdf [66] "A dictionary, Hindustani and English". Dsal.uchicago.edu. 29 September 2009. Retrieved 18 December 2011. [67] "A dictionary, Hindustani and English". Dsal.uchicago.edu. Retrieved 18 December 2011.

Volume 4, Issue 2, February 2015 Page 72