Key Issues in Vowel Based Splitting of Telugu Bigrams
Total Page:16
File Type:pdf, Size:1020Kb
(IJACSA) International Journal of Advanced Computer Science and Applications, Special Issue on Natural Language Processing 2014 Key Issues in Vowel Based Splitting of Telugu Bigrams T. Kameswara Rao Dr. T. V.Prasad Assoc. Professor and Head, CSE Dept Former Dean of Computing Sciences, Brahma’s Inst. of Engg. and Tech Visvodaya Technical Academy, Rajupalem, Nellore, AP, India Kavali, AP, India. Abstract—Splitting of compound Telugu words into its violated and special rule is applied. The person who is aware components or root words is one of the important, tedious and of this kind of special cases can only tokenize properly. yet inaccurate tasks of Natural Language Processing (NLP). Except in few special cases, at least one vowel is necessarily Likewise majority of Indian languages follow the features involved in Telugu conjunctions. In the result, vowels are often of Sanskrit; undergo conjunction which is inevitable that lead repeated as they are or are converted into other vowels or in generating compounds that are essentially bigrams, trigrams consonants. This paper describes issues involved in vowel based or n-grams. Bigram is a compound formed by two words and splitting of a Telugu bigram into proper root words using Telugu trigrams by three words, and so on. As Telugu is one of them, grammar conjunction (‘sandhi’) rules for MT. one can see the nature of uniting the words to form n-grams in Telugu also. Though Telugu is highly influenced by other Keywords—Telugu word splitting; vowel based splitting; languages, especially most of it is by Sanskrit [7], Telugu is compound word splitting; bigrams; trigrams; n-grams; NLP not originated from Sanskrit [4]. Even though Telugu was originally intended to be totally free from Sanskrit, it has I. INTRODUCTION tremendous impact and deep penetration into Telugu. In 1816, Sanskrit is considered as the mother language for almost Francis White Ellis raised this issue. Later Bishop Robert all Indian languages, since a majority of the Indian languages Cardwell proved that a family of twelve Dravidian languages are based on grammar rules similar to that of Sanskrit Telugu, Tamil, Kannada/Canarese, Malayalam, Tulu, grammar [6]. Sanskrit is grammatically very well structured Kodagu/Coorg, Tuda, Kota, Gond, Khond/Ku, Rajmahal and and very rich in its inflections [7]. It is the oldest language on Oraon are not originated from Sanskrit in his book titled “A the earth to have a powerful structured grammar. Panini (300 Comparative Grammar of Dravidian Languages” in 1856 [5]. BCE) the greatest grammarian developed Sanskrit grammar As a proof of that, pure Telugu literature work is available in with more than 4000 rules [8], [10]. Unlike western languages, the form of ‘yayAti caritramu’ by ‘ponnagaMTi telaganna’ th Sanskrit is the best example that unites the words to form a written in 16 century [1][13]. Later Telugu mingled with compound word (or simply compound). According to Sanskrit heavily by ‘samskrutAndhra kavulu’ (Sanskrit – Bloomfield and Chomsky (1957), sentence is the largest Andhra - a synonym of Telugu - poets) when they translated grammatical unit [16]. epics in Sanskrit literature like Ramayana, Mahabharata and bhAgavata, etc. Learning or speaking Sanskrit was a great There is a possibility and custom to write a complete honor in those days and literature work in Sanskrit was highly sentence as a single compound in Sanskrit. For instance honored. That can be one of the reasons to Sanskritize Telugu “jalObhaumamantarikshamitidvidhAbhavati” – for to enhance its value. convenience, it can be tokenized as “jalaH bhaumaM antarikshaM iti dvidhA bhavati” means ‘water is of two types, Additionally, there are numerous dialects in Indian one is on the earth, and another is in space’ (‘jalaH’ – water, languages - even many of them do not have script and are ‘bhaumaM’ – on the earth, ‘antarikshaM’ – in space, ‘iti’ – based on their culture, territory, and have tremendous impact like this, ‘dvidhA’ – two categories, ‘bhavati’ – is). of non-Indian languages like Urdu, Persian, Arabic, English, etc. For instance, most of the Telugu language is affected by Sanskrit scholars are to be very careful about tokenization. Urdu in ‘telaMgANa’ territory. ‘tarfIdu, aafIsu, pennu, Lack of appropriate knowledge on the grammar or less pEparu, kaburlu, bassu’, etc. are words from those languages attention to each and every letter gives immature tokenization adapted in Telugu [4]. Such words, their conjunctions and that leads to yield distorted or quite opposite meaning in some their corrupted / colloquial forms are almost understandable by special cases [8]. For example ‘viSvAmitraH’ is the word to be local humans but not easily by non-locals. For example ‘nI tokenized; its meaning is friend of the universe. It can be jimmaDa’ is the word very frequently used by the natives of tokenized as ‘viSva’ + ‘amitraH’ according to ‘savarNadIrga eastern Andhra. It means ‘let your tongue fall’ (literally sandhi’, which is not to be applied here because it gives ‘jimma’ is the colloquial form of ‘jihva’ – Sanskrit word for opposite meaning i.e., enemy of the universe. For this kind of tongue, ‘aDa’ is the corrupted form of ‘paDu’ – a Telugu special cases, Sanskrit gives exemptions strictly. So it should word). be ‘viSva’ + ‘mitraH’, where regular conjunction rule is to be 9 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Special Issue on Natural Language Processing 2014 II. VOWELS According to ‘pANini’ Sanskrit grammar, vowels and their TABLE II. TELUGU VOWELS AND THEIR ROMAN EQUIVALENT forms are given as in TABLE I [2] (pronunciations are given Telugu * vowel అ ఆ ఇ ఈ ఉ ఊ ఋ ౠ ఌ in Appendix). Roman A A I I U U R Ru Z English TABLE I. CHARACTERISTICS OF VOWEL ‘A ( )’ अ Telugu * vowel ౡ ఎ ఏ ఐ ఒ ఓ ఔ అం అః Vowel Time Types Roman Z E E Y O O W aM aH required to Roman English dEva nAgari English pronounce a अ One unit anudAtta, udAtta, Note: *Vowels ‘ , (z, Z)’ are not used now-a-days, (short vowel) (Eka mAtra) svarita (hrasva) ఌ ౡ they are not considered in this paper. A आ Two units anudAtta, udAtta, (Long vowel) (dvi mAtra) svarita (dIrgha) In Telugu, vowels are classified into two types as follows A3 आ3 Three units anudAtta, udAtta, [14]. (Longer vowel) (tri mAtra) svarita (pluta) ‘hrasvAs’ – ‘a, i, u, R, z, e’ Note: ‘pluta’ is applied in calling somebody who is at a ‘dIrghAs’ – ‘A, I, U, Ru, Z, E, Y, O, W’ distance. For example, ‘hE rAmA3’. Here ‘3’ indicates the ‘dIrghAs’ again classified into two types as follows. ‘pluta’ of the vowel ‘A’. If ‘pluta’ is not applied here, the ‘vakrAs’ – ‘e, E, o, O’ person cannot be called. ‘vakratamAs’ – ‘Y, W’ Again each type is classified in to two different forms, III. PROCESS OF SPLITTING WORDS namely ‘anunAsika’ and ‘ananunAsika’. ‘anunAsika’ is a nasal sound while ‘ananunAsika’ is not. A total of six types for each Due to many practical issues involved in maintaining a database with all combinations of compounds, it is better to vowel ‘a, A, A3’ yields 18 different forms of vowel ‘a(अ)’. maintain only standard or root words. Compounds of the Likewise ‘i(इ), R(ऋ)’ also have 18 different forms each. ‘z(ऌ), source language are split to obtain the original words using reverse engineering in accordance to the conjunction E( ), Y( ), O( ), and W ( )’ can be obtained in 12 forms ए ऐ ओ औ (‘sandhi’) rules. This will make the morphological analysis for each as they are not derived long forms. A huge total of easier. 132 vowels are there in Sanskrit. Mostly these are used in ‘Vedas’. But only thirteen vowels ‘a, A, i, I, u, U, R, Ru, z, E, Proper stemming and correcting of corrupted forms for Y, O and W’ are used in general usage. Two more vowels ‘aM splitting of n-grams into individual tokens is necessary for (anusvAra), aH (visarga)’ are used in Sanskrit. Two special better understanding the context. This plays an important role vowels are there appears only in Sanskrit named in translation also whereas understanding is also a kind of ‘jihvamUlIyam’ and ‘upadhmAnIyam’. If ‘visarga’ is appeared translation. Splitting of compounds into root words is an as prior character of consonant ‘k’, it is considered as ‘artha- important phase in NLP for the applications like MT [9]. visarga’ and called ‘jihvamUlIyam’, e.g. ‘aMtaHkaraNam’. If Building a computational model to analysis natural language is ‘visarga’ is appeared as prior character of consonant ‘p’, it is the goal of NLP [15]. For MT from Telugu to any other called ‘upadhmAnIyam’. Ex. ‘vAyuHpaMkam’. language including Indian languages, one of the issues of dealing with source language words is that each word need to Telugu includes two more short vowels ‘e’ and ’o’ and one be stored in the database together with different more long vowel ‘Z’ to the above listed Sanskrit vowels to suffixes/prefixes (also known as inflections) thus comprise a total of eighteen vowels [2]. All proper Telugu tremendously increasing the storage space. This is in the case words end with vowels only. That’s why Telugu language is like Telugu that has about 800 different dialects within the called ‘ajanta’ (= ‘ach’ + ‘anta’, literally ‘ach’ meaning vowel state of Andhra Pradesh. But most of the conjunctions are and ‘anta’ meaning ending) language. Consonants are called common and are computable. The best way to translate them ‘hal’ in Telugu. They are 37 in number. Unlike Telugu, words is to split back into root words as they formed and then of almost all Indian languages end in consonants and hence translate individual root words. Compounds are formed with called ‘halanta’ languages. All western languages are also two or more root words.