<<

(IJACSA) International Journal of Advanced Computer Science and Applications, Special Issue on Natural Language Processing 2014 Key Issues in Vowel Based Splitting of Telugu Bigrams

T. Kameswara Rao Dr. T. V.Prasad Assoc. Professor and Head, CSE Dept Former Dean of Computing Sciences, Brahma’s Inst. of Engg. and Tech Visvodaya Technical Academy, Rajupalem, Nellore, AP, India Kavali, AP, India.

Abstract—Splitting of Telugu words into its violated and special rule is applied. The person who is aware components or root words is one of the important, tedious and of this kind of special cases can only tokenize properly. yet inaccurate tasks of Natural Language Processing (NLP). Except in few special cases, at least one vowel is necessarily Likewise majority of Indian languages follow the features involved in Telugu conjunctions. In the result, vowels are often of ; undergo conjunction which is inevitable that lead repeated as they are or are converted into other vowels or in generating compounds that are essentially bigrams, trigrams consonants. This paper describes issues involved in vowel based or n-grams. Bigram is a compound formed by two words and splitting of a Telugu bigram into proper root words using Telugu trigrams by three words, and so on. As Telugu is one of them, conjunction (‘sandhi’) rules for MT. one can see the of uniting the words to form n-grams in Telugu also. Though Telugu is highly influenced by other Keywords—Telugu word splitting; vowel based splitting; languages, especially most of it is by Sanskrit [7], Telugu is compound word splitting; bigrams; trigrams; n-grams; NLP not originated from Sanskrit [4]. Even though Telugu was originally intended to be totally free from Sanskrit, it has I. INTRODUCTION tremendous impact and deep penetration into Telugu. In 1816, Sanskrit is considered as the mother language for almost Francis White Ellis raised this issue. Later Bishop Robert all Indian languages, since a majority of the Indian languages Cardwell proved that a family of twelve are based on grammar rules similar to that of Sanskrit Telugu, Tamil, Kannada/Canarese, Malayalam, Tulu, grammar [6]. Sanskrit is grammatically very well structured Kodagu/Coorg, Tuda, Kota, Gond, Khond/Ku, Rajmahal and and very rich in its inflections [7]. It is the oldest language on Oraon are not originated from Sanskrit in his book titled “A the earth to have a powerful structured grammar. Panini (300 Comparative Grammar of Dravidian Languages” in 1856 [5]. BCE) the greatest grammarian developed As a proof of that, pure work is available in with more than 4000 rules [8], [10]. Unlike western languages, the form of ‘yayAti caritramu’ by ‘ponnagaMTi telaganna’ th Sanskrit is the best example that unites the words to form a written in 16 century [1][13]. Later Telugu mingled with compound word (or simply compound). According to Sanskrit heavily by ‘samskrutAndhra kavulu’ (Sanskrit – Bloomfield and Chomsky (1957), sentence is the largest Andhra - a synonym of Telugu - poets) when they translated grammatical unit [16]. epics in like Ramayana, Mahabharata and bhAgavata, etc. Learning or speaking Sanskrit was a great There is a possibility and custom to write a complete honor in those days and literature work in Sanskrit was highly sentence as a single compound in Sanskrit. For instance honored. That can be one of the reasons to Sanskritize Telugu “jalObhaumamantarikshamitidvidhAbhavati” – for to enhance its value. convenience, it can be tokenized as “jalaH bhaumaM antarikshaM iti dvidhA bhavati” means ‘water is of two types, Additionally, there are numerous dialects in Indian one is on the earth, and another is in space’ (‘jalaH’ – water, languages - even many of them do not have script and are ‘bhaumaM’ – on the earth, ‘antarikshaM’ – in space, ‘iti’ – based on their culture, territory, and have tremendous impact like this, ‘dvidhA’ – two categories, ‘bhavati’ – is). of non-Indian languages like , Persian, , English, etc. For instance, most of the is affected by Sanskrit scholars are to be very careful about tokenization. Urdu in ‘telaMgANa’ territory. ‘tarfIdu, aafIsu, pennu, Lack of appropriate knowledge on the grammar or less pEparu, kaburlu, bassu’, etc. are words from those languages attention to each and every letter gives immature tokenization adapted in Telugu [4]. Such words, their conjunctions and that leads to yield distorted or quite opposite meaning in some their corrupted / colloquial forms are almost understandable by special cases [8]. For example ‘viSvAmitraH’ is the word to be local humans but not easily by non-locals. For example ‘nI tokenized; its meaning is friend of the universe. It can be jimmaDa’ is the word very frequently used by the natives of tokenized as ‘viSva’ + ‘amitraH’ according to ‘savarNadIrga eastern Andhra. It means ‘let your tongue fall’ (literally sandhi’, which is not to be applied here because it gives ‘jimma’ is the colloquial form of ‘jihva’ – Sanskrit word for opposite meaning i.e., enemy of the universe. For this kind of tongue, ‘aDa’ is the corrupted form of ‘paDu’ – a Telugu special cases, Sanskrit gives exemptions strictly. So it should word). be ‘viSva’ + ‘mitraH’, where regular conjunction rule is to be

9 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Special Issue on Natural Language Processing 2014

II. VOWELS According to ‘pANini’ Sanskrit grammar, vowels and their TABLE II. TELUGU VOWELS AND THEIR ROMAN EQUIVALENT forms are given as in TABLE I [2] (pronunciations are given Telugu * vowel అ ఆ ఇ ఈ ఉ ఊ ఋ ౠ ఌ in Appendix). Roman A A I I U U R Ru Z English TABLE I. CHARACTERISTICS OF VOWEL ‘A ( )’ अ Telugu * vowel ౡ ఎ ఏ ఐ ఒ ఓ ఔ అం అః Vowel Time Types Roman Z E E Y O O W aM aH required to Roman English dEva nAgari English pronounce a अ One unit anudAtta, udAtta, Note: *Vowels ‘ , (z, Z)’ are not used now-a-days, (short vowel) (Eka mAtra) svarita (hrasva) ఌ ౡ they are not considered in this paper. A आ Two units anudAtta, udAtta, (Long vowel) (dvi mAtra) svarita (dIrgha) In Telugu, vowels are classified into two types as follows A3 आ3 Three units anudAtta, udAtta, [14]. (Longer vowel) (tri mAtra) svarita (pluta)  ‘hrasvAs’ – ‘a, i, u, R, z, e’ Note: ‘pluta’ is applied in calling somebody who is at a  ‘dIrghAs’ – ‘A, I, U, Ru, Z, E, Y, O, W’ distance. For example, ‘hE rAmA3’. Here ‘3’ indicates the ‘dIrghAs’ again classified into two types as follows. ‘pluta’ of the vowel ‘A’. If ‘pluta’ is not applied here, the  ‘vakrAs’ – ‘e, E, o, O’ person cannot be called.  ‘vakratamAs’ – ‘Y, W’

Again each type is classified in to two different forms, III. PROCESS OF SPLITTING WORDS namely ‘anunAsika’ and ‘ananunAsika’. ‘anunAsika’ is a nasal sound while ‘ananunAsika’ is not. A total of six types for each Due to many practical issues involved in maintaining a database with all combinations of compounds, it is better to vowel ‘a, A, A3’ yields 18 different forms of vowel ‘a(अ)’. maintain only standard or root words. Compounds of the Likewise ‘i(इ), R(ऋ)’ also have 18 different forms each. ‘z(ऌ), source language are split to obtain the original words using reverse engineering in accordance to the conjunction E( ), Y( ), O( ), and W ( )’ can be obtained in 12 forms ए ऐ ओ औ (‘sandhi’) rules. This will make the morphological analysis for each as they are not derived long forms. A huge total of easier. 132 vowels are there in Sanskrit. Mostly these are used in ‘Vedas’. But only thirteen vowels ‘a, A, i, I, u, U, R, Ru, z, E, Proper stemming and correcting of corrupted forms for Y, O and W’ are used in usage. Two more vowels ‘aM splitting of n-grams into individual tokens is necessary for (anusvAra), aH (visarga)’ are used in Sanskrit. Two special better understanding the context. This plays an important role vowels are there appears only in Sanskrit named in translation also whereas understanding is also a kind of ‘jihvamUlIyam’ and ‘upadhmAnIyam’. If ‘visarga’ is appeared translation. Splitting of compounds into root words is an as prior character of consonant ‘k’, it is considered as ‘artha- important phase in NLP for the applications like MT [9]. visarga’ and called ‘jihvamUlIyam’, e.g. ‘aMtaHkaraNam’. If Building a computational model to analysis natural language is ‘visarga’ is appeared as prior character of consonant ‘p’, it is the goal of NLP [15]. For MT from Telugu to any other called ‘upadhmAnIyam’. Ex. ‘vAyuHpaMkam’. language including Indian languages, one of the issues of dealing with source language words is that each word need to Telugu includes two more short vowels ‘e’ and ’o’ and one be stored in the database together with different more long vowel ‘Z’ to the above listed Sanskrit vowels to suffixes/prefixes (also known as inflections) thus comprise a total of eighteen vowels [2]. All proper Telugu tremendously increasing the storage space. This is in the case words end with vowels only. That’s why Telugu language is like Telugu that has about 800 different dialects within the called ‘ajanta’ (= ‘ach’ + ‘anta’, literally ‘ach’ meaning vowel state of Andhra Pradesh. But most of the conjunctions are and ‘anta’ meaning ending) language. Consonants are called common and are computable. The best way to translate them ‘hal’ in Telugu. They are 37 in number. Unlike Telugu, words is to split back into root words as they formed and then of almost all Indian languages end in consonants and hence translate individual root words. Compounds are formed with called ‘halanta’ languages. All western languages are also two or more root words. While the root words can be retrieved categorized as ‘halanta’ languages as their words commonly from database, the inflections thus obtained needs serious end in consonants except Italian that ends in vowels. This is focus. Inability to sufficiently handle the inflections may the reason why Telugu is called ‘Italian of the East’ and one of result in false word formations and distorted meaning. But the secrets behind sweetness of Telugu vocabulary. mere splitting the compound may not give complete meaning There are eighteen vowels in Telugu language as shown in all the time. To understand the meaning of a compound, first TABLE II. All the vowels are called ‘ach’ or ‘svara’ identify the meaning of components and then the relationship according to , their Roman equivalents are as between them [11]. For instance, a compound in TABLE II. ‘rAmunitOkapirAju’ is formed by two words ‘rAmunitO + kapirAju’.

10 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Special Issue on Natural Language Processing 2014

First word is inflected, and second word is a root word. In Sanskrit, there are five important classifications of ‘rAmunitO’ literally means with ‘rAma’ and ‘kapirAju’ means ‘sandhis’. They are 1) ‘ach sandhi’, 2) ‘hal sandhi’ 3) ‘visarga Hanuman. If the inflection is not observed in the first word, it sandhi’ 4) ‘prakruti bhAva sandhi’ and 5) ‘svAdi sandhi’ [1]. may be split as ‘rAmunitOka’ (literally means the tail of But only first three ‘sandhis’ are used very frequently. ‘ach’ ‘rAma’) + ‘pirAju’ (an absurd word), which gives a distorted and ‘visarga sandhis’ works with vowels and ‘hal sandhis’ meaning. works with consonants. ‘sandhi’ classifications are given in TABLE III. The scope of this paper is limited to deal with bigrams only for obtaining better MT and aims to propose solutions to TABLE III. LIST OF SANSKRIT ‘SANDHIS’ the issues of vowel based splitting. Issues related to handling of different types of dialects and their corrupted forms have sandhi Names not been considered. More specifically, handling of Type ‘ach’ savarNa dIrgha, guN, vRddhi, yaNAdESa, vAntAdESa, compounds formed according to the grammar rules, and their yAntAdESa, pUrva rUpa, para rUpa, avaJnAdESA splitting based on vowels together with certain special cases in ‘hal’ Scutva, shTutva, jaStva, anunAsika, pUrva savarNa, para Telugu have been discussed. savarNa, chatva ‘visarga’ This ‘sandhi’ shows six types of differences, but names are IV. CONJUNCTION RULES not given to them. Splitting is a process opposite to the conjunction. Though these three kinds of ‘sandhis’ are used by Telugu Conjunction is called ‘sandhi’ and splitting is called ‘sandhi as it is, they are treated as Sanskrit ‘sandhis’. Telugu defines vicchEda’ in Sanskrit. Telugu also use the word ‘sandhi’ to around thirty ‘sandhis’ (TABLE IV) according to its grammar. represent conjunction. At least two words are required for These Telugu ‘sandhis’ fall under ‘ach sandhis’, ‘hal sandhis’ conjunction. First word is called ‘pUrva pada’ and the second or work with both vowels as well as consonants [1]. word is called ‘para pada / uttara pada’ [12]. While most part of the word remains unchanged, technically, the actual TABLE IV. LIST OF TELUGU ‘SANDHIS’ ‘sandhi’ occurs between two letters, i.e., ‘pUrva svaram’( last S.No ‘sandhi’ name S.No ‘sandhi’ name letter of the ‘pUrva pada’) and ‘para svaram’ (first letter of the ‘para pada’) [1]. Telugu language adapted many of the 1 ukAra (utva) 16 penvAdi 2 yaDAgama 17 AmrEDita ‘sandhi’ rules from Sanskrit as it uses much grammar of 3 AkAra 18 muvarNalOpa Sanskrit in addition to its own grammar rules. Sanskrit 4 IkAra 19 paDvAdi grammar rules were adapted into Telugu since majority words 5 apadAdisvara 20 aligAgama of Telugu language were taken from Sanskrit. Sanskrit 6 dvirukta TakAra 21 anukaraNa grammar describes three ways to form a ‘sandhi’. They are 7 TugAgama 22 visandhi 8 RugAgama 23 paMpavarNAdESa  ‘Agamamu’ (literally means coming in Sanskrit): one 9 gasaDadavAdESa 24 trika new letter comes according to ‘sandhi’ rules, and is 10 saraLAdESa(druta) 25 lu--na-la included between the conjunction characters, without 11 puMpvAdESa 26 dugAgama removing any of them. Ex. ‘’ +’ amma’ = 12 pugAgama 27 allOpa ‘mAyamma’. ‘A’ and ‘a’ are involved in ‘sandhi’, the 13 prAtAdi 28 nakArAdESA 14 nugAgama 29 mivarNalOpa new letter ‘y’ is included between ‘A’ and ‘a’. ‘tut’ is 15 itvAdESa 30 ukAra vikalpa sandhi introduced in ‘tuDAgama’, ‘dud’ in ‘dhuDAgama’, ‘jam’ in ‘jamuDAgama’ and so on, are the examples of Though there are many Sanskrit and Telugu ‘sandhis’, ‘Agama sandhis’ in Sanskrit and ‘yaDAgama, only some of them for vowel based splitting have been TugAgama, rugAgama’ etc. are the examples of considered which are resulting in a vowel in compound ‘Agama sandhis’ in Telugu. (TABLE V) irrespective of they are classified as ‘ach sandhi’,  ‘AdESamu’ (literally means rule in Sanskrit): one new ‘hal sandhi’ or ‘visarga sandhi’. Some special cases are also letter replaces the two ‘sandhi’ letters. Ex. ‘parama’ + discussed in this paper even they are involving a consonant. ‘ISvaruDu’ = ‘paramESvaruDu’. ‘a’ and ‘I’ are involved in ‘sandhi’ and both are replaced with ‘E’. TABLE V. LIST OF ‘SANDHIS’ RESULTS A VOWEL IN COMPOUND ‘yaNAdESa, anunAsika’ etc., are the examples of S.No ‘sandhi’ name Result vowel S/T ‘AdESa sandhis’ in Sanskrit and ‘pumpvAdESa, 1 savarNadIrgha A,I,U,Ru S gasaDadavAdESa’ etc., are the examples of ‘AdESa sandhis’ in Telugu. 2 guNa E,O,ar S 3 vRddhi Y, W S  ‘EkAdESamu’: one character of the ‘sandhi’ letters are 4 visarga O, H S omitted and second one continues to exist in the 5 akAra, ikAra, ukAra a,A,i,I,u,U,e,E,Y,o,O,W T compound. Ex. ‘rAmuDu’ + ‘ataDu’ = ‘rAmuDataDu’. ‘u’ and ‘a’ are involved in ‘sandhi’, but letter ‘u’ is 6 yaNAdESa y + vowel T dropped and only ‘a’ is continued. ‘savarNadIrgha, 7 jastva sandhi g/j/D/d/b + vowel S guNa, vRddhi’ etc., are the examples of ‘EkAdESa 8 dviruktaTakAra TT + vowel T sandhis’ in Sanskrit, and ‘akAra, ukAra, ikAra sandhis’ *S – Sanskrit, T – Telugu are the examples in Telugu.

11 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Special Issue on Natural Language Processing 2014

1) ‘savarNa dIrgha sandhi’: This results a vowel Rule1: when ‘pUrva svara’ is ‘aH’ and ‘para svara’ is ‘A/I/U/Ru’ accordingly when one of the following (TABLE 6) ‘a/u/g/gh/j /jh/D/Dh/d/dh/n/b/bh/m/y/r/l/v/h’, then ‘pUrva pattern occurs. svara’ is replaced with ‘O’ in the compound. Note: Pattern is the combination of ‘purvasvara’ and ‘parasvara’ TABLE IX. ALL PATTERNS OF ‘VISARGA SANDHI- 1’ S.No Pattern Res Example TABLE VI. ALL PATTERNS OF ‘SAVARNADIRGHA SANDHI’ 1 aH + a O saH + ahaM = sOhaM 2 aH + u O vijayaH + ullAsa = vijayOllAsa S.No Pattern Res Example 3 aH + g O tiraH + gamana = tirOgamana 1 a + a A phAla + aksha = phAlAksha 4 aH + gh O manaH + ghana = manOghana 2 a + A A rAma + Alayamu = rAmAlayamu 5 aH + j O saraH + = sarOja 3 A + a A pUjA + arhuDu = pUjArhuDu 6 aH + jh O manaH + jhari = manOjhari 4 A + A A prajA + Anati = prajAnati 7 aH + D O naraH + DiMbha = narODiMbha 5 i + i I kavi + iMdruDu = kavIMdruDu 8 aH + Dh O SivaH + Dhamar = SivODhamar 6 i + I I naMdi + ISvara = naMdISvara 9 aH + d O yaH + dEvaH = yOdEvaH 7 I + i I vANI + iMdra = vANIMdra 10 aH + dh O tapaH + dhana = tapOdhana 8 I + I I vasumatI + ISa = vasumatISa 11 aH + n O yaSaH + nagara = yaSOnagara 9 u + u U su + ukti = sUkti 12 aH + b O manH + buddhi = manObuddhi 10 u + U U mRdhu + Uruvu = mRdhUruvu 13 aH + bh O manaH + duHkh = manOduHkh 11 U + u U vadhU + unnati = vadhUnnati 14 aH + m O SiraH + maNi = SirOmaNI 12 U + U U vadhU + Uruvu = vadhUruvu 15 aH + y O manaH + yaMtra = manOyaMtra 13 R + R Ru pitR + RNamu = pitRuNamu 16 aH + r O rajH + rAgamu = rajOrAgamu 14 R + Ru Ru Examples are not given since no words 17 aH + l O jalaH + lahari = jalOlahari 15 Ru + R Ru start or end with ‘Ru’ in Telugu. 18 aH + v O tapaH + vanaM = tapOvanaM 16 Ru + Ru Ru 19 aH + h O manaH + hara = manOhara

2) ‘guNa sandhi’: This results in a vowel ‘E/O/ar’ Note: In this case, ‘visarga’ should be preceded by ‘a’ accordingly when one of the following (TABLE VII) pattern else, this rule is not applicable. Ex. ‘dhanuH’ + ‘chalanamu’ occurs. = ‘dhanuScalanamu’.

TABLE VII. ALL PATTERNS OF ‘GUNA SANDHI’ Rule2: H + k / kh / p / ph gives ‘visarga’ as it is in the compound (TABLE X). S.No Pattern Res Example 1 a + i E bhUtala + itara = bhUtalEtara TABLE X. ALL PATTERNS OF ‘VISARGA SANDHI-2’ 2 a + I E svarga + ISuDu = svargESuDu 3 A + i E mahA + ikshu = mahEkshu S.No Pattern Res Example 4 A + I E mahA + ISuDu = mahESuDu 1 H + k Hk tapaH + kaMpa = tapaHkampa 5 a + u O dAma + udara = dAmOdara 2 H + kh Hkh hariH + khaDga= hariHkhaDga 6 a + U O Nava + Uha = navOha 3 H + p Hp dhanuH + puMja= dhanuHpuMja 7 A + u O mahA + uttama = mahOttama 4 H + ph Hph SaSiH + phalamu = SaSiHphalamu 8 A + U O mahA + UrU= mahOrU 9 a + R ar brahma + Rshi = brahmarshi Note: For this rule any vowel can precede the ‘visarga’ 10 A + R ar mahA + Rshi = maharshi and that vowel appears in the compound with preceding character of ‘visarga’. 3) ‘vRddhi sandhi’: This results a vowel ‘Y/W’ accordingly when one of the following (TABLE VIII) pattern 5) ‘ukAra sandhi’: If ‘pUrva svara’ is ‘u’ and ‘paras occurs. vara’ is a vowel, then ‘u’ is replaced by the vowel in result (TABLE XI). TABLE VIII. ALL PATTERNS OF ‘VRDDHI SANDHI’ TABLE XI. ALL PATTERNS OF ‘UKARA SANDHI’ S.No Pattern Res Example 1 a + E Y Eka + Eka = EkYka S.No Pattern Res Example 2 a + Y Y Sarva + YSvarya = sarvYSvarya 1 u + a a iTlu + anenu = iTlanenu 3 A + E Y kAMtA + Eka = kAMtYka 2 u + A A kAlu + ADu = kAlADu 4 A + Y Y mahA + YSvarya = mahYSvarya 3 u + i i vADu + ippuDu = vADippuDu 5 a + O W Eka + Oshadhi = EkWshadhi 4 u + I I kAlu + IDcu = kAlIDcu 6 a + W W rAma + Wnnatya = rAmWnnatya 5 u + u U nEDu + unnADu = nEDunnADu 7 A + O W mahA + Odhana = mahWdhana 6 u + U U mEmu + Ugamu = mEmUgamu 8 A + W W kAMtA + Wnnati = kAMtWnnati 7 u + e E Enugu + ekku = Enugekku 4) ‘visarga sandhi’: This has five rules of which only two 8 u + E E vAgu + EtAmu = vAgEtAmu are considered since these two rules results in a vowel ‘O/H’ 9 u + Y Y siddhamu + Y = siddhamY 10 u + o o rAmuDu + okaDu = rAmuDokaDu accordingly when one of the following (TABLE IX) pattern 11 u + O O ippuDu + Orpu = ippuDOrpu occurs. 12 u + W W tinu + WshadhaM= tinWshadhaM 13 u + M M ipuDu + aMtamu = ipuDaMtamu

12 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Special Issue on Natural Language Processing 2014

Note: if ‘ukAra sandhi’ rule is applied to split ‘vAgISuDu’, 3) If it is a vowel, then try all possible combinations to it becomes ‘vAgu’ + ‘ISuDu’, which is a wrong splitting. It split the word according to the ‘sandhi’ rules listed in Tables should actually be split as ‘vAk’ + ‘ISuDu’. Such conflicts 6 through 13. should be handled carefully and may require manual checks. 4) If the compound is formed according to ‘sandhi’ rules 6) ‘akAra sandhi’: This ‘sandhi’ has four rules but only of two words, then it is split into two words. one of them is considered since remaining results in a 5) The process is recursively processed till all the words consonant. thus separated are found in the dictionary/database. Rule: when ‘pUrva svara’ is ‘a’ and ‘parasvara’ is any Example: ‘SivArcana’ – formed by the root words ‘Siva’ vowel, then ‘a’ is replaced by the vowel in result (TABLE + ‘arcana’. While using vowel based splitting, the vowels of XII). ‘SivArcana’ i.e., ‘i, A, a’ are to be checked (TABLE XIV).

TABLE XII. ALL PATTERNS OF ‘AKARA SANDHI’ TABLE XIV. POSSIBLE PATTERNS OF THIS SPLITTING OF ‘SIVARCANA’ S.No Pattern Res Example V Pattern Sandhi Result NA/A 1 a + a a rAma + anna = rAmanna i u + i ukAra Su + ivArcana NA 2 a + A A ciMta + Aku = ciMtAku i a + i akAra + ivArcana NA 3 a + i I puTTina + illu = puTTinillu i i + i ikAra Si + ivArcana NA 4 a + I I cinna + Iga = cinnIga A a + a savarNadIrgha Siva + arcane A 5 a + u u cUDaka + uMDu = cUDakuMDu A a + A savarNadIrgha Siva + Arcana NA 6 a + U U Kotta + Uyala = kottUyala A A + a savarNadIrgha SivA + arcane NA 7 a + e e sIta + ekkaDa = sItekkaDa A A + A savarNadIrgha SivA+ Arcana NA 8 a + E E tella + Enugu = tellEnugu A u + A ukAra Sivu + Arcana NA 9 a + Y Y nava + YSvarya = navYSvarya A a + A akAra Siva + Arcana NA 10 a + o o cIma + okaTi = cImokaTi A i + A ikAra Sivi + Arcana NA 11 a + O O konta + Opika = kontOpika a u + a ukAra SivArcu + ana NA 12 a + W W maha + WnnatyaM= mahWnnatyaM a a + a akAra SivArca + ana NA a i + a ikAra SivArci + ana NA 7) ‘ikAra sandhi’: if ‘pUrva svara’ is ‘i’ and ‘parasvara’ Note: If V (vowel) is the last character of the compound, is a vowel, then ‘i’ is replaced by the vowel in result (TABLE then there will be a chance of split when the letter is a long XIII). vowel like ‘A, I, U, E, Y, O’ e.g., ‘vaccADA’ = ‘vaccADu’ + ‘A’ (means, ‘did he come?’). TABLE XIII. ALL PATTERNS OF ‘IKARA SANDHI’ S.No Pattern Res Example This is occurs almost in interrogative cases. But there is no 1 i + a a Emi + aMTivi = EmaMTivi chance for short vowels to be the result of conjunction. There 2 i + A A nallani + Avu = nallanAvu is no need to check the last character of the compound, if it is 3 i + i I vacciri + ipuDu = vacciripuDu ‘a, i, u, R, e or o’ assuming it is the result of ‘sandhi’. From all 4 i + I I ciTTi + ItakAya = ciTTItakAya the patterns listed in TABLE XIX, ‘a + a’ pattern of 5 i + u u idi + unnadi = idunnadi ‘savarNadIrgha sandhi’ is applicable to split ‘SivArcana’ into 6 i + U U cakkani + Uru = cakkanUru ‘Siva + arcana’. Amongst these 13 patterns, only one pattern 7 i + e e idi + evaridi = idevaridi is suitable to split the compound properly. 8 i + E E Takkari + Enugu = TakkarEnugu 9 i + Y Y idi + YrAvatamu = idYrAvatamu When one pattern splits the compound successfully, then 10 i + o O nETiki + okkaTi = nETikokkaTi there is no need to go for further splitting until unless the 11 i + O O idi + Orugallu = idOrugallu compound is formed by three or more. Unnecessary splitting 12 i + W W ciTTi + Wshadhi = ciTTWshadhi may yield improper or unacceptable root words. As a rule of

thumb, best results are obtained by splitting in such a way that Note: There are some special issues in this ‘sandhi’, like first word extracted from the compound is as long as possible. ‘cEsi’ + ‘ipuDu’ = ‘cEsiyipuDu’, ‘vacci’ + ‘iccenu’ = Even if a proper word is obtained from the compound much ‘vacciyiccenu’. before finishing, splitting process is not to be stopped until all V. VOWEL BASED SPLITTING RULES vowels of the compound are checked. Ex. ‘adhikAramaDugu’ is a bigram formed by two proper words ‘adhikAramu’ Technically, whatever the rules used for conjunction, they (authority) and ‘aDugu’ (to ask) by the rule of ‘ukAra sandhi’. are used in reverse order to obtain those root words back. This But it can also be assumed as a trigram formed by three proper approach can be considered as a reverse engineering process. words ‘adhi’ (to overcome), ‘kAramu’ (chilli powder), Algorithm: ‘aDugu’ (to ask). If splitting process is stopped at the earlier stage when it found a proper word (for instance, ‘adhi’), it 1) A compound in Telugu, which is to be translated, is yields useless or distorted meaning when translated. taken and is transliterated into Roman Telugu. Sometimes, some words are not to be treated as 2) Each character is checked to determine whether it is a compounds and should be translated as a whole. For instance, vowel. ‘adhikAri’ literally meaning “” is the word to be treated as single word and should not be split. If it is split, it becomes,

13 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Special Issue on Natural Language Processing 2014

‘adhika + ari’ by the pattern ‘a + a = A’ from TABLE XVII. ALL PATTERNS OF ‘YANADESA SANDHI’ ‘savarNadIrgha sandhi’. ‘adhika’ (means ‘more’) and ‘ari’ S.No Pattern Res Example (means ‘enemy’). Both are root words and ‘sandhi’ seems to 1 i + a ati + aMta = atyaMta be proper but the meaning yields ‘more enemy’, an incorrect 2 i + A yA gWri + Arcana = gWryArcana translation. The primary requirement in translation is that the 3 i + u Yu ati + unnati = atyunnati meaning of the context should not be disturbed. 4 i + O yO dadhi + OdanaM = dadhyOdanaM 5 u + a va madhu + annamu = madhvannamu VI. SPECIAL CASES OF ‘ACH SANDHI’ 6 u + A vA guru + AJna = gurvAJna 7 R + A pirR + Arjitamu = pitrArjitamu All the ‘sandhis’ and the cases discussed above are related 3. ‘jastva sandhi’: This ‘sandhi’ also results in specific to single independent vowel. There are special cases in which consonants along with vowels in compound. These specific either next or previous letters of the vowel is also to be ‘consonant + vowel’ patterns are helpful (Except in some checked in splitting. This ensures that the compound is formed cases - refer Note of ‘ukAra sandhi’), in tracing exactly the by a particular ‘sandhi’. root words by applying ‘jastva sandhi’ rules. For some ‘sandhi’ rules, both the previous and next letters Rule: when ‘pUrva svara’ is ‘k/c/T/t/p’ and ‘parasvara’ is of the vowel are to be checked (TABLE XVIII). Following are a vowel/ g/j/D/d/b/h/y/v/r’ then, ‘g/j/D/d/b’ come as ‘AdESam’ the examples. (TABLE XVIII). 1) ‘guNa sandhi’: In specific cases, this ‘sandhi’ results in two letters instead of one in compound (TABLE VII). TABLE XVIII. VOWEL PATTERNS OF ‘JASTVA SANDHI’ Sometimes more than one letter also to be checked since, to S.No Pattern Res Example reduce time complexity in splitting i.e. six patterns causes to 1 k + I gI vAk + ISuDu = vAgISuDu 2 c + a Ja ac + aMtamu = ajaMtamu result in vowel ‘a’ but only three patterns can result ‘ar’. Ex. 3 T + a shaT + aMgamu = shaDaMgamu For ‘brahmarshi’ – is a compound formed by two root words 4 t + A dA sat + AcAramu = sadAcAramu ‘brahma’ and ‘Rshi’. All patterns are given in (TABLE XV, 5 p + a kakup + aMtamu = kakubaMtamu XVI). 6 t + E dE tat + Ekamu = tadEkamu 7 t + I dI jagat + ISuDu = jagadISuDu

TABLE XV. POSSIBLE SPLITTING BY OBSERVING ONLY VOWEL ‘A’ 3) ‘dvirukta TakAra sandhi’: Occurrence of ‘T’ two times is called ‘dvirukta TakAramu’. Pattern sandhi Split forms Result NA/A a + a akAra brahma + arshi Fail NA Note: ‘dvi’ means two, ‘ukta’ means to tell, ‘TakAramu’ u + a ukAra brahmu + arshi Fail NA means the letter ‘’. i + a ikAra brahmi + arshi Fail NA aH + part2 visarga brahmaH + rshi Fail NA Rule: If the words ‘kuru, ciru, kaDu, niDu, naDu’ connected a + R guNa brahma + Rrshi Fail NA with a vowel, then the letters ‘ru’, ‘Du’ are replaced with ‘TT’ A + R guNa brahmA + Rrshi Fail NA (TABLE XIX).

TABLE XVI. POSSIBLE SPLITTING BY OBSERVING TWO LETTERS ‘AR’ TABLE XIX. ALL PATTERNS OF ‘DVIRUKTA TAKARA SANDHI’ Pattern sandhi Split forms Result NA/A S.No Pattern Res Example a + R guNa brahma + Rshi Succeeded A 1 ru + e TTe ciru + eluka = ciTTeluka A + R guNa brahmA + Rshi Fail NA 2 ru + u TTu kuru + usuru = kuTTusuru From the Tables 15 and 16 it is observed that when a 3 Du + a TTa kaDu + aluka = kaTTaluka conjunction results in two or more letters, the total numbers of 4 Du + i TTi naDu + illu = naTTillu letters are to be observed for splitting. 5 Du + U TTU niDu + Urpu = niTTUrpu 2) ‘yaNAdESa sandhi’: Though ‘yaNAdESa sandhi’ is an 6 Du + A TTA kaDu + Ayata = kaTTayata ‘ach sandhi’, it results in generating a consonant in the 7 Du + e TTe kaDu + edura = kaTTedura compound along with the vowel. Rule: When ‘pUrva svara’ is ‘i/u/R’ and ‘para svara’ is While checking vowels of the compound in vowel based ‘i/u/R’ then, ‘y’ replaces ‘i’, ‘v’ replaces ‘u’ and ‘r’ + vowel splitting, if the previous two letters of the vowel are ‘TT’, then replaces ‘R’ as ‘AdESam’ in the result (TABLE XVII). ‘dvirukta TakAra sandhi’ rules are followed in reverse Note: ‘ya, va, ra’ are called ‘yaNNs’ in Sanskrit grammar. engineering to find out easily the root words. When ‘sandhi’ is formed, ‘yaNNs’ comes as ‘AdESam’. That’s 4) ‘visarga sandhi’: This ‘sandhi’ also results in specific why this ‘sandhi’ is named ‘yaNAdESa sandhi’[3]. consonants along with vowels in some special cases. These While checking vowels of the compound in vowel based ‘consonant + vowel’ patterns are helpful in tracing root words splitting, if the vowel if preceded with the letter ‘y/v/r’ then by applying ‘visarga sandhi’ rules. consider both the letters ‘y/v/r’ + vowel to apply ‘yaNAdESa Rule 1: when ‘pUrva svara’ is ‘H’ and ‘para svara’ is sandhi’ rules in reverse engineering to find root words ‘S/sh/s’ then, ‘S/sh/s’ come as ‘AdESam’ respectively (TABLE effectively. XX).

14 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Special Issue on Natural Language Processing 2014

TABLE XX. VOWEL PATTERNS OF ‘VISARGA SANDHI-1’ But in normal conversations, corrupted form ‘baLLu’ is S.No Pattern Res Example intermittently used for representing plurals for both. Likewise, ‘paLLu’ can also act as plural form for ‘paMDu’ (fruit) and 1 (v)H + S (v)SS tapaH + Sakti = tapaSSakti ‘pannu’ (teeth). ‘paMDlu’ and ‘paLLu’ are the plural forms of 2 (v)H + sh (v)shsh catuH + shashTi = catushshashTi the above respectively. When these are involved in 3 (v)H + s (v)ss manaH + sAkshi = manassAkshi conjunction, splitting becomes much difficult. For example Note: ‘v’ stands for ‘vowel’ ‘baLLunnavi’ is the compound formed by either ‘baDi + lu + unnavi’ (schools are there) or ‘baMDi’ + ‘lu’ + ‘unnavi’ (carts Rule 2: when ‘pUrva svara’ is ‘H’, for ‘para svara’ ‘c/ch’, are there). ‘AdESa’ is ‘S’, for ‘para svara’ ‘T,Th’, ‘AdESa’ is ‘sh’, for c) Colloquial forms: ‘para svara’ ‘t,th’, ‘AdESa’ is ‘s’ respectively (TABLE XXI). Colloquial or corrupted form of language is inevitable. These corrupted forms become impossible to split until unless TABLE XXI. VOWEL PATTERNS OF ‘VISARGA SANDHI-2’ they are maintained in Database. For example, S.No Pattern Res Example ‘rAmoDoccADu’ (Rama came) is the compound formed by 1 (v)H + c/ch (v)sc/ch manaH + calana= manascalana ‘rAmuDu’ + ‘occADu’ where ‘occADu’ is the corrupted form 2 (v)H + T/Th (v)shT/Th ushaH + TaMka = ushashTaMka of ‘vaccADu’. For successful splitting, either ‘occADu’ must 3 (v)H + t/th (v)stth manaH + tApam = manastApam also be available in database or a rule must be made to Note: ‘v’ stands for ‘vowel’ morph/consider it as ‘vaccADu’. While checking vowels of the compound in this splitting, if d) Problems caused by conjunctions: next two letters of the vowel are ‘SS/shsh/ss/sc/shT/st’, then Some conjunctions create difficulties in identifying the applying ‘visarga sandhi’ rules give better results in reverse root verb. For example, ‘mEDipaMDujUDu’ (see the fig fruit) engineering to find out the root words easily. is the compound of root words ‘mEDipaMDu’ +’ cUDu’. But according to conjunction rule (‘saraLAdESa sandhi’) they VII. DISCUSSIONS must be ‘mEDipaMDunu’ + ‘cUDu’ before conjunction. Here a) Inflections: ‘nu’ of the ‘pUrva pada’ is removed and ‘c’ of the ‘parapada’ Inflections are called ‘vibhaktis’ which play an important is converted to ‘j’ and ‘jUDu’ is not available in database. role in Telugu grammar. In Telugu, inflections occur at the ‘gasaDadavAdESa sandhi’ also creates similar difficulties rear part of a word which leads in altering the original form of in splitting. For example, ‘tallidaMDrulu’(parents) is the the root word. If the word is inflected then it is not possible to compound formed by ‘talli’ + ‘taMDrulu’. According to carryout splitting straightaway. All inflections must be conjunction rule, first letter of the ‘parapada’ is converted separated and splitting is applied to obtain root words. from ‘t’ to ‘d’ and ‘daMDrulu’ is not available in database. If Conjunction is possible not only with root words but also with ‘da’ is morphed to ‘ta’ for splitting, it leads to another inflected words. difficulty. For example, ‘Akalidappikalu’ (hunger and thirst) is For example, ‘bhUmyAkASamutO’ is the compound which the compound formed by ‘Akali’ + ‘dappikalu’ by is inflected (‘tO’) at rear end. It is separated first and split rule ‘gasaDadavAdESa sandhi’. If ‘da’ of this compound is is applied to obtain ‘bhUmi’ + ‘AkASamu’ (TABLE XVII). changed, then it becomes ‘tappulu’ (mistakes) thus providing wrong translation. If ‘pUrvapada’ is inflected and participated in conjunction, then it is difficult to find out root word. For example, in the VIII. CONCLUSION sentence ‘rAmuNNeMduku cUSAvu’ (why did you see Rama?) Though there are many issues involved in splitting, the compound is ‘rAmuNNeMduku’ and is to be split. It is splitting plays key role in MT. It paves a way to translate the known that this is formed by two words ‘rAmuNNi’ + source language as much as possible. Issues involved in the ‘eMduku’. Since the first word is inflected (‘rAmuDu’ + ‘ni’) splitting can be solved by applying appropriate properly and such words are not available in database as they are. In evolved morphological processes. such cases, splitting should be applied using morphology rules. All possible patterns to observe in a compound for vowel based splitting given in TABLE XXII. Applying longest b) Plural forms: pattern as much as possible gives good results. Apply the rule Plural forms are very common in any language. If they are of appropriate ‘sandhi’ for splitting. When there are no multi- involved in conjunction, splitting becomes difficult. For letter patterns available in compound, then it becomes example, ‘kukkalarupulu’ (barking of dogs) is a compound mandatory to observe only single vowel for splitting. This may formed by ‘kukkala’ + ‘arupulu’. ‘pUrva pada’ is a plural lead ambiguity in some cases. However, Vowel based splitting term and is inflected. Formation of some plural words is not can separate at least one proper word from compound from proper. For example ‘baLLu’ can be the plural form of either left to right, if any. One can find more patterns for some ‘baDi’ (school), or ‘baMDi’ (a cart). But ‘baDulu’ and special cases and can be included to split the compound very ‘baMDlu’ are the right plural forms of ‘baDi’ and ‘baMDi’ precisely. respectively.

15 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Special Issue on Natural Language Processing 2014

TABLE XXII. TWO/THREE-LETTERS TO OBSERVE IN THIS SPLITTING [6] Akshar Bharati, Amba Kulkarni and V Sheeba, “Building a Wide Coverage Sanskrit Morphological Analyzer: A Practical Approach”, The S.No V If Nx / Pr Pattern Split Sandhi First Nat. Symp. on Modelling and Shallow Parsing of Indian pattern Languages, IIT Bombay, 2006. 1 a is – r Ar a + R guNa [7] Malhar Kulkarni, Chaitali Dangarikar, Iravati Kulkarni, Abhishek 2 a is – r ar a + Ru guNa Nanda, and Pushpak Bhattacharya, “Introducing Sanskrit Wordnet”, 3 a is – r ar A + R guNa Proc. of Global Wordnet Conf. 2010, , India, 2010. 4 a is – r ar A + Ru guNa [8] Arthur. A. McDonell, “Sanskrit Grammar for Students”, 3rd Edition, 5 a is – r ar aH+ part2 visarga Oxford University Press, 1926. 6 i is – r ir iH + part2 visarga [9] Joshi Shripad S. “Sandhi Splitting of Marathi Compound Words”, Int. J. 7 I is – r Ir IH + part2 visarga on Adv. Computer Theory and Engg., Vol. 2 Issue 2, 2012 8 u is – r ur uH+ part2 visarga 9 V is – y y+ V i + vowel yaNAdESa [10] S. Varakhedi, V. Jaddipal and V.Sheeba, “An Effort To Develop A 10 a is – v va u + a yaNAdESa Tagged Lexical Resource For Sanskrit”, FISSCL Paris, Oct 2007. 11 A is – v vA u + A yaNAdESa [11] Anil Kumar, Vipul Mittal, Amba Kulkarni “ 12 A is – v vA U + A yaNAdESa Processor”, Sanskrit Computational Linguistics, pp. 57-69, 2010 13 a is – r ra R + a yaNAdESa [12] Shathaka Sagaram, available at http://shathakasagaram.blogspot.in/ 14 V is – g g+ V k + V Jastva 2011/05/blog-post_03.html 15 V is – j j + V c + V Jastva [13] Jasti Suryanarayana, “Sanskrit for Telugu Students” , Sri Balaji Printers, 16 V is – D D + V T + V Jastva Tirupati, 1993 17 V is – d d+ V t + V Jastva [14] Divakrala Venkata Avadhani, “Telugu in Thirty Days”, Andhra Pradesh 18 V is – b b+ V p + V Jastva Sahithya Academy, 1976. 19 V is – TT TT + V Du/ru + V dviruktTakAr [15] Akshar Bharati, Vineet Chaitanya, Rajeev Sangal, “Natural Language 20 V is – SS V + SS V+H + S visarga Processing – A Paninian Perspective”, Prentice-Hall of India, 1994. 21 V is – shsh V + shsh V+H+sh Visarga [16] T. Suryakanthi, Dr. S. V. A. V. Prasad and Dr. T. V. Prasad “Translation 22 V is – ss V + ss V+H+s visarga of Pronominal Anaphora from English to Telugu Language”, Int. J. of 23 V is – sc V + sc V+H + c visarga Adv. Computer Sc. App., Vol. 4 Issue 4, 2013. 24 V is – sch V + sch V+H + ch visarga 25 V is – shT V + shT V+H + T visarga APPENDIX 26 V is – shTh V + shTh V+H + Th visarga 27 V is – st V + st V+H+t visarga Pronunciations: The letters should be pronounced 28 V is – sth V + sth V+H+th visarga normally as in English, except when they are italicized. If so, * Here ‘pr’ for previous, ‘nx’ for next, and ‘V’ for vowel. follow the TABLE XXIII.

REFERENCES TABLE XXIII. ROMAN TELUGU PRONUNCIATION [1] Malladi Krishna Prasad, “Telugu Vyaakaranamu”, Sri Venkateswara Book Depot, 2012. RT Usage as RT Usage as RT Usage as [2] Dr. Samudrala Vemkata Ramga Ramanujacharya, “Samskruta Vaani” a a in – That R Ru in – Ruk o O in - Obey Rohini Publications, 1997. A a in - jack Ru roo in – roof O oa in - Roar i i in - His e e in - When W Ou in - out [3] Kambhampati Ramagopala Krishnamurti, “Telugu Vyaakaranamu”, Sri Sailaja Publications, 1991. I Ea in- east U oo in – fool aM um in - sum u u in – Put E a in - Hate aH aH in – aH [4] A.H. Arden, “A Progressive Grammar of the Telugu Language”, 2nd U oo in –fool Y I in - Ice Edition, Society for promoting Christian Knowledge, Madras, 1905. *RT stands for Roman Telugu and the capitalized letters [5] Robert Caldwell, “A Comparative Grammar of The Dravidian or South- Indian Family of Languages”, 2nd Edition, 1875. should be pronounced with greater emphasis on them.

16 | P a g e www.ijacsa.thesai.org