The Journey of Indian Languages: Perpectives on Culture and Society ISBN : 978-81-938282-6-7

Morphological Analyzer for Sanskrit Language

Jaideepsinh K. Raulji Dr. Jatinderkumar R. Saini 1Lecturer, Ahmedabad University, 1Professor & I/C Director, Narmada Ahmedabad, Gujarat, India. College of Computer Application, 2 Bharuch, Gujarat, India Research Scholar, Dr. Babasaheb Ambedkar Open University, 2Research Supervisor, Dr. Babasaheb Ahmedabad, Gujarat, India Ambedkar Open University, Ahmedabad, Gujarat, India Email: [email protected] Email: [email protected]

Word level processing requires knowledge of structure and formation of words. The branch of linguistics which is concerned with formation or creation of words in a language from morphemes in a systematic way is termed as Morphology. Sanskrit is a predecessor of most Indian languages and also a family of Indo-European branch. Knowledge of Sanskrit language may help to understand structure and architecture of other Indo-Aryan languages spoken widely in Indian sub-continent. Here morphological analysis of Sanskrit words through its strong feature of postposition and preposition markers is carried out in a lucid manner. After tokenizing Sanskrit sentence, the retrieved words are compared to indeclinable and pronoun database. The remaining unmatched words are looked up for post and pre position markers for identifying verb forms and then noun forms. The generated results definitely form basis for understanding surface structure of sentence and can be utilized for further improvement of related systems like Information Retrieval, Part of Speech Taggers, Machine Translation etc. Keywords : Morphology, Inflection, Declension, Subanta, Tinanta. 1. INTRODUCTION : Sanskrit belongs to Indo-European family of languages and is considered as a primary language of Vedic civilization. It is one of the 22 languages listed in the Eight Schedule of the Constitution of India. The literary work on Sanskrit grammar – Ashtadhyayi is a treatise from Sage Panini. In Linguistics, morphology is a branch that deals with word formation, analysis and generation. Computational Morphology (CM) is an application of morphological rules in the field of computational linguistics. Morphological analysis is vital for building any basic NLP application and for an inflectionally rich language like Sanskrit, it provides ample information of word with its syntactic and semantic role played in a sentence. Grammatical information like gender, number, person, tense, etc is marked through the inflectional suffixes.[4] Computational Morphology deals with the processing of words in their graphemic and phonemic forms. Its most basic task can be defined as taking a string of characters or phonemes as input and delivering an analysis as output.

Sanskrit has two fold morphology, nominal (subanta forms) and verbal (tinanta forms). Sanskrit is rich in inflections. Due to inflection morphology two kinds of padas viz subanta padas (nominal words) and tinanta padas (verbal forms) are formed. The Paninian analysis for Sanskrit has categorized each and every usable word under these two categories(subanta

Volume:3| 2019 www.baou.edu.in Page 198

The Journey of Indian Languages: Perpectives on Culture and Society ISBN : 978-81-938282-6-7 and tinanta). With respect to inflection no clear difference between nouns and adjectives is identified. 2. RELATED WORK : Sanskrit, being a free word order language; its syntacto-semantic relations solely depend on word inflections. Hence for analysis of Sanskrit syntax dependency parsers are more suitable by Pawan G, et al[1]. Not being strictly positional, sentential discourse requires strong morphological analysis. To develop algorithm for Sanskrit parser Shashank S and Raghav A[2] used Morphological Analyzer as Sanskrit words have rich case endings. They converted Devanagri format to ISCII format. Using DFA, root word along with its attributes are retrieved. Each word is checked against avyaya, pronoun, verb and noun tree sequentially. The whole analysis is done on basis of paradigm table. Amba K and Devanand S [3] graphically represented Sanskrit morphology described by Panini. They built FST for analyzing Sanskrit inflectional forms. Namrata T and Suresh J [5] built a rule based POS tagger, where rules are stored in the database and the word is compared to database after suffix stripping. They also introduced parsing of Sanskrit sentences using Lexical Functional Grammar [6][8]. Akshar B, et al [8] built morphological analyzer using modular approach of programming paradigm and included modules for Sandhi - Samasa analyzer and formation, Subanta, Tinanta and Kridanta Analyzer. Morphogical and comparative study of Sanskrit and English was carried by Promila B[9], et al in their framework for English to Sanskrit MT. A survey by Sulabh B [12], et al, on Sanskrit Tagsets, Part of Speech tagging methods, techniques and issues in implementing statistical methods due to scarce availability of Sanskrit corpora is discussed. A survey focusing on Sanskrit Grammar, models that are used for POS tagging, NLP analysis methods is done by Sharadha, A, et al [15]. A nice piece of work is carried out by Girish Nath Jha, et al [20] in analyzing inflections morphology. The recognition of avyayas is with the help of avyaya database, recognition of verb is with verb database wherein most common 450 verb root‘s inflectional forms are stored in verb dictionary for matching and subanta recognition through database pattern matching [20]. 3. SANSKRIT MORPHOLOGY : Natural Language Processing is a scientific study of languages with computational perspective[17]. Panini‘s grammar consists of nearly 4000 rules divided into 8 chapters. It describes entire Sanskrit language with all detailed structure of grammar. It is a peculiarity of Panini‘s word formation that he recognizes derivation by suffixes only. Even Panini‘s grammar begins with the alphabet arranged on scientific principles. Morphology also refers to grammatical information hidden in the word. Inflections with words are inbuilt, hence in most scenarios auxillary verb is not required. In Sanskrit, word with complete inflection is independent of expressing itself to various grammatical units. Sanskrit is rich in inflections. Due to inflection morphology two kinds of padas viz subanta padas (nominal words) and tinanta padas (verbal forms) are formed. 3.1 NOUN FORMS / SUBANTA PADA : Inflectional forms or Declension of Nouns, Substantive and Adjectives are considered as Subanta‟s in Sanskrit Language. Morphologically nouns and adjectives behave in similar way. The basic form of noun is called a Pratipadika. A noun has 3 genders namely masculine, feminine and neutral and 3 numbers namely Singular (only one), Dual (only two), Plural (more than two). There are 8 cases in each number namely Nominative, Accusative, Instrumental, Dative, Ablative, Genitive, Locative and Vocative. The case markers remain almost same for substantives and adjectives as nouns. Case Markers (Vibhakti) for a verb

Volume:3| 2019 www.baou.edu.in Page 199

The Journey of Indian Languages: Perpectives on Culture and Society ISBN : 978-81-938282-6-7 gives information about Tense, Aspect and Modality (TAM). Vibhakit‟s are so important to nouns endings that though word sequencing is changed the meaning remains same. But if case markings are changed the whole sentence semantics is altered. Hence the vibhaktis are crucial in determining the semantic roles. Karaka defines relationship between Nominal and Verbal root. There are 3 persons in Sanskrit

1. Uttam Purush (First Person) : It refers to myself eg (अहम ् ग楍छामम) I am going.

2. Madhyam Purush (Second Person) : It refers to yourself eg.( 配वम ् ग楍छमि) You are going.

3. Pratham Purush (Third Person) : It refers to they. Eg. (िः गचछति) He is going. Following are the case markers (Karaka System) for Nouns forms.

Table 1 (below) – Nominal Case Markings for Gender - Masculine

Vibhakti [Cases] Masculine

Singular Dual Plural

Nominative ःः, ःा, ः ःः, ः , ः , ः , ःाःः, नः, ःः, ः ःः, ःान ् यः, वः

Accusative म ् ः , ः , ः , ःान,् न ् , ः न ् , ः न ्, ःः

Instrumental ः न, ः ण , ःा , ःा땍याम ् , 땍याम ् ः ःः , 땍यः , म ः ना , , याम ्,

Dative ःाय , ः , य ःा땍याम ् , 땍याम ् ः 땍यः , 땍यः , , याम ् यः

Ablative ःाि ् , ःः ःा땍याम"् , ः 땍यः, 땍यः , , ः ःः , ः ःः 땍याम ्, याम ् यः

Genitive य़ , ःः , ः ःः य ः , ः ःः , व ः ःानाम ्, ःाणाम ् , ः ःः , नः , , न ः , , णाम ् , नाम ् , ः नाम ्, ःाम ्

Locative ः , िः , ः , य ः , ः ःः , व ः ः षु , षु , क्षु , िु तन , व , न ः

Volume:3| 2019 www.baou.edu.in Page 200

The Journey of Indian Languages: Perpectives on Culture and Society ISBN : 978-81-938282-6-7

Table 2 (below) – Nominal Case Markings for Gender - Feminine

Vibhakti [Cases] Feminine

Singular Dual Plural

Nominative ःा , ः ःः , ः , य , ः , ः , ःः , यः , ठः , ः ःः , ः ःः , वः उः

Accusative म ् ः , य , ः , ः , ःः , ः ःः

Instrumental या , याम ्, वा , 땍याम ्, याम ् म ः , िःःः ःा

Dative य , य , व , व 땍याम ्, याम ् 땍यः , ः

Ablative याः , ः ःः , 땍याम ्, 땍याम 땍यः ः ःः , वाः , न ः

Genitive याः , वाः य ः , व ः नाम ्, ःाम ्

Locative याम ् , ःाम ् , य ः , व ः , ः ःः िु , षु , क्षु िः

Table 3 (below) – Nominal Case Markings for Gender - Neutral

Vibhakti [Cases] Neuter

Singular Dual Plural

Nominative म ् , िः , ठः ः , ः , िःण , ःातन , णण , , ःु ण , न रीणण , तन , िति , िः

Accusative म ्, िः , ःु , ः , िःण , न ःातन , णण ,

Volume:3| 2019 www.baou.edu.in Page 201

The Journey of Indian Languages: Perpectives on Culture and Society ISBN : 978-81-938282-6-7

ः णण , तन , , िति , िः

Instrumental ः न , णा , ना , 땍याम ्, 땍याम ः ःः , म ः

Dative ःाय , ण , न 땍याम ्, 땍याम 땍यः

Ablative ःाि ्, णणः , नः 땍याम ् ः 땍यः , 땍यः

Genitive य़ , णः , नः य ः , ण ः , न ः ःानाम ् , नाम ् , णाम ्

Locative ः , तन , णण य ः , ण ः , न ः षु , क्षु

3.2 PRONOUN All categories of Pronoun (Personal, Demonstrative, Relative, , Reflexive, Indefinite, Correlative, Reciprocal, Possessive, Pronominal) with their cases, numbers and genders are directly added to pronoun database; initially the tokenized word is compared to pronoun, if it is positive(true) it is declared the same without continuing further in the algorithm. 3.3 INDECLINABLES or AVYAYAS : A word whose form remains same in all genders, numbers and cases is Indeclinable (Avyaya). The indeclinables consist of Prepositions, Adverbs, Particles, Conjunctions and Interjections. In the implemented system all indeclinables are added into database and they are directly compared after tokenization process. 3.4 VERB FORMS / TINANTA : Verbs, based on moods are divided into 10 conjugational classes. For implementation purpose, here it is identified in only two categories viz 1st, 4th, 6th and 10th in first and remaining in second category. In Sanskrit there are two kinds of Verbs namely Primitive and Derivative. There are 6 tenses and 4 moods. Tenses and Moods are referred to as Lakaras in Sanskrit. Tenses are namely Present, Aorist, Imperfect, Perfect, First Future, Second Future. Moods are namely Imperative, Potential, Benedictive, and Conditional. There are two personal terminations too namely Parasmaipada and Atmanepada. Parasmaipada are the verbs where fruits of the action does not go to one who acts ie other-serving verbs. Atmanepada are the verbs where fruits of the action go to one who acts ie self serving verbs. Here in the implemented system, verb endings are extensively covered in software with its Casses, Tense, Moods and Personal terminations. Voices are not implemented in the system. Based on inflectional endings the listing is as follows. Table 4 (below) -Verbal Conjugational Classes Group 1 –Verbal Classes 1st ,4th ,6th and 10th.

Present Present Tense[Atmanepada] Tense[Parasmaipada] S D P S D P

Volume:3| 2019 www.baou.edu.in Page 202

The Journey of Indian Languages: Perpectives on Culture and Society ISBN : 978-81-938282-6-7

FP FP मम वः मः ः वह मह

SP SP मि थः थ ि ः थ 鵍व

TP TP ति िः िति ि ः ि ति

Imperfect (Past) Imperfect (Past) [Atmane ada] [Parasmaipada] S D P S D P FP अ..ः अ..वहह अ..महह FP अ..म ् अ..व अ..म SP अ..थाः अ..थाम ् अ..鵍वम ् SP अ..ःः अ..िम ् अ..ि TP अ..ि अ..िाम ् अ..ति

TP अ..ि ् अ..िाम ् अ..न ्

Imperative Mood [Parasmaipada] [Atmanepada]

S D P S D P

FP FP ःातन ःाव ःाम ः ःावह ःामह

SP - SP िम ् ि व थाम ् 鵍वम ्

TP TP िु िाम ् ति ु िाम ् ः िाम ् तिाम ्

Potenti l Mood Potential Mood [Atmanepada] [Parasmaipada] S D P S D P FP ः य ः वहह ः महह FP ः यम ् ः व ः म SP ः थाः ः थाम ् ः 鵍वम ् SP ः ःः ः िम ् ः ि TP ः ि ः यािाम ् ः रन ्

TP ः ि ् ः िाम ् ः युः

Table 5 (below) - Verbal Conjugational Classes Group 2, Verbal Classes 2nd ,3rd ,5th ,7th ,8th ,9th

Volume:3| 2019 www.baou.edu.in Page 203

The Journey of Indian Languages: Perpectives on Culture and Society ISBN : 978-81-938282-6-7

Present Tense[Parasmaipada] Present Tense[Atmanepada]

S D P S D P

FP FP मम वः मः ः वह मह

SP SP मि/षष थः थ ि ः थ 鵍व

TP TP ति िः िति ि ःाि ि

Imperfect (Past) [Parasmaipada] Imperfect (Past) [Atmanepada]

S D S D P

FP FP अ..म ् अ..व अ..म अ..िः अ..वहह अ..महह

SP SP अ..ःः अ..िम ् अ..ि अ..थाः अ..ःाथाम ् अ..鵍वम ्

TP TP अ..ि ् अ..िाम ् अ..न ् अ..ि अ..िाम ् अ..ि

Imperative Mood [Parasmaipada] Imperative M od [Atmanepada]

S D P S D P

FP FP ःातन ःाव ःाम ः ःावह ःामह

SP - SP िम ् ि व / �व ःाथाम ् 鵍वम ्

TP TP िु िाम ् ति ु िाम ् ःािाम ् िाम ्

Potential Mood [Parasmaipada] Potential Mood [Atmanepada]

S D P S D P

FP FP याम ् याव याम ः य ः वहह ः महह

SP SP थाः यािम ् याि ः थाः ः थाम ् ः 鵍वम ्

Volume:3| 2019 www.baou.edu.in Page 204

The Journey of Indian Languages: Perpectives on Culture and Society ISBN : 978-81-938282-6-7

TP TP याि ् यािाम ् युः ः ि ः यािाम ् ः रन ्

First / Periphrastic Future First / Periphrastic Future [Parasmaipada] [Atmanepada]

S D P S D P

FP FP िािम िावः िामः िाह िावह िामह

SP SP िामि िाथः िाथ िाि िािाथ िा鵍व

TP TP िा िार िारः िा िार िारः

Second Future [Parasmaipada] Second Future [Atmanepada]

S D P S D P

FP FP �यामम �यावः �यामः �य �यावह �यामह

SP SP �यमि �यथः �यथ �यि �य थ �य鵍व

TP TP �यति �यिः �यिति �यि �य ि �यति

�य can also have forms like क्ष्य / �य can also have forms like क्ष्य / य य

Conditional Mood Condit onal Mood [Atmanepada] [Parasmaipada] S D P S D P FP अ..�य अ..�याव अ..�यामहह FP अ..�य अ..�या अ..�या हह म ् व म SP अ..�य अ..�य थाम ् अ..�य鵍व SP अ..�यः अ.. अ..�यि थाः म ् �यिम ् T अ..�यि अ..�य िाम ् अ..�यति T P अ..�यि ् अ.. अ..�यन ् P

Volume:3| 2019 www.baou.edu.in Page 205

The Journey of Indian Languages: Perpectives on Culture and Society ISBN : 978-81-938282-6-7

�यिाम ् �य can also have forms like क्ष्य / य

�य can also have forms like क्ष्य / य

Perfect (Past) [Parasmaipada] Perfect (Past) [Atmanepada]

S D P S D P

FP FP व म ः वह मह

SP - SP थ थःु ष ःाथ 鵍व

TP - TP िुः ःुःः ः ःाि िःर

Aorist (Past)[Parasmaipada] Aorist (Past)[Atmanepada]

S D P S D P

FP P अ..म ् अ..व अ..म अ..म ् अ..वहह अ..महह

SP SP अ..ःः अ..िम ् अ..ि अ..ःाःः अ..थाम ् अ..鵍वम ्

TP TP अ..ि ् अ..िाम ् अ..न ् अ..ि ् अ..िाम ् अ..ःुःः

Benedictive [Parasmaipada] Benedictive [Atmanepada]

S D P S D P

FP FP यािम ् याव याम ष य ष वहह ष महह

SP SP याः यािम ् याि ष �ठाः ष याथाम ् ष 鵍वम ्

TP TP याि ् यािाम ् यािुः ष �ट ष यािाम ् ष रन ्

ष can also have forms like ि

4. METHODOLOGY IMPLEMENTED The four pillars of NLP can be considered like Lexical (Word Analysis), Syntactic (arrangement of word), Semantic (Meaning of Word) and Discourse (World Knowledge

Volume:3| 2019 www.baou.edu.in Page 206

The Journey of Indian Languages: Perpectives on Culture and Society ISBN : 978-81-938282-6-7

Resolution) Analysis[17]. Here we have tried to implement the basic pillar of NLP viz; Word Analysis or Morphological Analyzer. The text encoding used is UTF-8 Devanagari script. The implementation flows in the manner as follows 1. Build Indeclinable (Avyaya) and Pronoun database. 2. Input raw Sanskrit text. 3. Normalize inputted text. Normalization process covers removal of unwanted spaces, punctuation marks, foreign letters/alphabets. 4. Tokenizing the text by spaces, hence converted to words. 5. Compare each word with pronoun database, if it matches, it generates appropriate results, otherwise forward it to next module. 6. If the word does not match as pronoun, the same is matched to indeclinable database. if it matches, it generates appropriate results, otherwise it is forwarded to verb forms rule base. 7. Verb forms (Tinanta) are processed by Regex pattern matcher affixes developed and added manually in Verb form rule base as shown in Verb forms table. 8. If verb form matching fails, then word is matched with Subanta (Noun form) through Pregex pattern matcher affixes. 9. It is compared in Sanskrit dictionary if all the rules fail.

5. CONCLUSION : The algorithm and its implementation is designed for Sandhi free text. The result for compounded word will be based completely on inflectional affixes. Hence efficiency of the tool can be increased by adding compound dissolution module. The system works great for pronoun and indeclinables. There are case marking for noun forms (subanta) which remains same for 2 or more cases or with numbers. Also the inflectional markings are replicated in verbal forms (tinanta) for 10 conjugational classes and personal terminations like atmanepada and parasmaipada. Hence it is ascertained to get multiple taggings for single verb forms and noun forms. But still performance of the tool is rectified by incorporating morphologically rich dictionary. It is possible to store all the inflected forms of words in

Volume:3| 2019 www.baou.edu.in Page 207

The Journey of Indian Languages: Perpectives on Culture and Society ISBN : 978-81-938282-6-7 dictionary but size of dictionary might increase to unmanageable level and it also defeats the concept of linguistic generalization. REFERENCES : 1. Pawan Goyal, Gerard Huet, Amba Kulkarni, Peter Scharf, Ralph Bunker, ― A Distributed 2. Platform for Sanskrit Processing‖, Proceedings of COLING 2012 : pp 1011-1028, 3. Shashank Saxena and Raghav Agrawal, ―Sanskrit as a Programming Language and Natural 4. Language Processing‖, Global Journal of Management and Business Studies, Vol 3, No. 10 , 5. pp 1135-1142, 2013. 6. Amba Kulkarni, and Devanand Shukl. "Sanskrit morphological analyser: Some 7. issues." Indian Linguistics 70.1-4 (2009): 169-177. 8. Vishal Goyal and Gurpreet Singh Lehal, ―Hindi Morphological Analyzer and Generator‖, 9. First International Confernce on Emerging Trends in Engineering and Technology, IEEE, 10. 2008. 11. Namrata Tapaswi and Suresh Jain, ―Treebank Based Deep Grammar Acquisition and Part of 12. Speech Tagging for Sanskrit Sentences‖, CSI 6th International Conference on Software 13. Engineering (CONSEG), IEEE , Sept 2012. 14. Namrata Tapaswi, Suresh Jain, Vaishali Chourey, ―Parsing Sanskrit Sentences using Lexical 15. Functional Grammar‖, International Conference on Systems and Informatics (ICSAI), IEEE 16. 2012. 17. Ved Kumar Gupta, Namrata Tapaswi and Suresh Jain, ―Knowledge Representation of 18. Grammatical Constructs of Sanskrit Language Using Rule Based Sanskrit Language to English Language Machine Translation‖,International Conference on Advances in 19. Technology and Engineering (ICATE), Jan 2013. 20. [8] Namrata Tapaswi and Suresh Jain, ―Knowledge Representation of Grammatical Constructs of Sanskrit Language and Modular Architecture of ParGram‖, International Conference on 21. Advances in Technology and Engineering (ICATE), Jan 2013. 22. Akshar Bharati, Amba Kulkarni, and V Sheeba, ―Building a Wide Coverage Sanskrit 23. Morphological Analyzer : A Practical Approach‖, The First National Symposium on 24. Modelling and Shallow Parsing of Indian Languages, IIT Bombay, 2006. 25. Promila Bahadur, Ajai Jain, Durg Singh Chauhan, ―Architecture of English to Sanskrit 26. Machine Translation‖, SAI Intelligent Systems Conference, London UK, 2015. 27. Vishal Goyal, Gurpreet Singh Lehal, ―Hindi Morphological Analyzer and Generator‖, First International Conference on Emerging Trends in Engineering and Technology, IEEE, 2008. 28. Sulabh Bhatt, Krunal Parmar and Miral Patel, ―Sanskrit Tag-sets and Part-of-Speech

Volume:3| 2019 www.baou.edu.in Page 208

The Journey of Indian Languages: Perpectives on Culture and Society ISBN : 978-81-938282-6-7

29. Tagging Methods- A Survey‖, International Journal of Innovative and Emerging Research in Engineering, Vol 2, Issue 1, 2015. 30. Hellwig Oliver., ―SanskritTagger: A Stochastic Lexical and POS Tagger for Sanskrit‖ In: Huet G., Kulkarni A., Scharf P. (eds) Sanskrit Computational Linguistics. Lecture Notes in Computer Science, vol 5402. Springer, Berlin, Heidelberg, 2009. 31. Aashish Pappu and Ratna Sanyal, ―Vaakkriti : Sanskrit Tokenizer‖, Indian Institute of Information Technology, Allahabad (U.P.), India, IJCNLP, 2008, pp. 577-582. 32. Sharadha Adinarayanan, N. Sri Ranjaniee and Naren. J, ―Part of Speech Tagger for Sanskrit 33. : A State of Art Survey‖ , International Journal of Applied Engineering Research, ISSN 0973-4562, 2015. 34. Akshar Bharati, Vineet Chaitanya, Rajeev Sangal, ―Natural Laguage Processing – A 35. Paninian Perspective‖ [Book] , PHI Learning Pvt Ltd, August 2016. 36. Ela Kumar, ―Natural Language Processing‖, IK International Publishing House Pvt Ltd, Reprint 2012 37. M.R. Kale, ―A Higher Sanskrit Grammar‖, Motilal Banarsidass Publishers Pvt. Ltd., 11th 38. Reprint 2016. 39. Subhash Chandra and Girish Nath Jha, ―Morphological Analysis of Nominal inflections in 40. Sanskrit‖ ,presented at Platinum Jubilee International Conference, L.S.I. at Hyderabad 41. University, Hyderabad, pp-34, 2005. 42. Girish Nath Jha, Muktanand Agrawal, Subhash Chandra, Sudhir Mishra, Diwakar Mani, 43. Diwakar Mishra, Manji Bhadra, Surjit Singh, ―Inflectional Morphology Analyzer for 44. Sanskrit‖, In A. Kulkarni, and G. Huet, editors, Sanskrit Computational Linguistics 1 and 2, 45. pages 219-238, Springer Verlag LNAI 5402, 2009.

Volume:3| 2019 www.baou.edu.in Page 209