A Benchmark Corpus and Neural Approach for Derivative Nouns Analysis Arun Kumar Singh Dr. Prathosh A. P. Indian Institute of Technology - Delhi Department of Electrical Engineering [email protected] [email protected] Sushant Dave Shresth Mehta Institute of Technology - Delhi Indian Institute of Technology - Delhi [email protected] [email protected] Prof. Brejesh Lall Indian Institute of Technology - Delhi [email protected] Abstract dona, 1997) wrote a treatise on Sanskrit grammar named Ashtadhyayi, in which Panini formalized This paper presents first benchmark corpus rules on linguistics, syntax and grammar for San- of Sanskrit Pratyaya (suffix) and inflectional words (padas) formed due to suffixes along skrit. Panini’s grammar is globally appreciated for with neural network based approaches to pro- its insightful analysis of Sanskrit and completeness cess the formation and splitting of inflectional of its descriptive coverage of the spoken standard words. Inflectional words spans the primary language of Panini’s time. and secondary derivative nouns as the scope Ashtadhyayi 1 is the oldest surviving text and the of current work. Pratyayas are an important most comprehensive source of grammar on San- dimension of morphological analysis of San- skrit today and provides often unique information skrit texts. There have been Sanskrit Computa- Ash- tional Linguistics tools for processing and ana- on Vedic, regional and socio-linguistic usage. lyzing Sanskrit texts. Unfortunately there has tadhyayi literally means eight chapters and these not been any work to standardize & validate eight chapters contain around 4000 sutras or rules these tools specifically for derivative nouns in total. These rules completely define the San- analysis. In this work, we prepared a Sanskrit skrit language as it is known today. Ashtadhyayi is suffix benchmark corpus called Pratyaya-Kosh remarkable in its conciseness and contains highly to evaluate the performance of tools. We also systematic approach to grammar. Because of its present our own neural approach for derivative well defined syntax and extensively well codified nouns analysis while evaluating the same on most prominent Sanskrit Morphological Anal- rules, many researchers have made attempts to cod- ysis tools. This benchmark will be freely dedi- ify the Panini’s sutras as computer programs to cated and available to researchers worldwide analyze Sanskrit texts. This paper tries to address and we hope it will motivate all to improve the problem of unavailability of benchmark cor- morphological analysis in Sanskrit Language. pus and provides morphological analysis method for derivative nouns as a result of Sanskrit suffixes 1 Introduction applied on root verbs and nouns using a machine Sanskrit is considered as one of the oldest Indo- learning approach. Aryan languages. The oldest known Sanskrit texts 1.1 Introduction of Pratyaya in Sanskrit are estimated to be dated around 1500 BCE. A large arXiv:2010.12937v1 [cs.CL] 24 Oct 2020 corpus of religious, philosophical, socio-political Different ways of inflectional word formation as and scientific texts of multi cultural Indian Subcon- mentioned by (Kiparsky, 2008) are as below: tinent are in Sanskrit. Sanskrit, in its multiple vari- - [Root + Suffix] Root: desideratives, intensives. ants and dialects, was the Lingua Franca of ancient - [Word + Suffix] Root: denominal verbs. - [Root + Suffix] Stem: primary (krit) suffixes. India (Coward, 1990). Therefore, Sanskrit texts - [Word + Suffix] Stem: secondary (taddhit) suf- are an important resource of knowledge about an- fixes. cient India and its people. Earliest known Sanskrit - [Word + Word] Stem: compounding. - [Root + Suffix] Word: verb inflection. documents are available in the form called Vedic - [Stem + Suffix] Word: noun inflection. Sanskrit. Rigveda, the oldest of the four Vedas, that are the principal religious texts of ancient India, Here in this introduction, we will explain more is written in Vedic Sanskrit. In sometime around about primary and secondary suffixes. Sanskrit is a 5th BCE, a Sanskrit scholar named Panini (Car- 1https://www.britannica.com/topic/Ashtadhyayi rich inflected language and depends on nominal and simple algorithm was developed which is described verbal inflections for communication of meaning in (Bharati et al., 2002). This algorithm has been (Murali et al., 2014). A fully inflected unit is called used to develop morph analyzer for different In- pada. The subanta padas are the inflected nouns dian languages (IIIT-H). Separate modules have and the tinanta padas are the inflected verbs. been developed to handle subanta, tinanta and Kri- danta words. A list of lexicon is extracted from 1.1.1 Kridanta subanta (Primary Derivative Monier William’s dictionary. Kridanta analyzer Nouns) works based on rule based approach to retrieve These are formed when the primary affixes called pratyaya and upasarga form lexicon dictionary. It krit are added to verbs to derive substantives, adjec- doesn’t provide any module to analyze other deriva- tives or indeclinable. DAtum nAma karoti iti krit. tional suffixes such as taddhit but claims for a pro- Kridanta play a vital role in understanding Sanskrit vision of plug in. Krishna et al.(2017) proposed language. Many morphological analyzers are lack- an approach for analysis of derivational nouns in ing the complete analysis of Kridanta. Examples Sanskrit. This approach attempted to build a semi of krit pratyaya are as below: supervised model that can identify usage of derived paW (root verb)+ tavya (krit suffix) = paWitavyam words in a corpus and map them to their corre- paW (root verb)+ tumun (krit suffix) = paWitum sponding source words. Special interest was given krit suffixes are mainly of seven types viz. tavyat, to the usage of secondary derivative affixes in San- tavya, anIyara,Ryat, yat, kyap, kelimer. skrit ie. taddhit. When it comes to automating taddhit, the Sanskrit Heritage System (Goyal and 1.1.2 Taddhitanta subanta (Secondary Huet, 2013) is an existing system that can recog- Derivative Nouns) nize taddhit and perform the analysis, but it does The secondary derivative affixes called taddhit de- not generate the taddhit and only the lexicalized rive secondary nouns from primary nouns. Some taddhit are recognized during the analysis. examples of taddhit pratyaya are as below: Krishna and Goyal(2015) attempted to auto- Indra (noun) + aR (taddhit suffix) = Endra mate the process of deriving taddhit in complete Dana (noun) + vatup (taddhit suffix) = Danavat adherence to Ashtadhyayi. The proposed system adopted a completely object oriented approach in taddhit suffixes are mainly of fourteen types. modelling Ashtadhyayi. The rules of Ashtadhyayi 2 Existing Work on Sanskrit Derivative were modeled as classes and were the environment Nouns Analysis that contains the entities for derivation. In their work the rule group was achieved through forma- Chandra and (2006) presented a model in the tion of inheritance network. The current resources form of a tool for recognition and analysis of nom- available for finding out derivational nouns in open inal morphology (Sanskrit subanta padas) in San- domain are not very accurate. Two most popular skrit text for Machine Translation. The tool recog- publicly available set of Pratyaya Analysis tools nizes nominal morphology with the help of avyaya viz. JNU Sanskrit Kridanta Analyzer 2 and UoH and verb database and does analysis of Sanskrit Morphological Generator 3 are mentioned in table subanta padas. Murali et al.(2014) provided an 1. approach to deal with Kridanta. Morphological dictionaries for upapadas, upasargas, roots and suf- 3 Motivation fixes were created and rule based avyaya analyzer was developed which does Morphological Analy- Researchers have earlier attempted to develop mor- sis based on identification & filtering of upapadas, phological analyzer for Sanskrit such as cdac 4, upsargas based on dictionary. Sanskrit academy 5 and (Huet, 2003)). But having The method adopted by Bharati et al.(2006) constraint of limited coverage or not being avail- was a paradigm based approach where a student able openly, it is difficult to reuse it for further is taught the word forms of a common word .g. applications in NLP area. Also it is not possible deva in Sanskrit and that it is the default paradigm 2 for ’a’ ending masculine words. Further the list of http://sanskrit.jnu.ac.in/kridanta/ktag.jsp 3http://sanskrit.uohyd.ac.in/scl/ exceptional words and the forms where they dif- 4http://www.cdac.in/html/ihg/ihgidx.asp fer are taught separately. Following this method, a 5http://www.sanskritacademy.org/ Figure 1: Inflectional Word Hierarchy

Tool Name Description Sanskrit Kridanta Ana- The ”Sanskrit Kridanta Analyzer” was partially completed as lyzer (JNU) part of M.Phil. research submitted to Special Center for Sanskrit Studies, JNU in 2008 by Surjit Kumar Singh (M.Phil 2006-2008) under the supervision of Dr. Girish Nath Jha . It facilitates the split of Kridanta into root verb and krit pratyaya Morphological Generator Morphological Generator shows the inflectional, and (some) (UoH) derivational forms of a given noun or a verb. This tool has been developed by University of Hyderabad (UoH)

Table 1: Open Available Tools for Derivative Nouns (Pada) Analysis for a person from a non-Sanskrit background to pared Pratyaya-Kosh in the form of root, suffix and develop applications and systems in Sanskrit mor- derivative noun due to affix. This data can be easily phology, even though Sanskrit is well codified lan- used to develop analyzers and can be trained to guage. Among various facets of Morphological learn rules using various algorithms automatically. Analysis, suffix analyzer is necessary for a good coverage morphological analyzer. Our proposed 4 Method Sanskrit noun derivatives analyzer will facilitate 4.1 Kridanta Pada Formation Method the work in below areas. Kridanta Pada formation task is similar to language • Text to speech synthesis system translation problem where a sequence of characters • Neural Machine Translation from non-Sanskrit to San- or words produces another sequence and lengths skrit language and vice versa of the inputs and outputs are not fixed. RNNs are • Sanskrit Language morphological analysis widely used to solve such problems. Sequence to Pratyayas are not easily available for building sequence model introduced by (Sutskever et al., Sanskrit based applications either in printed or e- 2014) is especially suited to such problems, there- forms. Though some sources are available such as fore it was used in this work. The training and test Panini Pratyayartha Kosha Taddhita Prakaranam data were in ITRANS format 8. This and Panini Kridanta Pratyaya Artha Kosha by data was converted to SLP1 9 as SLP1 was found Dr. Gyanprakash Shastri 6 , Laghu Siddhant Kau- more suited for proposed approach. The code was mudi 7 and Kridant Karak Prakaranam by Bhim- implemented in python 3.5 with Keras API running sen Shashtri but one has to have good understand- on TensorFlow backend. ing of Sanskrit to analyze it and use it to de- Proposed model expects 2 words (root + krit velop some analysis tool or system. We have pre- pratyaya) concatenated using a ”+” symbol as input

6https://archive.org/ 8https://www.aczoom.com/itrans/ 7https://chaukhambapustak.com/ 9https://en.wikipedia.org/wiki/SLP1 Figure 2: Architecture for Learning Derivative Noun (Pada) Formation and outputs the primary derivative noun (Kridanta) to the encoder part improved the accuracy slightly as shown in the example below: so we opted for using attention in our model. Both tul + lyuw = tolanam the encoder and decoder use the hidden unit size (latent dimension) as 128. The training vectors Results achieved from this approach seem to be were divided in batches of 32 vectors and total of good. Proposed approach tries to address this prob- 70 epochs were run to get the final results. lem by giving all the characters of the first input word and all the characters of the second input word 4.2 Taddhitanta Pada Formation Method concatenated with a ‘+’ symbol as the input to the Taddhitanta are formed from taddhit pratyayas encoder LSTM of the sequence to sequence model. much like Kridanta are formed from krit pratyayas. Due to unavailability of enough data we based our Below example shows this formation: analysis on the 10 pratyayas out of 12. Input se- sUrasena + Rya = sOrasenyaH quence is set as the two input words concatenated with a ‘+’ character between the 2 words. Out- This is similar to Kridanta formation and thus put sequence is the Kridanta (primary derivative we used the same sequence to sequence model. noun) single word. Characters ‘&’ and ‘$’ were However the data that we had in case of taddhit used as start and end markers respectively in the pratyaya was limited and imbalanced for training output sequence. The maximum lengths of input the neural network. The rules for Taddhitanta for- and output sequences were 17 and 18 respectively. mation moreover are considered to be more com- So we used ‘*’ for padding shorter input sequences. plicated than those for Kridanta formation. We We used the same dictionary for input and output believe these are the reasons we did not get very characters. The dictionary contained just 53 char- high but a satisfactory accuracy on Taddhitanta un- acters and characters usually don’t have too much like Kridanta which performed much better. The correlation as can be observed for words. This is accuracy mentioned is calculated as: the reason that we used one hot encoded represen- (Number of correct formations) / (Total number tation of the characters instead of training any word of test samples) embedding. The best results were achieved with an The character wise accuracy, however is quite LSTM (Hochreiter and Schmidhuber, 1997) as - high ( 98%), which means that the model incor- sic RNN cell for decoder and bidirectional LSTM rectly predicts just one or two characters rather as basic RNN cell for encoder. Adding attention than whole words in incorrect formations. 4.3 Derivative Noun (Pada) Split Method both the cases, pada formation and pada split, to- Conceptually, the model architecture explained in tal records (tuples) were considered as 21,980 for Section 4.1 for Kridanta formation and Taddhi- Kridanta and 3088 for Taddhitanta. 80% examples tanta formation respectively, can also be used for used for training & validation ( 17,584 for Kridanta, splitting the Kridanta and Taddhitanta back into 2470 for Taddhitanta) and remaining 20% exam- root and pratyaya, where the input and output are ples used for model testing(4396 for Kridanta and swapped with compound word as input and the two 618 for Taddhitanta). In case of Derivative noun or initial words concatenated with ‘+’ character as pada formation, evaluation is based on exact match output. The accuracy achieved with this approach of whole pada. To evaluate the pada split, both was better than that obtained for pada formation. suffix as well as original root stem or noun should Our model seems to perform fairly well even for be correct to consider it as success. Even if the longer sequences of length greater than 15. model confuses between a & A (SLP1 encoding), it is considered a failure. Results from method de- 5 Data and Evaluation Results scribed above were bench marked with results from other publicly available tools as mentioned in table 5.1 Pratyaya-Kosh 1. Test set from Pratyaya-Kosh was kept same for Data for Pratyaya-Kosh was extracted from 4 dif- comparing with UoH & JNU tools. UoH tool does ferent sources viz. Panini Pratyayarth kosh, San- pada split for both primary as well as secondary skrit Hindi Kosh dictionary, KridantaRupaMala & derivative nouns where as JNU tool does pada split Sankrit Abhyas Portal developed based on rules of for only primary derivative nouns. However both Ashtadhyayi. Majority of data was extracted from these tools does not provide feature for pada for- Sanskrit Abhyas Portal. Table2 shows details of mation analysis. Evaluation criteria was kept strict data sources. for proposed method as mentioned above while the This Pratyaya-Kosh corpus consists of Kridanta success criteria for UoH & JNU tool was relaxed in (Primary derivative nouns) and Taddhitanta (Sec- a way if either of original stem verb/noun or suffix ondary Derivative nouns) in a form which can be is found in the result, it is considered as success. directly ingested for machine learning based ap- The comparison is shown in the table5. Every proaches. Each record is a tuple in the form of cell in the table4&5 indicates successful test (verb stem, krit suffix, Kridanta) and (noun, tad- cases, overall test cases and success percentage. dhit suffix, Taddhitanta). Corpus size of Kridanta It is evident form results that proposed method padas is 24,757 which includes padas formed by improves upon the existing Morphological Analy- 12 different krit suffixes, where as corpus size of sis tools and methods by a significant margin. In Taddhitanta padas is 3,088 which are formed by 17 addition, proposed model does not require any ex- different taddhit suffixes. Details of the corpus are ternal lexicon for analysis and its prediction mech- given in table3. anism works better than dictionary based tools and approaches. 5.2 Derivative Noun (Pada) Analysis Evaluation 6 Conclusion Training and test data was taken from Pratyaya- Kosh. In case of Kridanta Padas, training data was In this research work, we propose Pratyaya-Kosh, a chosen for 10 krit suffixes out of 12 due to data un- benchmark corpus to help researchers new to San- balancing issue. Satf & SAnac suffixes have less skrit in building based Morphological Analyzer data as compared to other krit suffixes, hence Kri- for Sanskrit derivative nouns. Also we propose danta Padas were left out corresponding to these neural approach for learning derivative noun for- 2 suffixes while training and testing. In case of mation without use of any external resources such Taddhitanta padas analysis, all suffixes were taken as language models, morphological or phonetic into consideration for training and testing. Analy- analyzers and still manage to outperform existing sis of padas (primary derivative nouns & secondary approaches. In future we intend to extend current derivative nouns) were done in two ways as pada work to verb derivative and indeclinable derivative formation and pada split into its original verb stem using machine learning methods. Proposed models & suffix in case of Kridanta pada and noun & suf- can be further refined by using additional training fix in case of Taddhitanta pada. For analysis of data. Benchmark corpus (Pratyaya-Kosh) will be Source Name Author Source Reference Sankrit Abhyas Portal Sharat Kotian http://www.sanskritabhyas.in/en Panini Pratyayarth kosh Dr. Gyanprakash Shastri https://archive.org/details/PaniniPratyay arthaKoshaTaddhi- taPrakaranamDr.GyanprakashShastri KridantaRupaMala Pandit S. Ramasubba Sas- https://archive.org/details/ tri KridantaRupaMalaVol.1-5 Sanskrit Hindi Kosh dic- V.S. Apte https://archive.org/details/ San- tionary skritHindiKoshV.S.Apte

Table 2: Pratyaya-Kosh Data Sources

krit Pratyaya Corpus Size taddhit Pratyaya Corpus Size lyuw 2198 152 anIyar 2198 QaY 58 Rvul 2198 Qak 124 tumu n 2198 Rya 104 tavya 2198 Rini 36 tfc 2198 WaY 150 ktvA 2198 Wak 308 Lyap 2198 aR 610 ktavatu 2198 aY 246 Kta 2198 iY 96 Satf 1687 Itac 92 SAnac 1090 Matup 168 -- Mayaw 10 -- Tal 312 -- Tva 311 -- yaY 113 -- Yat 198

Table 3: Pratyaya-Kosh Corpus Details

Model Kridanta Formation Ac- Taddhitanta Formation curacy Accuracy JNU (Sanskrit Kridanta -- Analyzer) UoH (Morphological Gen- -- erator) Proposed Method 3718 / 4396 (84.58 %) 495 / 618 (80.09%)

Table 4: Benchmark Results of Derivative noun (Pada) formation

Model Kridanta Split Predic- Taddhitanta Split Predic- tion Accuracy tion Accuracy JNU (Sanskrit Kridanta 1710 / 4396 (38.9%) - Analyzer) UoH (Morphological Gen- 3492 / 4396 (79.43%) 146 / 618 (23.62%) erator) Proposed Method 3903 / 4396 (88.79 %) 255 / 618 (41.26%)

Table 5: Benchmark Results of Derivative noun (Pada) Split made available on git hub.

References Akshar Bharati, Vineet Chaitanya, Rajeev Sangal, and Brendan Gillon. 2002. Natural language processing: A paninian perspective. Akshar Bharati, Amba Kulkarni, V Sheeba, and Rashtriya Vidyapeetha. 2006. Building a wide cov- erage sanskrit morphological analyzer: A practical approach. George Cardona. 1997. Panini: A Survey of Research. Motilal Banarsidass. Subhash Chandra and Girish Jha. 2006. Sanskrit sub- anta recognizer and analyzer for machine transla- tion. Harold G. Coward. 1990. Note: Sanskrit was both a literary and spoken language in ancient india. The Philosophy of the Grammarians, in Encyclopedia of Indian Philosophies, 5:3–12, 36–47, 111–112. Pawan Goyal and Gerard´ Huet. 2013. Completeness analysis of a sanskrit reader.

Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.

Gerard´ Huet. 2003. Towards computational processing of sanskrit. Paul Kiparsky. 2008. On the architecture of panini’s grammar. volume 5402, pages 33–94. Amrith Krishna and Pawan Goyal. 2015. Towards - tomating the generation of derivative nouns in san- skrit by simulating panini. ArXiv, abs/1512.05670. Amrith Krishna, Pavankumar Satuluri, Harshavardhan Ponnada, Muneeb Ahmed, Gulab Arora, Kaustubh Hiware, and Pawan Goyal. 2017. A graph based semi-supervised approach for analysis of deriva- tional nouns in sanskrit. In TextGraphs@ACL. N. Murali, R. J. Ramasreee, and K. V. L. N. Acharyulu. 2014. Kridanta analysis for sanskrit. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing sys- tems, pages 3104–3112.