Neural Compound-Word (Sandhi) Generation and Splitting in Sanskrit Language
Total Page:16
File Type:pdf, Size:1020Kb
Neural Compound-Word (Sandhi) Generation and Splitting in Sanskrit Language Sushant Dave Arun Kumar Singh Indian Institute of Technology - Delhi Indian Institute of Technology - Delhi [email protected] [email protected] Dr. Prathosh A. P. Prof. Brejesh Lall Indian Institute of Technology - Delhi Indian Institute of Technology - Delhi [email protected] [email protected] ABSTRACT large corpus of religious, philosophical, socio-political and scientific This paper describes neural network based approaches to the pro- texts of multi cultural Indian Subcontinent are in Sanskrit. Sanskrit, cess of the formation and splitting of word-compounding, respec- in its multiple variants and dialects, was the Lingua Franca of an- tively known as the Sandhi and Vichchhed, in Sanskrit language. cient India [5]. Therefore, Sanskrit texts are an important resource Sandhi is an important idea essential to morphological analysis of knowledge about ancient India and its people. Earliest known of Sanskrit texts. Sandhi leads to word transformations at word Sanskrit documents are available in the form called Vedic Sanskrit. boundaries. The rules of Sandhi formation are well defined but Rigveda, the oldest of the four Vedas, that are the principal religious complex, sometimes optional and in some cases, require knowl- texts of ancient India, is written in Vedic Sanskrit. In sometime th edge about the nature of the words being compounded. Sandhi around 5 century BCE, a Sanskrit scholar named pARini [4] wrote split or Vichchhed is an even more difficult task given its non a treatise on Sanskrit grammar named azwADyAyI, in which pARini uniqueness and context dependence. In this work, we propose formalized rules on linguistics, syntax and grammar for Sanskrit. 1 the route of formulating the problem as a sequence to sequence azwDyAyI is the oldest surviving text and the most comprehensive prediction task, using modern deep learning techniques. Being source of grammar on Sanskrit today. azwADyAyI literally means the first fully data driven technique, we demonstrate thatour eight chapters and these eight chapters contain around 4000 sutras model has an accuracy better than the existing methods on mul- or rules in total. These rules completely define the Sanskrit language tiple standard datasets, despite not using any additional lexical as it is known today. azwADyAyI is remarkable in its conciseness or morphological resources. The code is being made available at and contains highly systematic approach to grammar. Because of https://github.com/IITD-DataScience/Sandhi_Prakarana its well defined syntax and extensively well codified rules, many researchers have made attempts to codify the pARini’s sutras as CCS CONCEPTS computer programs to analyze Sanskrit texts. • Computing methodologies ! Learning paradigms. 1.1 Introduction of Sandhi and Sandhi Split in KEYWORDS Sanskrit Bi-LSTM, Sanskrit, Word Splitting, Compound Word, Morphologi- Sandhi refers to a phonetic transformation at word boundaries, cal Analysis where two words are combined to form a new word. Sandhi literally means ’placing together’ (samdhi-sam together + daDAti to place) ACM Reference Format: is the principle of sounds coming together naturally according to Sushant Dave, Arun Kumar Singh, Dr. Prathosh A. P., and Prof. Brejesh Lall. certain rules codified by the grammarian pARini in his azwADyAyI. Neural Compound-Word (Sandhi) Generation and Splitting in Sanskrit There are 3 different types of Sandhi as defined in azwADyAyI. Language. In 8th ACM IKDD CODS and 26th COMAD, Jan 02–04, 2021, IIIT Bangalore, India. ACM, New York, NY, USA, 7 pages. (1) Swara (Vowel) Sandhi: Sandhi between 2 words where both arXiv:2010.12940v1 [cs.CL] 24 Oct 2020 the last character of first word and first character of second 1 INTRODUCTION word, are vowels. (2) Vyanjana (Consonant) Sandhi: Sandhi between 2 words where Sanskrit is one of the oldest of the Indo-Aryan languages. The old- at least one among the last character of first word and first est known Sanskrit texts are estimated to be dated around 1500 character of second word, is a consonant. BCE. It is the one of the oldest surviving languages in the world. A (3) Visarga Sandhi: Sandhi between 2 words where the last char- Permission to make digital or hard copies of all or part of this work for personal or acter of first word is a Visarga(a character in Devanagri script classroom use is granted without fee provided that copies are not made or distributed that when placed at end of a word, indicate an additional H for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM sound) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, An example for each type Sandhi is shown below: to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. vidyA + AlayaH = vidyAlayaH (Vowel Sandhi) CODS-COMAD ’21, Jan 02–04, 2021, IIIT Bangalore, India vAk + hari = vAgGari (Consonant Sandhi) © Association for Computing Machinery. 1https://www.britannica.com/topic/Ashtadhyayi CODS-COMAD ’21, Jan 02–04, 2021, IIIT Bangalore, India Sushant Dave, Arun Kumar Singh, Dr. Prathosh A. P., and Prof. Brejesh Lall punaH + api = punarapi (Visarga Sandhi) 2 MOTIVATION Sandhi Split on the other hand, resolves Sanskrit compounds and Analysis of sandhi is critical to analysis of Sanskrit text. Researchers “phonetically merged” (sandhified) words into its constituent mor- have pointed out how a good Sandhi Split tool is necessary for a phemes. Sandhi Split comes with additional challenge of not only good coverage morphological analyzer [1]. Good Sandhi and Sandhi splitting of compound word correctly, but also predicting where Split tools facilitate the work in below areas. to split. Since Sanskrit compound word can be split in multiple • Text to speech synthesis system ways based on multiple split locations possible, split words may be • Neural Machine Translation from non-Sanskrit to Sanskrit syntactically correct but semantically may not be meaningful. language and vice versa • Sanskrit Language morphological analysis tadupAsanIyam = tat + upAsanIyam tadupAsanIyam = tat + up + AsanIyam Most Sandhi rules combine phoneme at the end of a word with phoneme at the beginning of another word to make one or two new phonemes. It is important to note that Sandhi rules are meant to 1.2 Existing Work on Sandhi facilitate pronunciation. The transformation affects the characters The current resources available for doing Sandhi in open domain at word boundaries and the remaining characters are generally are not very accurate. Three most popular publicly available set unaffected. The rules of Sandhi for all possible cases are laidoutin of Sandhi tools viz. JNU2, UoH3 & INRIA4 tools are mentioned in azwADyAyI by pARini as 281 rules. The rules of Sandhi are complex table 1. and in some cases require knowledge of the words being combined, An analysis and description of these tools is present in the paper as some rules treat different categories of words differently. This on Sandhikosh [3]. The same paper introduced a dataset for Sandhi means that performing Sandhi requires some lexical resources to and Sandhi Split verification and compared the performance of the indicate how the rules are to be implemented for given words. Work tools in table 1 on that dataset. Neural networks have been used done in Sandhikosh [3] shows that currently available rule based for Sandhi Split by many researchers, for example [2], [7] and [6]. implementations are not very accurate. This paper tries to address The task of doing Sandhi has been mainly addressed as a rule based the problem of low accuracy of existing implementations using a algorithm e.g. [14]. There is no research on Sandhi using neural machine learning approach. networks in public domain so far. This paper describes experiments with Sandhi operation using neural networks and compares results 3 METHOD of suggested approach with the results achieved using existing Sandhi tools [3]. 3.1 The Proposed Sandhi Method Sandhi task is similar to language translation problem where a 1.3 Existing Work on Sandhi Split sequence of characters or words produces another sequence. RNNs are widely used to solve such problems. Sequence to sequence Many researchers like [9] and [11] have tried to codify pARini’s model introduced by [16] is especially suited to such problems, rules for achieving Sandhi Split along with a lexical resource. [13] therefore same was used in this work. The training and test data proposed a statistical method based on Dirichlet process. Finite were in ITRANS Devanagari format 7. This data was converted to state methods have also been used [12]. A graph query method has SLP1 8 as SLP1 was found more suited for proposed approach. The been proposed by [10]. code was implemented in python 3.5 with Keras API running on Lately, Deep Learning based approaches are increasingly be- TensorFlow backend. ing tried for Sandhi Split. [6] used a one-layer bidirectional LSTM Proposed model expects 2 words as input and outputs a new word to two parallel character based representations of a string. [15] as per the Sandhi rules of azwADyAyI as given in below example. and [7] proposed deep learning models for Sandhi Split at sentence sAmAnyaDvaMsAn + aNgIkArAt => sAmAnyaDvaMsAnaNgIkArAt level. [2] uses a double decoder model for compound word split. The method proposed in this paper describes an RNN based, two Results achieved from this approach are rather poor. Analysis stage deep learning method for Sandhi Split of isolated compound showed that this is mainly due to the excessive length of words words without using any lexical resource or sentence information. in some cases. Proposed approach tries to address this problem by In addition to above, there exist multiple Sandhi Splitters in the taking last = characters of the first input word and first < charac- open domain.