A New Speech Corpus and Modelling Insights Devaraja Adiga1∗ Rishabh Kumar1∗ Amrith Krishna2 [email protected] [email protected] [email protected]
Total Page:16
File Type:pdf, Size:1020Kb
Automatic Speech Recognition in Sanskrit: A New Speech Corpus and Modelling Insights Devaraja Adiga1∗ Rishabh Kumar1∗ Amrith Krishna2 [email protected] [email protected] [email protected] Preethi Jyothi1 Ganesh Ramakrishnan1 Pawan Goyal3 [email protected] [email protected] [email protected] 1IIT Bombay, Mumbai, India; 2University of Cambridge, UK; 3IIT Kharagpur, WB, India Abstract grapheme-phoneme correspondences. Connected speech leads to phonemic transformations in utter- Automatic speech recognition (ASR) in San- acnes, and in Sanskrit this is faithfully preserved in skrit is interesting, owing to the various lin- guistic peculiarities present in the language. writing as well (Krishna et al., 2018). This is called The Sanskrit language is lexically productive, as Sandhi and is defined as the euphonic assimi- undergoes euphonic assimilation of phones at lation of sounds, i.e., modification and fusion of the word boundaries and exhibits variations in sounds, at or across the boundaries of grammatical spelling conventions and in pronunciations. In units (Matthews, 2007, p. 353). Phonemic orthog- this work, we propose the first large scale study raphy is beneficial for a language, when it comes to of automatic speech recognition (ASR) in San- designing automatic speech recognition Systems skrit, with an emphasis on the impact of unit (ASR), specifically for unit selection at both the selection in Sanskrit ASR. In this work, we release a 78 hour ASR dataset for Sanskrit, Acoustic Model (AM) and Language Model (LM) which faithfully captures several of the linguis- levels. tic characteristics expressed by the language. Regardless of the aforementioned commonali- We investigate the role of different acoustic ties preserved in both the speech and text in San- model and language model units in ASR sys- skrit, designing a large scale ASR system raises tems for Sanskrit. We also propose a new mod- several challenges. The Unicode encoding for elling unit, inspired by the syllable level unit the native scripts in Sanskrit, both in Roman and selection, that captures character sequences from one vowel in the word to the next vowel. Devanāgari, does not preserve the correspondence We also highlight the importance of choos- with the phonemic encoding. Further, mapping ing graphemic representations for Sanskrit and the graphemes in Unicode to the corresponding show the impact of this choice on word er- phonemes either leads to ambiguity and redun- ror rates (WER). Finally, we extend these in- dancy or often requires multi-grapheme combina- sights from Sanskrit ASR for building ASR tions. systems in two other Indic languages, Gujarati The language is lexically productive, which re- and Telugu. For both these languages, our ex- perimental results show that the use of pho- sults in long compound words with multiple com- netic based graphemic representations in ASR ponents in usage. This results in the speakers seg- results in performance improvements as com- menting the compounds at arbitrary lexeme bound- pared to ASR systems that use native scripts.1 aries of the compound, as it need not always be pos- sible to utter the compound in one breath and also 1 Introduction to convey the meaning clearly. Similarly, such Sanskrit is a language with fairly advanced dis- arbitrary segmentations at the word boundaries ciplines of phonetics (Śiksạ̄ ), prosody (Chandas), are possible in utterance of long text sequences and grammar (Vyākaranạ ). The language has where multiple lexical items are fused together a rich oral tradition and it tends to follow via Sandhi. These segmentations are accompanied a phonemic-orthography resulting in systematic with the corresponding Sandhi based transforma- tions, resulting in a new phonetic sequence differ- ∗ Joint first author ent from the original sequence. Finally, Sanskrit 1Dataset and code can be accessed from www.cse.iitb.ac.in/~asr and https://github. might be one of those rare natural languages where com/cyfer0618/Vaksanca.git. the number of non-native proficient speakers are manifold in comparison to the native speakers (Kr- graphemes relevant for these languages which are ishna et al., 2020). This makes the ASR task fur- missing from Sanskrit. We report the performance ther challenging, as the speakers are prone to carry of these ASR systems on two publicly available their influence from their corresponding mother ASR datasets. tongues into the Sanskrit utterances as well. Our main contributions in this work are: While there exist several computational models 1) We present (in Section 2) a new, large vocabu- for processing Sanskrit texts (Kulkarni, 2013; Ku- lary Sanskrit ASR system and the first ever ASR- mar et al., 2010; Shukla et al., 2010; Kulkarni et al., based study for Sanskrit using a new, large and di- 2010a; Goyal et al., 2012; Kulkarni et al., 2010c; verse, labeled speech corpus वा啍 सञ्चयः (/Vāksañ Mishra et al., 2013; Saluja et al., 2017; Anoop and cayah/)̣ . Ramakrishnan, 2019; Krishna et al., 2021), large 2) We investigate (in Sections 3 and 4) different scale systems for processing of speech in Sanskrit, modeling choices for both acoustic models and lan- are almost non-existent. First, we present a new guage models in Sanskrit ASR systems, along with dataset, with 78 hours of speech covering about different graphemic representations. We propose 46,000 sentences, for ASR in Sanskrit. Keeping a new word segmentation technique based on split- the rich and long cultural heritage the language ting at vowels that can be used with both the acous- carries, we prepare our dataset to be diverse both tic model and the language model. chronologically and in terms of the domain cover- 3) We also contextualize our findings for Sanskrit age. Further, the dataset contains utterances from by providing comparisons on ASR systems built 27 different speakers, representing 6 different na- for two other Indian languages, viz., Gujarati and tive languages. The dataset splits have disjoint Telugu. speakers, with 12 in the training and 5 each in the 2 A new Sanskrit Speech Corpus: validation, test and out-of-domain test data sets. (/Vāksañcayah/)̣ Further, we explicitly mark the segmentation de- वा啍 सञ्चयः cisions made by a speaker to segment long com- Our corpus वा啍 सञ्चयः (/Vāksañcayah/)̣ , has more pound words and fused phrases and include the cor- than 78 hours of data with an overall vocabu- responding transformations due to sandhi. lary size of 91,000 words and recordings of about Using this dataset, we propose a new, large- 46,000 sentences, each with a sampling rate of 22 vocabulary Sanskrit ASR system, which, to the KHz. The contents span over 3 time periods cat- best of our knowledge, is the first such system for egorised into pre-classical literature (1,500 BCE - Sanskrit. The phonemic orthography followed in 100 BCE), classical literature (300 CE - 800 CE) Sanskrit has influenced our design choices in terms and modern literature (900 CE to now). The corpus of unit selection at the level of the acoustic and lan- is intended to address the challenges in interfacing guage models. We investigate three different en- the speech and the text covered by the disciplines coding schemes used to model LM tokens, namely, of phonetics (Śiksạ̄ ), and grammar (vyākaranạ ). word-based encoding, byte pair encoding (BPE) Hence, we confine our corpora to those written 2 and a new vowel split encoding inspired by exist- only in prose (Gadya) . In the Sanskrit litera- ing linguistic theories of syllabic structure popu- ture, frequency of commonly used words changes larly used within text-to-speech systems (Kishore and Black, 2003; Mishra et al., 2013). Further, to Dataset Speakers Hours Utterances address the redundancy issues in Unicode represen- Train 12 56 34,309 tations, we make use of the Sanskrit Library Pho- Validation 4 7 3,190 netic (SLP1) encoding scheme proposed by Scharf Test 6 11 6,004 and Hyman (2011). SLP1 is designed such that it Out-of- 5 5 2,618 preserves the phonemic orthography. Building on domain Test the study by Scharf and Hyman (2011), we focus on two graphemic representations only, viz., native Table 1: Overview of Sanskrit speech corpus. script (Devanagari) and SLP1. Finally, we extend our insights to model ASR 2We do not include verses in our current dataset, as mod- elling ASR systems for verses would require additional re- systems for two more Indian languages, viz., Tel- sources on both the acoustic model and the language model ugu and Gujarati. We extend the SLP1 to include fronts. from one topical domain to another, specifically Word Length: The tokens in Sanskrit texts one Śāstra (branch of learning) to another (Adiga can be very long owing to “Sandhi” and the et al., 2018). Our corpus contain samples from lexically productive process of compounding diverse domains, including philosophy, literature, (``Samāsa"). For instance, consider a compound commentary on poetry and grammar. It also in- word, वागथ셍प्रतपत्तये(/vāgarthapratipattaye/). It cludes contemporary recordings such as stories, forms a 19 letter word in SLP1 (vAgarTapratipat- live lectures, spiritual discourse and radio pro- taye), and is formed by combining the three San- gram/podcast, so that collecting a wide range of skrit stems वाक् , अथ셍, प्रतपत्त (/vāk, artha, prati Sanskrit vocabulary. patti/), as per the rules of Sandhi and Samāsa. In The recordings were primarily collected with Figure 1, we present the distribution of the num- the help of volunteers, recording their speech by ber of characters (in SLP1 format) per word across using the Recorder app on Android phones and the the three languages that we experimentally anal- Audacity platform, and from various sources avail- yse in this work, viz., Sanskrit, Telugu and Gu- able online.3 oTranscribe3 was used to transcribe jarati. The plots are normalized with respect to the the audio files.