Development of Speech Corpus and Automatic Speech Recognition of Angami

Viyazonuo Terhiija, Priyankoo Sarmah and Samudra Vijaya Indian Institute of Technology Guwahati, Guwahati, [email protected], [email protected], [email protected]

Abstract—Development of speech technologies for under- resourced languages is important. In this paper, we describe the development of speech corpus in Angami, an under re- sourced language, and the implementation of an automatic speech recognition system. Angami is a tone language, belonging to the Tibeto-Burman . It is spoken in the state of in North-East India. The speech corpus and the speech recognition system was developed for the variety of Angami spoken in village. In this work, we report the creation of a database of Angami sentences, read by eleven Angami speakers. The outcome of this speech database creation efforts (speech data, transcription and a pronunciation dictionary) was used in the development of an automatic speech recognition system using Kaldi, a public domain toolkit. The performance of various versions of the system using different types of acoustic models are presented and discussed. While the word error rate on training data is under 5%, the error is higher, for unseen test data by a new speaker, by a factor of 2 to 5, depending on the speaker. The average word error rate of a ‘leave one speaker out’ cross validation evaluation of context independent phone model is 17.3%. The results and inferences of a few experiments conducted to discover optimal settings of the system are presented. Index Terms—Angami, speech database, ASR

I.INTRODUCTION Fig. 1. The state of Nagaland in North East India. Nagaland is a small, multi-lingual state in north-east India, located at the foothills of the Himalayan range as shown in instead of typing keywords or navigate complex menus. In Fig. 1. The state is surrounded by Barail mountain ranges this context, machine recognition of spoken language becomes to the southeast and Patkai range to the northeast. Nagaland important. Such speech recognition systems were implemented is home to sixteen indigenous tribes, out of which fourteen for some major languages of north-east India [3], [4]. How- are Nagas and the other two tribes are Kukis and Kacharis. ever, no such system exists for any indigenous languages of The residents of Nagaland generally speak multiple languages; Nagaland. This paper reports the creation of speech database English as an official language, Nagamese as a lingua franca for standard Angami and its use in the implementation of an and as a subject in school. There are more than a Automatic Speech Recognition (ASR) system. dozen major (spoken by 10,000 or more people) indigenous The paper is organised as follows. A brief overview of languages spoken in the state of Nagaland [1]. Angami is one Angami language is given in section II. The development of a of the major communities of the Nagas in Nagaland and is database of Angami language spoken by native Angami speak- primarily found in the [2]. Angami is a tone ers is described in section III. The details of implementation language of the Tibeto-Burman language family. of an ASR system is given in section IV. The preliminary With the advancement in telecommunication network and results of the experiments conducted with the ASR systems is availability of inexpensive mobile telephone services, the use given in section V. A summary of the work and conclusions of internet has increased in India. Such an interaction with a are given in section VI. smartphone would be easier if people could talk to the device II.AN OVERVIEW OF THE ANGAMILANGUAGE This work is part of an ongoing project titled “Sociolinguistic Study of Angami (also referred to as Tenyidie, ISO 639−3: njm) [5] Phonetic Variations among the Clans and Khels of Two Southern Angami Villages”, funded by the Indian Council of Social Science Research (ICSSR), is spoken by 152, 796 people in Nagaland [1]. Angami is Government of India. an ethno−cultural as well as a linguistic group.The Angami

978-1-7281-2449-0/19/$31.00 ©2019 IEEE TABLE I TABLE III TONAL MINIMAL PAIR SETS IN STANDARD ANGAMI THE IPA LABELSOFCONSONANTSOF ANGAMIALONGWITHTHE CORRESPONDINGLABELSUSEDINTHISWORK. Word Tone Gloss T1 to use Label IPA symbol Label IPA symbol sE T2/T3 to erect p p ph ph T4 three b b t t T5 snatch th th d d T1 to incline k k kh kh pE T2/T3 fat/bridge g g m m T4 shiver mh mh n n T5 shoot nh nh ny ñ T1 to twist nyh ñh ng N ôi T2/T3 to hold f f v v T4 also s s z z T5 to mix sh S zh Z h h pf pf pfh pfh bv bv TABLE II ts ts tsh tsh MINIMAL PAIRS IN ANGAMI dz dz c tS ch tSh j dZ words Meaning l l lh lh kôa plenty r ô rh ôh kôE laugh w w wh wh kôi clingy y j yh jh kôu flow kô@ nest kôo allergy A. Standard Angami and variations community is broadly divided into four groups namely, the The Angami community can be categorized into four groups Northern, Southern, Western and Chakro (literally meaning as mentioned in section II. Several villages constitute each people residing below the highway) based on geography group. It is a folk belief of the Angamis that each village and administrative convenience. Each group consists of small has its own variety. The identity of a native speaker is the villages varying from ten to twenty, and each village is said to variety (s)he speaks. Due to the variations of Angami speech, have it’s own variety. The variety of Angami spoken around Rev. J.E. Tanquist (one of the earliest American missionaries) Kohima village is considered to be the standard form of convened a meeting of the elders of the community in 1939, Angami language. and formed the Angami Literature Committee to establish a common pattern of writing and spelling system [15]. This Some of the prominent works in Angami are the descrip- gave birth to the standard Angami, which is also known as tions of grammar in Standard Angami and western variety Tenyidie. In 1971, the literature committee changed its name [6]–[11]. Dialectal variations and internal variations of the to Ura Academy. It acts as a catalyst of development of the language based on kinship ties and geography have also been language and socio-cultural aspect of the community. With reported [12], [13]. Angami is a register tonal language with the development of the language, Standard Angami was intro- four lexical tones. Previous works have stated the inventory of duced in school and university curriculum. Previous linguistics tones ranging from four to five [7]–[11], [13]. In a recent study studies on Angami, including the description of grammar, were of tone and vowel interaction in Angami, a pilot study was based on data collected from Khonoma, Mezoma and Jotsoma conducted to determine the tonal inventory [14]. The study dialects, spoken in the western region of Kohima district [6], found that there is no acoustic difference between T2 and [16]. Studies by Giridhar and Kuolie focussed on the Standard T3 produced by the Standard Angami speakers. Hence, T2 Angami [8], [10]. Studies in variations of Angami across the and T3 were merged and treated as one entity. The examples Southern Angami villages and also the internal variations in of tone minimal pairs in Angami are shown in Table I, Kohima village based on clans were conducted [12], [13]. where T1 represents the highest tone and T5, the lowest. Tones in Angami not only have lexical significance but also grammatical functions. However, such tones are quite limited III.DEVELOPMENTOFTHE ANGAMI SPEECH DATABASE in number. Angami has six , /a, i, E, u, o, @/. Examples of vowel Here, we describe the steps in the development of the minimal pairs with cluster /kô/ on the onset position Angami speech database. The first subsection describes the are shown in Fig. II. There are 40 consonant phonemes in the preparation of text materials that were read by the informants. language. Angami uses Roman script for writing. The IPA The next subsection describes the geographical areas of data label of the forty , along with the corresponding collection, the speech recording procedure, segmentation and labels (in Roman script) used in this experiment are shown in annotation of speech files and an overview of the speech Table III. database. The speech recordings were then transferred to a computer for further processing. A passage or an article or a poetry, read by a speaker had been recorded in a single speech file. The text corresponding to such a speech was divided in terms of sentences or phrases. Each text was numbered serially, and stored in an excel file. Using Praat 6.0.43 toolkit [18], each sound file was segmented and annotated with the corresponding sentence number in the excel file. Care was taken to ensure that the speech in a sound file matches with the text with the same serial number. Out of recordings from 13 speakers, two were removed due to high levels of noise and disturbances from the environment. Thus, the speech database used in this work consists of speech recordings from a total of 11 speakers: 6 male and 5 female. The average age of the speakers is 28. Fig. 2. The distribution of the sources of sentences. IV. IMPLEMENTATION OF SPEECH RECOGNITION SYSTEM An Automatic Speech Recognition (ASR) system was im- A. Text corpus plemented using the open source Kaldi toolkit [19]. The details The sentences to be read were collection of passages from of the implementation are described in this section. books, articles and poetry. The distribution of the sources of text data is shown in Fig. 2. Length of the sentences varied A. Feature extraction and model training from 5-15 words. The text corpus used in this work contains We used the default settings of the Kaldi toolkit to im- 579 unique sentences comprising of 1694 unique words. plement the first version of the Angami ASR system. Mel Frequency Cepstral Coefficients (MFCC) and their first and B. Pronunciation dictionary second time derivatives were computed from 25ms long speech The pronunciation dictionary, in the format specified by frames separated by 10ms each. the Kaldi toolkit, was manually created. Since the speakers The standard recipe of Kaldi permits to train several types of Angami use Roman script, the first column of the pro- of hidden Markov model (HMM) to represent a linguistic unit nunciation dictionary consists of Angami words in Roman such as context independent phone (monophone) or context script. The second column contains the sequence of phonemes dependent phone (triphone). Depending on the type of model corresponding to the word in the first column. The labels or feature transformations carried out, three types of triphone used to denote the phonemes of spoken Angami are listed models can be trained. These are called ‘tri1’, ‘tri2’, and ‘tri3’. in Table III. The probability density function associated with a state of a HMM is modeled by a Gaussian Mixture Model (GMM) in C. Recording of speech data case of these 4 acoustic models. An acoustic model where Data were collected from eleven native speakers of Angami the emission probabilities are modeled by a subspace GMM residing in Kohima village, and two speakers residing in is denoted as SGMM. Further, the state dependent likelihoods Kohima town. The research associates took prior appointments of feature vectors can be computed from the posterior prob- with each of the speakers. The speakers voluntarily did the abilities estimated by a Deep Neural Network (DNN). The recording without any remuneration. All the speakers are standard recipe allows training and evaluation of all these 6 multilinguals; they speak at least three languages, namely, acoustic models. Angami, English and Nagamese. The speakers have a minimal The word level bigram language model was trained from the education of high school certificate, and thus were able to read transcriptions of training data using IRSTLM toolkit [20]. the text given to them. The speakers were presented printed V. RESULTS AND DISCUSSION passages and the investigator requested the speaker to read the The performance of various versions of Angami ASR sys- given text aloud while the speech was digitally recorded. After tem are presented and discussed in this section. recording the speech, the speakers were requested to provide The measure of performance is the Word Error Rate (in a few personal details such as age, language known to them, %); lower the value of WER, the better the system is. The and the clan/kinship ties to which they belong. word sequence hypothesised by the decoder corresponding to The data were recorded in an ambience condition, and not a wave file is aligned with the reference transcription (ground in controlled acoustic environment. A hand held Tascam linear truth). The number of words inserted, deleted and substituted PCM recorder was used. The characteristics of the digitized are computed. This is repeated for each (test) wav file fed to speech files were as follows: 44,100Hz, 24-bit. mono and MS the decoder. The WER is computed as wav format. The speech data were collected in September 2018. WER(%) = 100(I + D + S)/N Fig. 3. Performance of Angami ASR system, employing different types of Fig. 4. Performance of Angami DNN-HMM based ASR system, with varying acoustic model as a function of the number of training iterations. number of nodes (from 64 to 500) in the 4 hidden layers, as a function of the number of training iterations. The WER increases if the number of nodes per hidden layer is much larger than 100. where I, D, S are number of insertion, deletion and substitution errors, respectively and N is the total number of words in the phone classification accuracy takes time. We conducted an reference transcriptions of the test data set. experiment to investigate the performance of the DNN-HMM 1) Performance on training data: The Angami speech based system as a function of the number of parameters. database consists of 579 speech files from 11 speakers. Since 2) Optimal number of nodes in hidden layer of DNN: the number of files in this preliminary database is small, we The default number of hidden layers in DNN is four. Each used all 579 files for training the ASR system. Subsequently, hidden layer had 300 nodes to begin with. Thus, there are each of the 579 files was fed to the decoder of the ASR system, 3002 connections and weights between adjacent hidden layers. and the WER was computed. In other words, the performance In view of high WER values associated with the DNN-HMM of the system in recognizing the speech files with which it ASR system, we explored the possibility of reduction in WER was trained was measured. Such an evaluation of ASR system by discovering the optimum number of nodes in one hidden with respect seen (train) data was carried out for systems layer. In a similar experiment, increasing the number of nodes employing (6) different types of acoustic models: mono, tri1, in hidden layer from 64 to 1024 decreased the WER of DNN- tri2, tri2, SGMM, DNN. The Baum-Welch algorithm used for HMM system [21]. So, we increased the number of nodes from training of acoustic models is an iterative algorithm. In every 300 to 500. Unfortunately, the WER has increased. However, iteration, the parameters of the model are tuned following the the WER decreased when the number of nodes was reduced to maximum likelihood criterion to represent the training data 150. Decreasing the number further down to 64 did not have better. Consequently, the WER is expected to decrease as a a significant effect on the WER. Thus, about 100 nodes per function of increasing number of training iterations hidden layer of DNN seems to be optimal with respect to the The word error rates of 6 versions of ASR systems are small database of 579 speech files. This was found to be true graphically shown as a function of the number of training for varying number of iterations of training a DNN. iterations in Fig. 3. As expected, the WER shows a decreasing Fig. 4 shows the WER of DNN-HMM based system trained trend with the number of iterations. The WER of the ASR with all 579 speech files and tested with the same 579 speech system that models context independent phones with GMM- files as a function of the number of iterations of training the HMM (‘mono’) is 27% after the first iteration. The WER re- DNN. The WER is consistently high when the number of duces to single digit after 11 iterations of training. The WER of nodes in each hidden layer is 500. The WERs are nearly equal the systems that model context dependent phones either with when this number is either 150 or 64. The WER values of GMM-HMM (‘tri1’, ‘tri2’, ‘tri3’) or with subspace-GMM- DNN-HMM models in Fig. 3 refer to the DNN-HMM model HMM (‘SGMM’) is below 5% right from the beginning. These with 64 nodes in each hidden layer. models enjoy the triple benefits of (i) getting initiated with 3) Performance on test data: In the previous subsection, trained monophone models, (ii) exploiting phonetic context We presented and discussed the performance of ASR systems effects, and (iii) recognising ’seen’ speech data. In contrast, with respect to training data. The WERs of the systems on test the ASR system using DNN-HMM acoustic models have (’unseen’) data is expected to be higher, especially in view of high WER value initially. The WER decreases with increasing small amount of speech data available for training. number of iterations. The WER curve of DNN-HMM follows The complete Angami speech database consists of 579 closely that of a mono model. A possible explanation for this speech files spoken by 11 speakers. The number of files per behaviour is that the deep neural network has four hidden speaker varies from 24 to 100. So, we conducted a ‘leave- layers with hundreds of nodes in each layer. The tuning one-speaker-out’ cross validation experiment to assess the of a large number of weight parameters to achieve better efficacy of the trained models on ‘unseen’ data from a speaker whose characteristics are not known to the system. In the first [3] Abhishek Dey, Wendy Lalhminghlui, Priyankoo Sarmah, K. Samudrav- experiment, we kept aside all speech files from the first speaker ijaya, S. R. Mahadeva Prasanna, Rohit Sinha and S. R. Nirmala, “Mizo Phone Recognition System”, Proc. IEEE INDICON 2017, IIT Roorkee, as the test data, and trained the system with speech files from 15-17 December, 2017. the remaining 10 speakers. The WER of the system when fed [4] Barsha Deka, Nirmala S.R. and Samudravijaya K., “Development of with all files from the first speaker was computed. Such an Assamese Continuous Speech Recognition System”, Proc. of The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced evaluation procedure was repeated for each of the remaining Languages, pp. 215-219, 29-31 August, Gurugram, India, 2018. 10 speakers. [5] Naga, Angami: a language of India. SIL International. Online version: Since the set of sentences read by a speaker is mutually https://www.ethnologue.com/language/njm.2018 [6] Grierson, George A, “Tibeto-Burman Family: Specimens of the Bodo, exclusive from the set of sentences read by the remaining Naga, and Kachin Groups, Volume III Part II of Linguistic Survey of speakers, we trained the bigram language model from com- India”, Office of the Superintendent of Government Printing, Calcutta, bined transcriptions of test and train files. Otherwise, some 1903. [7] Burling Robbins,“ phonemics and word list”, Indian Lin- words spoken by the test speaker might not be part of the guistics, vol. 21, pp. 51–60, 1960. training vocabulary. In that case, the grammar network trained [8] Giridhar, Puttushetra Puttuswamy, “Angami grammar”, Central Institute with transcriptions of training data would not contain such of Indian languages, 1980. [9] Ravindran, N, “Angami phonetic reader”, Central Institute of Indian ‘unseen’ words. Such ‘out-of-vocabulary’ words will be mis- Languages, 1984. recognised as one or more words of the training vocabulary. [10] Kuolie, D, “Structural description of Tenyidie: a Tibeto-Burman lan- This will artificially increase the WER value. guage of Nagaland”, Ura Academy, Publication Division, 2006. [11] Chase, N, “A descriptive analysis of the Khwunomia dialect of Angami”, The WER of ASR systems on ‘unseen’ test data was Ph. D. dissertation, University of Poona, 1992. found to be larger for models of context dependent phones [12] Terhiija, Viyazonuo and Sarmah, Priyankoo and Vijaya, Samudra, in comparison with ‘mono’ model. So, here we report the “Acoustic Analysis of Vowels in Two Southern Angami Dialects”, Oriental COCOSDA 2018, 7-8 May, Miyazaki, Japan. 2018. WER of only ‘mono’ ASR system. The mono system was [13] Suokhrie, Kelhouvinuo, “Clans and clanlectal contact”, Asia-Pacific trained with 20 iterations. The WER for the unseen test Language Variation, vol. 2, pp. 188-214, 2017. data varied from 6.8% to 35.7% in 11 rounds of evaluation [14] Lalhminghlui, Wendy and Terhiija, Viyazonuo and Sarmah, Priyankoo, “Vowel-Tone Interaction in Two Tibeto - Burman Languages”, INTER- corresponding to 11 speakers as the test speaker. The average SPCEECH 2019, 15th-19th September, Graz, Austria. 2019. WER was 17.3%. There was no apparent correlation of WER [15] Liezietsu, Shiirhozelie, “Ura Academy and the evolution of Tenyidie”, in with attributes such as gender or number of test files belonging Angami Society at the beginning of the 21st Century. Edited by Kikhi, K.et.al, Akansha Publishing House,2009. to the test speaker. [16] McCabe, Robert Blair, “Outline Grammar of the Angami Naga Lan- On inspection, it was observed that speakers sometimes guage”, Calcutta: Superintendent of Government Printing, 1887 uttered extra words in addition to the text provided. The [17] Crystal David, “A dictionary of linguistics and phonetics”, edition 6.Blackwell Publishing. 2008 performance is likely to improve if such speech disfluencies [18] Boersma, Paul and Weenink, David, “Praat: A system for doing phonet- are detected and incorporated in the reference transcription. ics by computer”, 1992. Moreover, there are a large number of parameters such as the [19] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,J. Silovsky, G. number of shared HMM states (senones) that can be tuned to Stemmer, and K. Vesely, “The Kaldi Speech Recognition Toolkit”, achieve optimal performance with respect unseen test data. in IEEE 2011 Workshop on Automatic Speech Recognition and However, it may be better to train the system with larger Understanding. IEEE Signal Processing Society, Dec. 2011. Online: https://publications.idiap.ch/downloads/papers/2012/Povey ASRU20- amount of speech data from more speakers, and then tune 11 2011.pdf the parameters of the system for the best performance on test [20] Marcello Federico, Nicola Bertoldi, Mauro Cettolo. .“IRSTLM: An open data. source toolkit for handling large scale language models”. Proceedings of Interspeech. 1618-1621,2008. VI.SUMMARY AND CONCLUSIONS [21] Barsha Deka, Priyankoo Sarmah and Samudra Vijaya,“Assamese Database and Speech Recognition”, unpublished. An automatic speech recognition system was implemented for Standard Angami, an under-resourced language spoken in the Nagaland state of North-East India. The system was trained with sentences read by native Angami speakers. The performance of the ASR system was presented and discussed in the light of the small size of the speech database. Collection of a larger database from more speakers is needed to improve the performance of the system. REFERENCES [1] Office of the Registrar General & Census Commissioner India, ‘Statement-1 Part-B Languages not specified in the eighth schedule (non-scheduled languages)’. Online: http://www.censusindia.gov.in/2011Census/Language-2011/Statement- 1.pdf. [2] Eberhard, David M., Gary F. Simons, and Charles D. Fennig (eds.). Ethnologue: Languages of the World. Twenty-second edition. Dallas, Texas: SIL International. Online version: http://www.ethnologue.com/map/IN 05.2019.