Synthesizing Audio for Hindi WordNet

Diptesh Kanojia†,♣,⋆, Preethi Jyothi†, Pushpak Bhattacharyya† †Indian Institute of Technology Bombay, India ♣IITB-Monash Research Academy, India ⋆Monash University, Australia †{diptesh, pjyothi, pb}@cse.iitb.ac.in

Abstract formation retrieval (Knight and Luk, 1994; Richardson and Smeaton, 1995). They are In this paper, we describe our work lexical structures composed of synsets and on the creation of a model semantic relations (Fellbaum, 1998). Such using a synthesis system for a lexical knowledge base is at the heart of the Hindi . We use pre- an intelligent information processing system existing “voices”, use publicly available for Natural Language Processing and Un- speech corpora to create a “voice” us- derstanding. The first WordNet was built ing the Festival Sys- in English at Princeton University1. Then, tem (Black, 1997). followed the for European Lan- Our contribution is two-fold: (1) We guages2 (Vossen, 1998), and then IndoWord- scrutinize multiple speech synthesis Net3 (Bhattacharyya, 2010). systems and provide an extensive re- IndoWordNet consists of 18 Indian Lan- port on the currently available state- guages with an average of 27000+ synsets for of-the-art systems. We also develop all the and 40000+ for the Hindi voices using the existing implementa- Language. It uses Hindi WordNet4 (Narayan tions of the aforementioned systems, et al., 2002) as a pivot to link all these lan- and (2) We use these voices to gen- guages and contains more than 25000 linkages erate sample audios for randomly cho- to the Princeton WordNet. Cognitive theories sen ; manually evaluate the audio of multimedia learning (Mayer, 2002) indicate generated, and produce audio for all that audio cues are effective aids in a learning WordNet words using the winner voice scenario, and also help in retaining the mate- model. We also produce audios for the rial learned (Bajaj et al., 2015). Hindi WordNet Glosses and Example “Our goal is to enrich the semantic lexicon of sentences. Hindi WordNet by augmenting it with We describe our efforts to use pre- audios generated automatically using a speech existing implementations for WaveNet synthesis voice model.” - a model to generate raw audio using Manually recording for all the neural nets (Oord et al., 2016) and gen- words is a tedious task. These recording ef- erate speech for Hindi. Our lexicog- forts could be minimized by using text-to- raphers perform a manual evaluation speech (TTS) systems to automatically syn- of the audio generated using multiple thesize speech for all the words. However, one voices. A qualitative and quantitative cannot be sure about the quality of these syn- analysis reveals that the voice model thesized clips. We build multiple TTS systems generated by us performs the best with and systematically analyze the quality of the an accuracy of 0.44. resulting synthesized clips, with the help of

1 Introduction 1 http://wordnet.princeton.edu 2 WordNets have proven to be a rich lexical re- http://www.illc.uva.nl/EuroWordNet/ 3http://www.cfilt.iitb.ac.in/indowordnet/ source for many NLP sub-tasks such as Ma- 4http://www.cfilt.iitb.ac.in/wordnet/ chine Translation (MT) and Cross-Lingual In- webhwn/ 2 Related Work

A significant amount of work has been done in the area of Speech Synthesis or Text-to- Speech conversion for English, Japanese, Chi- nese, Russian (Takano et al., 2001; Zen et al., 2007; Zen et al., 2009; Wang et al., 2000; Sproat, 1996). Text-to-Speech conver- Figure 1: A Unit selection based Concatena- sion systems for Indian Languages have also tive Speech Synthesis System emerged in the recent past (Patil et al., 2013). Although these systems, which are already available, do not produce the most “natural” lexicographers. We envision that this addition sounding output, but they are usable to an to Hindi WordNet will further its use in the extent. Manual evaluations of the speech syn- education domain, for students and language thesis systems built for the Hindi Language enthusiasts alike. show that there is still a need for better and additional phonetic coverage 1.1 Speech Synthesis: An Introduction (Kishore et al., 2003; Raj et al., 2007). Bengu There are four basic approaches to synthe- et al. (2002) create an online context sensi- sizing speech: 1) concatenation, 2) tive dictionary using Princeton WordNet and , 3) synthesis, implement a Java based speech interface for and 4) . Concatenative the Text-to-Speech (TTS) engine. Kanojia et synthesis produces a very natural-sounding al. (2016) automatically collect images for In- synthesized version of the utterances. There doWordNet and augment them to the web in- can be glitches in the output owing to the na- terface, but due to the lack of tagged images ture of automatic segmentation of the wave- openly available for use, they do not collect forms, but the speech produced sounds natural enough images. To the best of our knowledge, indeed. Apart from the first method i.e. wave- there has been no other work specifically in the form concatenation, all approaches to speech direction of synthesizing audio for WordNet synthesis are based on the source-filter model. words or Synthesizing audio for Indian Lan- The synthesis method can be broken down guage WordNets. into two components, consisting of a model of the source (models of periodic vibration and 3 Our Approach models of supra-glottal sources) and a model of the vocal tract transfer function. In Among the three main sub-types of concate- articulatory synthesis, computational models native synthesis, we choose to perform unit of the articulators are constructed that allow selection synthesis and build cluster units of the system to simulate various configurations the speech data recorded by a . that human speech organs can attain during We use the Festival system to create a syn- speech production. Acoustic-phonetic theory thetic voice for Hindi. We followed the doc- is used to compute the transfer function for umentation of the Festival Framework along vocal tract shape. In formant synthesis, for- with FestVox5 implementation to train a voice mant transitions across and on Hindi Speech Corpora provided by the In- must be modeled closely. These transitions are dicTTS Consortium6 for research purposes. most important in identifying the . Figure 1 displays a generic speech synthesis Designing these set of rules is still a difficult system which uses the concatenative synthesis task. The simplest approach to synthesis by- or unit selection corpus-based speech synthesis passes most of the problems since it involves to generate speech given an input text. taking real recorded/coded speech, cutting it into segments, and concatenating these seg- 5http://festvox.org/ ments back together during synthesis. It is 64https://www.iitm.ac.in/donlab/tts/index. called concatenative synthesis. 3.1 Dataset 3.3 Implementation Details

We use the Female Voice - Hindi and Fe- We implement the Unit Selection based male Voice - English dataset provided by the method for generating audio and try to IndicTTS forum to train our system. The use a pre-implemented neural network based dataset is publicly available for the purpose of method to generate audio. Although Hindi research. We download the complete dataset audio could not be generated using the lat- i.e. 7.22 hours of Audio with English and 5.18 ter, but we successfully generate audio using hours of monolingual audio. We also down- the former method. load the dictionary provided on the website for 3.3.1 Unit Selection based providing it to the synthesis system as input. Concatenative Synthesis We use a total of 2318 Female Hindi We perform Unit selection synthesis using a utterances downloaded from IndicTTS consor- large corpus of recorded speech labeled with tium, and 1378 word audios manually recorded the text being spoken (Details of the corpus in by us to train the voice model. implementation details). During such a corpus creation, each recorded utterance is segmented 3.2 Architecture and Methodology into some or all of the following: a) individual phones, b) diphones, c) half-phones, d) sylla- While training input to the system is a cor- bles, e) , f) words, g) phrases, and responding speech-text parallel corpus, where h) sentences. a WAV file containing audio is aligned toits A specially modified speech recognizer set corresponding text using an ID, a textual unit to a “forced alignment” mode with some man- such as a word or a phrase is given as an input ual correction is typically used to divide the in the testing phase. The output is an audio into segments (utterances). It waveform stored in the WAV format. uses visual representations such as the wave- The system needs a dictionary for form and , to divide the speech. a letter to sound conversion and We generate An index of such units in the speech one which contains unique words and in is then created based on the segmentation and parallel has corresponding syllabification of acoustic parameters like the fundamental fre- the word with the beginning and ending quency (pitch), duration, position in the sylla- clearly marked. For e.g., The Hindi word ble, and neighboring phones. At run time, the “kamaane” which means “To Earn” would be desired target utterance is created by deter- represented as: mining the best chain of candidate units from the database (unit selection). This process is (“कमाने” nil (((“क_beg”) 0) ((“मा_mid”) 1) typically achieved using a specially weighted ((“ने_end”) 0))), decision tree (HTS System uses HMM and looks at posterior probability and prior prob- 0 for lower stress, and 1 for the high stress. ability to decide the best chain.) Since the output of our work would Such a system also requires the utterances be used to generate pronunciations of a composed of phones including a considerable word/phrase/short sentences, we need a nat- amount7 of recorded speech. ural sounding voice and hence choose to build cluster units of the recorded speech data avail- Both the requirements above can also be able. generated programmatically from a given text corpora which corresponds to a speech 3.3.2 Neural Network based RAW database/corpora. Although, the recorded Audio generation speech needs to be in parallel correspondence We also use pre-implemented models from to the recorded audio files (usually in a WAV around the web to reproduce TTS systems, audio file format - 16KHz, Mono Channel). but as quoted at many places, such systems require huge amounts of data and exorbitant 7conventionally, hours of speech is required amounts of time to generate even smallest of the samples. We use a WaveNet implementa- tion (basveeling/)8 to generate RAW audio for a piano music dataset and generate audio using it. Due to various errors in the implementation when trying to use it to generate audio based on text, we could not use this implementation for any form of Text-to-Speech generation. 3.3.3 Other Experiments We also use other pre-trained voices available on the FestVox website to generate audio for Figure 2: A Unit selection based Concatena- comparison with the audio generated via our tive Speech Synthesis System voice model. We downloaded the following voices: 4 Results & Evaluation 1. Hindi - Male Voice, We accumulate 6 usable voice models and 2. Hindi - Female Voice, produce word audios and randomly sample 3. Marathi - Female, and word audios from them. Among these models, the one which we successfully generated using 4. Marathi - Male. Unit Selection based Concatenative Synthesis, We generate audio using these voices. A sounded most natural in a brief overview. brief record of our survey of various speech Speech Synthesis evaluation is a subjective synthesis systems available is provided in Ta- issue. Different speech voices are used to train ble 1. We also use the default Festival diphone- various speech systems, and no agreed upon based voice for Hindi provided with the system metric for the quality of such an output has for comparison. We also survey the other po- been produced, yet. Quality of production tential speech synthesis frameworks and list technique is another factor on which speech them in the table for reference. synthesis depends, and hence the evaluation of speech synthesis systems has been compro- Voice Usable Technique Explored Models for mised by differences between such factors (pro- Generated Hindi TTS duction techniques, recording facilities etc.) Festival+FestVox Yes Many Yes (IndicTTS Data) Speech Synthesis systems require human an- Flite Voice Yes 1 Yes notators for evaluation of their output. The (Hindi - Female) annotation is done based on naturalness and Flite Voice Yes 0 Yes (Hindi - Male) intelligibility of the output. A recent work Flite Voice Yes 0 Yes proposes a novel approach that formulates ob- (Marathi - Female) jective intelligibility assessment as an utter- Flite Voice Yes 0 Yes (Marathi - Male) ance verification problem using hidden Markov Festival Yes 1 Yes models, thereby alleviating the need for hu- (diphone) Wavenet man reference speech (Ullmann et al., 2015). Yes 1 No (basveeling) Although nothing exists to assess the natural- DeepVoice Yes 0 No ness of a speech synthesis output. Merlin No 0 No MaryTTS No 0 No We generate word audios for approximately Tacotron No 0 No 4000 words using four best voice models. For SampleRNN No 0 No Char2Voice No 0 No evaluation of our synthesized data, we create an experiment vaguely based on . Table 1: Our tryst with Speech Synthesis- An We randomly choose 30 Hindi Words and also overall picture of the area explored get audio recorded for them with the help of our lexicographers. 8https://github.com/basveeling/wavenet We create a PHP-MySQL based web- #0 #1 #2 #1+#2 Most Liked ularly with respect to the clips that were Model 1 79 55 99 154 101 Model 2 37 78 112 190 90 marked “unusable”: i) Flap or tap sounds Model 3 72 86 58 144 51 (ड़, ढ़) were pronounced incorrectly, ii) Into- Model 4 55 117 107 224 70 nation of the audio for heavy was at times incorrectly rendered and for words Table 2: Results of manual evaluation of syn- such as ‘ ’, the had a spe- thesized speech clips एकदम cific stress pattern which should have ideally been neutral, thus making it sound unnatural, interface show as a screenshot in Figure 2 and iii) There were also a few examples of unnec- crowd-source results. The interface shows a essary lengthening of a . For example, user, three different audio samples, and they in बीमारी (beemari, sickness), there was unnec- were asked to choose the “Most Natural” audio essary stress on ‘बी’ and hence it was length- from among them. ened, iv) Incorrect syllable breaks were ob- served in some words. For example, We receive a total of 442 responses for 30 नापसंद (naapasand, non-favourite),was pronounced as word samples. Thus, we assume that 14 people , which is incorrect, v) It was also had completed the test. The results of our ini- नाप-संद noted that sometimes consonant clusters were tial evaluation based on naturalness are as fol- mispronounced. E.g. -(kutta) - dog, was lows: (i) The mean of our voice model win कु 配ा incorrectly pronounced as or . percentage is over 44%. We beat both the कु -ता कु त-ता other voices by an acceptable margin, (ii) Pre- Eventually, we employ the best voice model recorded speech by humans was rated for generating word, gloss, and example au- best somewhat less than 30% of the times, dios. We generated, using Unit Selec- and (iii) based synthesized speech tion based Concatenative Synthesis, au- scored around 26% on this scale. dios for 151831 words, and 40337 synset glosses/example sentence. We randomly chose 535 words and gener- ate synthesized outputs from four best mod- 5 Conclusion and Future Work els; these outputs were presented to two lexi- cographers for further analysis. They used the We present our work on generating voice mod- following scale to report the output (i) unus- els using the Festival speech synthesis system. able (#0): This rating corresponds to audio We also describe our efforts to use deep learn- clips which are either distorted, or too noisy ing based implementations for generating such for the user to comprehend, (ii) usable (#1): a model. A survey of the current state-of-the- This rating corresponds to audio clips which art techniques available for speech synthesis are moderately usable and suggests that the was also done. We download pre-generated user can comprehend the underlying words. voice models available for Hindi and provide However, audio clips with this rating can be a detailed qualitative and quantitative analy- synthesized better, (iii) good (#2): This rat- sis by comparing them with the voice model ing corresponds to audio clips that are really generated by us. We evaluate our model via good and convey the words. For each of the crowd-sourcing and select the best voice model 535 words, the lexicographers were also asked to generate audios for all words, glosses and to mark which of the four synthesized clips example sentences for Hindi WordNet. We be- they liked the most. lieve our work will help students and language The evaluation results are shown in Table 2 learners understand the Hindi language, and which clearly show that Model 1 was marked help them pronounce it as well. as the most liked audio clip most often, while In future, we plan to improve the voice Model 4 performed the best in terms of pro- model by analyzing the speech output and in- ducing the most number of usable audio clips corporate more data for training. We also plan (obtained by summing clips with ratings #1 to implement WaveNet and other such neural and #2). network based techniques for raw audio gen- A qualitative analysis of the synthesized eration and training models to produce speech clips highlighted the following issues, partic- for a given text. References on Asian Spoken Language Research and Evalu- ation (O-COCOSDA/CASLRE), 2013 Interna- Jatin Bajaj, Akash Harlalka, Ankit Kumar, tional Conference, pages 1–8. IEEE. Ravi Mokashi Punekar, Keyur Sorathia, Om Deshmukh, and Kuldeep Yadav. 2015. Anand Arokia Raj, Tanuja Sarkar, Sathish Chan- Audio cues: Can sound be worth a hundred dra Pammi, Santhosh Yuvaraj, Mohit Bansal, words? In International Conference on Learn- Kishore Prahallad, and Alan W Black. 2007. ing and Collaboration Technologies, pages Text processing for text-to-speech systems in in- 14–23. Springer. dian languages. In SSW, pages 188–193. G. Bengu, Guyangu Liu, Ritesh Adval, and Frank Ray Richardson and Alan F Smeaton. 1995. Us- Shih. 2002. Educational application of an online ing wordnet in a knowledge-based approach to context sensitive dictionary. The 17th Interna- information retrieval. tional Symposium on Computer and Information Sciences. Richard Sproat. 1996. Multilingual text analysis for text-to-speech synthesis. Natural Language P Bhattacharyya. 2010. Indowordnet. lexical re- Engineering, 2(4):369–380. sources engineering conference 2010 (lrec 2010). Malta, May. Satoshi Takano, Kimihito Tanaka, Hideyuki Mizuno, Masanobu Abe, and S Nakajima. 2001. Alan Black. 1997. The festival speech synthesis A japanese tts system based on multiform units system: System documentation (1.1. 1). Tech- and a speech modification algorithm with har- nical Report HCRC/TR-83. monics reconstruction. IEEE Transactions on Speech and Audio Processing, 9(1):3–10. Christiane Fellbaum. 1998. WordNet. Wiley On- line Library. Raphael Ullmann, Ramya Rasipuram, Hervé Bourlard, et al. 2015. Objective intelligibil- Diptesh Kanojia, Shehzaad Dhuliawala, and Push- ity assessment of text-to-speech systems through pak Bhattarcharyya. 2016. A picture is worth utterance verification. In Proceedings of Inter- a thousand words: Using openclipart library for speech, number EPFL-CONF-209096. enriching indowordnet. In Eighth Global Word- Net Conference. GWC 2016. Piek Vossen. 1998. A multilingual database with lexical semantic networks. Springer. SP Kishore, Alan W Black, Rohit Kumar, and Ra- jeev Sangal. 2003. Experiments with unit se- Ren-Hua Wang, Zhongke Ma, Wei Li, and Donglai lection speech for indian languages. Zhu. 2000. A corpus-based chinese speech syn- Carnegie Mellon University. thesis with contextual dependent unit selection. In Sixth International Conference on Spoken Kevin Knight and Steve K Luk. 1994. Building a Language Processing. large-scale knowledge base for machine transla- tion. In AAAI, volume 94, pages 773–778. Heiga Zen, Takashi Nose, Junichi Yamagishi, Shinji Sako, Takashi Masuko, Alan W Black, Richard E Mayer. 2002. Multimedia learning. and Keiichi Tokuda. 2007. The hmm-based Psychology of learning and motivation, 41:85– speech synthesis system (hts) version 2.0. In 139. SSW, pages 294–299. Dipak Narayan, Debasri Chakrabarti, Prabhakar Heiga Zen, Keiichi Tokuda, and Alan W Black. Pande, and Pushpak Bhattacharyya. 2002. An 2009. Statistical parametric speech synthesis. experience in building the indo wordnet-a word- Speech Communication, 51(11):1039–1064. net for hindi. In First International Conference on Global WordNet, Mysore, India.

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Ko- ray Kavukcuoglu. 2016. Wavenet: A gen- erative model for raw audio. arXiv preprint arXiv:1609.03499.

Hemant A Patil, Tanvina B Patel, Nirmesh J Shah, Hardik B Sailor, Raghava Krishnan, GR Kasthuri, T Nagarajan, Lilly Christina, Naresh Kumar, Veera Raghavendra, et al. 2013. A syllable-based framework for unit selection synthesis in 13 indian languages. In Oriental COCOSDA held jointly with 2013 Conference