Voice Building from Insufficient Data – Classroom Experiences with Web-Based Language Development Tools

Voice Building from Insufficient Data – Classroom Experiences with Web-based Language Development Tools John Kominek, Tanja Schultz, Alan W Black Language Technologies Institute Carnegie Mellon University, Pittsburgh, USA {jkominek,tanja,awb}@cs.cmu.edu Abstract A new addition to this software suite is an embeddable Javascript applet that provides within-browser recording and To make the goal of building voices in new languages easier playback facilities. By this means any two people in the word and more accessible to non-experts, the combined tasks of who would previously be separated by a language barrier can phoneme set definition, text selection, prompt recording, potentially speak with each other through our recognition/ lexicon building, and voice creation in Festival are now translation/synthesis server (presuming access to a compliant integrated behind a web-based development environment. This Internet browser.) Also, our in-browser recorder provides a environment has been exercised in a semester-long laboratory solution to an enduring problem of system development: course taught at Carnegie Mellon University. Here we report namely, that of speech collection. It is no longer necessary that on the students' efforts in building voices for the languages of the system developer be able to locate native speakers of a Bulgarian, English, German, Hindi, Konkani, Mandarin, and particular language living nearby. Vietnamese. In some cases intelligible synthesizers were built from as little as ten minutes of recorded speech. The SPICE software toolkit is in an early stage of development. To stress and evaluate the current state of the system, a hands-on laboratory course “Multilingual speech-to- 1. Introduction speech translation” was offered for credit at Carnegie Mellon University. It ran for a single semester from January to May In the past decade, the performance and capability of 2007, taught by three instructors: Tanja Schultz (ASR), Alan automatic speech processing systems, including speech W Black (TTS), and Stephan Vogel (MT) [10]. All recognition and speech synthesis, has matured significantly. participants are graduate students studying language With the addition of machine translation linking input to technologies. Students were paired into teams of two and output, the prospect of two people of different languages asked to create working speech-to-speech systems, for a communicating together becomes a tantalizing possibility. limited domain of their choosing, by the end of the course. In light of the increasing trend towards Globalization, it The languages tackled were English, German, Bulgarian, has become important to support multiple input and output Mandarin, Vietnamese, Hindi, and Konkani, a secondary languages beyond the dominant Western languages (English, language of India that does not have its own writing system German, Spanish, etc.). Due to the high costs and long but is transcribed through various competing foreign scripts. development times typical of ASR, TTS, and MT, the need Interim results from this course are described in [11]. for new techniques to support the rapid adaptation of speech Here we report on the student's attempts in building processing systems to previously uncovered languages synthetic voices in their language, including the role that becomes paramount [1]. pronunciation-lexicon creation played in their efforts. The 3-year project and software toolkit known as SPICE1 Creating a speech-to-speech translation system is a very is an initiative intended to dramatically reduce the difficulty ambitious task, meaning that only a portion of their time was of building and deploying speech technology systems. It is allocated to TTS. Consequently, the students attempted to designed to support any pair of languages in the world for make good out of less material than is typical, i.e. much less which a writing system exists, and for which sufficient text than the one hour of speech of an Arctic-sized database [12]. and speech resources can be made available. This is Some hopeful students relied on less than 10 minutes of accomplished by integrating and presenting in a web-based speech to build a voice from scratch – insufficient data, development interface several core technologies that have without doubt, hence the title of this paper. been developed at Carnegie Mellon University. These include the Janus ASR trainer and decoder [2], GlobalPhone 2. TTS as a part of Speech-to-Speech multilingual inventory and speech database [3], CMU/ Cambridge language modeling toolkit [4], Festival speech The speech synthesis system at the heart of SPICE is the synthesis software [5] and FestVox voice building toolkit [6], CLUSTERGEN statistical parametric synthesizer [13], now Lexlearner pronunciation dictionary builder [7], Lemur released as part of the standard Festival distribution. We chose information retrieval system [8], and CMU statistical machine this technology because experience has shown that it (and the translation system [9]. similar HTS [14]) degrades gracefully as the amount of training data decreases. 1 Speech Processing – Interactive Creation and Evaluation Toolkit for new Languages. 6th ISCA Workshop on Speech Synthesis, Bonn, Germany, August 22-24, 2007 322 Unlike the normal development procedure, however, in which predicted acoustics, i.e. of diphones or triphones. At the user executes a series of Unix scripts and hand-verifies the this stage though we have no means of predicting intermediate results, in SPICE all operations are orchestrated acoustics, and so prompts are selected on the basis of behind a web interface. The ASR and TTS components share grapheme coverage. resources. This includes text collection, prompt extraction, • speech collection, and phoneme selection, and dictionary Audio collection. Depending on the source text, the building. The interdependencies are depicted below. prompt list is of varying length. We instructed the students to record at least 200 sentence-length prompts, adding material as needed if they had a shortfall. • Grapheme definition. Once text is provided, the SPICE software culls all of the characters and asks the user to define basic attributes of each. This includes class membership (letter, digit, punctuation, other), and casing. • Phoneme selection. In this, perhaps the most critical stage, students select and name a phoneme set for their language. The interface assists this by providing a list of available phonemes laid out similar to the official IPA charts. An example wavefile is available for most phonemes so help in the selection. While this interface was intended to allow a phoneme set to be built up from scratch, not one student did that. Instead, they started from one or two reference lists and used the interface to make refinements. • G2P rules. The development of grapheme-to-phoneme (or letter-to-sound) rules proceeds in a two-stage process. First, the user is asked to assign a default phoneme to each grapheme, including those that are unspoken (e.g. punctuation). Explaining his request required multiple clarifications, as students tended initially to provide word sound-outs – declaring, for example, that 'w' is not associated with /W/ but is pronounced /D UH B AH L Y UW/ “double u”. • Pronunciation lexicon. The second phase of G2P rule building goes on behind the scenes as the user accumulates their pronunciation lexicon. Words are selected from the supplied text in an order that favors the most frequent words first. Each word is presented with a suggested pronunciation, which the user may Figure 1. High level component dependencies in the accept or manually correct. As an additional aid, each SPICE system. suggestion is accompanied by a wavefile synthesized using a universal discrete-phoneme synthesizer. Figure 2 shows a screen shot. After each word is supplied to 2.1. Development Work Flow – TTS the system, the G2P rules are rebuilt, thereby • Text collection. The development process begins with incrementally providing better predictions, similar to text collection. Text may be collected from the world- that of [15]. When the user is satisfied with the size of wide web by pointing the SPICE webcrawler at a their lexicon the final G2P rules are compiled. These particular homepage, e.g. of a online news site. For are then used to predict pronunciations for all the additional control, the user can upload a prepared body remaining lower frequency words. of text. All of the students used this option, so that they • Speech synthesis. With the necessary resources could verify that the text is all within their chosen provided, the standard FestVox scripts have been domain. Given the multitude of character encodings modified to a) automatically import the phoneme set used worldwide, we imposed the constraint that the and look up IPA feature values, b) import the text had to be encoded in utf-8 (which includes ASCII, pronunciation dictionary, and c) compile the G2P but not the vast population of 8-bit code pages.). Some rules into a transducer. The recorded prompts are began with large collections; others small, cf. section 3. labeled and utterance structures created. With this the • Prompt selection. Typically one would convert the language-specific F0, duration, and sub-phonetic collected text to phonemes, then select sentence-length spectral models are trained. The user can then test utterances that provide a balanced coverage of their synthesizer by entering text into a type-in box. 6th ISCA Workshop on Speech Synthesis, Bonn, Germany, August 22-24, 2007 323 Language Prompts Time Bulgarian 358 14:42 English 96 4:00 German 424 22:33 Hindi 191 9:51 Konkani 195 7:49 Mandarin 199 37:47 Vietnamese (rec) 203 10:41 Figure 2. Web interface to lexicon builder. Vietnamese (built) 77 3:38 3. Descriptive Statistics Table 2. Size of speech recordings (time in mm:ss). Whitespace was not trimmed from the prompts. Due to Prior to tacking their own language, students were gaps in the lexicon, the Vietnamese voice was built encouraged to complete an English walk-though. The from only a third of the available recordings.

Voice Building from Insufficient Data – Classroom Experiences with Web-Based Language Development Tools

The Role of Higher-Level Linguistic Features in HMM-Based Speech Synthesis

The RACAI Text-To-Speech Synthesis System

Synthesis and Recognition of Speech Creating and Listening to Speech

The Role of Speech Processing in Human-Computer Intelligent Communication

Voice User Interface on the Web Human Computer Interaction Fulvio Corno, Luigi De Russis Academic Year 2019/2020 How to Create a VUI on the Web?

A Multimodal User Interface for an Assistive Robotic Shopping Cart

Voice Assistants and Smart Speakers in Everyday Life and in Education

Models of Speech Synthesis ROLF CARLSON Department of Speech Communication and Music Acoustics, Royal Institute of Technology, S-100 44 Stockholm, Sweden

Attention, I'm Trying to Speak Cs224n Project: Speech Synthesis

Quantifying the Effects of Prosody Modulation on User Engagement

Lecture 5: Part-Of-Speech Tagging

Investigation of Using Continuous Representation of Various Linguistic Units in Neural Network Based Text-To-Speech Synthesis