Virtual Storytelling: Emotions for the Narrator

Virtual Storytelling: Emotions for the narrator Master's thesis, August 2007 H.A. Buurman Committee Faculty of Human Media Interaction, dr. M. Theune Department of Electrical Engineering, dr. ir. H.J.A. op den Akker Mathematics & Computer Science, dr. R.J.F. Ordelman University of Twente ii iii Preface During my time here as a student in the Computer Science department, the ¯eld of language appealed to me more than all the other ¯elds available. For my internship, I worked on an as- signment involving speech recognition, so when a graduation project concerning speech generation became available I applied for it. This seemed like a nice complementation of my experience with speech recognition. Though the project took me in previously unknown ¯elds of psychol- ogy, linguistics, and (unfortunately) statistics, I feel a lot more at home working with- and on Text-to-Speech applications. Along the way, lots of people supported me, motivated me and helped me by participating in experiments. I would like to thank these people, starting with MariÄetTheune, who kept providing me with constructive feedback and never stopped motivating me. Also I would like to thank the rest of my graduation committee: Rieks op den Akker and Roeland Ordelman, who despite their busy schedule found time now and then to provide alternative insights and support. My family also deserves my gratitude for their continued motivation and interest. Lastly I would like to thank all the people who helped me with the experiments. Without you, this would have been a lot more di±cult. Herbert Buurman iv Samenvatting De ontwikkeling van het virtueel vertellen van verhalen staat nooit stil. Zolang een virtuele verhalenverteller niet op een vergelijkbaar niveau presteert als een menselijke verhalenverteller is er nog wat aan te verbeteren. De Virtual Storyteller van de HMI afdeling van de Universiteit Twente gebruikt een Text-to-Speech programma dat ingevoerde tekst automatisch omzet naar spraak. Het zeker niet triviaal om deze automatisch gegenereerde spraak menselijk te laten klinken, vooral als het om het vertellen van verhalen gaat. Er zijn zoveel aspecten van het vertellen waar vertellers hun stem gebruiken om de act te verbeteren. Dit verslag beschrijft het onderzoek waarbij gekeken wordt hoe verhalenvertellers hun stem gebruiken om emoties van personages in het verhaal over te brengen. Er wordt speci¯ek gekeken naar hoe de verhalenverteller zijn stem verandert om een personage in het verhaal iets te laten zeggen op een emotionele manier. Een experiment is uitgevoerd om de emotionele lading van spraakfragmenten van personages te identi¯ceren. Deze fragmenten worden daarna geanalyseerd in een poging om uit te vinden hoe de emotionele ladingen gekoppeld zijn aan de manier waarop de verhalenverteller zijn stem gebruikt. De analyse wordt daarna gebruikt bij het opstellen van een model dat het Text-to-Speech programma gebruikt om in plaats van neutrale spraak emotionele spraak te genereren. Het model is geimplementeerd in een open-source Text-to-Speech programma dat een Neder- landse stem gebruikt. Hierdoor kan de Virtuele Verhalenverteller tekst creÄerendie gemarkeerd is met een emotie. Deze tekst wordt daarna omgezet naar emotionele spraak. Abstract The development of virtual story-telling is an ever ongoing process. As long as it does not perform on the same level as a human storyteller, there is room for improvement. The Virtual Storyteller, a project of the HMI department of the University of Twente, uses a Text-to-Speech application that creates synthetic speech from text input. It is de¯nitely not a trivial matter to make synthetic speech sound human, especially when story-telling is involved. There are so much facets of story-telling where storytellers use their voice in order to enhance their performance. This thesis describes the study of how storytellers use their voice in order to convey the emotions that characters in the story are experiencing; speci¯cally how the storyteller changes his voice to make a character say something in an emotional way. An experiment has been conducted to identify the emotional charge in fragments of speech by story-characters. These fragments are then analysed in an attempt to ¯nd out how the emotional charge is linked to the way the storyteller changes his voice. The analysis is then used in the creation of a model that is used by the Text-to-Speech application to synthesise emotional speech instead of neutral speech. This model is implemented in an open-source Text-to-Speech application that uses a Dutch voice. This allows the Virtual Storyteller to create tekst marked with an emotion, which is then used to synthesise emotional speech. Contents 1 Introduction 1 1.1 Story-telling, what is it and why do we do it? ..................... 1 1.2 Virtual story-telling ................................... 2 1.3 Goals ........................................... 2 1.4 Approach ......................................... 3 2 Literature 5 2.1 Story-telling ........................................ 5 2.2 An overview of speech synthesis ............................. 7 2.2.1 Concatenative synthesis ............................. 7 2.2.2 Formant synthesis ................................ 8 2.2.3 Articulatory synthesis .............................. 8 2.2.4 Speech synthesis markup languages ....................... 8 2.3 Prosody .......................................... 8 2.4 De¯nitions of emotions .................................. 9 3 Choosing the voice 13 3.1 Text-to-Speech engine .................................. 13 3.2 Limitations of Festival .................................. 14 3.3 Conclusion ........................................ 14 4 The search for emotion 15 4.1 Test material ....................................... 16 4.2 Preliminary experiment ................................. 17 4.3 Target participants .................................... 19 4.4 Test environment ..................................... 19 4.5 Test structure ....................................... 20 4.6 Experiment results .................................... 21 4.7 Discussion ......................................... 25 4.8 Conclusion ........................................ 25 5 The birth of a model 27 5.1 Speech analysis in literature ............................... 27 5.2 Speech analysis ...................................... 28 5.3 Comparison with other measurements ......................... 32 5.4 Discussion ......................................... 38 5.4.1 Quality and quantity of fragments ....................... 38 5.5 Conclusion ........................................ 39 v vi CONTENTS 6 Plan B 41 6.1 The Emotion Dimension Framework .......................... 41 6.2 A new model ....................................... 42 6.2.1 The link with the Virtual Storyteller ...................... 43 6.3 The new model ...................................... 44 6.4 Conclusion ........................................ 44 7 Implementation 47 7.1 Narrative model ..................................... 47 7.2 Input format ....................................... 48 7.3 Additional Constraints .................................. 50 7.4 SSML parser implementation details .......................... 50 7.5 Festival web interface ................................... 51 8 Evaluation experiment 53 8.1 Material .......................................... 53 8.2 Interface .......................................... 54 8.2.1 Introductory page ................................ 54 8.2.2 Main page design ................................. 54 8.3 Results ........................................... 56 8.4 Discussion ......................................... 59 8.5 Conclusion ........................................ 60 9 Conclusions and Recommendations 63 9.1 Summary ......................................... 63 9.2 Recommendations .................................... 64 A Tables 71 A.1 Presence test ....................................... 71 A.2 The Model ........................................ 80 A.3 Evaluation Experiment .................................. 92 B Maintainer Manual 101 B.1 Festival code manual ................................... 101 B.1.1 Source code locations .............................. 101 B.1.2 Compiling instructions .............................. 101 B.1.3 Code structure .................................. 101 B.1.4 Data structure .................................. 102 B.1.5 SSML implementation .............................. 103 B.2 VST XML Preparser ................................... 103 B.3 SSML Function list .................................... 104 B.4 SSML Variable list .................................... 107 Chapter 1 Introduction This chapter will give a quick introduction into story-telling, the virtual story-telling project, and how to enhance the performance of story-telling using a TTS engine. 1.1 Story-telling, what is it and why do we do it? Story-telling is a very multifaceted subject, not consistently de¯ned and very open to subjective interpretation. A lot of people de¯ne story-telling in a di®erent way, but in my opinion, the National Story-telling Association [16] uses the best re¯ned de¯nition. Story-telling is the art of using language, vocalisation, and/or physical movement and gesture to reveal the elements of a story to a speci¯c, live audience. These story elements are informational in nature, and vary between simple absolute facts (consistent with the world we live in or not), reports of facts by characters, which makes the reports subjective and therefore

Virtual Storytelling: Emotions for the Narrator

THE DEVELOPMENT of ACCENTED ENGLISH SYNTHETIC VOICES By

Estudios De I+D+I

UN SYNTHÉTISEUR DE LA VOIX CHANTÉE BASÉ SUR MBROLA POUR LE MANDARIN Liu Ning

Espeak : Speech Synthesis

Fully Generated Scripted Dialogue for Embodied Agents

Text to Speech Synthesis for Bangla Language

Design and Implementation of Text to Speech Conversion for Visually Impaired People

Feasibility Study on a Text-To-Speech Synthesizer for Embedded Systems

Voice Synthesizer Application Android

Assisting the Speech Impaired People Using Text-To-Speech Synthesis 1Ledisi G

Towards Expressive Speech Synthesis in English on a Robotic Platform

Guía De Accesibilidad De Oracle Solaris 11 Desktop • Octubre De 2012 Contenido