Prosodic and Voice Quality Cross-Language Analysis of Storytelling Expressive Categories Oriented to Text-To-Speech Synthesis
Total Page:16
File Type:pdf, Size:1020Kb
90) - 02 - TESIS DOCTORAL Generalitat de Catalunya núm. 472 (28 Título Prosodic and Voice Quality Cross-Language Analysis of Storytelling Expressive Categories Oriented to Text-To-Speech Synthesis Rgtre. Fund. Realizada por Raúl Montaño Aparicio Fundació en el Centro Escola Tècnica Superior d’Enginyeria Electrònica i Informàtica La Salle Universitat Ramon Llull y en el Departamento Grup de Recerca en Tecnologies Mèdia (GTM) G: 59069740G: . Dirigida por Dr. Francesc Alías Pujol C.I.F C. Claravall, 1-3 | 08022 Barcelona | Tel. 93 602 22 00 | Fax 93 602 22 49 | [email protected] | www.url.edu To my parents, Esperanza & Juan. ABSTRACT For ages, the oral interpretation of tales and stories has been a worldwide tradition tied to entertainment, education, and perpetuation of culture. During the last decades, some works have focused on the ana- lysis of this particular speaking style rich in subtle expressive nuances represented by specific acoustic cues. In line with this fact, there has also been a growing interest in the development of storytelling ap- plications, such as those related to interactive storytelling. This thesis deals with one of the key aspects of audiovisual storytellers: improving the naturalness of the expressive synthetic speech by analysing the storytelling speech in detail, together with providing better non-verbal language to a speaking avatar by synchronizing that speech with its gestures. To that effect, it is necessary to understand in detail the acoustic characteristics of this particular speaking style and the interaction between speech and gestures. Regarding the acoustic characteristics of storytelling speech, the related literature has dealt with the acoustic analysis of storytelling speech in terms of prosody, being only suggested that voice quality may play an important role for the modelling of its subtleties. In this thesis, the role of both prosody and voice quality in indirect storytelling speech is analysed across languages to identify the main expressive categories it is composed of together with the acoustic parameters that characterize them. To do so, an analysis methodology is proposed to annotate this particular speaking style at the sentence level based on storytelling discourse modes (narrative, descriptive, and dialogue), besides introducing narrative sub- modes. Considering this annotation methodology, the indirect speech of a story oriented to a young audience (covering the Spanish, English, French, and German versions) is analysed in terms of pros- ody and voice quality through statistical and discriminant analyses, after classifying the sentence-level utterances of the story in their corresponding expressive categories. The results confirm the existence of storytelling categories containing subtle expressive nuances across the considered languages beyond narrators’ personal styles. In this sense, evidences are presented suggesting that such storytelling ex- pressive categories are conveyed with subtler speech nuances than basic emotions by comparing their acoustic patterns to the ones obtained from emotional speech data. The analyses also show that both prosody and voice quality contribute almost equally to the discrimination among storytelling expressive categories, being conveyed with similar acoustic patterns across languages. It is also worth noting the strong relationship observed in the selection of the expressive category per utterance across the narrators even when, up to our knowledge, no previous indications were given to them. In order to translate all these expressive categories to a corpus-based Text-To-Speech system, the recording of a speech corpus for each category would be required. However, building ad-hoc speech corpora for each and every specific expressive style becomes a very daunting task. In this work, we introduce an alternative based on an analysis-oriented-to-synthesis methodology designed to derive rule- based models from a small but representative set of utterances, which can be used to generate storytelling speech from neutral speech. The experiments conducted on increasing suspense as a proof of concept show the viability of the proposal in terms of naturalness and storytelling resemblance. Finally, in what concerns the interaction between speech and gestures, an analysis is performed in terms of time and emphasis oriented to drive a 3D storytelling avatar. To that effect, strength indicators are defined for speech and gestures. After validating them through perceptual tests, an intensity rule is obtained from their correlation. Moreover, a synchrony rule is derived to determine temporal corres- pondences between speech and gestures. These analyses have been conducted on aggressive and neutral performances to cover a broad range of emphatic levels as a first step to evaluate the integration of a speaking avatar after the expressive Text-To-Speech system. vii Ah, they’ll never ever reach the moon, at least not the one that we’re after. — Leonard Cohen ACKNOWLEDGMENTS Quisiera empezar agradeciendo a quienes va dedicada esta tesis, mis padres, por todo su apoyo para que hiciera lo que quisiera con mi vida académica y profesional, además de la ayuda diaria para que todo me fuera un poco más fácil. También al resto de mi familia cercana, que siempre me han animado a seguir con mis objetivos. Seguidamente quiero agradecer a mi director de tesis, Francesc Alías, por ser muy posiblemente el mejor director de la historia. Tenerle al lado en esta aventura ha sido toda una suerte. He aprendido muchísimo con él, y considero que soy mejor profesional y persona en gran parte por sus consejos. Es también seguramente la persona más trabajadora que conozco y, por qué no decirlo, es un tío enrollado. Muchas otras personas también me han acompañado casi dirariamente en esta aventura. Si se me permite, querría destacar primero que he tenido un compañero de batallas, Marc Freixes, una gran per- sona con la que he compartido escritorio, trabajos y buenos ratos. Deseo que todo te vaya bien y seas doctor el año que viene. También esta la gente que ha estado desde el principio o desde algún momento. Àngels, Lluís, Ramón, Arnela, Patri, Sevillano, Rosa, Adso, Alan, Marc Antonijoan, Àngel, Ale, Dav- ide, Marcos, Ferran, Barco, Alejandro, Pascal (espero no dejarme a nadie)... Como digo muchas veces, gracias por ser cómo sois, será difícil encontrar a un grupo de gente tan especial como vosotros. También han habido profesores que me han ayudado y enseñado muchas cosas, como Joan Claudi Socoró, Ignasi Iriondo o David Miralles. A todos ellos mi más sincero agradecimiento por guiarme y valorar positivamente mi trabajo. Otra gente que me ha animado a su manera es la gente del restaurante de La Salle, como Lídia, Carmen, Lucía, Gustavo y Carlos. Gente sana y divertida, ¡y que hacen muy buenos cafés! Gracias a mis amigos de toda la vida, que aunque muchas veces me dijeran que no sabían lo que yo hacía ni lo que haría después (bueno, realmente no eran los únicos), son un grupo increíble del que me siento muy afortunado de formar parte. Celebraremos la finalización de esta tesis con unas birritas en el KPC seguro. He de agradecer también a los que me han otorgado la beca con la que he podido llevar a cabo mi doc- torado, es decir, a la Secretaria d’Universitats i Recerca y al Departament d’Economia i Coneixement de la Generalitat de Catalunya y al Fondo Social Europeo por la beca FI (No. 2013FI_N 00790, 2014FI_B1 00201, y No. 2015FI_B2 00110), además de la financiación al Grup de recerca en Tecnologies Mèdia de La Salle - Universitat Ramon Llull (refs. 2009-SGR-293 y 2014-SGR-0590), y a la ayuda parcial por el proyecto CEN-20101019, otorgado por el Ministerio de Ciencia e Innovación de España. Finalmente, quisiera agradecer a Laia por estar a mi lado estos últimos dos años y medio. Estar contigo ha hecho mi vida mucho más feliz y me has hecho mejor persona. Ahora que el doctorado ha acabado, se cierra una etapa de nuestras vidas y nace otra donde viviremos muchas más experiencias juntos. ix PUBLICATIONS This thesis is based on the following publications: — R. Montaño, F. Alías, and J. Ferrer. Prosodic analysis of storytelling discourse modes and narrative situations oriented to Text-to-Speech synthesis. In 8th ISCA Workshop on Speech Synthesis, pages 171– 176, Barcelona, Spain, 2013 Contributions of the author to the work • Annotation methodology definition • Speech corpora annotation and segmentation • Prosodic and statistical analyses • Writing of the publication, including the creation of Tables and Figures. — A. Fernández-Baena, R. Montaño, M. Antonijoan, A. Roversi, D. Miralles, and F. Alías. Gesture synthesis adapted to speech emphasis. Speech Communication, 57:331–350, 2014 Contributions of the author to the work • Gestures and speech annotation and segmentation • Classification analyses of speech and gestures • Gestures and speech correlation analysis • Writing of some parts of the publication, including the creation of some Tables and Figures. — R. Montaño and F. Alías. The role of prosody and voice quality in text-dependent categories of storytelling across languages. In Proc. Interspeech, pages 1186–1190, Dresden, Germany, 2015 This thesis is also based on the following publications currently under review: — R. Montaño, M. Freixes, F. Alías, and J. C. Socoró. Generating Storytelling Speech from a hybrid US-aHM Neutral TTS synthesis framework using a rule-based prosodic model. In Proc. Interspeech, San Francisco, USA, 2016. Submitted Contributions of the author to the work • Annotation, segmentation, and analysis of the set of utterances • Statistical analysis of the perceptual test results • Writing of the publication, including the creation of Figures. — R. Montaño and F. Alías. The Role of Prosody and Voice Quality in Indirect Storytelling Speech: Analysis Methodology and Expressive Categories. Speech Commun., 2016a. Second Round of Revisions — R. Montaño and F. Alías. The Role of Prosody and Voice Quality in Indirect Storytelling Speech: A Cross-language Perspective. Speech Commun., 2016b. First Round of Revisions xi CONTENTS 1 INTRODUCTION 1 1.1 Framework of the thesis .