A Review of Deep Learning Based Speech Synthesis
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
THE DEVELOPMENT of ACCENTED ENGLISH SYNTHETIC VOICES By
THE DEVELOPMENT OF ACCENTED ENGLISH SYNTHETIC VOICES by PROMISE TSHEPISO MALATJI DISSERTATION Submitted in fulfilment of the requirements for the degree of MASTER OF SCIENCE in COMPUTER SCIENCE in the FACULTY OF SCIENCE AND AGRICULTURE (School of Mathematical and Computer Sciences) at the UNIVERSITY OF LIMPOPO SUPERVISOR: Mr MJD Manamela CO-SUPERVISOR: Dr TI Modipa 2019 DEDICATION In memory of my grandparents, Cecilia Khumalo and Alfred Mashele, who always believed in me! ii DECLARATION I declare that THE DEVELOPMENT OF ACCENTED ENGLISH SYNTHETIC VOICES is my own work and that all the sources that I have used or quoted have been indicated and acknowledged by means of complete references and that this work has not been submitted before for any other degree at any other institution. ______________________ ___________ Signature Date iii ACKNOWLEDGEMENTS I want to recognise the following people for their individual contributions to this dissertation: • My brother, Mr B.I. Khumalo and the whole family for the unconditional love, support and understanding. • A distinct thank you to both my supervisors, Mr M.J.D. Manamela and Dr T.I. Modipa, for their guidance, motivation, and support. • The Telkom Centre of Excellence for Speech Technology for providing the resources and support to make this study a success. • My colleagues in Department of Computer Science, Messrs V.R. Baloyi and L.M. Kola, for always motivating me. • A special thank you to Mr T.J. Sefara for taking his time to participate in the study. • The six Computer Science undergraduate students who sacrificed their precious time to participate in data collection. -
Gender, Ethnicity, and Identity in Virtual
Virtual Pop: Gender, Ethnicity, and Identity in Virtual Bands and Vocaloid Alicia Stark Cardiff University School of Music 2018 Presented in partial fulfilment of the requirements for the degree Doctor of Philosophy in Musicology TABLE OF CONTENTS ABSTRACT i DEDICATION iii ACKNOWLEDGEMENTS iv INTRODUCTION 7 EXISTING STUDIES OF VIRTUAL BANDS 9 RESEARCH QUESTIONS 13 METHODOLOGY 19 THESIS STRUCTURE 30 CHAPTER 1: ‘YOU’VE COME A LONG WAY, BABY:’ THE HISTORY AND TECHNOLOGIES OF VIRTUAL BANDS 36 CATEGORIES OF VIRTUAL BANDS 37 AN ANIMATED ANTHOLOGY – THE RISE IN POPULARITY OF ANIMATION 42 ALVIN AND THE CHIPMUNKS… 44 …AND THEIR SUCCESSORS 49 VIRTUAL BANDS FOR ALL AGES, AVAILABLE ON YOUR TV 54 VIRTUAL BANDS IN OTHER TYPES OF MEDIA 61 CREATING THE VOICE 69 REPRODUCING THE BODY 79 CONCLUSION 86 CHAPTER 2: ‘ALMOST UNREAL:’ TOWARDS A THEORETICAL FRAMEWORK FOR VIRTUAL BANDS 88 DEFINING REALITY AND VIRTUAL REALITY 89 APPLYING THEORIES OF ‘REALNESS’ TO VIRTUAL BANDS 98 UNDERSTANDING MULTIMEDIA 102 APPLYING THEORIES OF MULTIMEDIA TO VIRTUAL BANDS 110 THE VOICE IN VIRTUAL BANDS 114 AGENCY: TRANSFORMATION THROUGH TECHNOLOGY 120 CONCLUSION 133 CHAPTER 3: ‘INSIDE, OUTSIDE, UPSIDE DOWN:’ GENDER AND ETHNICITY IN VIRTUAL BANDS 135 GENDER 136 ETHNICITY 152 CASE STUDIES: DETHKLOK, JOSIE AND THE PUSSYCATS, STUDIO KILLERS 159 CONCLUSION 179 CHAPTER 4: ‘SPITTING OUT THE DEMONS:’ GORILLAZ’ CREATION STORY AND THE CONSTRUCTION OF AUTHENTICITY 181 ACADEMIC DISCOURSE ON GORILLAZ 187 MASCULINITY IN GORILLAZ 191 ETHNICITY IN GORILLAZ 200 GORILLAZ FANDOM 215 CONCLUSION 225 -
Masterarbeit
Masterarbeit Erstellung einer Sprachdatenbank sowie eines Programms zu deren Analyse im Kontext einer Sprachsynthese mit spektralen Modellen zur Erlangung des akademischen Grades Master of Science vorgelegt dem Fachbereich Mathematik, Naturwissenschaften und Informatik der Technischen Hochschule Mittelhessen Tobias Platen im August 2014 Referent: Prof. Dr. Erdmuthe Meyer zu Bexten Korreferent: Prof. Dr. Keywan Sohrabi Eidesstattliche Erklärung Hiermit versichere ich, die vorliegende Arbeit selbstständig und unter ausschließlicher Verwendung der angegebenen Literatur und Hilfsmittel erstellt zu haben. Die Arbeit wurde bisher in gleicher oder ähnlicher Form keiner anderen Prüfungsbehörde vorgelegt und auch nicht veröffentlicht. 2 Inhaltsverzeichnis 1 Einführung7 1.1 Motivation...................................7 1.2 Ziele......................................8 1.3 Historische Sprachsynthesen.........................9 1.3.1 Die Sprechmaschine.......................... 10 1.3.2 Der Vocoder und der Voder..................... 10 1.3.3 Linear Predictive Coding....................... 10 1.4 Moderne Algorithmen zur Sprachsynthese................. 11 1.4.1 Formantsynthese........................... 11 1.4.2 Konkatenative Synthese....................... 12 2 Spektrale Modelle zur Sprachsynthese 13 2.1 Faltung, Fouriertransformation und Vocoder................ 13 2.2 Phase Vocoder................................ 14 2.3 Spectral Model Synthesis........................... 19 2.3.1 Harmonic Trajectories........................ 19 2.3.2 Shape Invariance.......................... -
Expression Control in Singing Voice Synthesis
Expression Control in Singing Voice Synthesis Features, approaches, n the context of singing voice synthesis, expression control manipu- [ lates a set of voice features related to a particular emotion, style, or evaluation, and challenges singer. Also known as performance modeling, it has been ] approached from different perspectives and for different purposes, and different projects have shown a wide extent of applicability. The Iaim of this article is to provide an overview of approaches to expression control in singing voice synthesis. We introduce some musical applica- tions that use singing voice synthesis techniques to justify the need for Martí Umbert, Jordi Bonada, [ an accurate control of expression. Then, expression is defined and Masataka Goto, Tomoyasu Nakano, related to speech and instrument performance modeling. Next, we pres- ent the commonly studied set of voice parameters that can change and Johan Sundberg] Digital Object Identifier 10.1109/MSP.2015.2424572 Date of publication: 13 October 2015 IMAGE LICENSED BY INGRAM PUBLISHING 1053-5888/15©2015IEEE IEEE SIGNAL PROCESSING MAGAZINE [55] noVEMBER 2015 voices that are difficult to produce naturally (e.g., castrati). [TABLE 1] RESEARCH PROJECTS USING SINGING VOICE SYNTHESIS TECHNOLOGIES. More examples can be found with pedagogical purposes or as tools to identify perceptually relevant voice properties [3]. Project WEBSITE These applications of the so-called music information CANTOR HTTP://WWW.VIRSYN.DE research field may have a great impact on the way we inter- CANTOR DIGITALIS HTTPS://CANTORDIGITALIS.LIMSI.FR/ act with music [4]. Examples of research projects using sing- CHANTER HTTPS://CHANTER.LIMSI.FR ing voice synthesis technologies are listed in Table 1. -
Wormed Voice Workshop Presentation
Wormed Voice Workshop Presentation micro_research December 27, 2017 1 some worm poetry and songs: The WORM was for a long time desirous to speake, but the rule and order of the Court enjoyned him silence, but now strutting and swelling, and impatient, of further delay, he broke out thus... [Michael Maier] He worshipped the worm and prayed to the wormy grave. Serpent Lucifer, how do you do? Of your worms and your snakes I'd be one or two; For in this dear planet of wool and of leather `Tis pleasant to need neither shirt, sleeve, nor shoe, 2 And have arm, leg, and belly together. Then aches your head, or are you lazy? Sing, `Round your neck your belly wrap, Tail-a-top, and make your cap Any bee and daisy. Two pigs' feet, two mens' feet, and two of a hen; Devil-winged; dragon-bellied; grave- jawed, because grass Is a beard that's soon shaved, and grows seldom again worm writing the the the the,eeeronencoug,en sthistit, d.).dupi w m,tinsprsool itav f bometaisp- pav wheaigelic..)a?? orerdi mise we ich'roo bish ftroo htothuloul mespowouklain- duteavshi wn,jis, sownol hof." m,tisorora angsthyedust,es, fofald,junss ownoug brad,)fr m fr,aA?a????ck;A?stelav aly, al is.'rady'lfrdil owoncorara wns t.) sh'r, oof ofr,a? ar,a???????a? fu mo towess,eethen hrtolly-l,."tigolav ict,a???!ol, w..'m,elyelil,tstreamas..n gotaillas.tansstheatsea f mb ispot inici t.) owar.**1 wnshigigholoothtith orsir.tsotic.'m, sotamimoledug imootrdeavet..t,) sh s,tranciror."wn sieee h asinied.tiear wspilotor,) bla av.nicord,ier.dy'et.*tite m.)..*d, hrouceto hie, ig il m, bsomoug,.t.'l,t, olitel bs,.nt,.dotr tat,)aa? htotitedont,j alesil, starar,ja taie ass.nishiceroouldseal fotitoonckysil, m oitispl o anteeeaicowousomirot. -
Attuning Speech-Enabled Interfaces to User and Context for Inclusive Design: Technology, Methodology and Practice
CORE Metadata, citation and similar papers at core.ac.uk Provided by Springer - Publisher Connector Univ Access Inf Soc (2009) 8:109–122 DOI 10.1007/s10209-008-0136-x LONG PAPER Attuning speech-enabled interfaces to user and context for inclusive design: technology, methodology and practice Mark A. Neerincx Æ Anita H. M. Cremers Æ Judith M. Kessens Æ David A. van Leeuwen Æ Khiet P. Truong Published online: 7 August 2008 Ó The Author(s) 2008 Abstract This paper presents a methodology to apply 1 Introduction speech technology for compensating sensory, motor, cog- nitive and affective usage difficulties. It distinguishes (1) Speech technology seems to provide new opportunities to an analysis of accessibility and technological issues for the improve the accessibility of electronic services and soft- identification of context-dependent user needs and corre- ware applications, by offering compensation for limitations sponding opportunities to include speech in multimodal of specific user groups. These limitations can be quite user interfaces, and (2) an iterative generate-and-test pro- diverse and originate from specific sensory, physical or cess to refine the interface prototype and its design cognitive disabilities—such as difficulties to see icons, to rationale. Best practices show that such inclusion of speech control a mouse or to read text. Such limitations have both technology, although still imperfect in itself, can enhance functional and emotional aspects that should be addressed both the functional and affective information and com- in the design of user interfaces (cf. [49]). Speech technol- munication technology-experiences of specific user groups, ogy can be an ‘enabler’ for understanding both the content such as persons with reading difficulties, hearing-impaired, and ‘tone’ in user expressions, and for producing the right intellectually disabled, children and older adults. -
Voice Synthesizer Application Android
Voice synthesizer application android Continue The Download Now link sends you to the Windows Store, where you can continue the download process. You need to have an active Microsoft account to download the app. This download may not be available in some countries. Results 1 - 10 of 603 Prev 1 2 3 4 5 Next See also: Speech Synthesis Receming Device is an artificial production of human speech. The computer system used for this purpose is called a speech computer or speech synthesizer, and can be implemented in software or hardware. The text-to-speech system (TTS) converts the usual text of language into speech; other systems display symbolic linguistic representations, such as phonetic transcriptions in speech. Synthesized speech can be created by concatenating fragments of recorded speech that are stored in the database. Systems vary in size of stored speech blocks; The system that stores phones or diphones provides the greatest range of outputs, but may not have clarity. For specific domain use, storing whole words or suggestions allows for high-quality output. In addition, the synthesizer may include a vocal tract model and other characteristics of the human voice to create a fully synthetic voice output. The quality of the speech synthesizer is judged by its similarity to the human voice and its ability to be understood clearly. The clear text to speech program allows people with visual impairments or reading disabilities to listen to written words on their home computer. Many computer operating systems have included speech synthesizers since the early 1990s. A review of the typical TTS Automatic Announcement System synthetic voice announces the arriving train to Sweden. -
Biomimetics of Sound Production, Synthesis and Recognition
Design and Nature V 273 Biomimetics of sound production, synthesis and recognition G. Rosenhouse Swantech - Sound Wave Analysis & Technologies Ltd, Haifa, Israel Abstract Biomimesis of sound production, synthesis and recognition follows millenias of years of adaptation as it developed in nature. Technically such communication means were initiated since people began to build speaking machines based on natural speech concepts. The first devices were mechanical and they were in use till the end of the 19th century. Those developments later led to modern speech and music synthesis, initially applying pure mechanics. Since the beginning of the 20th century electronics has taken up the lead to independent electronic achievements in human communication abilities. As shown in the present paper, this development was intentionally made along history in order to satisfy human needs. Keywords: speaking machines, biomimetics, speech synthesis. 1 Introduction Automation is an old attempt of people to tame machines for the needs of human beings. It began mainly, as far as we know, since 70 DC (with Heron). In linguistics, the history of speaking machines began in the 2nd century when people tried to build speaking heads. However, actually, the process was initiated in the 18th century, with Wolfgang von Kempelen who invented a speaking mechanism that simulated the speech system of human beings. It opened the way to inventions of devices for producing artificial vowels and consonants. This development was an important step towards modern speech recognition -
Observations on the Dynamic Control of an Articulatory Synthesizer Using Speech Production Data
Observations on the dynamic control of an articulatory synthesizer using speech production data Dissertation zur Erlangung des Grades eines Doktors der Philosophie der Philosophischen Fakultäten der Universität des Saarlandes vorgelegt von Ingmar Michael Augustus Steiner aus San Francisco Saarbrücken, 2010 ii Dekan: Prof. Dr. Erich Steiner Berichterstatter: Prof. Dr. William J. Barry Prof. Dr. Dietrich Klakow Datum der Einreichung: 14.12.2009 Datum der Disputation: 19.05.2010 Contents Acknowledgments vii List of Figures ix List of Tables xi List of Listings xiii List of Acronyms 1 Zusammenfassung und Überblick 1 Summary and overview 4 1 Speech synthesis methods 9 1.1 Text-to-speech ................................. 9 1.1.1 Formant synthesis ........................... 10 1.1.2 Concatenative synthesis ........................ 11 1.1.3 Statistical parametric synthesis .................... 13 1.1.4 Articulatory synthesis ......................... 13 1.1.5 Physical articulatory synthesis .................... 15 1.2 VocalTractLab in depth ............................ 17 1.2.1 Vocal tract model ........................... 17 1.2.2 Gestural model ............................. 19 1.2.3 Acoustic model ............................. 24 1.2.4 Speaker adaptation ........................... 25 2 Speech production data 27 2.1 Full imaging ................................... 27 2.1.1 X-ray and cineradiography ...................... 28 2.1.2 Magnetic resonance imaging ...................... 30 2.1.3 Ultrasound tongue imaging ...................... 33 -
Siren Songs and Echo's Response: Towards a Media Theory of the Voice
Published as _Article in On_Culture: The Open Journal for the Study of Culture (ISSN 2366-4142) SIREN SONGS AND ECHO’S RESPONSE: TOWARDS A MEDIA THEORY OF THE VOICE IN THE LIGHT OF SPEECH SYNTHESIS CHRISTOPH BORBACH [email protected] Christoph Borbach is a research fellow at the research training group Locating Media at the University of Siegen, where he conducts a research project entitled “Zeitkanäle|Kanalzeiten” (“Time Channels|Channel Times”) on the media history of the operationalization of delay time from a media archaeological perspective. Borbach studied Musicology, Media, and History at the Humboldt University of Berlin. His bachelor’s thesis dealt with radio theories between ideology and media epistemology. For his master’s thesis, he studied the media-technical implementations of echoes. His research interests include media theory of the voice, media archaeology of the echo, operationalization of the sonic, time-critical detection technologies and their visualiza- tion strategies, and the occult of media/media of the occult. KEYWORDS speech synthesis, phonography, voice theory, embodied voices, speaking machines, technotraumatic affects PUBLICATION DATE Issue 2, November 30, 2016 HOW TO CITE Christoph Borbach. “Siren Songs and Echo’s Response: Towards a Media Theory of the Voice in the Light of Speech Synthesis.” On_Culture: The Open Journal for the Study of Culture 2 (2016). <http://geb.uni-giessen.de/geb/volltexte/2016/12354/>. Permalink URL: <http://geb.uni-giessen.de/geb/volltexte/2016/12354/> URN: <urn:nbn:de:hebis:26-opus-123545> On_Culture: The Open Journal for the Study of Culture Issue 2 (2016): The Nonhuman www.on-culture.org http://geb.uni-giessen.de/geb/volltexte/2016/12354/ Siren Songs and Echo’s Response: Towards a Media Theory of the Voice in the Light of Speech Synthesis _Abstract In contrast to phonographical recording, storage, and reproduction of the voice, most media theories, especially prominent media theories of the human voice, neglected the aspect of synthesizing human-like voices by non-human means. -
Speech Generation: from Concept and from Text
Speech Generation From Concept and from Text Martin Jansche CS 6998 2004-02-11 Components of spoken output systems Front end: From input to control parameters. • From naturally occurring text; or • From constrained mark-up language; or • From semantic/conceptual representations. Back end: From control parameters to waveform. • Articulatory synthesis; or • Acoustic synthesis: – Based predominantly on speech samples; or – Using mostly synthetic sources. 2004-02-11 1 Who said anything about computers? Wolfgang von Kempelen, Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine, 1791. Charles Wheatstone’s reconstruction of von Kempelen’s machine 2004-02-11 2 Joseph Faber’s Euphonia, 1846 2004-02-11 3 Modern articulatory synthesis • Output produced by an articulatory synthesizer from Dennis Klatt’s review article (JASA 1987) • Praat demo • Overview at Haskins Laboratories (Yale) 2004-02-11 4 The Voder ... Developed by Homer Dudley at Bell Telephone Laboratories, 1939 2004-02-11 5 ... an acoustic synthesizer Architectural blueprint for the Voder Output produced by the Voder 2004-02-11 6 The Pattern Playback Developed by Franklin Cooper at Haskins Laboratories, 1951 No human operator required. Machine plays back previously drawn spectrogram (spectrograph invented a few years earlier). 2004-02-11 7 Can you understand what it says? Output produced by the Pattern Playback. 2004-02-11 8 Can you understand what it says? Output produced by the Pattern Playback. These days a chicken leg is a rare dish. It’s easy to tell the depth of a well. Four hours of steady work faced us. 2004-02-11 9 Synthesis-by-rule • Realization that spectrograph and Pattern Playback are really only recording and playback devices. -
A Unit Selection Text-To-Speech-And-Singing Synthesis Framework from Neutral Speech: Proof of Concept 39 II.1 Introduction
Adding expressiveness to unit selection speech synthesis and to numerical voice production 90) - 02 - Marc Freixes Guerreiro http://hdl.handle.net/10803/672066 Generalitat 472 (28 de Catalunya núm. Rgtre. Fund. ADVERTIMENT. L'accés als continguts d'aquesta tesi doctoral i la seva utilització ha de respectar els drets de ió la persona autora. Pot ser utilitzada per a consulta o estudi personal, així com en activitats o materials d'investigació i docència en els termes establerts a l'art. 32 del Text Refós de la Llei de Propietat Intel·lectual undac F (RDL 1/1996). Per altres utilitzacions es requereix l'autorització prèvia i expressa de la persona autora. En qualsevol cas, en la utilització dels seus continguts caldrà indicar de forma clara el nom i cognoms de la persona autora i el títol de la tesi doctoral. No s'autoritza la seva reproducció o altres formes d'explotació efectuades amb finalitats de lucre ni la seva comunicació pública des d'un lloc aliè al servei TDX. Tampoc s'autoritza la presentació del seu contingut en una finestra o marc aliè a TDX (framing). Aquesta reserva de drets afecta tant als continguts de la tesi com als seus resums i índexs. Universitat Ramon Llull Universitat Ramon ADVERTENCIA. El acceso a los contenidos de esta tesis doctoral y su utilización debe respetar los derechos de la persona autora. Puede ser utilizada para consulta o estudio personal, así como en actividades o materiales de investigación y docencia en los términos establecidos en el art. 32 del Texto Refundido de la Ley de Propiedad Intelectual (RDL 1/1996).