Robust Structured Voice Extraction for Flexible Expressive Resynthesis
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Masterarbeit
Masterarbeit Erstellung einer Sprachdatenbank sowie eines Programms zu deren Analyse im Kontext einer Sprachsynthese mit spektralen Modellen zur Erlangung des akademischen Grades Master of Science vorgelegt dem Fachbereich Mathematik, Naturwissenschaften und Informatik der Technischen Hochschule Mittelhessen Tobias Platen im August 2014 Referent: Prof. Dr. Erdmuthe Meyer zu Bexten Korreferent: Prof. Dr. Keywan Sohrabi Eidesstattliche Erklärung Hiermit versichere ich, die vorliegende Arbeit selbstständig und unter ausschließlicher Verwendung der angegebenen Literatur und Hilfsmittel erstellt zu haben. Die Arbeit wurde bisher in gleicher oder ähnlicher Form keiner anderen Prüfungsbehörde vorgelegt und auch nicht veröffentlicht. 2 Inhaltsverzeichnis 1 Einführung7 1.1 Motivation...................................7 1.2 Ziele......................................8 1.3 Historische Sprachsynthesen.........................9 1.3.1 Die Sprechmaschine.......................... 10 1.3.2 Der Vocoder und der Voder..................... 10 1.3.3 Linear Predictive Coding....................... 10 1.4 Moderne Algorithmen zur Sprachsynthese................. 11 1.4.1 Formantsynthese........................... 11 1.4.2 Konkatenative Synthese....................... 12 2 Spektrale Modelle zur Sprachsynthese 13 2.1 Faltung, Fouriertransformation und Vocoder................ 13 2.2 Phase Vocoder................................ 14 2.3 Spectral Model Synthesis........................... 19 2.3.1 Harmonic Trajectories........................ 19 2.3.2 Shape Invariance.......................... -
Expression Control in Singing Voice Synthesis
Expression Control in Singing Voice Synthesis Features, approaches, n the context of singing voice synthesis, expression control manipu- [ lates a set of voice features related to a particular emotion, style, or evaluation, and challenges singer. Also known as performance modeling, it has been ] approached from different perspectives and for different purposes, and different projects have shown a wide extent of applicability. The Iaim of this article is to provide an overview of approaches to expression control in singing voice synthesis. We introduce some musical applica- tions that use singing voice synthesis techniques to justify the need for Martí Umbert, Jordi Bonada, [ an accurate control of expression. Then, expression is defined and Masataka Goto, Tomoyasu Nakano, related to speech and instrument performance modeling. Next, we pres- ent the commonly studied set of voice parameters that can change and Johan Sundberg] Digital Object Identifier 10.1109/MSP.2015.2424572 Date of publication: 13 October 2015 IMAGE LICENSED BY INGRAM PUBLISHING 1053-5888/15©2015IEEE IEEE SIGNAL PROCESSING MAGAZINE [55] noVEMBER 2015 voices that are difficult to produce naturally (e.g., castrati). [TABLE 1] RESEARCH PROJECTS USING SINGING VOICE SYNTHESIS TECHNOLOGIES. More examples can be found with pedagogical purposes or as tools to identify perceptually relevant voice properties [3]. Project WEBSITE These applications of the so-called music information CANTOR HTTP://WWW.VIRSYN.DE research field may have a great impact on the way we inter- CANTOR DIGITALIS HTTPS://CANTORDIGITALIS.LIMSI.FR/ act with music [4]. Examples of research projects using sing- CHANTER HTTPS://CHANTER.LIMSI.FR ing voice synthesis technologies are listed in Table 1. -
Design and Implementation of Text to Speech Conversion for Visually Impaired People
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Covenant University Repository International Journal of Applied Information Systems (IJAIS) – ISSN : 2249-0868 Foundation of Computer Science FCS, New York, USA Volume 7– No. 2, April 2014 – www.ijais.org Design and Implementation of Text To Speech Conversion for Visually Impaired People Itunuoluwa Isewon* Jelili Oyelade Olufunke Oladipupo Department of Computer and Department of Computer and Department of Computer and Information Sciences Information Sciences Information Sciences Covenant University Covenant University Covenant University PMB 1023, Ota, Nigeria PMB 1023, Ota, Nigeria PMB 1023, Ota, Nigeria * Corresponding Author ABSTRACT A Text-to-speech synthesizer is an application that converts text into spoken word, by analyzing and processing the text using Natural Language Processing (NLP) and then using Digital Signal Processing (DSP) technology to convert this processed text into synthesized speech representation of the text. Here, we developed a useful text-to-speech synthesizer in the form of a simple application that converts inputted text into synthesized speech and reads out to the user which can then be saved as an mp3.file. The development of a text to Figure 1: A simple but general functional diagram of a speech synthesizer will be of great help to people with visual TTS system. [2] impairment and make making through large volume of text easier. 2. OVERVIEW OF SPEECH SYNTHESIS Speech synthesis can be described as artificial production of Keywords human speech [3]. A computer system used for this purpose is Text-to-speech synthesis, Natural Language Processing, called a speech synthesizer, and can be implemented in Digital Signal Processing software or hardware. -
Attuning Speech-Enabled Interfaces to User and Context for Inclusive Design: Technology, Methodology and Practice
CORE Metadata, citation and similar papers at core.ac.uk Provided by Springer - Publisher Connector Univ Access Inf Soc (2009) 8:109–122 DOI 10.1007/s10209-008-0136-x LONG PAPER Attuning speech-enabled interfaces to user and context for inclusive design: technology, methodology and practice Mark A. Neerincx Æ Anita H. M. Cremers Æ Judith M. Kessens Æ David A. van Leeuwen Æ Khiet P. Truong Published online: 7 August 2008 Ó The Author(s) 2008 Abstract This paper presents a methodology to apply 1 Introduction speech technology for compensating sensory, motor, cog- nitive and affective usage difficulties. It distinguishes (1) Speech technology seems to provide new opportunities to an analysis of accessibility and technological issues for the improve the accessibility of electronic services and soft- identification of context-dependent user needs and corre- ware applications, by offering compensation for limitations sponding opportunities to include speech in multimodal of specific user groups. These limitations can be quite user interfaces, and (2) an iterative generate-and-test pro- diverse and originate from specific sensory, physical or cess to refine the interface prototype and its design cognitive disabilities—such as difficulties to see icons, to rationale. Best practices show that such inclusion of speech control a mouse or to read text. Such limitations have both technology, although still imperfect in itself, can enhance functional and emotional aspects that should be addressed both the functional and affective information and com- in the design of user interfaces (cf. [49]). Speech technol- munication technology-experiences of specific user groups, ogy can be an ‘enabler’ for understanding both the content such as persons with reading difficulties, hearing-impaired, and ‘tone’ in user expressions, and for producing the right intellectually disabled, children and older adults. -
Voice Synthesizer Application Android
Voice synthesizer application android Continue The Download Now link sends you to the Windows Store, where you can continue the download process. You need to have an active Microsoft account to download the app. This download may not be available in some countries. Results 1 - 10 of 603 Prev 1 2 3 4 5 Next See also: Speech Synthesis Receming Device is an artificial production of human speech. The computer system used for this purpose is called a speech computer or speech synthesizer, and can be implemented in software or hardware. The text-to-speech system (TTS) converts the usual text of language into speech; other systems display symbolic linguistic representations, such as phonetic transcriptions in speech. Synthesized speech can be created by concatenating fragments of recorded speech that are stored in the database. Systems vary in size of stored speech blocks; The system that stores phones or diphones provides the greatest range of outputs, but may not have clarity. For specific domain use, storing whole words or suggestions allows for high-quality output. In addition, the synthesizer may include a vocal tract model and other characteristics of the human voice to create a fully synthetic voice output. The quality of the speech synthesizer is judged by its similarity to the human voice and its ability to be understood clearly. The clear text to speech program allows people with visual impairments or reading disabilities to listen to written words on their home computer. Many computer operating systems have included speech synthesizers since the early 1990s. A review of the typical TTS Automatic Announcement System synthetic voice announces the arriving train to Sweden. -
Observations on the Dynamic Control of an Articulatory Synthesizer Using Speech Production Data
Observations on the dynamic control of an articulatory synthesizer using speech production data Dissertation zur Erlangung des Grades eines Doktors der Philosophie der Philosophischen Fakultäten der Universität des Saarlandes vorgelegt von Ingmar Michael Augustus Steiner aus San Francisco Saarbrücken, 2010 ii Dekan: Prof. Dr. Erich Steiner Berichterstatter: Prof. Dr. William J. Barry Prof. Dr. Dietrich Klakow Datum der Einreichung: 14.12.2009 Datum der Disputation: 19.05.2010 Contents Acknowledgments vii List of Figures ix List of Tables xi List of Listings xiii List of Acronyms 1 Zusammenfassung und Überblick 1 Summary and overview 4 1 Speech synthesis methods 9 1.1 Text-to-speech ................................. 9 1.1.1 Formant synthesis ........................... 10 1.1.2 Concatenative synthesis ........................ 11 1.1.3 Statistical parametric synthesis .................... 13 1.1.4 Articulatory synthesis ......................... 13 1.1.5 Physical articulatory synthesis .................... 15 1.2 VocalTractLab in depth ............................ 17 1.2.1 Vocal tract model ........................... 17 1.2.2 Gestural model ............................. 19 1.2.3 Acoustic model ............................. 24 1.2.4 Speaker adaptation ........................... 25 2 Speech production data 27 2.1 Full imaging ................................... 27 2.1.1 X-ray and cineradiography ...................... 28 2.1.2 Magnetic resonance imaging ...................... 30 2.1.3 Ultrasound tongue imaging ...................... 33 -
A Unit Selection Text-To-Speech-And-Singing Synthesis Framework from Neutral Speech: Proof of Concept 39 II.1 Introduction
Adding expressiveness to unit selection speech synthesis and to numerical voice production 90) - 02 - Marc Freixes Guerreiro http://hdl.handle.net/10803/672066 Generalitat 472 (28 de Catalunya núm. Rgtre. Fund. ADVERTIMENT. L'accés als continguts d'aquesta tesi doctoral i la seva utilització ha de respectar els drets de ió la persona autora. Pot ser utilitzada per a consulta o estudi personal, així com en activitats o materials d'investigació i docència en els termes establerts a l'art. 32 del Text Refós de la Llei de Propietat Intel·lectual undac F (RDL 1/1996). Per altres utilitzacions es requereix l'autorització prèvia i expressa de la persona autora. En qualsevol cas, en la utilització dels seus continguts caldrà indicar de forma clara el nom i cognoms de la persona autora i el títol de la tesi doctoral. No s'autoritza la seva reproducció o altres formes d'explotació efectuades amb finalitats de lucre ni la seva comunicació pública des d'un lloc aliè al servei TDX. Tampoc s'autoritza la presentació del seu contingut en una finestra o marc aliè a TDX (framing). Aquesta reserva de drets afecta tant als continguts de la tesi com als seus resums i índexs. Universitat Ramon Llull Universitat Ramon ADVERTENCIA. El acceso a los contenidos de esta tesis doctoral y su utilización debe respetar los derechos de la persona autora. Puede ser utilizada para consulta o estudio personal, así como en actividades o materiales de investigación y docencia en los términos establecidos en el art. 32 del Texto Refundido de la Ley de Propiedad Intelectual (RDL 1/1996). -
Tools for Creating Audio Stories
Tools for Creating Audio Stories Steven Rubin Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2015-237 http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-237.html December 15, 2015 Copyright © 2015, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Acknowledgement Advisor: Maneesh Agrawala Tools for Creating Audio Stories By Steven Surmacz Rubin A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Maneesh Agrawala, Chair Professor Björn Hartmann Professor Greg Niemeyer Fall 2015 Tools for Creating Audio Stories Copyright 2015 by Steven Surmacz Rubin 1 Abstract Tools for Creating Audio Stories by Steven Surmacz Rubin Doctor of Philosophy in Computer Science University of California, Berkeley Professor Maneesh Agrawala, Chair Audio stories are an engaging form of communication that combines speech and music into com- pelling narratives. One common production pipeline for creating audio stories involves three main steps: recording speech, editing speech, and editing music. Existing audio recording and editing tools force the story producer to manipulate speech and music tracks via tedious, low-level wave- form editing. -
Chapter 2 a Survey on Speech Synthesis Techniques
Chapter 2 A Survey on Speech Synthesis Techniques* Producing synthetic speech segments from natural language text utterances comes with a unique set of challenges and is currently under serviced due to the unavail- ability of a generic model for all available languages. This chapter presents a study on the existing speech synthesis techniques along with their major advantages and deficiencies. The classification of different standard speech synthesis techniques are presented in Figure 2.1. This chapter also discusses about the current status of the text to speech technology in Indian languages focusing on the issues to be resolved in proposing a generic model for different Indian languages. Speech Synthesis Techniques Articulatory Formant Concatenative Statistical Unit Selection Domain Specific Diphone Syllable-based Figure 2.1: Classification of speech synthesis techniques *S. P. Panda, A. K. Nayak, and S. Patnaik, “Text to Speech Synthesis with an Indian Language Perspective”, International Journal of Grid and Utility Computing, Inderscience, Vol. 6, No. 3/4, pp. 170-178, 2015 17 Chapter 2: A Survey on Speech Synthesis Techniques 18 2.1 Articulatory Synthesis Articulatory synthesis models the natural speech production process of human. As a speech synthesis method this is not among the best, when the quality of the produced speech is the main criterion. However, for studying speech production process it is the most suitable method adopted by the researchers [48]. To understand the articulatory speech synthesis process the human speech production process is needed to be understood first. The human speech production organs (i.e. main articulators - the tongue, the jaw and the lips, etc as well as other important parts of the vocal tract) is shown in Figure 2.2 [49] along with the idealized model, which is the basis of almost every acoustical model. -
Chapter 8 Speech Synthesis
PRELIMINARY PROOFS. Unpublished Work c 2008 by Pearson Education, Inc. To be published by Pearson Prentice Hall, Pearson Education, Inc., Upper Saddle River, New Jersey. All rights reserved. Permission to use this unpublished Work is granted to individuals registering through [email protected] for the instructional purposes not exceeding one academic term or semester. Chapter 8 Speech Synthesis And computers are getting smarter all the time: Scientists tell us that soon they will be able to talk to us. (By ‘they’ I mean ‘computers’: I doubt scientists will ever be able to talk to us.) Dave Barry In Vienna in 1769, Wolfgang von Kempelen built for the Empress Maria Theresa the famous Mechanical Turk, a chess-playing automaton consisting of a wooden box filled with gears, and a robot mannequin sitting behind the box who played chess by moving pieces with his mechanical arm. The Turk toured Europeand the Americas for decades, defeating Napolean Bonaparte and even playing Charles Babbage. The Mechanical Turk might have been one of the early successes of artificial intelligence if it were not for the fact that it was, alas, a hoax, powered by a human chessplayer hidden inside the box. What is perhaps less well-known is that von Kempelen, an extraordinarily prolific inventor, also built between 1769 and 1790 what is definitely not a hoax: the first full-sentence speech synthesizer. His device consisted of a bellows to simulate the lungs, a rubber mouthpiece and a nose aperature, a reed to simulate the vocal folds, various whistles for each of the fricatives. and a small auxiliary bellows to provide the puff of air for plosives. -
A. Acoustic Theory and Modeling of the Vocal Tract
A. Acoustic Theory and Modeling of the Vocal Tract by H.W. Strube, Drittes Physikalisches Institut, Universität Göttingen A.l Introduction This appendix is intended for those readers who want to inform themselves about the mathematical treatment of the vocal-tract acoustics and about its modeling in the time and frequency domains. Apart from providing a funda mental understanding, this is required for all applications and investigations concerned with the relationship between geometric and acoustic properties of the vocal tract, such as articulatory synthesis, determination of the tract shape from acoustic quantities, inverse filtering, etc. Historically, the formants of speech were conjectured to be resonances of cavities in the vocal tract. In the case of a narrow constriction at or near the lips, such as for the vowel [uJ, the volume of the tract can be considered a Helmholtz resonator (the glottis is assumed almost closed). However, this can only explain the first formant. Also, the constriction - if any - is usu ally situated farther back. Then the tract may be roughly approximated as a cascade of two resonators, accounting for two formants. But all these approx imations by discrete cavities proved unrealistic. Thus researchers have have now adopted a more reasonable description of the vocal tract as a nonuni form acoustical transmission line. This can explain an infinite number of res onances, of which, however, only the first 2 to 4 are of phonetic importance. Depending on the kind of sound, the tube system has different topology: • for vowel-like sounds, pharynx and mouth form one tube; • for nasalized vowels, the tube is branched, with transmission from pharynx through mouth and nose; • for nasal consonants, transmission is through pharynx and nose, with the closed mouth tract as a "shunt" line. -
A Parametric Sound Object Model for Sound Texture Synthesis
Daniel Mohlmann¨ A Parametric Sound Object Model for Sound Texture Synthesis Dissertation zur Erlangung des Grades eines Doktors der Ingenieurwissenschaften | Dr.-Ing. | Vorgelegt im Fachbereich 3 (Mathematik und Informatik) der Universitat¨ Bremen im Juni 2011 Gutachter: Prof. Dr. Otthein Herzog Universit¨atBremen Prof. Dr. J¨ornLoviscach Fachhochschule Bielefeld Abstract This thesis deals with the analysis and synthesis of sound textures based on parametric sound objects. An overview is provided about the acoustic and perceptual principles of textural acoustic scenes, and technical challenges for analysis and synthesis are con- sidered. Four essential processing steps for sound texture analysis are identified, and existing sound texture systems are reviewed, using the four-step model as a guideline. A theoretical framework for analysis and synthesis is proposed. A parametric sound object synthesis (PSOS) model is introduced, which is able to describe individual recorded sounds through a fixed set of parameters. The model, which applies to harmonic and noisy sounds, is an extension of spectral modeling and uses spline curves to approximate spectral envelopes, as well as the evolution of pa- rameters over time. In contrast to standard spectral modeling techniques, this repre- sentation uses the concept of objects instead of concatenated frames, and it provides a direct mapping between sounds of different length. Methods for automatic and manual conversion are shown. An evaluation is presented in which the ability of the model to encode a wide range of different sounds has been examined. Although there are aspects of sounds that the model cannot accurately capture, such as polyphony and certain types of fast modula- tion, the results indicate that high quality synthesis can be achieved for many different acoustic phenomena, including instruments and animal vocalizations.