Speech Technology

Total Page:16

File Type:pdf, Size:1020Kb

Speech Technology Specialization Module Speech Technology Timo Baumann [email protected] Universität Hamburg, Department of Informatics Natural Language Systems Group not „just“ Text-to-Speech Synthesis Synthesis examples ● first singing !digital) computer (IB$% 1961) hand-tuned vocoding ● extension of the same techni*ue today+ espeak → rule-based vocoding system ● based on natural speech: DreSS-F.% $brola → diphone-synthesis ● a more modern system: MaryTTS → general concatenati)e speech synthesis ● smaller memory footprint of the abo)e → HM$-based speech synthesis (to be co)ered in 2 weeks" → Input and Output for Spoken Dialogue Systems ● .ecognition ● Synthesis – .eduction of the signal to 1ords themsel)es only 1ords insu3ciently describe the signal ➔ abstraction from details ➔ naturalness only 1ith addition of details 2 s o rd rd o s 2 Sound Sound Speech Recognition Speech Synthesis 1hat is missing in 1ritten language4 Written vs. Spoken Language Timo's list ● 5bbreviations% dates% numbers% currencies% … ● /omographs+ Bass ● Text does not ha)e any melody or rhythm! – prosody is important to con)ey meaning – 8unctuation only partially helpful Homographs "#aɪs$ "#%s$ Bass information structure Information Structure The linguistic means of structuring information, in order to optimize information transfer within discourse – Topic / Focus – :i)en / New information ● not directly con)eyed in textual representation – but to a certain degree by prosody ● to reconstruct the structure% listeners also use – context of the utterance in the whole con)ersation – 1orld kno1ledge Sonderforschungsbereich 0<<=-0<&>+ http+99111.s?(=0.uni-potsdam.de &rosody supra-segmental properties of speech ● phenomena+ – pitch (i.e.% melody / fundamental fre*uency) – loudness / intensity – duration, pauses ● phonetically+ accentuation and phrasing ● phonologically+ (word)stress% intonation, juncture &rosody' &honology ( &honetics ( &henomena )ocus and *ccentuation )ocus and *ccentuation ● “# didn@t say 1e should kill him.A – someone else said we should kill him – # am denying that # said 1e should kill him – # 1rote it do1n or implied it% but # didn't say it – # said someone else should do the job – # said that 1e absolutely must kill him – getting him a little ner)ous 1ould ha)e been enough – 1e got the wrong guy )ocus and *ccentuation ● “# didn@t say 1e should kill him.A – someone else said we should kill him – # am denying that # said 1e should kill him – # 1rote it do1n or implied it% but # didn't say it – # said someone else should do the job – # said that 1e absolutely must kill him – getting him a little ner)ous 1ould ha)e been enough – 1e got the wrong guy )ocus and *ccentuation ● “# didn@t say 1e should kill him.A – someone else said we should kill him – # am denying that # said 1e should kill him – # 1rote it do1n or implied it% but # didn't say it – # said someone else should do the job – # said that 1e absolutely must kill him – getting him a little ner)ous 1ould ha)e been enough – 1e got the wrong guy )ocus and *ccentuation ● “# didn@t say 1e should kill him.A – someone else said we should kill him – # am denying that # said 1e should kill him – # 1rote it do1n or implied it% but # didn't say it – # said someone else should do the job – # said that 1e absolutely must kill him – getting him a little ner)ous 1ould ha)e been enough – 1e got the wrong guy )ocus and *ccentuation ● “# didn@t say 1e should kill him.A – someone else said we should kill him – # am denying that # said 1e should kill him – # 1rote it do1n or implied it% but # didn't say it – # said someone else should do the job – # said that 1e absolutely must kill him – getting him a little ner)ous 1ould ha)e been enough – 1e got the wrong guy )ocus and *ccentuation ● “# didn@t say 1e should kill him.A – someone else said we should kill him – # am denying that # said 1e should kill him – # 1rote it do1n or implied it% but # didn't say it – # said someone else should do the job – # said that 1e absolutely must kill him – getting him a little ner)ous 1ould ha)e been enough – 1e got the wrong guy )ocus and *ccentuation ● “# didn@t say 1e should kill him.A – someone else said we should kill him – # am denying that # said 1e should kill him – # 1rote it do1n or implied it% but # didn't say it – # said someone else should do the job – # said that 1e absolutely must kill him – getting him a little ner)ous 1ould ha)e been enough – 1e got the wrong guy Information Structure ● information structure is an acti)e area of research+ – unkno1n ho1 exactly to represent IS !cross-linguistically% cross-genre% in dialogue% 6" – unkno1n ho1 (exactly" IS inBuences speech ● problem of premature implementation+ can we really expect a computer to successfully perform speech synthesis even before the basic research has been done? What a computer can do ● problems that are well understood+ – nd solutions based on a model – use lists of exceptions if model is faulty ● problems that are some1hat understood+ – use heuristics to get details right – try to a)oid taking a stand ● problems that aren't yet understood+ – re*uire additional instructions in the input – guess What a computer can do: focus ● human listeners are predicti)e !and forgiving): – it's 1orse to be )ery wrong occasionally than to say e)erything a little bit wrongly – human listeners 1ill select the correct interpretation !using their 1orld kno1ledge) from a)ailable options ● solution: – put a small accentuation on all possible focus points ● ho1e)er – system does not take a stand% it sounds indiCerent, bored &rocess diagram of Speech Synthesis ✓ pro#lem' text is missing detail &rocess diagram of Speech Synthesis ✓ pro#lem' text is missing detail phones D their durations D pitch cur)e !support )alues" 1a)eform synthesis Waveform Synthesis from the target se*uence !phonesDduration+pitch) 1.formant-based+ rules to determine target formants and other parts of the signal rules to determine transitions 2.pattern-based+ database of many short speech segments segments are concatenated one aEer the other 3.model-based approach in 2 weeks Diphone Synthesis ● Foncatenation of short speech snippets ● units from center of a phone to center of the next+ Gh+ha+Da:lDlo+Do:GDG)Dvi+Di:g+ge+De:t+tsDsG – concatenation 1ithin “stableA phase of the phone – coarticulation is (largely) co)ered ● 40 phones ~1600 diphones7 – recorded from one speaker → one voice → – additional signal processing for durationDpitch change +eneral ,oncatenative Synthesis ● alternati)es for the mapping target speech snippets – more speech material in database → – selection of material that better ts the target se*uence ● selection becomes a search of best concatenation – costs of t of concatenation between snippets – costs of t of snippets to target se*uence ● computationally expensi)e (search" – )ery high memory demands (5<<MBD per voice) ● results can be very natural sounding 1hat do you like better+ formant-based or pattern-based synthesis4 Summary Kank you. [email protected] https://nats-111.informatik.uni-hamburg.de/SL8&( Universität Hamburg, Department of Informatics Natural Language Systems Group )urther Reading ● Speech Synthesis in :eneral+ – 8. Taylor !0<<'"+ Text-to-Speech Synthesis. Fambridge Mni) 8ress. #SB;+ 'NO- <>0&O''0NN. #nfBib+ 5 T5P H=<N<. ● Ke $aryTTS Speech Synthesis System+ – SchrQder R Trou)ain !0<<="+ “Ke :erman Text-to-Speech Synthesis System $5.P+ 5 Tool for .esearch, ,e)elopment and TeachingA% Int. of Speech Technology 6!=". -oti.en Desired Learning Outcomes ● speech synthesis goal is to add variation for naturalness (this is opposite from 5S." ● problems/ambiguities in linguistic pre-processing – prosody and pitch: ToB#% information structure – major synthesis techni*ues+ formants% diphone% – !8SSL5 techni*ue".
Recommended publications
  • Expression Control in Singing Voice Synthesis
    Expression Control in Singing Voice Synthesis Features, approaches, n the context of singing voice synthesis, expression control manipu- [ lates a set of voice features related to a particular emotion, style, or evaluation, and challenges singer. Also known as performance modeling, it has been ] approached from different perspectives and for different purposes, and different projects have shown a wide extent of applicability. The Iaim of this article is to provide an overview of approaches to expression control in singing voice synthesis. We introduce some musical applica- tions that use singing voice synthesis techniques to justify the need for Martí Umbert, Jordi Bonada, [ an accurate control of expression. Then, expression is defined and Masataka Goto, Tomoyasu Nakano, related to speech and instrument performance modeling. Next, we pres- ent the commonly studied set of voice parameters that can change and Johan Sundberg] Digital Object Identifier 10.1109/MSP.2015.2424572 Date of publication: 13 October 2015 IMAGE LICENSED BY INGRAM PUBLISHING 1053-5888/15©2015IEEE IEEE SIGNAL PROCESSING MAGAZINE [55] noVEMBER 2015 voices that are difficult to produce naturally (e.g., castrati). [TABLE 1] RESEARCH PROJECTS USING SINGING VOICE SYNTHESIS TECHNOLOGIES. More examples can be found with pedagogical purposes or as tools to identify perceptually relevant voice properties [3]. Project WEBSITE These applications of the so-called music information CANTOR HTTP://WWW.VIRSYN.DE research field may have a great impact on the way we inter- CANTOR DIGITALIS HTTPS://CANTORDIGITALIS.LIMSI.FR/ act with music [4]. Examples of research projects using sing- CHANTER HTTPS://CHANTER.LIMSI.FR ing voice synthesis technologies are listed in Table 1.
    [Show full text]
  • Design and Implementation of Text to Speech Conversion for Visually Impaired People
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Covenant University Repository International Journal of Applied Information Systems (IJAIS) – ISSN : 2249-0868 Foundation of Computer Science FCS, New York, USA Volume 7– No. 2, April 2014 – www.ijais.org Design and Implementation of Text To Speech Conversion for Visually Impaired People Itunuoluwa Isewon* Jelili Oyelade Olufunke Oladipupo Department of Computer and Department of Computer and Department of Computer and Information Sciences Information Sciences Information Sciences Covenant University Covenant University Covenant University PMB 1023, Ota, Nigeria PMB 1023, Ota, Nigeria PMB 1023, Ota, Nigeria * Corresponding Author ABSTRACT A Text-to-speech synthesizer is an application that converts text into spoken word, by analyzing and processing the text using Natural Language Processing (NLP) and then using Digital Signal Processing (DSP) technology to convert this processed text into synthesized speech representation of the text. Here, we developed a useful text-to-speech synthesizer in the form of a simple application that converts inputted text into synthesized speech and reads out to the user which can then be saved as an mp3.file. The development of a text to Figure 1: A simple but general functional diagram of a speech synthesizer will be of great help to people with visual TTS system. [2] impairment and make making through large volume of text easier. 2. OVERVIEW OF SPEECH SYNTHESIS Speech synthesis can be described as artificial production of Keywords human speech [3]. A computer system used for this purpose is Text-to-speech synthesis, Natural Language Processing, called a speech synthesizer, and can be implemented in Digital Signal Processing software or hardware.
    [Show full text]
  • Attuning Speech-Enabled Interfaces to User and Context for Inclusive Design: Technology, Methodology and Practice
    CORE Metadata, citation and similar papers at core.ac.uk Provided by Springer - Publisher Connector Univ Access Inf Soc (2009) 8:109–122 DOI 10.1007/s10209-008-0136-x LONG PAPER Attuning speech-enabled interfaces to user and context for inclusive design: technology, methodology and practice Mark A. Neerincx Æ Anita H. M. Cremers Æ Judith M. Kessens Æ David A. van Leeuwen Æ Khiet P. Truong Published online: 7 August 2008 Ó The Author(s) 2008 Abstract This paper presents a methodology to apply 1 Introduction speech technology for compensating sensory, motor, cog- nitive and affective usage difficulties. It distinguishes (1) Speech technology seems to provide new opportunities to an analysis of accessibility and technological issues for the improve the accessibility of electronic services and soft- identification of context-dependent user needs and corre- ware applications, by offering compensation for limitations sponding opportunities to include speech in multimodal of specific user groups. These limitations can be quite user interfaces, and (2) an iterative generate-and-test pro- diverse and originate from specific sensory, physical or cess to refine the interface prototype and its design cognitive disabilities—such as difficulties to see icons, to rationale. Best practices show that such inclusion of speech control a mouse or to read text. Such limitations have both technology, although still imperfect in itself, can enhance functional and emotional aspects that should be addressed both the functional and affective information and com- in the design of user interfaces (cf. [49]). Speech technol- munication technology-experiences of specific user groups, ogy can be an ‘enabler’ for understanding both the content such as persons with reading difficulties, hearing-impaired, and ‘tone’ in user expressions, and for producing the right intellectually disabled, children and older adults.
    [Show full text]
  • Project Melt
    PROJECT MELT 5/8/2013 The Middle English Language Translator In conjunction with the Virginia Tech English Department, a group of Computer Science students sought to build an online Middle English (12th to 15th century) translator with text to speech pronunciation. This document outlines their research, development, and future plans for the project. By Zachary Lytle, Tam Ayers, and Michael Goheen Client: Dr. Karen Swenson, English Department at Virginia Tech www.project-melt.org (See Attached Page for Site Map) [email protected] Project MELT Project MELT THE MIDDLE ENGLISH LANGUAGE TRANSLATOR Navigation | Table of Contents INTRODUCTION .............................................................................................................. 1 Middle English Background ....................................................................................................................................... 1 Client Representation ................................................................................................................................................. 2 Application Requirements .......................................................................................................................................... 2 Concept Map ........................................................................................................................................................... 3 TEXT-TO-SPEECH ANALYSIS ........................................................................................... 4 Speech Introduction
    [Show full text]
  • Voks: Digital Instruments for Chironomic Control of Voice Samples Grégoire Locqueville, Christophe D’Alessandro, Samuel Delalez, Boris Doval, Xiao Xiao
    Voks: Digital instruments for chironomic control of voice samples Grégoire Locqueville, Christophe d’Alessandro, Samuel Delalez, Boris Doval, Xiao Xiao To cite this version: Grégoire Locqueville, Christophe d’Alessandro, Samuel Delalez, Boris Doval, Xiao Xiao. Voks: Digital instruments for chironomic control of voice samples. Speech Communication, Elsevier : North-Holland, 2020, 125, pp.97 - 113. 10.1016/j.specom.2020.10.002. hal-03009712 HAL Id: hal-03009712 https://hal.archives-ouvertes.fr/hal-03009712 Submitted on 2 Dec 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Voks: digital instruments for chironomic control of voice samples Gregoire´ Locquevillea, Christophe d’Alessandroa, Samuel Delalezb, Boris Dovala, Xiao Xiaoa aInstitut Jean le Rond d’Alembert, Sorbonne Universit´e,CNRS, UMR 7190, 4 place Jussieu, 75015, Paris, France bLunii,18 rue Dubrunfaut, 75012 Paris, France Abstract This paper presents Voks, a new family of digital instruments that allow for real-time control and modifica- tion of pre-recorded voice signal samples. An instrument based on Voks is made of Voks itself, the synthe- sis software and a given set of chironomic (hand-driven) interfaces. Rhythm can be accurately controlled thanks to a new methodology, based on syllabic control points.
    [Show full text]
  • Robust Structured Voice Extraction for Flexible Expressive Resynthesis
    ROBUST STRUCTURED VOICE EXTRACTION FOR FLEXIBLE EXPRESSIVE RESYNTHESIS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Pamornpol Jinachitra June 2007 c Copyright by Pamornpol Jinachitra 2007 All Rights Reserved ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Julius O. Smith, III Principal Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Robert M. Gray I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Jonathan S. Abel Approved for the University Committee on Graduate Studies. iii iv Abstract Parametric representation of audio allows for a reduction in the amount of data needed to represent the sound. If chosen carefully, these parameters can capture the expressiveness of the sound, while reflecting the production mechanism of the sound source, and thus allow for an intuitive control in order to modify the original sound in a desirable way. In order to achieve the desired parametric encoding, algorithms which can robustly identify the model parameters even from noisy recordings are needed. As a result, not only do we get an expressive and flexible coding system, we can also obtain a model-based speech enhancement that reconstructs the speech embedded in noise cleanly and free of musical noise usually associated with the filter-based approach.
    [Show full text]
  • D10.1: State of the Art of Accessibility Tools Revision: 1.0 As of 28Th February 2011
    DELIVERABLE Project Acronym: EuDML Grant Agreement number: 250503 Project Title: The European Digital Mathematics Library D10.1: State of the Art of Accessibility Tools Revision: 1.0 as of 28th February 2011 Authors: Volker Sorge University of Birmingham Mark Lee University of Birmingham Petr Sojka Masaryk University Alan P. Sexton University of Birmingham Contributors: Martin Jarmar Masaryk University Project co-funded by the European Comission within the ICT Policy Support Programme Dissemination Level P Public X C Confidential, only for members of the consortium and the Commission Services Revision History Revision Date Author Organisation Description 0.1 15th October 2010 Volker Sorge UB Just a content outline. 0.2 28th January 2011 Volker Sorge UB Section 2 completed. 0.3 4th February 2011 Petr Sojka MU MU experience part added. 0.4 13th Feburary 2011 Volker Sorge UB Sections 3,4,5 completed. 0.5 14th February 2011 VS+PS UB+MU Development version. 0.6 14th February 2011 Alan P. Sexton UB Typo corrections and pol- ishing 0.7 19th February 2011 Mark Lee UB Addition of conclusions on language translation 1.0 28th Feburary 2011 Volker Sorge UB Made changes suggested by reviewer and final edit- ing. Statement of originality: This deliverable contains original unpublished work except where clearly indicated otherwise. Acknowledgement of previously published material and of the work of others has been made through appropriate citation, quotation or both. Abstract The purpose of this report is to present a state of the art in accessibility tools that can be used to provide access to mathematical literature to visually impaired users as well as print impaired users (i.e., people with specific learning disabilities like dyslexia) as well as an overview of current automated translation technology.
    [Show full text]
  • Singing Voice Resynthesis Using Concatenative-Based Techniques
    Singing Voice Resynthesis using Concatenative-based Techniques Nuno Miguel da Costa Santos Fonseca Faculty of Engineering, University of Porto Department of Informatics Engineering December 2011 FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Departamento de Engenharia Informáica Singing Voice Resynthesis using Concatenative-based Techniques Nuno Miguel da Costa Santos Fonseca Dissertação submetida para satisfação parcial dos requisitos do grau de doutor em Engenharia Informática Dissertação realizada sob a orientação do Professor Doutor Aníbal João de Sousa Ferreira do Departamento de Engenharia Electrotécnica e de Computadores da Faculdade de Engenharia da Universidade do Porto e da Professora Doutora Ana Paula Cunha da Rocha do Departamento de Engenharia Informática da Faculdade de Engenharia da Universidade do Porto Porto, Dezembro de 2011 Singing Voice Resynthesis using Concatenative-based Techniques ii This dissertation had the kind support of FCT (Portuguese Foundation for Science and Technology, an agency of the Portuguese Ministry for Science, Technology and Higher Education) under grant SFRH / BD / 30300 / 2006, and has been articulated with research project PTDC/SAU- BEB/104995/2008 (Assistive Real-Time Technology in Singing) whose objectives include the development of interactive technologies helping the teaching and learning of singing. Copyright © 2011 by Nuno Miguel C. S. Fonseca All rights reserved. No parts of the material protected by this copyright notice may be reproduced or utilized in any form or by any means, electronic
    [Show full text]
  • Speech Synthesis
    Contents 1 Introduction 3 1.1 Quality of a Speech Synthesizer 3 1.2 The TTS System 3 2 History 4 2.1 Electronic Devices 4 3 Synthesizer Technologies 6 3.1 Waveform/Spectral Coding 6 3.2 Concatenative Synthesis 6 3.2.1 Unit Selection Synthesis 6 3.2.2 Diaphone Synthesis 7 3.2.3 Domain-Specific Synthesis 7 3.3 Formant Synthesis 8 3.4 Articulatory Synthesis 9 3.5 HMM-Based Synthesis 10 3.6 Sine Wave Synthesis 10 4 Challenges 11 4.1 Text Normalization Challenges 11 4.1.1 Homographs 11 4.1.2 Numbers and Abbreviations 11 4.2 Text-to-Phoneme Challenges 11 4.3 Evaluation Challenges 12 5 Speech Synthesis in Operating Systems 13 5.1 Atari 13 5.2 Apple 13 5.3 AmigaOS 13 5.4 Microsoft Windows 13 6 Speech Synthesis Markup Languages 15 7 Applications 16 7.1 Contact Centers 16 7.2 Assistive Technologies 16 1 © Specialty Answering Service. All rights reserved. 7.3 Gaming and Entertainment 16 8 References 17 2 © Specialty Answering Service. All rights reserved. 1 Introduction The word ‘Synthesis’ is defined by the Webster’s Dictionary as ‘the putting together of parts or elements so as to form a whole’. Speech synthesis generally refers to the artificial generation of human voice – either in the form of speech or in other forms such as a song. The computer system used for speech synthesis is known as a speech synthesizer. There are several types of speech synthesizers (both hardware based and software based) with different underlying technologies.
    [Show full text]
  • Singing Synthesis Framework from Neutral Speech: Proof of Concept Marc Freixes* , Francesc Alías and Joan Claudi Socoró
    Freixes et al. EURASIP Journal on Audio, Speech, and Music Processing (2019) 2019:22 https://doi.org/10.1186/s13636-019-0163-y RESEARCH Open Access A unit selection text-to-speech-and- singing synthesis framework from neutral speech: proof of concept Marc Freixes* , Francesc Alías and Joan Claudi Socoró Abstract Text-to-speech (TTS) synthesis systems have been widely used in general-purpose applications based on the generation of speech. Nonetheless, there are some domains, such as storytelling or voice output aid devices, which may also require singing. To enable a corpus-based TTS system to sing, a supplementary singing database should be recorded. This solution, however, might be too costly for eventual singing needs, or even unfeasible if the original speaker is unavailable or unable to sing properly. This work introduces a unit selection-based text-to-speech-and-singing (US-TTS&S) synthesis framework, which integrates speech-to-singing (STS) conversion to enable the generation of both speech and singing from an input text and a score, respectively, using the same neutral speech corpus. The viability of the proposal is evaluated considering three vocal ranges and two tempos on a proof-of-concept implementation using a 2.6-h Spanish neutral speech corpus. The experiments show that challenging STS transformation factors are required to sing beyond the corpus vocal range and/or with notes longer than 150 ms. While score-driven US configurations allow the reduction of pitch-scale factors, time-scale factors are not reduced due to the short length of the spoken vowels. Moreover, in the MUSHRA test, text-driven and score-driven US configurations obtain similar naturalness rates of around 40 for all the analysed scenarios.
    [Show full text]
  • Speech Synthesis
    6. Text-to-Speech Synthesis (Most Of these slides come from Dan Juray’s course at Stanford) History of Speech Synthesis • In 1779, the Danish scientist Christian Kratzenstein builds models of the human vocal tract that can produce the five long vowel sounds. • In 1791 Wolfgang von Kempelen (the creator of the Turk chess playing game) devises the bellows(fuelle, mancha)-operated “automatic- mechanical speech machine”. It added models of the tongue and lips that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's design. • In the 1930s, Bell Labs developed the VOCODER, a keyboard- operated electronic speech analyzer and synthesizer that was said to be clearly intelligible. It was later refined into the VODER, which was exhibited at the 1939 New York World's Fair. Tractament Digital de la Parla 2 Von Kempelen: • Small whistles controlled consonants • Rubber mouth and nose; nose had to be covered with two fingers for non-nasals • Unvoiced sounds: mouth covered, auxiliary bellows driven by string provides puff of air From Traunmüller’s web site Von Kempelen’s speaking machine Bell labs VOCODER machine Homer Dudley 1939 VODER • Synthesizing speech by electrical means • 1939 World’s Fair Homer Dudley’s VODER • Manually controlled through complex keyboard • Operator training was a problem One of the first “talking” computers Closer to a natural vocal tract: Riesz 1937 The UK Speaking Clock • July 24, 1936 • Photographic storage on 4 glass disks • 2 disks for minutes, 1 for hour, one for seconds. • Other words in sentence distributed across 4 disks, so all 4 used at once.
    [Show full text]
  • Production of Greek Singing Voice, the Result Is Clearly Acceptable, but Not Satisfactory Enough to Be 3.1
    Proceedings of the Inter. Computer Music Conference (ICMC07), Copenhagen, Denmark, 27-31 August 2007 A SCORE-TO-SINGING VOICE SYNTHESIS SYSTEM FOR THE GREEK LANGUAGE Varvara Kyritsi Anastasia Georgaki Georgios Kouroupetroglou University of Athens, University of Athens, University of Athens, Department of Informatics and Department of Music Studies, Department of Informatics and Telecommunications, Athens, Greece Telecommunications, Athens, Greece [email protected] Athens, Greece [email protected] [email protected] ABSTRACT [15], [11]. Consequently, a SVS system must include control tools for various parameters, as phone duration, In this paper, we examine the possibility of generating vibrato, pitch, volume etc. Greek singing voice with the MBROLA synthesizer, by making use of an already existing diphone database. Our Since 1980, various projects have been carried out all goal is to implement a score-to-singing synthesis system, over the world, through different optical views, languages where the score is written in a score editor and saved in and methodologies, focusing on the mystery of the the midi format. However, MBROLA accepts phonetic synthetic singing voice [10], [11], [13], [14], [17], [18], files as input rather than midi ones and because of this, and [19]. The different optical views extend from the we construct a midi-file to phonetic-file converter, which development of a proper technique for the naturalness of applies irrespective of the underlying language. the sound quality to the design of the rules that determine the adjunction of phonemes or diphones into phrases and The evaluation of the Greek singing voice synthesis their expression. Special studies on the emotion of the system is achieved through a psycho acoustic experiment.
    [Show full text]