Developing a Speech Unit Framework

Total Page:16

File Type:pdf, Size:1020Kb

Developing a Speech Unit Framework Sushil Bastola Developing a Speech Unit Framework Metropolia University of Applied Sciences Bachelor of Engineering Software Engineering Bachelor’s Thesis 30 August 2018 Author Sushil Bastola Title Developing a Speech Unit Framework DegreeNumber of Pages Bachelorxx pages of+ xEngineering appendices Date 30 August 2018 Degree Programme Information Technology Professional Major Software Engineering Instructors Janne Salonen, Principal Lecturer This project aims to build a speech unit framework for Kone to automate process of gener- ating announcements, translating them to multiple languages and extracting to desired au- dio formats for different devices’ types such as elevators, doors, gates using speech syn- thesis. Traditionally Kone has been generating the speech using human resources to create an- nouncements for the different device types. Specialized personnel speak on the recording devices and records the voices that later is installed as an announcement in different de- vices. The process is repeated for different languages using native speakers from corre- sponding countries. The audios then are saved and installed on different device types. This process can be expensive, inconsistent and hefty in long run since the speech must be translated to many different languages. This project built a system that uses speech synthesizers from AWS to automate the pro- cess of creating announcements. The system can generate audios in selected languages with custom settings and filters. The audio can be exported as a zip file to a specific format and then can be installed in corresponding device type. To conclude, the project was successful to automate the process of generating announce- ments using speech synthesizers. The project built a system that minimizes the traditional problems of inconsistence and offered a faster, reliable and cheaper solution to the problem. Keywords AWS, TTS, Polly, Synthesizers, Micro Service, Architecture, Speech Contents List of Abbreviations 1 Introduction 1 2 Overview of TTS 1 2.1 Brief History 1 2.2 Implementation Techniques 2 2.3 Tools and Technologies 4 2.3.1 Speech synthesis software and APIs 4 2.4 Application of Speech Synthesis 6 2.4.1 Applications for the Blind 6 2.4.2 Applications for the Deafened and Vocally Handicapped 7 2.4.3 Educational Applications 7 2.4.4 Applications for Telecommunications and Multimedia 7 3 Kone 7 3.1 Brief History 8 3.2 Main Expertise 8 3.2.1 Elevators, escalators and automatic building doors solutions 8 3.2.2 Maintenance and modernization 9 3.2.3 Advanced people flow solutions 9 4 Addressing the problem 10 5 Implementation 10 5.1 Technologies 10 5.1.1 Docker 11 5.1.2 React 12 5.1.3 Redux 13 5.1.4 Koa 14 5.1.5 Sequelize 15 5.1.6 JWT 15 5.1.7 Postgres 16 5.1.8 Swagger 16 5.2 Application Architecture 19 5.2.1 Microservices 20 5.3 Amazon Polly 22 5.4 Execution 23 5.4.1 Requirement Analysis 23 5.4.2 System Design 23 5.4.3 Execution 25 5.4.4 Integration and Deployment 26 5.5 Outcome and observations 26 6 Conclusion 27 7 References 28 Appendices List of Abbreviations TTS Text to Speech NTTS Neutral Text to Speech AWS Amazon Web Service API Application Programming Interface DOM Documents Object Model MVC Model View Controller ORM Object Relational Mapping SQL Structured Query Language JS JavaScript OS Operating System MVCC Multi-Version Concurrency Control JWT Json Web Token JSON JavaScript Object Notation HMAC Hash Message Authentication Code RSA Rivest, Shamir, and Adelman RFC Request For Comment ECDSA Elliptic Curve Digital Signature Algorithm SSML Speech Synthesis Markup Language MP3 Moving Picture Experts Group MVP Minimal Viable Product CRUD Create Read Update Delete OAS OpenAPI Specification YAML YAML Ain’t Markup Language UI User Interface PAT Parametric Artificial Talker VOCODER Voice Operating Demonstrator IOT Internet of Things UWP Universal Windows Platform PHP Hypertext Preprocessor ASP Active Server Page HTML Hyper Text Markup Language ES6 EcmaScript 1 1 Introduction Speech Synthesis is the process of artificially producing human speech, usually done using computers. The software that produces the artificial speech is called speech syn- thesizer. A TTS system converts a language text into speech while other render symbolic linguistic representation into speech.[1] The project aimed to replace the traditional way of generating the announcements in many different languages using manual human resources with speech synthesis tech- nologies. With the use of speech synthesizers, the process can be automated using dif- ferent online cloud services that provide TTS conversion real-time. Since the technology of speech synthesis have evolved drastically in fast few years, the services are trustwor- thy and resilient. Therefore, the approach of using speech synthesis is cheaper, faster, reliable and consistent compared to traditional method. The main goal of the project was to use speech synthesis for the process of generating announcements rather than tradi- tional human resources. 2 Overview of TTS 2.1 Brief History The earliest memory of creating an artificial speech date back to over two hundred years ago. It started with using mechanical devices to produce the speech since the electrical signals were not invented yet. A Danish Scientist Christian Kratzenstein, working on St. Petersburg used a mechanical device to model the human vocal tract that could produce artificially synthesized vowels. He first made acoustic resonators like human vocal tract and then activated the resonators with vibrating reeds. The outline of the device is shown in figure 2.[1] 2 This invention was followed by better version created by Wolfgang von Kempelen that added the model for lips and tongue and could produce constants and vowels. This ver- sion was upgraded by other scientists for next few decades. [1] Later in 1930s, VOCODER (Voice Operating Demonstrator) was developed by Bell labs which is the first electronic speech synthesizer. The device was a keyboard-operated electronic speech analyzer and synthesizer. In 1953, PAT (Parametric Artificial Talker) was invented which consisted of three formant resonators connected in parallelly to each other. Following the invention of relatively cheaper TMS-5100 chips, Texas Instruments brought a product called Speak-n-Spell in 1953. The device was designed to help chil- dren with reading. [1] The technology of speech synthesis has gotten more complex and sophisticated in mod- ern era. Now the algorithms like HMM and neural network are used which are more accurate and resilient. Since the technology has evolved to the phase where it is trust- worthy, it has been used in numerous fields of development including health sectors, educations, entertainment, telecommunication and education. [1] 2.2 Implementation Techniques Speech synthesis can be done following several different ways. The techniques can be classified into three types. The first technique is Articulatory Synthesis. In this technique, synthesis tries to the model of the human vocal organ and vocal cord. Human articulator and vocal cords are modeled with different sets of areas function between glottis and mouth. When speaking, the vocal tract muscles contract causing the articulators to move 3 and change the shape of vocal tract. This consequently produces different sounds. This technique usually produces high quality synthetic speech but however is also difficult to model since large amount of data must be processed. [1] The second technique of speech synthesis is Formant synthesis which is based on source-filter-model of speech. There are basically two structures, cascade and parallel, but to get the better-quality output some kind of combination of these structures is used. Formant techniques allows infinite amount of sound combination which makes it more flexible than some other techniques. In cascade formant synthesizer has band-pass res- onators connected in series and each output of the resonator is supplied as input to the following resonator. This structure is simpler to implement since it only uses formant frequencies as control information and has been found useful for non-nasal voices. [1] Figure 3: Simple layout of Cascade formant synthesizer. Figure 4: Simple layout of Parallel formant synthesizer 4 In parallel structure formant synthesizer, resonators are connected in parallel to each other. The excitation signal is supplied to available formants simultaneously and the out- put of these formants are summed. This structure has more control information since it offers controlling of bandwidth and the gain of each formant individually. This type of synthesizer has been found better for nasal voices, fricatives and stop-constants. [1] The other technique of speech synthesis is Concatenative Synthesis. It uses collection of large recorded speech data that is enough to cover the areas of language. The units of speech data are modified and used as necessary. The method heavily depends on runtime selection and editing of the speech units available in the database. This selection and storage of speech units are usually heavy and requires a lot of memory. Also, usually concatenative synthesizers are usually limited to one speaker and one voice. [1] 2.3 Tools and Technologies 2.3.1 Speech synthesis software and APIs There are several tech giants that offer TTS APIs to their customers in order to make the development of TTS applications faster, easier and convenient. Companies like Amazon, Google and Microsoft has been playing big role in recent years for rapid development on TTS fields. Companies like IVONA, Neospeech and Readspeaker have also been in this business for a while. Following is the list of some of companies that have been working with the speech syn- thesis. Acapela Acapela company provides TTS software and services. They provide SDK solutions for Windows, Mac OS X, Windows server, Linux server, UWP, iOS, Android, Linux embed- ded and windows mobile. Most of the solutions they provide are cloud based solutions. Merged previously from three companies, Acapela support TTS services in more than 30 different languages and narrowing down their support even to accents, dialects and local voices.
Recommended publications
  • THE DEVELOPMENT of ACCENTED ENGLISH SYNTHETIC VOICES By
    THE DEVELOPMENT OF ACCENTED ENGLISH SYNTHETIC VOICES by PROMISE TSHEPISO MALATJI DISSERTATION Submitted in fulfilment of the requirements for the degree of MASTER OF SCIENCE in COMPUTER SCIENCE in the FACULTY OF SCIENCE AND AGRICULTURE (School of Mathematical and Computer Sciences) at the UNIVERSITY OF LIMPOPO SUPERVISOR: Mr MJD Manamela CO-SUPERVISOR: Dr TI Modipa 2019 DEDICATION In memory of my grandparents, Cecilia Khumalo and Alfred Mashele, who always believed in me! ii DECLARATION I declare that THE DEVELOPMENT OF ACCENTED ENGLISH SYNTHETIC VOICES is my own work and that all the sources that I have used or quoted have been indicated and acknowledged by means of complete references and that this work has not been submitted before for any other degree at any other institution. ______________________ ___________ Signature Date iii ACKNOWLEDGEMENTS I want to recognise the following people for their individual contributions to this dissertation: • My brother, Mr B.I. Khumalo and the whole family for the unconditional love, support and understanding. • A distinct thank you to both my supervisors, Mr M.J.D. Manamela and Dr T.I. Modipa, for their guidance, motivation, and support. • The Telkom Centre of Excellence for Speech Technology for providing the resources and support to make this study a success. • My colleagues in Department of Computer Science, Messrs V.R. Baloyi and L.M. Kola, for always motivating me. • A special thank you to Mr T.J. Sefara for taking his time to participate in the study. • The six Computer Science undergraduate students who sacrificed their precious time to participate in data collection.
    [Show full text]
  • Gender, Ethnicity, and Identity in Virtual
    Virtual Pop: Gender, Ethnicity, and Identity in Virtual Bands and Vocaloid Alicia Stark Cardiff University School of Music 2018 Presented in partial fulfilment of the requirements for the degree Doctor of Philosophy in Musicology TABLE OF CONTENTS ABSTRACT i DEDICATION iii ACKNOWLEDGEMENTS iv INTRODUCTION 7 EXISTING STUDIES OF VIRTUAL BANDS 9 RESEARCH QUESTIONS 13 METHODOLOGY 19 THESIS STRUCTURE 30 CHAPTER 1: ‘YOU’VE COME A LONG WAY, BABY:’ THE HISTORY AND TECHNOLOGIES OF VIRTUAL BANDS 36 CATEGORIES OF VIRTUAL BANDS 37 AN ANIMATED ANTHOLOGY – THE RISE IN POPULARITY OF ANIMATION 42 ALVIN AND THE CHIPMUNKS… 44 …AND THEIR SUCCESSORS 49 VIRTUAL BANDS FOR ALL AGES, AVAILABLE ON YOUR TV 54 VIRTUAL BANDS IN OTHER TYPES OF MEDIA 61 CREATING THE VOICE 69 REPRODUCING THE BODY 79 CONCLUSION 86 CHAPTER 2: ‘ALMOST UNREAL:’ TOWARDS A THEORETICAL FRAMEWORK FOR VIRTUAL BANDS 88 DEFINING REALITY AND VIRTUAL REALITY 89 APPLYING THEORIES OF ‘REALNESS’ TO VIRTUAL BANDS 98 UNDERSTANDING MULTIMEDIA 102 APPLYING THEORIES OF MULTIMEDIA TO VIRTUAL BANDS 110 THE VOICE IN VIRTUAL BANDS 114 AGENCY: TRANSFORMATION THROUGH TECHNOLOGY 120 CONCLUSION 133 CHAPTER 3: ‘INSIDE, OUTSIDE, UPSIDE DOWN:’ GENDER AND ETHNICITY IN VIRTUAL BANDS 135 GENDER 136 ETHNICITY 152 CASE STUDIES: DETHKLOK, JOSIE AND THE PUSSYCATS, STUDIO KILLERS 159 CONCLUSION 179 CHAPTER 4: ‘SPITTING OUT THE DEMONS:’ GORILLAZ’ CREATION STORY AND THE CONSTRUCTION OF AUTHENTICITY 181 ACADEMIC DISCOURSE ON GORILLAZ 187 MASCULINITY IN GORILLAZ 191 ETHNICITY IN GORILLAZ 200 GORILLAZ FANDOM 215 CONCLUSION 225
    [Show full text]
  • Masterarbeit
    Masterarbeit Erstellung einer Sprachdatenbank sowie eines Programms zu deren Analyse im Kontext einer Sprachsynthese mit spektralen Modellen zur Erlangung des akademischen Grades Master of Science vorgelegt dem Fachbereich Mathematik, Naturwissenschaften und Informatik der Technischen Hochschule Mittelhessen Tobias Platen im August 2014 Referent: Prof. Dr. Erdmuthe Meyer zu Bexten Korreferent: Prof. Dr. Keywan Sohrabi Eidesstattliche Erklärung Hiermit versichere ich, die vorliegende Arbeit selbstständig und unter ausschließlicher Verwendung der angegebenen Literatur und Hilfsmittel erstellt zu haben. Die Arbeit wurde bisher in gleicher oder ähnlicher Form keiner anderen Prüfungsbehörde vorgelegt und auch nicht veröffentlicht. 2 Inhaltsverzeichnis 1 Einführung7 1.1 Motivation...................................7 1.2 Ziele......................................8 1.3 Historische Sprachsynthesen.........................9 1.3.1 Die Sprechmaschine.......................... 10 1.3.2 Der Vocoder und der Voder..................... 10 1.3.3 Linear Predictive Coding....................... 10 1.4 Moderne Algorithmen zur Sprachsynthese................. 11 1.4.1 Formantsynthese........................... 11 1.4.2 Konkatenative Synthese....................... 12 2 Spektrale Modelle zur Sprachsynthese 13 2.1 Faltung, Fouriertransformation und Vocoder................ 13 2.2 Phase Vocoder................................ 14 2.3 Spectral Model Synthesis........................... 19 2.3.1 Harmonic Trajectories........................ 19 2.3.2 Shape Invariance..........................
    [Show full text]
  • Wormed Voice Workshop Presentation
    Wormed Voice Workshop Presentation micro_research December 27, 2017 1 some worm poetry and songs: The WORM was for a long time desirous to speake, but the rule and order of the Court enjoyned him silence, but now strutting and swelling, and impatient, of further delay, he broke out thus... [Michael Maier] He worshipped the worm and prayed to the wormy grave. Serpent Lucifer, how do you do? Of your worms and your snakes I'd be one or two; For in this dear planet of wool and of leather `Tis pleasant to need neither shirt, sleeve, nor shoe, 2 And have arm, leg, and belly together. Then aches your head, or are you lazy? Sing, `Round your neck your belly wrap, Tail-a-top, and make your cap Any bee and daisy. Two pigs' feet, two mens' feet, and two of a hen; Devil-winged; dragon-bellied; grave- jawed, because grass Is a beard that's soon shaved, and grows seldom again worm writing the the the the,eeeronencoug,en sthistit, d.).dupi w m,tinsprsool itav f bometaisp- pav wheaigelic..)a?? orerdi mise we ich'roo bish ftroo htothuloul mespowouklain- duteavshi wn,jis, sownol hof." m,tisorora angsthyedust,es, fofald,junss ownoug brad,)fr m fr,aA?a????ck;A?stelav aly, al is.'rady'lfrdil owoncorara wns t.) sh'r, oof ofr,a? ar,a???????a? fu mo towess,eethen hrtolly-l,."tigolav ict,a???!ol, w..'m,elyelil,tstreamas..n gotaillas.tansstheatsea f mb ispot inici t.) owar.**1 wnshigigholoothtith orsir.tsotic.'m, sotamimoledug imootrdeavet..t,) sh s,tranciror."wn sieee h asinied.tiear wspilotor,) bla av.nicord,ier.dy'et.*tite m.)..*d, hrouceto hie, ig il m, bsomoug,.t.'l,t, olitel bs,.nt,.dotr tat,)aa? htotitedont,j alesil, starar,ja taie ass.nishiceroouldseal fotitoonckysil, m oitispl o anteeeaicowousomirot.
    [Show full text]
  • Biomimetics of Sound Production, Synthesis and Recognition
    Design and Nature V 273 Biomimetics of sound production, synthesis and recognition G. Rosenhouse Swantech - Sound Wave Analysis & Technologies Ltd, Haifa, Israel Abstract Biomimesis of sound production, synthesis and recognition follows millenias of years of adaptation as it developed in nature. Technically such communication means were initiated since people began to build speaking machines based on natural speech concepts. The first devices were mechanical and they were in use till the end of the 19th century. Those developments later led to modern speech and music synthesis, initially applying pure mechanics. Since the beginning of the 20th century electronics has taken up the lead to independent electronic achievements in human communication abilities. As shown in the present paper, this development was intentionally made along history in order to satisfy human needs. Keywords: speaking machines, biomimetics, speech synthesis. 1 Introduction Automation is an old attempt of people to tame machines for the needs of human beings. It began mainly, as far as we know, since 70 DC (with Heron). In linguistics, the history of speaking machines began in the 2nd century when people tried to build speaking heads. However, actually, the process was initiated in the 18th century, with Wolfgang von Kempelen who invented a speaking mechanism that simulated the speech system of human beings. It opened the way to inventions of devices for producing artificial vowels and consonants. This development was an important step towards modern speech recognition
    [Show full text]
  • Siren Songs and Echo's Response: Towards a Media Theory of the Voice
    Published as _Article in On_Culture: The Open Journal for the Study of Culture (ISSN 2366-4142) SIREN SONGS AND ECHO’S RESPONSE: TOWARDS A MEDIA THEORY OF THE VOICE IN THE LIGHT OF SPEECH SYNTHESIS CHRISTOPH BORBACH [email protected] Christoph Borbach is a research fellow at the research training group Locating Media at the University of Siegen, where he conducts a research project entitled “Zeitkanäle|Kanalzeiten” (“Time Channels|Channel Times”) on the media history of the operationalization of delay time from a media archaeological perspective. Borbach studied Musicology, Media, and History at the Humboldt University of Berlin. His bachelor’s thesis dealt with radio theories between ideology and media epistemology. For his master’s thesis, he studied the media-technical implementations of echoes. His research interests include media theory of the voice, media archaeology of the echo, operationalization of the sonic, time-critical detection technologies and their visualiza- tion strategies, and the occult of media/media of the occult. KEYWORDS speech synthesis, phonography, voice theory, embodied voices, speaking machines, technotraumatic affects PUBLICATION DATE Issue 2, November 30, 2016 HOW TO CITE Christoph Borbach. “Siren Songs and Echo’s Response: Towards a Media Theory of the Voice in the Light of Speech Synthesis.” On_Culture: The Open Journal for the Study of Culture 2 (2016). <http://geb.uni-giessen.de/geb/volltexte/2016/12354/>. Permalink URL: <http://geb.uni-giessen.de/geb/volltexte/2016/12354/> URN: <urn:nbn:de:hebis:26-opus-123545> On_Culture: The Open Journal for the Study of Culture Issue 2 (2016): The Nonhuman www.on-culture.org http://geb.uni-giessen.de/geb/volltexte/2016/12354/ Siren Songs and Echo’s Response: Towards a Media Theory of the Voice in the Light of Speech Synthesis _Abstract In contrast to phonographical recording, storage, and reproduction of the voice, most media theories, especially prominent media theories of the human voice, neglected the aspect of synthesizing human-like voices by non-human means.
    [Show full text]
  • Speech Generation: from Concept and from Text
    Speech Generation From Concept and from Text Martin Jansche CS 6998 2004-02-11 Components of spoken output systems Front end: From input to control parameters. • From naturally occurring text; or • From constrained mark-up language; or • From semantic/conceptual representations. Back end: From control parameters to waveform. • Articulatory synthesis; or • Acoustic synthesis: – Based predominantly on speech samples; or – Using mostly synthetic sources. 2004-02-11 1 Who said anything about computers? Wolfgang von Kempelen, Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine, 1791. Charles Wheatstone’s reconstruction of von Kempelen’s machine 2004-02-11 2 Joseph Faber’s Euphonia, 1846 2004-02-11 3 Modern articulatory synthesis • Output produced by an articulatory synthesizer from Dennis Klatt’s review article (JASA 1987) • Praat demo • Overview at Haskins Laboratories (Yale) 2004-02-11 4 The Voder ... Developed by Homer Dudley at Bell Telephone Laboratories, 1939 2004-02-11 5 ... an acoustic synthesizer Architectural blueprint for the Voder Output produced by the Voder 2004-02-11 6 The Pattern Playback Developed by Franklin Cooper at Haskins Laboratories, 1951 No human operator required. Machine plays back previously drawn spectrogram (spectrograph invented a few years earlier). 2004-02-11 7 Can you understand what it says? Output produced by the Pattern Playback. 2004-02-11 8 Can you understand what it says? Output produced by the Pattern Playback. These days a chicken leg is a rare dish. It’s easy to tell the depth of a well. Four hours of steady work faced us. 2004-02-11 9 Synthesis-by-rule • Realization that spectrograph and Pattern Playback are really only recording and playback devices.
    [Show full text]
  • Chapter 8 Speech Synthesis
    PRELIMINARY PROOFS. Unpublished Work c 2008 by Pearson Education, Inc. To be published by Pearson Prentice Hall, Pearson Education, Inc., Upper Saddle River, New Jersey. All rights reserved. Permission to use this unpublished Work is granted to individuals registering through [email protected] for the instructional purposes not exceeding one academic term or semester. Chapter 8 Speech Synthesis And computers are getting smarter all the time: Scientists tell us that soon they will be able to talk to us. (By ‘they’ I mean ‘computers’: I doubt scientists will ever be able to talk to us.) Dave Barry In Vienna in 1769, Wolfgang von Kempelen built for the Empress Maria Theresa the famous Mechanical Turk, a chess-playing automaton consisting of a wooden box filled with gears, and a robot mannequin sitting behind the box who played chess by moving pieces with his mechanical arm. The Turk toured Europeand the Americas for decades, defeating Napolean Bonaparte and even playing Charles Babbage. The Mechanical Turk might have been one of the early successes of artificial intelligence if it were not for the fact that it was, alas, a hoax, powered by a human chessplayer hidden inside the box. What is perhaps less well-known is that von Kempelen, an extraordinarily prolific inventor, also built between 1769 and 1790 what is definitely not a hoax: the first full-sentence speech synthesizer. His device consisted of a bellows to simulate the lungs, a rubber mouthpiece and a nose aperature, a reed to simulate the vocal folds, various whistles for each of the fricatives. and a small auxiliary bellows to provide the puff of air for plosives.
    [Show full text]
  • DOCUMENT RESUME ED 052 654 FL 002 384 TITLE Speech Research
    DOCUMENT RESUME ED 052 654 FL 002 384 TITLE Speech Research: A Report on the Status and Progress of Studies on Lhe Nature of Speech, Instrumentation for its Investigation, and Practical Applications. 1 July - 30 September 1970. INSTITUTION Haskins Labs., New Haven, Conn. SPONS AGENCY Office of Naval Research, Washington, D.C. Information Systems Research. REPORT NO SR-23 PUB DATE Oct 70 NOTE 211p. EDRS PRICE EDRS Price MF-$0.65 HC-$9.87 DESCRIPTORS Acoustics, *Articulation (Speech), Artificial Speech, Auditory Discrimination, Auditory Perception, Behavioral Science Research, *Laboratory Experiments, *Language Research, Linguistic Performance, Phonemics, Phonetics, Physiology, Psychoacoustics, *Psycholinguistics, *Speech, Speech Clinics, Speech Pathology ABSTRACT This report is one of a regular series on the status and progress of studies on the nature of speech, instrumentation for its investigation, and practical applications. The reports contained in this particular number are state-of-the-art reviews of work central to the Haskins Laboratories' areas of research. Tre papers included are: (1) "Phonetics: An Overview," (2) "The Perception of Speech," (3) "Physiological Aspects of Articulatory Behavior," (4) "Laryngeal Research in Experimental Phonetics," (5) "Speech Synthesis for Phonetic and Phonological Models," (6)"On Time and Timing in Speech," and (7)"A Study of Prosodic Features." (Author) SR-23 (1970) SPEECH RESEARCH A Report on the Status and Progress ofStudies on the Nature of Speech,Instrumentation for its Investigation, andPractical Applications 1 July - 30 September1970 U.S. DEPARTMENT OF HEALTH, EDUCATION & WELFARE OFFICE OF EDUCATION THIS DOCUMENT HAS BEEN REPRODUCED EXACTLY AS RECEIVED FROMTHE PERSON OR ORGANIZATION ORIGINATING IT,POINTS OF VIEW OR OPINIONS STATED DO NOT NECESSARILY REPRESENT OFFICIAL OFFICE OF EDUCATION POSITION OR POLICY.
    [Show full text]
  • Mechanical Speech Synthesis in Early Talking Automata
    Mechanical Speech Synthesis in Early Talking Automata Gordon J. Ramsay Early attempts at synthesizing speech using mechanical models of the vocal tract Address: prefigure modern embodied theories of speech production. Spoken Communication Laboratory Introduction Marcus Autism Center 1920 Briarcliff Road NE Three centuries of scientific research on speech production have seen significant Atlanta, Georgia 30329 progress in understanding the relationship between articulation and acoustics in the USA human vocal tract. Over this period, there has been a marked shift in approaches to experimentation, driven by the emergence of new technologies and the novel ideas Email: these have stimulated. The greatest advances during the last hundred years have [email protected] arisen from the use of electronic or computer simulations of vocal tract acoustics for the analysis, synthesis, and recognition of speech. Before this was possible, the focus necessarily lay in detailed observation and direct experimental manipulation of the physical mechanisms underlying speech using mechanical models of the vocal tract, which were the new technology of their time. Understanding the history of the problems encountered and solutions proposed in these largely forgotten attempts to develop speaking machines that mimic the actual physical processes governing voice production can help to highlight fundamental issues that are still outstanding in this field. Many recent embodied theories of speech production and perception actually directly recapitulate proposals that arose from early talking automata. The Voice as a Musical Instrument By the beginning of the seventeenth century, the anatomy of the head and neck was already well understood, as witnessed by the extraordinarily detailed illustrations found in many books of the period (e.g., Casserius, 1600).
    [Show full text]
  • Speech Synthesis
    Contents 1 Introduction 3 1.1 Quality of a Speech Synthesizer 3 1.2 The TTS System 3 2 History 4 2.1 Electronic Devices 4 3 Synthesizer Technologies 6 3.1 Waveform/Spectral Coding 6 3.2 Concatenative Synthesis 6 3.2.1 Unit Selection Synthesis 6 3.2.2 Diaphone Synthesis 7 3.2.3 Domain-Specific Synthesis 7 3.3 Formant Synthesis 8 3.4 Articulatory Synthesis 9 3.5 HMM-Based Synthesis 10 3.6 Sine Wave Synthesis 10 4 Challenges 11 4.1 Text Normalization Challenges 11 4.1.1 Homographs 11 4.1.2 Numbers and Abbreviations 11 4.2 Text-to-Phoneme Challenges 11 4.3 Evaluation Challenges 12 5 Speech Synthesis in Operating Systems 13 5.1 Atari 13 5.2 Apple 13 5.3 AmigaOS 13 5.4 Microsoft Windows 13 6 Speech Synthesis Markup Languages 15 7 Applications 16 7.1 Contact Centers 16 7.2 Assistive Technologies 16 1 © Specialty Answering Service. All rights reserved. 7.3 Gaming and Entertainment 16 8 References 17 2 © Specialty Answering Service. All rights reserved. 1 Introduction The word ‘Synthesis’ is defined by the Webster’s Dictionary as ‘the putting together of parts or elements so as to form a whole’. Speech synthesis generally refers to the artificial generation of human voice – either in the form of speech or in other forms such as a song. The computer system used for speech synthesis is known as a speech synthesizer. There are several types of speech synthesizers (both hardware based and software based) with different underlying technologies.
    [Show full text]
  • Speech Synthesis
    6. Text-to-Speech Synthesis (Most Of these slides come from Dan Juray’s course at Stanford) History of Speech Synthesis • In 1779, the Danish scientist Christian Kratzenstein builds models of the human vocal tract that can produce the five long vowel sounds. • In 1791 Wolfgang von Kempelen (the creator of the Turk chess playing game) devises the bellows(fuelle, mancha)-operated “automatic- mechanical speech machine”. It added models of the tongue and lips that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's design. • In the 1930s, Bell Labs developed the VOCODER, a keyboard- operated electronic speech analyzer and synthesizer that was said to be clearly intelligible. It was later refined into the VODER, which was exhibited at the 1939 New York World's Fair. Tractament Digital de la Parla 2 Von Kempelen: • Small whistles controlled consonants • Rubber mouth and nose; nose had to be covered with two fingers for non-nasals • Unvoiced sounds: mouth covered, auxiliary bellows driven by string provides puff of air From Traunmüller’s web site Von Kempelen’s speaking machine Bell labs VOCODER machine Homer Dudley 1939 VODER • Synthesizing speech by electrical means • 1939 World’s Fair Homer Dudley’s VODER • Manually controlled through complex keyboard • Operator training was a problem One of the first “talking” computers Closer to a natural vocal tract: Riesz 1937 The UK Speaking Clock • July 24, 1936 • Photographic storage on 4 glass disks • 2 disks for minutes, 1 for hour, one for seconds. • Other words in sentence distributed across 4 disks, so all 4 used at once.
    [Show full text]