Using Speech Synthesis to Create Stimuli for Speech Perception Research

Using speech synthesis to create stimuli for speech perception research Prof. Simon King V N I E R The Centre for Speech Technology Research U S University of Edinburgh, UK E I T H www.cstr.ed.ac.uk Y T O H INSPIRE 2013, Sheffield F G E R D I N B U 1 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Contents • Part I : Motivation • why synthesis might be a useful tool • Part II : Core techniques • formant synthesis • articulatory synthesis • physical modelling • vocoding • concatenation of diphones • concatenation of units from a large inventory • statistical parametric speech synthesis (HMMs) • Part III : The state of the art • controllable HMM synthesis with articulatory and formant controls 2 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Part I Motivation 3 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Goal: investigate speech perception • How? • form a hypothesis • design experiment • design the stimuli design is limited by methods available • create the stimuli for creation • play stimuli to listeners • obtain responses • analyse responses • support / refute hypothesis 4 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Designing stimuli • Usually speech or speech-like sounds • Natural speech • elicited from one or more speakers • Manipulated natural speech • filtered - e.g., delexicalised • edited - e.g., modify temporal structure, remove acoustic cues, splice, ... • Synthetic speech • several methods available • which should we choose? • Other synthetic sounds - e.g, sine wave speech 5 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. The limits of manually manipulating natural speech • Manually editing means that only limited forms of modification possible • remove information • splice together individual natural sounds • Laborious • Highly skilled • Therefore very slow to create stimuli • Places limits on the experiments that can be performed • a bias towards certain types of stimuli 6 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Doing it automatically - decomposing speech • The speech signal we observe (the waveform) is the product of interacting processes operating at different time scales • at any moment in time, the signal is affected not just by the current phoneme, but many other aspects of the context in which it occurs • the context is complex - it’s not just the preceding/following sounds • We have a conflict: we want to simultaneously: • model speech as a linear string of units, for engineering simplicity • take into account all the long-range effects of context, before, during and after the current moment in time 7 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Speech is produced by several interacting processes 8 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Resolving this conflict: take context into account • The context in which a speech sound is produced affects that sound • articulatory constraints: where the articulators are coming from / going to • phonological effects • prosodic environment 9 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Modern speech synthesis 4 examples: diphones, unit selection, HMMs (x2) http://www.cstr.ed.ac.uk/projects/festival/morevoices.html 10 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Part II Core techniques 11 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Part II Core techniques - formant synthesis 12 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Mayo (Segmental),Mayo (Segmental),JASA JASA Stimulus design: a simple consonant-vowel sequence [sa ] [sa ] [ a ] [ a ] [de] [de] [be] [be] 8000 8000 8000 8000 (Hz) (Hz) (Hz) (Hz) equency equency equency equency Fr Fr Fr Fr 0 0 0 0 0 Time (ms)0 T300ime 0(ms) Time300 (ms)0 T300ime (ms) 300 0 Time (ms)0 T300ime0(ms) T300ime 0(ms) T300ime (ms) 300 “sigh” “shy” [ta] [ta] [da] [da] 13[ti] [ti] [di] [di] © Copyright8000 Simon King, 8000University of Edinburgh. For personal use only. Re-use or distribution not8000 permitted. 8000 (Hz) (Hz) (Hz) (Hz) equency equency equency equency Fr Fr Fr Fr 0 0 0 0 0 Time (ms)0 T300ime0(ms)T300ime (ms)0 T300ime (ms) 300 0 Time (ms)0 T300ime0(ms) T300ime 0(ms) T300ime (ms) 300 FIG. 1: FIG. 1: 39 39 Stimulus design: a continuum Mayo (Segmental), JASA 8000 (Hz) equency Fr 0 /s/ continua of frication noise / / 7-year-olds 40 Adults 8000 of (Hz) equency “sigh” ..... “shy” Fr esponses r 0 / Mayo using the vowel from “sigh” Number Time 300ms) /sa (Segmental), FIG. 2: (audio: 9 point continuum) /s/ / / /s/14 / / © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. JASA 5-year-olds 3- to 4-year-olds of esponses r / Number /sa /(s)a / /( )a / /s/ / / /s/ / / Frequency of frication noise FIG. 3: 41 The Klatt vocal tract model • F0 and gain Voicing Tilt TF TAF NF NAF • Up to six vocal tract F1 F1 F1 F1 F1 F2 F3 F4 F5 resonances: formants B1 B1 B1 B1 B1 B2 B3 B4 B5 F1, F2, … , F6 Aspiration • Aspiration and frication A2 F2 B2 noise A3 F3 B3 • Nasal zero (the ànti- resonance’ introduced Frication A4 F4 B4 when the nasal cavity is noise opened by lowering the velum) A5 F5 B5 Bypass © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Creating speech using the Klatt model © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Manipulating speech via the Klatt model audio: original, Klatt, Klatt (stylised) © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Pros and cons of formant-based systems • Pros • Allows incorporation of linguistic knowledge in a transparent way • Precise control over every parameter value • Low memory requirements and low computational cost (for simple models like Klatt) • Cons • Speech quality is ultimately limited by the vocal tract model • Skilled and laborious work to create high-quality output • Work involved leads to a strong bias towards very short stimuli © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Text-to-speech using formant synthesis • Most well known system is MITalk (1970s), but hardware-based predecessors include PAT (Edinburgh, 1950s-1960s), OVE (KTH, Sweden, 1960s), & others • Uses rules to drive an abstract & simplified vocal tract model • MITalk also implemented in hardware (DECtalk, as used by Stephen Hawkins) • This type of system takes a long time to develop: rule sets written by experts http://festvox.org/history/klatt.html Example: MITalk 19 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Driving the Klatt model from text with rules • A synthesiser like MITalk determines values for vowel formants using rules • start with a fixed default (target) value for every vowel • modify using co-articulation rules, reduction, etc. • The Klatt vocal tract model is still used to create stimuli for phonetic experiments • reasonable results can be obtained by experts • but driving it automatically with rules is another matter • It is only used for text-to-speech in legacy applications © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Part II Core techniques - articulatory synthesis 21 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. VocalTractLab (audiovisual example) 22 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. HLSyn from Sensimetrics 6.542J Lab 16 11/08/05 5 • “quasi-articulatory” synthesiser f1, f2, f3, f4 an • specify vocal tract in terms of ab both physical dimensions and formant frequencies al • fewer parameters than Klatt ue (13, instead of 40-60) • no longer available as a product ag dc f0 ps ap (audio examples: original. copy synthesis) - credit: ProsynthFigure project, 1(a) UCL 23 © Copyright Simon King, University of Edinburgh. For f0personalag ap useal only. ab Re-use an or distributionuepsdc f1 not f2 permitted. f3 f4 HL parameters (13 physiologically- based) Mapping Relations (including the circuit model used to calculate pressures and flows) AV OQ AF ...... F1 F2B1B2 ...... KL parameters (40-50 acoustic) Sources Transfer functions Speech output Figure 1(b) 5000 4000 3000 Frequency (Hz) 2000 1000 0 0 50 100 150 200 250 300 350 400 450 Time (ms) Figure 2 Tada Manual -3- Last Updated 3/9/07 Usage Launching TADA TADA : TAsk Dynamic Application • MATLAB™ version: Release 14 (Ver. 7.0 or higher) Type 'tada' in command line in MATLAB™. • Stand-alone version: Double-click on TADA icon. • Based on Browman & Goldstein’s Task Dynamic model of speech production The TADA Window • Synthesis achievedTADA opens the using GUI shown HLSyn in Fig. 2. Figure 2. (audiovisual example) • In the center is the temporal display: the gestural activations that are input to the task dynamic model (gestural score) and time functions of tract variable values and 24 © Copyright Simon King, Universityarticulator of Edinburgh. that are the modelFor personal outputs. use only. Re-use or distribution not permitted. • At the left side is the spatial display: vocal tract shape and area function of the at the time of the cursor in the temporal display. • The right side is organized into four panels of buttons and readouts.

Using Speech Synthesis to Create Stimuli for Speech Perception Research

Commercial Tools in Speech Synthesis Technology

Models of Speech Synthesis ROLF CARLSON Department of Speech Communication and Music Acoustics, Royal Institute of Technology, S-100 44 Stockholm, Sweden

Expression Control in Singing Voice Synthesis

Text-To-Speech Synthesis System for Marathi Language Using Concatenation Technique

Wormed Voice Workshop Presentation

Design and Implementation of Text to Speech Conversion for Visually Impaired People

Attuning Speech-Enabled Interfaces to User and Context for Inclusive Design: Technology, Methodology and Practice

Voice Synthesizer Application Android

Towards Expressive Speech Synthesis in English on a Robotic Platform

Speech Synthesis

The Algorithms of Tajik Speech Synthesis by Syllable

Understanding Disability and Assistive Technology