Using Speech Synthesis to Create Stimuli for Speech Perception Research
Total Page:16
File Type:pdf, Size:1020Kb
Using speech synthesis to create stimuli for speech perception research Prof. Simon King V N I E R The Centre for Speech Technology Research U S University of Edinburgh, UK E I T H www.cstr.ed.ac.uk Y T O H INSPIRE 2013, Sheffield F G E R D I N B U 1 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Contents • Part I : Motivation • why synthesis might be a useful tool • Part II : Core techniques • formant synthesis • articulatory synthesis • physical modelling • vocoding • concatenation of diphones • concatenation of units from a large inventory • statistical parametric speech synthesis (HMMs) • Part III : The state of the art • controllable HMM synthesis with articulatory and formant controls 2 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Part I Motivation 3 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Goal: investigate speech perception • How? • form a hypothesis • design experiment • design the stimuli design is limited by methods available • create the stimuli for creation • play stimuli to listeners • obtain responses • analyse responses • support / refute hypothesis 4 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Designing stimuli • Usually speech or speech-like sounds • Natural speech • elicited from one or more speakers • Manipulated natural speech • filtered - e.g., delexicalised • edited - e.g., modify temporal structure, remove acoustic cues, splice, ... • Synthetic speech • several methods available • which should we choose? • Other synthetic sounds - e.g, sine wave speech 5 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. The limits of manually manipulating natural speech • Manually editing means that only limited forms of modification possible • remove information • splice together individual natural sounds • Laborious • Highly skilled • Therefore very slow to create stimuli • Places limits on the experiments that can be performed • a bias towards certain types of stimuli 6 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Doing it automatically - decomposing speech • The speech signal we observe (the waveform) is the product of interacting processes operating at different time scales • at any moment in time, the signal is affected not just by the current phoneme, but many other aspects of the context in which it occurs • the context is complex - it’s not just the preceding/following sounds • We have a conflict: we want to simultaneously: • model speech as a linear string of units, for engineering simplicity • take into account all the long-range effects of context, before, during and after the current moment in time 7 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Speech is produced by several interacting processes 8 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Resolving this conflict: take context into account • The context in which a speech sound is produced affects that sound • articulatory constraints: where the articulators are coming from / going to • phonological effects • prosodic environment 9 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Modern speech synthesis 4 examples: diphones, unit selection, HMMs (x2) http://www.cstr.ed.ac.uk/projects/festival/morevoices.html 10 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Part II Core techniques 11 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Part II Core techniques - formant synthesis 12 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Mayo (Segmental),Mayo (Segmental),JASA JASA Stimulus design: a simple consonant-vowel sequence [sa ] [sa ] [ a ] [ a ] [de] [de] [be] [be] 8000 8000 8000 8000 (Hz) (Hz) (Hz) (Hz) equency equency equency equency Fr Fr Fr Fr 0 0 0 0 0 Time (ms)0 T300ime 0(ms) Time300 (ms)0 T300ime (ms) 300 0 Time (ms)0 T300ime0(ms) T300ime 0(ms) T300ime (ms) 300 “sigh” “shy” [ta] [ta] [da] [da] 13[ti] [ti] [di] [di] © Copyright8000 Simon King, 8000University of Edinburgh. For personal use only. Re-use or distribution not8000 permitted. 8000 (Hz) (Hz) (Hz) (Hz) equency equency equency equency Fr Fr Fr Fr 0 0 0 0 0 Time (ms)0 T300ime0(ms)T300ime (ms)0 T300ime (ms) 300 0 Time (ms)0 T300ime0(ms) T300ime 0(ms) T300ime (ms) 300 FIG. 1: FIG. 1: 39 39 Stimulus design: a continuum Mayo (Segmental), JASA 8000 (Hz) equency Fr 0 /s/ continua of frication noise / / 7-year-olds 40 Adults 8000 of (Hz) equency “sigh” ..... “shy” Fr esponses r 0 / Mayo using the vowel from “sigh” Number Time 300ms) /sa (Segmental), FIG. 2: (audio: 9 point continuum) /s/ / / /s/14 / / © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. JASA 5-year-olds 3- to 4-year-olds of esponses r / Number /sa /(s)a / /( )a / /s/ / / /s/ / / Frequency of frication noise FIG. 3: 41 The Klatt vocal tract model • F0 and gain Voicing Tilt TF TAF NF NAF • Up to six vocal tract F1 F1 F1 F1 F1 F2 F3 F4 F5 resonances: formants B1 B1 B1 B1 B1 B2 B3 B4 B5 F1, F2, … , F6 Aspiration • Aspiration and frication A2 F2 B2 noise A3 F3 B3 • Nasal zero (the `anti- resonance’ introduced Frication A4 F4 B4 when the nasal cavity is noise opened by lowering the velum) A5 F5 B5 Bypass © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Creating speech using the Klatt model © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Manipulating speech via the Klatt model audio: original, Klatt, Klatt (stylised) © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Pros and cons of formant-based systems • Pros • Allows incorporation of linguistic knowledge in a transparent way • Precise control over every parameter value • Low memory requirements and low computational cost (for simple models like Klatt) • Cons • Speech quality is ultimately limited by the vocal tract model • Skilled and laborious work to create high-quality output • Work involved leads to a strong bias towards very short stimuli © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Text-to-speech using formant synthesis • Most well known system is MITalk (1970s), but hardware-based predecessors include PAT (Edinburgh, 1950s-1960s), OVE (KTH, Sweden, 1960s), & others • Uses rules to drive an abstract & simplified vocal tract model • MITalk also implemented in hardware (DECtalk, as used by Stephen Hawkins) • This type of system takes a long time to develop: rule sets written by experts http://festvox.org/history/klatt.html Example: MITalk 19 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Driving the Klatt model from text with rules • A synthesiser like MITalk determines values for vowel formants using rules • start with a fixed default (target) value for every vowel • modify using co-articulation rules, reduction, etc. • The Klatt vocal tract model is still used to create stimuli for phonetic experiments • reasonable results can be obtained by experts • but driving it automatically with rules is another matter • It is only used for text-to-speech in legacy applications © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Part II Core techniques - articulatory synthesis 21 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. VocalTractLab (audiovisual example) 22 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. HLSyn from Sensimetrics 6.542J Lab 16 11/08/05 5 • “quasi-articulatory” synthesiser f1, f2, f3, f4 an • specify vocal tract in terms of ab both physical dimensions and formant frequencies al • fewer parameters than Klatt ue (13, instead of 40-60) • no longer available as a product ag dc f0 ps ap (audio examples: original. copy synthesis) - credit: ProsynthFigure project, 1(a) UCL 23 © Copyright Simon King, University of Edinburgh. For f0personalag ap useal only. ab Re-use an or distributionuepsdc f1 not f2 permitted. f3 f4 HL parameters (13 physiologically- based) Mapping Relations (including the circuit model used to calculate pressures and flows) AV OQ AF ...... F1 F2B1B2 ...... KL parameters (40-50 acoustic) Sources Transfer functions Speech output Figure 1(b) 5000 4000 3000 Frequency (Hz) 2000 1000 0 0 50 100 150 200 250 300 350 400 450 Time (ms) Figure 2 Tada Manual -3- Last Updated 3/9/07 Usage Launching TADA TADA : TAsk Dynamic Application • MATLAB™ version: Release 14 (Ver. 7.0 or higher) Type 'tada' in command line in MATLAB™. • Stand-alone version: Double-click on TADA icon. • Based on Browman & Goldstein’s Task Dynamic model of speech production The TADA Window • Synthesis achievedTADA opens the using GUI shown HLSyn in Fig. 2. Figure 2. (audiovisual example) • In the center is the temporal display: the gestural activations that are input to the task dynamic model (gestural score) and time functions of tract variable values and 24 © Copyright Simon King, Universityarticulator of Edinburgh. that are the modelFor personal outputs. use only. Re-use or distribution not permitted. • At the left side is the spatial display: vocal tract shape and area function of the at the time of the cursor in the temporal display. • The right side is organized into four panels of buttons and readouts.