Using to create stimuli for speech perception research

Prof. Simon King V N I E R The Centre for Speech Technology Research U S University of Edinburgh, UK E I T H www.cstr.ed.ac.uk Y T

O

H INSPIRE 2013, Sheffield F G E R D I N B U 1 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Contents

• Part I : Motivation • why synthesis might be a useful tool • Part II : Core techniques • formant synthesis • • physical modelling • vocoding • concatenation of diphones • concatenation of units from a large inventory • statistical parametric speech synthesis (HMMs) • Part III : The state of the art • controllable HMM synthesis with articulatory and formant controls

2 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Part I

Motivation

3 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Goal: investigate speech perception

• How?

• form a hypothesis

• design experiment

• design the stimuli design is limited by methods available • create the stimuli for creation • play stimuli to listeners

• obtain responses

• analyse responses

• support / refute hypothesis

4 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Designing stimuli

• Usually speech or speech-like sounds

• Natural speech

• elicited from one or more speakers

• Manipulated natural speech

• filtered - e.g., delexicalised

• edited - e.g., modify temporal structure, remove acoustic cues, splice, ...

• Synthetic speech

• several methods available

• which should we choose?

• Other synthetic sounds - e.g, sine wave speech

5 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. The limits of manually manipulating natural speech

• Manually editing means that only limited forms of modification possible

• remove information

• splice together individual natural sounds

• Laborious

• Highly skilled

• Therefore very slow to create stimuli

• Places limits on the experiments that can be performed

• a bias towards certain types of stimuli

6 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Doing it automatically - decomposing speech

• The speech signal we observe (the waveform) is the product of interacting processes operating at different time scales

• at any moment in time, the signal is affected not just by the current phoneme, but many other aspects of the context in which it occurs

• the context is complex - it’s not just the preceding/following sounds

• We have a conflict: we want to simultaneously:

• model speech as a linear string of units, for engineering simplicity

• take into account all the long-range effects of context, before, during and after the current moment in time

7 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Speech is produced by several interacting processes

8 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Resolving this conflict: take context into account

• The context in which a speech sound is produced affects that sound

• articulatory constraints: where the articulators are coming from / going to

• phonological effects

• prosodic environment

9 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Modern speech synthesis

4 examples: diphones, unit selection, HMMs (x2)

http://www.cstr.ed.ac.uk/projects/festival/morevoices.html

10 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Part II

Core techniques

11 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Part II

Core techniques - formant synthesis

12 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Mayo (Segmental),Mayo (Segmental),JASA JASA

Stimulus design: a simple consonant-vowel sequence

[sa ] [sa ] [ a ] [ a ] [de] [de] [be] [be] 8000 8000 8000 8000 (Hz) (Hz) (Hz) (Hz) equency equency equency equency Fr Fr Fr Fr

0 0 0 0 0 Time (ms)0 T300ime 0(ms) Time300 (ms)0 T300ime (ms) 300 0 Time (ms)0 T300ime0(ms) T300ime 0(ms) T300ime (ms) 300

“sigh” “shy”

[ta] [ta] [da] [da] 13[ti] [ti] [di] [di] © Copyright8000 Simon King, 8000University of Edinburgh. For personal use only. Re-use or distribution not8000 permitted. 8000 (Hz) (Hz) (Hz) (Hz) equency equency equency equency Fr Fr Fr Fr

0 0 0 0 0 Time (ms)0 T300ime0(ms)T300ime (ms)0 T300ime (ms) 300 0 Time (ms)0 T300ime0(ms) T300ime 0(ms) T300ime (ms) 300

FIG. 1: FIG. 1:

39 39 Stimulus design: a continuum Mayo (Segmental), JASA

8000 (Hz) equency Fr 0 /s/ continua of frication noise / / 7-year-olds 40 Adults 8000 of (Hz)

equency “sigh” ..... “shy” Fr esponses r

0 / Mayo using the vowel from “sigh” Number Time 300ms) /sa (Segmental),

FIG. 2: (audio: 9 point continuum) /s/ / / /s/14 / / © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. JASA 5-year-olds 3- to 4-year-olds of esponses r / Number /sa /(s)a / /( )a /

/s/ / / /s/ / / Frequency of frication noise

FIG. 3:

41 The Klatt vocal tract model

• F0 and gain Voicing Tilt

TF TAF NF NAF • Up to six vocal tract F1 F1 F1 F1 F1 F2 F3 F4 F5 resonances: formants B1 B1 B1 B1 B1 B2 B3 B4 B5 F1, F2, … , F6 Aspiration

• Aspiration and frication A2 F2 B2 noise A3 F3 B3 • Nasal zero (the `anti-

resonance’ introduced Frication A4 F4 B4 when the nasal cavity is noise opened by lowering the velum) A5 F5 B5

Bypass

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Creating speech using the Klatt model

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Manipulating speech via the Klatt model

audio: original, Klatt, Klatt (stylised) © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Pros and cons of formant-based systems

• Pros

• Allows incorporation of linguistic knowledge in a transparent way

• Precise control over every parameter value

• Low memory requirements and low computational cost (for simple models like Klatt)

• Cons

• Speech quality is ultimately limited by the vocal tract model

• Skilled and laborious work to create high-quality output

• Work involved leads to a strong bias towards very short stimuli

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Text-to-speech using formant synthesis

• Most well known system is MITalk (1970s), but hardware-based predecessors include PAT (Edinburgh, 1950s-1960s), OVE (KTH, Sweden, 1960s), & others

• Uses rules to drive an abstract & simplified vocal tract model

• MITalk also implemented in hardware (DECtalk, as used by Stephen Hawkins)

• This type of system takes a long time to develop: rule sets written by experts

http://festvox.org/history/klatt.html Example: MITalk 19 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Driving the Klatt model from text with rules

• A synthesiser like MITalk determines values for vowel formants using rules

• start with a fixed default (target) value for every vowel

• modify using co-articulation rules, reduction, etc.

• The Klatt vocal tract model is still used to create stimuli for phonetic experiments

• reasonable results can be obtained by experts

• but driving it automatically with rules is another matter

• It is only used for text-to-speech in legacy applications

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Part II

Core techniques - articulatory synthesis

21 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. VocalTractLab

(audiovisual example) 22 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. HLSyn from Sensimetrics 6.542J Lab 16 11/08/05 5

• “quasi-articulatory” synthesiser f1, f2, f3, f4 an • specify vocal tract in terms of ab both physical dimensions and formant frequencies al

• fewer parameters than Klatt ue (13, instead of 40-60)

• no longer available as a product

ag dc

f0 ps ap (audio examples: original. copy synthesis) - credit: ProsynthFigure project, 1(a) UCL

23 © Copyright Simon King, University of Edinburgh. For f0personalag ap useal only. ab Re-use an or distributionuepsdc f1 not f2 permitted. f3 f4 HL parameters (13 physiologically- based) Mapping Relations (including the circuit model used to calculate pressures and flows)

AV OQ AF ...... F1 F2B1B2 ...... KL parameters (40-50 acoustic)

Sources Transfer functions Speech output

Figure 1(b)

5000

4000

3000 Frequency (Hz) 2000

1000

0 0 50 100 150 200 250 300 350 400 450 Time (ms)

Figure 2 Tada Manual -3- Last Updated 3/9/07

Usage

Launching TADA TADA : TAsk Dynamic Application • MATLAB™ version: Release 14 (Ver. 7.0 or higher) Type 'tada' in command line in MATLAB™. • Stand-alone version: Double-click on TADA icon.

• Based on Browman & Goldstein’s Task Dynamic model of speech production The TADA Window

• Synthesis achievedTADA opens the using GUI shown HLSyn in Fig. 2.

Figure 2. (audiovisual example) • In the center is the temporal display: the gestural activations that are input to the task dynamic model (gestural score) and time functions of tract variable values and 24 © Copyright Simon King, Universityarticulator of Edinburgh. that are the modelFor personal outputs. use only. Re-use or distribution not permitted. • At the left side is the spatial display: vocal tract shape and area function of the at the time of the cursor in the temporal display. • The right side is organized into four panels of buttons and readouts. From the top, they are:

Pros and cons of articulatory synthesis

• Pros

• Allows incorporation of linguistic knowledge in a transparent way

• Interesting way to explore and understand speech production

• Reasonably accurate control over articulator positions

• Cons

• Not a comprehensive model

• HLSyn = 4 formants and a few physical parameters

• VocalTractLab = synthetic sources + physically-modelled filter

• Speech quality is ultimately very limited by the vocal tract model

• Skilled and laborious work to create high-quality output

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Part II

Core techniques - physical modelling

26 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Physical modelling & simulation

968 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Fig. 7. 2-D /u/ vowel model 0.25 ms after applied Gaussian impulse at the glottal end.

Fig. 6. Straight tube and vocal tract 2-D DWM models. wall curvature can be better approximated with the application of a wave motion in a flared horn-type geometry. By modeling a gradual change in propagating wavefront radius rather than the abrupt change offered by the conical element method, disper- sion is introduced into the scattering between sections leading to wave reflectance of greater accuracy. Fig. 8. Noise excited 1-D-2-D model spectra: /u/ vowel. B. Two-Dimensional Waveguide Vocal Tract Model tract is modeled by the waveguides along the -axis, the diam- The 2-D model implements the width of the tract in the same eter is modeled by waveguides along the -axis, and pressure manner as length is included in the 1-D KL model. Removing signal magnitudes are represented on the -axis. The 2-D ar- the plane-wave motion assumptions gives propagation across as rangement of junctions results in signal scattering in both the well as along the tract allowing for simulation of higher order and axes, and, hence, the propagation of higher order resonant modes. The area function data [3] is translated into width data, modes not inherent in the 1-D model. assuming a circular cross-sectional area, and then used to de- One current limitation of the 2-D model is the restriction on termine the number of waveguides across each length segment. sampling frequency. The constrictions within the vocal tract for The constructed mesh is then analogous to a 2-D plan of the particular shapes can result in diameters as small as 8 mm, in air cavity through the tract, from the glottis to the lips in the for example, the distance between the lips during production mid-sagittal plane. of the vowel /u/. For adequate mesh resolution a minimum of Fig. 6 illustrates the mesh constructed to represent a straight two waveguides (one scattering and two boundary junctions) tube and an arbitrary vowel shape with the inclusion of the are required across the narrowest sections to ensure boundary width data. The nature of the 2-D model increases the control- junctions are not connected together. A choice of waveguide ling parameters available when compared to the 1-D case. The size of 4 mm results in a system sampled at kHz, introduction of the extra boundary into the system along the constituting a total number of between 200 and 300 junctions, inside length of the tract wall, labeled as in Fig. 6, al- depending on vowel shape. This sample rate upper-bound cur- lows for flexible control of formant bandwidths as will be ex- rently suggests that real time system performance may not yet amined in Section IV. Glottal excitation can be introduced into be achieved. Clearly, a speech synthesis system based on a 2-D the system by either a direct injection of a periodic signal such model will introduce additional complexities and computational as the LF model, or using a constant flow controlled by the requirements. At its current stage in development, it is intended changing impedances of the waveguides, modulated by a glottal simply as a research tool and not as a successor to existing opening area-function. Both input methods can be applied along real-time speech methods. the width of the glottal end of the mesh, taking advantage of the second dimension by using the curvature of the vocal cords to IV. RESULTS focus the injected pressure toward the middle of the tract. Sim- ilar advantages may also be gained from the increased dimen- A. Vowel Synthesis sionality of the model in terms of the application of articulatory Measurements, in the form of noise excited spectral re- features. The effects of speech modifiers such as the lips, teeth, sponses, taken from the 2-D mesh model of the vocal tract and tongue may also be included in the cross-tract plane to ac- show its potential to simulate accurate formant peaks [10], in curately model their influence and again offering improved se- terms of frequency and bandwidth, both important factors in mantic control. The intention is that similar control benefits to the realistic simulation of vowels. those offered in using the chain matrix approach to articulatory Fig. 8 shows an analysis of the 2-D waveguide mesh model in speech synthesis can be exploited in a wave scattering-based the /u/ vowel configuration, with preliminary reflection values of method. , , and . Constructed in Fig. 7 illustrates the graphical output used in simulations for this way, a 2-D model with a waveguide size of 4 mm gives 220 the /u/ vowel 2-D waveguide-mesh model. The length of the nodes. Also included in Fig. 8 are formant patterns generated Part II

Core techniques - vocoding

28 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Factorising speech

29 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Factorising speech

30 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Source filter model

31 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. The STRAIGHT vocoder (Kawahara) H. BANNO et al.: IMPLEMENTATION OF REALTIME STRAIGHT preted as a two-dimensional sampling operation of the input speech output speech smooth time-frequency surface that represents artic- ulatory information [1,13]. F0 adaptive interference-free (2) Group delay manipulation of the excitation source spectral minimum phase spectral information filter [13]. information (3) F0 estimation that does not require a priori knowledge extractor for designing a relevant analysis window [1]. F0 modification (4) Extended pitch synchronous analysis that does not require alignment with pitch marks [1]. mixed mode source source excitation source (5) F0 extraction based on fixed-point analysis of a information information with group delay extractor mapping from the carrier frequencies of the analyzing manipulation wavelet to the instantaneous frequencies of the corresponding wavelet transform [14]. Figure taken from Banno et al, “Implementation of Realtime STRAIGHT”, Acoust. Sci. & Tech. 28, 3 (2007) (6) Acoustic event information extraction based on fixed- 32 © Copyright Simon King, UniversityFig. 1of Edinburgh.Schematic For personal structure use only. of MatlabRe-use or STRAIGHT.distribution not permitted. point analysis of window center location with respect to the centroid of the windowed signal and minimum- phase group delay based compensation [15]. 2.2. Extended Pitch Synchronous Analysis (7) Auditory morphing [3,10]. The most unique feature of STRAIGHT is its extended (8) Algorithm AMALGAM [16] that can seamlessly pitch synchronous analysis. Unlike other pitch synchronous morph different speech processing algorithms such procedures, STRAIGHT does not require its analysis frame as waveform-based synthesis, sinusoidal models and to be aligned with the pitch marks placed on the waveform STRAIGHT. under study. (9) Nearly defect-free F0 extractor using multiple F0 cues This analysis employs a compensatory set of windows. and post processing suitable for offline and quality The primary window is an effectively isometric Gaussian sensitive applications [17]. window convoluted with a pitch adaptive Bartlett window These studies roughly trace the course of STRAIGHT’s h t . The fundamental period is represented as t . ð Þ 0 evolution. The components developed in the first and third t 2  t topics were replaced by their counterparts developed in the wp t eÀ 0 h t=t0 1 ð Þ¼ à ð ÞðÞ fourth and fifth topics, respectively, and the former no 1 ÀÁt ; t < 1 longer exist. The modules developed in the sixth and eighth h t À j j j j ð Þ¼ 0 otherwise, topics have not yet been integrated into STRAIGHT. At  this stage, the most important topic among those presented where  represents a temporal stretching factor for for realtime STRAIGHT implementation is the fourth improving the frequency resolution slightly. The operator one—the extended pitch synchronous spectral analysis. represents convolution. This convoluted window is pitch à The topic of secondary importance is that on F0 extraction. synchronized and it yields temporally constant spectral peaks at harmonic frequencies. However, periodic zeros 2.1. Architecture of STRAIGHT (period is the fundamental period t0) between the harmonic Figure 1 shows the schematic diagram of Matlab components still remain. STRAIGHT. STRAIGHT is basically a channel The modulation of this window by a sinusoid of VOCODER with enhanced parameter modification capa- frequency f0=2 yields a compensating window that produc- bilities and a very high quality. The parameters manipu- es spectral peaks at positions where zeros were located in lated are (a) smoothed spectrogram, (b) fundamental the original spectra. frequency, and (c) time-frequency periodicity map. The t frequency resolution of the periodicity map is set to one wc t wp t sin  : 2 ð Þ¼ ð Þ t0 ð Þ ERB rateà by smoothing along a nonlinear frequency axis.  N A temporarily stable composite spectrum P !; t;  is STRAIGHT offers a graphical interface for analysis, rð Þ represented as a weighted squared sum of the power modification, and synthesis, and it also allows direct access spectra P2 !; t;  and P2 !; t;  using the original time to the Matlab functions. The central feature of STRAIGHT oð Þ cð Þ window and the compensatory window, respectively. is the extended pitch synchronous spectral analysis that provides a smooth artifact-free time-frequency representa- 2 2 Pr !; t;  Po !; t;    Pc !; t;  3 tion of the spectral envelope of the speech signal. ð Þ¼ ð Þþ ð Þ ð Þ ð Þ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi where the mixing coefficient   is numerically optimized ð Þ ÃERB: equivalent rectangular bandwidth to minimize temporal variations both in the peaks and the

141 STRAIGHT - graphical interface for manipulation

33 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Proceedings of 2009 APSIPA Annual Summit and Conference, Sapporo, Japan, October 4-7, 2009 9

Fig. 14. Pointer shapes used in GUI for temporal anchoring.

Fig. 12. Left: Initial state of morphing handling menu. Right: State after importing a morphing substrate.

Proceedings of 2009 APSIPA Annual Summit and Conference, Sapporo, Japan, October 4-7, 2009 9

Fig. 15. GUI for frequency anchor assignment. Blue and red lines represent STRAIGHT spectrum slice at anchoring points on example A and example B respectively. STRAIGHTFig. 14. Pointer shapes used- morphing in GUI for temporal anchoring. between natural speech samples

distance measures can also be used. A white trajectory with open circles represents the assigned mapping path and tempo- ral anchors.

Fig. 12. Left: Initial state of morphing handling menu. Right: State after (a) Initial state after distance calculation The bottom panel shows the zoomed distance matrix. Spec- importing a morphing substrate. trograms on the left are also zoomed. This magnification is used to edit anchoring points. Figure 14 shows pointer shapes used in this edit mode. The shapes represent their assigned functions; from top left to bottom right: distance matrix dragging, anchor position modification, addition of new anchor point, deletion of existing anchor point, expansion of zoomed region, dragging inspection region, relocation to the Fig. 15. GUI for frequency anchor assignment. Blue and red lines represent STRAIGHT spectrum slice at anchoring points on example A and example B pointer position and inspection of spectral slices for frequency respectively. anchoring. These functions are dependent on graphic objects under pointer and depression of modifier keys. The assigned temporal anchor points are returned to the morphing menu by distance measures can also be used. A white trajectory with clicking the “set up anchors” button. open circles represents the assigned mapping path and tempo- ral anchors. 3) Assigning frequency anchors: Holding down the “alt” (a) Initial state after distance calculation The bottom panel shows the zoomed distance matrix. Spec- modifier key and clicking one of the anchor points on the trograms on the left are also zoomed. This magnification 34mapping trajectory invokes the GUI for frequency anchoring. is used to edit anchoring points. Figure 14 shows pointer(b) Inspection using zooming © shapesCopyright used Simon in this King, edit University mode. Theof Edinburgh. shapes represent For personal their use only. Re-use or distribution not permitted. Figure 15 shows the GUI for frequency anchoring. Similar assigned functions; from top left to bottom right: distance Fig. 13. GUI for temporal anchor assignment. to temporal anchoring, anchor points can be added, moved matrix dragging, anchor position modification, addition of new and removed by using modifier keys and pointer operation. anchor point, deletion of existing anchor point, expansion of zoomed region, dragging inspection region, relocation to the This plot can also be zoomed and dragged horizontally. In pointer position and inspection of spectral slices for frequency this implementation, frequency anchoring points are set to anchoring. These functions are dependent on graphic objects minimize discrepancy of representative spectral features, such under pointer and depression of modifier keys. The assigned temporal anchor points are returned to the morphing menu by clicking the “set up anchors” button. 3) Assigning frequency anchors: Holding down the “alt” modifier key and clicking one of the anchor points on the mapping trajectory invokes the GUI for frequency anchoring. (b) Inspection using zooming Figure 15 shows the GUI for frequency anchoring. Similar Fig. 13. GUI for temporal anchor assignment. to temporal anchoring, anchor points can be added, moved and removed by using modifier keys and pointer operation. This plot can also be zoomed and dragged horizontally. In this implementation, frequency anchoring points are set to minimize discrepancy of representative spectral features, such STRAIGHT - a continuum created via morphing

“sigh” ...... “shy”

35 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Pros and cons of vocoding

• Pros

• Starting point is natural speech

• Extremely high quality (for small modifications, at least)

• Can modify various aspects of speech independently

• Cons

• Imprecise control

• no direct control of individual formants, voice onset time

• instead, manipulation of spectral envelope, etc

• Still needs natural speech samples as a starting point

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Part II

Core techniques - concatenation

37 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Let’s create a word

train peas

trees

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. 38 Part II

A side issue: text processing

39 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Text processing

• Text processing breaks the original input text into units suitable for further processing; this involves tasks such as • expanding abbreviations • part-of-speech (POS) tagging • letter-to-sound rules • prosody prediction • We end up with a ‘linguistic specification’ - in other words, all the information required to generate a speech waveform, such as • phone sequence • phone durations • pitch contour

40 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. From multi-level / tiered linguistic information to a linear string of context dependent units

pitch accent phrase initial phrase final

sil dh ax k ae t s ae t sil "the cat sat" phoneme: ax DET NN VB left context: sil dh right context: k ae ((the cat) sat) position in phrase: initial syllable stress: unstressed etc....

41 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Front end plus waveform generation

• Front end

• input is text

• output is a linguistic specification pitch accent phrase initial phrase final

sil dh ax k ae t s ae t sil "the cat sat" DET NN VB ((the cat) sat) • Waveform generation

• concatenation, or

• generate from a model

42 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Part II

Core techniques - concatenation of diphones

43 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Diphones

• The second half of one phone plus the first half of the following phone

• Concatenation points (joins) are in mid-phone ‘stable’ positions

• There will be one join per phone

time © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Time-domain

• The inventory contains the waveform plus pitch-marks for each speech unit (i.e., diphone)

Units have their original duration and F0, which will get modified during waveform generation The pitch-marks are needed by PSOLA-type algorithms

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. PSOLA (Pitch Synchronous OverLap and Add)

• The first method we consider for modifying F0 and duration is a time domain version of PSOLA called TD-PSOLA.

• It operates directly on waveforms

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. How TD-PSOLA works

• Deal with individual pitch periods (each of which is essentially the impulse response of the vocal tract)

• The pitch periods themselves are not modified

• To increase F0, periods are moved closer together; where they overlap, we add the waveforms

• To decrease F0, periods are moved further apart

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. TD-PSOLA

• Decreasing F0:

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. TD-PSOLA

• Increasing F0:

Overlap and add

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. TD-PSOLA

• Increasing duration:

duplicate

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Pros and cons of diphone synthesis

• Pros

• fast to generate speech

• direct control over F0 and duration of each phone

• Cons

• sounds quite bad

• not really of interest in speech perception research

• but ... is a precursor to the next method: unit selection

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Part II

Core techniques - concatenation of units from a large inventory

“unit selection”

52 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. 53 Concatenative systems (“cut and paste”)

• Most common method in commercial use • Example systems • CHATR, Ximera – ATR, Japan • Festival – University of Edinburgh, UK • rVoice – Rhetorical, UK (now Nuance) • Natural Voices – AT&T, USA • RealSpeak – ScanSoft (now Nuance) • Vocalizer – Nuance • Loquendo TTS – Loquendo, Italy (now Nuance) • InterPhonic – iFlyTek, China • IVONA – IVO software, Poland (now Amazon) • SVOX, Switzerland (now Nuance) • Cepstral, USA • Phonetic Arts, UK (now Google) • CereVoice - Cereproc, UK

54 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Components of a concatenative system

• A pipeline of processes takes us from input text to output waveform

• This pipeline can be broken into two main parts

• the ‘front end’

• waveform generation

• The front end infers additional information inlcuding pronunciation, intonation and phrasing to produce a ‘linguistic specification’

• Waveform generation creates a waveform that meets this specification

Examples: diphones vs. unit selection

55 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Unit selection

• In an ideal world, we would concatenate a sequence of speech units from precisely matching contexts

• Unfortunately, that would mean pre-recording every sentence in the language

• In practice, if we can’t find the speech sound from a precisely matching context, then we choose a version of that sound from a similar context

• in other words, a context that will have a similar effect on the sound

• For example:

• can’t find “phrase-final [a] in the context [n]_[t]”

• choose “phrase-medial [a] in the context [m]_[d]”

56 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Coverage affects quality

• A larger database gives better coverage, but

• takes longer to record

• takes more storage space

• leads to higher computational cost during unit selection search

• 400 sentence database

• 2000 sentence database

57 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Target sequence and candidates

target # dh ax k ae t s ae t #

# dh ax k ae t s ae t # # dh ax k ae t s ae t # # dh ax k ae t s ae t # # dh ax k ae t s ae # # ax ae t ae # candidates ae

58 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Linguistic criteria

• The ideal unit sequence would comprise units taken from identical linguistic contexts to those in the sentence being synthesised

• of course, this will not be possible in general, so we must use less-than- ideal units from non-identical (i.e., mismatched) contexts

• need to quantify how close to ideal they are, so we can choose amongst them

• The mismatch between the linguistic context of a candidate unit and the ideal (i.e., target) context is measured by the target cost

59 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Acoustic criteria

• After units are taken from the database, they will be joined (concatenated)

• Cannot simply join two fragments of speech and hope that it will sound OK - it generally will not !

• Why? Because of mismatches in acoustic properties around the join point, such as

• differences in the spectrum, F0, or energy

• The acoustic mismatch between consecutive candidate units is measured by the join cost

60 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Target cost and join cost

Target Cost Join Cost

Linguistic Acoustic Features Features

ae

Phonetic context Stress Syllable position ae t Word position Phrase position MFCCs F0 Energy

61 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. The components of the join cost

Left unit Right unit

waveform

spectrogram

energy

f0

Join point 62 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Target cost details

• The target cost measures the mismatch in linguistic features (which encode the context in which the unit appears) between target and candidate units.

• Festival’s multisyn engine uses a handcrafted function which is a simple weighted sum of sub-costs, one per linguistic feature

• these sub-costs are each assigned a weight which determines how important they are with respect to each other

• the weights are set by hand (well, by ear...)

• automatic optimisation of the weights is very hard

63 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Why must a search be performed?

• The total cost of a particular candidate unit sequence under consideration is the sum of

• the target cost for every unit in the sequence

• the join cost between every pair of consecutive units in the sequence

• The choice of which candidate to use in a particular position depends on which units are chosen for the other positions

• therefore, it is not possible to make independent decisions about the best candidate for each unit

• there is a single globally-optimal sequence

• a search is required, to find this sequence

64 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Viterbi search, with pruning

# dh ax k ae t s ae t #

# dh ax k ae t s ae t #

# dh ax k ae t s ae t #

# dh ax k ae t s ae t #

# dh ax k ae t s ae #

# ax ae t ae #

ae

65 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Generation of prosody in unit selection

• The front end produces a description of prosody, from which we generate values for F0 and duration

• In diphone synthesis, signal processing would then be used to impose this onto the concatenated speech units

• but signal processing results in significant artefacts, especially if modification factors are large

• This all assumes that our predictions of prosody (e.g., ToBI tones and break indices, phrase break positions) are accurate

• Unit selection takes a different approach, either

• select units whose prosody matches the generated values, or

• select units whose contexts match the predicted description - they will then already have the correct prosodic properties (F0 contour, duration, etc.)

66 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Pros and cons of unit selection

• Pros

• Fewer joins than the diphone method

• Can change the voice relatively easily without changing any software

• Can sound very much like a particular individual

• Can be very natural sounding indeed

• Cons

• Can still sometimes hear the joins between units

• Many (or most) sentences will include at least one error or artefact

• Large database of speech required for best quality - expensive to make

• Control over most aspects of the speech is limited

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Part II

Statistical parametric speech synthesis

68 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Front end plus waveform generation

• Front end

• input is text

• output is a linguistic specification pitch accent phrase initial phrase final

sil dh ax k ae t s ae t sil "the cat sat" DET NN VB ((the cat) sat) • Waveform generation

• concatenation (e.g., unit selection), or

• generate from a model

69 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Concatenation vs. generation from a model

• Concatenation builds up the utterance from units of recorded speech:

• Generation uses a sequence of models to generate the speech:

model 1 model 2 model 3 model 4

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. 70 Model-based systems trained on data

• Older model-based approaches such as MITalk

• Rule-based front end processes text and generates a set of parameters

• These parameters are used to drive a simple vocal tract model (Klatt)

• Modern statistical parametric speech synthesis

• Uses a statistical model, trained on data

• Output of the model is typically the set of parameters needed to drive a source-filter model vocoder

• Commonly known as ‘HMM based speech synthesis’ and implemented using the software toolkit HTS

• Overcomes the need for hand written rules: learns from data instead

71 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Modelling a coded representation of speech

• Waveform is not suitable for direct modelling, so use another representation

speech waveform

speech parameters

models model 1 model 2 model 3 model 4

72 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. HMMs are generative models

73 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Learning the models from data

• For each training utterance

• create a linguistic specification using the front end

• convert this to a linear sequence of context-dependent phone labels

• assemble an HMM for this utterance by concatenating the corresponding models

• Train the HMM parameters in the same way as for automatic speech recognition

• in simplified terms:

• alignment of the data to model states • update model parameters iterate • ‘alignment’ actually uses soft counting - Expectation-Maximisation

74 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Comparison with ASR

• Differences from automatic speech recognition include

• Synthesis uses a much richer model set, with a lot more context

• For speech recognition: triphone models

• For speech synthesis: “full context” models

• “Full context” = both phonetic and prosodic factors

• Observation vector for HMMs contains the necessary parameters to generate speech, such as spectral envelope + F0 + multi-band noise amplitudes

75 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. HMM-based speech synthesis

• Figure adapted from: An HMM-based approach to multilingual speech synthesis, Tokuda, Zen & Black, in Text to speech synthesis: New paradigms and advances; Prentice Hall: New Jersey, 2004

extract spectrum, F0, aperiodic energy

learn model

stored model

generate from model

reconstruct

76 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Examples of statistical parametric speech synthesis

• Statistical parametric speech synthesis method

• Standard voices are built from relatively large amounts of data

• typically 2 to 5 hours of material

• Experienced / professional speakers

2 examples

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. 77 Generating speech from a HMM

• HMMs are used to generate a parameterised form that we will call ‘speech parameters’

• From the parameterised form, we can generate a waveform

• The parameterised form contains sufficient information to generate speech:

• spectral envelope

• fundamental frequency (F0)

• aperiodic (noise-like) components (e.g., for sounds like ‘sh’ and ‘f’)

78 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Generating speech from the model

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. 79 Constructing the HMM

• Linguistic specification (from the front end) is a sequence of phonemes, annotated with contextual information

• There is one 5-state HMM for each phoneme, in every required context

• To synthesise a given sentence,

• use front end to predict the linguistic specification

• concatenate the corresponding HMMs

• generate from the HMM

80 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Trajectory generation

• Using an HMM to generate speech parameters

• because of the Markov assumption, the most likely output is the sequence of the means of the Gaussians in the states visited

• this is piecewise constant, and ignores important dynamic properties of speech

• Maximum likelihood parameter generation algorithm (Tokuda and colleagues)

• solves this problem, by correctly using statistics of the dynamic (‘delta’) properties during the generation process

81 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Trajectory generation speech parameter time

82 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Key insight: use the statistical properties of dynamic features

c

∂c

generated c

time

83 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Taking the delta and delta-delta features into account

84 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Part II

Statistical parametric speech synthesis

Context-dependent models, learning from data, sparsity, complexity control

85 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Context is the key

• Dealing with context-dependency is essential for good quality

• engineer the system in terms of a simple linear string of units

• then account for context by having a different version of each unit for every different context

• But, how do we know what all the different contexts are?

• If we enumerate all possible contexts, they will be practically infinite

• there are an infinite number of different sentences in a language

• context potentially spans the whole sentence (or further)

• However, what is important is the effect that the context has on the current speech sound - so next we can think about reducing the number of effectively different contexts

86 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Some contexts are (nearly) equivalent

• This insight is the key to unit selection - that’s what the target cost is doing

• In HMM-based synthesis models will be shared across groups of contexts

• We cannot record and store a different version of every speech sound in every possible context

• there are far too many of them

• some of them will be almost identical, so recording all of them is not necessary

• We can have each speech sound in a variety of different contexts

87 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Flattening the linguistic specification: attaching all features to the segment level

Phrase Phrase

Word Word Word Pitch accent Boundary tone Syllable Syllable Syllable Syllable

P P P P P P P P P P P P P P P P

88 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Context-dependent models

pitch accent phrase initial phrase final

sil dh ax k ae t s ae t sil "the cat sat" DET NN VB ((the cat) sat)

sil^dh-ax+k=ae, "phrase initial", "unstressed syllable", ... • “Author of the ...”

pau^pau-pau+ao=th@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$..... pau^pau-ao+th=er@1_2/A:0_0_0/B:1-1-2@1-2&1-7#1-4$..... pau^ao-th+er=ah@2_1/A:0_0_0/B:1-1-2@1-2&1-7#1-4$..... ao^th-er+ah=v@1_1/A:1_1_2/B:0-0-1@2-1&2-6#1-4$..... th^er-ah+v=dh@1_2/A:0_0_1/B:1-0-2@1-1&3-5#1-3$..... er^ah-v+dh=ax@2_1/A:0_0_1/B:1-0-2@1-1&3-5#1-3$..... ah^v-dh+ax=d@1_2/A:1_0_2/B:0-0-2@1-1&4-4#2-3$..... v^dh-ax+d=ey@2_1/A:1_0_2/B:0-0-2@1-1&4-4#2-3$.....

89 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. The problems caused by context-dependent models

• We cannot be sure to have examples of every unit type in every possible context in the training data

• In reality, the context is so rich (it spans the whole sentence), that almost every single token in the training is the only token of its type

• and the vast majority of possible types have no training examples

• Two key problems to solve

• train models for types that we have too few examples of

• create models for types that we have no examples of

• Joint solution: parameter sharing amongst groups of similar models

90 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Some contexts exert similar effects

• Key insight

• we can group contexts according to the effect that they have on the centre phoneme

• for example

• the [ae] in the contexts p-ae+t and b-ae+t may be very similar

• how to group these contexts?

• how to represent them so we can form useful groupings?

• use their phonetic features

• place, manner, voicing, ....

91 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Grouping contexts according to phonetic features

• Could try to write rules to express our knowledge of how co-articulation and other context effects work

• “all bilabial stops have a similar effect on the following vowel”

• “all nasals have a similar effect on the preceding vowel”

• ... etc

• Of course, it’s better to learn this from the data, for 2 reasons

• find those groupings that actually make a difference to the acoustics

• scale the granularity of the groups according to how much data we have

• But we still want to make use of our phonetic knowledge

92 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Combining phonetic knowledge with data-driven learning

vowel to right ? n y nasal to left ? n y /uw/ to right ? n y

tied state

93 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Decision tree-based state clustering [Odell;'95]

k-a+b

t-a+n

L=voice ? yes no L="w" ? R=silence? yes no yes no R=silence? L="gy" ? yes no yes no leaf nodes

synthesized states

w-a+t gy-a+sil g-a+sil

w-a+sil w-a+sh gy-a+pau 94 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Which contexts really matter?

• That will depend on the model parameters we are dealing with

• Spectral envelope

• mainly phonetic context, followed by stress/prominence

• F0

• mainly suprasegmental features (position in phrase, etc)

• Duration

• some combination of both phonetic context and suprasegmental features (e.g., “is this segment phrase final?”)

• No problem - can share parameters differently within each “stream”

95 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Multi-strObservationeam HMM s vectortructu rise divided into streams

ot

bj(ot) ct bj(ot) S m S s s ws u r t = b (o ) 1 t j t r c Dc t ot e e s=1 1 1 a

p b (o ) ! j t m " # S

2 1 D ct

2 2 2 2 bj(ot ) n pt o

o t 3 i t b3(o3) a 3 j t t i δ pt ot c 4 4 4 x 2 4 bj(ot ) E δ pt ot

96 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Decision tree-based context clustering

• As with unit selection, it is impossible to record all the phonetic/linguistic contextsStream required.-depe Eachnde individualnt tree HMM-ba swilled be c usedlus tfore rai ngroupg of similar contexts. Decision trees are used to ‘tie’ models across contexts. State duration model

HMM

Decision trees Decision tree for for state dur. mel-cepstrum models

Decision trees for F0 97 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Connecting unit selection and HMM-based methods

• Both methods share the same approach to the front end

• Both are trying to solve the same problem in waveform generation:

• recorded speech (the inventory for unit selection / the training data for HMMs) cannot include one example of every desired unit

• need a method to use ‘similar’ speech

• unit selection: quantify ‘similar’ using the target cost function

• HMMs: cluster ‘similar’ unit types together and train a single model

• the key concept is the same in both cases

• need to measure the suitability of using the ‘wrong unit type’

• that is, measure the ‘equivalence’ of speech unit types

98 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Target cost vs. HMM clustering: finding the underlying classes of speech sounds?

• Target cost is a weighted sum L=voiced ? of penalties Y N • left phonetic context R=consonant ? L=stop ? • right phonetic context Y N

• phrase position Phrase final ? Syllable stressed ? Y Y N N • syllable stress L = unvoiced but L = unvoiced but L = unvoiced stop; L = unvoiced stop; not a stop; not a stop; phrase final not phrase final phrase final not phrase final

Target cost is low when target and Each leaf contains a group candidate contexts are equivalent of equivalent contexts

99 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Comparison with unit-selection synthesis

• System construction • Training and optimising an HMM-based speech synthesiser is • almost entirely automatic • based on objective measures (maximum likelihood criterion, minimum cepstral distortion, etc) • Optimising a unit selection system (e.g., choosing target cost weights) is usually • done by trial and error • based on subjective measures (listening).

• Data • Unit selection needs 5-10,000+ sentences of data from one speaker • Training a speaker-dependent HMM needs 1000+ sentences • Adapting a speaker-independent HMM needs 1-100 sentences

100 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Pros and cons of statistical parametric speech synthesis

• Pros

• a parametric model (= control)

• based on natural speech (= sounds good)

• automatic training of models from data

• speech is generated via a vocoder

• high degree of control over speech signal is possible (in theory)

• Cons

• quite a bit of experience required to obtain best-quality results

• speech quality is probably somewhat limited by the vocoder

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Part II

Statistical parametric speech synthesis

Adaptation

102 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. ML-based Piecewise Linear Regression

Transformation Regression Class Tree Function fk

3 AMCC Threshold i = i + k 1 2 MLLR

i = ki + k Target Speaker CMLLR Model  =   +  Average Voice i k i k Model  =   > 1 i k i k Transformation Mean Vector of 3 i Function fk Gaussian pdf i 2 Covariance Matrix i Acoustic Space of Gaussian pdf i 103 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Demonstration: Various Voices

The HTS-2007 system can adapt the average voice model into ...

US English Indian English Celebrity

Male Female Male Male

104 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Pros and cons of adaptive HMM synthesis

• Pros

• can adapt all aspects of the model

• automatic learning from data

• Cons

• adaptation is “dumb”

• simply mimics all aspects of the adaptation data

• quality of speech signal can degrade if adaptation data are very different from the speech used to train the average voice model

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Part III

State of the art

106 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Part III

State of the art - capabilities & limits of current text-to-speech synthesis

107 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. What we can do today

• Unit selection

• can sound excellent but needs considerable engineering effort per voice

• HMM-based synthesis

• automatically learned from data; can adapt the models to new data easily

• Prosody

• ‘plausible’ default prosody for isolated sentences in read-text style

• Intelligibility

• HMM synthesis as intelligible as natural speech in quiet conditions

• Naturalness

• No synthesiser ever judged to be as natural as human speech

108 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Open problems in text-to-speech synthesis

• Unit selection

• not much active research, yet still commercially very attractive

• HMM-based synthesis

• still lacks quality of unit selection; vocoder + statistical model both need improving

• Prosody

• Unsolved

• Intelligibility

• need to maintain intelligibility in noisy conditions, like humans can

• Naturalness

• Unsolved

109 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Part III

State of the art - controllable speech synthesis

110 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Motivation

• Speaker adaptation requires adaptation data • It’s a shallow method • Adaptation takes place at the surface level • features or model parameters • no ‘deep model’ underlying the adaptation process • just a non-linear transform of the whole feature (or model) space

• How about speech modification without requiring new speech data? • perhaps based on other information, such as articulation, listener characteristics, environment, ... • our starting point: knowledge of speech production, in the form of articulatory measurement data

111 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Articulatory data used

• Male native British English (RP accent) speaker • 1,263 phonetically-balanced utterances • 7 articulatory points: UL, LL, LI, TT, TB, TD • Carstens 3D Electromagnetic Articulograph • Audio suitable for speech synthesis

112 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Introducing articulation into the HMM

Acoustic only    Model joint distributions of acoustic and articulatory parameters Acoustic distributions (for spectral

    parameters) dependent on articulation

EMA + Dependency = linear transform acoustic      No loss of quality Note: can use articulatory function to modify yt    

   

113 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Change tongue height = change the vowel

+1.5

+1.0

+0.5

default set peck led

-0.5

-1.0 Tongue height (cm) -1.5

114 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Training and synthesis

Training Speech Speech Analysis Database

Formant, Spectral and F0 features Labels HMM Training Questions (without dependency)

HMM Training (with dependency)

Synthesis Text Formant Feature Text Analysis Generation

Labels * Formant f (Y )

Acoustic Feature Generation

Spectral and F0 features

High Quality Synthesized Vocoder Speech

115 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Solution 2: Formant-controllable HMM synthesis

get

dit

default det default det

dat Direct manipulation of formants

Interpolation between consonants Interpolation bet

© Copyright Simon King, University of Edinburgh. For personal116 use only. Re-use or distribution not permitted. Control of a vowel

mark -3 -2 -1 0 +1 +2 +3 F1(Hz) +150 +100 +50 0 -100 -200 -300

F2(Hz) -300 -200 -100 0 +100 +200 +300

Samples

© Copyright Simon King, University of Edinburgh. For personal117 use only. Re-use or distribution not permitted. Formant-controllable HMM synthesis: controllable over whole sentences

F1 +100Hz

F1 -100Hz

default Please say what this word is: b__t.

F2 -300Hz

Formant height Formant F1 +300Hz

F1 -100Hz, F3 + 300 Hz

© Copyright Simon King, University of Edinburgh. For personal118 use only. Re-use or distribution not permitted. Vowel triangle and Lombard speech

     





    



          

© Copyright Simon King, University of Edinburgh. For personal119 use only. Re-use or distribution not permitted. Simulating hyper-articulation by expanding the vowel triangle

Process car noise, -30dB babble noise, -15dB

Baseline

Multi-compand (waveform modification) Enlarge vowel triangle by F1 and F2 (after multi-compand)

© Copyright Simon King, University of Edinburgh. For personal120 use only. Re-use or distribution not permitted. Pros and cons of controllable HMM speech synthesis

• Pros

• powerful control in terms of meaningful parameters

• articulator positions

• formants

• anything else you can provide at training time

• Cons

• bleeding edge technology

• only just starting to gather evidence that it is suitable for speech perception research

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Conclusions

122 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Take-home messages

• many tools are available - familiarise yourself with them before choosing

• don’t dismiss “old” tools

• they might still be the right choice for some tasks

• don’t only use “old” tools

• just because they are widely used, does not mean they are perfect

• consider whether any of these newer tools are suitable for your needs

• STRAIGHT vocoder

• controllable HMM synthesis

• selection of natural stimuli from a large corpus of natural speech

• are there perception questions that can only be answered with new tools?

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. End matter

124 © Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Credits

• The work described in this talk is not all mine! • Contributors from Edinburgh include • Junichi Yamagishi • Korin Richmond • Cassie Mayo • Alice Turk • Zhenhua Ling • Ming Lei • and many others • The material related to STRAIGHT, Praat, VocalTractLab, TADA and some other images in these slides were borrowed from the various authors’ papers and slides.

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. Software tools

• STRAIGHT • http://www.wakayama-u.ac.jp/~kawahara/index-e.html • Festival • http://www.cstr.ed.ac.uk/projects/festival/ • HTS • http://hts.sp.nitech.ac.jp • Praat • http://www.fon.hum.uva.nl/praat/ • VocalTractLab • http://www.vocaltractlab.de • TADA • http://www.haskins.yale.edu/tada_download/

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted. References

• Too many to include here!

• Please do not hesitate to email me to request further information and references to papers about anything covered in this talk, or indeed anything else related to speech synthesis

[email protected]

• As the very first thing to read about statistical parametric speech synthesis, you might try this

• Simon King. A tutorial on HMM speech synthesis (invited paper). In Sadhana - Academy Proceedings in Engineering Sciences, Indian Institute of Sciences, 2010 (DOI 10.1007/s12046-011-0048-y)

• available from http://www.cstr.ed.ac.uk/publications/users/simonk.html

© Copyright Simon King, University of Edinburgh. For personal use only. Re-use or distribution not permitted.