Presentations

Work in pairs in 6 minutes mini-interviews (3 minutes each). Ask questions around the topics: • What is your previous experience of ? Speech synthesis • Why did you decide to take this course? • What do you expect to learn? Write down the answers of your partner. Present during the presentation round Submit the answers to me Why? To let me know more about your background and expectations to be able to adapt the course content. To get to know each other. To ”start you up”…

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

The course Definition & Main scope

This is what the course book will look like… Until then, refer to http://svr-www.eng.cam.ac.uk/~pat40/book.html

Course pages: www.speech.kth.se/courses/GSLT_SS

Lecture content (impossible to cover the entire book): 1) History, , Unit selection, HMM synthesis, Text issues, Prosody The automatic generation of synthesized sound or 2) Vocal tract models, Formant synthesis, Evaluation visual output from any phonetic string. 3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic selection Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Synthesis approaches History

By Concatenation By Rule Elementary speech units are Speech is produced by stored in a database and then mathematical rules that concatenated and processed describe the influence of to produce the speech signal phonemes on one another

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

1 van Kempelen Wheatstone’s version

Wolfgang von Kempelen’s book Charles Wheatstone’s version of von Kempelen's speaking machine Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine (1791).

The essential parts • pressure chamber = lungs, • a vibrating reed = vocal cords, • a leather tube = vocal tract.

The machine was • hand operated • could produce whole words and Why is it of interest to us? short phrases. Parametric features!

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

First electronic synthesis First formant synthesizers

1950’s PAT (Parametric Artificial Talker), Walter Lawrence 3 electronic formant resonators input signal (noise) 6 functions to control 3 formant frequencies, voicing, amplitude, fundamental frequency, and noise amplitude. 1950’s OVE (Orator Verbis Electris) by

From 1950’s: other synthesizers including the first DAVO (Dynamic Analog of the Vocal • Homer Dudley presented VODER (Voice Operating tract) Demonstrator) at the World Fair in New York in 1939 An excellent historical trip of speech synthesis: • The device was played like a musical instrument, with voicing/noise source on a foot pedal and signal routed Dennis Klatt's History of Speech Synthesis at through ten bandpass filters. http://www.cs.indiana.edu/rhythmsp/ASA/Contents.html

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Let us take a look at OVE OVE Instructions http://www.speech.kth.se/courses/GSLT_SS/ove.html

• OVE I (1953) 1. Test how the five different source models change the output. What is the difference in the • On your computer today, and the original next formant pattern between different sources? Look at the number of formants, the peak amplitude, the bandwidth. time + OVE II (1962) 2. Alter a) the frequency and b) the shape of the source signal. What happens with the formant frequencies in the two cases? Relate these changes to human speech production.

3. Change the Frequency values F1-F4. Start with a neutral vowel (F1=500 Hz, F2=1500 Hz, F3=2500 Hz, F4=3500 Hz). Explain the attenuation in formant peak amplitude for higher frequencies (hint: try a rectangle source and change the shape to 99).

Now move one of the formant peaks so that it is about 200 Hz from the closest peak. What happens with the neighbour peak? Change the bandwidth of the formants. What is the relation between the bandwidth and the formant peak amplitude?

4. Move around the cursor in the vowel space and see how the shape of the output waveform (green curve in the bottom panel) changes.

If you have time, try to generate the sentences "How are you?" and "I love you!".

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

2 Formant amplitudes Speech analysis & manipulation

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Why signal processing? The source-filter theory • Need to separate the source from the filter The signal (c-d) is the result of a linear filter for modelling (linear predictive analysis) (b) excited by one or several sources (a). • Need to model the sound source (prosody, speaker characteristics) • Need to alter speech units in concatenative synthesis (amplitude, cepstrums) • Need to make concatenations smooth in concatenative synthesis (PSOLA)

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

The source-filter theory Source functions

• The voiced quasi-periodic source (glottis pulses) – vowels

t TIME: Parameters: – on/off, source filter radiation – Fundamental frequency F0, (glottis) (vocal tract) (lips) – intensity, – shape

FREQUENCY: • Frication source – fricatives • Transient noise – plosives More to come on the vocal tract filter in lecture 7 • No source – voiceless occlusions

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

3 Voice source types The quasi-periodic source

Articulatory Acoustic Why is there a damping slope in Modal Normal, efficient Standard source. the transfer function? Complete glottis closures Steep spectral slope. Breathy Lack of tension. Audible aspiration Never close completely. Slow “glottal return”. Frequency Glottal pulse symmetry. Time Higher F0 intensity. Whispery Low glottal tension. Triangular High aspiration levels. glottal opening. High medial Greater pulse asymmetry compression Medium longitud. t f Less time in open state. T0 tension. f= nF0 =n/T0 Creaky High adductive tension and Very low F0. medial compression Irregular F0 & amplitude Little longitudinal tension t f Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Simple vowel synthesis Linear Prediction (LP) A method to separate the source from the filter Predicts the next sample as a linear combination of the past p p Source Filter samples ~ x[n] = ∑ ak x[n − k] k=1

The coefficients a1 …ap describe a filter that is the inverse of the transfer function Waveform F1 F2 F3 F4 • Minimization of the prediction error results in an all-pole filter which Triangle source and formant filters in cascade: matches the signal spectrum Bandpass filters with frequency, bandwidth, level • This removes the formants and can hence be used So, how do we find the source from a speech signal? to find the source.

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Spectral Fourier analysis Fourier Transforms

• A Fourier transform of the filter coefficients • Fourirer transform (FT): A non-period signal has a a1 …an give the frequency response of the inverse filter continous frequency spectrum • A periodic waveform can be described as a sum of harmonics t • The harmonics are sine waves with different phases, f amplitudes and frequencies. • Discrete FT (DFT): Fourier transform of a sampled signal • The frequencies are multiples of the fundamental • N samples in both the time and frequency domain. frequency. • The spectrum is mirrored around the sampling frequency • A periodic signal has a discrete spectrum

t t f f • Fast FT (FFT): Clever algorithm to calculate DFT t • Reduces the number of multiplications: f DFT: ~N2 FFT: ~(N/2) * 2log(N)

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

4 Windowing Effect spectrum

• The analysis of a long speech signal is made on short frames: • The FFT analysis gives complex values: amplitude and Analysis window 10 – 50 ms (20 ms in example) phase for each frequency component • The phase is often not interesting, only the signal’s energy at different frequencies. • The effect spectrum shows the power spectrum for a short section of the signal

• The truncation of the signal results in artefacs (sidelobes) Windowing • The artefacts are reduced if the signal is multifplied with a FFT window that gives less weight to the sides. Square Logarithm

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Cepstrum Analysis Cepstrum from filterbank • The dominating method for ASR, used in HMM synthesis 2 N C = A cos( jπ (i−0.5) ) j N ∑ i N •Inverse Fourier transform of i=1 logarithmic frequency spectrum Spectrum of /a:/ Weight functions “Spectral analysis of spectrum” 2 110 1 90 0 W1 Cepstrum of /a:/ -1 70 •The coarse structure of the -2 50 spectrum is described with a 30 * = 1234 1,5 small number of parameters 1 0,5 0 -0,5 W2 -1 C1 C2 C3 C4 •Orthogonal coefficients -1,5 (uncorrelated) Spectrum of /s/ 2 90 1 Cepstrum of /s/ 70 0 -1 •Anagram: Spectrum-cepstrum, 50 W3 -2 filtering-liftering, frequency- 30 1234 quefrency, phase-saphe * 1,5 1 = 0,5 0 -0,5 C1 C2 C3 C4 -1 -1,5 W4

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Mel Frequency Cepstral Coefficients MFCC Concatenative synthesis FFTFFT MelMel filter filter bank bank Cepstrum transform

Linear < 1000 Hz Log > 1000 Hz

dB 110 100 90 0 70 1234 -100 50 ~6000 Hz -200 30 Mel C1 C2 C3 C4

Mel-Spectrum of /a:/ Mel-Cepstrum of /a:/ The Mel scale is perceptually motivated Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

5 Nothing new under the sun… Let’s get the terms straight

Concatenative synthesis Definition: All kinds of synthesis based on the concatenation of units, regardless of type (sound, formant trajectories, articulatory parameters) and size • Peterson et al. (diphones, triphones, syllables, longer units). (1958) (Everyday use: Concatenation of same-size sound units.) • Dixon and Maxey (1968) Unit selection •“DiadicUnits”, (Olive, 1977) Definition: All kinds of synthesis based on the concatenation of units where there are several candidates to choose from, regardless of if the candidates have the same, fixed size or if the size is variable. (Everyday use: Concatenation of variable sized sound units.)

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Why has concatenation conquered? Database preparation

Concatenative Synthesis is the state-of-the-art • Choose the speech units (Phone, Diphone, Sub-word unit, Cluster based unit selection) • Storing the segment database is no longer an issue • Compile and record utterances • Advances in ensuring smoothness in concatenations • Segment signal and extract speech units • Store segment waveforms (along with context) and • Rule-based synthesis output used to be smoother information in a database: Dictionary, waveform, pitch mark • Unit selection provides (piece-wise) high quality speech. e.g. “ch-l r021 412.035 463.009 518.23” • Change of applications. diphone file Start time Middle time End • Certain sounds are too hard to be produced by rule • Pitch mark file: a list of each pitch mark position in the file • Vowels are easy to create by rule • Extract parameters; create parametric segment • Bursts, voiceless stops are too difficult, we do not fully database (for data compaction and prosody matching) understand their production mechanisms • Perform amplitude equalization (prevents mismatches)

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Diphone & Triphone synthesis Diphone synthesis

Sequences of a particular sound/phone in all its environments of occurrence or all/most two-phone sequences occurring in a ɑː ɑː s l r k language: auto ’car’ -> _a, au, ut, to, o_

• Rationale: the ’center’ of a phonetic realization is the most *s s ɑ ɑ l l * *r r ɑ ɑ k k * 1 2 1 2 1 2 1 2 1 2 1 2 stable region, whereas the transition from one segment to ɑ ɑ ɑ ɑ *s 1 2l* *r 1 2k* another contains the most interesting phenomena, and is thus the hardest to model. s ɑː k Diphone ɑ ɑ *s1 s2 1 2k1 k2*

Triphone ɑ ɑ *s 1 2k*

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

6 Diphone synthesis Diphone ”synthesis” lab http://www.speech.kth.se/courses/GSLT_SS/lab1.html • 1200 diphones can already create a quite good sounding 1. Record the "database", the word list: synthesis "Dockad, yttern, töm, flöde, möta, lätt, blomster, lyssnarna." in one go, in that order and without pausing. -Speaker dependence (one set from one speaker) 2. Segment the wordlist into diphones: Cut out each diphone and put them in a new Wavesurfer window, but -Various digital signal processing techniques -> ’robotic’ sound with pauses separating each diphone.

- Segmental quality, transition between diphones 3. Identify the diphones that you need to create the sentence "Dom flyttade möblerna.“ - Only partial covery of co-articulation 4. Copy and paste diphones from the database window into a new MBROLA BT, Laureate Festival synthesis window. 5. Play the sentence, fine tune durations and concatenations.

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Equalization Concatenation with PSOLA

• Segments extracted from different words, with different • Time-Domain Pitch-Synchronous-OverLap-Add (TD-PSOLA) phonetic contexts, have amplitude and timbre mismatches. • High speech quality • Very low computational cost (7 operations/sample). • Equalization: Related endings of segments are imposed • A window (2-pitch periods long) is multiplied with the similar amplitude spectra. signal • Amplitude equalization: smooth modification of the energy • The signal is broken into a set of localized signals (non-zero levels at the beginning and at the end of segments. The only at the window intervals) energy of all the phones of a given phoneme is given the average value. The difference is distributed on the neighbourhood. • Timbre conflicts are tackled at run-time, by smoothing individual couples of segments when necessary, so that some of the phonetic variability is still maintained.

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Altering pitch with PSOLA Altering duration & amplitude

• Relative shifting of localized signals Increase number of PSOLA iterations (overlaps) to increase • Spacing reflects pitch duration duration • Good result for modification factor [0.6 – 1.5]

Frame duplication • Decrease number of PSOLA iterations (overlaps) to decrease duration

• Multiplying the signal by a constant • If constant > 1, amplitude increase • If constant < 1, amplitude decrease Spaced futher apart

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

7 MBROLA Unit Selection • Algorithm: Multi-Band Resynthesis OverLap and Add • Larger database of recorded units: e.g. diphones, phones, • A time-domain PSOLA-like algorithm with efficient syllables, words, etc. smoothing of the spectral envelope • Multiple occurrences of the units cover a wide space of the • Very high data compression ratios (up to 10) spectral and prosodic parameters • Units nearest in this space to the targets will be chosen and • Synthesizer: Concatenation of diphones. will require only minor modification In: List of phonemes and prosodic info (duration of phonemes and a piecewise linear description of pitch), • The corpus is segmented into phonetic units, indexed, and Out: speech samples on 16 bits (linear), at the sampling frequency used as-is of the diphone database. • Selection is made on-line • Project goal: generate a set of speech synthesizers for • The trend is towards longer and longer units as many languages as possible, free for non-commercial 1999 2000 2001 2002 2003 2004 2005 applications.

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Best Unit Selection Target & Concatenation cost

Target cost Target cost = The difference in each frame between the target – Prosodic and spectral closeness to target and candidates for Oh no! A different number of frames! – target pitch Concatenation cost –power – Units occurring beside each other in the recorded – duration database being given a zero • Manhattan (City block) distance D = ∑| xi − yi | i BEWARE OF Cost function: PITFALL • Euclidean distance D = (x − y )2 – Target + Concatenation cost (weighted sum) ∑ i i i Viterbi algorithm used to find the overall minimum cost path. • Concatenation cost = The difference between the end of diphone 1 and the start of diphone 2 Assignment 1: 2 (xi − yi ) Practical exercises with the calculation of target and • Mahalanobis distance D = ∑ 2 σ i N xi concatenation cost. D = (x − y )log ∑i=1 i i • Kullback-Leibler distance yi

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Viterbi – best path search Pros & cons of Unit selection • All possible sequences are hypothesized in parallel Advantages: • Threshold excludes improbable hypotheses • Piece-wise very high waveform quality, thanks to minimal Based on signal manipulation • previous path probability (getting to state i) • transition probability (getting from i to j) • Non-linguistic features of the speakers voice built in • observation likelihood (state j matches input) Disadvantages: j • Discontinuities between units i • Hit or miss for target selection • Quality differences between different sized units Phone 2 • Fixed voice • Fixed non-linguistic features Are there any valid alternatives? Phone 1

Utterance Time Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

8 Source

Speech n

LP o

LP i

t a

Model t

n

o

o

i

R

t

a

Source e

t

l o

Speaker Formant o

R

P

Trajectory / e

HMM

l Mapped

g o

Model n i P Speech

Speech Speech p

Warping /

r Recon-Recon g

Factors a n

i struction

W

Target p

r m HMM synthesis Formant a

Speaker

u

r W Trajectory t

HMM

c

m

e u

Model p

r

t

S

-

c e

LP C

p

P S

Model L

C P Target L Speech Model Formant Formant Mapping Speech Estimation Tracking Reconstruction

American male American female Transformed(AM M to F)

An example of voice conversion

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

HMM synthesis Hidden Markov Models

O O Ok O A speech synthesis technique based on i j l Pii Pjj Pjk Pll HTK (Hidden Markov Model Toolkit) P ij Pjk Pkl

Developed by the HTS working group • A HMM is a machine, with a limited number of at the Department of Computer Science Nagoya possible states. Institute of Technology • The transition between two states is regulated by probabilities. Interdisciplinary Graduate School of Science and • Every transition results in an observation with a Engineering Tokyo Institute of Technology certain probability. • The states are hidden, only the observations are http://hts.sp.nitech.ac.jp visible.

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

HMM in speech synthesis Basic idea

1. Transcription & segmentation of speech databases Start the training of the 2. Construction of inventory of speech segments HMMs with a good guess 3. Run-time selection of speech segments on the parameters. High quality speech can be synthesized using waveform The guess is improved concatenation algorithms (e.g., PSOLA). through comparison with training observations. However, to obtain various voice qualities, a large amount of speech data is necessary. In the synthesis we should find the optimal sequence → Speech synthesis from HMMs themselves. of states, through Voice quality can be changed by transforming HMM concatenation of HMMs parameters appropriately. The output is vocoded, but it is always smooth and stable Mel-Log-Spectrum-Approximation Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

9 The training part Clustering

• The training is automatic. You need: • Groups a large database into clusters – The text + recordings of about 1000 sentences • Three trees: Duration, F0 and Spectrum • The training of 1000 sentences • Division based on yes/no questions – takes 24 hours and generates a voice of less than 1 MB – Grouping acoustic similar phonemes • Separate HMMs for: Spectrum, F0, Duration –Features. –Context. • Trainingin twosteps: 1.Context independent models 2.Use these models to create context dependent models. • Clustering: – Storing all contexts requires much space – It may be difficult to find alternatives for missing models – Many models are very similar = redundancy

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Synthesis Delta, delta-delta...

For each phoneme we need: • Mel-Cepstrum, with first and second derivative (mcep, Δ, Δ²) •(F0, Δ, Δ²) + information about voicing • Duration. Can be generated implicitly by F0 and spectrum HMMs, but the result is more natural with explicit modeling. • Δ och Δ² are used to smooth the parameter sequences.

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Use of HMM synthesis Speaker adaptation

• Various voices: – Speaker adaptation – Speaker interpolation – Eigenvoices • Very low bit rate speech coder • Security of speaker identification systems

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

10 Speaker interpolation Test of speaker verification

www.sp.nitech.ac.jp/~tokuda/HTS_demo/speaker_inter/index.html

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Very low bit-rate speech coding Swedish HMM synthesis Master thesis by Anders Lundgren Language specific parts: • Text to phoneme transcription (RulSys or Festival) • Translation of the phonemic transcription to HTK SAMPA- Festival • Module to generate contextual information (syllable division, word accent placement) • Decision tree paths for the clustering of HMMs –Features – contextual information

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Listening test Clarity

• Separate evaluation of prosody and spectrum 100% 90% • Six voice variants: 80% 70% Much worse –HTS 60% Worse – Prosody from HTS, spectrum from MBROLA 50% No difference 40% – Prosody from RULSYS, spectrum from HTS 30% Better 20% Much better 10% • TMH’s synthesis reference system 0%

– Prosody from RULSYS, spectrum from MBROLA y y S d d rum so HTS so ct o F o e M HT pr pr spectrum sp S TS HTS H F HTS M M HT F

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

11 Naturalness Previous TTS experience 100% 90% 80% 100% 70% 90% 60% Yes 50% 80% Much worse 70% 40% 30% 60% Worse No Difference 20% s 50% 10% p p s 40% Better 0% 30% Much better More on how to evaluate in lecture 9! 20% 100% 90% No 10% 80% Muc h wors e 0% 70% 60% Worse S y S y od trum T od trum 50% No Difference HT s c s M ro F H 40% p pro Better spe S 30% p s TS S T Much better H H 20% HT F M M F HTS spec 10% ps 0% Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Text-to-speech ”The automatic generation of synthesized sounds...” “hello” text text Morphologic analysis Lexicon and rules From text LinguisticLinguistic analysis analysis Syntax analysis

ProsodicProsodic analysis analysis Rules and lexicon

PhoneticPhonetic description description Rules and choice of units

JoiningJoining parts parts Rules The automatic generation of synthesized sound from any text string. SoundSound generation generation

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Text Analysis Challenges Preprocessor

• Homographs • Sentence end detection (semicolon, period – ratio, time – My latest project is to learn how to better project my and decimal point, sentence ending respectively) voice. • Abbreviations (e.g. – for instance) – The girl with the bow in her hair was told to bow deeply Changed to their full form with the help of lexicons when greeting her superiors. • Acronyms (I.B.M – these can be read as a sequence of • Numbers (models, dates) characters, or NASA which can be read following the – On May 5 2005, the university bought 2005 computers default way) – a Boeing model 747 can contain 747 people • Numbers (Once detected, first interpreted as rational, • Abbreviations time of the day, dates and ordinal depending on their – Yesterday it rained 3 in. Take 1 out, then put 3 in. context) – St. John St. • Idioms (e.g. “In spite of”, “as a matter of fact”– these are Let us try! combined into single FSU using a special lexicon)

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

12 Morphological Analysis Contextual Analysis Task is to propose all possible parts of speech categories to • Considers words in their context each word taken individually on the basis of their spelling. • Reduces the list of their parts of speech categories to a Function words Content words very restricted number of highly probable hypotheses, (determiners, pronouns, – infinite in number given the corresponding possible parts of speech of prepositions, neighboring words. conjunctions..) • Needs Morphology – describes – limited number. words using a reduced set of • Achieved by N-grams, multi-layer perceptrons (neural abstract semantically bearing networks), local stochastic grammars (provided by • Can be stored in lexicon units called morphemes. expert linguistics) etc • Word he: • Inflectional, derivational and = he compound words are = pronoun decomposed into morphemes = = masc • Uses regular grammars with = /hΙ/ lexicons of stems and affixes

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Letter-to-phonemes Dictionary or Rule Based

• Module responsible for the automatic determination of Dictionary: the phonetic transcription of the incoming text Store a maximum of phonological knowledge into a lexicon. Compounding rules describe how the morphemes of • Cannot just look up in a pronunciation dictionary dictionary items are modified. – Do not follow the rule “one character = one phoneme” – Single character correspond to two phonemes — x as /ks/ Hand-corrected, expensive – Several characters producing one phoneme — th in thought The lexicon is never complete: – Single character pronounced in different ways — c in ancestor, needs out of vocabulary pronouncer, transcribed by rule. ancient, epic Rules: • Rule based – applied based on spelling, sentence analysis A set of letter to sound (grapheme to phoneme) rules. Words pronounced in a such a particular way that they • Dictionary based – a large dictionary of correct spellings have their own rule are stored in exceptions directory. • Hybrid Approach – combines the above, usually used Fast & easy, but lower accuracy

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Letter-to-sound difficulties Creating rules • Writing rules by hand is difficult • Consonants reduced or deleted in clusters (eg. /t/ in softness) • Automatic process built from c h e c k e d lexicon • Assimilation leads to a change of some phonological ch - eh - k - t features of a given phoneme (eg. obstacle) – Find alignments: • Provides phone string plus • Homographs pronounced differently (eg. record, contrast) stress • Phonetic liaisons (e.g. in French words immediately followed Lexicon Correct by a vocalic sound results in pronunciation of characters that Letters Words otherwise disappear) OALD 95.80% 74.56% • Unstressed vowels transformed into schwas (short central CMUDICT 91.99% 57.80% phonetic elements) or deleted (e.g. interesting) BRULEX 99.00% 93.03% • New words, proper nouns dependent on the language of DE-CELEX 98.79% 89.38% origin (e.g. in Swedish “jeans”, “comme il faut”) Thai 95.60% 68.76%

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

13 Phrasing Intonation: Word accent Word Accent: Decided depending on word class, Determines where phrase boundaries occur position in the sentence and in the phrase, word – insert pauses on phrase boundaries classes of preceding and following words. – determined by CART tree trained on big For each syllable of each word: if and which data corpus (e.g. Swedish ‘tomten’, ‘stegen’).

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Intonation: F0 contour Pitch contour modeling

Large pitch range (female) • Tonetics (the British school) Authoritive (final fall) – tone groups composed of syllables {unstressed, Emphasis for Finance (H*) stressed, accented or nuclear}. Final has a raise – more – nuclear syllables have nuclear tones {fall, rise, fall-rise, information to come rise-fall} • ToBI (Tones and Break Indices) • Word stress and sentence intonation – Phrases split into intermediate phrases composed of – each word has at least one syllable which is spoken with syllables. higher prominence – Relative tone levels: high (H) or low (L) (plus diacritics) – in each phrase the stressed syllable can be accented at every intonational or intermediate phrase boundary depending on the semantics and syntax of the phrase (%) and on every accented syllable • Prosody relies on syntax, semantics, pragmatics: personal • Stylization method (prosodic pattern measured from reflection of the reader. natural speech)

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Prosody modeling Prosody modeling Prosody is critical for obtaining the right intonation (or else speech may sound unnatural or unintelligible)

• Fixed durations, flat F0. • Decline F0 • “hat” accents on stressed syllables • accents and end tones • statistically trained

• Prosody targets (to put emphasis, stress) typically include: –Pitch – Phone durations –Energy • Prosody parameters can be trained

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

14 Synthesis markup SABLE: marking emphasis [ The boy saw the girl in the park with the telescope. What will the weather be like today in Boston? The boy saw the girl in the park with the telescope. It will be rainy today in Boston. Some English first and then some Spanish. Hola amigos. Namaste When will it rain in Boston? Good morning My name is Stuart, which is spelled It will be rainy today in Boston. stuart though some people pronounce it stuart. My telephone number is 2787. Where will it rain today? It will be rainy today in Boston I used to work in Buccleuch Place, but no one can pronounce that. By the way, my telephone number is actually

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Articulatory synthesis Articulation as filter

Benefits: Vocal tract models • Produce speech in the same way as humans • Can be made with few parameters • The changes are intuitive (raise the tongue tip, round the lips)

Disadvantages: • Computationally demanding • Problems with consonants • Articulatory measurements required • State-of-the-art articulatory synthesis still sounds bad

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Articulatory models Articulatory basis

Measurements (X- rays, MRI etc) are used to model the dimensions of the tube.

In the midsagittal plane, and to get the relation between Functional Physiological midsagittal Geometric parameters Muscle model. Articulations are distance and area control the different parts created through activation of in each plane. of the tongue, jaw, lips etc. different muscles.

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

15 3D articulatory syntesis 3D MRI measurements Corpus Why? One neutral reference and • Two-dimensional models 43 Swedish articulations. simulate the third dimension as area=a•(distance)d. 13 vowels: /ɑ:, e:, æ:, i:, y:, a & d are decided empirically u:, ʉ:, o:, ø:, œ:, a, u, ɔ/ and vary through the tube. 10 consonants: /p, t, k, l, r, s, f, ʂ, ɧ, ɕ/ in VCV contexts: / / A three-dimensional model gives aɪʊ • the cross-sectional area directly • lateral modeling (/l/) 3*18 slices orthogonal to the • visual synthesis (pronunciation midsagittal plane in 43 s. training) Supine position

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

3D Reconstruction Articulatory model

One contour per Six articulatory parameters defined using a component image. analysis of the 3D tongue shapes. Reconstruct a 3D shape for each articulation

Jaw height Tongue body Tongue dorsum

⁄akÉaalÉaUžU

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Articulatory model Multimodal articulatory synthesis

C Qualisys optical Rf C C motion tracking: Rf Rf 4 IR cameras C V 28 reflectors R R C 3 reference reflectors on headmount Rf Tongue tip Tongue advance Tongue width Audio & video recorders V Add vocal tract walls: Movetrack Electromagnetic UL T3 Symmetric walls, extracted from Articulograph: the MR Images. 6 coils; upper lip UL , upper & T2 J Collision handling for the tongue J T1 UL T2T2 lower incisors , three tongue against walls, palate and jaw. coils, 8 T1 , 20 T2 and 52 mm T3 T1 from the tip.

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

16 Synthesis: LSP L * LPC analysis LSP Li k ∀i∈C, i≠k Linear estimator Training: or Neural network Linear estimator Pitch Pi Training ∀i∈C, i≠k or Speech signal Si Synthesis

Data processing & Model fitting Neural network LSP filter

∀i∈C, i≠k Resampling Pitch Pk 14 Articulatory 14 Articulatory Synthetic 3D Articulatory data Ai parameters AP Speech S * i parameters APk k Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Multimodal synthesis From articulation to acoustics 3D air flow calculations

Tubes

Vocal tract model Cross-sections

Area function

Electric circuit equivalent 2D airflow dynamics

Waveform Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Area & transfer functions Formants

Area function → Area vs. distance Articulatory model Formants

Transfer function: Amplitude vs. frequency Parameter settings

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

17 Vocal tract models lab Equivalent circuit

http://www.speech.kth.se/courses/GSLT_SS/lab2.html Acoustically Mechanically Electrically • Synthesize /aa, ii, uu/ with Formant values are: flow speed current a) a two-tube model /aa/: 650 1000 2500 pressure force voltage b) a three-parameter model /ii/: 290 2050 2400 c) an area function model /uu/: 300 700 2100 ac. mass mass inductance d) an articulatory model ac. spring mech. spring capacitance • Investigate what happens if a nasal tract is added for each model. R L

• Compare the four methods regarding flexibility, complexity, intuitivity. C G What are the advantages and disadvantages of each of them? A • Use the articulatory model to investigate how the seven parameters influence the vocal tract shape and acoustics. Start from a neutral vocal tract (set all values to 0) and vary each parameter. The tube has a acoustic mass ~ L = ρ/Α • Move or place your own articulators in the same way; do your intuitive The air functions as a spring ~ C = A/(ρc2) thoughts about the effect of your articulatory movements correspond There are frication losses ~ R, G to the results in the model? • Experiment with the parameters in 'Tract Configuration' and 'Physical A is the cross-sectional area of the tube, ρ is the air density Constants'. What influence do they have on the synthesis? and c is the speed of sound in air

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Assumptions

In an electric circuit:

• The tube has rigid walls. Iin Iout The transfer function is the quota

• Since the cross-sections are small compared to the tube Iout Za Za H = Iin – Iout length we have a plane wave. Iin Z Uout • Two-directional waves with reflections between the tubes Uin b Uin = Iin ()Za + Zb − Iout Zb

A r = (A -A )/(A +A ) Uout = IinZb − Iout ()Za + Zb A1 2 1 2 1 2 Long calculations give: If we assume that the tube is loss-less, • The “current” and “voltage” are sinusoidal: Z0 and γ are simplified:

Ux()=+ Ue−γγxx Ue 1 + − Z = Z RjL+ ω L ρ / A ρc −γγxx b 0 Ix()=+ Ie Ie sinhγl Z02= == = + − GjC+ ω C Ac/ ρ A with γ = ()()RjLGjC++ωω γl jω γ = ()()RjLGjCj++==ωωω LC The index + and – indicate the direction. Za = Z0 tanh c 2

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Example: the neutral vowel Two tubes

Iin ⎛ γl ⎞ cosh()γl 1 Za = Z0 tanh⎜ ⎟ = Z0 − Connect two homogene tubes 2 sinh()γl sinh()γl I Za Za ⎝ ⎠ in Nod 1 I1 Nod 2 I - I out in Z0 Z Z Z ⎛ ZZ ⎞ Z = a2 a2 Za1 a1 ab11 I b I - I I1 – Iout Node 1: ()IIZIZZin−=12 b⎜ a 2 ++ a 1 ⎟ glottis out lips sinh()γl in 1 1⎝ ZZ+ ⎠ Zb Iout aa12 Zb2 Zb1 Z0 Za1Zb1 Node 2: ()I1 −Iout Zb2 = IoutZa1 ⇒ I1 = Iout sinh()γl 1 Za1 +Za2 ⇒ H = = Z Z ()cosh()γl −1 0 + 0 cosh()γl Za Iout + Zb ()Iout − Iin = 0 sinh()γl sinh()γl I Z Even longer calculations give: ⇒ H = out = b I Z + Z jω 1 1 in a b Lossless γ = ⇒ H = = I c ⎛ jωl ⎞ ⎛ ωl ⎞ ut 1 cosh⎜ ⎟ cos⎜ ⎟ H == c c Iin jlωω122jl⎛ A jl ωω12jl⎞ ⎝ ⎠ ⎝ ⎠ cosh cosh⎜1+ tanh tanh ⎟ c c ⎝ A1 c c ⎠ Poles : Poles when ⎛ ωl ⎞ ωl π c ⎛ π ⎞ c cos⎜ ⎟ = 0⇒ = + nπ ⇒ ω n = ⎜ + nπ ⎟ , n = 0,1,2,... ⇒ Fn = ()2n −1 , n = 1,2,3,... ⎝ c ⎠ c 2 l ⎝ 2 ⎠ 4l A2 ωl12ωl tan tan = 1 For a typical male speaker, l = 17.5cm, and c = 350m/s, we get A1 c c F = {500Hz,1500Hz,2500Hz,3500Hz,...}

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

18 Example with two tubes Consonants

A male speaker produces a vowel with a constricted The source is somewhere else than at the glottis pharynx (the pharyngeal area is one eighth of that in the oral cavity). Calculate the first two formant frequencies.

l A2 1 l1 = l2 = , = 2 A1 8 l = 17.5cm for a male speaker, c = 350m / s

A ⎛ ωl ⎞ ⎛ ωl ⎞ Some cavities may be closed: 2 tan⎜ 1 ⎟ tan⎜ 2 ⎟ = 1 A1 ⎝ c ⎠ ⎝ c ⎠ 2 ⎛ ωl1 ⎞ A1 ⎛ Fn 2π0.175 ⎞ tan⎜ ⎟ = ⇒tan⎜ ⎟ = 8 ⎝ c ⎠ A2 ⎝ 2350 ⎠ 2000 Fn = []arctan()± 8 + nπ ⇒F1 = 784Hz, F2 = 1216Hz e.g. the mouth cavity for nasals π Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Consonants

Two homogene tubes in series Power source

Iin U Za2 Za2 I Za1 Za1 I I1 -Iut Formant Synthesis Iut Zb2 Zb1

I Z −1 sinhθ ut = 2 2 I ⎛ Z ⎞ in ⎜ 1 ⎟ coshθ1 coshθ2 ⎜1+ tanhθ1 tanhθ2 ⎟ ⎝ Z 2 ⎠

The same poles as before,

but now zeroes as well, when |sinhθ2|=0 !! Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

OVE II Digital resonators

Model the poles directly instead!

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

19 Formants and bandwidths Synthesis by rule

An all-pole model: resonances when the denominator is zero. Bandwidths: Function of energy losses due to heat conduction, viscosity, cavity-wall motions, radiation of sound from the lips and the real part of the glottal source impedance. Lab exercise 3

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Formant synthesis lab Data-driven formant synthesis http://www.speech.kth.se/courses/GSLT_SS/lab3.html • The task is to adapt the synthesis of "Dom flyttade möblerna" to sound as a target speaker. • Start Wavesurfer and open the reference sentence in the "Speech

Analysis" configuration. 4000 M O B I: L sil • Create an transcription pane: right-click > "Create Pane > 3500 Transcription". Download the automatic transcription and load it by 3000 right-clicking in the transcription pane > "Load transcription“. 2500 2000 • Start a new Wavesurfer (File > New), choose "Formant Synthesis sw". 1500 1000 • Type "Dom flyttade möblerna" into the Text slot and Synthesize. 500 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 • Edit the parameters that are displayed: F0, F1-F4. - left-click > and drag the parameter track. - Insert new control points: right-click on a parameter track. Keeps the flexibility of the formant synthesis • To make a phoneme longer/shorter: click in the transcription window and drag to the left/right. More natural sounding than rule-driven synthesis • How close do you get by just editing pitch, formants and duration? Speaker adaption Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Formant unit selection Synthesis comparison

4000 M O B I: L sil Data−driven Rule−driven

3500

3000

2500

2000 Formants are chosen through unit selection from a formant 1500 diphone library of about 2000 diphones. 1000

Formant trajectories are scaled 500 and interpolated to fit rule- 0 generated durations. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

20 Listening test evaluations Evaluation results • 15 subjects, 20 sentences, continuous scale from Rule-driven and Data-driven ”Unnatural” to ”Natural”. • 4 types of stimuli: Sentences without Sentences with Hand-corrected 1. Rule-based syntes Overall 2. Data-driven formant synthesis critical errors critical errors sentences 3. + Data-driven fricative synthesis 4. + Replace the voiceless fricatives (/f/, /s/, /sj/,/tj/, /rs/) and plosives (/k/, /p/, /t/, /rt/) with recorded versions.

• 12 subjects, 10 sentences, binary scale • Data-driven synthesis with manually corrected formant data was preferred in 73 % of the cases over rule-driven synthesis

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Evaluation: Why?

• Monitoring the development – Initial: choosing a ”good” voice, a good inventory. Evaluation – Progress evaluation – Diagnostic evaluation: find out where things go wrong and why. • Performance Evaluation

For developers: For users: Comparative evaluation Overall quality evaluation intelligibility comprehensibility quality adequacy usability

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Diagnostic evaluation: Diagnostic Rhyme Test

Different levels • Consonant intelligibility in word initial position. • Segmental: intelligibility tests on the ability to distinguish individual sounds. • 96 word-pairs to test 6 characteristics: – Diagnostic Rhyme Test (DRT) Voicing: veal-feel – Modified Rhyme Test (MRT) Nasality: reed, deed – Minimal Pair Intelligibility Tests (MPIT) Sustension: vee-bee, sheat- cheat – Phonetically Balanced Word List (PB) Sibilation: sing-thing – Nonsense words Graveness: weed-reed • Sentence: comprehension of words or short sentences Compactness: key-tea • Comprehension: more than one sentence. • Forced choice • Prosody: assessment of intonation and emotions • Intelligibility = number of correct identifications compared to all words. • Subjective opinions • Diagnostic information given in confusion matrices. Standard procedures are available only for segmental evaluation

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

21 Pros and cons of DRT Minimal pairs intelligibility test

Pros Cons • Nonsense sentences. Forced choice (two alternatives). • Limited number of stimuli, • Consonantal intelligibility a) ”the uniform towels snitch a sniffer” / not too time consuming only in word initial position b) ”the uniformed towels snitch a sniffer” – Forced choice, between: uniform - uniformed • Naive listeners can take => Modified Rhyme test: • Phonetic features: part inital and final – Consonant and vowel substitution: copper-chopper, tutor-teeter – Consonant insertion/deletion: attitudes-altitudes – One-feature substitutions: ringers-riggers • Easy to interpret the • isolated contexts – Two-features substitutions: burnish-furnish results – Word initial: gasket- basket – Word internal: musty - musky • closed response format • Confusion matrices help – Word final: familiar- familial to localise the problems • Segment location: stressed, unstressed • Word location, initial, medial, final Choice: din, sin, fin, pin, win, tin. Winamp media file

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Evaluation problems Comprehension test Check mental load • It is unrealistic to test one level at a time: they are not De Logu et al. 1998 independent. Single-task performance Multiple-task performance measure: listen and measure: listen and • Can we really evaluate the intelligibility of TTS at understand 2 passages and understand 1 passage and segmental level? answer ten multiple choice at the same time detect questions. clicks occurring in the passage. • Is intelligibility more important than naturalness? Subjects: 2 groups, one listening to syntheic • Limitations of subjective tests speech, one listening to Subjects: same – Learning effects natural speech. – Concentration problems Results: no significant Results: Subjects who listened differences between to synthetic speech took – Choise of listeners: naive or expert? understanding synthetic longer to identify the clicks. and natural speech. • Is it possible to build objective tests? Comprehension tests are difficult to construct due to the intervention of cognitive factors. Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Subjective opinion tests Comparing systems • Listeners are presented with a set of stimuli to be rated on: Overall impression & acceptance (quality) • No standard procedures are available to carry out Listening effort, comprehension (intelligibility) comparative evaluations of systems. Pronunciation, speaking rate & voice quality (naturalness). • Most common is to use preference scores:

(- System A much better) 100% Mean Opinion Score: Degradation Mean Opinion Score: 90% - System A better 80% Much worse 70% Evaluates the general Evaluates how disturbances are Worse - No difference 60% perceived. 50% No Difference speech quality 40% Better - System B better 30% Much better 5 excellent, 5 Inaudible, 20% (- System B much better) 10% 4 good, 4 Audible, not annoying 0% y y m d m osod ectru F HTS oso ectru 3 fair, 3 Slightly annoying, M HTS pr pr S sp S sp S S HT T F HT H 2 poor, 2 Annoying, M M HT F 1 bad 1 Very annoying

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

22 Speaker adaptation

•Why? – Make the synthesis more human-sounding, more Synthesis of the diverse, more personalized. – Synthesize ordinary speech of ordinary people! • What? future – The non-linguistic (?) features of the acoustic signal: voice quality, gender, age, dialect, sociolect. • How? – Record the speaker as target or adapt the synthesis (by statistics or rules)

• Various contexts: low, raspy voices, strong, commanding voices, children’s and old persons’ voices, promotional voices, emotional voices, etc.

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Speaker characteristics Speaker Variability: Dialect

Linguistic vs. Individual components • Different dialects use different phonemes for the same word The linguistic component: semantic information that is part – e.g.: British vs American “better” of the speaker’s language (e.g. question intonation) –Brittish vs. • The paralinguistic component: the speaker’s attitudinal Australian ”say” or emotional states, sociolect and regional dialect. • Different dialects use different • The extralinguistic component: the individuality, gender allophones for the same phoneme: and age of a certain speaker. It can be judged – Swedish: Öga/Öra, Äga/Ära independently of the language. (Värmland-Östergötland) To adapt a speech synthesizer to a certain speaker, we • Differences in prosody and accent. need both the para- and extralinguisitic components.

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Speaker Variability: Sociolect Individual Differences New York City department Mean normalized F0 in vowels store study Within-Speaker Variability (in Bark) for different languages • Can change F0 and voice quality 1,2 [r] in 'fourth floor' 1,1 F0 Between-Speaker Variability 1 SHOP [r]% STATUS • Cannot change basic physiology (lungs, vocal folds, 0,9 vocal tract…), which limits ranges of F0 and voice 0,8 Saks 5th Av. 62% high qualities 0,7 0,6 Macy's 51% middle 0,5 • Difficult to change the Klein's 20% low

– Sociolect: Level of education/social environment Dutch French – Personal history Swedish RP English

Am English Swedish: Liiidingö, sju

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

23 Sounding Gay Emotions

Creating emotional speech synthesis of a text requires: a. Signal Processing: algorithms for altering the acoustic prosodic parameters of the speech.

b. Prosody Modeling: Creating typical patterns Fricative duration corresponding to different emotions. • Crist (1997) - 5 out of 6 speakers exhibited longer /s/ in gay stereotyped speech • Linville (1998) - gay speakers had longer /s/ • Rogers, Smyth, and Jacobs (2000) - both /s/ and /z/ were longer in gay-sounding speech • Levon (2004) - altering sibilant duration alone insufficient c. Text Analysis: finding textual cues to prosody and to change perception of gayness the expressive intention of a text.

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Two approaches Emotion analysis

1. To design a general method of assigning a given How to determine synthesis parameters for different emotions? expressive intention to any text. – Professional acting – An ongoing and challenging task, involving research on signal – Amateur acting processing, speech acoustics and human communication. – Read a text with different emotions • Acted and read speech is widely used, but… 2. Enriching synthetic messages with expressive phrases and sounds, which convey expressive intentions. Does it reflect the way emotions are expressed in – A commercially available solution: e.g., Loquendo, IBM. spontaneous speech?

Alternatives: Yes sir, the package will be on your desk tomorrow. And I say that with the utmost confidence. I will take care of it. How will I take care of it? I don’t know how I’m going to take care of it. If I knew how to take care of it • Wizard of Oz scenarios

• Customer calls to call centres Yes sir the package will be on your desk tomorrow. And I say that with the – Lots of real emotional speech utmost confidence. I will take care of it. How will I take care of it? I don’t – But, permissions? know how I’m going to take care of it. If I knew how to take care of it • TV shows (Oprah, Ricki Lake, Dr. Phil etc)

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Emotion databases Emotion correlates Mean F0 Loquendo Syllable duration Again, two approaches: 30.00 10.00 8.00 20.00 6.00 1. Create large databases for each emotion you want to FS 4.00 FS 10.00 2.00 % S synthesize and use the entries as such % 0.00 S 0.00 LS -2.00 LS U -10.00 -4.00 U • E.g. diphones -6.00 -20.00 -8.00 angry happy sad F0 Range -10.00 • Duplicate the database for each emotion… angry happy sad RMS Energy 120,00 2. Modify the default output signal from the synthesizer using 10.00 100,00 8.00 80,00 emotion rules 6.00 60,00 FS 4.00 FS % 40,00 S % 2.00 S • Small set of phonetically balanced sentences (25 or so) LS 20,00 LS 0.00 -2.00 U 0,00 U -4.00 • Sentences without emotional content, -20,00 -6.00 -40,00 angry happy sad e.g. The competitor has made twenty five offers, closing only angry happy sad five contracts FS the first stressed syllable of the sentence or after a speech pause • Compare with a Neutral style. S stressed syllable LS the last stressed syllable of the sentence or before a speech pause U unstressed syllable

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

24 Emotion synthesis scheme Examples

300 Input Text Acoustic unit selection Loquendo’s Susan 250 Analysis prosodic Energy Duration Pitch signal parameters 200 neutral sad Hz happy 150 angry Expressive style “E” rules “D” rules “P” rules

100 Synthesis prosodic Energy Duration Pitch parameters 50 00,20,40,60,811,21,4 time (sec.)

Time and Pitch Scaling + Gain function

Output Waveform PSOLA Many more on http://emosamples.syntheticspeech.de/

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Evaluation results Emotional questions • Texts without emotional content. - “The competitor has made twenty five offers, closing only five But, the closer we get to “real” emotions, the more contracts” difficult it is to recognize them! • Volunteers listened to samples in random order and evaluated from 0 to 5 how much sad, angry, happy or neutral each stimuli sounded. Up to 95% correct identification on acted speech Up to 79% on read speech

4.5 Up to 73% on lab-recorded dialogue data 4 3.5 What is the goal of expressive synthesis? 3 2.5 Convey an emotion? 2 Make the synthesized emotion sound natural? 1.5 1 And, how many emotions do we have? 0.5 0 Four? neutral angry happy Sad Seven? (Ekman: neutral + sadness, happiness, neutral (TTS) angry (TTS) happy (TTS) sad (TTS) anger, fear, disgust, surprise)

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Why 'real speech’ synthesis? But how to synthesize? • 1000 hours of everyday • 'Yeah!', 'Right on!', 'Fantastic!', 'Hi!' conversation • Why can’t we synthesize ‘real speech’? • Recorded with head-mounted – Because we assume that words alone carry most of mic to DAT and Minidisc the meaning in speech • Analyzed acoustically, – But the '85%' (?) of speech which is non-verbal is manually transcribed, & largely monosyllabic perceptually labeled – Monosyllables can be very repetitive - unless they • No studio use, no recording vary in another dimension constraints • Voice quality: spectral features, voice source features • Japanese native-language and temporal features (e.g., voice on-/offsets, jitter, speakers, mixed ages, in creak, etc.). everyday situations => A paralinguistic speech corpus The NATR approach

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

25 Acoustic Analysis Discourse Act Labelling Quasi- Boundaries of syllable quasi-syllabic aあいさつgreeting o反論 argue boundaries Nuclei b 会話終了 closing p 提案、申し出 suggest, offer c 自己紹介 introduce-self q 気づき notice Phonetic d 話題紹介 introduce-topic s つなぎ connector labels (if available) e 情報提供 give-information r 依頼、命令 request-action f 意見、希望 give-opinion t文句 complain g 応答肯定 affirm u 褒める flatter h 応答否定 negate w 独り言 talking-to-self i 受け入れ accept x 言い詰まり disfluency j 拒絶 reject y 演技 acting k 了解、理解、納得 acknowledge z 繰り返し repeat Sonorant l 割り込み, 相づち interject r* 要求 request (a~z) Energy F0 contour m感謝thank v* 確認を与える verify (a~z) contour n 謝罪 apologize (a) Variance (b) in delta- Speaking style (and voice) vary Formant / Cepstrum FFT greatly, depending upon Cepstral distance (a) the situation Composite (a & b) (b) who we are speaking to ... Glottal AQ measure of pressed reliability (c) how we feel about what we are Ú breathy Estimated vocal-tract area-functions saying!

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Concept-to-speech: why? Concept-to-speech: what?

So is future text-to-speech synthesis just cut and paste from an enormous database?

5 talk to someone about a bill Ways to say “question about my bill” to AT&T: 5 talk to somebody about my billing 5 talk to somebody about a bill 105 question about my bill 10 calling about my billing 5 speak to someone in the billing 63 question on my bill 9 problem with my phone bill 5 speak to someone about a bill 57 calling about my bill 9 calling about my telephone bill 5 questions on my billing 43 talk to somebody about my bill 8 speak to someone in billing 5 question on the bill 41 talk to someone about my bill 8 question about the bill 5 question on a bill 32 questions about my bill 7 speak to somebody about my bill 5 question my bill 30 problem with my bill 7 speak to a billing 5 calling in regards to my bill 23 speak to someone about my bill 7 question on my phone bill 5 calling about the bill • Input: Abstract presentation Goal or a machine generated 22 calling about a bill 7 calling regarding my bill 4 talk to someone about my telephone bill 20 calling about my phone bill 7 calling concerning my bill 4 talk to somebody about my account message. 16 questions on my bill 6 talk to somebody in billing 4 talk to billing 16 question about a bill 6 questions about my billing 4 speak with someone in billing • Output: Syntactic Structure for Concept-To-Speech 15 talk about my bill 6 question on my billing 4 question about my telephone bill 11 question about my phone bill 6 problem with my billing 4 information on my bill 11 question about my billing 6 information about my bill 4 calling regarding my statement Synthesis 11 discuss my bill 6 calling about my A T and T bill ...... 10 speak with someone about my bill 5 talk to someone about my phone bill 1 talk to someo- to someone about my moms • Language-independent text planning component telephone bill 1 question about the new A T and T billing • Language-specific domain-grammars Humans do not read a text Total 1083 variations in 1912 • Enriched information passed to synthesis aloud, we talk! matches

Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008

Concept-to-speech: how?

Slot filling or generation? Either: Put key information into different carrier phrases

Or: Generate utterances based on content.

Olov Engwall, Speech synthesis, 2008

26