For Speech Synthesis) (And Coding) Phil Garner Idiap Research InsTute

Parametric vocoding (for speech synthesis) (and coding) Phil Garner Idiap Research Ins1tute 1 Contents • Part 1: Background applicaons • Some jus1ficaon for all this • Part 2: Approaches to vocoding • A few insights into what’s involved • Probably incomplete, probably a couple of years out of date • Part 3: Shortcomings • Some things I tried to address 2 Isn’t this a hearing conference? During the evolu1on of human speech, the ar1culatory motor system has presumably structured its output to match those rhythms the auditory system can best apprehend. Similarly, the auditory system has likely become tuned to the complex acous1c signal produced by combined jaw and ar1culator rhythmic movements. Both auditory and motor systems must, furthermore, build on the exis1ng biophysical constraints provided by the neuronal infrastructure. – Giraud and Poeppel (2012) aributed to Heimbauer et al. (2011) and Liberman and Mangley (1985) 3 Is it all moot? • There is now wavenet (le) • Synthesises waveform directly • Also • Lyrebird hOps://lyrebird.ai • Char2Wav hOp://josesotelo.com/speechsynthesis/ • Tacotron (not strictly waveform synthesis) • Do we really need to understand the produc1on mechanism? 4 Part 1 Background applicaons 5 Text to speech synthesis • It works! • Basic TTS in its raw form is a solved problem • hOp://www.cstr.ed.ac.uk/projects/fes1val/ 6 The EMIME scenario Speech in Traditional speech-to-speech translation Speech out Recognition Translation Synthesis ASR Adaptation TTS models models Speaker-to-speaker adaptation 7 Example of adaptaAon Natural speech Average voice Adapted voice • Note: • 16 kHz speech (not such high quality for synthesis) • Wall St. Journal samples (an ASR database, more males) • VTLN adaptaon (quite simple, gender characteris1cs) 8 What can be done with adaptaon hOp://homepages.inf.ed.ac.uk/jyamagis/Demo-html/map-new.html 9 Vocal tract length normalisaAon • Distance between larynx and lips • Longer in males • “Lower voice” • Shorter in females • “Higher voice” • Adaptaon does more than alter vocal tract length • But VTL is easy to visualise 10 Bilinear warping • Bilinear warping • can stretch or shrink a vocal tract • Enough to dis1nguish male / female • It can also be represented as a linear transform • Very quick to learn • Only one speech example! • Only one parameter to learn Synthesised frequency Synthesised frequency • Can be prior for more sophis1cated techniques Model frequency 11 Fast, difficult adaptaAon • Note: “RJS” (original) • 48 kHz samples (quite good!) • RJS is a HTS2010 / Fes1val / Blizzard challenge voice • VTLN + SMAPLR Roger (synthe1c) Robyn (synthe1c) adaptaon • Roger: 1 sentence • Robyn: 4 sentence Roger (real) Robyn (real) 12 Speech synthesis • TTS involves lots of listening tests e.g., Alex’s page • Which sounds most like the original? • Which is most natural? • None of them! They all sound unnatural! • There is a characteris1c “buzzy” quality • The effect of the vocoder is Random box plot stronger than that of the thing we’re trying to modify 13 Part 2 Approaches to vocoding 14 Recogni1on Synthesis Dialogue Sentence Translaon Seman1c level Word Phone /s/ /p/ /i/ /ch/ /k/ /ow/ /d/ Spectral Acous1c Spaal “One cannot understand speech recogni4on without understanding speech synthesis. One cannot understand either without understanding speech coding.” - (in the spirit of) James Baker, ICASSP 2012, Kyoto. 15 Similar views… …pinched from Andrea There is no hope at the high levels if the lower ones don’t work! 16 Non-parametric vs. parametric • MPEG-like codecs do not care • In TTS, we want to be able to about content control • What comes out is what went in • Voice character • May use LP, may not • Pitch • mp3 uses LP • Emoon • Generally structure doesn’t maer • Structure maers! • Examples: • STRAIGHT, YANG, Vocaine • WORLD: hOp://ml.cs.yamanashi.ac.jp/world/ english/ 17 The source-filter model Donald Ayler, 1942–2007 (with Albert Ayler, 1936–1970) © The Wire Magazine 18 Source-filter for speech x = Excitaon System (Simplis1c frequency-domain view) 19 Parametric vocoding • Parametric vocoding breaks the signal into physiologically mo1vated components • System • Spectral shape • Excitaon • Pitch • May include voicing detec1on • Harmonic component (think vowels) • Depends on voicing • Noise component (think consonants) • Harmonic to noise rao • Could be band-dependent 20 Spectral shape • Spectral shape is quite well understood! Two choices: • DFT • STRAIGHT is a DFT method • Linear predic1on • Tends to be lower dimensional • AOrac1ve as it has a physiological basis 21 STRAIGHT &c • STRAIGHT is essen1ally a pitch- synchronous DFT • Gets rid of pitch artefacts • Later iteraon: “Tandem STRAIGHT” Figure 18: Pitch synchronous analysis: 3D representation. Figure 19: Isometric Gaussian spectrogram: 3D representation. Examplese using real speech samples are shown to • STRAIGHT is known as a vocoder illustrate operations and various representations in the proposed method. • Uses mixed excitaon (see later) 5.1 F0 extraction • See also: CheapTrick (Morise, Figure 17 shows results of the source information 2015) analysis using the proposed instantaneous-frequency- based procedure for a real speech example ‘right’ spo- ken by a female subject. The top plot shows the input Figure 18: Pitch synchronous analysis: 3D representa- 22 waveform of an utterance of the word ‘right’ by a fe- Kawahara, Masuda-tion. Katsuse & de Cheveigné, 1998 Figure 19: Isometric Gaussian spectrogram: 3D repre- male speaker. The second panel shows the total power sentation. of wavelet outputs. The center plot is the extracted fundamental frequency. The thin line in the plot represents the voiced part, and the thick dots represent data points Examplese using real speech samples are shown to classified as unvoiced. The next plot shows ‘funda- Figure 20: Spectrogram with reducedillustrate phasic operations interfer- and various representations in the mentalness’ values associated with the extracted fun- ence: 3D representation. proposed method. damental frequencies. The next plot illustrates output power that corresponds to channels consisting of the fundamental component. The bottom image illustrates 5.1 F0 extraction a ‘fundamentalness’ map for all channels. Note that the Those fluctuations in spectral valleys are artifacts voiced part corresponds to the salient dark blob in this due to sharp discontinuities at theFigure both edges 17 shows of the results of the source information map. rectangular time window. It cananalysis be concluded using by the com- proposed instantaneous-frequency- parison with Figure 19. Figurebased 19 shows procedure a 3D spec- for a real speech example ‘right’ spo- 5.2 Spectral envelope extraction trogram using an isometric Gaussiankenby time a window female thatsubject. The top plot shows the input is defined in Equation 2 and is pitchwaveform adaptive. of an As utterance the of the word ‘right’ by a fe- The extracted F0 information is used to control the Gaussian window has the minimummale uncertainty, speaker. The regu- second panel shows the total power spectral envelope extraction procedure. The pitch syn- lar local peaks virtually representof waveletthe sampled outputs. values The center plot is the extracted fun- of the underlying spectral envelope. These peak val- chronous analysis using a rectangular time window damental frequency. The thin line in the plot represents whose length is set equal to the fundamental period was ues change more smoothly than the pitch synchronous the voiced part, and the thick dots represent data points also conducted for comparison. Figure 18 shows a 3D spectrogram, both in the time domain and in the fre- representation of the pitch synchronous spectrogram. quency domain, indicating that theclassified underlying as spectral unvoiced. The next plot shows ‘funda- Figure 20: Spectrogram with reduced phasic interfer- The figure illustrates voiced portion. There are con- envelope is much smoother thanmentalness’ the conventional values pitch associated with the extracted fun- ence: 3D representation. siderable spectral fluctuations in spectral valleys while synchronous analysis suggests. damental frequencies. The next plot illustrates output spectral peaks in the lower frequency region shows Figure 20 shows a spectrogrampower calculated that corresponds by a re- to channels consisting of the smooth behavior. duced phasic interference procedure.fundamental It also indicates component. The bottom image illustrates a ‘fundamentalness’ map for all channels. Note that the Those fluctuations in spectral valleys are artifacts 12 voiced part corresponds to the salient dark blob in this due to sharp discontinuities at the both edges of the map. rectangular time window. It can be concluded by comparison with Figure 19. Figure 19 shows a 3D spec- 5.2 Spectral envelope extraction trogram using an isometric Gaussian time window that is defined in Equation 2 and is pitch adaptive. As the The extracted F0 information is used to control the Gaussian window has the minimum uncertainty, regu- spectral envelope extraction procedure. The pitch syn- lar local peaks virtually represent the sampled values chronous analysis using a rectangular time window of the underlying spectral envelope. These peak val- whose length is set equal to the fundamental period was ues change more smoothly than the pitch synchronous also conducted for comparison. Figure 18 shows a 3D spectrogram, both in the time domain and in the fre- representation of the pitch synchronous spectrogram. quency domain, indicating

Load more