<<

Parametric vocoding (for synthesis) (and coding) Phil Garner Idiap Research Instute

1 Contents

• Part 1: Background applicaons • Some jusficaon for all this

• Part 2: Approaches to vocoding • A few insights into what’s involved • Probably incomplete, probably a couple of years out of date

• Part 3: Shortcomings • Some things I tried to address

2 Isn’t this a hearing conference?

During the evoluon of human speech, the arculatory motor system has presumably structured its output to match those rhythms the auditory system can best apprehend. Similarly, the auditory system has likely become tuned to the complex acousc signal produced by combined jaw and arculator rhythmic movements. Both auditory and motor systems must, furthermore, build on the exisng biophysical constraints provided by the neuronal infrastructure.

– Giraud and Poeppel (2012) aributed to Heimbauer et al. (2011) and Liberman and Mangley (1985)

3 Is it all moot?

• There is now wavenet (le) • Synthesises waveform directly

• Also • Lyrebird hps://lyrebird.ai • Char2Wav hp://josesotelo.com/speechsynthesis/ • Tacotron (not strictly waveform synthesis)

• Do we really need to understand the producon mechanism?

4 Part 1 Background applicaons

5 Text to

• It works! • Basic TTS in its raw form is a solved problem • hp://www.cstr.ed.ac.uk/projects/fesval/

6 The EMIME scenario

Speech in Traditional speech-to-speech translation Speech out

Recognition Translation Synthesis

ASR Adaptation TTS models models

Speaker-to-speaker adaptation 7 Example of adaptaon

Natural speech Average voice Adapted voice

• Note: • 16 kHz speech (not such high quality for synthesis) • Wall St. Journal samples (an ASR database, more males) • VTLN adaptaon (quite simple, gender characteriscs)

8 What can be done with adaptaon

hp://homepages.inf.ed.ac.uk/jyamagis/Demo-html/map-new.html

9 Vocal tract length normalisaon

• Distance between larynx and lips • Longer in males • “Lower voice” • Shorter in females • “Higher voice” • Adaptaon does more than alter vocal tract length • But VTL is easy to visualise

10 Bilinear warping

• Bilinear warping • can stretch or shrink a vocal tract • Enough to disnguish male / female • It can also be represented as a linear transform

• Very quick to learn • Only one speech example! • Only one parameter to learn

Synthesised frequency Synthesisedfrequency • Can be prior for more sophiscated techniques

Model frequency

11 Fast, difficult adaptaon • Note:

“RJS” (original) • 48 kHz samples (quite good!) • RJS is a HTS2010 / Fesval / Blizzard challenge voice • VTLN + SMAPLR Roger (synthec) Robyn (synthec) adaptaon • Roger: 1 sentence • Robyn: 4 sentence Roger (real) Robyn (real)

12 Speech synthesis

• TTS involves lots of listening tests e.g., Alex’s page • Which sounds most like the original? • Which is most natural? • None of them! They all sound unnatural! • There is a characterisc “buzzy” quality • The effect of the vocoder is Random box plot stronger than that of the thing we’re trying to modify

13 Part 2 Approaches to vocoding

14 Recognion Synthesis Dialogue

Sentence Translaon Semanc level

Word

Phone /s/ /p/ /i/ /ch/ /k/ /ow/ /d/

Spectral

Acousc

Spaal

“One cannot understand speech recognion without understanding speech synthesis. One cannot understand either without understanding .” - (in the spirit of) James Baker, ICASSP 2012, Kyoto. 15 Similar views…

…pinched from Andrea

There is no hope at the high levels if the lower ones don’t work!

16 Non-parametric vs. parametric

• MPEG-like codecs do not care • In TTS, we want to be able to about content control • What comes out is what went in • Voice character • May use LP, may not • Pitch • mp3 uses LP • Emoon

• Generally structure doesn’t maer • Structure maers!

• Examples: • STRAIGHT, YANG, Vocaine • WORLD: hp://ml.cs.yamanashi.ac.jp/world/ english/

17 The source-filter model

Donald Ayler, 1942–2007 (with Albert Ayler, 1936–1970) © The Wire Magazine

18 Source-filter for speech

x =

Excitaon System (Simplisc frequency-domain view)

19 Parametric vocoding

• Parametric vocoding breaks the signal into physiologically movated components • System • Spectral shape • Excitaon • Pitch • May include voicing detecon • component (think ) • Depends on voicing • component (think ) • Harmonic to noise rao • Could be band-dependent

20 Spectral shape

• Spectral shape is quite well understood! Two choices:

• DFT • STRAIGHT is a DFT method

• Linear predicon • Tends to be lower dimensional • Aracve as it has a physiological basis

21 STRAIGHT &c

• STRAIGHT is essenally a pitch- synchronous DFT • Gets rid of pitch artefacts

• Later iteraon: “Tandem STRAIGHT” Figure 18: Pitch synchronous analysis: 3D representa- tion. Figure 19: Isometric Gaussian spectrogram: 3D repre- sentation.

Examplese using real speech samples are shown to • STRAIGHT is known as a vocoder illustrate operations and various representations in the proposed method. • Uses mixed excitaon (see later)

5.1 F0 extraction • See also: CheapTrick (Morise, Figure 17 shows results of the source information 2015) analysis using the proposed instantaneous-frequency- based procedure for a real speech example ‘right’ spo- ken by a female subject. The top plot shows the input Figure 18: Pitch synchronous analysis: 3D representa- 22 waveform of an utterance of the word ‘right’ by a fe- Kawahara, Masuda-tion. Katsuse & de Cheveigné, 1998 Figure 19: Isometric Gaussian spectrogram: 3D repre- male speaker. The second panel shows the total power sentation. of wavelet outputs. The center plot is the extracted fun- damental frequency. The thin line in the plot represents the voiced part, and the thick dots represent data points Examplese using real speech samples are shown to classified as unvoiced. The next plot shows ‘funda- Figure 20: Spectrogram with reducedillustrate phasic operations interfer- and various representations in the mentalness’ values associated with the extracted fun- ence: 3D representation. proposed method. damental frequencies. The next plot illustrates output power that corresponds to channels consisting of the fundamental component. The bottom image illustrates 5.1 F0 extraction a ‘fundamentalness’ map for all channels. Note that the Those fluctuations in spectral valleys are artifacts voiced part corresponds to the salient dark blob in this due to sharp discontinuities at theFigure both edges 17 shows of the results of the source information map. rectangular time window. It cananalysis be concluded using by the com- proposed instantaneous-frequency- parison with Figure 19. Figurebased 19 shows procedure a 3D spec- for a real speech example ‘right’ spo- 5.2 Spectral envelope extraction trogram using an isometric Gaussiankenby time a window female thatsubject. The top plot shows the input is defined in Equation 2 and is pitchwaveform adaptive. of an As utterance the of the word ‘right’ by a fe- The extracted F0 information is used to control the Gaussian window has the minimummale uncertainty, speaker. The regu- second panel shows the total power spectral envelope extraction procedure. The pitch syn- lar local peaks virtually representof waveletthe sampled outputs. values The center plot is the extracted fun- of the underlying spectral envelope. These peak val- chronous analysis using a rectangular time window damental frequency. The thin line in the plot represents whose length is set equal to the fundamental period was ues change more smoothly than the pitch synchronous the voiced part, and the thick dots represent data points also conducted for comparison. Figure 18 shows a 3D spectrogram, both in the time domain and in the fre- representation of the pitch synchronous spectrogram. quency domain, indicating that theclassified underlying as spectral unvoiced. The next plot shows ‘funda- Figure 20: Spectrogram with reduced phasic interfer- The figure illustrates voiced portion. There are con- envelope is much smoother thanmentalness’ the conventional values pitch associated with the extracted fun- ence: 3D representation. siderable spectral fluctuations in spectral valleys while synchronous analysis suggests. damental frequencies. The next plot illustrates output spectral peaks in the lower frequency region shows Figure 20 shows a spectrogrampower calculated that corresponds by a re- to channels consisting of the smooth behavior. duced phasic interference procedure.fundamental It also indicates component. The bottom image illustrates a ‘fundamentalness’ map for all channels. Note that the Those fluctuations in spectral valleys are artifacts 12 voiced part corresponds to the salient dark blob in this due to sharp discontinuities at the both edges of the map. rectangular time window. It can be concluded by com- parison with Figure 19. Figure 19 shows a 3D spec- 5.2 Spectral envelope extraction trogram using an isometric Gaussian time window that is defined in Equation 2 and is pitch adaptive. As the The extracted F0 information is used to control the Gaussian window has the minimum uncertainty, regu- spectral envelope extraction procedure. The pitch syn- lar local peaks virtually represent the sampled values chronous analysis using a rectangular time window of the underlying spectral envelope. These peak val- whose length is set equal to the fundamental period was ues change more smoothly than the pitch synchronous also conducted for comparison. Figure 18 shows a 3D spectrogram, both in the time domain and in the fre- representation of the pitch synchronous spectrogram. quency domain, indicating that the underlying spectral The figure illustrates voiced portion. There are con- envelope is much smoother than the conventional pitch siderable spectral fluctuations in spectral valleys while synchronous analysis suggests. spectral peaks in the lower frequency region shows Figure 20 shows a spectrogram calculated by a re- smooth behavior. duced phasic interference procedure. It also indicates

12 Linear predicon

Frame 220

23 Excitaon Vocal folds do not produce impulses!

24 Excitaon

E(t) • Our best guess is the LF model • Opening phase

t • Closing phase t t t t p e a c • Disconnuity when vocal folds

-Ee come into contact E(t) • This is the gloal flow derivave

t tpte ta tc Gunnar Fant, Johan Liljencrants, and Qi-Guang Lin. “A Four-Parameter Model of Gloal Flow”. In: STL-QPSR 26.4 (1985). Paper presented at the French-Swedish Symposium, -Ee Grenoble, April 22–24, 1985, pp. 001–013. URL: hp://www.speech.kth.se/qpsr

25 LF in frequency domain

• LF shapes vocalisaon, not

Gloal noise • HF is just noise • There is cut-off

Noise • The whole thing is subject to system response

LF Response

Frequency

26 Separaon of from noise

Eurospeech• Mixed excitaon (Yoshimura et al. 2001) 2001 - Scandinavia

Eurospeech 2001 - Scandinavia Periodic pulse train White noise Periodic pulse train White noise

Periodic pulse train White noise PositionPeriodic jitter pulse trainBandpass White noise voicing strengths MLSA filter Bandpass filter Bandpass filter Position jitter Bandpass voicing Synthetic speech strengths MLSA filter Bandpass filter Bandpass filter MLSA filter Figure 2: Traditional excitation model. Synthetic speech Pulse dispersion MLSA filter pulseFigure dispersion 2: Traditional filter excitation model. • Synthetic speech Pulse dispersion The mixed excitation is implemented using a multi-band Figure 3: Mixed excitation model. pulse dispersion filter Synthetic speech mixing•(MLSA: Mel Log Spectrum Approximaon) model, and can reduce the buzz of synthesized speech. Furthermore, aperiodic pulses and pulse disper- lag. The correlation coefficient at delay t is defined by 27 sion filter reduce some of the harsh or tonal sound quality Figure 3: Mixed excitation model. The mixed excitation is implemented using a multi-band N 1 of synthesized speech. In recent years, the mixed excita- − mixing model, and can reduce the buzz of synthesized s s tion model of MELP has been applied not only to narrow- ! n n+t speech. Furthermore, aperiodic pulses and pulse disper- lag. The correlationn coefficient=0 at delay t is defined by band speech coding but also to wideband speech coder [6] ct = , (1) sion filter reduce some of the harsh or tonal sound quality N 1 N 1 − N− 1 and speech synthesis system [7]. " − of synthesized speech. In recent years, the mixed excita- # snsn sn+tsn+t # ! ! snsn+t tionIn model this paper, of MELP mixed has excitation been applied model not which only is to simi- narrow- $ n=0 n!=0 n=0 larband to the speech excitation coding model but also used to in wideband MELP is speech incorporated coder [6] ct = , (1) into the TTS system. Excitation parameters, i.e., pitch, where sn and N representN the1 speechN signal1 at sample n and speech synthesis system [7]. " − − bandpass voicing strengths and Fourier magnitudes, are and the size of pitch analysis# window,s s respectively.s s The # ! n n ! n+t n+t modeledIn this by paper, HMMs, mixed and generated excitation from model trained which HMMs is simi- Fourier magnitudes of$ then=0 first tenn pitch=0 harmonics are inlar synthesis to the excitation phase. model used in MELP is incorporated measured from a residual signal obtained by inverse fil- into the TTS system. Excitation parameters, i.e., pitch, tering.where sn and N represent the speech signal at sample n The rest of this paper is organized as follows. The and the size of pitch analysis window, respectively. The nextbandpass section voicing describes strengths the mixed and excitation Fourier magnitudes, model. The are modeled by HMMs, and generated from trained HMMs 2.2.Fourier Synthesis magnitudes phase of the first ten pitch harmonics are section 3 describes the HMM-based TTS system with measured from a residual signal obtained by inverse fil- mixedin synthesis excitation phase. model. Experimental results are given A block diagram of the mixed excitation generation and tering. in theThe section rest of 4, andthis concluding paper is organized remarks and as follows. our plans The speech synthesis filtering is shown in Fig. 3. fornext future section work describes are given the in the mixed final excitation section. model. The The bandpass filters for pulse train and white 2.2. Synthesis phase section 3 describes the HMM-based TTS system with noise are determined from generated bandpass voicing mixed excitation2. model. Mixed Experimental excitation results are given strength.A block The diagram bandpass of the filter mixed for pulse excitation train generation is given and in the section 4, and concluding remarks and our plans by thespeech sum synthesisof all the bandpassfiltering is filter shown coefficients in Fig. 3. for the 2.1.for future Analysis work phase are given in the final section. voiced frequencyThe bandpass bands, filters while the for bandpass pulse train filter and for white whitenoise noise are is determined given by the from sum generated of the bandpass bandpass filter voicing coefficients for the unvoiced bands. The excitation is In order to realize the mixed excitation model in the sys- strength. The bandpass filter for pulse train is given 2. Mixed excitation generated as the sum of the filtered pulse and noise ex- tem, the following excitation parameters are extracted by the sum of all the bandpass filter coefficients for the from speech data. citations. The pulse excitation is calculated from Fourier voiced frequency bands, while the bandpass filter for 2.1. Analysis phase magnitudes using an inverse DFT of one pitch period in length.white The noise pitch is used given here by is the adjusted sum by of varying the bandpass 25% filter pitch In order• to realize the mixed excitation model in the sys- of itscoefficients position according for the to unvoiced the periodic/aperiodic bands. The flag excitation de- is tem,bandpass the following voicing excitation strengths parameters are extracted cidedgenerated from the as bandpass the sum voicing of the strength. filtered pulse By the and aperi- noise ex- • from speech data. odiccitations. pulses, the The system pulse mimics excitation the erratic is calculated glottal frompulses Fourier Fourier magnitudes • andmagnitudes reduces the tonal using noise. an inverse The noise DFT excitation of one pitch is gen- period in length. The pitch used here is adjusted by varying 25% pitch erated by a uniform random number generator. The ob- In bandpass• voicing analysis, the speech signal is filtered tainedof pulseits position and noise according excitations to the are periodic/aperiodic filtered and added flag de- into fivebandpass frequency voicing bands, strengths with passbands of 0–1000, together.cided from the bandpass voicing strength. By the aperi- • 1000–2000, 2000–4000, 4000–6000, 6000–8000Hz [6]. Byodic exciting pulses, the the MLSA system filter mimics [8], synthesized the erratic glottalspeech pulses Fourier magnitudes Note• that the TTS system deals with 16kHz sampling is generatedand reduces from the the tonal mel-cepstral noise. The coefficients, noise excitation directly. is gen- speech. The voicing strength in each band is estimated Finally,erated the by obtained a uniform speech random is filtered number by a generator. pulse disper- The ob- usingIn bandpass normalized voicing correlation analysis, coefficients the speech around signal the is pitch filtered siontained filter which pulse is and a 130-th noise order excitations FIR filter are derived filtered from and a added into five frequency bands, with passbands of 0–1000, together. 1000–2000, 2000–4000, 4000–6000, 6000–8000Hz [6]. By exciting the MLSA filter [8], synthesized speech Note that the TTS system deals with 16kHz sampling is generated from the mel-cepstral coefficients, directly. speech. The voicing strength in each band is estimated Finally, the obtained speech is filtered by a pulse disper- using normalized correlation coefficients around the pitch sion filter which is a 130-th order FIR filter derived from a Separaon of harmonics from noise

dB • HNM: Harmonic plus noise model (Y. Stylianou, PhD, 1996) • Determine a maximum voiced frequency • Model everything below that as harmonics • Model everything above that as noise

Noise Harmonics Freq. • These models are prey good! MVF

28 Part 3 Shortcomings

29 Degradaon in vocoding

• The cause(s) of unnaturalness is (are) not well understood. There are several plausible theories:

• Voicing detecon • Framing • For sure, incorrect voicing will • Mains hum is “buzzy”; the buzz is cause artefacts. the frame rate. • Incorrect aperiodicity • Plus: • Aperiodicity is the part that is not • It's worse in TTS than just vocoding. voicing. • It's worse if you mess with the • Note: voiced fricaves have both signal. periodic and aperiodic components.

30 Pitch Extractor

200

100

0 0 50 100 150 200 250 300 350 400

• Issues: • Not defined for unvoiced segments • Tendency towards doubling and halving errors • 1001 different algorithms

500 400 300 200 100 0 0 50 100 150 200 250 300 350 400 450 500

31 Is pitch extracon solved?

Junichi’s method:

“f0 is first extracted using a wide range over a whole database, then a range is determined for each speaker and f0 is extracted again using three methods. Finally a median value of the three methods is chosen.”

• So, something is broken! • A hypothesis • Can we fix vocoding by fixing gross pitch errors?

32 Basic pitch extracon

Paul Boersma. “ACCURATE SHORT-TERM ANALYSIS OF THE FUNDAMENTAL FRE- QUENCY AND THE HARMONICS-TO-NOISE RATIO OF A SAMPLED SOUND”. . In: Proceedings of the Institute of Phonetic Sciences, Amsterdam. 17. University of Amsterdam, 1993, pp. 97–110

• Boersma’s algorithm is in Praat • Prey good • Essenally autocorrelaon, but with lots of extra hacks

• Autocorrelaon also gives a harmonic to noise rao • You can convert HNR to a variance on the esmate • If so, you don’t need a voicing decision

33 Use HNR as precision in a Kalman filter

34 The SSP vocoder

• Given pitch and harmonic to noise rao, • Excitaon examples you can build a vocoder! Ingredients • Original • Connuous pitch extracon (The natural recording at 22 kHz) • Linear Predicon based spectral shape • Whisper (Pure Gaussian noise excitaon) • “Assorted” excitaon methods • Robot • Implemented in python using numpy (Pure impulse excitaon, constant pitch) • Impulse (Impulse & noise excitaon, connuous pitch)

35 Tentave pitch conclusion

• LPC actually works very well • Accurate pitch seems not to maer • At least, it’s not the most important issue • Rather, the synthec speech is sll “buzzy” • It’s more likely to be excitaon related • Note: the Kaldi pitch tracker is a good choice too

Philip N. Garner, Milos Cernak, and Petr Motlicek. “A Simple Continuous Pitch Estimation Algorithm”. In: IEEE Signal Processing Letters 20.1 (Jan. 2013), pp. 102–105. DOI: 10.1109/LSP.2012.2231675

Pegah Ghahremani et al. “A Pitch Extraction Algorithm Tuned for Automatic Speech Recognition”. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, Italy, May 2014, pp. 2513–2517

36 LF automac parameter extracon

• LF opening phase is an exponenal sinusoid

k N/m c Ns/m • This is the same as a second order system m kg

37 Rough producon model

• There is a minimum / maximum phase disncon • are inside the unit circle • Excitaon is outside the circle The glottal formant • Usual (real) signal processing does not disnguish min/max phase • …but complex cepstrum does

38 Algorithm

• Calculate complex cepstrum • Replace min-phase poron with max-phase poron • Invert (one sided) cepstrum, auto-correlaon, order two LP

• See also: R. Maia et al. / Speech Communicaon 55 (2013)

39 Gloal formant

(180º is 1000 Hz)

40 Examples

• The file that I was using during development • EMIME sample • Milos got some rather good results with a LibriVox voice • Luke original • Luke STRAIGHT • Luke gloal

41 Excitaon conclusion

• A beer excitaon model can get rid of buzzy character • Key seems to be in low pass filtering the impulsive part This LPF is represented by the LF model • Only makes sense when added to noise • Currently has a few issues • Automac extracon is tricky Gives all but one parameter • Introduces other distorons • Defined as differenal gloal flow Actual gloal flow would be beer

42 Closing

• Conclusions • Pitch and HNR lead to a Kalman smoothed connuous pitch esmate • Vocoding without a voicing decision is easier than with • Gloal excitaon is available from the complex cepstrum • Gloal modelling leads naturally to a maximum voiced frequency • Corollaries • It’s all available open source in python hps://github.com/idiap/ssp • Some in C++ hps://github.com/idiap/libssp

43