Lecture Notes
Total Page:16
File Type:pdf, Size:1020Kb
Presentations Work in pairs in 6 minutes mini-interviews (3 minutes each). Ask questions around the topics: • What is your previous experience of speech synthesis? Speech synthesis • Why did you decide to take this course? • What do you expect to learn? Write down the answers of your partner. Present during the presentation round Submit the answers to me Why? To let me know more about your background and expectations to be able to adapt the course content. To get to know each other. To ”start you up”… Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008 The course Definition & Main scope This is what the course book will look like… Until then, refer to http://svr-www.eng.cam.ac.uk/~pat40/book.html Course pages: www.speech.kth.se/courses/GSLT_SS Lecture content (impossible to cover the entire book): 1) History, Concatenative synthesis, Unit selection, HMM synthesis, Text issues, Prosody The automatic generation of synthesized sound or 2) Vocal tract models, Formant synthesis, Evaluation visual output from any phonetic string. 3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic selection Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008 Synthesis approaches History By Concatenation By Rule Elementary speech units are Speech is produced by stored in a database and then mathematical rules that concatenated and processed describe the influence of to produce the speech signal phonemes on one another Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008 1 van Kempelen Wheatstone’s version Wolfgang von Kempelen’s book Charles Wheatstone’s version of von Kempelen's speaking machine Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine (1791). The essential parts • pressure chamber = lungs, • a vibrating reed = vocal cords, • a leather tube = vocal tract. The machine was • hand operated • could produce whole words and Why is it of interest to us? short phrases. Parametric features! Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008 First electronic synthesis First formant synthesizers 1950’s PAT (Parametric Artificial Talker), Walter Lawrence 3 electronic formant resonators input signal (noise) 6 functions to control 3 formant frequencies, voicing, amplitude, fundamental frequency, and noise amplitude. 1950’s OVE (Orator Verbis Electris) by Gunnar Fant From 1950’s: other synthesizers including the first articulatory synthesis DAVO (Dynamic Analog of the Vocal • Homer Dudley presented VODER (Voice Operating tract) Demonstrator) at the World Fair in New York in 1939 An excellent historical trip of speech synthesis: • The device was played like a musical instrument, with voicing/noise source on a foot pedal and signal routed Dennis Klatt's History of Speech Synthesis at through ten bandpass filters. http://www.cs.indiana.edu/rhythmsp/ASA/Contents.html Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008 Let us take a look at OVE OVE Instructions http://www.speech.kth.se/courses/GSLT_SS/ove.html • OVE I (1953) 1. Test how the five different source models change the output. What is the difference in the • On your computer today, and the original next formant pattern between different sources? Look at the number of formants, the peak amplitude, the bandwidth. time + OVE II (1962) 2. Alter a) the frequency and b) the shape of the source signal. What happens with the formant frequencies in the two cases? Relate these changes to human speech production. 3. Change the Frequency values F1-F4. Start with a neutral vowel (F1=500 Hz, F2=1500 Hz, F3=2500 Hz, F4=3500 Hz). Explain the attenuation in formant peak amplitude for higher frequencies (hint: try a rectangle source and change the shape to 99). Now move one of the formant peaks so that it is about 200 Hz from the closest peak. What happens with the neighbour peak? Change the bandwidth of the formants. What is the relation between the bandwidth and the formant peak amplitude? 4. Move around the cursor in the vowel space and see how the shape of the output waveform (green curve in the bottom panel) changes. If you have time, try to generate the sentences "How are you?" and "I love you!". Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008 2 Formant amplitudes Speech analysis & manipulation Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008 Why signal processing? The source-filter theory • Need to separate the source from the filter The signal (c-d) is the result of a linear filter for modelling (linear predictive analysis) (b) excited by one or several sources (a). • Need to model the sound source (prosody, speaker characteristics) • Need to alter speech units in concatenative synthesis (amplitude, cepstrums) • Need to make concatenations smooth in concatenative synthesis (PSOLA) Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008 The source-filter theory Source functions • The voiced quasi-periodic source (glottis pulses) – vowels t TIME: Parameters: – on/off, source filter radiation – Fundamental frequency F0, (glottis) (vocal tract) (lips) – intensity, – shape FREQUENCY: • Frication source – fricatives • Transient noise – plosives More to come on the vocal tract filter in lecture 7 • No source – voiceless occlusions Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008 3 Voice source types The quasi-periodic source Articulatory Acoustic Why is there a damping slope in Modal Normal, efficient Standard source. the transfer function? Complete glottis closures Steep spectral slope. Breathy Lack of tension. Audible aspiration Never close completely. Slow “glottal return”. Frequency Glottal pulse symmetry. Time Higher F0 intensity. Whispery Low glottal tension. Triangular High aspiration levels. glottal opening. High medial Greater pulse asymmetry compression Medium longitud. t f Less time in open state. T0 tension. f= nF0 =n/T0 Creaky High adductive tension and Very low F0. medial compression Irregular F0 & amplitude Little longitudinal tension t f Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008 Simple vowel synthesis Linear Prediction (LP) A method to separate the source from the filter Predicts the next sample as a linear combination of the past p p Source Filter samples ~ x[n] = ∑ ak x[n − k] k=1 The coefficients a1 …ap describe a filter that is the inverse of the transfer function Waveform F1 F2 F3 F4 • Minimization of the prediction error results in an all-pole filter which Triangle source and formant filters in cascade: matches the signal spectrum Bandpass filters with frequency, bandwidth, level • This inverse filter removes the formants and can hence be used So, how do we find the source from a speech signal? to find the source. Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008 Spectral Fourier analysis Fourier Transforms • A Fourier transform of the filter coefficients • Fourirer transform (FT): A non-period signal has a a1 …an give the frequency response of the inverse filter continous frequency spectrum • A periodic waveform can be described as a sum of harmonics t • The harmonics are sine waves with different phases, f amplitudes and frequencies. • Discrete FT (DFT): Fourier transform of a sampled signal • The frequencies are multiples of the fundamental • N samples in both the time and frequency domain. frequency. • The spectrum is mirrored around the sampling frequency • A periodic signal has a discrete spectrum t t f f • Fast FT (FFT): Clever algorithm to calculate DFT t • Reduces the number of multiplications: f DFT: ~N2 FFT: ~(N/2) * 2log(N) Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008 4 Windowing Effect spectrum • The analysis of a long speech signal is made on short frames: • The FFT analysis gives complex values: amplitude and Analysis window 10 – 50 ms (20 ms in example) phase for each frequency component • The phase is often not interesting, only the signal’s energy at different frequencies. • The effect spectrum shows the power spectrum for a short section of the signal • The truncation of the signal results in artefacs (sidelobes) Windowing • The artefacts are reduced if the signal is multifplied with a FFT window that gives less weight to the sides. Square Logarithm Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008 Cepstrum Analysis Cepstrum from filterbank • The dominating method for ASR, used in HMM synthesis 2 N C = A cos( jπ (i−0.5) ) j N ∑ i N •Inverse Fourier transform of i=1 logarithmic frequency spectrum Spectrum of /a:/ Weight functions “Spectral analysis of spectrum” 2 110 1 90 0 W1 Cepstrum of /a:/ -1 70 •The coarse structure of the -2 50 spectrum is described with a 30 * = 1234 1,5 small number of parameters 1 0,5 0 -0,5 W2 -1 C1 C2 C3 C4 •Orthogonal coefficients -1,5 (uncorrelated) Spectrum of /s/ 2 90 1 Cepstrum of /s/ 70 0 -1 •Anagram: Spectrum-cepstrum, 50 W3 -2 filtering-liftering, frequency- 30 1234 quefrency, phase-saphe * 1,5 1 = 0,5 0 -0,5 C1 C2 C3 C4 -1 -1,5 W4 Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008 Mel Frequency Cepstral Coefficients MFCC Concatenative synthesis FFTFFT MelMel filter filter bank bank Cepstrum transform Linear < 1000 Hz Log > 1000 Hz dB 110 100 90 0 70 1234 -100 50 ~6000 Hz -200 30 Mel C1 C2 C3 C4 Mel-Spectrum of /a:/ Mel-Cepstrum of /a:/ The Mel scale is perceptually motivated Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008 5 Nothing new under the sun… Let’s get the terms straight Concatenative synthesis Definition: All kinds of synthesis based on the concatenation of units, regardless of type (sound, formant trajectories, articulatory parameters) and size • Peterson et al. (diphones, triphones, syllables, longer units). (1958) (Everyday use: Concatenation of same-size sound units.) • Dixon and Maxey (1968) Unit selection •“DiadicUnits”, (Olive, 1977) Definition: All kinds of synthesis based on the concatenation of units where there are several candidates to choose from, regardless of if the candidates have the same, fixed size or if the size is variable.