Topic: Spectrogram, Cepstrum and Mel-Frequency Analysis

Speech Technology: A Practical Introduction Topic: Spectrogram, Cepstrum and Mel-Frequency Analysis

Kishore Prahallad Email: [email protected]

Carnegie Mellon University & International Institute of Information Technology Hyderabad 1 Speech Technology - Kishore Prahallad ([email protected]) Topics

• Spectrogram • Cepstrum • Mel-Frequency Analysis • Mel-Frequency Cepstral Coefficients

2 Speech Technology - Kishore Prahallad ([email protected]) Spectrogram

3 Speech Technology - Kishore Prahallad ([email protected]) Speech signal represented as a sequence of spectral vectors

FFT FFT FFT

Spectrum

4 Speech Technology - Kishore Prahallad ([email protected]) Speech signal represented as a sequence of spectral vectors

FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT

Spectrum

5 Speech Technology - Kishore Prahallad ([email protected]) Speech signal represented as a sequence of spectral vectors

FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT

Spectrum

Amp.

Hz 6 Speech Technology - Kishore Prahallad ([email protected]) Speech signal represented as a sequence of spectral vectors

FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT

Spectrum

Rotate it by 90 degrees

Amplitude 7 Speech Technology - Kishore Prahallad ([email protected]) Speech signal represented as a sequence of spectral vectors

FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT

Spectrum

Hz • MAP spectral amplitude to a grey level (0- 255) value. 0 represents black and 255 represents white. • Higher the amplitude, darker the corresponding region. Amplitude 8 Speech Technology - Kishore Prahallad ([email protected]) Speech signal represented as a sequence of spectral vectors

FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT

Spectrum

Time 9 Speech Technology - Kishore Prahallad ([email protected]) Speech signal represented as a sequence of spectral vectors

Time Vs Frequency FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT representation of a speech signal is referred to as spectrogram Spectrum

Time 10 Speech Technology - Kishore Prahallad ([email protected]) Some Real Spectrograms

Dark regions indicate peaks (formants) in the spectrum

11 Speech Technology - Kishore Prahallad ([email protected]) Why we are bothered about spectrograms Phones and their properties are better observed in spectrogram

12 Speech Technology - Kishore Prahallad ([email protected]) Why we are bothered about spectrograms Sounds can be identified much better by the Formants and by their transitions

13 Speech Technology - Kishore Prahallad ([email protected]) Why we are bothered about spectrograms Sounds can be identified much better by the Formants and by their transitions

Hidden Markov Models implicitly model these spectrograms to perform speech recognition

14 Speech Technology - Kishore Prahallad ([email protected]) Usefulness of Spectrogram

• Time-Frequency representation of the speech signal

• Spectrogram is a tool to study speech sounds (phones)

• Phones and their properties are visually studied by phoneticians

• Hidden Markov Models implicitly model spectrograms for speech to text systems

• Useful for evaluation of text to speech systems – A high quality text to speech system should produce synthesized speech whose spectrograms should nearly match with the natural sentences.

15 Speech Technology - Kishore Prahallad ([email protected]) Cepstral Analysis

16 Speech Technology - Kishore Prahallad ([email protected]) A Sample Speech Spectrum

Frequency (Hz) • Peaks denote dominant frequency components in the speech signal • Peaks are referred to as formants • Formants carry the identity of the sound 17 Speech Technology - Kishore Prahallad ([email protected]) What we want to Extract? – Spectral Envelope • Formants and a smooth curve connecting them • This Smooth curve is referred to as spectral envelope

Frequency (Hz)

18 Speech Technology - Kishore Prahallad ([email protected]) Spectral Envelope

Spectrum

Spectral Envelope

Spectral details

19 Speech Technology - Kishore Prahallad ([email protected]) Spectral Envelope

Spectrum log X[k]

Spectral log H[k] Envelope

Spectral log E[k] details

20 Speech Technology - Kishore Prahallad ([email protected]) Spectral Envelope

Spectrum log X[k] log X[k] = log H[k] + log E[k]

1. Our goal: We want to Spectral log H[k] separate spectral Envelope envelope and spectral details from the spectrum. Spectral log E[k] details 2. i.e Given log X[k], obtain log H[k] and log E[k], such that log X[k] = log H[k] + log E[k] 21 Speech Technology - Kishore Prahallad ([email protected]) How to achieve this separation ?

22 Speech Technology - Kishore Prahallad ([email protected]) Play a Mathematical Trick

Spectrum • Trick: Take FFT of the spectrum!! • An FFT on spectrum referred to as Inverse FFT (IFFT). Spectral Envelope • Note: We are dealing with spectrum in log domain (part of the trick) • IFFT of log spectrum would represent the signal in pseudo- Spectral frequency axis details 23 Speech Technology - Kishore Prahallad ([email protected]) Play a Mathematical Trick

Spectrum

Spectral Envelope

Spectral A pseudo-frequency details axis 24 Speech Technology - Kishore Prahallad ([email protected]) Play a Mathematical Trick

Spectrum

Low Freq. High Freq. region region

Spectral Envelope

Spectral A pseudo-frequency details axis 25 Speech Technology - Kishore Prahallad ([email protected]) Play a Mathematical Trick

Spectrum

Low Freq. High Freq. region region

Spectral Envelope

IFFT

Spectral A pseudo-frequency details axis 26 Speech Technology - Kishore Prahallad ([email protected]) Play a Mathematical Trick

Spectrum

Low Freq. HighTreat Freq. this as a region region sine wave with 4 cycles per sec. Spectral Envelope

IFFT

Spectral A pseudo-frequency details axis 27 Speech Technology - Kishore Prahallad ([email protected]) Gives a peak at 4Play Hz in a Mathematical Trick frequency axis Spectrum

Low Freq. HighTreat Freq. this as a region region sine wave with 4 cycles per sec. Spectral Envelope

IFFT

Spectral A pseudo-frequency details axis 28 Speech Technology - Kishore Prahallad ([email protected]) Gives a peak at 4Play Hz in a Mathematical Trick frequency axis Spectrum

Low Freq. HighTreat Freq. this as a region region sine wave with 4 cycles per sec. Spectral Envelope

IFFT

Spectral A pseudo-frequency details axis 29 Speech Technology - Kishore Prahallad ([email protected]) Play a Mathematical Trick

Spectrum

Low Freq. High Freq. region region

Spectral Envelope

IFFT

Spectral A pseudo-frequency details axis 30 Speech Technology - Kishore Prahallad ([email protected]) Play a Mathematical Trick Gives a peak at 100 Hz in Spectrum frequency Low Freq.axis High Freq. region region

Treat this as a sine wave with Spectral 100 cycles per Envelope sec.

IFFT

Spectral A pseudo-frequency details axis 31 Speech Technology - Kishore Prahallad ([email protected]) Play a Mathematical Trick

Spectrum

Low Freq. High Freq. region region

Spectral Envelope IFFT

IFFT

Spectral A pseudo-frequency details axis 32 Speech Technology - Kishore Prahallad ([email protected]) Play a Mathematical Trick

Spectrum

Spectral Envelope

Spectral A pseudo-frequency details axis 33 Speech Technology - Kishore Prahallad ([email protected]) Play a Mathematical Trick

log X[k] = log H[k] + log E[k] Spectrum

IFFT

log H[k] Spectral Envelope

log E[k]

Spectral A pseudo-frequency details axis 34 Speech Technology - Kishore Prahallad ([email protected]) Play a Mathematical Trick x[k] = h[k] + e[k] log X[k] = log H[k] + log E[k] Spectrum

IFFT

log H[k] Spectral Envelope

log E[k]

Spectral A pseudo-frequency details axis 35 Speech Technology - Kishore Prahallad ([email protected]) Play a Mathematical Trick x[k] = h[k] + e[k] log X[k] = log H[k] + log E[k] Spectrum

IFFT

log H[k] Spectral Envelope

In practice all you have access to only log X[k] and hence log E[k] you can obtain x[k]

Spectral A pseudo-frequency details axis 36 Speech Technology - Kishore Prahallad ([email protected]) Play a Mathematical Trick x[k] = h[k] + e[k] log X[k] = log H[k] + log E[k] Spectrum

IFFT

log H[k] Spectral Envelope

If you know x[k] Filter the low frequency region to get log E[k] h[k]

Spectral A pseudo-frequency details axis 37 Speech Technology - Kishore Prahallad ([email protected]) Play a Mathematical Trick

x[k] = h[k] + e[k] log X[k] = log H[k] + log E[k] Spectrum

IFFT

A pseudo-frequency log H[k] Spectral axis Envelope

• x[k] is referred to as Cepstrum • h[k] is obtained by considering log E[k] the low frequency region of x[k]. • h[k] represents the spectral envelope and is widely used as feature for speech recognition Spectral details 38 Speech Technology - Kishore Prahallad ([email protected]) Cepstral Analysis

X[k] = H[k]E[k] || X[k ||] = || H[k ||] || E[k ||] ||.|| −denotes magnitude Take Log on both sides log || X[k ||] = log || H[k ||] + log || E[k ||] Taking inverse FFT on both sides x[k] = h[k] + e[k]

39 Speech Technology - Kishore Prahallad ([email protected]) Mel-Frequency Analysis

40 Speech Technology - Kishore Prahallad ([email protected]) Review: What we did

• We captured spectral envelope (curve connecting all formants) • BUT: Perceptual experiments say human ear concentrates on certain regions rather than using whole of the spectral envelope….

Frequency (Hz) 41 Speech Technology - Kishore Prahallad ([email protected]) Mel-Frequency Analysis

• Mel-Frequency analysis of speech is based on human perception experiments • It is observed that human ear acts as filter – It concentrates on only certain frequency components • These filters are non-uniformly spaced on the frequency axis – More filters in the low frequency regions – Less no. of filters in high frequency regions

42 Speech Technology - Kishore Prahallad ([email protected]) Mel-Frequency Filters

43 Speech Technology - Kishore Prahallad ([email protected]) Mel-Frequency Filters More no. of filters in low Lesser no. of filters in freq. region high freq. region

44 Speech Technology - Kishore Prahallad ([email protected]) Mel-Frequency Cepstral Coefficients (MFCC) • Spectrum Mel-Filters Mel-Spectrum • Say log X[k] = log (Mel-Spectrum) • NOW perform Cepstral analysis on log X[k] – log X[k] = log H[k] + log E[k] – Taking IFFT – x[k] = h[k] + e[k] • Cepstral coefficients h[k] obtained for Mel- spectrum are referred to as Mel-Frequency Cepstral Coefficients often denoted by *MFCC*

45 Speech Technology - Kishore Prahallad ([email protected]) Speech signal represented as a sequence of spectral vectors

FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT

Spectrum

Mel-Filters

Cepstral Analy.

46 Speech Technology - Kishore Prahallad ([email protected]) Speech signal represented as a sequence of CEPSTRAL vectors

FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT

Spectrum

Cepstral Vectors

47 Speech Technology - Kishore Prahallad ([email protected]) Why we are going to use MFCC

• Speech synthesis – Used for joining two speech segments S1 and S2 – Represent S1 as a sequence of MFCC – Represent S2 as a sequence of MFCC – Join at the point where MFCCs of S1 and S2 have minimal Euclidean distance • Used in speech recognition – MFCC are mostly used features in state-of-art speech recognition system

48 Speech Technology - Kishore Prahallad ([email protected]) Summary: Process of Feature Extraction • Speech is analyzed over short analysis window • For each short analysis window a spectrum is obtained using FFT • Spectrum is passed through Mel-Filters to obtain Mel- Spectrum • Cepstral analysis is performed on Mel-Spectrum to obtain Mel-Frequency Cepstral Coefficients • Thus speech is represented as a sequence of Cepstral vectors • It is these Cepstral vectors which are given to pattern classifiers for speech recognition purpose

49 Speech Technology - Kishore Prahallad ([email protected]) Additional Reading

• Chapter 6 – Pg: 273 – 281 – Pg: 304 – 311 – Pg: 314 - 316

50 Speech Technology - Kishore Prahallad ([email protected])