Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition| ASR Lectures 4&5 26&30 January 2017 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models1 Overview HMMs and GMMs Key models and algorithms for HMM acoustic models Gaussians GMMs: Gaussian mixture models HMMs: Hidden Markov models HMM algorithms Likelihood computation (forward algorithm) Most probable state sequence (Viterbi algorithm) Estimting the parameters (EM algorithm) ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models2 Fundamental Equation of Statistical Speech Recognition If X is the sequence of acoustic feature vectors (observations) and W denotes a word sequence, the most likely word sequence W∗ is given by W∗ = arg max P(WjX) W Applying Bayes' Theorem: p(XjW) P(W) P(WjX) = p(X) / p(XjW) P(W) W∗ = arg max p(XjW) P(W) W | {z } | {z } Acoustic Language model model NB: X is used hereafter to denote the output feature vectors from the signal analysis module rather than DFT spectrum. ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models3 Acoustic Modelling Recorded Speech Decoded Text (Transcription) Hidden Markov Model Signal Analysis Acoustic Model Search Lexicon Training Space Data Language Model ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models4 Hierarchical modelling of speech Utterance Generative Model "No right" NO RIGHT Word n oh r ai t Subword HMM Acoustics ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models5 Calculation of p(X jW ) Speech signal Spectral analysis Feature vector sequence x X= x1x2 n p(X |/s/) p(X |/y/) p(X |/n/) p(X |/r/) p(X|/sayonara/) = 1 3 5 7 p(X2 |/a/) p(X4 |/o/) p(X6 |/a/) p(X8 |/a/) Acoustic (phone) model [HMM] /a/ /u//s/ /p/ /i/ /r/ /t/ /w/ NB: some conditional independency is assumed here. ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models6 How to calculate p(X1j=s=)? Assume x1; x2; ··· ; xT1 corresponds to phoneme /s/, the conditional probability that we observe the sequence is t d p(X1j=s=) = p(x1; ··· ; xT1 j=s=); xi = (x1i ; ··· ; xdi ) 2 R We know that HMM can be employed to calculate this. (Viterbi algorithm, Forward / Backward algorithm) To grasp the idea of probability calculation, let's consider an extremely simple case where the length of input sequence is just one (T1 = 1), and the dimensionality of x is one (d = 1), so that we don't need HMM. p(X1j=s=) −! p(x1j=s=) ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models7 How to calculate p(X1j=s=)? (cont.) p(xj=s=) : conditional probability (conditional probability density function (pdf) of x) A Gaussian / normal distribution function could be employed for this: 2 (x−µs ) 1 − 2 P(xj=s=) = e 2σs p 2 2πσs 2 The function has only two parameters, µs and σs Given a set of training samples fx1; ··· ; xN g, we can estimate µs and σs 1 PN 2 1 PN 2 µ^s = N i=1xi; σ^s = N i=1(xi − µs ) For a general case where a phone lasts more than one frame, we need to employ HMM. ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models8 Acoustic Model: Continuous Density HMM P(s s ) P(s2 s2) P(s s ) 1 | 1 | 3 | 3 sI s1 s2 s3 sE P(s1 sI) P(s2 s1) P(s s ) P(sE s3) | | 3 | 2 | p(x s1) p(x s ) p(x s3) | | 2 | x x x Probabilistic ﬁnite state automaton Paramaters λ: Transition probabilities: akj = P(S =j jS =k) Output probability density function: bj (x) = p(xjS =j) NB: Some textbooks use Q or q to denote the state variable S. x corresponds to ot in Lecture slides 02. ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models9 Acoustic Model: Continuous Density HMM sI s1 s2 s3 sE x1 x2 x3 x4 x5 x6 Probabilistic ﬁnite state automaton Paramaters λ: Transition probabilities: akj = P(S =j jS =k) Output probability density function: bj (x) = p(xjS =j) NB: Some textbooks use Q or q to denote the state variable S. x corresponds to ot in Lecture slides 02. ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models9 HMM Assumptions s(t 1) s(t) + − s(t 1) x(t 1) x(t) x(t + 1) − NB: unfolded version over time 1 Markov process: The probability of a state depends only on the previous state: P(S(t)jS(t−1); S(t−2);:::; S(1)) = P(S(t)jS(t−1)) A state is conditionally independent of all other states given the previous state 2 Observation independence: The output observation x(t) depends only on the state that produced the observation: p(x(t)jS(t); S(t−1);:::; S(1); x(t−1);:::; x(1)) = p(x(t)jS(t)) An acoustic observation x is conditionally independent of all other observations given the state that generated it ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 10 Output distribution P(s s ) P(s2 s2) P(s s ) 1 | 1 | 3 | 3 sI s1 s2 s3 sE P(s1 sI) P(s2 s1) P(s s ) P(sE s3) | | 3 | 2 | p(x s1) p(x s ) p(x s3) | | 2 | Single multivariate Gaussian with mean µj , covariance matrix Σj : x x x bj (x) = p(xjS =j) = N (x; µj ; Σj ) M-component Gaussian mixture model: M X bj (x) = p(xjS =j) = cjm N (x; µjm; Σjm) m=1 Neural network: bj (x) ∼ P(S =j jx) = P(S =j) NB: NN outputs posterior probabiliies ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 11 Background: cdf Consider a real valued random variable X Cumulative distribution function (cdf) F (x) for X : F (x) = P(X ≤ x) To obtain the probability of falling in an interval we can do the following: P(a < X ≤ b) = P(X ≤ b) − P(X ≤ a) = F (b) − F (a) ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 12 Background: pdf The rate of change of the cdf gives us the probability density function (pdf), p(x): d p(x) = F (x) = F 0(x) dx Z x F (x) = p(x)dx −∞ p(x) is not the probability that X has value x. But the pdf is proportional to the probability that X lies in a small interval centred on x. Notation: p for pdf, P for probability ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 13 The Gaussian distribution (univariate) The Gaussian (or Normal) distribution is the most common (and easily analysed) continuous distribution It is also a reasonable model in many situations (the famous \bell curve") If a (scalar) variable has a Gaussian distribution, then it has a probability density function with this form: 1 −(x − µ)2 p(x jµ, σ2) = N (x; µ, σ2) = p exp 2πσ2 2σ2 The Gaussian is described by two parameters: the mean µ (location) the variance σ2 (dispersion) ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 14 Plot of Gaussian distribution Gaussians have the same shape, with the location controlled by the mean, and the spread controlled by the variance One-dimensional Gaussian with zero mean and unit variance (µ = 0, σ2 = 1): pdf of Gaussian Distribution 0.4 mean=0 variance=1 0.35 0.3 0.25 0.2 p(x|m,s) 0.15 0.1 0.05 0 −4 −3 −2 −1 0 1 2 3 4 x ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 15 Properties of the Gaussian distribution 1 −(x − µ)2 N (x; µ, σ2) = p exp 2πσ2 2σ2 pdfs of Gaussian distributions 0.4 mean=0 variance=1 0.35 mean=0 0.3 variance=2 0.25 0.2 p(x|m,s) 0.15 mean=0 variance=4 0.1 0.05 0 −8 −6 −4 −2 0 2 4 6 8 x ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 16 Parameter estimation Estimate mean and variance parameters of a Gaussian from data x1; x2;:::; xT Use the following as the estimates: T 1 X µ^ = x (mean) T t t=1 T 1 X σ^2 = (x − µ^)2 (variance) T t t=1 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 17 Exercise | maximum likelihood estimation (MLE) Consider the log likelihood of a set of T training data points fx1;:::; xT g being generated by a Gaussian with mean µ and variance σ2: T 2 2 1 X (xt − µ) 2 L = ln p(fx1;:::; x gjµ, σ ) = − − ln σ − ln(2π) T 2 σ2 t=1 T 1 X 2 T 2 T = − (xt − µ) − ln σ − ln(2π) 2σ2 2 2 t=1 By maximising the the log likelihood function with respect to µ show that the maximum likelihood estimate for the mean is indeed the sample mean: T 1 X µ = x : ML T t t=1 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 18 The multivariate Gaussian distribution T The D-dimensional vector x = (x1;:::; xD ) follows a multivariate Gaussian (or normal) distribution if it has a probability density function of the following form: 1 1 p(xjµ; Σ) = exp − (x − µ)T Σ−1(x − µ) (2π)D=2jΣj1=2 2 T The pdf is parameterized by the mean vector µ = (µ1; : : : ; µD ) 0 σ11 : : : σ1D 1 .

Load more