Factor Graph Decoding for Speech Presence Probability Estimation
Total Page:16
File Type:pdf, Size:1020Kb
ITG-Fachbericht 267: Speech Communication · 05. – 07.10.2016 in Paderborn Factor Graph Decoding for Speech Presence Probability Estimation Thomas Glarner, Mohammad Mahdi Momenzadeh, Lukas Drude, Reinhold Haeb-Umbach Department of Communications Engineering, Paderborn University, 33098 Paderborn, Germany Email: {glarner,drude,haeb}@nt.uni-paderborn.de Web: nt.uni-paderborn.de Abstract ing an MMSE estimator for the spectral amplitude [9], they are employed here for SPP estimation for the first time. This paper is concerned with speech presence probability Second, viewing the aforementioned turbo algorithm as a estimation employing an explicit model of the temporal particular message passing scheme for inference in graph- and spectral correlations of speech. An undirected graphi- ical models with loops, alternative schemes will be inves- cal model is introduced, based on a Factor Graph formula- tigated, thus introducing general factor graph decoding as tion. It is shown that this undirected model cures some an approach to SPP estimation. of the theoretical issues of an earlier directed graphical The remainder of the paper is organized as follows. In model. Furthermore, we formulate a message passing in- Section 2, the signal model is presented, and the SPP es- ference scheme based on an approximate graph factoriza- timation problem is formulated. Section 3 introduces the tion, identify this inference scheme as a particular message UGM, a Factor Graph formulation and the inference algo- passing schedule based on the turbo principle and suggest rithm. The UGM is contrasted with the Directed Graphical further alternative schedules. The experiments show an im- Model of [7] in Section 4, giving reasons why the UGM proved performance over speech presence probability esti- is to be preferred. In Section 5, the inference algorithm is mation based on an IID assumption, and a slightly better generalized to different message passing schedules based performance of the turbo schedule over the alternatives. on differing propagation paths through the graph, while Section 6 provides an experimental comparison of differ- 1 Introduction ent scheduling types with respect to their performance on the SPP estimation task. Speech Presence Probability (SPP) estimation refers to the estimation of the presence of speech in the Short Time Fourier Transform (STFT) domain at the resolution of indi- 2 Problem Formulation vidual time-frequency (TF) bins. SPP estimators are used, As usual, our signal model is formulated in the STFT do- e.g., in several state-of-the-art noise tracking algorithms main and assumes an additive mixture of speech and noise. [1], for a-priori SNR estimation [2], or for mask estimation According to [7], a latent binary variable Z(m,k) is intro- in acoustic beamforming [3] and therefore are an important duced which indicates presence or absence of speech: component of modern speech enhancement systems. SPP estimators exploit the correlations of the signal N(m,k) if Z(m,k)=0, X(m,k)= (1) in time and/or frequency direction in some way or an- S(m,k)+N(m,k) if Z(m,k)=1, other. In a popular approach, the SPP estimation relies on a decision-directed estimate of the a priori signal-to-noise ra- where 1 ≤ m ≤ M is the STFT frame index, 1 ≤ k ≤ K is tio [4], which in turn depends on the clean speech estimate, the frequency index, S(m,k), N(m,k), and X(m,k) are thus coupling SPP estimation and speech enhancement. the STFT Coefficients corresponding to the speech, noise, Gerkmann et al. decouple the two and employ spectral and observed microphone signal, respectively. and temporal smoothing of the a posteriori SNR and the Following [5, 7] observations will be modeled in terms SPP using heuristically chosen parameters [5], while Mo- of the a-posteriori SNR meni et al. apply a sliding window directly to the complex- 2 valued observations in the STFT domain to determine the ζ(m,k)=|X(m,k)| /Φˆ NN(k), (2) statistics of the signal and noise, from which in turn the SPP is estimated [6]. rather than of the microphone signal directly. Here, Φˆ (k) Tran Vu and Haeb-Umbach propose a formulation NN is a (coarse) estimate of the noise power spectral that directly models the underlying statistical dependen- density. With the assumption of complex Gaussian dis- cies in time and frequency direction, employing a two- tributed random noise, the observation likelihood is com- dimensional hidden Markov Model (2DHMM) [7]. For puted from an exponential distribution with a parameter each TF-slot a latent binary variable is introduced which depending on the latent variable: indicates presence or absence of speech. The temporal and p(ζ(m,k)|Z(m,k)=i)=λi exp(−λiζ(m,k)). (3) spectral correlations of a speech signal are then captured by the transition probabilities of the 2D Markov grid, which Here, λ0 = 1 while λ1 depends on the a-priori SNR and is can be learned in a training phase. An efficient inference set empirically. algorithm has been developed which is an instance of the The task of SPP estimation is the inference of the pos- turbo principle known from coding literature [8]. terior probability of the latent variable: In this paper we extend and generalize this approach γ(m,k)= (Z(m,k) |X( M, K)). in two important aspects. First, we propose an equivalent Pr 1: 1: (4) Undirected Graphical Model (UGM), i.e., a Markov Ran- Here, 1:J denotes that the corresponding index variable dom Field (MRF), and show that this has important the- runs from 1 to J. Thus, in the above equation, the obser- oretical advantages over the directed HMM. While MRFs vations from all frames and frequencies are employed for have already been used in speech enhancement for provid- the estimation of Z(m,k). ISBN 978-3-8007-4275-2 36 © 2016 VDE VERLAG GMBH  Berlin  Offenbach ITG-Fachbericht 267: Speech Communication · 05. – 07.10.2016 in Paderborn Frequency k+1 Z(m,k+1) μ (m,k) μ (m,k) μ (m,k) o μ (m,k) o T(k,k+1;m) ↓ ↓ X(m,k) T(m,m+1;k) μ→(m,k) μ←(m,k) μ→(m,k) μ←(m,k) k μ↑(m,k) μ↑(m,k) Z(m,k) Z(m+1,k) k-1 (a) Horizontal factorization (b) Vertical factorization Time m-1 m m+1 Figure 2: Factorizations with message flow Figure 1: Part of the factor graph for the Undirected Model where P(Z(1:M,1:K)) is the set of all possible state pairs It is assumed that the observations are conditionally in- in Z(1:M,1:K) as given by the graph, and T (zi,zj) is the dependent if the latent variables are known. Therefore, all dependency factor capturing the statistical dependency be- dependencies are modeled by the latent variables. The SPP tween a pair of nodes. It is an undirected analog to the tran- can in theory be calculated by applying the Bayes Theo- sition matrix entries in the 2DHMM and can be identified rem, factorizing into the joint state probability and the ob- as a two-node potential function in the context of MRFs. servation likelihoods, and marginalizing over all state vari- Similar to [12], the dependency factors are defined as ables except Z(m,k): Pr(Z(m−1,k),Z(m,k)) T (m−1,m;k)= , Pr(Z(m,k)|X(1:M,1:K)) (5) Pr(Z(m−1,k))Pr(Z(m,k)) ∝ (Z( M, K)) p(X(m,k)|Z(m,k)), Pr(Z(m,k−1),Z(m,k)) ∑ Pr 1: 1: ∏ T (k−1,k;m)= , (7) ∼Z(m,k) ∀(m,k) Pr(Z(m,k−1))Pr(Z(m,k)) T (m− ,m k) = T (Z(m− ,k),Z(m,k)) where the operator ∑∼x is taken from [10] and represents where 1 ; : 1 is used summing over all variables except x. However, a direct for brevity. This still allows the interpretation as a (scaled) calculation of the marginalization is not tractable in prac- probability. Furthermore, the factor is a symmetric func- tice as it requires 2MK−1 additions where MK is usually tion with respect to the state variables Z(m,k) and does in the order of 105. To handle this issue, it is necessary therefore not introduce a directivity as conditional proba- to constrain the statistical dependencies between the latent bilities would do. variables. Unfortunately, the sum-product algorithm given in [10] The following sections cover different formalisms to is not directly applicable to perform inference, because the introduce reasonable constraints with the help of graphical given latent state variable structure as shown by Fig. 1 ex- models. hibits a grid with tight loops. If only vertical or horizon- tal connections existed, the model would reduce to sev- eral independent chains and the inference problem could 3 Undirected Graphical Model be solved by running a modified version of the forward- In this Section an Undirected Graphical Model (UGM) is backward algorithm [13] for each chain. However, this proposed to model temporal and spectral correlations in a assumption would drop the information carried in the ne- speech signal. UGMs are also known as MRFs in the lit- glected direction and is therefore discarded. erature [11]. The Factor Graph framework is used [10] to Instead, similar approximations as introduced in [7] are used: A local horizontal (i.e. time-directed) tree structure provide a higher cohesion between the mathematical for- k mulation and the graphical representation. This has the for a fixed frequency index can be obtained by neglect- advantage of explicitly visualizing the statistical depen- ing all horizontal connections for other frequencies.