Speech Detection Using High Order ()

Juraj Kačur1, Matúš Jurečka2 1) Slovak University of Technology, Faculty of Electrical Engineering and Information Technology, Department of Telecommunications, Ilkovičová 3, Bratislava, Slovakia, e- mail: [email protected], phone: +421-2- 68279416 2) Slovak University of Technology, Faculty of Electrical Engineering and Information Technology, Department of Telecommunications, Ilkovičová 3, Bratislava, Slovakia, e- mail: [email protected]

process. Another detection problem poses the Abstract speech itself, containing high energy vowels of In this article we discuses one of many various lengths as well as unvoiced consonants possible applications of high order statistic of low energy and higher (HOS) to speech detection problem. As a components; even intervals of silence are speech feature we used skewness or third- regular parts of speech. Situation dramatically order cumulant with two zero lags in a simple deteriorates if there is some background noise voiced speech detection algorithm. We present and mixed up with useful signal. It is compared its properties with the energy important to stress that noises can have almost feature in symmetric noises, speech signals any spectral and time characteristic that makes and in various signal to noise ratios (SNR). their separation from speech sometime almost These results proofed skewness feature being impossible. much suitable than energy for detection The process of speech detection includes: purposes, given those environment settings. finding proper speech features distinguishing Furthermore, we executed detection accuracy speech from noise, evaluation of their tests, which showed skewness outperforming behaviour in the time and decision taking classical approach, especially in low SNRs’. algorithm. All these stages are closely related to each other and will be presented Key words: speech detection, WSS noises, chronologically. symmetric noises, HOS, cumulants, skewness.

2. High Order Statistic 1. Introduction Speech detection algorithms can be found 2.1 Introduction & Theory in many different signal or speech processing The decision to use some HOS parameters applications, usually as parts of complex in our speech detection algorithm was based on systems. These algorithms are widely outstanding properties and features of HOS employed, especially in the following areas: techniques over other classical approaches. speech compression and transmission, speech Despite HOS has been investigated for 40 recognition, speech enhancement, medicine, years there is still lot to be explored and tested, etc. Despite existing many of them, there is no notably in real speech processing applications. universal detection algorithm yet working HOS is a non- linear statistical branch of reliably in all possible noises and settings. As signal processing. It provides new, more the requirements for their performance are still integrated view and description of stochastic growing it is getting worth to explore this area signals. Commonly used features like: more and more thoroughly. intensity, energy, and various The task of speech detection consists of spectral techniques can be expressed in the labelling end points of words if uttered isolated terms of HOS as first and second order statistic or marking active segments of a speech signal. features. Obviously these important and well- Apparently, finding word boundaries in known parameters suppress some parts of continues speech is much more difficult and information (like phase) which may cause can be done only via speech recognition ambiguity in and other applications. These drawbacks can be • Any random vector that is jointly Gaussian eliminated by using HOS, like third and four has all its cumulants of the order three and order moments and cumulants and their higher identically equal to zero. spectra: bispectrum and trispectrum • Odd order cumulants of processes with respectively. High order moments can be symmetrical pdfs’ are equal to zero. expressed using generating function • Cumulants of the higher order than 1 are as follows. invariant. k1 kn k1 kn M om[x1 ,.., xn ] = E{x1 ,.., xn } = Other interesting properties can be found in δ r Φ(ω ,..,ω ) (− j)r 1 n | (1) [1]. Definition given by (3) is somewhat k1 kn ω1 =..=ωn =0 awkward and is practically out of use. Usually, δω1 ..δωn Where moment generating function is the real cumulants’ calculations for stationary Fourier transform of probability density processes are done by (6): function (pdf) of given stochastic process (2). Letmn(τ1,τ2,..,τn) = E(x(k +τ1)x(k +τ2)...x(k +τn)) then Φ(ω1,ω2 ,...,ωn ) = c1 = m1 (mean) ∞ ∞ ∞ c (τ ) = m (τ ) −(m )2 ( ) ... pdf (x ,..., x )e− j(ω1x1+...ωn xn )dx ..dx 2 1 2 1 1 ∫∫ ∫ 1 n 1 n c (τ ,τ ) = m (τ ,τ )−m[m (τ )+m (τ )+m (τ −τ )]+2(m )3 −∞−∞ −∞ 3 1 2 3 1 2 1 2 1 2 2 2 2 1 1 c (τ ,τ ,τ ) = m (τ τ τ ) −m (τ )m (τ −τ )− (2) 4 1 2 3 4 1, 2, 3 2 1 2 3 2 m (τ )m (τ −τ )−m (τ )m (τ −τ ) − Even though these moments definitely provide 2 2 2 3 1 2 3 2 2 1 m[m (τ −τ ,τ −τ )+m (τ ,τ ) +m (τ ,τ ) +m (τ ,τ )] us with additional information, they do not 1 3 2 1 3 1 3 2 3 3 2 4 3 1 2 +(m )2[m (τ ) +m (τ ) +m (τ ) +m (τ −τ ) +m (τ −τ ) necessary exhibit eligible properties for further 1 2 1 2 2 2 3 2 3 1 2 3 2 +m (τ −τ )]−6(m )4 processing of stochastic signals. Therefore a 1 2 1 1 new set of features was derived called (6) cumulants, defined according (3): Zero lag cumulants, i.e. τn=0 have statistical k1 kn meanings and names and can be related to C um[x1 ,.., xn ] = moments by (7), provided processes are zero r mean: r δ ln(Φ(ω1,..,ωn )) (− j) | (3) 2 k1 kn ω1=..=ωn =0 E(x(k) ) = c (0) covariance δω1 ..δωn 2 3 This set of parameters exhibits interesting E(x(k) ) = c3 (0,0) skewness properties, which make them be a very E(x(k) 4 ) − 3c (0)2 =c (0,0,0) powerful tool for signal processing. Some of 2 4 them being most important, which are also of (7) direct use to our application are listed next: Combining those properties together, we can

• If sets of random variables x1,..xn and y1,…, easily see their great potential in signal yn are mutual independent then following processing, like system identification, equality (4) holds: Gaussian noise suppression, detection and Cum(x + y , x + y ,.., x + y ) = classification of various non- linear distortions 1 1 2 2 n n and many more. Detailed information can be Cum(x1, x2 ,..xn ) + Cum(y1, y2 ,...yn ) found in [1], [2], [3]. (4) 2.2 Real life application of HOS In the previous section definitions and • If random variables can be divided into any basic properties of HOS were given. However, two or more statistically independent sets in real applications we must not forget its then their joined cumulant is identically limitations. These are mainly related to the equal to zero. cumulants’ calculation (estimation) from real • Let a , a ,..,a are constants, then: 1 2 n sets. Real data are always limited in their

Cum(a1x1,a2 x2 ,...,an xn ) = size that inevitably leads to the certain estimation dispersion (due to approximation of a a ...a Cum(x , x ,..., x ) (5) 1 2 n 1 2 n the mean operation). This is just the case for speech which is highly non- . Therefore speech must be segmented into 1 (i+1)M −1 intervals which can be regarded as being X(l) − ∑ X( j), l ∈< iM, (i +1)M) stationary and thus can be easily described by M j=iM statistical . Those intervals for speech Xˆ i (l) = 〈 signals are usually about 10-30ms. Obviously, the more parameters we are to estimate the 0 otherwise more data we need, which contradict the real (9) life. Note that higher order moments or Finally, a consistent and asymptotically cumulants need much more parameters to set unbiased estimation of the third order than the mean or covariance sequence does. cumulants from real data is given by (10) Furthermore, when only one realization of the K stochastic process is present (almost always), 1 i c3 (τ ,τ ) = m (τ ,τ ) we have to resort to the ergodicity assumption 1 2 ∑ 3 1 2 K i=1 of the examined process, which is usually not the exact case. Ergodicity is a strict condition (10) (if process is ergodic it must be also stationary, reverse implication does not hold), which enable us to obtain various statistic 3 Detection algorithm characteristics of the tested stochastic process We based our detection algorithm on by applying time averages to the samples skewness parameter (c3(0,0)) assuming belonging only to one realization. Otherwise, symmetric pdfs’ of noise generating processes, we would have to have lot of realizations and which many real life application obey. This apply averages to them rather than to the time, assumption proofed to be correct for all which is usually unfeasible. Unfortunately, available and tested noises appearing in our there is no exact method for ergodicity test, so . Theoretically this parameter the decision usually lies on designers’ should be equal to zero in those noises, but due experience. Speech signals are regarded to be to approximations and estimation , it “ergodic enough” within their stationary is no, but still small enough. This feature is intervals. To suppress those abovementioned also negligibly small in unvoiced speech strict statements and limitations a key role is segments, but increases in voiced segments laid on more sophisticated estimation methods. especially in vowels, even if those are Description and explanation of proper corrupted by symmetric noises. This can be calculation methods that are not biased, due to slightly asymmetric pdfs of wolves and consistent and eliminate estimation dispersion their high intensity, which emphasize even can be found in [1],[3],[5]. small discrepancies. Then the influence of As we decided to use the third order transition processes between phonemes can cumulants, let us show how their real also positively affect the higher value of calculations (estimations) are usually done. skewness inside speech active segments. Using the last property of cumulants listed in the section 2.1 and the computation formula segmentation Skewness calculation (6) for c3(τ1,τ2), the process is as follows [6]: Let X(k) is the input signal of the length of N x(n) samples. This signal is divided into K Detection segments of the same length M and from each speech/ such sub-segment its mean is subtracted. Then noise Threshold these partial cumulants (in fact moments, mean calculation is zero) are calculated according (8): l Figure 1 A block diagram of 1 2 i ˆ i ˆ i ˆ i proposed voice activity detection m3 (τ1,τ 2 ) = ∑ X (l)X (l +τ1 )X (l +τ 2 ), M l=l1 system for WSS noises, using l = max(0,−τ ,−τ ), l = min(M −1,M − 2), skewness. 1 1 21 2 Scheme of the detection algorithm is τ1 < L, τ 2 < L, i =1,2,..K shown in figure 1. First, signal is segmented (8) into blocks of the length of 40ms; a Where compromise length between reliable estimations and the limited length of static 4 Experiments and results speech intervals. Neighboring blocks are Various tests were executed using a set of overlapped by 25ms, which keeps the course 36 specially chosen Slovak words, each of of c3(0,0) stable and smooth (desirable for which was uttered isolated, i.e. there was at decision taking algorithm). From each segment least 400ms interval of silence before and after c3(0,0) is calculated following the process the word. Each word was located in a separate presented in the section 2.2, applying (9) (8) sound file; 8 kHz frequency and 16 and (10) in successive steps and letting τ1 and bit accuracy. These words were artificially τ2 equals zero. Last stage is the decision taking noised by 5 real wide sense stationary noises, algorithm operating on a simple threshold rule which sounded differently and by one and additionally refined by a method testing generated uniformly distributed white noise. incline or decline characteristics of the course Each word was noised with all that noises in of skewness close to the threshold value. This the following SNRs: 0, 3, 7, 10, 13, 17dB. threshold was set up in compliance with Thus we created enough test samples, which several observations of c3(0,0) in different should provide reliable results. noises, also taking in account its usual values Following the description of our detection in vowels. However, no experiments aimed to algorithm it is clear that it is focused mainly on optimize its value were executed. vowels localization. Thus all experiments and comparisons were accomplished upon them. As the parameter of energy is simple, common, similar to skewness (in the sense of calculation) and good indicator of vowels’ location in the speech, we related all our comparisons to it. In figure 2 course of noised Slovak word is depicted and is accompanied by the course of c3(0,0) with marked both detected and real vowels’ boundaries. In general we made two sets of experiments. One aimed to highlight the sensitivity of skewness to vowels presence in the noisy environments and the other one tested the overall accuracy of vowels’ localization. Because we based our detection

Figure 2A) Digitized Slovak word “trojstup” corrupted by the white noise, Figure 3 Ratios of mean values of energy SNR=3dB. B) Course of skewness and cumulants in the noisy wovels (0dB) to constructed for A) case. The vertical the energy and cumulants in the noises only dashed lines mark real localization of for different noises. 1) white (uniformly vowels and the solid lines are those distributed), 2) hair dryer, 3) microwave detected positions. oven, 4) mixer, 5) TV set, 6) vacume cleaner. • Detection error for vowels (figure 4) using simple threshold method of cumulants also provided better results, especially in low SNR values, where energy parameter dramatically lost its detection capability. However in high SNR (less corrupted speech) energy almost reached the accuracy of the cumulant based approach. • Cumulants c3(0,0) are easy to be computed (instead using power to 2 for energy, power to 3is used for each ).

Reference: [1] Chrysostomos L. Nikias, Athina P. Figure 4 Mean detection error for both Petropulu: Higher-Order Spectra Analysis, energy based detection algorithm and the PTR Prentice Hall, Upper Saddle River, one using cumulants, as a function of SNR. New Jersey, 1993, ISBN 0-13-678210-8 Detection error is listed in the number of speech segments [2] J. Fackrell, Bispectral analysis of speech on a simple threshold of the skewness value, signals, PhD. thesis, The University of an eligible comparison was made via mean Edinburgh, 1996 values of c3(0,0) in the noisy vowels and in the noise only with the energy feature in the same [3] A. Swami, J. Mendel, Ch. Nikias, High- settings. As skewness and energy has Order Spectral Analysis Toolbox, for use completely different values (energy was about with Matlab, 1998 3 10 higher than c3(0,0)) we had to use ratios of those features so that the comparisons were [4] John R. Deller, John G. Proakis, John H. fair. In figure 3 there are depicted ratios of Hansen : Discrete-Time Processing of averaged values of cumulants c3(0,0) in the Speech Signals, Macmillan Publishing noisy vowels and in the noise only. The same Company, Englewood Cliffs, 1993 is also done for energy and for all available noises. In figure 4 the average detection error [5] Jiří Ján: Číslicová filtrace, analýza over all words and noises is shown as a a restaurace signálů, Vysoké učení function of SNR for both skewness based technické v Brně, 1997 detection as well as for the energy one.

[6] Martin Cooke, Steve Beet, Visual 5 Conclusions Representations of speech signals, John Wiley&Sons, West Sussex England, 1993 • Skewness (c3(0,0)) proofed to be very sensitive to vowels’ presence, even if they were severely corrupted by wide of WSS noises. This is well documented in figure 3, were the minimal ratio of the mean value of cummulants in corrupted vowels to the mean value of cummulants in the noises only was 12 (microwave oven, SNR 0dB), whereas these ratios constructed from energy parameters hardly reach 2.7.

• Sensitivity of c3(0,0) increases as the present noise converges to white like noise. In this situation the ratio reached the value of 291 whereas the energy was only 2.7.