Comparison of Voice Activity Detection Algorithms for Wireless
Total Page:16
File Type:pdf, Size:1020Kb
Proc. IEEE Canadian Conf. Electrical, Computer Engineering (St. John’s, NF), pp. 470-473, May 1997 COMPARISON OF VOICE ACTIVITY DETECTION ALGORITHMS FOR WIRELESS PERSONAL COMMUNICATIONS SYSTEMSy Khaled El-Maleh and Peter Kabal TSP Lab oratory, Dept. of Electrical Eng. McGill University Montreal, Queb ec, Canada H3A 2A7 email: fkhaled,[email protected] ABSTRACT (VBR) co ding has b een recently adopted for CDMA- based cellular and PCS systems to enhance capacity V oice activity detection (VAD) algorithms havebe- b y reducing interference. A V AD device is an indis- come an in tegral part of many of the recently stan- p ensable part of any VBR co dec as it controls b oth the dardized wireless cellular and P ersonal Communica- a verage bit rate, and the o v erall qualit y of the co der tions Systems (PCS). In this pap er, we presentacom- [3]. parative study of the p erformance of three recently pro- This pap er is organized in ve parts. Section 2 posed VAD algorithms under various acoustical back- starts with a general review of the basics of a V AD ground noise conditions. We also prop ose new ideas to design. It then describ es, in some detail, three selected enhance the p erformance of a VAD algorithm in wire- V AD designs for the this study. Section 3 rep orts the less PCS sp eech applications. results of a simulation study that has b een p erformed to study the p erformance of the three VADs under var- 1. INTRODUCTION ious background noise conditions. We then discuss and analyze the simulation results in Section 4. Finally, Conv ersational sp eech is a sequence of consecutive seg- conclusions are presented in Section 5. ments of silence and sp eech. In wireless telephony,the user is often roaming and th us encountering di erent 2. VOICE ACTIVITY DETECTION types and levels of background acoustical noises. This ALGORITHMS bac kground noise which contaminates the signal results in either noise only or sp eech plus noise segments. In The basic principle of a VAD device is that it extracts manyspeech pro cessing applications, it is desirable to some measured features or quantities from the input detect sp eech in the noise. This pro cess is called voice signal and then compare these values with thresholds, activit y detection (VAD) [1]. The VAD op eration can usually extracted from noise only p erio ds. Voice ac- be viewed as a decision problem in which the detec- tivity(VAD=1) is declared if the measured values ex- tor decides b etw een noise only,orofspeech plus noise. ceed the thresholds. Otherwise, no sp eec h activityor This is a challenging problem in noisy acoustical envi- noise (VAD=0) is present. What generally character- ronments. izes a VAD design is the wa y it selects its features, and V oice activity detection is used in a variet y of sp eech the wa y it de nes and up dates the thresholds. In gen- communication systems such as sp eech co ding, sp eech eral, a V AD algorithm outputs a binary decision in a recognition, hands-free telephony , audio-conferencing, frame-by-frame basis where a \frame" of the input sig- and echo cancellation. The recently prop osed multiple- nal is a short unit of time such as 20{40 ms. Accuracy, access sc hemes, such as CDMA, and enhanced TDMA robustness to noise conditions, simplicity , adaptation, for cellular and PCS systems use some form of V AD and real-time pro cessing are some of the required fea- [1]. Moreov er, in GSM-based wireless systems a VAD tures of a go o d VAD. mo dule is used for discontinuous transmission to sav e In the early V AD algorithms, short-time energy, the battery life of p ortable units [2]. V ariable bit rate zero crossing rate, and LPC co ecients w ere among the common features used in the detection pro cess [4]. yThis work w as supp orted b y a grant from the Canadian Cepstral features [5], formant shap e [6], a least-square Institute for T elecommunications Research under the NCE pro- gram of the Gov ernment of Canada p erio dicity measure [7] are some of the recent ideas in VAD designs. 2.3. The TOS VAD In this work, we consider three recently prop osed Symmetrically distributed (non-skewed) pro cesses are VAD algorithms. These include the VAD used in the characterized to have a third-order cumulant (TOC) GSM cellular system [1,2], the VAD used in the en- that is identically zero at all lags. However, sp eechsig- hanced variable rate co dec (EVRC) of the North Amer- nals have b een observed exp erimentally to be skewed ican CDMA-based PCS and cellular systems [3], and a enough to pro duce signi cantly non-zero TOC at all third-order statistics (TOS)-based VAD [8]. lags. Under the assumption that many noises can be mo deled as Gaussian or symmetrically distributed pro- 2.1. The GSM VAD cesses, it is p ossible to discriminate sp eech from noise. In [8], a novel time domain Gaussianity test is used in In the GSM VAD, an adaptive noise-suppressor lter the sp eech detection pro cess. The test statistic of this ^ is used to lter the input signal frame. The co ecients VAD, d is de ned as of the lter are computed during noise-only p erio ds. 1 t ^ ^ The energy of the ltered signal is compared to a noise- d =^c C c^ : (1) 3y 3y 0 dep endent threshold. As b oth the lter co ecients and In this equation, c^ is the third-order cumulant 3y the threshold are computed during noise-only frames, ^ of a given frame, and C is the covariance matrix of 0 sp ecial measures are taken to identify noise frames. the TOC estimated from R initial noise-only frames. These include b oth signal stationarity and p erio dicity ^ Sp eech is detected if d exceeds a selected threshold, tests. The ma jor weakness of this VAD lies on the sta- otherwise the frame contains noise. One ma jor feature tionarity assumption of background noise. This is not of this VAD is that it has a xed noise-indep endent always the case for many of the commonly encountered 2 threshold, T given as ( ), where is a pre-selected Q noises in wireless telephony. probability of false alarm (P )andQ is the number of F Toimprove the p erformance of the GSM VAD for lags used in the TOC computation. The value of the both stationary and non-stationary noises, Srinivasan 2 threshold is obtained from the chi-square ( ) table Q and Gersho [1] prop osed several new features to the ba- [8]. sic VAD design. These include a multi-band (4 bands) energy comparison, sp ectral atness measurement, and 3. SIMULATION RESULTS using the fraction of the energy of the low frequency band. This improved GSM VAD is more powerful as We have implemented the aforementioned VAD algo- it relies on multiple-thresholds to make the nal deci- rithms and tested their p erformance for di erent noise sion. Some of these thresholds are determined empiri- environments and at various noise levels. For the pur- cally and the others are dynamically up dated based on pose of this study, we have recorded several acousti- signal measurements. cal environmental noises (bus, street, restaurant) and used some noise signals from the NOISEX-92 database (car noise, babble) [9]. Background noise was digi- 2.2. The EVRCVAD tally added to clean sp eech with SNR values of 20, 10, and 0 dB. The p erformance of a given VAD al- The EVRC co der [3] uses a 3-rate determination algo- gorithm is a function of b oth the noise level (SNR) and rithm (RDA) to select the appropriate rate and co ding the structure of the background noise (stationary, non- strategy for each input frame. The lowest rate signi- stationary, white, or p erio dic). In Figures 1{8, weshow es a noise-only frame. For our comparative study, on each gure the binary output of eachVAD sup er- we have changed this RDA to output a binary VAD imp osed on a noisy sp eech signal. Due to the space ag. The basic idea of this VAD is similar to the GSM limitations, we show the VAD results only for a high VAD or its improved version. However, the novel part SNR (20 dB) and for a very noisy environment (0 dB). of this VAD is its dynamic up dating of the thresholds It is common in mo dern VAD algorithms to use a in a way that cop es with di erent background noise `hangover' p erio d of few frames to delayany pre-mature environments and conditions. The sp ectrum of the in- transition from sp eech to noise [1,2,3]. This is to min- put signal is divided into two bands and the energy in imize the probability of missing sp eech esp ecially for each band is compared against two thresholds. Sp eech low-energy unvoiced sp eech. These hangover mecha- is detected if the energy in each band is greater than nisms are generally not e ective in correcting isolated the corresp onding lowest threshold. The thresholds are VAD errors (i.e `one' among a sequence of zeros or vice scaled versions of estimated sub-band noise energies versa). For manyVAD applications (esp ecially sp eech from previous frames.