Proc. IEEE Canadian Conf. Electrical, Computer Engineering (St. John’s, NF), pp. 470-473, May 1997

COMPARISON OF VOICE ACTIVITY DETECTION ALGORITHMS FOR

WIRELESS PERSONAL COMMUNICATIONS SYSTEMSy

Khaled El-Maleh and Peter Kabal

TSP Lab oratory, Dept. of Electrical Eng.

McGill University

Montreal, Queb ec, Canada H3A 2A7

email: fkhaled,[email protected]

ABSTRACT (VBR) co ding has b een recently adopted for CDMA-

based cellular and PCS systems to enhance capacity

V oice activity detection (VAD) algorithms havebe-

b y reducing interference. A V AD device is an indis-

come an in tegral part of many of the recently stan-

p ensable part of any VBR co dec as it controls b oth the

dardized wireless cellular and P ersonal Communica-

a verage , and the o v erall qualit y of the co der

tions Systems (PCS). In this pap er, we presentacom-

[3].

parative study of the p erformance of three recently pro-

This pap er is organized in ve parts. Section 2

posed VAD algorithms under various acoustical back-

starts with a general review of the basics of a V AD

ground noise conditions. We also prop ose new ideas to

design. It then describ es, in some detail, three selected

enhance the p erformance of a VAD algorithm in wire-

V AD designs for the this study. Section 3 rep orts the

less PCS sp eech applications.

results of a simulation study that has b een p erformed

to study the p erformance of the three VADs under var-

1. INTRODUCTION

ious background noise conditions. We then discuss and

analyze the simulation results in Section 4. Finally,

Conv ersational sp eech is a sequence of consecutive seg-

conclusions are presented in Section 5.

ments of silence and sp eech. In wireless ,the

user is often roaming and th us encountering di erent

2. VOICE ACTIVITY DETECTION

types and levels of background acoustical noises. This

ALGORITHMS

bac kground noise which contaminates the signal results

in either noise only or sp eech plus noise segments. In

The basic principle of a VAD device is that it extracts

manyspeech pro cessing applications, it is desirable to

some measured features or quantities from the input

detect sp eech in the noise. This pro cess is called voice

signal and then compare these values with thresholds,

activit y detection (VAD) [1]. The VAD op eration can

usually extracted from noise only p erio ds. Voice ac-

be viewed as a decision problem in which the detec-

tivity(VAD=1) is declared if the measured values ex-

tor decides b etw een noise only,orofspeech plus noise.

ceed the thresholds. Otherwise, no sp eec h activityor

This is a challenging problem in noisy acoustical envi-

noise (VAD=0) is present. What generally character-

ronments.

izes a VAD design is the wa y it selects its features, and

V oice activity detection is used in a variet y of sp eech

the wa y it de nes and up dates the thresholds. In gen-

communication systems such as sp eech co ding, sp eech

eral, a V AD algorithm outputs a binary decision in a

recognition, hands-free telephony , audio-conferencing,

frame-by-frame basis where a \frame" of the input sig-

and echo cancellation. The recently prop osed multiple-

nal is a short unit of time such as 20{40 ms. Accuracy,

access sc hemes, such as CDMA, and enhanced TDMA

robustness to noise conditions, simplicity , adaptation,

for cellular and PCS systems use some form of V AD

and real-time pro cessing are some of the required fea-

[1]. Moreov er, in GSM-based wireless systems a VAD

tures of a go o d VAD.

mo dule is used for discontinuous transmission to sav e

In the early V AD algorithms, short-time energy,

the battery life of p ortable units [2]. V ariable bit rate

zero crossing rate, and LPC co ecients w ere among

the common features used in the detection pro cess [4].

yThis work w as supp orted b y a grant from the Canadian

Cepstral features [5], formant shap e [6], a least-square

Institute for T elecommunications Research under the NCE pro-

gram of the Gov ernment of Canada p erio dicity measure [7] are some of the recent ideas in

VAD designs. 2.3. The TOS VAD

In this work, we consider three recently prop osed

Symmetrically distributed (non-skewed) pro cesses are

VAD algorithms. These include the VAD used in the

characterized to have a third-order cumulant (TOC)

GSM cellular system [1,2], the VAD used in the en-

that is identically zero at all lags. However, sp eechsig-

hanced variable rate co dec (EVRC) of the North Amer-

nals have b een observed exp erimentally to be skewed

ican CDMA-based PCS and cellular systems [3], and a

enough to pro duce signi cantly non-zero TOC at all

third-order statistics (TOS)-based VAD [8].

lags. Under the assumption that many noises can be

mo deled as Gaussian or symmetrically distributed pro-

2.1. The GSM VAD

cesses, it is p ossible to discriminate sp eech from noise.

In [8], a novel time domain Gaussianity test is used in

In the GSM VAD, an adaptive noise-suppressor lter

the sp eech detection pro cess. The test statistic of this

^

is used to lter the input signal frame. The co ecients

VAD, d is de ned as

of the lter are computed during noise-only p erio ds.

1

t

^

^

The energy of the ltered signal is compared to a noise-

d =^c C c^ : (1)

3y

3y

0

dep endent threshold. As b oth the lter co ecients and

In this equation, c^ is the third-order cumulant

3y

the threshold are computed during noise-only frames,

^

of a given frame, and C is the covariance matrix of

0

sp ecial measures are taken to identify noise frames.

the TOC estimated from R initial noise-only frames.

These include b oth signal stationarity and p erio dicity

^

Sp eech is detected if d exceeds a selected threshold,

tests. The ma jor weakness of this VAD lies on the sta-

otherwise the frame contains noise. One ma jor feature

tionarity assumption of background noise. This is not

of this VAD is that it has a xed noise-indep endent

always the case for many of the commonly encountered

2

threshold, T given as ( ), where is a pre-selected

Q noises in wireless telephony.

probability of false alarm (P )andQ is the number of

F

Toimprove the p erformance of the GSM VAD for

lags used in the TOC computation. The value of the

both stationary and non-stationary noises, Srinivasan

2

threshold is obtained from the chi-square ( ) table

Q

and Gersho [1] prop osed several new features to the ba-

[8].

sic VAD design. These include a multi-band (4 bands)

energy comparison, sp ectral atness measurement, and

3. SIMULATION RESULTS

using the fraction of the energy of the low frequency

band. This improved GSM VAD is more powerful as

We have implemented the aforementioned VAD algo-

it relies on multiple-thresholds to make the nal deci-

rithms and tested their p erformance for di erent noise

sion. Some of these thresholds are determined empiri-

environments and at various noise levels. For the pur-

cally and the others are dynamically up dated based on

pose of this study, we have recorded several acousti-

signal measurements.

cal environmental noises (bus, street, restaurant) and

used some noise signals from the NOISEX-92 database

(car noise, babble) [9]. Background noise was digi-

2.2. The EVRCVAD

tally added to clean sp eech with SNR values of 20,

10, and 0 dB. The p erformance of a given VAD al- The EVRC co der [3] uses a 3-rate determination algo-

gorithm is a function of b oth the noise level (SNR) and rithm (RDA) to select the appropriate rate and co ding

the structure of the background noise (stationary, non- strategy for each input frame. The lowest rate signi-

stationary, white, or p erio dic). In Figures 1{8, weshow es a noise-only frame. For our comparative study,

on each gure the binary output of eachVAD sup er- we have changed this RDA to output a binary VAD

imp osed on a noisy sp eech signal. Due to the space ag. The basic idea of this VAD is similar to the GSM

limitations, we show the VAD results only for a high VAD or its improved version. However, the novel part

SNR (20 dB) and for a very noisy environment (0 dB). of this VAD is its dynamic up dating of the thresholds

It is common in mo dern VAD algorithms to use a in a way that cop es with di erent background noise

`hangover' p erio d of few frames to delayany pre-mature environments and conditions. The sp ectrum of the in-

transition from sp eech to noise [1,2,3]. This is to min- put signal is divided into two bands and the energy in

imize the probability of missing sp eech esp ecially for each band is compared against two thresholds. Sp eech

low-energy unvoiced sp eech. These hangover mecha- is detected if the energy in each band is greater than

nisms are generally not e ective in correcting isolated the corresp onding lowest threshold. The thresholds are

VAD errors (i.e `one' among a sequence of zeros or vice scaled versions of estimated sub-band noise energies

versa). For manyVAD applications (esp ecially sp eech from previous frames. For more details ab out the VAD

co ding), it is desirable to clean up such errors. We implementation, see [3].

consistent sup eriority of b oth the EVRC and the TOS have develop ed an isolated error correction mechanism

VADs when compared with the GSM-based VADs. We (IECM) that signi cantly corrects the VAD decision in

have also shown that VAD decisions were improved by awaythatmakes it more useful to sp eech applications.

using the prop osed isolated error correction mechanism The basic idea of the IECM is that we delay the deci-

and the LP residual as the input signal to the VAD. sion by 2 to 3 frames to monitor the VAD decisions in

neighb oring frames. If the current frame VAD decision

is di erent from its close neighb ors, then its VAD ag

6. ACKNOWLEDGMENTS

is changed to be similar to the other frames. This is

rep eated for each frame to removeany isolated errors.

Wewould like to thank B. V. Nguyen, P. Lim and C. M.

In Figure 7, weshow the e ectiveness of this algorithm

Leung for participating in an early stage of this study.

in enhancing the VAD results.

7. REFERENCES

4. DISCUSSION

[1] K. Srinivasan and A. Gersho, \Voice activity de-

In this pap er, the simulation results shown in Figures

tection for cellular networks," in Proc. of the

1{6 are the VAD decisions after isolated error correc-

IEEE Workshop., pp. 85{86, Oc-

tion. The results show a consistent sup eriorityofthe

tob er 1993.

EVRCVAD in detecting sp eech for almost all typ es of

noise and even for very lowSNR.However, it o ccasion-

[2] D. K. Freeman, G. Cosier, C.B. Southcott, and I.

ally detects noise as sp eech (false alarm) esp ecially for

Boyd, \The voice activity detector for the Pan-

babble (simultaneous background conversations) noise

Europ ean digital cellular mobile telephone ser-

as SNR gets low. The TOS VAD is ranked overall sec-

vice," in Proc. Intl. Conf. Acoust., Sp., & Sig.

ond in p erformance and shows almost-p erfect detection

Proc., pp. 369{372, Glasgow, May 1989.

results for babble noise at 0 dB.

[3] TIA Do cument, PN-3292, Enhanced Variable

The GSM VAD exhibits good p erformance under

Rate Co dec, Sp eech Service Option 3 for Wide-

stationary noise environments while it has diculty dis-

band Spread Sp ectrum Digital Systems, January,

tinguishing sp eech from noise in non-stationary noises

1996.

such as buses, babble, and street. Also, its perfor-

mance deteriorates for low SNR (b elow 20 dB). The

[4] L. R. Rabiner, and M. R. Sambur, \Voiced-

GSM VAD results clearly get improved using the mo d-

unvoiced-silence detection using the Itakura LPC

i cations suggested in [1] but still the robustness of the

distance measure," in Proc. Intl. Conf. Acoust.,

VAD is not guaranteed at lowSNRs. Wehave observed

Sp., & Sig. Proc., pp. 323{326, May 1977.

that high-energy voiced sp eech segments are always de-

[5] J. A. Haigh, and J. S. Mason, \Robust voice ac-

tected in all VADs under very noisy conditions. How-

tivity detection using cepstral features," in IEEE

ever, low-energy unvoiced sp eech is commonly missed.

TENCON, pp. 321{324, China, 1993.

Conventionally, the input to the VAD is the con-

versational sp eech signal. The linear prediction (LP)

[6] J. D. Hoyt, and H. Wechsler, \Detection of human

residual has b een used b efore as a to ol in voicing de-

sp eech in structured noise," in Proc. Intl. Conf.

cision algorithms to classify sp eech as voiced or un-

Acoust., Sp., & Sig. Proc., pp. I I-237{I I-240, Aus-

voiced. In this work, wehavealsoevaluated using the

tralia, May 1994.

LP residual as the input signal to the VAD algorithm.

Thus a standard linear prediction analysis is done rst

[7] R. Tucker, \Voice activity detection using a p eri-

and then the output of the LP analysis lter (residual

odicity measure," in IEE Proceedings-I,Vol. 139,

signal) is fed to the VAD to make the binary decision.

No. 4, pp. 377{380, August 1992.

The results showthat the accuracy of the VAD deci-

[8] M. Rangoussi, and G. Carayannis, \Higher order

sions has been improved in almost all cases when we

statistics based Gaussianity test applied to on-line

used the LP residual instead of directly using the input

sp eech pro cessing ," in Proc. of the IEEE Asilo-

signal (see Figure 8).

mar Conf., pp. 303{307, 1995.

5. CONCLUSIONS

[9] H.J. M. Steeneken, and F. W. M. Geurtsen, \

Description of the RSG.10 noise database," Re-

Wehavepresented in this pap er an exp erimental com-

p ort IZF 1988-3, TNO Institute for Perception,

parative study of three VAD algorithms under di er-

So esterb erg, The Netherlands.

ent background noise conditions. The results show a 4 Bus Noise, 0 dB, VAD Results with Error Correction 4 Car Noise, 20 dB, VAD Results with Error Correction x 10 x 10 2 2

0

0 GSM GSM

−2 −2 0 4 0.5 1 1.5 2 2.5 3 3.5 0 4 0.5 1 1.5 2 2.5 3 3.5 x 10 x 10 4 4 2 x 10 2 x 10

0 0 iGSM iGSM

−2 −2 0 4 0.5 1 1.5 2 2.5 3 3.5 0 4 0.5 1 1.5 2 2.5 3 3.5 x 10 x 10 4 4 2 x 10 2 x 10

0 0 EVRC EVRC

−2 −2 0 4 0.5 1 1.5 2 2.5 3 3.5 0 4 0.5 1 1.5 2 2.5 3 3.5 x 10 x 10 4 4 2 x 10 2 x 10

0

0 TOS TOS

−2 −2 0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5 sample number 4 4 x 10

sample number x 10

Figure 5: VAD results for bus noise at 0 dB SNR.

Figure 1: VAD results for car noise at 20 dB SNR.

4 x 10 Street Noise, 0 dB, VAD Results with Error Correction 4 2 x 10 Babble Noise, 20 dB, VAD Results with Error Correction 2 0 GSM 0 GSM −2 0 4 0.5 1 1.5 2 2.5 3 3.5 x 10 −2 4 0 4 0.5 1 1.5 2 2.5 3 3.5 2 x 10 x 10 4 2 x 10 0 0 iGSM iGSM −2 0 4 0.5 1 1.5 2 2.5 3 3.5 −2 x 10 4 0 4 0.5 1 1.5 2 2.5 3 3.5 2 x 10 x 10 4 2 x 10 0

0 EVRC EVRC −2 0 4 0.5 1 1.5 2 2.5 3 3.5 −2 x 10 0 4 0.5 1 1.5 2 2.5 3 3.5 4 x 10 2 x 10 4 2 x 10

0 0 TOS TOS −2 −2 0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5 4 sample number x 10 4

sample number x 10

Figure 6: VAD results for street noise at 0 dB SNR.

Figure 2: VAD results for babble noise at 20 dB SNR.

4 x 10 TOS VAD, Car Noise, 10 dB 2 4 x 10 Car Noise, 0 dB, VAD Results with Error Correction 2 1.5

1 0

GSM 0.5

0 −2 0 4 0.5 1 1.5 2 2.5 3 3.5 x 10 −0.5 2 4

x 10 without error correction −1

−1.5 0 0 0.5 1 1.5 2 2.5 3 3.5 iGSM 4 x 10

−2 4 0 4 0.5 1 1.5 2 2.5 3 3.5 x 10 x 10 2 4 2 x 10 1.5

0 1 EVRC 0.5

−2 0 0 4 0.5 1 1.5 2 2.5 3 3.5 x 10 4 2 x 10 −0.5 with error correction −1 0

TOS −1.5 0 0.5 1 1.5 2 2.5 3 3.5 sample number 4 −2 x 10 0 0.5 1 1.5 2 2.5 3 3.5 4

sample number x 10

Figure 7: TOS VAD: e ect of isolated error correction

Figure 3: VAD results for car noise at 0 dB SNR.

mechanism.

4 4 TOS VAD, Car Noise, 0 dB x 10 Babble Noise, 0 dB, VAD Results with Error Correction x 10 2 2

0 1 GSM

−2 0 0 4 0.5 1 1.5 2 2.5 3 3.5 x 10 4 2 x 10

signal domain −1 0 iGSM −2 0 0.5 1 1.5 2 2.5 3 3.5 −2 0 4 0.5 1 1.5 2 2.5 3 3.5 4 x 10 x 10 2 4 4 x 10 x 10 2 0 EVRC 1 −2 0 4 0.5 1 1.5 2 2.5 3 3.5 x 10 4 2 x 10 0

0 residual domain −1 TOS

−2 −2 0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5 4 sample number 4

x 10 sample number x 10

Figure 4: VAD results for babble noise at 0 dB SNR.

Figure 8: TOS VAD: e ect of input signal.