JOURNAL OF CRITICAL REVIEWS

ISSN- 2394-5125 VOL 7, ISSUE 17, 2020 A REVIEW OF THE RECENT ADVANCES IN VOICE ACTIVITY DETECTION TECHNIQUES FOR SPEECH PROCESSING

Jayaprakasha Honnatteppanavar1, Nagaraja B.G.2

1Research Scholar, Visvesvaraya Technological University, Belagavi 2Professor & HOD, Dept. of Electronics & Communication Engg., Jain Institute of Technology, Davanagere

Email : [email protected], [email protected]

Received: 14 March 2020 Revised and Accepted: 8 July 2020

ABSTRACT—This paper presents a brief review of the recent advances that have taken place in the voice detection activity areas that has been used in the digital speech processing. This review/survey paper gives an idea about the research works that have been carried out in this field & once the related works could be analyzed, the problem formulation could be carried out and thus the article serves as a ready reckoner to begin with for any researcher who wants to pursue research in this field.

KEYWORDS—Speech, Voice, Simulation, Processing, Matlab, Recognition, Phonetics, Result, VAD, Audio, Noise.

I. ORGANIZATION OF THE REVIEW/SURVEY PAPER The paper is organized as follows. A small introduction to the chosen research work about what is voice activity detection is presented in the introductory section in the section II. Review of literature w.r.t. the voice / speech / audio processing is dealt with in greater detail in section III. Comparision of the work done by different authors is projected in section IV. This is followed by the drawbacks of the works done by various researchers is presented in section V. The paper concludes with the brief conclusive remarks in the section VI followed by the acknowledgment and the list of references.

II. INTRODUCTION In recent years, digital speech signal processing technology has made a rapid progress & has been applied to many fields, such as in the communication, multimedia, computer Man-machine interface, etc. & this could be termed as what is called as the VAD-Voice Activity Detection [8]. The VAD relates to a class of digital signal processing methodologies which will detect if any short transcripts or segments of a captured speech signal will contain any voiced or unvoiced (with or without noise) data and this VAD processing system is nothing but a set of decision based algorithmatic rules on selected & estimated features of the speech signal. The detection systems play a vital role as one of the main pre-processing blocks in a large number of digital speech processing applications, to name a few of them – speech enhancement, , speech recognition, male-female speaker identification, determination of age & sex from speech, speech synthesis, speech identification, speech density estimation, speaker & language identification, speech compression, speech transmission, etc. In other words, it can be said that VAD is a form of digital speech processing, which is processing of 1 dimensional data as the speech is a 1D signal, which conveys some information. The captured signal may be a noisy one or a de-noised one. In many of the DSP based speech application problems, these VAD system is a front-end processing concept & plays a vital role in separating any audio stream into different time intervals that contains different activities of the speech and sometime intervals where activity is absent [1]. A typical voice activity detection scheme is shown in the Fig. 1 where a speech signal with noise is inputted & the processing takes place using different types of algorithms (filtering techniques) and a noise free speech signal is obtained as the output. A large number of studies have addressed the VAD problems in the literature/s. Actually speaking, a VAD approach consists of 2 successive steps, viz., a feature extraction model & a weighted discrimination model.

3129

JOURNAL OF CRITICAL REVIEWS

ISSN- 2394-5125 VOL 7, ISSUE 17, 2020

Input VAD Speech Pre Feature V A D V A D Decision Output Signal Processing Extraction Decision Correction Signal (with noise) (w/o noise)

Threshold Filters (remove noise) Computations Fig. 1. A typical VAD scheme (prototype) for speech processing application – model 1 Another typical VAD schematic is shown in the Fig. 2 where the voice & the noise serves as the input to the processing system, the output being the automatic speech recognition with high quality output [23] as

Fig. 2. A typical VAD scheme (prototype) for speech processing application – model 2 A large number of algorithms have been developed for the VAD process for various applications & the complexity with vary depending upon the type of applications. Some of the standard industrial VAD algorithms that are currently in use are the G.729, the adaptive multirate (AMR), the advanced front-end (AFE) & the Skype type SILK. Most common part in these are the run time background noise estimations usage. An adaptive threshold has to be calculated using the noise parts of the speech signal & these are used to estimate that the frame is containing speech or not. Many of the industrial VADs require feature vectors as the input where a threshold is applied to the system and the frame will be declared as the speech after the noise removal process. Some of these processes require the hidden Markov models, Gaussian mixture models, SVM, the ANNs, the CNNs & the DNNs. Another typical VAD model could be shown in the Fig. 3 as [2]

updation of the noise model @ intermittent times

Output Input Noise Threshold Feature V A D Feature Model/s Computations Vector/s Vector/s

Fig. 3. A typical VAD scheme (prototype) for speech processing application – model 3 In general, any type of VAD system can be classified into 3 different types, i.e., time-domain (TD), frequency- domain (FD), & the statistical approaches (SA). A majority of the TD & FD methods are based upon the heuristic rules which are reflected upon the production characteristic features of different parameters such as the (LPC), the short-time energy (STE), zero crossing rate (ZCR), Mel-frequency cepstral coefficients (MFCC), the spectral entropy (SE) & the periodicity measures (PM). The Energy-based VAD approaches are very much straightforward & they are popularly used in speech, voice, audio & speaker recognition application problems. Some of the statistical model based VAD approaches are the HMM, GMM, SVM & the LRT & could be used for noisy speech analysis [3]. As mentioned in [12], the most important property for any VAD are - Reliability, Robustness, Real-time processing, Adaptation, Accuracy, Simplicity w/o any apriori information about the noise knowledge. Amongst these properties, robustness against very noisy environments is the most difficult part to achieve. When high SNR condition is there, simplest VAD algorithm will works, but in very low SNR cases, all the VAD algorithms will degrade to a certain level. To be noted that, VAD algorithm should be of very low complexity so that computations are speeded up & faster, which is the most essential part for any RTI to take place. Hence, robustness & simplicity against the noise levels are the 2 vital characteristics of any practicable VAD. Voice activity detectors are oftenly used to identify different sections or some of the parts of noisy speech signals that contains speech activities and constitute a key module in many of the speech processing applications, that too in hearing impairment (hearing aids) & in cochlear implants [4]. Even though there are a large use cases for speech analysis, all the algorithms that are developed have a common feature in it, i.e., it will be based on the signal that is being corrupted with noise (arising due to different reasons) & the presence of the actual speech signal or info has to be detected before it is going to be processed [11].

3130

JOURNAL OF CRITICAL REVIEWS

ISSN- 2394-5125 VOL 7, ISSUE 17, 2020 III. LITERATURE SURVEY In this section, a brief review of the literature is being carried out in the field of voice activity detection processes used in digital speech processing one after the other in succession. Voice activity detection gives information about the problem of distinguishing different speech segments from the background noise. Many approaches had been proposed for this purpose. In [1], the authors Thomas et.al. investigated the joint usage of source & filter-based features in the detection of voice activity and produced excellent results in the form of superior discrimination power for the source-related features as the ANN classifiers had been used for the training purposes. Later, Zaur & Tomi in their dissertation work in [2], proposed an excellent fast processing algorithm relating to the decision fusion of the voice / speech-based activity detectors and they developed a concept of fusion process where several number of VAD’s outputs were combined to get more accurate binary speech classifications of the input. It is a well-known fact that accurate & an effective VAD system is very much essential for the design and development of a robust digital speech recognition system. In relation to this, the authors Yan Zhang et.al. in [3] developed an hierarchical framework approach for the design of the speaker recognition system and for its speech enhancement criteria, where they used the modified Wiener Filter (WF) approach for reducing the noise in their proposed system and comparably it produced efficient results w.r.t. the SNR levels. One major drawback was they did not work on multiple noise reductions and sticked upon to only few limits. Abhishek et.al. in [4] worked on the CNN smart phone apps for RT VADs coupled with low audio latencies. The app which they had developed acted as a switch for reducing the noise in the DSP pipeline of the hearing device. One small lacuna was the consumption of memory of the app w/o starting the app was 17.5 MB & after the app is started it was 20.8 MB. When robustness comes into picture, VADs design will be more complicated that too in the presence of non-stationary back-lit noises. In this scenario, researcher Mathews in his thesis in [5] did extensive work on the heuristic nature of the speech signals with the help of some statistical speech models supported by Deep Neural Nets, where he used the data augmentation concepts using AI & ML with proper training in the RT audio streams. Couple of drawbacks were there in the work, which was they had lot of troubles in generalizing to the subtle differences & the problem of developing the CNN-GRU hybrid model more robust in nature to unseen environments was not carried out. A new robust mel-energy based VAD for a non-stationary noise & its application to speech signal compression problem was proposed by Waheeduddin & Syed in their thesis work presented in [6], where an adaptive threshold related to the signal-to-noise ratio (SNR) estimates was worked upon with to arrive at very low noise reduction levels. They considered 3 types of noises in their experimental works, viz., white, babble & vehicle noise which was a mix of stationary and non-stationary, one lacuna was the performance was not upto the mark when high levels of a combination of stationary & non-stationary noise levels was taken into consideration. In [7], Tomi & Padmanabhan developed a practical, self-adaptive VAD for speaker verification process when there is a mix of noisy telephonic and micro-phone datas. But the process deterioration was more under severe or very high noisy conditions like in the market zones or in crowded places. In this work, they used the concept of enhancement pre-processing of the speech signals where their work used trained speech and non-speech models on an utterance-by utterance basis from the MFCCs. A new algorithm for VAD depending on the concepts was proposed by Transform Jiang et.al. in [8], where the authors utilized the difference of the spectral distributions between the noises & the voices of the human speech signals. First, the authors took the WT of the speech signal & then decomposed that into sub- bands using the BP filtering schemes, finally detecting the voice in the signal by comparision of the sub-band energies of the split-up components between the noise & the voices, one hinderance in this paper was the work was not quite effective in situations under very high SNRs, which proved to be fatal in such cases. In general, VAD is a important process in VC systems to avoid unnecessary coding & noise transmissions. Currently many VADs suffer from high false alarming rates & low sensitivities to low SNR’s especially at 0 DB & below. In this context, a group of researchers led by Wei Qing Ong et.al. in [9] worked on the RT robust VAD utilizing the upper envelope of the weighted measure of entropy along with a double adaptive NL filter concept, where they devised a methodology to separate speech & non-speech parts in voice to voice communications. Their concept worked well when the no. of taps w.r.t. the filter bank is ≤ 50, but if it is more than that, the, the requirement million instructions per second for VAD increases, which is a major drawback. Seshashyama et.al. [10] did an extensive survey & evaluation of voice activity-based detection algorithms for various applications in their thesis and produced very good impressive results in comparision with the others. The authors carried out a thorough investigation of the different modern algorithms based on energy threshold; zero crossing detection & other statistical measures of the speech signal implemented in Matlab environment both under noisy as well as noise-free environments. The insertion errors were found to be low in VAD based on statistical measures, but it was very high in LED method, a major set-back in their work & making the

3131

JOURNAL OF CRITICAL REVIEWS

ISSN- 2394-5125 VOL 7, ISSUE 17, 2020 methods less effective in speech applications. Some of the errors in the measurements were due to manual mismarking of the test signals taken in their works as the inputs. Simon Graf et.al. in [11] evaluated the different features for the voice or speech activity processes based on various parameters of the speech / audio signal and also performed a comparative analysis. They categorized the speech based on different parameters such as power, modulation, harmonicity, amplitude, phase, frequency, strength, entropy, energy, power, time. In [12], the authors developed a very simple, but highly efficient RT VAD algorithm, a noise robust one & used the short-term features such as the spectral flatness of the speech signal along with the short-term energies and they used for a host of on-line audio processing tasks, which was evaluated on a number speech corpora with additive noises. 2 deficiencies were found in their methods, firstly, the method was still vulnerable against noises (vehicular noise), secondly, relatively lower average speech hit rate comparable to the G.729 VAD standards. Krill et.al. in [13] devised a new approach for the energy-based voice detection concepts using adaptive scaling factors in their research article & the work they presented served as an alternative energy-based algorithm which is used to provide speech – silence classification. The authors were able to track non stationary speech signals & were able to calculate the instantaneous values of the audio signals for obtaining the threshold using scaling factor. A noise power estimation was used by them with the concept of min-max values of short- term energy estimations, but one lacuna was it involved more number of computations. Md. Salman et.al. in [14] did research on the VAD & Garbage Modelling for a automatic speech recognition for an mobile application problem. ANN was used to solve the speech problems & 3 works were carried out by the researcher, firstly, building up of a novel acoustic model, secondly to improve the current VAD process & thirdly, the garbage modelling of the OOV words with the overall improvement in the word error rates. Some drawback in the work was ANN was used, as such less efficiency could be seen, which can be rectified by the usage of DNNs or CNNs. Also, confidence measure could be used to increase the overall accuracy of their system. In [15], Yi Hu et.al. carried out some evaluation procedures of different objective quality measures that were used for speech enhancement process to improve the strength and reduce the noise factor in the audio processing systems using hybrid noise suppression algorithms. Couple of composite objective measures were also designed by combining some of the individual objective measures with the help of non-parametric & parametric regression analysis processes, thus yielding very good simulated results. One demerit was – majority of the measures predicted equally well the signal distortion with good qualitative measures, but the background distortion could not be reduced, a set-back in their work. Andreas et.al. in their research paper in [16] researched upon the decision-based robustness of the segmentation w.r.t. the voice activity processes in unconstrained mobile speaker recognition environment and proved that even noise cannot affect the activity process if the design is robust in nature. Work was simulated for different environmental noise condition by proposed a hamming distance-based prediction method to solve the noise issues. The authors used the likelihood ratio comparison of speech to non-speech mathematical models which included the most dominant frequency component features for selection of audio model training patterns. Their approach yielded good gains, but could have been extended by incorporating an SFM to improve the SNR. In [17], research group led by Sahidullah et.al. compared the speech activity detection techniques for speaker recognition problems under various environments and for various applications. They reviewed some of the SAD techniques & their applications. SVS using SAD concepts were experimentally evaluated on the NIST speech database using GMM-UBM based classifier for different clean & noisy environments. One lacuna was in some cases, sub-optimal results were obtained, i.e., when a speech signal was getting distorted by noise, its frequency bands were getting affected unequally, a major set-back, but could be improved using Gaussian models. The team lead by Park et.al. in [18] worked upon the VADs in very much crowded noisy environments utilizing the Double-Combined FT & Line Fitting approach to design and develop noise free more efficient VADs. This was done because the frame energy feature will be unstable in the crowded or in the noisy environments as such transformation techniques could be used to remove the noises & get effective results. The algorithm they developed is an hybrid one combined with an edge detection filter system for effective detection of the end points working on the AURORA-2.0 databases & SITEC databases. Joachim et.al. in [19] developed a robust VAD system based on the wavelet transformations. Their algorithm used the transforms flexibility in the time- frequency mode in order to compute the robust VAD parameters. The authors used a simple noise model for the WT & for the G-729 consisting of a Qth order pole filter driven by a Gaussian noise (white) with proper selection of the filter coefficients. But one drawback was as the order of Q was decreased, weightage was given to the WT-l as clipping became more audible for the G729, that too at low values of SNR < 10 db. Semi-supervised speech activity detection processes with an application to the automatic speaker verification problem in daily life was worked upon by group led by Alexey et.al. in their research paper in [20]. A model was proposed on the GMM-Gaussian Mixture Modeling of speech and non-speech recording frames. Their

3132

JOURNAL OF CRITICAL REVIEWS

ISSN- 2394-5125 VOL 7, ISSUE 17, 2020 proposed work on SAD didn’t require any type of off-line training data compared to the supervised SADs. The work was found to be better for long recordings, but for shorter duration of the speech recordings, it was not that much effective. In [21], Jongseo et.al. worked on the development of the statistical model-based VAD system and showed that it could be used for any type of generic application problems where moderate accuracy is required such as in variable-rate speech coding process & uses the decision-directed parameter estimation approach for the likelihood ratio-based tests. Along with this concept, they also introduced an highly effective hang over approach considering the 1st order Markov Process modelling of speech codings. Pranav et.al. in [22] researched upon the VAD system incorporating the Wavelet Transform approach to deliver effective goods. Considering that s(n) is a clear audio / speech signal, w(n) being the additive noise, the authors modelled the speech signal with the noise factor as y(n) = s(n) + w(n), then the DWT was taken, coefficients found out, optimization done & the output y(n) produced with good results. However, one problem was the average and the available bandwidth was limited, thus providing it as a hinderance. Table I : A comparision of few of the top 5 authors works (advantages & lacunas) Ref. Type of concepts used Advantages Dis-advantages No. did not work on multiple used the modified Wiener hierarchical framework noise reductions and [3] Filter (WF) approach, low approach sticked upon to only few SNR limits. CNN smart phone apps for reducing the noise in the DSP consumption of memory [4] RT VADs coupled with low pipeline of the hearing device of the app audio latencies performance was not upto adaptive threshold related to the mark when high levels the signal-to-noise ratio very low noise reduction of a combination of [6] (SNR) estimates was worked levels stationary & non- upon stationary noise levels was taken into consideration utilized the difference of the suffer from high false Algo based on wavelet spectral distributions between [8] alarming rates & low transforms the noises & the voices of the sensitivities to low SNR’s human speech signals. concept worked well when the no. of taps w.r.t. RT robust VAD utilizing the Separate speech & non- the filter bank is ≤ 50, but [9] upper envelope of the speech parts in voice to voice if it is more than that, the, weighted measure of entropy communications. the requirement million instructions per second for VAD increases A number of techniques has been adopted to the state of art of VAD designs. In the early cases, short-time energy, zero-crossing rate & linear prediction coefficients were amongst the common features that were used in speech detection process [24]. Speech cepstral coefficients [25], the spectral entropy [26], the least-square periodicity measures [27], DWT-wavelet transform coefficients [28] are some of the examples of recently proposed features of VAD. But, commonly, all of them will not even have a perfect solution as each one will have a varying human speech nature equipped with background noises. Nagaraja & Jayanna worked on the feature extraction & modeling techniques for multi-lingual speaker recognition in their research paper in [29], which was extended to multi-lingual speaker identification by combining the evidences from LP Residual and Multi-Taper MFCCs in [30].

IV. COMPARISION OF THE WORK DONE BY AUTHORS

A large number of papers were collected, referred & studied on the voice detection activities and here only the important ones have been projected [1]-[30]. A comparision of the some (few) of the top 5 noteworthy cited researchers / authors were also made regarding the type of strategy they had used, their advantages & what the drawbacks/lacunas was in their methodology and the entire chronology of items discussed is presented in the table I.

3133

JOURNAL OF CRITICAL REVIEWS

ISSN- 2394-5125 VOL 7, ISSUE 17, 2020 V. DRAWBACKS OF THE EXSITING METHODS A large number of researchers had worked on the extension of the VAD based systems & from the application point of view & in fact, only the important works been presented in this literature survey are utilized in our future research work for the design & simulation purposes of effective VAD based system. In majority of the work done by the different researchers / authors presented in the previous paragraphs, there were lot of disadvantages / burdens / lacunas / drawbacks / deficiencies, for example, • single signal usage case & multiple signals were not considered, • no noise was considered, • Number of computations were little high, • few worked on the cross-talk interference aspects, • robustness was not considered, • many of them used linearized models, • non-linearized mathematical models was not considered, • linearization about an operating point was done, • utilization of traditional methodologies for noise reduction purposes, • Few worked on multiplexing concepts, • MIMO case of inputting the signals were not tried upon (very few), • Hardware & Real-Time Implementation-very few people attempted, • Low spectral efficiency, • SNR was very high in some of the works, • hybrid designs were not used for efficient speech analysis, and so on & so forth.

VI. CONCLUSIONS A brief review of the related research work done in the field of voice recognition concepts used in speech processing were presented in a nutshell in this survey paper. Also, some of the drawbacks or the dis-advantages of the works done were also portrayed herewith. These lacunas or the demerits could be taken up as some of the identified problems that could be solved upon defining the problem and arriving at good optimized solutions. The review paper is definitely going to serve as a ready reckoner for the researchers. The literature survey presented in this research work is used further to define the research problem & verify it through effective simulation results in the Matlab environment in order to substantiate the research problem undertaken in comparison with the work done by the earlier authors in the relevant field, in the sense to solve the desired objective & arrive at the solution of the research work.

ACKNOWLEDGMENTS This research work was supported by the VTU Research Centre, Dept. of Electronics & Communication Engg., Jain Institute of Technology, Davangere and Visvesvaraya Technological University, Belagavi, Karnataka.

VII. REFERENCES [1]. Thomas Drugman, Yannis Stylianou, Yusuke Kida, Masami Akamine, “Voice Activity Detection: Merging Source and Filter-based Information”, IEEE Signal Processing Letters, Vol. 23, No. 2, pp. 252- 256, Feb. 2016. [2]. Zaur Nasibov & Dr. Tomi Kinnunen, “Decision fusion of voice activity detectors”, Master's Thesis, School of computing University of Eastern Finland, Europe, April 16, 2012. [3]. Yan Zhang, Zhen-min Tang, Yan-ping Li andYang Luo, “A Hierarchical Framework Approach for Voice Activity Detection and Speech Enhancement”, Hindawi Publishing Corporation, The Scientific World Journal, Volume, Article ID 723643, 8 pages, pp. 1-8, 2014. [4]. Abhishek Sehgal & Nasser Kehtarnavaz, “A Convolutional Neural Network Smartphone App for Real- Time Voice Activity Detection”, IEEE Access, Vol. 6, pp. 9017-9026, 2018. [5]. Matthew McEachern, “Neural Voice Activity Detection and its Practical Use”, Dept. of Electrical Engg. & Comp. Sci., Master of Engg. in Electr. Engg. & Comp. Sci. Thesis, MIT, Massachusetts Inst. of Tech., June 2018. [6]. Waheeduddin, Syed Q., “A Novel Robust Mel-Energy Based Voice Activity Detector for Nonstationary Noise and Its Application for Speech Waveform Compression”, Louisiana State University and Agricultural and Mechanical College, LSU Master’s Theses, Master of Science in Electrical Engg. in Dept. of Electrical Engg., 2006.

3134

JOURNAL OF CRITICAL REVIEWS

ISSN- 2394-5125 VOL 7, ISSUE 17, 2020 [7]. Tomi Kinnunen and Padmanabhan Rajan, “A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data”, IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, pp. 7229-7233, 26-31 May 2013. [8]. Transform Jiang Shaojlll, Gila Haitao, Yin FlIliang, “A New Algorithm for Voice Activity Detection Based on Wavelet”, Proc. of 2004 Int. Symp. on Intelligent Multimedia, & Speech Processing, Hong Kong, October 20-22, 2004. [9]. Wei Qing Ong, Alan Wee Chiat Tan, V. Vijayakumar Vengadasalam, Cheah Heng Tan and Thean Hai Ooi, “Real-Time Robust Voice Activity Detection Using the Upper Envelope Weighted Entropy Measure and the Dual-Rate Adaptive Nonlinear Filter”, MDPI, Journal of Entropy, 19, issue 11, 487, pp. 1-21, 2017. [10]. Seshashyama Sameeraj Meduri, Rufus Ananth, Dr. Benny Sällberg, Dr. Sven Johansson, “A Survey and Evaluation of Voice Activity Detection Algorithms”, Dept. of Electrical Engg., School of Engg., Blekinge Tekniska Högskola, Se 37175, Karlskrona, Sweden, M.S. Thesis, Jun. 2011. [11]. Simon Graf, Tobias Herbig, Markus Buck and Gerhard Schmidt, “Features for voice activity detection : a comparative analysis”, Springer Open Access’s EURASIP Journal on Advances in Signal Processing, Article No. 2015:91, 2015. [12]. M.H. Moattar and M.M. Homayounpour, “A simple but efficient real-time voice activity detection algorithm”, 17th European Signal Processing Conference (EUSIPCO 2009), Glasgow, Scotland, pp. 2549- 2553, August 24-28, 2009. [13]. Kirill Sakhnov, Member, IAENG, Ekaterina Verteletskaya, and Boris Simak, “Approach for Energy-Based Voice Detector with Adaptive Scaling Factor”, IAENG Int. Jour. of Comp. Sci., Vol. 36, Issue 4, IJCS_36_4_16, Nov. 2009. [14]. Muhammad Salman Ishaq, Mikko Kurimo, Matti Varjokallio, Leo Hämäläinen, “Voice Activity Detection and Garbage Modelling for a Mobile Automatic Speech Recognition Application”, School of Electrical Engg., MS Thesis, Aalto University, Helsinki, Finland, Jan. 2017. [15]. Yi Hu and Philipos C. Loizou “Evaluation of Objective Quality Measures for Speech Enhancement”, IEEE Trans. on Audio, Speech & Language Processing, Vol. 16, No. 1, pp. 229-238, Jan. 2008. [16]. Andreas Nautsch, Reiner Bamberger, Christoph Busch, “Decision Robustness of Voice Activity Segmentation in unconstrained mobile Speaker Recognition Environments”, Gesellschaft Für Informatik e.V., Bonn, Germany, IEEE International Conference of the Biometrics Special Interest Group (BIOSIG), Darmstadt, 2016, ISBN: 978-3-8857-9654-1, pp. 1-7, 21-23 Sept. 2016. [17]. Md Sahidullah, Goutam Saha, “Comparison of Speech Activity Detection Techniques for Speaker Recognition”, Journal of Computer Science - Multimedia; Computer Science – Sound, 7 pages, Oct. 2012. [18]. Jinsoo Park, Wooil Kim, David K. Han,HanseokKo, “Voice Activity Detection in Noisy Environments Based on Double-Combined and Line Fitting”, Hindawi Publishing Corporation, The Scientific World Journal, Vol. 2014, Article ID 146040, 12 pages, pp. 1-12, 2014. [19]. Joachim Stegmann, Gerhard Schroder, “Robust voice-activity detection based on the wavelet transform”, 1997 IEEE Workshop on Speech Coding for Telecommunications Proceedings. Back to Basics: Attacking Fundamental Problems in Speech Coding, Pocono Manor, PA, USA, pp. 99-100, 7-10 Sept. 1997. [20]. Alexey Sholokhova, Md Sahidullaha, Tomi Kinnunena, “Semi-Supervised Speech Activity Detection with an Application to Automatic Speaker Verification”, Elsevier’s Science Direct Comp. Speech & Lang. Jour., Vol. 47, pp. 132-156, Jan. 2018. [21]. Jongseo Sohn, Nam Soo Kim, Member, Wonyong Sung, “A Statistical Model-Based Voice Activity Detection”, IEEE Signal Processing Letters, Vol. 6, No. 1, Jan. 1999. [22]. Pranav Venuprasad, Jacob T. Lassen, “Voice Activity Detection based on Wavelet Transform”, Conference Paper, 2015. [23]. http://alango.com/voice-activity-detection.php [24]. B.S. Atal and L.R. Rabiner, “A pattern recognition approach to voiced-unvoiced- silence classification with applications to speech recognition”, IEEE Trans. Acoustics, Speech, Signal Processing, vol. 24, pp. 201-212, June 1976. [25]. J.A. Haigh and J.S. Mason, “Robust voice activity detection using cepstral features” Proc. of IEEE Region 10 Annual Conf. Speech and Image Technologies for Computing and Telecommunications, (Beijing), China, pp. 321-324, Oct. 1993. [26]. S.A. McClellan and J.D. Gibson, “Spectral entropy: An alternative indicator for rate allocation”, IEEE Int. Conf. on Acoustics, Speech, Signal Processing, (Adelaide, Australia), pp. 201-204, Apr. 1994. [27]. R. Tucker, “Voice activity detection using a periodicity measure”, IEE Proc.-I, vol. 139, pp. 377-380, Aug. 1992. [28]. J. Stegmann and G. Schroder, “Robust voice-activity detection based on the wavelet transform”, Proc. IEEE Workshop on Speech Coding for Telecommunications, (Pocono Manor, PN), pp. 99-100, Sept. 1997.

3135

JOURNAL OF CRITICAL REVIEWS

ISSN- 2394-5125 VOL 7, ISSUE 17, 2020 [29]. Nagaraja B.G. & H.S. Jayanna, “Feature extraction and modeling techniques for multilingual speaker recognition: a review”, Int. Jour. of Signal & Imaging Systems Engg., Inderscience Journal, Vol. 9, No. 2, pp. 67-78, 2016. [30]. Nagaraja B.G. & H.S. Jayanna, “Multilingual speaker identification by combining evidences from LP residual and multi-taper MFCC”, Jour. of Intelligent Systems, (JISYS), vol. 22, no. 3, pp. 241-251, Jun. 2013.

3136