INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 10, OCTOBER 2019 ISSN 2277-8616 Multi-Class Support Vector Machine Based Continuous Voiced Odia Numerals Recognition

Prithviraj Mohanty, Ajit Kumar Nayak

Abstract: With rapid advancement of automatic speech-recognition technologies, speech-based machine interaction has fascinated attention by many researchers to apply their approach from the research laboratory to real-life applications. A continuous voiced numerals recognition system is always useful for physically challenged persons (blind people) or elder people to have a telephonic conversion, setting the PIN number for their debit and credit cards and also devising the security code for some applications without physically touching the system. The work presented on this paper emphasize the recognition of continuous Odia numerals using multi-class Support Vector Machine (SVM).Three popular feature extraction techniques such as: PLP, LPC and MFCC are used to extract the feature parameters from voiced numerals and fed as the input to the recognition process. Different kernel mapping functions like: polynomial, sigmoid, Radial Basis Function (RBF) and wavelet are used in order to map the non-linear input feature space of framed signals to linear high dimensional feature space. So as to recognize the Odia numerals, multi-class SVM models are constructed using the techniques of One-Verses-All (OVA) and Half-Verses-Half (HVH). For the proposed system, diverse experimentations has been performed and results are analyzed over multi-class SVM models considering different feature parameter techniques with various kernel mapping functions. It has been observed that, OVA SVM model with MFCC for feature extraction and wavelet as kernel mapping function provides better accuracy as compared to other variations of the results attained.

Index Terms: SVM, PLP, LPC, MFCC, Sigmoid, RBF, Wavelet, OVA, HVH. ——————————  ——————————

1. INTRODUCTION distribution, so it exhibits poor performance in classification. AUTOMATIC speech recognition (ASR) by machine is Therefore, it requires a technique which may be used to believed to be the most active and exciting field of research classify in a better manner. During last few decades and being well-thought-out for more than 50 years. The main researchers were proposed some alternative approaches that objective of an ASR system is to transcript input voiced have better performance compared to HMM. Most of the utterances into its corresponding text. The ASR system can be approaches based on Artificial Neural Networks [5] or the exploited for certifying users via their voiced signals and hybrid approach of HMM-ANN [6]. A new machine learning executing the activity in the form of commands specified by technique like SVM which has good generalization and the human [1]. ASR is also, treated as one of the active convergence property, can be used as a better classifier. application of speech processing and usually used for human Generally SVM is a linear classifier but use of different kernel machine interaction. ASR applications are very much crucial in mapping functions permits SVM to function as a non-linear voice based activities, automatic debit and credit card classifier which has high dimensional feature space [7]. activation, safety and investigation amenities. In the present A lot of enhancement has been already done in the field of day, in most of the smart phones, laptops and tablets, ASR ASR for all popular spoken language like English, French, based soft wares such as: OK Google, Apple Siri, and Chinese, Japanese, and Mandarin etc. ASR systems are there Microsoft Cortana are incorporated to make the life more and further evolving is going on continuously. Currently many simple and productive [2]. Automatic numeral recognition of research works are going on over Indian languages like Hindi, spoken utterances has concerned a lot of authenticity because Bengali, Kannada, Telugu, Marathi, Punjabi, Gujarati etc. several numerical data such as account number, debit and Indian language like Odia, is still less advanced due to credit card number, telephone number can be inputted to the absence of computational linguistic resources. ASR system for machine conveniently using the voices of humans. Isolated has been found inefficient even though it is word recognition and spoken digit recognition are mostly spoken by approximate 33 million of people in India. Yet, a few applicable for data entry automation, generation of PIN code research work has been found for Odia language. So the non- applied in various services, automation in banking and security availability of advanced ASR software for Odia language and systems. Similarly, the application of continuous voice based regional sensation makes curiosity for adding more research numeral recognition is generally helpful for automatically effort towards it. In this paper, we proposed a system for dialing telephone numbers [3]. In the early research over the recognizing continuous voiced Odia numerals using support speech recognition systems, HMM model, a statistical method vector machines. The system implements by considering based classifiers have been employed to evaluate the acoustic voiced mobile numbers spoken in Odia language. First the probability. Using the maximum likelihood estimation (MLE) continuous voiced numerals are preprocessed and segmented algorithm, parameters for HMM models are computed [4]. into isolated numerals. Then different feature extractions Since the parameters are evaluated using only the input class methods are applied like: MFCC, LPC and PLP. These data and mainly depends upon the prior probability features are inputted to multi-class SVM for recognition of numerals. The input voiced numeral signal contains many ———————————————— different features which is non-linear in nature can’t be  Prithviraj Mohanty , classified easily by SVM. So, various kernel mapping functions Department of CS&IT, ITER, S’O’A (Deemed to be) University, Bhubaneswar, India.E-mail: [email protected] such as: polynomial, sigmoid, RBF and wavelet are used to  Dr. Ajit Kumar Nayak map from non-linear input space to linear feature space [8]. Department of CS&IT, ITER, S’O’A (Deemed to be) University, The efficiency of the system is computed using the above Bhubaneswar, India. Email: [email protected] mapping functions along with different multi-class SVM classifier. The next part of the paper is framed as follows. Section-2 describes the related work over word and digit 2754 IJSTR©2019 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 10, OCTOBER 2019 ISSN 2277-8616 recognition for different languages. Section-3 outlines the classifications. A continuous speech recognition with SVM proposed model along with fundamental of SVM classifier. The which takes decision at frame level and a token passing experimental result with comparison is presented in section-4. method which is considered for finding the sequence of Finally section-5 accomplishes the paper and suggests the recognized words, has been proposed in [21]. Mittal et.al future directions. proposed a multiclass SVM for recognition of spoken Hindi digits. They used MFCC, LPC and mix of both for feature 2 RELATED WORK extraction and different kernel mapping functions of SVM for Continuous voiced numeral recognition is treated as a classification. The system has been experimented with designing technique for a voiced dialer system. The system is different approaches (one vs. all and ten one vs. all) for SVM usually significant for substantially defied (blind people) or classification with variation of frames for a signal. The aged people for having a telephonic exchange without performance comparison of various feature extraction physically dialing the numbers. This is also helpful for the technique along with other recognition technique has been illiterate people those who can speak the numerals but can’t reported in [22]. A new technique where wavelet analysis and recognize them accurately. A number of research work have SVM is utilized for speaker verification has been proposed by been proposed to recognize numerals for different languages. Returi et.al [23]. Filter banks present in wavelet has been used The exploration work related to isolated word recognition and for extracting the features which consequently distinct the digit/numeral recognition for different spoken languages with normal and abnormal input voices. Furthermore SVM different parametric representation of the speech along with approach was used to segregate the particular speaker signal various methods for classification are main focus for our from multiple dialogs. The results obtained was found to be discussion. Isolated word and digit recognition for various 95% accuracy considering appropriate classification. language with different techniques along with their Whispered recognition using SVM and HMM approach has performances has been presented in [9]. Odia word been proposed in [24]. The experimental outcomes obtained recognition system based on HMM model used for the visually by the authors suggest that, for speaker independent (SI) diminished students in school and public education was HMM provides better result while for speaker dependent (SD) proposed in [11]. S. Mohanty et.al developed a model where SVM was found to be superior. A hybrid technique which uses speech recognition and speaker verification has been both HMM and SVM can be considered to provide an average achieved by HMM and SVM respectively [12]. Isolated voiced accuracy for SI and as well as SD environment. Odia digit recognition with implementation particulars using HTK tool is presented in [13]. Pandit et.al developed an ASR 3 PROPOSED MULTI-CLASS SVM BASED system which uses dynamic time warping method for CONTINUOUS VOICED ODIA NUMERALS recognizing Gujarati digits [14]. A survey which presents RECOGNITION SYSTEM different methods and approach for spoken English digit recognition has been proposed in [15]. Ali et.al proposed a 3.1 Frame-work for MSVM-CVONR model for Urdu digit recognition using three classification The proposed model for continuous voiced Odia numerals models such as: SVM, Random Forest (RF) and Linear recognition using multi-class SVM (MSVM-CVONR) is Discriminant Analysis (LDA).The experimental outcomes represented in Fig.1.Two phases: training phase and testing obtained by them suggests that SVM provides better phase are considered for developing the proposed model. performance related to other two classifiers [16].Biometric Training phase comprises pre-processing and end point classification using some voiced keywords as the input for detection, framing, windowing, feature extraction using Brazilian Portuguese language has been proposed in PLP/LPC/MFCC technique and evolving multi class SVM. In [17].SVM with kernel mapping functions like RBF and linear this phase the recorded continuous utterances of numerals are functions are used in their experiment for speaker first preprocessed (noise and silence removal) and divided into identification. Using MFCC as the parameter technique and individual signals for numerals that is present in the utterance. SVM as classifier an independent Malayalam digit recognition Then the individual voices for the numerals are has been proposed in [18].From the experimental results, it has been shown that the average recognition accuracy of the system is 97.6%. Hedge et.al developed an isolated word recognition model for Kannada language [19]. The system which was implemented using SVM as classifier combined with MFCC as feature extraction, noise reduction method and end point detection for voices produced good accuracy rate for input words. Isolated English digit recognition using ANN as classifier has been proposed in [20]. The model was evaluated by considering a set of 63 feature vectors for each input signal obtained from the combination of different feature extraction techniques. They tested the model over 280 samples and obtained an accuracy of 85%. In [35], data classification using some heuristic methods has been proposed. A novel method known as Diminishing Learning (DL) has been suggested in [8] which is used to reduce the number of support vector points without negotiating the classification accuracy. The system has been evaluated for digit recognition with PLP and Fig.1. Framework for Multi-class SVM based Continuous MFCC as parameter extraction with different approach of SVM Voiced Odia Numerals Recognition System

2755 IJSTR©2019 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 10, OCTOBER 2019 ISSN 2277-8616 further divided in to sequence of frames which are being Mel-Frequency Cepstral Coefficient:The utmost leading and overlapped. The entire frames should be quantized by using a prominent method used for extraction of spectral features is hamming or windowing function. Several acoustical coefficient Mel-Frequency Cepstral Coefficients (MFCC). It extracts parameters are obtained by considering individual frames. parameters from the speech just like the way humans are From many feature extraction methods, three popular feature listening the speech and at the same time it deemphasizes extraction techniques: PLP, LPC and MFCC are considered for other vital data. A linear frequency < 1000Hz and logarithm our proposed work [25]. Each numeral of continuous voiced spacing>1000Hz can be found in the Mel scale. A reference Odia numerals has its special features in amplitude, energy, point can be set by considering a pitch of 1000Hz with audible pitch and spectrum. Using signal processing technique, we threshold as 40dB. Using a filter bank signals can be can manipulate these feature parameters to ultimately, find out disintegrated. Also, MFCC provides a discrete cosine the required different voiced numerals. In the testing phase, transform (DCT) which is a real logarithm of short-term energy first the untrained continuous voiced signals are pre- being reflected on the Mel frequency scale [4]. MFCC processed and end point detection was done in order to get technique have been applied for a wide range of signal the individual signal for isolated numerals present in the analysis task and considered to be performing well compared utterances. Then framing, windowing function is to be applied to other feature extraction methods. For each input frame of a to the isolated numerals. In the next step, parameters are signal, it evaluates the cepstral coefficient, delta cepstral extracted from each framed signal and considered as the input energy and power spectrum deviation. For individualframe of a to the multi-class SVM classifier to detect the particular signal, MFCC computes first 12 coefficients along with a null numeral. coefficient, 13 delta coefficients ( obtained using 1st order derivative of initial 13 coefficients) and 13 acceleration 3.2 Preprocessing and End-point Detection coefficients (obtained using 2nd order derivative of initial 13 Generally in voiced signal low frequency formants have higher amplitude than high frequency formants. A pre-emphasis technique for high frequency is required in order to balance the amplitude for all formants. This can be achieved by cleaning the voiced signal with the 1st order finite impulse response (FIR) filter, who’s mapping method is defined in Z- domain as [26]: H (z) =1-0.95*Z-1 The end-point detection method can be utilized to find the starting and ending point of each spoken numerals present in the recorded continuous voiced numerals. It is applied to find the region where actual speech is present and also used to eliminate the silent and noise area of the inputted signal. Mostly two end point algorithms are used such as: short time energy which uses the energy level of the featured signal and zero crossing rate which usages the number of zero crossings for each signal frame [27].

3.3 Feature Extraction The technique used for pull out the parameters from the input speech wave form at a reduced data rate for further Fig. 2. Block diagram for MFCC feature extraction processing and exploration is known as feature extraction. These features are very essential for speech recognition. This coefficients) [10]. So, all together 39 features are extracted for is generally named as the front end part of speech or voice each frame of the input speech wave form. Fig. 2 depicts the processing [28]. It transforms the treated speech signal to a block diagram for MFCC feature extraction method. Linear brief but logical illustration that is more discriminative and Predictive Coefficient: LPC is one of the most powerful speech consistent than the genuine signal. In speech recognition the analysis techniques which replicates the human vocal tract front end comprises the feature extraction process while the and has gained popularity as a formant estimation technique. back end uses the pattern matching or speaker modelling It computes the power spectrum of the signal and generally technique. Therefore, acceptable recognition of speech is used for low bit rate order. The least square error method is more dependent on the quality of the features already used for generating the parameters. Using this technique, the extracted. Currently for automatic speaker recognition (ASR) speech sample is guessed as a linear grouping of its previous systems, the idea for parametric representation of wave forms samples. LPC coefficient describes the formant that is should be more reliable and with slight alternation of the frequencies where the resonant peaks present. Using this features the speech signal portions remains the same. Feature method, the position of the formants of a signal are computed extraction technique involves with more than one dimensional considering the linear predictive coefficients over a framed feature parameters for each speech signal. An extensive range window. From the resulting linear predicting filter, it also finds of choices are there for parametric representation of the wave the peaks of the spectrum. With slight variation of LPC, other form such as: MFCC, LPC, PLP, Linear Prediction Cepstral features that can be deduced such as: linear predication Coefficients (LPCC), Line Spectral Frequencies (LSF) and cepstral coefficients (LPCC), log area ratio (LAR), reflection Discrete Wavelet Transform (DWT). In our proposed work, we coefficients (RC), line spectral frequencies (LSF) and Arcus consider the MFCC, LPC and PLP as the feature extraction Sine Coefficients (ARCSIN) [25]. Fig. 3 represents the block technique. diagram for LPC feature extraction technique. Perceptual Linear Prediction: 2756 IJSTR©2019 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 10, OCTOBER 2019 ISSN 2277-8616

The PLP model was first developed by Hemansky. Using the is known as kernel trick. Fig.5 depicts the mapping of non- concept of psychophysics of hearing, PLP method for human linear space to linear higher order feature space. speech is considered. It also discards the unnecessary The basic equation of an SVM classifier can be represented coefficients should be transferred to match with human as: auditory system. PLP approximate three main accepts such as: the critical band resolution, the equal loudness and the Where zin is the test input vector, wo is a vector which is intensity loudness [25]. Fig.4 depicts the steps for PLP normal to hyperplane and used to separate the classes in the coefficient computation. For our experiment altogether 18 feature space. The feature mapping function φ () which may cepstral coefficients are extracted from each framed signal. be linear or non-linear used to produce the feature space. The 3.4 Multi-class SVM Classifier parameter bi is the bias term. The decision plane or Support vector machines are reasonably a new form of machine learning technique applied for classification,

Fig. 5. Non-linear to Linear mapping using kernel mapping of SVM

hyperplane is determined by minimizing the misclassification and maximizing the boundary between two classes. The size Fig. 3. Block diagram for LPC feature extraction of the decision boundary is . Let < xi, yi > for i=1, 2, 3...... n denote the training data set where yi is the targeted output for the training data xi. So the basic equation of SVM can be modified by applying constraints which maximize the margin and minimizes the misclassification is expressed as:

Subject to yi ( (xi) + bi) ≥ 1 − , ≥ 0, for all i Where P is the penalty parameter and is the slack variable. This parameter P is used to balance the complexity of the decision plane and the number of incorrectly classified testing point. Erroneous selection of the penalty parameter can create sever loss in the performance. Using cross validation and exponential growing sequence of P, the parameter values are computed. Using quadratic programming optimization, solution Fig. 4. Block diagram for PLP feature extraction to the above equation can be represented as [29]: regression, outlier detection and clustering. However, it is mostly used in classification problems. For 2-dimesinal feature space, it is to create decision boundary that can separate two Subject to 0 ≤ βi ≤ P, ∀i and th class labels. But for m-dimensional feature space, SVM can Where M is an nXn matrix. The (i, j) element of M is given create a hyperplane which is used to classify different class by: labels. In such type of scenario, SVM discovers the hyperplane which enhances the margin and simultaneously A Langrage multiplier βi is used for individually training sample reduces the misclassification rate. The perfect approach to xi. Training samples whose βi values are non-zero are called distinct 2 groups of data is, by considering a straight line support vector points. Now solving the above quadratic having single dimension, uniform plane having 2 dimensions problem yields [30]: or an N-dimensional hyperplane. However, there are some circumstances where a nonlinear section can discrete the groups more competently. SVM knobs this by considering kernel functions intended to map the input space into another space where a hyperplane cannot be utilized for the Using support vector points the above equation can be separation. That means a non-linear function is trained by rewritten as: using a linear learning machine in a higher dimensional feature space, at the same time the capability of the system has been controlled by a factor which is free from the dimensionality of the input feature space. The above technique 2757 IJSTR©2019 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 10, OCTOBER 2019 ISSN 2277-8616

Where nsvp= number of support vector points and βsvp, volume level as 1.0. Recordings of the mobile numbers in ysvp and xsvp denote the parameters matching to each support Odia language are prepared considering the sampling vector point. Now, the SVM classifier for a test data can be frequency of 16 KHz with 16 bits per sample. Also, some of the expressed as: recordings are done using smart mobile phones using a sound recording application software. Each sound file containing the recordings of a single mobile number is preprocessed in order to remove the continuous silent period and noise. Also, the file is segmented into 10 portions, each portion is corresponding to an Odia numeral. For preprocessing and end point detection, each recorded sound file is passed with audacity software. As a total of 200X10=2000 isolated Odia numeral files are resulted with a labelling of each file corresponds to Where Kl(xsv, ) = ) is a kernel function. It is observed from the above equation that, with increase in any individual numerals. For testing purpose, 5 speakers (3 support vector points leads to increase in computational male and 2 female) are considered. Each speaker has to requirements and making the classifier slow for the nonlinear record their utterances for 10 mobile numbers. Thus, a total of kernels. If we will reduce the support vector points, it also 5X10=50 wave files are created for testing. Preprocessing and reduces the number of time kernel should be updated. Table- segmentation is done using audacity software. After 1 represents some of the popular kernel functions used in preprocessing and end point detection was done using SVM classifier. The kernel mapping function must satisfy the audacity, each individual speech signals now segmented into a Karush-Kuhn-Tucker and Mercers condition [31] in order to series of continuous frames of approximately 20 to 30ms. The interpret the input feature space. The solution for the objective method, framing is used to make the input signal to function have the hyperplane that should lie in the mapped segmented quasi-stationary frames. Further, each individual feature space. frame may overlap to its previous frame. Also, framing is required in order to minimize the signal discontinuities at the For the multi-class classifier, the output is defined as Y ( ) start point and end point of each frame. Mostly, the Hamming = {1, 2,….m} where m is number of classes. Langrage window function is applied for windowing, since it produces the multipliers ( ) and bias ( ) values are associated for each m least amount of distortion. In the next step, the feature class, where jε {1, 2 … nsvp}. The decision function can be extraction techniques (PLP/LPC/MFCC) are applied for represented as: generating the equivalent parametric representation of the framed signal and considered as the input to the multiclass Where is evaluated for each class j ε {1, 2, ….. m} SVM for classification. using Eq (7). Basically this is linking a hyperplane for individual class and assigning the test input to that class 4.2 Odia Numerals Recognition whose hyperplane is farthest away from it. Numerals recognition problem is similar like pattern classification problem which considers a very large feature TABLE 1 dimension. Fig. 6 represent the block diagram of numeral KERNEL MAPPING FUNCTIONS USED IN SVM recognition. It consists of the components as feature extraction, followed by self-organizing mapping of the extracted features and the classifier SVM. Voiced Odia numerals are separated into frames and PLP/LPC/MFCC coefficients are evaluated from each individual frame. For each frame, using PLP method 11 coefficients and using LPC method 18 feature coefficients are extracted. Similarly considering MFCC, total 39 coefficients (13 energy coefficients + 13 1st order MFCC+ 13 2nd order MFCC) are computed. The total number of PLP/LPC/MFCC coefficients for each numeral signal are not same because each isolated numeral may consists of unalike amount of frames. Multi-class SVM based recognition needs always same amount of feature vectors for each distinct numeral. Kohonen’s self-organizing feature map

(SOFM) is used [32] in order to build the size of input feature 4 RESULT ANALYSIS AND DISCUSSION uniform for all numeral. It also decrease the input parameter size for 4.1 Data Set Preparation For our proposed MSVM-CVONR system, the training corpus is being developed with 100 mobile numbers recorded by 20 different speakers (10 male and 10 female) whose age’s lies between 20 to 60 years. Speakers are chosen from different regions of Odisha. Each speaker is asked to speak 10 set of mobile numbers. So, a total of 10X20=200 sound files are recorded. Recording of the voiced mobile numbers are done using a room environment. Some of the utterances of the speaker are done using the audacity (a recording and editing software for speech) using a headset microphone with a 2758 IJSTR©2019 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 10, OCTOBER 2019 ISSN 2277-8616

(IPA). TABLE 2 ODIA NUMERALS WITH EQUIVALENT WORDS, WORDS IN ROMAN AND ENGLISH LANGUAGE, ODIA SYMBOL AND IPA

Fig.6 Block diagram of Multi-class SVM classifier with SOFM array

Fig. 7. OVA learning mechanism used in SVM classification. This technique transfers the feature vector sequence into trajectories in a square matrix of fixed dimension [33]. The feature map is trained using Kohonen’s self-organizing learning method. It has the useful property that, after training, similar input vectors whose Euclidian distance are close can be found as vertices in the feature map. The size of the SOFM may be 32X32, 24X24, 18X18 or any other combination. Using SOFM weights, transferred features for each numeral is obtained. SOFM matrix present in fig.6 contains some black dots and white dots which are denoted 1 and 0 respectively. From the Eq.7, it can be concluded that computational complexity depends on the number of support vector points. Support vector points are obtained during the learning phase of SVM. The most popular technique used for multi-class SVM is One-Verses-All (OVA). This is being used for our proposed work for Odia numeral recognition. The OVA learning mechanism is depicted in fig.7. The whole data set is utilized for creating a decision boundary between classes and the support vector points obtained for each class. A multi-class parallel SVM classifier may be designed as shown in fig.8. In this approach, the decision boundary is always strict with respect to individual classes, hence each classifier may identify the input test data more precisely. As a result, overall recognition accuracy is improved. Table-2 represents the Odia numeral with equivalent words, words in Roman and English language, Odia symbol and International Phonetic Alphabet 2759 IJSTR©2019 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 10, OCTOBER 2019 ISSN 2277-8616

TABLE 3 CONFUSION MATRIX FOR MSVM-CVON WITH CLASSIFICATION ORDER OF THE VOICED MOBILE NUMBER ―୮୨୪୯୫୬୭୧୨୦‖

Fig. 8. Parallel SVM classifier using OVA

The confusion matrix for MSVM-CVON classifier using wavelet kernel (a=10) for MFCC feature inputs with SOFM size 24x24 with OVA for a Odia voiced mobile number is represented in Table-3 . This matrix represents the number of instances each numeral is correctly classified and also the number of times these are misclassified. Ten utterances for same mobile number is considered for recognition. From the table it has been seen that when the numeral is chhaa (Six) being considered for 10 times , it has been correctly classified to 8 times and misclassified for 2 times. Similar for other numerals like naa (Nine) the correct classification and misclassification are 7 and 3 respectively.

4.3 Performance Comparison of Odia numeral recognition using different Classifiers In this section, the performance of MSVM-CVON for recognition of Odia numerals using two different classification scheme such as: One-Verses-All (OVA) and Half–Verses-Half (HVH) [34] are evaluated and results are compared. The architecture for OVA and HVH are represented in Fig.8 and Fig.9 respectively.

Fig. 9. HVH SVM classifier architecture

For OVA classifier, the trial input is computed by considering all binary SVM classifiers for individual numerals and obtains the subsequent class by choosing a decision rule which out puts for a particular numeral. If m is the total number of numerals present, then total number of binary SVM classifiers required is m and maximum number of binary classes required to classify any test input is also m. Since, in Odia language 10 numerals are there, so 10 binary SVM classifiers are required. On the other hand, HVH classifier uses the concept of divide and conquer approach. First a binary classifier is constructed by considering the Odia numerals 0 to 9 by dividing into two classes 0 to 4 in one class and 5 to 9in another class. If the test input lies in the class 0-4, then it is further divided to two classes 0-2 and 3-4. Similar process is carried out in order to classify into a single class. The total number of binary 2760 IJSTR©2019 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 10, OCTOBER 2019 ISSN 2277-8616 classifiers required for m number of numerical will be m-1 and done using Python. The recognition rate for two different maximum number of binary classifiers required to classify any classifiers: OVA and HVH with different SOFM size and test input is (m/2)-1. For our case, since the number of different feature extraction technique along with various kernel numerals is 10, hence total number of binary classes and mapping functions is represented in table-4 and table-5 maximum number of binary classes required to test an input respectively. will be 9 and 4 respectively. The implementation has been

TABLE 4 AVERAGE RECOGNITION RATE (%) OF PROPOSED MSVM-CVON CONSIDERING DIFFERENT SOFM SIZE WITH OVA CLASSIFIER

Kernel Polynomial Sigmoid RBF Wavelet

SOFM Feature ARR #SVP ARR #SVP ARR #SVP ARR #SVP size Extraction Technique 18x18 PLP 85.25 870 83.75 743 85.5 840 87.5 840 LPC 89.5 876 88.75 745 89.5 810 90.5 770 MFCC 90.25 908 89 880 91 814 92 785

24x24 PLP 86.25 890 87.75 710 86.5 820 90.5 800 LPC 91.5 880 93.75 675 91.5 760 95 720 MFCC 92.25 920 94.5 840 96 780 96.5 750

32x32 PLP 86 880 87.75 770 86.5 821 90.5 810 LPC 90.5 786 92.7 725 90.75 788 92.7 712 MFCC 92.25 926 92 863 96 805 93.5 770

TABLE 5 AVERAGE RECOGNITION RATE (%) OF PROPOSED MSVM-CVON CONSIDERING DIFFERENT SOFM SIZE WITH HVH CLASSIFIER

Kernel Polynomial Sigmoid RBF Wavelet

SOFM Feature ARR #SVP ARR #SVP ARR #SVP ARR #SVP size Extraction Technique 18x18 PLP 83.25 834 82.55 783 84.75 845 88.5 845 LPC 86.5 887 87.5 795 88.5 830 91.5 795 MFCC 90.2 911 88.3 940 91.5 845 92 815

24x24 PLP 84.5 856 87.75 754 86.5 830 90 820 LPC 91.5 893 93.5 770 90.45 765 95.25 725 MFCC 92.25 930 94.25 850 95.5 787 95.75 784

32x32 PLP 85 895 86.5 787 85.5 825 88.5 825 LPC 91.5 810 90.7 745 91.75 890 91.7 812 MFCC 92.2 946 91.5 890 96.3 855 93 790

** ARR: Average Recognition Rate; #SVP: Number of Support Vector Points

2761 IJSTR©2019 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 10, OCTOBER 2019 ISSN 2277-8616

recognition using N-gram model [10] and depicted in fig. 12. It is observed that, OVA and HVH classifiers has better average recognition rate compared to our previous work which is implemented using Hidden Markov Model (HMM) with trigram language model approach.

Fig. 10. Comparison of ARR of OVA and HVH with various kernel mapping functions (Considering SOFM size as24x24 with MFCC as feature extraction technique)

Fig. 12. Comparison of ARR of OVA and HVH (Considering best values obtained for MSVM-CVONR with SOFM size as 24x24, wavelet kernel function and MFCC as feature extraction technique) with Continuous Voiced Odia Digit Recognition(CVODR –implemented using HMM with trigram approach)

4 CONCLUSION Speech signals for Indian language always contains complex features and also sensitive for varying ascents, so more and more prominence must be needed in order to construct the desired ASR systems. In this work, we have demonstrated the usage of multi-class SVM classifier for recognition of Fig. 11 Comparison of ARR of OVA and HVH with different continuous voiced Odia numerals. We have considered PLP, feature extraction techniques (Considering SOFM size LPC and MFCC as the parameter extraction technique for the as24x24 with wavelet as kernel mapping function) input voiced numerals with different kernel mapping functions for SVM classifier. Concurrently the experimentation has been a) It may be perceived from table 4 and 5 that, OVA multi-class carried out by considering different SOFM array sizes. The classifier has better average recognition rate compared to experimental results reveals that, OVA SVM classifier has HVH classifier considering SOFM size of 24x24 with various better recognition rate as compared to HVH SVM classifier. kernel mapping functions. Further, by analyzing the reported results, wavelet based b) Also it may be detected from the table 4 and 5 that, both the kernel mapping multi-class OVA SVM classifier with SOFM multi-class classifier performs well considering SOFM size of array size 24x24 has attained better 24x24 with MFCC as feature extraction method and wavelet accuracy in respect to various kernel mapping functions as kernel mapping function. inclusive off other SOFM array sizes. We except these c) It may be noticed that MFCC feature extraction technique outcomes will encourage for further exploration on Odia ASR. is better compared to other two feature extraction methods In future, the use of other parameter extraction methods along used. The number of SVPs is more in case of MFCC and with SVM-HMM hybrid model, can be considered for Odia varies a little compared with other methods. Also use of numeral recognition. Further experiment might be conducted wavelet kernel mapping function for multi-class SVM classifier considering different frame sizes for the inputted voiced is observed to be better compared to other kernel mapping signals. We suggest for more robust classifier which can aid functions. for achieving better performance. In addition to the above, the d) In fig. 10, it may be noticed that the average recognition work can be extended for using deep learning model, which rate of OVA and HVH matches for polynomial and sigmoid may achieve higher performance on larger data set. kernel mapping functions but OVA performs better for RBF and wavelet functions, fixing the SOFM size as 24x24 and MFCC REFERENCES as feature extraction technique. [1] R. Rabiner and B. H. Juang, ―Fundamentals of Speech e) Considering SOFM size as 24x24 and wavelet as kernel Recognition,‖ mapping function, it may noticed from fig. 11 that, OVA has Prentice-Hall International, New Jersey, 1993 better recognition rate with different feature extraction [2] G. Chen, C. Parada, and G. Heigold, ―Small-footprint techniques. Keyword f) Finally, the best recognition rate obtained by OVA and HVH Spotting using Deep Neural Networks„‖ in Proceedings of (considering SOFM size of 24x24, wavelet as kernel mapping the International Conference on Acoustics, Speech and function and MFCC as feature extraction technique) is Signal Processing (ICASSP), IEEE, 2014, pp.4087-4091. compared with our previous work of continuous Odia voiced 2762 IJSTR©2019 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 10, OCTOBER 2019 ISSN 2277-8616

[3] P. Mohanty and A. K. Nayak, "Design of an Odia Voice Frequency Cepstral Coefficients and the Discrete Cosine Dialer Transform applied on voice based biometric System," 2019 5th National language (5th NLC-2019) authentication,‖ SAI Intelligent Systems Conference IOSR at Ravenshaw University, Cuttack, Odisha, 04-06 (IntelliSys), London, 2015, pp. 1032–1039. February 2019. [18] C. Kurian, Firoz Shah. A and K. Balakrishnan, ―Isolated [4] J.H. Martin and D.Jurafsky, ―Speech and language Malayalam digit recognition using Support Vector processing: An Machines,‖ International Conference on Communication, Introduction to natural language processing, Control and Computing Technologies (ICCCCT), 2010, pp. computational linguistics, and speech recognition,‖ 692–695. Pearson / Prentice Hall Upper Saddle River, 2009. [19] S. Hegde, K.K Achary and S. Shetty, ―Isolated Word [5] Y. Bengio ―Neural Network for Speech and Sequence Recognition for Kannada Language Using Support Vector Recognition‖, 1996, London: International Thomson Machine,‖ Communications in Computer and Information Computer Press. Science, 2012, Vol-292, pp. 262–269. [6] E.Trentin, M. Gori ― A survey of hybrid ANN/HMM models [20] B. P. Das and R. Parekh, ― Recognition of Isolated words for using features based on LPC, MFCC, ZCR and STE with Automatic Speech Recognition‖, Neurocomputing, 2001, Neural Network Classifiers,‖ International Journal of 37(1-4), pp-91-126. Modern Engineering Research (IJMER), 2012, Vol-3, [7] A. Osowska and S. Osowski, ―Voice Command Issue-3, pp. 854–858. Recognition Using [21] T. Mittal and R. K. Sharma, ―Multiclass SVM based Statistical Signal Processing and SVM,‖ in International Spoken Hindi numerals recognition,‖ The International Work-Conference on Artificial Neural Networks, Springer, Arab Journal of Information Technology, 2015, Vol-12, 2019, pp.65-73 Issue-6A, pp. 666–671. [8] J. Manikandan and B. Venkataramani, ― Study and [22] D. Gupta, P. Bansal and K. Choudhary, ―The State of the Evaluation of a Art of Feature Extraction Techniques in Speech Multi-class SVM classifier using diminishing learning rate,‖ Recognition,‖ Speech and Language Processing for Neurocomputing (Elsevier), 2010, Vol-73, pp. 1676–1685. Human- Machine Communications, 2018, Vol-664, pp. [9] P. Prajapati and M. Patel, ―A Survey on Isolated Word and 195–207. Digit Recognition using Different Techniques,‖ [23] K.D Returi, VM Mohan and Y. Radhika ―A Novel Approach International Journal of Computer Applications (IJCA), for Speaker Recognition by Using Wavelet Analysis and 2017, Vol-161, Issue-3 pp. 6–15. Support Vector Machines‖, in Proceedings of the Second [10] P. Mohanty and A. K Nayak ―N-Gram Language Model International Conference on Computer and based Continuous Voiced Odia Digit Recognition‖, Communication Technologies, Springer, New Delhi,2016, International Journal of Recent Technology and pp. 163-174. Engineering (IJRTE), 2019, Vol.8, Issue-2, pp-4565-4574. [24] J. Galić, B. Popović and D.Š Pavlović, ―Whispered Speech [11] S .Mohanty and B. K Swain, ―Markov Model Based Oriya Recognition using Hidden Markov Models and Support Isolated Vector Machines‖ Acta Politechnica Hungarica, 2018. Vol.- Speech -An Emerging Solution for Visually Impaired 15 No-.5, pp-11-29. Students in School and Public Examination,‖ International [25] N. Dave, ―Feature extraction methods LPC, PLP and Journal of Computer & Communication Technology MFCC in speech recognition‖, 2013. International journal (IJCCT) 2010, Vol-2, Issue-2 pp. 1–5. for advance research in engineering and technology, 1(6), [12] S. Mohanty and B. K Swain, ―Speaker Identification using pp.1-4. SVM during Oriya Speech Recognition,‖ International [26] C. Bechetti, and L. Ricotti. "Speech recognition theory and Journal Image, Graphics and Signal Processing, 2015, C++ implementation." John WILEY&Sons, Ltd, 1999, pp: Vol-7, Issue-10 pp. 28–36. 125-137. [13] P. Mohanty and A. K. Nayak, ―Isolated Odia Digit [27] D. B Hanchate, M. Nalawade, M. Pawar, V Pophale, P.K Recognition Using HTK: An Implementation View,‖ 2nd Maurya,‖ Vocal digit recognition using artificial neural International Conference on Data Science and Business network‖. 2nd International Conference on Computer Analytics (ICDSBA) . 2018, pp. 30–35. Engineering and Technology 2010, Apr 16, Vol. 6, pp.88- [14] P. Purnima and B. Shardav, ―Automatic Speech 91. Recognition of [28] C. Cornaz, U. Hunkeler, V. Velisavljevic, ―An Automatic Gujarati Digits Using Dynamic Time Warping,‖ Speaker International Journal of Engineering and Innovative Recognition System.‖ Switzerland: Lausanne; 2003. Technology (IJEIT), 2014, Vol-3, pp. 69–73. Retrieved from: [15] V. Trivedi , ―A Survey On English Digit Speech http://read.pudn.com/downloads60/sourcecode/multimedi Recognition a/audio/209082/asr_project.pdf Using HMM,‖ International Journal of Science and [29] Simon Haykin, ―Neural Networks, Second Edition‖, Research (IJSR), 2013, Vol-2, Issue-3 pp. 247–253. Prentice-hall of India, 2003 [16] H. Ali, A. Jianwei and K. Iqbal ‖Automatic speech [30] Edwin K.P. Chong and S.H Zak, ‖An introduction to recognition of Urdu digits with optimal classification optimization‖, approach‖, International Journal of Computer Applications, John Wiley & Sons, 2013, Vol.76. 2015, Vol-118, No-9, pp.1–5. [31] N. Cristianini and J. Shawe-Taylor, ―An introduction to [17] F. G. Barbosa and W. L. S. Silva, ―Support vector support vector machines and other kernel-based learning machines, Mel methods‖, 2000, Cambridge university press.

2763 IJSTR©2019 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 10, OCTOBER 2019 ISSN 2277-8616

[32] M.T. Hagan, H.B. Demuth and M. Beale, ―Neural network design‖, PWS Publishing Co. 1997. [33] Z. Huang and A. Kuh, ―A combined self-organizing feature map and multilayer perception for isolated word recognition‖, IEEE Transaction on Signal processing, 1992, Vol.40 (11), pp-2651-2657. [34] Hansheng Lei and Venu Gobindaraju, ―Half-against-half multi-class support vector machines, Springer Book Series, Lecture Notes in Computer Science, 2005, pp. 156-164. [35] N. Panda and S.K Majhi, ―How Effective is Slap Swarm Algorithm in Data Classification‖, 1st International Conference on Computational intelligence in Pattern Recognition (CIPR) 2019, August, pp.579-588.

2764 IJSTR©2019 www.ijstr.org