Khasi Dialects Identification Based on Gaussian Mixture Model Arjunasor Syiem1, Gaurab Krishnan Deka2, Tanvira Ismail3, L

Research Article Volume 6 Issue No. 4

Khasi Dialects Identification Based on Gaussian Mixture Model Arjunasor Syiem1, Gaurab Krishnan Deka2, Tanvira Ismail3, L. Joyprakash Singh4 Department of ECE Speech & Image Processing Laboratory, NEHU, Shillong, India [email protected], [email protected], [email protected], [email protected]

Abstract: Dialect Identification (DID) has recently gained substantial interest in the field of speech processing. Automatic spoken data identification refers to the task of identifying the dialect spoken in a particular utterance by a speaker. Dialects of a language normally are reflected in terms of their phoneme space, word pronunciation/selection, and prosodic traits. This motivates the studies toward the development of Dialect Identification system for Khasi dialects i.e Khynriem and Bhoi-Jirang. Around 2 hours read speech data have been collected separately of each Dialect for training and testing the system. The collected speech data of Khynriem and Bhoi-Jirang consisted of 10 male speakerand 8 female speakers of each dialect.Data has been extracted using MFCC and then modeled using GMM.For testing the DID system, we used 140 utterances of each Dialect. The test utterance which gets higher ’Score’ is considered as the identified Dialect.

Index Terms:Acoustic Likelihood, Mel-frequency cepstrum coefficient (MFCC), Matlab, GMM etc

I. INTRODUCTION Quality of Voice: Speakers speech is considered to be normal Speech is the most common and natural means of and good when it posses good quality of voice, means communi-cation among humans. A language when used by pleasant and intelligence. people from different regions can be analyzed to see the usage Pitch: It is the fundamental frequency of vocal cords. Speaker of words with different lexis and even if they speak some should use his optimum pitch where output is maximum with standard form of the word the difference in spectral properties less vocal efforts. That mean he should not deviate from his of sound pro-duced can be observed [3]. Due to advancements optimum speech. Pitch should not be too high, too low or of technology and the need for access to massive online monotonous or stereotype. resources have made human-machine communication has become essential in everyday life. It will help extend the use Intonation: Intonation means variation of pitch or fluc- of information technology to the population who are not well tuation of pitch during delivery of speech. Speech is acquainted with the peripheral devices of computers. For an considered having good characteristics when there is proper automatic speech recognition system of any language, with use of intonation or inflexion (tone). This means speaker varieties of dialects the performance is highly dependent upon should not have limited pitch in his speech or should not be the variability captured during the training of the system. monotous. Normal adult person should have pitch range one and a half octave in males or 2 in females. Khasi is an Austroasiatic language spoken primarily in the state of Meghalaya in India by the Khasi community. Khasi is Loudness: the speakers loudness should be normal i.e. in a recognised language of the 6th schedule. Khasi and Garo are range of 40-80 dB. Loudness should not be micro phonic or the dominant languages of the Indian State of Meghalaya,the macro phonic. language is also spoken by a number of people in the hill districts of Assam bordering around Meghalaya and by a Rhythm: It refers to easy and smooth flow of speech or sizable population of people living in Bangladesh, close to the continuity of speech. A good speakers easy flow of speech can Indian border [1].Khasi itself is divided into numerous so- be observed during his delivery of speech that means he called dialects, as there are Pnar or Synteng,khynriem, should not have any struggle or effortful speech neither he Lyngngam, Amwi, Bhoi etc.To avoid collecting prohibited repeats, hesitates, pauses nor stops in syllables. large dataset for the existing Khasi dialect,we focus only two specific khasi dialect, viz.the Khynriem and the Bhoi-Jirang Stress: It refers to extra pressure given on particular syllable Dialect.The khynriem dialect has generarally been considered during speech. Speaker is considered to have good speech if as the standard dialect of the khasi language whereas the he doesnt lay stress on making us clear his speech. Bhoi-Jirang dialect,mainly spoken in ri bhoi district and some part of west khasi hills. Articulation: The process of production of single speech sound is called Articulation. Speaker should have proper Every Dialects has its own characteristics.Based on these utterances of speech sound. Intelligibility of speech de-pends following charateristics we are able to differentiate the on proper articulation. dialects

International Journal of Engineering Science and Computing, April 2016 3882 http://ijesc.org/ Flexibility: It is one of the most important characteristic of The speech data rate is recorded using Zoom H4N Handy speech. Speech should be flexible depending on situ-ations. Portable Digital Recorderr. All the speaker are in the age Flexibility of speech depends on the manner how it is said. group of 25-50 years.The sessions were recorded on alternate days to capture the variability of human speech production Positive Feedback: Speaker should have positive feed-back system.The recording was done in such a way that each of the from the listeners like clapping. Then the speech is considered speaker had to speak in a natuaral ways and the entire speech to have good characteristics of speech. database was recorded in a quite room ,without any obstacle in the recording path. The corpus has three different partitions, Adequate Projection: The voice should be loud enough to each organized for specific tasks, training folder is for training reach the listeners. Speakers voice becomes more louder when the dialect models, development folder is used for testing the there is no noise in the room. Hearing impaired children speak dialect models and, finally, evaluation folder is for reporting with soft voice because they cant perceive their own voice in the final test accuracies of the system. For purpose of this the environment. work, each of the speech utterances available has been partitioned into 15 msec length. Correct Prununciation: Speaker should have correct TABLE I: pronunciation in normal speech. If the articulation is not Summary of Khasi Corpus proper as per the phoneme of that word then it is not called Dialect Training Testing Sum correct pronunciation. Data Data Khynriem 350 140 490 Semantic Soudness: The appropriate words are selected and Bhoi- 350 140 490 arranged in proper manner to convey the meaning of what Jirang speaker speaks. According to the context, speaker puts all the words together to express idea for listeners. While expressing, III. OVERVIEW OF THE DIALECT IDENTIFICATION speaker follows the linguistic rules to frame sentences in SYSTEM correct manner to make the listeners understand. The acoustical analysis is the first step in the development of Dialect Identification system. During the analysis, the signal Animation: When an individual speaks, he represents some is segmented in successive frames of 25 ms with a frame shift kinds of body language with gestures to convey meaning. of 10 ms. Each frame is then multiplied by a Hamming Like instead of saying NO people shake their head. window. A vector of acoustic coefficients giving a compact representation of the spectral features is extracted from each Context: Individual should speak to the topic i.e. related to windowed frame. subject. He should not deviate from his topic.

II. Database Description

The corpus used for our study is a Khasi corpus.The utterances were digitized at a 8khz sampling rate.Two Dialects are used in our study,based on two different district in Megha-laya: East khasi Hills District(Khynriem) and Ri Bhoi District (Bhoi-Jirang). Khynriem is one of the Khasi Dialect spoken in Cherrapunji and commonly spoken in Shillong and its nearby region. This Dialect has been used in schools, newspapers and other literary activities and it has generally been considered as the standard dialect of the Khasi Fig. 1:Training phase of Dialects Identification System language. Bhoi-Jirang, mainly spoken in West Khasi hills in Mawkyrwat block [4]. For each Dialect, Speech is collected using 10 Male speaker and 8 Female Speaker. Speech data is collected from the speaker, by posing the questions arbitrarily such as to describe ones childhood,history of the home town,details of the career, views on personal habits and so on. From each speaker, 10-20 mins of speech is collected from the spontaneous response to the above questions. Altogether, for each dialect the duration of the speech collected, is about 3 to 4 hours. Instead of reading some study material or uttering the small fixed text sentences, responses to the general questions usually contain the natural accent of the language.Based on this method, we have used the spontaneous response to the questions as the speech material for the identifcation of dialects. Fig. 2: Identification phase of Dialects identification System During “recognition phase” of Dialects Identification system, feature vectors extracted from a test utterance is compared to

International Journal of Engineering Science and Computing, April 2016 3883 http://ijesc.org/ the “acoustic model” and ”Acoustic likelihood” score of the V. GAUSSIAN MIXTURE MODEL test utterance is calculated. Mixture Model are a type of density model which com- IV. FEATURE EXTRACTION TECHNIQUE USED IN DID prise a number of component function usually Gaus-sian SYSTEM [6]. These component functions are combined to provide a multimodal density.The GMM, the following equation Theoretically, it should be possible to recognize speech di- attempt to model the probability density of a N- rectly from the digitized waveform. However, because of the dimensional random vector x, by adding weighted large variability of the speech signal, it is better to perform combination of multivariate Gaussian densities: some feature extraction that would reduce that variability. Particularly, eliminating various source of information, such as whether the sound is voiced or unvoiced and, if voiced, it eliminates the effect of the periodicity or pitch, amplitude of excitation signal and fundamental frequency etc. The most dominant method used to extract speech signal is by using the Mel-Frequency Cepstral Coefficients (MFCC). MFCC is a frequency domain feature representation which is a lot more accurate than time domain characteristics [7], [8]. In thi system, the speech signal is segmented into 20 to 30 ms, each with a frame shift of 10ms overlapping on each other. In a GMM-based dialect identification system, each di- This overlapping of the frames make a smooth transition from alect identified is modeled by mth order GMM parameter, one frame to another. Each frame is then multiplied by a The model parameters d for dialect D are estimated with window function called a Hamming window function to an Expectation-Maximization (EM) algorithm by the eliminate the discontinuities at the edges. After windowing, spectral features which are extracted from a collection of DFT is computed for each frame. Then, logarithmic Mel- speech utterances spoken in a dialect D. Scaled filter is applied to the Fourier transform coefficients. The required relationship between frequency of speech and GMM parameters are defined by using maximum Mel Scale is mentioned in the equation below: likelihood training estimation, such as:

The next step is to estimate Discrete Cosine Transform (DCT) of the filter bank output. Then, we get MFCC for each frame. This set of coefficients is called acoustic vectors and is very useful for further analysis and processing. The overall the Current model parameter,and is the new model procedure of MFCC feature extraction is shown in Fig. 3. parameter.

During the identification step, an unknown speech utterance d, is classified following the average log likelihood calculation produced by the dialect model, which is given by:

VI. EXPERIMENTAL RESULT We used 140 utterances of each dialect for testing the performance of Dialects Identification system. The performance of a Dialect Identification system is determined by the identification rate (IDR). The test utterance which gets higher “Acoustic Liklihood” score is considered as the identified dialect. The error rate is calculated by the number of test utterances that give false identification per total test utterances. For a given dialect D, the IDR is defined as:

n IDR = N (3)

where n is the number of correctly identified utterances in Fig. 3: Steps involved in MFCC Feature Extraction Dialect D. N is the total number of utterances in dialect D.

International Journal of Engineering Science and Computing, April 2016 3884 http://ijesc.org/ TABLE II: Experimental Results of Dialects Identification [5] Bin Ma, Donglai Zhu and Rong Tong, “chinese dialect system for correct detection identification using tone feature based on pitch flux ”,Institute for Infocomm Research, Singapore.2006,IEEE. Accuracy obtain using Acoustic [6] Lachachi Nour-Eddine and Adla Abdelkader, “GMM- Sl. no. Dialect Based Maghreb Dialect Identification System”, Journal of Likelihood information processing systems, Vol. 11, march 2015

[7] Namrata Dave, “MFCC and its applications in speaker 1 Khynriem 99% recognition”, International Journal on Emerging Technologies , (Received 5 Nov., 2009, Accepted 10 Feb., 2010) 2 Bhoi-Jirang 96% [8] Pratik K. Kurzekar, Ratnadeep R. Deshmukh, Vishal B. Waghmare, Pukhraj P. Shrishrimal, “A Comparative Study Table II show the accuracy of two different dialect using of Feature Extraction Techniques for Speech Recognition Gaussian Mixture Model.It is clear from the Table II,the accu- System”, International Journal for Advance Research in racy obtained for the khynriem dialect is better as compared to Engineering and Technology, Vol 3, Issue 12, December bhoi-jirang dialect .This could be through collecting a dataset 2015. of targeted audio sample and manually annotating each one.Therefor we can conclude that the studying of dialects [9] Vibha Tiwaril, “acoustic phonetic feature based dialect identifications has been improved with the very less error rate. identification in hindi speech”,international journal on smart sensing and intelligent system vol. 8, no. 1, march, VII. SUMMARY AND CONCLUSION 2015.

In this paper,we have shown that the two Khasi colloquial Di- alect(Khynriem and Bhoi-Jirang) can be distinguished using a gaussian Mixture Model with good accuracy.The study of two khasi dialects considered in this paper are Khynriem(spoken in East Khasi Hills District of Meghalaya) and Bhoi-Jirang (spoken in Ri-Bhoi District of Meghalay and some part of West Khasi Hills District in Mawshynrut block). Table II show the acuraccy for both the khynriam dialect and Bhoi- Jirang dialect. The accuracy of both the dialects was found to be 99% and 96% . With this result, we are encouraged to increase our database and work for more dialects of Khasi. Identification of more spectral feature to capture details significant to Khasi dialects is also one of our future goals.

REFERENCES

[1] “The Khasi language is no longer in danger”,. United Nations Educational, Scientific and Cultural Organization. 2012-06-04. Retrieved 2012-09-29

[2] Pedro A Toress-Carrasquillo,Terry P.Gleason and Douglas A. Reynolds, “Dialect identification using Gaussian mixture model”, Lincoln Labora-tory, Massachusetts Institute of Technology.

[3] Shweta Sinha, Aruna Jain and S.S. Agrawal, “acoustic phonetic feature based dialect identification in hindi speech”,international journal on smart sensing and intelligent system vol. 8, no. 1, march, 2015.

[4] Acharya, S.K, “Language of the Khasis ”,Mainstream,june 1971.

International Journal of Engineering Science and Computing, April 2016 3885 http://ijesc.org/