Pattern Recognition Based Method Used in Identifying Impaired Speech
Total Page:16
File Type:pdf, Size:1020Kb
Recent Researches ιn Applied Informatics Pattern Recognition Based Method Used in Identifying Impaired Speech Valentin Velican, Ovidiu Grigore, Corina Grigore experiments/systems dedicated to a given language [3], [4] are Abstract— The article presents a method of identifying heavily not necessarily useful in other languages. It is obvious, that a impaired pronunciations of ‘r’ consonant in Romanian language final product capable of correcting all the existing using time domain signal processing techniques. The study focused mispronunciations is almost impossible to implement as most on words that contain ‘r’ as the first letter and used signals recorded likely a specific defect needs a specific identification and mainly from children as mispronunciations occur most of the time at young age persons. correction method. Therefore we concentrated our efforts on developing a Keywords— speech processing, impaired speech evaluation. method that can be applied in correcting mispronunciations of ‘r’ consonant, in Romanian. We chose to study this defect, also I. INTRODUCTION known as rhotacism, as it is one of the most common in Romanian. ERSONS with impaired pronunciation are often ignored Pwhen it comes down to developing speech processing based systems. There are numerous speech recognition, II. PROBLEM FORMULATION speaker recognition, voice coding, and voice control Mispronunciations occur in most of the cases on consonants, applications useful in many different fields but one can not sounds that contain less energy than vowels. The main find many solutions for impaired speech identification. problem encountered is that most of the signal processing Furthermore speech therapy is complicated and requires many techniques developed work best (in the case of speech) on visits to specialized cabinets thus increasing the need of an signals packing a good amount of energy – i.e. vowels [5], [6]. automated system that can be used at home. Finding common features for the correctly pronounced More exactly, the speech therapy practice is based on phoneme becomes rather difficult as the sounds analyzed in correct diagnosis of the pronunciation problems and on their our case are consonants. Obviously the incorrectly pronounced correction in conformity with a precise methodology. The phonemes are even harder to evaluate as these, in addition to speech therapist needs a large amount of time to precisely being altered consonants, categorize themselves in several identify not only problematic sounds, but also the problematic types (all probably with subtypes defined by different degrees phonemic combinations. The correct “emission” of affected of variation from the standard), decreasing the chances of sounds is made both isolated and in syllables and words. finding a common feature that could simplify the final Because of that, a computer application developed for evaluation and classification algorithm. evaluating mispronunciations can lighten the therapy and simplify the specialist effort, offering an etalon-mean for home III. PROBLEM SOLUTION exercises. The ‘r’ phoneme is in Romanian language a hard, rhotic Unfortunately automated methods are sparse and still in the consonant being also one of the most commonly research stage mostly due to the characteristics of the sounds mispronounced sounds. The defect is generally known as analyzed and also due to the large amount of pronunciations “rhotacism” and it can be observed from young age children to defects existing [1], [2]. Also due to the difference in adults. In the worst possible utterances the ‘r’ is replaced by pronunciations from language to language, research in this other sounds like ‘l’, ‘î’ or can completely miss from words field is inevitably divided and results from existing [7]. In the mild, linguistic acceptable mispronunciations, the phoneme is replaced with a guttural ‘r’, resembling the Manuscript received October 9, 2001. pronunciation of ‘r’ in French. This work was supported by CNCSIS project no.846/2009 and by CNCSIS project IDEI_1472/2008 A. Database of Recordings Valentin Velican is with Polytechnic University of Bucharest The database used in the study was recorded from 7 children Department of Applied Electronics and Information Engineering. Ovidiu Grigore is with Polytechnic University of Bucharest and 5 adults pronouncing words starting with ‘r’ like rac, Department of Applied Electronics and Information Engineering rană, ramă, rață, etc. – in English: crab, wound, frame, duck. (corresponding author, phone: +40(0)21.402.4897; fax: +40(0)21.402.4957; It was taken care to collect sounds that have the same “starting e-mail: ovidiu.grigore@ upb.ro). Corina Grigore is with Carol Davila University of Medicine and Pharmacy consonant – vowel” combination as transitions from the ISBN: 978-1-61804-034-3 190 Recent Researches ιn Applied Informatics consonant to the vowel influence the wave shape of the number of samples as the original ones due to the recorded signal [2]. From the entire batch, 30 (15 correct and interpolation. 15 incorrect pronunciations) different recordings were selected To further refine the signals in order to reduce a potential to test the method. It is also important to mention that all of the error (in the classification process) caused by the difference in adults had a correct pronunciation while only two children amplitude, the extracted envelopes are being normalized to correctly pronounced the words. one by dividing each sample in a current analyzed signal to the maximum sample value of that signal. B. Feature Extraction At this point and as expected the shape of the envelope for The core of this study is the feature extraction method. the correctly pronounced signals can only be described as Studying the shape of the signal in time domain it was sinusoidal whereas in the other case, the envelope’s amplitude observed that ‘r’, correctly pronounced, resembled very well is constant or modifies slowly in time. To use this observed an amplitude modulated signal with an envelope frequency of characteristic, we first decided to compute a moving average about 25 – 30Hz. Unlike this, all other phonemes and of of the envelope and then square the difference between the course the incorrectly pronounced ‘r’ lacks the AM shape of envelope and the mean vector just computed. the signal, resembling in most cases with vowels (though with less amplitude) – Fig. 1. It becomes obvious that a good feature by which correctly pronounced ‘r’ phonemes differ from their incorrect counterparts is the shape of the signal in time domain. More exactly, the clearest difference between these two classes is the signal’s envelope. Figure 1. Correctly (above) and incorrectly (below) pronounced ‘r’ I in ‘rac’. The samples belong to kindergarten children. Figure 2. Envelope extraction of a correctly (above) and an incorrectly (below, pronounced ‘r’ in ‘rac’. The graph is plotted in samples versus Extracting the signal’s envelope in a software manner can amplitude. be done in several ways: using the Hillbert Transform, using the square of the signal and then filtering the result, etc. For This step is described by the following expressions: this study we have chosen a simpler approach, extracting the 1 M−1 + k maximum signal value for every N sample intervals of length vectAvg() k = ⋅ ∑ Envelope() u L, where M u= k length _(Signal ) (3) N = (1) k = 1, 2, 3… L To be noticed that when And then linearly interpolating the resulting points knowing k = length_Envelope – M + 1 (4) that a straight line that contains P1(x1,y1) and P2(x2,y2) satisfies the equation the averaging window of length M overshoots the end of the y− y envelope vector. In this case we decided to reduce the length 2 1 of the window (M) by one unit for every increase of k such that ()()y− y1 = x− x1 (2) x2− x 1 the last M – 1 values of vectAvg will be computed using M – 1, M – 2, … , 1 values from the envelope. The result of applying the above “envelope detector” to both a correct and an incorrect signal can be seen in – Fig. 2. It is important to notice that the processed signals have the same ISBN: 978-1-61804-034-3 191 Recent Researches ιn Applied Informatics Then: right. The majority of these distances, if belonging to class 1, vectFeat() k = (vectAvg() k − Envelope() k )2 will decide that the current analyzed data belongs to the same class. Otherwise, the contrary will be decided. (5) k = 1, 2, 3 … Going a bit more into detail, (3) defines the continuous component of the signal or the mean value. Fig. 3 presents this next step in processing the signals of Fig.2. Now, the amplitude variation of the correct pronunciations becomes even more evident compared to the incorrect case. This is due to the fact that (3) overpasses the imperfections generated by the “envelope detector” and filters most of the spikes in the signal. Equation (5) defines the alternate component of the signal or, better said, a measure of how much does the envelope varies in amplitude with time [8] – Fig.4. Naturally, correctly pronounced samples of /r/ will have higher values in vectFeat than wrong pronunciations. Therefore vectFeat becomes the needed feature. Yet due to the fact that vectFeat is still a large array, running the classifier at this stage is not computationally efficient. Moreover, input signals have different lengths generating feature arrays of different lengths, all ending in a situation that is not desirable: the classifier has to test arrays of different lengths. Therefore a feature selection stage is implemented by dividing each vectFeat in the database into INT intervals of equal sizeINT length and calculating the mean on every such interval. 1 i⋅ sizeINT Figure 3. Continuous component extraction, from the envelope of a correctly vectMean( i) = vectFeat() k (6) (above) and an incorrectly (below) pronounced ‘r’ in ‘rac’. ∑ Notice the normalization to one. sizeINT k=( i − 1) ⋅ sizeINT + 1 Where i = 1, 2, … ,INT What can be noticed is that vectMean is now a feature array of equal length for any input signal, that contains the information of vectFeat condensed in much less coefficients; an excellent input for the classification stage - Fig.5.