Development of Rules for Unlimited Text To Speech Synthesis of S.S.Agrawal*,Rajesh Verma,Shailendra Nigam, Anuradha Sengar, Central Electronics Engineering Research Institute Centre, CSIR Complex, NPL Campus, Dr.K.S Krishnan Road, New Delhi 110012. *E-Mail: [email protected] Abstract: This paper describes development of linguistic and acoustic rules for implementation in a unlimited Text To Speech Synthesis (TTS) system for Hindi. The Klatt’s cascade/parallel formant synthesizer has been simulated on a PC for synthesizing high quality Hindi and other Indian spoken languages. Parametric files of commonly spoken syllables have been created as the basic units of sounds for concatenation purpose. The input module can take up text from the keyboard in Indian Standard Code for Information Interchange (ISCII) format. Pre-processing is done for expanding abbreviations, numerals etc. in the form of phonemic strings of words. The input word sequence is parsed to break them into syllables and sent to the concatenator module. The parametric files of the corresponding syllables are picked up from the data base and concatenation rules are applied to form as smooth words as possible. These parametric word files are then subjected to prosodic rules for introducing correct pitch and stress levels. Pitch variations are considered at three levels i.e. word level, clause level and sentence level. The local peaks and values along with their amplitude levels are determined based on the phonological and syntactic information in a sentence. It is very important to correctly assign the pitch value, its rate of rise and fall at the appropriate syllables and at clausal breaks. Stressed syllables have 4 to 6 db higher intensity compared to unstressed syllables. A Win 98 based software model of the system is working satisfactorily and it is continuously improved to produce quality synthesis. DESCRIPTION OF TTS FOR HINDI

The modules of the CEERI Text to Speech Synthesis system for Hindi are shown in Fig 1. Hindi Text input in ISCII format is fed to Text Preprocessor which expands abbreviations, numerals, punctuation, Special charac- ters etc. At next level the word parser breaks down the individual words into basic units available in database. The intonational and stress parser provides information regarding position of pitch and stress variation at word and sentence level. In the present text to speech system, monosyllables have been chosen as the basic units of Hindi speech to generate an unrestricted vocabulary. The combination of most frequently occurring 29 consonants and 10 vowels forms the core of our database. The concatenator merges the syllable data from the database to make raw word data. This data is then processed by the Quality Manager (VQM) using the Knowl- edge base consisting of durational, phonological and con- textual rules, to improve the speech quality. The Syn- thesizer then uses this data to generate the speech.

DEVELOPMENT AND IMPLEMENTA- TION OF RULES The word parser takes Hindi Text (ISCII codes) as input and then breaks them into monosyllables of CV and VC types, clusters and geminates. These syllables form the basic parametric database used in the Hindi TTS. Infor- mation regarding punctuation etc. is included so that appropriate silence could be introduced during synthe- sis. Similarly information about sentence being declara- Fig 1 TTS block diagram tive or interrogative, is included in the first word of the sentence for implementing intonational contour. The syllable tags also carry the information regarding whether the syllable is a simple /CV/, /VC/ or a cluster or a geminate, or it is nasalized and the junction type it forms with the succeeding syllable so that the appropri- ate set of rules are applied while concatenating. This information helps to expedite the process of synthesis and to decide which category of rules has to be applied. Pitch variations have been determined at three levels i.e. word, clause and sentence. The F0 values for peaks and valleys, rise and falls at each of these level are as- Fig 2. Synthetic Speech from TTS signed and implemented. The content word in Hindi are usually stressed and stress is generally on the penultimate CONCLUSION high syllable. The present text to speech system for Hindi, has been The concatenator generates raw speech data for words developed on a PC platform, using syllables as the basic and sentences based on the input syllable string stream. units. These basic units are used to generate unrestricted The rules for speech quality improvements are imple- Hindi text. In our system the quality of basic units is of mented on this speech data. The smoothening of pa- paramount importance as it has a major impact on the rameters at syllable junction boundaries is done for avoid- quality speech output. The final speech output is quite ing jerks in output speech. intelligible. Since the development of a body of rules for bringing naturalness to the output synthetic speech is a In the present system, data for syllables containing short continuing process and it is largely language dependent vowels such as / U /, /I / etc. are generated by rules from further studies are being carried out. the corresponding syllables in the long vowels by vary- ing frequency and durational parameters using the rel- ACKNOWLEDGEMENT evant rule in the knowledge base. The authors are grateful to Dr. Shamim Ahmed, Direc- A large number of general rules for improving the syn- tor CEERI for his support and thankful to the MoIT for thetic speech quality have been generated after detailed providing the financial assistance. They are also thank- acoustic analysis in different contexts. The rules are based ful to Prof. K. Stevens for his useful suggestions. on the syllabic combinations that may occur in words. In the present system the clusters and geminates are REFERENCES treated differently. The information regarding presence of a cluster or a geminate in a word is incorporated at the parser level. Rules pertaining to them are applied at [1] Agrawal S S, and Stevens K., Proc. ICSLP-92, 177- the concatenation level. In case of voiced geminates a 180 1992. predetermined duration of voice bar is inserted before [2]Klatt D H and Klatt L, JASA,87(2),820-857,1990. the consonant while for unvoiced geminate a predeter- [3]Verma R, Sarma A S S, Shrotriya N, Sharma A K, mined duration of silence is inserted. For other conso- Agrawal S S, ICPhS95 Stolkholm Vol2 ,1995,p 354- nants similar treatment is carried out. For clusters the 357. parser not only sets a cluster flag but also converts, ev- ery single consonant of cluster into a / VC / by prefixing a vowel with the consonant. This vowel is redundant actually but used only to convert a single consonant into an available syllable unit. Later on at the concatenation level data corresponding to this vowel is stripped off and only consonant data is retained. Depending upon the consonants forming the cluster their duration are adjusted as per rule.

The Investigation of the Topoiyo Language in Indonesia and its Speech Data-base

H.Nakashimaa and M.Yamaguchib

aFaculty of Information Science,Osaka Institute of Technology,1-79-1 Kitayama,Hirakata,Osaka 573-0196,Japan bFaculty of International Language and Culture, Osaka,Japan

This study deal with the investigation and recordings of the Topoiyo language, which is one of the endangered languages, in Sulawesi Island, Republic of Indonesia.

INTRODUCTION 1,200; Benggaulu, about 6,000; and Panasuan, about 800. No data were available about the Bana language. The Topoiyo language is now used in the inland area The following is the data we gathered in the along the Budong- budong River in Mamuju District, investigation: which is in the South Sulawesi Province in Sulawesi (1)Topoiyo; more than 2,000 words including noun, Island, Republic of Indonesia. The number of its verb, adjective and adverb, morphological and speakers is said to be between 500-1,000. And it is syntactic material, and some recording of vocabularies considered to genetically belong to the Kaili-Pamona and simple sentences. Family[1],[2],[3]. The languages from the South (2)Benggaulu; about 300 basic words and some Sulawesi family are spoken mostly in the South recording. Sulawesi Province, and the languages studied there (3)Bada; about 300 basic words and some recording. have been ones with a large population. In the Central (4)Panasuan; about 300 basic words and some Sulawesi Province, the languges which belong to the recording. Kaili-Pamona Family are used, and main languages We have constructed the speech data-base of the there have been studied since the country was Topoiyo language from the recorded speech materials, colonized by Holland. But the Kaili-Pamona languages and have made the CD-ROM from it. and other languages with a small population which are used in the north of the South Sulawesi Province have PHONETIC AND ACOUTICAL not been invesitigated and studied so much. ANALYSIS OF TOPOIYO LANGUAGE We have conducted an investigation and have collected the linguistic and phonetic data of the The following is the phonological system of the Topoiyo language, and we have constructed its speech Topoiyo language. data-base from the recorded speech materials. VOWELS: The vowel system of Topoiyo consists of /i,e,a,o,u/( /u/,/o/ are rounded vowels). These vowels INVESTIGATION OF TOPOIYO are distributed in word-initial , middle and final. LANGUAGE AND ITS SPEECH DATA- CONSONANTS: The consonants of Topoiyo are BASE shown in Table 1. The phonological feature as the language from the Kaili-Pamona are as follows; We have conducted an investigation there in (1) There are the nasal+consonant clusters, mp, mb, nt, September, 2000 and found the following facts about nd, nyc, nyj, ngk, and ngg. The clusters, mp, mb, nt, the Topoiyo language and other languages with a small nd and ngg, exsist in the word-initial. These population around it. Topoiyo is used in the village of nasal+consonant clusters in word-initial don’t exsist Topoiyo in Budong-budong Country in Mamuju in the languages from the South Sulawesi Family. District. The population of the village is about 3,000, (2) We confirmed the existence of the voiced labio- and there are 100 households which consist of 400-500 dental /v/ from the acoustical analysis. This Topoiyo people. The speakers of the language are phoneme /v/ doesn’t exist in the South Sulawesi mainly ones older than 40 years. We collected from the Family. people around Topoiyo village data on the number of (3) The final syllable of word are basically the open speakers of the following languages; Bada, about syllable. and Culture in Japan. We thank Dr.M.L.Manda of Hasanuddin University in Indonesia for his great CONCLUSION cooperation in this project.

We have conducted the investigation of Topoiyo, REFERENCES which is an endangered language, and have constructed its speech data-base for recordings of Tpoiyo. And we 1.T.Friberg, South Sulawesi Sosiolinguistic Surveys 1983- did the phonetical, acoustical and morphonological 1987. SIL in Cooperation with the Department of Education analysis from the data we got this time, and we found and Culture,Ujung Pandang(1987). 2.C.E.Grimes and B.D.Grimes, Languages of South that Topoiyo belongs to the Kaili-Pamona Family. Sulawesi. The Australian National University, Canbera(1987). ACKNOWLEDGMENTS 3.B.F.Grimes, Ethnologue:Languages of the World.SIL,Dallas(1992). This study is partly supported by the Grant-in-Aid for Scientific Research, Ministry of Education, Science

Table 1. The consonants of Topoiyo. bilabial labio- dental- palatal velar glottal dental alveolar p,b t,d c,j k,g q fricative v s h nasal m n ny ng lateral l trill r semivow wy el

A Study on Acoustical Correlates to Adjective Ratings of Speaker Characteristics on Dynamic Aspects

Yasuki Yamashitaa and Hiroshi Matsumotob

aNagano Prefectural Institute of Technology, 813-8 Shimonogo, Ueda-shi, Nagano 386-1211, Japan. bShinshu University, 4-17-1 Wakasato, Nagano-Shi, Nagano 380-0921, Japan

Abstract. For synthesizing voice quality expressed by adjectives, acoustical correlates to adjective ratings of speaker characteristics are investigated. First, two sentences uttered by each of 20 male and 19 female speakers are rated by 22 listeners on eight semantic-differential scales used in everyday life. This paper focuses on two scales relating to dynamic aspect of articulation: “resting – busy” and “articulate – inarticulate”. For each gender, several acoustic parameters of the top and bottom three speakers on each scale are compared. The results show that the “busy” voices have significant correlation to the normalized standard deviation of logarithmic F0 per mora duration. On the other hand, “articulate” voices have wider dispersion on the F1-F2 plane than “inarticulate” voices. In phoneme level, it is observed that the formant loci of “inarticulate” voices tend to have slower movement than “articulate” ones.

INTRODUCTION for male voices and 8 for female voices. Each listener In order to synthesize speech with desired speaker rates the voice attribute of each voice sample on each characteristics or to convert the voice quality of speech pair of adjectives. The six-point rating scale is ranged signal to any specific one expressed by adjectives, it is from 3 to -3 in which the value 1 to 3 correspond to required to control the acoustical parameters correlated the degree “a little” to “very”. with the values on the adjective scales. There are The rates for each speaker were averaged over two several studies on acoustical correlates to the static sentences, listeners, and three trials. Fig.1 shows the aspect of speaker characteristics [1,2]. We have also speaker distribution of the average rates on the two SD investigated the intercorrelation between the ratings of scales of dynamic aspect of articulation: “resting - the voice characteristic on 8 Japanese semantic- busy” and “articulate - inarticulate”. The ranges of the differential (SD) scales and their acoustic parameters male and female speaker distributions on each scale [3]. However, the acoustical correlates to the dynamic are mostly same. Furthermore, the correlation aspect of speaker characteristics have not been well between the rates of “resting” and “articulate” is investigated due to its varieties. significantly high (0.67 for male and 0.58 for female). This paper examines the relationship between SD scales of two adjectives (“resting - busy” and ACOUSTICAL CORRELATES OF “articulate - inarticulate”) and dynamics of acoustical DYNAMIC SPEAKER CHARACTERISTICS parameters: F0 dispersion for “resting” and formant loci for “inarticulate”. Analysis method In this study, twelve acoustical parameters are extracted: the means and standard deviations of the SUBJECTIVE EVALUATION logarithmic fundamental frequency (F0, sd_F0), the The voice samples used are two sentences (28 and first to third logarithmic formant frequencies (F1, F2, 33 mora) uttered by each of 20 male and 19 female F3, sd_F1, sd_F2, sd_F3), and their bandwidths, and speakers with standard dialect who were selected from speech rate in mora per second computed excluding 50 speakers for each gender. These were then pauses. Speech signal was digitized at a sampling evaluated on eight adjective ratings [4] (“clear - hazy”, frequency of 16kHz and analysis was performed with a “resting - busy,” “powerful - weak”, “young - old”, frame length of 20ms and a frame shift of 10ms. “deep - shallow”, “sharp - dull”, “articulate - Formant frequencies for each frame were extracted by inarticulate” and ”nasal - less nasal”) by 22 listeners means of a 12th order Mel-LPC analysis followed by a

root solving, and pitch frequency are estimated by the cepstrum method.

Results of the “busy” scale busy resting While the correlation between the SD values and each of twelve parameters was not significant for both inarticulate articulate genders, the following parameter was found to be Male highly correlated with the “busy” ratings. Female sd _ ln(F0) 1 -3 -2 -1 0 1 2 3 ⋅ ()1 ln(F0) speech rate Average rate

This parameter indicates the logarithmic F0 dispersion FIGURE 1. Speaker distributions on two semantic- ratio per mora duration. The correlation coefficient is differential scales associated with the dynamic aspect. 0.83 for 20 male speakers (significant level 1%) and 0.47 for 19 female speakers (5%) as shown in Fig.2. On the other hand, this parameter does not show high 0.012 correlation to the “articulate” ratings.

Male (Correl: 0.83) Results of the “inarticulate” scale 0.008 Female (Correl: 0.47) First, Fig.3 compares the formant loci of /da/ (in “dai piramiddo”) for the top and bottom three speakers Normalized F0 Deviation on the “articulate-inarticulate” scale in every 5ms. 0.004 From the Fig.3, the F1s for the “articulate” voices are -3 -1 1 3 steeper than those of the “inarticulate” ones. "Resting-Busy" Rate Fig.4 compares the mean formant frequencies of five Japanese vowels for the “articulate” and FIGURE 2. The scatter plots of speakers on the plane of the “inarticulate” voices on the normalized log-F1 and “resting - busy” rates and the normalized deviation of F0. log-F2 plane whose origin is the mean log-formant frequency over five vowels for each speaker. The 8.0

formant frequencies are extracted from steady portions ln(F) 7.5 of vowels. The dispersion for the “articulate” voices 7.0 tends to be wider than that of the “inarticulate” ones. articulate F1 6.5 articulate F2 inarticulate F1 6.0 inarticulate F2 CONCLUSION 5.5 This paper has presented the acoustical correlates to the SD values associated with the dynamic aspect of 5.0 [ms] speaker characteristics. The “resting” shows high 020406080 correlation to the standard deviation of the log-F0 FIGURE 3. Average formant loci of /da/ for top and normalized by the mora duration. The “inarticulate” bottom 3 male speakers on the “articulate - inarticulate” rate. correlates to slower formant loci in the and 0.6 the narrower dispersion on the log-F1-F2 plane for Male "Articulate" the five vowels. 0.4 0.2 Male "Inarticulate"

F2 0 Female REFERENCES "Articulate" 1. W.D.Voiers,"Perceptual bases of speaker identity," -0.2 Female J.Acoust.Soc.Am. 36, 1065-1073 (1964). "Inarticulate" -0.4 2. G.L.Holmgren, "Physical and psychological correlates of speaker recognition", J.Speech ear.Res. 10,57-66 (1967). -0.6 3. Y.Yamashita, H.Matsumoto," Study on Acoustical -0.6 -0.4 -0.2 0 0.2 0.4 0.6 F1 Correlates to Adjective Ratings of Speaker Characteristics," WESTPRAC VII, 177-180 (2000). FIGURE 4. Five Japanese vowels on the normalized log- 4. H.KIDO and H.KASUYA, “Extraction of everyday F1-F2 plane for top and bottom 3 speakers on the “articulate expression associated with voice quality of normal - inarticulate” scale. utterance”, J. Acoust. Soc. Japan 55, 405-411 (1999)(in Japanese)

Analysis and Synthesis of Hindi Retroflex Sounds Using PC Based Klatt Synthesizer

Rajesh Verma and Shyam S Agrawal

Speech Technology Group Central Electronics Engineering Research Institute Centre CSIR Complex, NPL Campus, Hill Side Road, New Delhi 110 012, India

Emails: [email protected], [email protected] Fax: 91-11-5788347

Abstract

This paper describes the analysis results obtained for achieving the high quality synthesis of all the Hindi Retroflex consonants /t./, /t.h/, /d./, /d.h/, /n~/, /r/, /r./ and /r.h/, using PC based cascade/parallel formant synthesizer proposed by Klatt. These sounds were analyzed in five long vowel contexts /a/, /i/, /u/, /e/ and /o/ for a very accurate description of their acoustic characteristics/features. Various parameters like duration of closure/voice bar, duration of burst, voice onset time, duration of aspiration, rate of second formant transition and burst frequencies and amplitudes have been studied in details For the synthesis purpose, the source and vocal tract parameters of the synthesizer configuration were selected very carefully. Special attention was paid to the parameters like F2, F3, A2F, A3F and A4F, which play an important role in making distinction between cognate sounds like /t/ and /t./. Parameters like FNP, FNZ, FTP and FTZ were successfully employed to get the nasal sound like /n~/. The parametric doc files were modified iteratively, until a satisfactory quality of synthetic sound was obtained. The quality of synthetic speech was evaluated not only by subjective listening but also by matching the spectra of synthetic speech with original speech.

SPECIFIC FEATURES OF HINDI retroflex depending upon the phonetic context in SOUNDS which it appears.

The sounds of Hindi speech can be conveniently ANALYSIS PROCEDURE divided into two broad categories of vowels and consonants. Hindi speech contains a set of ten pure CV and VC syllables were analyzed from CVC vowels about 35 consonants, of which about 29 syllables (where the final consonant was the same as consonants are of frequent usage. These consonants the initial consonant), recorded by male native Hindi can be conveniently classified according to the speaker in all five vowel contexts /a/, /i/,/u/,/e/ and manner and place of production[1]. /o/, at sampling rate of 12 kHz. These recorded The Hindi consonants possess certain special syllables were analyzed using a PC AT equipped with features, which are not so common to European Sensimetrics Speech Station software. languages and American English. The most significant The analysis tools which were used, consisted of differences are in stops and , which use both digital spectrograms and other techniques like short voicing and aspiration, to distinguish them from other time FFT and LPC spectra, pitch, formants, languages. Retroflex stops and nasals do not occur in waveforms and envelope displayed together. Using most forms of English but very common in Hindi, the spectrogram most of the important acoustical Malayalam and other Indian Languages. Retroflex event were studied in time domain. Duration and sounds are made by curling the tip of the tongue up and sequences of events in each sound/syllable was noted back so that the underside touches or approaches the down using time scale of the spectrogram. Unlike back part of the alveolar ridge. Retroflex /n~/,/r./ and other sounds, which can be described largely in terms /r.h/ generally appear in word medial and word final of steady-state spectra, stops are transient phonemes position only. They do not appear at word initial and thus are acoustically complex. Hindi CV position. All other retroflex do appear at any word syllables of the type stop plus vowel consist of at position. Among these retroflex consonants, /r/ has least four phonetic segments, viz. the closure or very large allophonic variations in different contexts, voicebar, burst, voice onset time (VOT), and a voiced i.e., it behaves like a semi-vowel, fricative or interval.

Burst Frequencies and Amplitudes provide additional damping by increasing the bandwidths of the higher formants. In case of voiced Burst frequencies depend on the place of articulation aspirated stops /t.h/ and /d.h/, a break in the voicebar of the stop as well as the vowel context. The spectral prior to the aspiration is observed. Duration of Voice difference among the stop consonants across the Bar is around 40 to 60 ms, depending on the vowel places of articulation are primarily reflected in the context. spectra of the burst & formant transitions if the target VOT: For the unvoiced/voiced unaspirated stops vowel. The burst frequencies do show slight variation /t.,d./, there is practically no VOT. The voicing starts in different vowel contexts. It has been observed that immediately after the release of burst. While for the first two formants contained most of the energy as aspirated stops /t.h,d.h/, the VOT is about 50 ms. compared to third formant, in case of retroflex Aspiration: In case of aspirated sounds, it has been sounds. Retroflex sounds have a very heavy burst observed that the bandwidths of the formants are release as compared to the corresponding dental narrower (i.e. the formant peaks are better defined) in sounds. Also there is general lowering of the third the case of voiced aspirated sounds as compared to and fourth formants. unvoiced aspirated sounds. The formant transitions from burst frequency to the target vowel frequency, SYNTHESIS PROCEDURE are part of the aspiration segment. Aspiration duration is about 40 to 60 ms while the value used for the A PC based cascade/parallel formant synthesizer parameter AH is 55. based on Klatt model [2] was used for the synthesis The spectrograms os a sample original and synthetic of retroflex sounds. Based on the analyzed data, a syllable /n~i/ are shown in Figure 1. preliminary parametric file was created and synthesis was done to obtain a starting synthetic file. Then the spectrograms of natural and synthetic syllables were compared. A number of source and tract parameters were adjusted iteratively in order to achieve a close imitation to natural syllables. These include the source parameters such as amplitude of voicing, frication and aspiration (AV, AF, and AH FIGURE 1. Spectrogram of original and synthetic respectively), Open Quotient (OQ) and Spectral Tilt syllables /n~i/ (TL) and vocal tract parameters like six formant frequencies, amplitudes, and their bandwidths CONCLUSIONS Burst Frequencies: Same formant frequencies of burst (and transitions) can be used for consonants The Hindi Retroflex sounds have been successfully belonging to same place of articulation. For example synthesized using cascade/parallel formant same values of first four formants can be used for all synthesizer. FTP and FTZ, ie, frequencies of tracheal h h four consonants /t.,t. ,d.,d. / in the retroflex group for pole and zero play a very important role in improving a given vowel context (see Table 1). the nasality in the vowel portion. These synthesis parameters are also used in development of an TABLE 1. Parameters used for generating Burst unlimited vocabulary Text to Speech synthesis of the Retroflex in /a/ context system for Hindi.

ACKNOWLEDGEMENTS

The authors are grateful to Dr S Ahmad, Director, CEERI, Pilani for his encouragement and useful discussions. They are also thankful to MIT, Govt. of India, for financial support. Voice Bar: In case of voiced stops, it has been observed that the center frequency of the voicebar varies between 200-300 Hz, and the amplitude of the REFERENCES first resonance is high while all higher resonance are strongly damped. This type of spectral shape is [1] Agrawal S. S., and Stevens K., Proc. ICSLP-92, obtained by using a spectral tilt factor (TL) of 15 and 177-180, (1992). OQ of 80. AV is set to 45 since the amplitude of the [2] Klatt D. H., J.Acoust.Soc.Am., 67(3), 971-995, voice bar is low. Sometimes it is necessary to (1980). A Novel MFCCs Normalization Technique for Robust Hindi Speech Recognition

Amita Dev, S.S Agrawal

Central Electronics Engineering Research Institute CSIR Complex, NPL Campus, Dr K S Krishnan Marg , New Delhi 110012, India E-mail- [email protected]

ABSTRACT

The requirement for robustness in Speech Recognition System is becoming increasingly important as speech technology is applied to real world applications. Dramatic degradation in system performance can occur as a result of difference between the training and testing conditions. As the noise is introduced in the signal, the statistical parameters of the MFCCs varies considerably. For instance as the level of the noise increases, the mean shifts, the variance reduces and the distribution tends to be non-Guassian. As a consequence, the recognition score reduces considerably. In this paper we propose a normalization technique, Frame Cepstral Mean Normalization [FCMN] which normalizes the output of the Front-end to have equal frame parameter statistics in all noisy conditions thereby reducing the mismatch between training and testing conditions. The viability of the proposed normalization technique was verified in various experiment. Even under normal conditions the proposed technique clearly outperforms MFCCs, Delta-MFCCs in terms of recognition accuracy .In a multi-environment speaker independent word recognition task, the proposed normalization technique reduced the error rate by over 43% in noisy condition with respect to the baseline tests [MFCCs] and for microphone mismatch & noisy case, over 34% error rate reduction was achieved.

INTRODUCTION A need for the proposed normalization technique was strongly felt while working on speaker Speech recognition systems works reasonably well independent Hindi Word recognition system where the in a laboratory condition , but their performance VQ model training was done in noise free environment deteriorates drastically when they are deployed in using the training utterance. Despite the clean training practical situation where there is a microphone data, recognizer yielded a poor performance when we mismatch or speech is corrupted by office noise[1]. relied on a LPC based Cepstral Coefficients (LPCCs) For instance in one of our experiment, an isolated front-end. It was found that these Linear Prediction word recognizer based on classical VQ model, by cepstral coefficients (LPCCs) were very sensitive to using MFCCs feature vectors could recognize 500 additive noise and channel mismatch distortion which isolated words perfectly when spoken in laboratory is very common in practice. In order to cope up environment. Recognition score reduced to 68% when mismatch between training and testing condition we testing was carried out in noisy environment at 14 dB tried state-of-the-art Mel Frequency Cepstral SNR. We observed recognition score further decreased Coefficient (MFCCs), Delta-MFCCs front-ends with to 56.0% when speech was recorded under noisy and certain speech enhancement techniques including microphone mismatch condition at approx. 9dB SNR. preemphasis, cepstral lifters but we could not The prime objective of this paper is to present the significantly improve the performance of the different front-end techniques as applied for word recognizer. This may be attributed to the fact that as recognition using a VQ based speech recognizer and the noise is introduced in the signal the statistical propose a novel MFCCs normalization technique for parameters of the MFCCs varies considerably with robust Hindi speech recognition. noise[2]. We propose a normalization technique FCMN which makes speech recognition system more ROBUST FEATURE SELECTION robust to environmental changes by normalizing the output of the signal processing front end to have quantization. A 16ms Hamming window shifted with 8 similar frame parameter statistics in different noise ms steps and a preemphasis factor of 0.97 was used to conditions. calculate 24 filter bank energy values.

Frame Cepstral Mean Normalization RECOGNITION RESULTS

Hence in order to make the system more robust to The suitability of the proposed technique was tested above said distortions we implemented a normalization in various experiments with different microphones and technique [FCMN] by which cepstral coefficients were noise condition and the results for the same have been normalized to have zero mean and unit variance within summarized in Table 1. a given frame.[3] The normalization coefficients were calculated over a relatively short sliding window as Table 1. On the Comparison of Recognition Score using follows. different Front-ends ^ ct-T (j) = (ct-T (j) – µt(j)) / σ t(j) where - ct-T(j) is the jth component of the original feature vector at time t-T ^ - ct-T (j) is the normalized feature vector. - µt(j), σ t(j) is the mean and standard deviation for each feature vector component j. Here the mean removal can be regarded as the linear Firstly the recognizer was tested under normal High Pass Filter and division by standard deviation act conditions and the recognition score achieved was as an Automatic Gain Control. 93.6% with MFCCs, 97.0% with Delta-MFCCs and 100% with FCMN. Subsequently recognizers was DATABASE tested for Microphone Mismatch and it was seen that there was a degradation in the recognition rate as The database for most frequently occurring 500 compare to normal condition. But among all FCMN Hindi words spoken by 50 speakers was used for technique showed better results. As expected there was training purpose. The spoken samples were recorded in further degradation in the recognition rate when testing a studio environment condition using Sennheiser was done for Microphone mismatch and Noisy microphone model MD421 and tape recorder model condition with a low SNR value 9 dB. But proposed Philips sAF6121. The spoken words were repeated technique reduced the error rate by 34.5% with twice by each speaker. In order to study the effect of respect to baseline test . channel mismatch and noise, another set of recording was done by 10 speakers (7 males and 3 females) ACKNOWLEDGEMENTS using 2 different microphones i.e Lapple model D109 and Shure model SM 48. This data was used as testing Authors are grateful to Dr. S Ahmad, Director data to evaluate the recognition rates by applying CEERI, Pilani for encouragement and to AICTE for different parameterization techniques.Finally proposed financial support for carrying out this research. normalization technique was applied for extraction of robust feature vectors as shown in Figure 1. REFERENCES

1. Acero, A., Stern R.M., 92. “Environmental Robustness in automatic Speech Recognition” Proc. IEEE ICASSP, 1990, page 849-852.

2. Gong Yifan “Speech Recognition in noisy environment.” Speech Communication Journal , Vol 16 No-3, April 95, page 261- 291.

FIGURE 1. Normalized Mel Frequency Cepstral 3. Dev A, Sarma A. S. S, Agrawal, S. S., “Recognition of Coefficients Hindi Phonemes using Time Delay Neural Networks and its comparison with other languages” Workshop on The speech data signal was low pass filtered and Multi-Lingual Speech Communication ( MSC-2000) , digitized at 16 kHz sampling rate with 16 bit Kyoto, Japan ,2000, Page 54-58. Silence as a Cue to Stop Manner in Meaningful /sVC/ Isolated (Hindi) Word Context

b RK Upadhyaya, SHS Rizvib, SS Agarwalc and A Ahmad

a Deptt. of Physics, Govt. P. G. College, Rishikesh, Distt. Dehradun, INDIA. b Deptt. of Physics, Aligarh Muslim University, Aligarh cCentral Electronics Engineering Research Institute, New Delhi

A perception study of silence as a cue in isolated meaningful /sVC/ sounds through a manipulation experiment has been conducted on eight sounds with five abutted vowels. For each sound the initial fricative noise /s/ was excerpted and the remaining vocalic portion [RVP] was seperated to later manipulate 10 silence durations of 50, 80, 100, 120, 140, 160, 180, 200, 230 and 250 ms between the two segments to conduct an extensive perception test on 50 bilingual subjects. It was found that for silence interval of upto 80 ms listeners report the sounds as original sounds, but from 100 ms to 160 ms, they perceive the sound as /spVC/. On further increase of silent interval the sound /spVC/ was separated into a hissing noise corresponding to /s/ followed by silence and /VC/ in distinct succession thus establishing that silence is a sufficient condition for stop manner.

The importance of silent interval confined responses fall to 50% level at about 180 ms beyond between abutting sound segments, as a cue in speech which the `separation effect' dominates. The maximum perception is well established [1,2,3,4]. To investgate separation response is 129 [86%] at 250 ms, the similar effects in Hindi 8 meaningful /sVC/ words /sat/, average separation response 250 ms being 74%. /sat/, /sik/, /sid/, /sit/, /sIn/, /sen/ and /sun/ were Further there is little effect of the abutted vowel on the recorded, and audio waveforms and digital stop consonant manner. spectrograms were obtained. The noise portion of /s/ We find that our results are consistent with and the RVP were seperated, silent intervals of 10 ms, the findings of Dorman et al. [2] and Port [5] in that 20 ms, 30 ms, 40 ms, 50 ms,60 ms, 70 ms, 80 ms, 90 even without any strong stop manner cues in the ms and 100 ms were introduced between the separated surrounding signal portions, /p/ responses are obtained /s/ noise and the RVP and 'manipulated' tokens were in a certain range of closure durations. However our subjected to a perception study conducted by four boundary of /p/ responses spans from 80 ms to 160 ms speech researchers who found that until a silent after which sounds begin to separate into the unnatural interval of about 80 ms was introduced the original sounding /s-noise + silence + VC/. sound was heard and it was then and thereafter that the Bastian et al [1] found that syllable /slIt/ is responses were /spVC/. The initial spectrum of RVP heard as /splIt/ when a short interval of silence (~40 was then gradually removed till the /tat/ was heared as ms) is introduced but in our study such short silence /at/ etc. The duration removed varied between 20 ms closure duration sounds were invariably perceived as to 25 ms. A new file of these 8 manipulated tokens was the original sound. It is at 80 ms and beyond that a created and the silent intervals of 50, 80, 100, 120, 140, very sharp increase in the /p/ responses is obtained and 160, 180, 200, 230 and 250 ms were introduced, to this boundary spans from 80 ms to 160 ms after which obtain 80 tokens which were randomized three times the response is /s-noise+silence+VC/. and subjected to perception by 38 male and 12 female The investigation shows that neither a very young listeners. brief nor a very long closure is appropriate for stop Listener’s responses [Table 1] show that manner although it establishes that silence is important stop /p/ almost invariably springs up as the silent for the perception of stops in prevocalic position. interval is increased from 80 ms to 100 ms. The /p/ Table 1. Response for /sVC/stimuli as function of varying interval of silence (Max.No.of Response = 150)

Stimuli Response Response for silance Duration (ms) as 50 80 100 120 140 160 180 200 230 250 /sat/ /sat/ 149 146 0 0 0 0 0 0 0 0 /spat/ 1 4 150 150 150 100 76 42 22 21 /s+at/ 0 0 0 0 0 50 74 108 128 129 /sat/ /sat/ 150 149 0 0 0 0 0 0 0 0 /spat/ 0 1 150 150 148 77 52 38 20 17 /skat/ 0 0 0 0 0 0 0 0 0 4 /stat/ 0 0 0 0 0 0 0 0 0 1 /s+at/ 0 0 0 0 2 73 98 112 130 128 /sIn/ /sIn/ 150 147 0 0 0 0 0 0 0 0 /spIn/ 0 3 149 139 126 80 41 37 28 25 /skIn/ 0 0 0 5 3 2 3 2 2 0 /stIn/ 0 0 0 0 7 2 8 3 2 5 /s+In/ 0 0 1 6 14 66 98 108 118 120 /sik/ /sik/ 150 142 0 0 0 0 0 0 0 0 /spik/ 0 8 150 150 150 100 86 64 54 46 /s+ik/0 00 0 050658696104 /sid/ /sid/ 150 145 0 0 0 0 0 0 0 0 /spid/ 0 5 150 150 144 115 87 64 60 52 /s+id/0 00 0 63563869098 /sit∫//sit∫/ 150 145 0 0 0 0 0 0 0 0 /spit∫/ 0 5 149 150 147 95 70 69 48 47 /skit∫/ 0000 000 1 0 0 /stit∫/ 0000 000 3 0 1 /s+it∫/ 0 0 1 0 3 55 80 77 102 102 /sen/ /sen/ 146 144 0 0 0 0 0 0 0 0 /spen/ 4 6 149 147 133 82 65 40 37 35 /sten/ 0 0 0 0 5 5 5 0 6 16 /sten/ 0 0 0 0 0 0 0 9 0 0 /sken/ 0 0 0 0 0 0 0 0 1 0 /s+en/ 0 0 1 3 12 63 80 101 106 99 /sun/ /sun/ 150 148 0 0 0 0 0 0 0 0 /spun/ 0 2 143 146 148 115 107 94 72 46 /skun/ 0 0 4 4 0 5 0 1 0 0 /stun/ 0 0 0 0 1 1 0 0 2 0 /s+un/0 03 0 129435576104

REFERENCES 1. Bastian J, Eimas P and Liberman A, J. 4. Repp BH, Haskins Laboratories, Status Acoust. Soc. Am.33, 842a (1961). Report on Speech Research, SR-77/78, 137- 2. Dorman MF, Raphael LJ and Liberman AM, 145(1984). J. Acoust. Soc. Am., 65, 1518-1532(1979). 5. Port RF, J. of Phonetics,7, 45-56(1979 3. Raphael LJ and Dorman MF, J. of Phonetics, 8, 269-275(1980). Towards a Robust Speech Intelligibility Test in Japanese Kazuhiro Kondo, Ryo Izumi and Kiyoshi Nakagawa Department of Electrical Engineering, Faculty of Engineering, Yamagata University 4-3-16 Jonan, Yonezawa, Yamagata 992-8510, JAPAN {kkondo, nakagawa}@eie.yz.yamagata-u.ac.jp

We proposed and initially tested a Japanese version of the well-known Diagnostic Rhyme Test. We analyzed the Japanese phone inventory, and classified these into the six taxonomy used in DRT. Accordingly, we proposed a Japanese word pair list with the same rhyming structure whose first phone differ by a single feature. We then collected and tested speech intelligibility with this list with white noise, multi-speaker noise, and pseudo-speech noise added at various levels. The results showed basically the same trend as English; sustention scores were lower with white noise, while graveness showed lower scores for all types of noise, however. We also tested the effect of word familiarity on DRT-type selection-based intelligibility tests. Word pair and 4-word lists with the same rhyming structure were constructed; half of words were in the high familiarity group, while the other was in the low group. We compared intelligibility scores by groups; tests were given with 4-word selection, 2-word selection, and conventional free choice (write-in). The 2-word selection test scores showed significantly less effect of familiarity than the two remaining tests. We also confirmed that word pair tests showed much smaller effect of training than the conventional free choice tests.

minimize this effect. We conducted experiments to INTRODUCTION quantify this observation as well. Traditionally, Japanese intelligibility tests often used stimuli of randomly selected single mora, two A DIAGNOSTIC RHYME TEST FOR morae or three morae speech. The subjects were free to JAPANESE choose from any combination of valid Japanese We first proposed a consonant taxonomy for syllables. This quickly became a strenuous task as the Japanese with the same feature classification used in channel distortion increases. Thus, intelligibility tests English, which were drawn from the classification by of this kind is known to be unstable and often do not Jakobson, Fant and Halle [4]. The consonant reflect the physically evident distortion, giving taxonomy was then used to compile a word-pair list to surprising results [1]. be used as stimuli for the DRT. As for English, 16 English intelligibility tests are also reported to word pairs per each of the 6 features were proposed for show similar trends. Accordingly, the Diagnostic a total of 192 words. The word pairs are rhyme words, Rhyme Test (DRT) [2], a closed set selection test differing only in the initial phoneme. which restricted the reply to two words, was proposed. The following is specific for the Japanese list: This test is said to be effective in controlling various • Only two morae words were initially factors including the amount of training and phonetic considered. Longer words will be considered as context, and is known to give stable intelligibility needed. scores. • Only words with the same accent type were In this paper, we attempt to propose a DRT-type selected as a word pair. closed set selection test in Japanese. We initially try to • We tried to select mostly common nouns. categorize Japanese consonants into the same Proper nouns, slang words and obscure words taxonomy used for the English tests, and propose a were avoided. minimum-pair list accordingly which differ only by We collected speech from 4 speakers, two per the initial consonant and by a single phonetic feature. gender. White noise, babble and pseudo-speech noise Initial test results are also shown with various noise were mixed into these samples at an SNR of -15, -10, under various SNR. 0 and 10 dB respectively. Speech for words in the It has been known that word familiarity effects word-pair list was played out in random order. The word intelligibility tests, where the subjects tend to listeners were given the word pair list to choose from. bias the more familiar words [3]. However, a closed Figure 1 shows the DRT scores per feature under selection test with only a few choices is expected to various SNR for white noise only. These results showed basically the same trend as the 2-word RT was generally significantly smaller English results shown by Voiers[2]; nasality and than the 4-word RT or the conventional intelligibility voicing were fairly immune to noise, while graveness scores. This suggests us that the effect of familiarity was effected significantly. on intelligibility tests is minimized when the number of choices given is two but increases with larger THE EFFECT OF FAMILIARITY ON number of choices. RHYME TESTS REFERENCES One of the major factors known to effect word intelligibility is word familiarity. However, a closed- 1. Nishimura, R., Asano, F., Suzuki, Y., and Sone, T., selection test may not be effected if the choice is IEICE Trans. Fundamentals 79-A, 1986-1993 (1996) limited. We conducted tests to verify this observation. (in Japanese). 2. Voiers, W., “Diagnostic Evaluation of Speech Word familiarity is the average subjective measure of Intelligibility,” Speech Intelligibility and Speaker familiarity one feels towards a word. Amano and Recognition, edited by M. Hawley, Dowden, Kondo have compiled a large database with familiarity Hutchinson & Ross, Stroudsburg, PA, 1977, 374-387. for 80,000 Japanese words in the Shinmeikai 3. Sakamoto, S., Suzuki, Y., Amano, S., Ozawa, K., Dictionary [5] on a 7-point scale. We based all Kondo, T., and T. Sone, T., J. Acoust. Soc. Japan 54, familiarity ratings on their data. 842-849, (1998) (in Japanese). We compiled a list of word pairs and 4-word group 4. Jakobson, R., Fant, C. and Halle, M., Tech. Rep. 13, list according to their familiarity. Two familiarity Acoustics Laboratory, MIT (1952). classes were defined; the low familiarity class, with 5. Amano, S., and Kondo, T., Lexical Properties of familiarity below 4.0, and the high familiarity class, Japanese, Sanseido, Tokyo, 1999. with familiarity above 6.0. Words with familiarity voicing nasality sustention between 4.0 and 6.0 were left out intentionally to sibilation graveness compactness clearly distinguish the high familiarity words from the ¨©¢¥¢ low words. The two words in the same pair in the word pair list as well as all four words in the same §£¢ group within the 4-word group list were rhyme words. ¦£¢

For the word pair list, one word in a pair was in the ] %

low familiarity class, while the other word was in the [

¤¥¢ y t high class. Likewise, two words in a 4-word group i l i b were in the high class, while the remaining two were i g

i ¡£¢ l l in the low class. The word pair list contained 31 pairs e t n of 2-morae words. The 4-word group list contained 7 I groups of 2-morae words and 2 groups of 3-morae ¢

words. ¡£¢

We collected speech from 2 speakers, one male and

¨£ ¨©¢ ¢ ¨©¢ ¨£

one female. White noise was mixed into these samples none at an SNR of –10, 0 and 10 dB respectively. SNR [dB] Figure 1. Japanese DRT Scores (White Noise) Three testing sessions were conducted: 100.0 1. 2-word RT: Speech for words in the word-pair list was played out in random order. The listeners were 90.0 given the word pair list to choose from.

2. 4-word RT: Speech for words in the 4-word group ] 80.0 %

[ 2-word RT y t

list was played out in random order. The listeners i

l (high fam.) i

b 70.0 2-word RT were given the 4 words in the set to choose from. i g i

l (low fam.) l

e 4-word RT 3. Conventional Intelligibility: Speech in the word- t n I 60.0 (high fam.) pair list were played out in random order. However, 4-word RT (low fam.) the listeners were asked to write freely what they Intell. T est 50.0 heard. (high fam.) Intell. T est The intelligibility scores vs. signal-to-noise ratio (low fam.) for samples with added noise are shown in Figure 2. 40.0 -10 -5 0 5 10 The difference in the intelligibility scores between the low familiarity class and the high familiarity class in SNR [dB] Figure 2. Intelligibility by Test Modes