<<

Analysis of nasal using perceptual linear prediction YingyongQi Departmentof Speech and Hearing Sciences and Department of E!ectrical and Computer Engineering, Universityof •4rizona, Tucson, •4 rizona 85721

RobertA. Fox Diuisionof Speech and Hearing Science& Ohio State Unioersit•, Columbus, Ohio 43210 (Received31 May 1991;accepted for publication19 November 1991 ) Until recently,speech analysis techniques have been built aroundthe all-polelinear predictive model.This studyexamines the effectivenessof usingthe perceptuallinear predictive method for analyzingnasal consonants. Six speakers(three men and three women) produced300 CV syllableswith initial nasalconsonants/m/and//. A threshold-basedboundary detection algorithmwas developed to extractnasal segments from the CV contexts.Poles of a fifth-order perceptuallinear predictivemodel were calculated and the frequencyof the secondpole was usedto characterizethe placeof articulationof nasalconsonants. Results indicated that the frequencyfor the secondtransformed pole was significantly lower for/m/than for/n/and wasindependent of factorssuch as vowelcontext and genderof the speaker.A nasal identificationrate of 86% wasobtained based on the frequencyof the secondpole. The useof the perceptuallinear predictive method may thusovercome some difficulties associated with analyzingnasal consonants. PACS numbers:43.71.Es, 43.71.Cq, 43.72.Ar

INTRODUCTION salconsonants and measured the frequency of theantireson- The progressivephonetic refinement of broadcategories anceusing the methodof analysisby synthesis.This study of speechsounds (stop, , and sonorant) is animpor- provided,for the first time, valuableexperimental evidence tant step in implementingspeech understanding and/or on the predictedacoustic characteristics of nasal conson- speechrecognition systems (Reddy, 1976); however, ants.As pointedout by the author,however, "the procedure knowledgeabout the acousticcues that allow for reliable ... is by no meanssimple or mechanical... and a machine extractionof phoneticcharacteristics of speechfrom the couldhardly replacethe humanoperator at the stageof this acousticwaveform is limited (Harrington, 1988). For exam- study." ple, robustfeatures that enableautomatic within-class dif- Kurowski and Blumstein(1987) reportedthat the place ferentiation of nasal consonants have not been well estab- of articulationof nasalconsonants in CV syllablescould be lished. differentiatedby thepattern of spectralenergy change in the The acousticstructure of nasal consonantshas long vicinityof theoral releaseonce the spectra were transformed beenpredicted by the acoustictheory of speechproduction usingan algorithmbased on the psychoacousticcharacteris- (Fant, 1960;Fujimura, 1962; Flanagan,1972). The pres- ticsof humanauditory system. Utilizing criticalband analy- enceof a side-branchingresonator (the blockedoral cavity) sis based on the Bark scale (Zwicker, 1970), Kurowski and will introducean antiresonance(zero) in the spectrumof Blumsteinexamined the spectraof two glottalpulses of the nasal consonants.Theoretically, the antiresonancecan be nasalmurmur immediatelypreceding oral releaseof the na- usedto identifythe placeof articulationof nasalconsonants salstop closure and the firsttwo glottalpulses after the nasal becausethe frequencyof the antiresonanceis determinedby release.The spectralenergy increase was reported to belarg- the dimensionof theside-branching resonator. For example, er between 5-7 Bark than between 11-14 Bark for the nasal thefrequency of theantiresonance should be lower for/m/ /m/; whereasthe spectral energy increase was reported to be than for/n/because the blockedoral cavityattached to the larger between11-14 Bark than between5-7 Bark for the pharyngeal-nasalpassage is longerfor/m/than for/ru t. nasal/n/. Although the auditory-basedspectral transfor- Such differences,however, are difficult to detect usingcon- mation provides a reasonablealternative for analyzing the ventionaltechniques of spectralanalysis. The presenceof the nasalconsonants, these features were inevitablycontext de- spectralzero introducesnonlinear equations to parametric pendentbecause spectral information of the followingvowel methodsof spectralanalysis (Kay, 1987). The harmonic was an important part of the measurement.In addition, structure,the absence of a prominentspectral valley, and the there wasa certaindegree of arbitrarinessin the selectionof significantlydamped high-frequency energy in nasalspectra the frequencyrange for comparingenergy increase around are the sources of some of the difficulties associated with nasalrelease since the selectionwas neithersystematically analyzingnasal consonants using nonparametric (periodo- optimizedgiven the transformedspectrum nor directlypre- gram) methods. scribedby theory. Fujimura (1962) examinedthe acoustic structure of ha- Hermansky(1990) proposeda new approach,called

1718 J. Acoust.Soc. Am. 91 (3), March1992 0001-4966/92/031718-09500.80 (• 1992Acoustical Society of America 1718

Downloaded 21 Sep 2011 to 128.146.89.68. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp perceptuallinear predictive (PLP) method,for thespectral cyand bandwidth of oneequivalent spectral peak. The real transformationbased on characteristicsof the auditorysys- pole specifiesthe overall spectraltilt of the transformed tem.While the underlying principles of thespectral transfor- spectrum.Because nasal spectra are dominatedby only one mationwas similar to thosepreviously reported (Kewley- low-frequencypole (250-300 Hz) (Fujimura, 1962;Fuji- Port and Luce, 1984; Kurowski and Blumstein, 1987), the mura arid Lindqvist, 1971), the 5th-order PLP model is finalall-pole modeling of thetransformed spectrum parame- adoptedhere for the analysis.However, the realpole is not tricallyspecified the spectrumand, thus,simplified the ex- consideredin the finalanalysis because spectral tilt caneasi- tractionof acousticcharacteristics of speechsignals. Her- ly beinfluenced by factorssuch as recording conditions and mansky (1990) demonstrated that a fifth-order PLP glottal characteristicsof a particular speakerand is often spectraltransformation improved the performanceof a tem- consideredas a phoneticallynonessential factor (Klatt, plate-basedspeech recognition system. 1982). In light of the newlyproposed PLP methodof spectral analysis,the presentstudy was undertakento determine A. Subjects and recordings whether acoustic characteristics of nasal consonants could Sixnative speakers of AmericanEnglish (three men and be found that were independentof the following con- three women) providedthe speechsamples. Each speaker text in a CV syllableand to evaluatewhether place of articu- producedten C¾ syllableswith initial nasalconsonants/m/ lation couldbe automaticallyidentified on the basisof these and/n/followed by five differentvowels/i/,/a•/,/o/,/a/, features.The paper is organizedas follows. SectionI in- and/u/. Eachsyllable was repeated five times. The syllables cludesa brief rationalefor usingthe PLP method and the wererandomly ordered. The recordingswere made in a quiet implementationdetails of the PLP analysis.Section II con- room with backgroundnoise at about 30 dB (SPL). The tainsa descriptionof thecontext-independent acoustic char- microphonewas placed approximately10 cm from the acteristicsof nasalconsonants revealed by the PLP analysis mouth of the speaker.Recorded tokens were low-passfil- and the results of nasal identification based on these features. tered (f• = 4.5 kHz) and digitizedat a samplingrate of 10 Section III is the discussion and conclusion. kHz with a 16-bitquantization level.

I. METHOD B. Nasal segment selection The integrationof spectralenergy as part of the audi- A 25.6-mssegment of the nasalconsonant before the tory-basedspectral transformation provides a uniqueadvan- oral releaseinto the followingvowel (i.e., within the nasal tage in analyzingthe spectralcharacteristics of nasalcon- murmur) was selectedfor the PLP analysis.The boundary sonants. As stated earlier, the absenceof a prominent betweenthe nasalconsonant and the following vowel was spectralvalley and the reducedspectral energy in the mid- to automaticallylocated from the acousticwaveform. The ex- high-frequencyranges make it difficultto accuratelylocate act algorithmis asfollows. The signalwas bandpass filtered the center frequencyof the antiresonanceand to use this ( 1-3.5 k Hz, fourth-order Butterworth filter) to enhancethe frequencyto distinguishnasal consonants. There are, how- energydiscontinuity of the boundarysince energy for the ever, less-intensivespectral perturbations at differentfre- nasalmurmur is mostlyin the frequencyrange below about quencyranges for differentnasal consonants due to the side- 800 Hz (Mermelstein,1977). The upperfrequency limit for branch (oral) coupling. For example, there is a relatively thebandpass filter wasused to eliminateunpredictable high- broad energyreduction in the mid-frequency(1000-2000 frequencyinfluences. The boundarywas determined using a Hz) rangewithin the nasalmurmur of nasal/n/that is not short-term amplitude signal calculated from the filtered asapparent for the nasal/m/. When a nonuniformintegra- waveform.A hammingwindow, with a windowsize of 10ms tion of poweris undertakenas part of the spectraltransfor- anda windowstep of 2 ms,was used in the calculations.The mation, someotherwise unreliable energy concentrations short-termamplitude was calculated as the absolutesum of are expectedto manifestthemselves as spectral peaks. These sampledvalues within the window spectralpeaks could be usedto differentiatethe placeof ar- ticulationof nasalconsonants. Because the spectralintegra- tion in the PLP transformation is based on the characteris- A,= • [s.'h.l, (1) tics of the auditory system and spectral peaks in the whereA, is the amplitudesignal of the ith frame, s. is the transformedspectrum have important perceptualimplica- bandpassfiltered waveform,h.is the Hamming window, tions, it is reasonableto assumethat integrated spectral and n is the discretetime index. The boundarywas automati- peaksin the transformedspectrum could be usedto differen- cally detectedand marked as the time location where the tiate the placeof articulatioetof nasalconsonants. Thus in- increase,of shortterm amplitude,A•+• -&, exceededa steadof attemptingto measurethe difficultto locateantire- threshol[dA t with the conditionthat A• wasless than another sonancedirectly, polesof the PLP transformedspectra are threshol[dA r. The thresholdA• wasadaptively established used to characterize and differentiate nasal consonants. as a fixed percentageof the averageamplitude of two con- Hermansky (1990) has claimedthat a fifth-order PLP secutive frames model is most effectivein a template-basedspeech recogni- tion system.The order of the model impliesthat two pairsof A/ := 0.25'(Ai+! +Ai)/2. (2) conjugatepoles and one real pole are available for eachsignal The secondthreshold A r was usedto avoid detectinglarge frame.Each pair of the conjugatepoles provides the frequen- energylluctuations within the vowel segment.When more

1719 J. Acoust.Soc. Am., Vol. 91, No. 3, March1992 Y. Qi andR. A. Fox::Speech analysis using perceptual linear prediction 1719

Downloaded 21 Sep 2011 to 128.146.89.68. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp than onetime locationwas marked within a singlesyllable patternsseparated at a 1-Barkinterval (covering the total becauseof the finitewindow length, the lastmark wascho- frequencyrange of 504674 Hz) wereeach calculated as a senas the boundarybetween the nasaland the vowel because 128-pointweighting function and werepre-emphasized by a the thresholdA r minimizedthe possibilitythat a boundary simulatedequal loudness curve [seeFig. 2(a)]; (b) a 128- couldbe markedwithin the vowel segment.The boundary pointpower spectrum was calculated for eachselected nasal markswere visually checked using time-synchronized spec- segmentusing a 256-pointFFT; (c) a weightedsum of the trographicdisplays. All the markedboundaries were well powerspectrum was calculated for eachweighting function alignedwith the onsetof high-frequencyenergy in the spec- and the summationwas compressed by the cubic-rootinten- trograph.The markedoriginal and filteredwaveforms and sity-loudnessrelationship obtaining 16 filter outputs;() 16 thespectrograms of theoriginal signals are illustrated in Fig. autocorrelation coefficients were derived from the filter out- I for the samplesyllables/ma/and/ha/. puts usinga 32-pointIDFT, 5 of which were usedto solve the Yule-Walker equationsto obtain the fifth-order PLP C. PLP spectral analysis coefficients.Finally, rootsof the fifth-orderPLP modelwere PLP spectralanalysis was implemented using the fol- calculatedto representthe nasalconsonants. The processof lowingprocedure: (a) 16 simulatedcritical-band masking the calculationis illustratedin Fig. 2(b).

lime: 1.4371B L: 23.B3668 R: 25.27378 iF: 0.70

•led data I1me: 24.7046 Freq: 6,80710 L: 23.03660 R: 25.27370 (F: 0 I

FIG. 1.The originalwaveform (top window), thefiltered waveform (middle window), and spectrographs of theoriginal waveform ( bottomwindow ) for the syllable/ma/(left) and/ha/(right). The identifiedboundary marks between the nasaland the vowelare shownin eachwindow.

1720 J. Acoust. Sec. Am.. Vol. 91. No. 3. March 1992 Y. Qi and R. A. Fox: Speech analysis using perceptual linear prediction 1720

Downloaded 21 Sep 2011 to 128.146.89.68. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp 0.6

0.4

0.2

FIG. 2. (a) The pre-emphasizedfiltered bank -.i."• •. , ". i"• :' '*..• '*,,' %/' ',• '-,i. *' usedin the spectraltransformation. (b) Flow 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 chartfor the analysis.

Frequencyin Hz

(a)

Critical-Band••- Weigl•ttedSum IDFT andSolve --• Root Solving

PowerSpectrum Transformexl PLP PolesofPLP FFT••-ofFFT Spectra PowerSpeemam Y-W equationsCoefficients Spectrtma

SpeechSignal

(b)

D. Statistical analysis f/== 600sinh(f•/6), (5) A mixed-designanalysis of variance (ANOVA) was where./5 is frequencyin Hz andf• is frequency•n Bark undertakenfor eachfrequency and bandwidthof the two (Schroeder, 1977). spectralpeaks to assessthe effectsof placeof articulation, vowel context,gender of speaker,as well as their interac- II. RESULTS tions.Because the identificationof eachindividual speaker The distributionof the calculatedpoles were shown in wasalso a factorthat wasnested within each gender group, Fig. 3 for the nasalconsonants produced by maleand female its interaction with those main factors was treated as an error speakers.The averagetransformed spectra for thenasal con- term in the subsequenttesting of hypothesis(Fisher, 1946). sonants/m/and/n/produced by the three male and three The frequencyf•and bandwidth b• of eachtransformed female:•peakers are presentedin Fig. 4 togetherwith the polewere obtained first in the Barkscale using the formula overall averagetransformed spectra. The corresponding f//= 16.arctan (z•)/•r, (3) averageFFT spectraare presentedin Fig. 5. The;nasal/m/and/n/exhibited quitedifferent charac- b, = 32-1nlz• [/•', (4) teristicsin both the poledistribution plots and the average wherez• representthe ith conjugatepole (i = 1,2). Thef• PLP spectra.The differencebetween nasal/m/and/n/in andb• were then transformed. back into linear scale using the the averageFFT spectra,however, was rather distributed inverserelationship between frequency and Bark anddi•cult to quantify.The secondpeak of the PLP spec-

1721 J. Acoust.Soc. Am., Vc,I.91, No. 3, March 1992 Y. Qi and R. A. Fox: Speech analysisusing perceptual linear prediction 1721

Downloaded 21 Sep 2011 to 128.146.89.68. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp Im[zl 1.4 I 1.3

1.2

1.1

1 0.4 0.9

0.2 0.8

0 ...... ',...... Retzl 0.7

0.6 -0.2 05 -0.4 0.4

-0.6 Frequencyin (Bark) (a) -0.8 1.4

1.3 Im [z] 1.2

1.1

I

0.9

0.8

0.6

0..5 ......

0'?0'40f i , ; ;o ;: ,,, 16 -0,4 Fn•lucncyin (Bark)

1A

1.•

1.2

1.1

FIG. 3.Pole distributions fornasal consonants produced by male (top) and

female(bottom) speakers. Small circle depicts/m/and plus sign depicts 0.! /n/. 0.8

0.7

0.6 trum wascharacterized, on the average,with a lower center frequencyand wider bandwidth for /m/ than for /n/, 05

whereasthe first peak did not seemto distinguishthe two 0.4 0 16 categorieseffectively. The results of the ANOVA revealed that there was a Prr.queacy in (Bark) significantdifference in the frequencyof the secondspectral (c) peak between nasal /m/ and /n/ [F(1,4)=87.23, FIG. 4. Averagetransformed spectra for nasalconsonants produced by (a) p < 0.001]. Aposthoc testing indicated that thefrequency of malespeakers, (b) femalespeakers, and (c) all speakers.Solid line repre- the secondspectral peak was significantlyhigher for/n/ sentsnasal/m/and dottedline representsnasal/n/. than for/m/(a = 0.01 ). The effectsof the vowel context [F(4,16)=3.31, p>0.0373], gender of the speaker [F(1,4) = 1.31, p> 0.3157], and nasal and vowel interac- The averagefrequencies of the spectralpeaks were tion [F(4,16) = 3.32, p>0.0368] on the frequencyof the f• = 290Hz andf• = 1742Hz for/m/andf• = 305Hz and secondspectral peak were not significant.No significantdif- f• = 2062Hz for/n/. Theaverage bandwidth of the spectral ferenceswere found for the frequencyof the first spectral peakswere b• = 305 Hz and b2 = 563 Hz for /m/ and peakor for the bandwidthsof either spectralpeak. b• = 305 Hz and b2 = 479 Hz for/n/. Meansand standard

1722 J. Acoust.Soc. Am., Vol. 91, No. 3, March1992 Y. Oi and R. A. Fox:Speech analysis using perceptual linear prediction 1722

Downloaded 21 Sep 2011 to 128.146.89.68. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp TABLE 1. Meansand standard deviations (s.d.).

f, (Hz) f• (Hz) b, (Hz) b2(Hz) Syllable Mean s.d. Mean s.d. Mean s.d. Mean s.d.

mi 300.3 23.4 1846.2 201.8 278.4 36.2 572.2 178.2 to' m•e 293.2 25.6 1764.8 210.9 271.8 43.8 504.8 135.4 me 279.2 19.8 1530.4 229.0 323.8 79.6 531.8 65.8 m• 288.1 20.1 1692.8 213.7 335.0 87.2 603.0 151.0 mu 291.3 25.3 1776.5 224.9 314.8 69.0 604.4 132.6 ni 310.3 23.7 2107.7 170.0 300.0 36.6 526.8 195.0 n•e 303.0 22.7 1989.5 153.8 298.8 54.0 482.4 159.8 no 304.5 24.2 2035.4 132.9 317.8 63.4 459.8 135.8 n• 304.3 31.5 2052.1 159.5 296.8 45.8 468.0 146.8 nu 302.1 22.3 2123.3 135.6 294.0 36.0 459.4 130.2

500 1000 1500 200} 2500 3000 3500 4000 4500 Frequencyin (Hz) Fig. 6(a) for eachCV syllableand in Fig. 6(b) for each (a) individualspeaker. A/m-n/classification test was undertaken based on the frequencyof thesecond pole. The nasalsegments were iden- lO s

1o,

2OOO

Fm=l,•ncyin (Hz) 5OO

Syllable lO s (a)

io •

2500 --

IO• 2OOO + { t + los :• 1500 .-q

Fraiuencyin (Hz) (½)

5O0 FIG. 5. AverageFFT spectrafor nasalconsonants produced by (a) male speakers,(b) femalespeakers, and (c) all speakers.Solid line represents nasal/m/and dottedline representsnasal/n/. M1

Speak•

1723 J. Acoust.Sec. Am., Vol. 91, No.3, March1902 Y. Qi andR, A. Fox:Speech analysis using perceptual linear prediction 1723

Downloaded 21 Sep 2011 to 128.146.89.68. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp titledusing templates (means of thef2 for/m/and/n/) that difficult to locate directly. However, there are observable were establishedfrom randomlyselected tokens (one-third spectralperturbations at differentfrequency ranges for dif- of the wholesample set). A correctnasal identification score ferentnasal consonants (see Fig. 4). For example,nasal/n/ of 86% wasobtained when these templates were applied to may have a higher,less sharply tuned antiresonancethan the remainingtwo-thirds of the tokens.For comparison,the /m/(Fujimura, 1962) and,thus, the energyreduction asso- nasalsegments were also classified using the cepstralgroup- ciatedwith antiresonanceis locatedhigher in frequencyand delay spectraldistance measure has wider effect for/n/than for/m/. This difference could be further deepenedby the auditory-basedtransformation doD= • i2' (ciR -- ciT)2, (6) becausethe integration interval in the transformationis i=1 smallerfor low frequenciesthan for highfrequencies making whereciR and ciT were the cepstral coefficients of the refer- distributedspectral perturbation at higherfrequencies more ence and test PLP spectra,respectively (Yegnanarayana significant.Both factorsmay havecontributed to the differ- and Reddy, 1979) and/9 was equalto the order of the PLP encesbetween/m/and/n/in the secondpole of the trans- model. A correct nasal identification rate of 87% was ob- formedPLP spectra. tained. The nasalidentification rate basedon the frequencyof the secondtransformed pole was comparableto that based III. DISCUSSION AND CONCLUSION on cepstral distance--a measurewhich usesinformation The acoustictheory of speechproduction predicts that from the whole spectrum(Gray and Markel, 1976). How- the placeof articulationof nasalconsonant is characterized ever, the featurescharacterized by the transformedpoles by antiresonanceas well asheavily context-dependent reson- have more direct perceptualimplication than the cepstral ances.Effective methods for measuringthe antiresonance distance measure. Hermansky (1990) claimed that the frequency,however, are not available.The presentstudy es- transformedspectral characteristics for vowelssupport the sentiallyuses the broad range spectral characteristics closely conceptof effectivesecond formant frequency(Fant and relatedto the antiresonanceto characterizethe placeof ar- Risberg, 1962) and auditory normalization(Bladon and ticulation of nasal consonants. Results demonstrated that Lindblom, 1981). The fact that the frequencyof the second differencesof energydistribution in the middle frequency transformedpole is the dominantacoustic feature for differ- rangebetween/m/and/n/(see Fig. 5) wereclearly reflect- entiatingnasal consonants again supports the claimedim- ed in the secondtransformed spectral peak (seeFig. 4). The portanceof the effectivesecond formant frequency. The gen- frequencyof the secondtransformed spectral peak has been der-independentnature of the secondspectral peak also effectivelyused to identify the placeof articulationof nasal supportsthe hypothesisthat perceptualnormalization may consonants.These resultsconform with the conjecturethat be performedat the auditorylevel. the spectralperturbations introduced by the antiresonance With the establishedcorrelation between nasal produc- are quantitativelymore convenientand effectivethan the tion and its transformedspectrum, a universalor absolute exact antiresonancefrequency itself in characterizingthe criterion could be developedfor nasalclassification. Using placeof articulationof nasalconsonants once appropriate sucha criterion,the identificationof placeof articulationof spectraltransformation is undertaken. nasalconsonants would only involve one-pass or bottom-up Auditory-basedspectral transformations have been ap- process.In contrast,the identificationof nasalconsonants pliedto speechanalysis in a numberof studies(e.g., Kewley- basedon Euclidianspectral distance will alwaysdepend on Port and Luce, 1984; Kurowski and Blumstein, 1987; Her- the estimationof prototypesfor eachsound category and a mansky, 1986). While the implementationdetails of the decisioncannot be madeuntil crosscomparisons with each transformationmight be differentamong the studies,these prototypeare completed.Such a processoffers few insights studieshave demonstrated some of the advantagesin speech aboutthe codingand decoding activities involved in human analysisof applyingthe two basicoperations, the amplitude- speechcommunication. The presentwork represents one of weightedintegration and frequencywarping. The PLP all- the effortthat probesthe relevanceof speechproduction to pole modelingof the transformedspectra (Hermansky, perceptionmechanisms. 1990) providesan effectivealgorithm of applyingthe trans- In this work, however,only the nasalmurmur was used formationto speechanalysis. The polesof the PLP all-pole to identifythe placeof articulationof nasalconsonants. Al- model were usedin this study to characterizeand identify though a reasonablyhigh identificationrate (86%) was the placeof articulationof nasalconsonants. Since the trans- achievedusing the PLP method,this rate is far fromperfect formedpoles can be automaticallyderived from the record- especiallysince chance performance is 50%. Information ed acoustic waveform and the identification rate based on presentin the transitionperiod between the nasaland the the poleswas reasonably high, it appearsthat a perceptually followingvowel could conceivably be usedto increasethe basedanalysis procedure for automaticidentification of identification rate. It remains to be determined how the tran- placeof articulationof nasalconsonants has been validated. sitional information could be incorporatedinto the PLP The differencesrevealed in the transformedspectra be- analysiswithout resorting to the conventionalcomputation tween/m/and/n/are interpretableon the basisof the un- methodsin patternrecognition. derlyingmechanism of nasalproduction and the properties The continuousmovement of articulatorsin speechpro- of the perceptuallybased transformation. As statedearlier, ductionoften imposescontextual effect on the predicted the centerfrequency for antiresonanceof nasalconsonant is acousticfeatures of individualspeech sound. While a num-

1724 J. Acoust.Soc. Am., Vol. 91, No.3, March1992 Y. Qi andR. A. Fox:Speech analysis using perceptual linear prediction 1724

Downloaded 21 Sep 2011 to 128.146.89.68. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp 10000

o

-5000

-lOOOO t 200 400 6(}0 õ00 1000 1200 1400 1600 1800 2000 0 200 400 600 800 1000 1200 1400 1600 1800 2OO0 Time(N samples) Time (N samples)

FIG. 7. (a) Waveformof thenasal-vowel combination,/n/and/i/, in CV (upper)and VC (lower)context and their corresponding transformed running spectrain CV (upper)and VC (lower)context. (b) Waveformof thenasal-vowel combination,/m/and/a•/, in CV (upper)and VC (lower)contexl and theircorresponding transformed running spectra in CV (upper) andVC (lower) context.

bet of studieshave supported the assertionthat contextinde- leading vowel and the desirednasal murmur could not be pendentor invariant acousticfeatures for place of articula- identified either from the filtered version of the signal or tion exist despite acoustic variations introduced by from a spectrogram.Consequently, significant contextual productiondynamics (Stevens and Blumstein,1978; Kew- effectis presentuntil the very last 20 to 30 ms where the ley-Port 1982; Kurowski and Blumstein, 1987), several energyof the signalis closeto zero.The absenceof discontin- problemsremain to be resolvedbefore final conclusionscan uity, particularlyin the high-frequencyrange, and the sub- be reached.The context-independentnature of the acoustic stantial .contextualeffect (see Fig. 7) seemto be the m•0or characteristics for nasal consonants in CV context demon- difficultiesin reliablyobtaining the reportedfeatures. Simi- stratedin this studylends some support to the acousticin- lar difficultieshave been reported in previousstudies (Kur- variant hypothesis.Automatic identificationof nasalsin a owski and Blumstein, 1987). Thus it cannot be claimed that VC contextbased on the samefeatures, however, has only the featuresrevealed here representthe universal,context- achievedslightly better than chancenasal identification rate. independentfeatures for both syllableinitial and syllablefi- The reducedhigh-frequency energy in the vowelsegment nal nasals. becauseof nasalizationhave smearedthe discontinuitybe- In summary,this study established a procedurefor ana- tween vowel and nasal in the VC combinations lyzingthe acoustic characteristics of nasalconsonants using (Stevenset al., 1987). Furthermore,it appearsthat the sec- the PLP method.The secondtransformed pole was used to ondhalf of theVC segmentbecame a nasalizedversion of the automaticallyidentify the placeof articulationof nasalcon-

1725 J. Acoust.Sec. Am., Vol. 91, No. 3, March 1992 Y. Qi and R. A. Fox: Speechanalysis using perceptual linear prediction 1725

Downloaded 21 Sep 2011 to 128.146.89.68. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp sonantsin a CV context.Reasonably high identification Kay, S. (1987).Modern Spectral Estimation (Prentice-Hall, Englewood rates were achieved. Cliffs, NJ). Kewley-Port,D. (1982)."Measurement of formanttransitions in naturally producedstop consonant-vowel syllables," J. Acoust.Soc. Am. 72, 379- ACKNOWLEDGMENTS 389. The authorsare gratefulto Dr. OsamuFujimura for Kewley-Port,D., and Luce,P. (1984). "Time-varyingfeatures of initial stopconsonants in auditoryrunning spectra: A firstreport," Percept. many suggestionsand commentson earlier versionsof this Psychophys.35, 353-360. paper.This workis supported,in part,by a grantfrom U.S. Klatt, D. (1982). "Predictionof perceivedphonetic distance from critical- West Telecommunication Inc. band spectra:A first step," Proc. IEEE ICASSP-82, 1278-1281. Kurowski,K., and Blumstein,S. (1984). "Perceptualintegration of the murmur and formanttransitions for placeof articulationin diffusenasal consonants,"J. Acoust. Soc. Am. 76, 383-390. Bladon,A., and Lindblom,B. (1981). "Modelingthe judgement of vowel Kurowski,K., andBlumstein, S. (1987). "Acousticproperties for placeof quality differences,"J. Acoust.Soc. Am. 69, 1414-1422. articulationin nasalconsonants," $. Acoust.Soc. Am. 81, 1917-1927. Fant, G. (1960). AcousticTheory of SpeechProduction (Mouton, The Markel,J. D., andGray, A. H. (1976).LinearPrediction ofSpeech (Spring- Hague,The Netherlands). er-Verlag,Berlin), pp. 5-38. Fant,G., andRisberg, A. (1962). "Auditorymatching of vowelswith two Mermelstein,P. (1977). "On detectingnasals in continuousspeech," J. formantsynthetic sounds," STL-QPRS 4, 7-11 (RoyalInstitute of Tech- Acoust. Soc. Am. 61, 581-587. nology,Stockholm). Reddy,D. R. (1976). "Speechrecognition by machine:A review,"Proc. Fisher,R. A. (1946). The Designof Experiments(Oliver and Boyd, Edin- IEEE. 64, 501-531. burgh), 3rd ed. Schroeder,M. (1977)."Recognition ofcomplex acoustic signals," Life Sci- Flanagan,J. (1972). SpeechAnalysis, Synthesis and Perception(Springer- encesResearch Report 5, editedby T. Bullock( AbakonVerlag, Berlin), Verlag, Berlin). p. 324. Fujimura, O. (1962). "Analysisof nasalconsonants," J. Acoust.Soc. Am. Stevens,K. N., andBlumstein, S. (1978). "Invariantcues for placeof ar- 34, 1865-1875. ticulationin stopconsonants," J. Acoust.Soc. Am. 64, 1358-1368. Fujimura,O., and Lindqvist,J. (1971). "Sweep-tonemeasurements of vo- Stevens,K. N., Fant, G., and Hawkins, S. (1987). "Some acousticaland cal tract characteristics," $. Acoust. Soc. Am. 49, 541-558. perceptualcorrelates of nasalvowels," in LehisteFestschrifi, edited by Gray, A. H., andMarkel, J. D. (1976). "Distancemeasures for speechpro- RobertChannon and Linda Shockey (Foris, Providence, RI), pp. 241- cessing,"IEEE Acoust.Speech Signal Proc. ASSP-24, 380-391. 254. Harrington,J. (1988). "Automaticrecognition of Englishconsonants," in Yegnanarayana,B., and Reddy, R. (1979). "A distancemeasure derived Aspectsof SpeechTechnology, edited by S. Michaelsonand Y. Wilks (Ed- fromthe firstderivative of the linearprediction phase spectra," Proc. inburghU.P., Edinburgh,England), pp. 69-143. IEEE ICASSP-79, 744-747. Hermansky, H. (1986). "Perceptuallybased processingin automatic Zwicker, E. (1970). "Masking and psychologicalexcitation as conse- speechrecognition," Proc. IEEE. ICASSP-37, 1971-1974. quencesof ear'sfrequency analysis," in FrequencyAnalysisand Periodic- Hermansky,H. (1990). "Perceptuallinear predictive (PLP) analysisof ity Detectionin Hearing,edited by R. Plompand G. F. Smoorenburg speech,"J. A½oust.Soe. Am. 87, 1738-1752. (Sijthoff, Leyden,The Netherlands).

1726 d. Acoust.Soc. Am., Vol. 91, No. 3, March1992 Y. Qi and R. A. Fox:Speech analysis using perceptual linear prediction 1726

Downloaded 21 Sep 2011 to 128.146.89.68. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp