<<

ACOUSTIC, PSYCHOACOUSTIC, AND PSYCHOPHYSIOLOGICAL MEASURES OF DISTANCE IN THE FINNISH SPACE

Mietta Lennes1, Anne Lehtokoski1, Paavo Alku2, and Risto NŠŠtŠnen1 1. University of Helsinki, Finland 2. University of Turku, Finland

ABSTRACT present study deals with an acoustic and psychoacoustic The aim of this study was to compare different vowel description of the vowel space. representations and distance measures by applying them on A reliable perceptual distance measure would be valuable eighteen isolated semi-synthetic Finnish vowel stimuli. For when studying phonetic changes in vowel phones. The vowel spectral distance measures, the FFT power spectra and auditory systems of different could be compared through the spectra were calculated. Using different combinations of the dimensions and structure of their vowel spaces. Such a measure vowel stimuli, the amplitudes of the mismatch negativity (MMN) would also support clinical evaluation methods and help develop event-related brain response were measured individually for four and test perceptual models of . In order to adult subjects to study the psychophysiological distances develop clinically applicable psychophysiological measures of between Finnish . A correlation was found between the perception, it is essential to have a reliable acoustic or Euclidean distances of both the FFT power spectra and the even psychoacoustic reference measure with which to compare auditory spectra of the vowels in comparison to the the brain responses. corresponding MMN amplitudes. The results are important for comparative studies of the vowel spaces of different languages. 1.2. Acoustic distance An FFT spectrum equally represents all that are 1. INTRODUCTION present in the signal. Plomp [7] developed a distance measure for Many studies have concentrated on searching for invariance in stationary complex tones, which calculates the distance between vowel or looked for physical or perceptual correlates for two spectra. phoneme boundaries. However, vowel phones are not absolute nor stable objects. Therefore it is important to find comparative 1.3. Psychoacoustic distance methods for studying the relative structure and dimensions of the Formants (particularly the frequencies of three lowest formants) vowel spaces of different languages and individuals. The present are usually referred to as the acoustic parameters that mainly study compares the basic premises, possibilities and limitations determine the perceived quality of a vowel. The two dimensions of different acoustic and psychoacoustic vowel representations of the plane that can be printed on paper are a probable cause for and applies spectral and psychophysiological distance measures the general need to use two-dimensional vowel representations. to a set of Finnish vowel stimuli. Formants have provided solutions to this problem, e.g., two- or three-dimensional formant charts are widely used to visualize the 1.1. Vowel distance structure of vowel systems [4, 5, 6]. When measuring spectral distances for acoustic vowel sound Iivonen [3] points out that many factors contribute to the signals, one must obtain a numerical representation of each psychoacoustic vowel qualities. Therefore, it may be argued that sound under study. At present, this implies losing the temporal in stead of formant combinations, the overall spectral information of the sound signal: sample spectra of both members pattern of vowels should be taken into account when developing of the vowel pair must be taken at a certain comparable point in a distance measure. time or the spectral properties of each signal averaged over a To obtain psychoacoustic or auditory spectra of vowels, the selected time interval. These spectral representations of vowel FFT power spectra may be filtered according to a model of sound signals may be further processed to obtain peripheral (and central) auditory processes. The Euclidean psychoacoustically weighed spectra. Another option for the distance version of Plomp's [7] formula has been shown to be a purposes of distance measurements would be to extract certain fairly good correlate of perceptual distance [8, 9]. parameters, e.g., formant frequencies from the vowel spectra. When determining the psychoacoustic spectral properties of 1.4. Psychophysiological distance steady-state speech signals, attention has been paid to several The changes in brain activity elicited by a certain sound stimulus principles: taking the ear sensitivity curve into account, can be studied by recording the auditory event-related potentials realization of the physiological frequency scale of hearing in mel (ERP) for the stimulus. The mismatch negativity (MMN) [10] is or Bark scales, using frequency resolution in accordance with the an ERP component, typically peaking at 100Ð200 ms from critical band, and including the frequency level masking effect stimulus onset. It is elicited when an infrequent "deviant" sound [1]. A recent alternative to the is the ERB scale [2]. stimulus occurs in a sequence of repetitive "standard" sound Iivonen [3] states that in the description of a listener's stimuli. The amplitude of the MMN response has been shown to psychoacoustic vowel space there are three basic problems: "(1) reflect the perceived frequency difference between "standard" the scale problem, (2) the combination of parameters that define and "deviant" simple tone stimuli [11]. However, in the case of the space, and (3) the question of how sensitive the speech sounds, a -specific phoneme identification psychoacoustical ability to resolve vowel differences is". The process may also contribute to the MMN [12].

page 2465 ICPhS99 San Francisco

1.5. Problems and hypotheses Vowel F1 F2 F3 This study attempts to visualize and compare the spectral i 320 2263 2770 distances and formant-based representations of a set of Finnish vowel stimuli. It was hypothesized that the spectral distances are y 320 1822 2200 reflected in the corresponding MMN amplitudes. u 320 566 2500 A basic problem in perceptual distance experiments is the fact that the use of the phonetic labels in dissimilarity judgments o 460 820 2468 cannot be distinguished from "purely" psychoacoustic, A 700 1130 2468 nonlinguistic judgment criteria. Fox [13] reported results of e 450 2198 2770 paired-comparison tests indicating that the similarity judgments of vowels are sensitive to subphonemic variations, i.e., the Ï 654 1695 2468 judgments are based at least partially on the acoustic P 450 1350 2540 representations rather than the phonetic labels of the stimuli. Therefore, it makes sense to compare spectral distances with i\y 320 2043 2485 MMN responses, which can be measured without the subjectÕs y\u 320 1194 2350 attention to the stimuli. u\o 410 730 2484 2. METHODS o\A 560 920 2468 2.1. Subjects i\e 385 2231 2770 Four subjects (all normal-hearing speakers of Finnish, right- handed, aged 21Ð23 years; one male) participated in the e\Ï 552 1947 2619 psychophysiological and vowel identification experiments. Two e\P 450 1650 2655 subjects had previous experience of synthetic vowels and all P\o 455 1000 2504 subjects of ERP experiments. i\P 385 1750 2655 2.2. Stimuli P\A 595 1240 2504 Eighteen isolated steady-state vowel sounds were created using a Table 1. The Finnish semi-synthetic vowel stimuli and their new method, Semi-synthetic Speech Generation (SSG), which formant frequencies (Hz). The formant frequencies for the produces synthetic vowels from natural glottal excitation in intermediate sounds (marked with two vowel symbols separated conjunction with an artificial vocal tract model [14]. The glottal with a slash /) were obtained by taking the mean formant excitation was computed from a sustained male voice phonation frequencies of their neighbours. of the Finnish vowel [A]. The utterance was automatically inverse filtered using a method described by Alku [15], hence Condition Standard Deviant 1 Deviant 2 Deviant 3 cancelling the effects of the vocal tract and lip radiation. The resulting glottal excitation was then re-filtered with an eighth- 1 i y y\u u order all-pole filter with coefficients adjusted to shift the 2 u y i\y i frequencies of the three lowest formants (F1, F2, F3) according to the desired vowel qualities. The formant frequencies (see 3 e P P\o o Table 1) were chosen on auditory grounds by two Finnish 4 o P e\P e phoneticians. Eight stimuli were to represent prototypical vowels 5 i e e\Ï Ï [I, y, u, o, A, e, Ï, P] of Finnish and ten stimuli [i\y, y\u, u\o, o\A, i\e, e\Ï, e\P, P\o, i\P, P\A] were intermediates of these 6 Ï e i\e i prototypes. The stimuli were cut to the duration of 400 ms and 7 u o o\A A filtered with a Hanning filter (10 ms r/f times) to avoid clicks. 8 A o u\o u 2.3. Experiments 9 i P P\A A Four Finnish adult subjects were presented with sequences of 10 A P i\P i different combinations of the vowel stimuli using the mismatch negativity experimental paradigm. Each stimulus condition Table 2. Stimulus conditions for the ERP measurements. involved one frequent (standard) stimulus, which was randomly Intermediate vowel stimuli are marked with a slash /. replaced by one of three deviant stimuli. The different stimulus conditions are shown in Table 2. EEG was measured and the 3. RESULTS stimulus sequences presented to the subject while he or she was 3.1. Spectrum analysis reading a book. Measurement sessions were repeated until a FFT power spectra were calculated with a 512-point Hamming minimum of 120 averaged samples was reached for each subject window from the temporal midpoints of all vowel stimuli. and each deviant stimulus. The average MMN amplitudes for the Frequency components from 0 to 11025 Hz were included in 43 deviant stimuli from electrode Cz were measured (of a 50 ms Hz increments. The auditory spectra were obtained by weighting time frame at the largest negative peak between 100 and 250 ms the power spectra by the ear sensitivity function and the from stimulus onset) and separately averaged for each stimulus spreading function. condition and individual. The formant frequencies of three lowest formants were measured from FFT spectra and found to be almost identical to

page 2466 ICPhS99 San Francisco

the values used in the stimulus creation procedure. Formant dimensional solution for the distances of FFT spectra was not diagrams of the stimuli are plotted in Figure 1 for Bark-scaled acceptable (it did not account for most of the variance). formant values and in Figure 2 for ERB-scaled formants. Therefore, a three-dimensional solution was calculated as well The F0 of each stimulus was measured and found to be (see Figure 5). For a detailed discussion of the MD scaling practically constant between 112 and 113 Hz in all stimulus method, see Basalaj [16]. signals. -1 Bark F2 i\e 15 13 11 9 7 5 e i e\œ œ i\P 3 i\y A y P\o i i\y y y\u u P\A P e\P o i\e i\P 4 o\A y\u u\o u\o e e\P P P\o o 5 F1 e\œ o\A u P\A 6 0,2 œ 0,6 -0,8 A Figure 3. Eighteen semi-synthetic Finnish vowel stimuli plotted 7 Bark according to the Euclidean distances of their FFT power spectra. Multidimensional scaling was used to force the vowel points to a Circle size = F3: 13 15 17 two-dimensional plane. The fit of this two-dimensional solution Figure 1. The vowel stimuli plotted according to three lowest with the distance information was not very good, i.e., the relative formants in Bark scale. Along with circle size, darkness and spectral distances were not as small as suggested by the figure. character size indicate the perceptual height of F3. -0,6 F2 u ERB y\u 23 21 19 17 15 13 11 y 6 i\y i u\o i\P i y y\u u i\e o i\y 8 e\P P P\o i\e i\P u\o e e e\P P P\o o e\œ P\A 10 F1 œ o\A e\œ o\A A P\A œ 12 0,8 A -1,2 0,2 Figure 4. Eighteen semi-synthetic Finnish vowel stimuli plotted 14 ERB according to the Euclidean distances of their auditory spectra. Multidimensional scaling was used to force the vowel points to a Circle size = F3 : 20 25 two-dimensional plane. Note the separation of the circles and the Figure 2. The vowel stimuli plotted according to three lowest similarities of the configuration with a two-dimensional formant formants in ERB scale. Along with circle size, darkness and diagram. This solution fits well with the calculated distances. character size indicate the perceptual height of F3. 3.3. Analysis of the ERP data 3.2. Acoustic and psychoacoustic distance measurements Considerable interindividual variation was found in both the To obtain acoustic and psychoacoustic distance measures, identifications and the MMN amplitudes. Therefore, MMN symmetric Euclidean distance matrices were calculated for both amplitudes were analysed separately for each individual subject. the FFT power spectra and the auditory spectra of each vowel Multidimensional scaling could not be used to visualize the pair. MMN results for three reasons. First, there was not enough data For visual inspection of distances, two- and three- for the results to be reliable on the individual level, and too much dimensional configurations of the vowel stimuli were calculated variability for an average computation. Second, not all stimulus by applying a multidimensional scaling algorithm on the distance combinations were present in the psychophysiological data, matrices. This algorithm constructs an n-dimensional space so preventing MD scaling. Third, the MMN amplitudes for both that the distances between vowel points in the space correspond endpoints of each stimulus "continuum" were compared. No optimally to the input distance matrix. The two-dimensional significant correlation was found. configurations are shown in Figures 3 and 4. The fit of the two-

page 2467 ICPhS99 San Francisco

perceptual distance between two temporally changing vowel -0,2 patterns would be useful when analysing running speech u samples. However, this requires more knowledge of the perception of prosodic phenomena.

y i\y y\u REFERENCES i u\o [1] Karjalainen, M. 1982a. Formanttiparametrien mittauksesta ja P analyysistŠ (English summary: On the analysis and measurement of formant parameters). In A. Iivonen and H. Kaskinen (ed.), X Fonetiikan i\e o o\A th e e\P PŠivŠt Tampereella 20.Ð21.3.1981 (10 Meeting of Finnish e\œ P\A Phoneticians), Publications of the Department of Finnish and General Linguistics, University of Tampere, 7, 123Ð137. i\P [2] Moore, B. C. J. and Glasberg, B. R. 1983. Suggested formulae for Circle size œ A P\o calculating auditory-filter bandwidths and excitation patterns. Journal of -1 0 1 the Acoustical Society of America, 74, 750Ð753. 0,5 [3] Iivonen, A. 1994. A psychoacoustical explanation for the number of 0,4 -0,6 major IPA vowels. Journal of the International Phonetic Association, 24, Figure 5. Eighteen semi-synthetic Finnish vowel stimuli plotted 73Ð90. according to the Euclidean distances of their auditory spectra. [4] Carlson, R., Granstršm, B. and Fant, G. 1970. Some studies Multidimensional scaling was used to force the vowel points to a concerning perception of isolated vowels. Speech Transmission three-dimensional plane. Along with circle size, the darkness and Laboratory Ð Quarterly Progress and Status Report, 2Ð3, 19Ð35. character size indicate the third dimension. [5] Bladon, A. 1983. Two-formant models of vowel perception: shortcomings and enhancements. Speech Communication, 2, 305Ð313. [6] Darwin, C. J. and Gardner, R. B. 1985. Which contribute 3.4. Comparing the acoustic, psychoacoustic, and to the estimation of first formant frequency? Speech Communication, 4, psychophysiological results 231Ð235. The mismatch negativity amplitudes significantly correlated with [7] Plomp, R. 1970. as a multidimensional attribute of complex the Euclidean distances of both the acoustic (p<0.001) and the tones. In Plomp, R. and Smoorenburg, G. F. (eds.) Frequency analysis psychoacoustic (p<0.05) vowel spectra. However, large variation and periodicity detection in hearing. Leiden: Sijthoff. was found in the MMN amplitudes. [8] Karjalainen, M. 1982b. Puheen perifeerisen kuulemisen laskennallisista malleista (English summary: Computational models of 4. DISCUSSION speech analysis in the peripheral auditory system). In A. Iivonen and H. Kaskinen (ed.), .), XI Fonetiikan PŠivŠt (11th Meeting of Finnish It was found that MMN amplitude correlates with the acoustic Phoneticians) Ð Helsinki 1982. Publications of the Department of and psychoacoustic distances of vowel spectral Further analysis , University of Helsinki, 35, 89Ð118. and collection of additional ERP data will be in order, since the [9] Bladon, R. A. W. and Lindblom, B. 1979. Auditory modeling of effect of phonemic categories on the MMN amplitudes could not vowels. Speech Communication Papers presented at the 97th Meeting of be resolved due to a lack of data. Possible learning effects could the Acoustical Society of America, 12Ð16 June 1979, D1: 1Ð4. not be controlled, either. Statistically, MMN may not yet be [10] NŠŠtŠnen, R., Gaillard, A. W. K., and MŠntysalo, S. 1978. Early reliable enough as a psychophysiological measure of individual selective-attention effect reinterpreted. Acta Psychologica, 42, 313Ð329. vowel distance. [11] Lang, H., Nyrke, T., Ek, M., Aaltonen, O., Raimo, I. And NŠŠtŠnen, In this particular case, the acoustic distance measure of FFT R. 1990. Pitch discrimination performance and auditory event-related potentials. In C. H. M. Brunia et al. (ed.), Psychophysiological brain spectra is probably very accurate, because the glottal excitation research, 1, 294Ð298. Tilburg University Press. and therefore also the structure remained constant for [12] NŠŠtŠnen, R., Lehtokoski, A., Lennes, M., Cheour, M., Huotilainen, all stimuli. If the glottal excitation and the time point had not M., Iivonen, A., Vainio, M., Alku, P., Ilmoniemi, R., Luuk, A., Allik, J., been identical, small differences in the frequencies of spectral Sinkkonen, J. and Alho, K. 1997. Language-specific phoneme spikes would result in unreliable distance measures. representations revealed by electric and magnetic brain responses. The spectral distances of the vowel stimuli were visualized Nature, 385, 432Ð434. using a multidimensional scaling technique. This way it was [13] Fox, R. A. 1985. Multidimensional scaling and perceptual features: possible to present the distances of the auditory spectra on a two- evidence of stimulus processing or memory prototypes? Journal of dimensional plane, suggesting that only two parameters explain Phonetics, 13, 205Ð217. [14] Alku, P., Tiitinen, H. and NŠŠtŠnen, R. Submitted. A method for most of the variance between the psychoacoustic representations generating natural-sounding speech stimuli for cognitive brain research. of the vowel stimuli. This is well in accord with the fact that the Clinical Neurophysiology. two most varied parameters, namely F1 and F2, are also most [15] Alku, P. 1992. Glottal wave analysis with Pitch Synchronous weighed in the auditory spectra and therefore bear the most Iterative Adaptive Inverse Filtering. Speech Communication, 11, information. 109Ð118. Although the temporal aspect of speech was excluded from [16] Basalaj, W. 1999. Incremental multidimensional scaling method this study for experimental reasons, this choice should not be for database visualization. Proceedings of Visual Data Exploration and self-evident as it takes the experimenter further away from the Analysis VI (SPIE volume 3643), San Jose, California, USA. processes of natural temporal speech. A method for measuring

page 2468 ICPhS99 San Francisco