Perception of speaker age, sex and quality investigated using stimuli produced with an articulatory model

Hartmut Traunmüller†, Anders Eriksson† and Lucie Ménard‡ † Inst. f. lingvistik, Stockholms universitet ‡ Université du Québec à Montréal E-mail: [email protected], [email protected], [email protected]

agreement with a previous suggestion [4], while Z2-Z1 was ABSTRACT related to perceived , and perceived roundedness could be predicted by F2’, computed by a This paper deals with the perception of linguistic and model of the effective second formant. paralinguistic qualities conveyed by synthetic pro- duced with an articulatory model in which transfer func- The present approach is different in that it will be attempted to find for each dependent variable those linear combina- tions of the French vowels /i e ø  œ/ characteristic of five growth stages were each combined with five different F0 tions of the critical band rate values of F0, F1, F2, F3 and values. Listeners had to judge the speaker's age and sex in F2' that produce the best prediction of the subjects' per- addition to vowel quality. Four subgroups of listeners were formance. Thereby, we open for different weights of Z1 and distinguished, according to sex and frequency of contact Z0 as correlates of vowel height. Linear combinations of the logarithms of F0, F1, F2, F3 and F2' will also be tried. with children. The results were subjected to regression analysis based on critical band rate (z) and logarithmic Since the articulatory model used in [2, 3] simulated the values of F0, F1 to F5 and calculated values of F2’. This vocal tracts of speakers of varying age, and F0 was varied showed Z1 -0.6 Z0 to correlate highly with vowel openness independently, there was some additional perceivable and 0.8 Z4 -Z3 with roundedness in addition to Z2'. F0 and variation that remained uninvestigated. This variation, the formants above F1 contributed equally to age percep- which is in the focus of the present experiment, can be tion. There were slight but significant differences between expected to affect the percived age of the speaker, but also listener groups and there was a tendency to perceive vowels to some extent the perceived sex and vocal effort or the as produced by a younger speaker when perceived as apparent distance between speaker and addresse [5]. rounded - older when not. This can be understood as due to a choice listeners have in interpreting lower formants as Possible differences between groups of listeners are due to liprounding or a permanently longer vocal tract searched for by using subgroups that differ in (1) sex and (2) indicative of a higher age. acquaintance and frequency of contact with children. Per- ceptual differences conditioned by the latter factor have been reported in a previous investigation [,6 6]. 1. INTRODUCTION The present experiment was designed in order to determine It is well known that the acoustic properties of sounds vary the main acoustic parameters involved in the perception of as a function of various factors: linguistic, organic, expres- speaker age in addition to only two aspects of vowel quality: sive, and transmittal [1]. In the present investigation, we openness and roundedness among front vowels. The stimuli focus on the perception of vowel quality (linguistic vari- were generated with the same articulatory model as used in able), speaker age and sex (organic variable), and vocal [2, 3]. The experiment was also expected to provide some effort (expressive quality). Previous studies have shown information on the perception of speaker sex and vocal that variation along these linguistic and paralinguistic di- effort and to show whether listeners are affected by their mensions result in a modification of the same acoustic pa- perception of one quality of a stimulus when judging an- rameters. In natural speech, it is thus quite difficult to de- other quality of the same stimulus. Vowel perception has termine the parameters related to each quality. been suggested [1] and also observed [7] to be affected by The acoustic correlates associated with the perceived oral listeners' expectations, and these are likely to be reflected in vowels of French were investigated in a recent study [2, 3]. sex- and age-ratings. An articulatory model of speech production that allowed simulating the non-uniform vocal tract growth from birth to 2. METHOD adulthood and synthesizing vowels was used to study the perceptual effects of variations in F0 and the formant fre- 2.1. STIMULI quencies. In the perception experiments, listeners were The stimuli consisted of 5-formant vowels generated by asked to identify the vowels. The critical band rate diffence formant synthesis with the Variable Linear Articulatory or “distance” between F1 and F0 (Z1-Z0) was found to be Model developed by S. Maeda [8]. This model integrates associated with perceived height (or “openness”), in knowledge acquired from previous models with the growth data currently available. The growth process is introduced vocal effort. For each quality, they had to select an icon on a by modifying the longitudinal dimension of the vocal tract screen. Concerning vowel quality, the ten French oral according to two scale factors, one for the anterior part of vowels /i y u e ø o  œ  a/ were available. They were in- the vocal tract and the other for the pharynx, interpolating structed to select no more than two vowels. As for speaker the zone in-between. The associated F0-values also reflect age, subjects had to select one answer among the following the evolution with age. The model is controlled by seven choices: 0, 1, 2, 3, 4, 6, 8, 10, 12, 14, 16, 18, 20, 24, and 28 parameters, directly interpretable in terms of functionally years old. In their judgment of speaker sex, the listeners had organized articulatory blocks (protrusion and labial aper- to indicate also their confidence. An answer was selected ture; jaw height; tongue body, dorsum and tip position; among the following: doubtless male, male, probably male, larynx height). The use of such a model enabled us to uncertain, probably female, female, doubtless female. Fi- generate various stimuli while carefully controlling ar- nally, estimates of vocal effort were obtained by asking the ticulatory and acoustic coherence. A detailed description of listeners about the apparent distance between speaker and the model is given elsewhere ([2]). addressee: 0.5, 0.75, 1, 1.5, 2, 3, 4, 6 or 8 meters. The test lasted about thirty minutes and took place in a quiet room. Vocal tracts representative of the following ages were simu- lated: 0, 4, 8, 12, and 21 years old. For each growth stage, articulatory-acoustic prototypes for the six French oral 3. RESULTS vowels /i y e ø  œ/ were determined using criteria based on the Dispersion-Focalization Theory (DFT) of vowel sys- The results were first subjected to a test for between- tems [9]. For each vowel, the optimal formant triplet was listener agreement by calculating Cronbach's alpha. The determined. The values of the 4th and 5th formants were values obtained were 0.993 for openness, 0.978 for determined by the articulatory commands retrieved by an roundedness, 0.977 for speaker age, 0.935 for speaker sex acoustic-to-articulatory inversion method exploiting the and 0.666 for communicational distance. These values pseudo-inverse of the Jacobian matrix. Figure 1 displays show a very high degree of between-listener conformity in the resulting values of the stimuli. judging openness, roundedness, age and sex. In these cases, it is fruitful to analyze the average behavior of the subjects. The much lower value obtained for distance was due to discrepant behavior that emerged clearly in a sub- ject-by-subject analysis. Due to this, it was decided to skip the further analysis of perceived communicational distance. In order to make certain numeric calculations, openness was coded as 1 for [i y u], 2 for [e ø o], 3 for [ε œ ] and 4 for [a] while roundedness was coded as 1 or 0. Sex was coded from -3 (doubtless male) to +3 (doubtless female). In ad- dition to the age value as such, some transformations of it were also studied, in order to see whether these were more linearly related with the independent variables. The square root of the age value proved to be such a transformation. Figure 1: Values of F1, F2 and F3 of all stimuli. Figure 2 shows the average perceived degree of openness and roundedness of all stimuli to the left. Figure 3 shows Mean F0 values of 450, 365, 280, 195, and 110 Hz corre- the perceived age and sex of the speaker for each stimulus, spond respectively to 0, 4, 8, 12, and 21 years old. However, also to the left. It can be noticed that the speaker was mostly each of the 30 vocal tract shapes was excited with each of perceived as more or less distinctly male, but at a younger these values of F0. As a result, a set of 150 stimuli (6 vow- age of uncertain sex, with a slight bias in favor of female els x 5 growth stages x 5 F0 values) was available. (sex > 0). These figures also show the predictions of linear regression models for each stimulus. Ideally, the locations 2.2. SUBJECTS of the symbols that represent the stimuli should agree in Twelve men and twelve women served as listeners in a each pair of figures. In the first pair, much of the discrep- ancy is due to relaxed boundary conditions. perceptual experiment. All were native speakers of Cana- dian French. Among each sex, there were six subjects with In Table 1, the performance of linear regression models that a high and six with a low frequency of contact with children. describe the results are presented. With these models, it is The subjects were not informed of the objectives of the attempted to predict the average value of perceived open- study before the experiment. ness, roundedness, log(age) and sex of each stimulus based on its acoustic properties. The independent variables con- 2.3. PROCEDURE sidered were the critical band rate values of the funda- The stimuli were presented to the subjects through head- mental and the first five formants (Z0 to Z5). In cases in phones and in randomized order. The subjects had to judge which F1 < F0, the effective F1 was assumed to be at F0. four qualities: vowel identity, speaker age, speaker sex, and Two calculated F2' values, Z2' and a slightly modified version, Z2'N [2, 3], were also considered. The analysis of teners in their age ratings attached more weight to the perceived roundedness was performed with and without higher formants, represented by Z4, while female and less these. In additional analyses, log(Fn) was considered in- child-aware listeners attached more weight to the funda- stead of Zn. The log(Fn)-based models did slightly worse in mental (Z0). It can also be seen that the more child-aware describing the results for openness, but slightly better for listeners in their roundedness ratings attached a higher age and sex. weight to Z2 than the other groups did. Other be- tween-group differences were rather marginal in nature. Table 1. Performance of linear regression models of per- ceived openness, roundedness and speaker age and sex. 5.0 5.0 Independent variables were critical band rates Zn or 4.5 a a logarithmic values Ln of F0 and the formant frequencies. 4.5 a a a Depend. Regression equation r2 4.0 aa 4.0 a a aaa œ variable a a 3.5   œ œ œ 3.5 a   OPN -0.29 +0.45 Z1 0.710    œ   œœœœœ  œœ   œœœ œœœ œœ “ 0.44 +0.46 Z1 -0.29 Z0 0.881 3.0  œ 3.0    œ œœœ  œ “ 1.35 +0.51 Z1 -0.30 Z0 -0.08 Z2 0.904 œ   œœœ œ  ø   œœœœœ 2.5 e ø    “ -3.74 +4.65 L1 -1.46 L0 -3.16 L2 0.892 e ø 2.5   œœœœ œ œ ee øø ø ee e ø  ø ø +2.11 L3 ø e e ee e ø ø 2.0 ee e ø e e 2.0 e e e øø ø ø RND 2.60 -0.13 Z2'N 0.674 e e ø e ee ø y e yy yy y ø “ 2.48 -0.13 Z2'N -0.038 Z0 0.688 1.5 y i i y yy y y 1.5 i i yyyy y y i y y y ii yy yy y “ 3.36 -0.17 Z3 0.633 ii yyy i ii yy iiiii i yyyyyyyyyy ii i yy yy “ 1.03 -0.41 Z3 +0.34 Z4 0.695 1.0 1.0 ii “ 0.67 -0.34 Z3 +0.62 Z4 +.05 Z0 -0.11 0.762

Openness .5

Openness .5 Z2 -0.22 Z5 -.5 0.0 .5 1.0 1.5 -.5 0.0 .5 1.0 1.5 “ 3.67 -1.63 L2 +0.30 L0 +8.55 L4 -4.99 0.762 L3 -3.23 L5 Roundedness L_AGE 1.39 -0.20 Z0 0.464 “ 4.70 -0.20 Z0 -0.18 Z4 0.865 Figure 2: Openness plotted against roundedness for each “ 4.22 -0.20 Z0 -0.14 Z4 -0.05 Z1 0.900 stimulus. Left: average perceived; Right: predicted. “ 3.70 -1.20 L0 0.682

“ 12.01 -1.20 L0 -2.25 L4 0.877 6 6 “ 11.49 -1.20 L0 -1.72 L4 -0.52 L1 0.910 SEX -1.77 +0.45 Z0 0.541 5 5 “ -.643 +0.45 Z0 +0.24 Z5 0.698 “ -7.38 +2.86 L0 0.634 4 4 “ -17.64 +2.86 L0 +2.72L5 0.785 3 The mean ratings by the groups of male and female listen- 3 ers and by listeners with more and with less experience 2 2 with children are shown in Table 2. It can be seen there that men and listeners with more child experience rated the 1 1 speakers as younger and slightly less male. It can also be 0 Age 0 seen that more child-aware listeners rated the vowels as Age slightly more open. -3 -2 -1 0 1 -3 -2 -1 0 1

Table 2. Mean ratings by the four groups of listeners (4 x Sex Sex 1800 ratings). Differences that were significant (p < 0.01) in a linear regression model are shown in bold face. Figure 3: Square root of age in years plotted against per- ceived sex (-3 = definitely male, 0 = uncertain) for each SUBJECT GROUP OPN RND L_AGE AGE SEX stimulus. Left: average perceived; Right: predicted. Mean Male 2,16 ,53 ,797 -,46 F0 = 110 Hz (filled squares), 195 Hz (open squares), 280 Female 2,18 ,51 ,860 -,59 Hz (triangles), 365 Hz (circles), 450 Hz (filled circles). Difference -,02 +,02 -,063 (-14%) -,13 More child-aware 2,22 ,51 ,812 -,48 The variance explained for the female group was in each Less child-aware 2,12 ,53 ,845 -,58 case higher than in the male group, probably due to more Difference +,10 -,02 -,033 (-7%) -,10 between-subject variation within the latter. Due to the be- tween-subject variation, the variance explained is in each Table 3 shows the order in which the independent variables case lower than in the models based on the average data entered into a stepwise linear regression analysis of the (Table 1). results obtained from the individual subjects in each group. It can be seen there, that male and more child-aware lis- A regression analysis of the individual results showed the listeners to have been significantly affected by the per- sex at F0 = 195 Hz, which is not replicated to the right. This ceived roundedness of a stimulus when judging its open- suggests that compatibility testing is a component in the ness (and vice versa) and when judging the speaker's age perception of a speaker's age and sex that has not been (and vice versa). Stimuli perceived as rounded were per- captured adequately in the modeling. It could possibly be ceived as 0.13 units less open and produced by a 14% captured by also considering interactions. younger speaker. Perceived speaker sex showed no sig- nificant influence on any other quantities or vice versa. The observed tendency to perceive vowels as produced by a younger speaker when perceived as rounded can be under- Table 3. Variables considered in a stepwise regression stood as due to a choice listeners have in interpreting lower analysis of the ratings by the four groups of listeners. The formants as due to occasional liprounding or a more per- more important variables, explaining more than 95% of the manently longer vocal tract indicative of a higher age. This ultimate variance explained (r2), are shown in bold face. choice affects the formant frequencies listeners expect vowels of a given degree of openness to have [1]. These SUBJECT OPN RND L_AGE SEX effects are analogous to the effects of listeners' expectations GROUP based on prior or different-modality experience reported in Male Z1, Z0, Z3, Z4, Z2, Z4, Z0, Z0, [6]. The failure to observe an effect of perceived sex in the Z2, Z4, Z5 Z0, Z5, Z1 Z1, Z2 Z4 2 present experiment may be due to the absence of sex r .763 .487 .550 .222 variation in the vocal tract shapes used. Female Z1, Z0, Z3, Z4, Z0, Z0, Z4, Z0, Z2, Z4, Z5 Z2, Z5, Z1 Z1, Z3 Z5 r2 .782 .507 .555 .320 REFERENCES More Z1, Z0, Z2, Z0, Z4, Z4, Z0, Z0, child-aware Z2, Z4, Z5 Z3, Z5 Z1, Z2 Z5 [1] H. Traunmüller, “Conventional, biological, and envi- r2 .779 .494 .564 .269 ronmental factors in speech communication: A modu- Less Z1, Z0, Z3, Z4, Z0, Z0, Z4, Z1 Z0, lation theory” Phonetica 51: 170–183, 1994. child-aware Z2, Z3 Z5, Z2 Z4 2 [2] L. Ménard, J.-L. Schwartz, L.-J. Boë, S. Kandel and N. r .768 .497 .530 .267 Vallée, “Auditory normalization of French vowels synthesized by an articulatory model simulating growth from birth to adulthood ”, J. Acoust. Soc. Am. 4. DISCUSSION 111: 1892–1905, 2002.

Although the analysis confirms that listeners consider F1 in [3] L. Ménard, “Production et perception des voyelles au relation to F0 in the perception of vowel openness, the cours de la croissance du conduit vocal: variabilité, correlate Z1-Z0, [2, 3, 4] explains only 84.1% of the vari- invariance et normalisation”, doctoral thesis, Univer- ance, while 90.4% is explained when the weight of Z0 is sité Grenoble III, 2002. reduced to about 60% of that of Z1 and a small contribution of Z2 is also taken into account. As for roundedness, a [4] H. Traunmüller, “Perceptual dimension of openness in calculated value of Z2' [2, 3] explains much of the variance, vowels”. J. Acoust. Soc. Am. 69: 1465–1475, 1981. 67.4% (68.8% if considered together with Z0), but so does [5] A. Eriksson and H. Traunmüller, “Perception of vocal also Z3 considered in relation to Z4 (69.5%, and with ad- effort and distance from the speaker on the basis of ditional Zn 76.2%). vowel utterances”. Perception and Psychophysics 64: The perception of speaker sex was found to be mainly 131–139, 2002. based on F0, with a smaller contribution by the higher [6] H. Traunmüller and R. van Bezooijen, “The auditory formants, while F0 and the higher formants contributed perception of children's age and sex” in Proceedings approximately to the same extent to the perception of ICSLP-94, vol. 3: 1171–1174, 1994. speaker age. The difference observed in the latter case between different groups of listeners may possibly be due [7] K. Johnson, E.A. Strand and M. D’Imperio, “Audi- to differences in the ease by which F0 and the higher for- tory-visual integration of talker gender in vowel per- mants are detected by listeners. While this is only a vague ception” J. 27: 359–384, 1999. suggestion, the results confirm the existence of perceptual differences linked with a subject's sex and frequency of [8] Boë, L.-J., and S. Maeda, “Modélisation de la crois- communication with children. sance du conduit vocal. Espace vocalique des nou- veaux-nés et des adultes. Conséquences pour The discrepancies between right and left in Figure 2 are l’ontogenèse et la phylogenèse,” Journées d’Études mainly due to the fact that the procedure used here ignores Linguistiques: “La Voyelle dans Tous ses États”, the boundaries (1 < OPN < 4, 0 < RND < 1) and the bias in Nantes, 98–105, 1997. favor of integer values that results when most listeners agree in their categorization of a vowel, which is reflected [9] Schwartz, J.-L., Boë, L.-J., Vallée, N. and C. Abry, in the figure to the left. In Figure 3, to the left, there is an “The Dispersion-Focalization Theory of vowel sys- excess of independent variation between perceived age and tems,” Journal of Phonetics, 25, 255–286, 1997.