<<

Prediction of and

Ren6 Carr6 and Maria Mody Drpartement Signal, Unit6 Associre CNRS, ENST carre@ sig.enst.fr, mody @ sig.enst.fr

Abstract a simple acoustic tube, open at one end and closed at the other, and modeled after a male vocal tract, will A deductive approach is used to predict vowel yield () a good acoustic communication device, and (ii) and consonant places of articulation. Based on an acoustic repertoire that closely matches the phonetic two main criteria, viz. simple and efficient use of richness of human . an acoustic tube, along with maximum acoustic dispersion, the Distinctive Regions Model 2 Criteria for building an acoustic device for (DRM) of derives regions communication that closely correspond to established vowel and consonant places of articulation. Our approach will be based purely on physical laws and communication theory principles, and not on any 1 Introduction specific characteristics of the human production and/or perception systems. The acoustic device is shown in Attempts to explain speech production data often suffer Fig.1. It is a closed-open tube corresponding to the from circularity. That is, the classification of data tends vocal folds at one end and to the lip opening at the to be purely descriptive without shedding light on the other, respectively. Commands deform the area function processes that give rise to them. In criticizing the data- of the acoustic tube consequently producing changes in based approach, Lindblom (1990) favors, instead, the the acoustic signal. These commands constitute our use of substance-based approaches. Here, deductions phonological repertoire and may be viewed as are derived from specific criteria or from characteristics analogous to the 'gestures' proposed by Browman and of the speech production and/or perceptual systems in Goldstein (1989). order to explain the data. In fact, researcher have successfully exploited the substance-based approach in Commands Acoustic explaining vowel systems, using a perceptually-driven Lexical 1(5, 6, 7) I Isignal (1) maximum acoustic dispersion criterion (Liljencrants memory (8) ] : Acoustic tube I and Lindblom, 1972; Lindblorn, 1986). But this Articulatory-acoustic approach, too, may be thought of as being circular, to relation (2, 3, 4) some extent, in that the criterion used may be viewed as, merely, a more encompassing description of the Figure 1. Acoustic device deformed by commands. themselves. The crucial task then is to explain the characteristics of the /production The criteria used to build a 'good' system itself. For example, what is it about the human communication device are the following: speech apparatus that yields three and not four or five • acoustic contrast: maximization of the acoustic places of articulation? Is the vocal tract designed contrast of the produced (1) in order to have primarily for feeding or does it reflect adaptations a good signal-to-noise ratio. The spectral peaks specialized for speech? To answer some of these () having maximum energy are retained as questions, we undertook to investigate what acoustic parameters; deformations (i.. vocal tract configurations/shapes) of • least effort: this refers to the efficiency of the articulatory-acoustic relations (2) with monotonicity (3) and orthogonality (4) of the relations. The use of an efficiency criterion corresponds to increasing the role of the dynamics; °I2 .=/ 468 • simplicity: with respect to the commands of the deformation of the acoustic tube, the commands must be simple J(5) (straight deformations), few in cm number (6), and with a reduced number of degrees (a) 0 5 10 15 20 of constriction (7). This is based on the assumption that fewer commands make for smaller demands on 10 memory resources (8) that may be engaged in the 8 learning and eventual mastery of these commands in the phonological acquisition process. 6 An algorithm to automatically and efficiently 4 deform the area function of an acoustic tube in order to 2 increase or decrease the of a or 0 ---! combination of several formants has been proposed cm elsewhere Carr6 et al. (1994; 1995). As an example, 0 5 10 15 20 figure 2 shows the automatic evolution of the area (b) function of the tube from a closed-open neutral 300O configuration, for increasing and decreasing F2. Four main regions naturally emerge which are 2000 not of equal . The /a/ vowel is an automatic consequence of a back constriction associated with a front cavity, and, the/i/vowel of a front constriction 1000 associated with a back cavity (anti-symmetrical behavior), a cavity, thus, being automatically 0 iterations obtained. If, however, the initial configuration is that of a tube closed at both ends (i.e. closed-closed) the// (c) vowel is automaiicaUy obtained with a central 3000 F2(Hz) L_ constriction (symmetrical behavior). A summary of the main conclusions that may be drawn from the above manipulations are as following: 2000 /i~ U /a/ • configurations using the maximum acoustic contrast criterion correspond to those for the three vowels/a, 1000 / Fl(Hz) i, u/of the vowe[ triangle; • the deformation of the tube is minimum, because of 0 I I I / I the use of the sensitivity function (Fant and Pauli, 0 200 400 600 800 1000 1974) and thus efficient; (d) • the deformation 'commands (or gestures) are simple (recti-linear), limited in number (only one in the case Figure 2. (a)-(b) Iterative deformation of the area of figure 2, for the making of a back constriction is function A(x) of a uniform acoustic tube (U) for automatically as~sociated with a front cavity and increasing and decreasing (upper part) F2. vice-versa, as i9 human), and applied at specific (c)-(d) Corresponding three formant variations and places called distinctive regions. formant trajectories in the F,-F2 plane (lower part).

=,7 In summary, characteristics important and The DRM is a 2-region model when only the integral to a 'good' acoustic communication device first formant is controlled. It is a 4-region model when were obtained when an acoustic tube underwent the first two formants (F~ and F2) are controlled (note specific deformations. It is this type of tube, so- that this 4-region structure is automatically obtained structured in regions, that forms the basis of the with a maximum acoustic contrast algorithm - see Distinctive Region Model (DRM) (Carr6 and Mrayati, figure 2). It is an 8-region model when the first three 1992; Mrayati, et al., 1988). Deformation gestures and formants (F~, F2, F3) are controlled. Thus, increased efficient places of deformation (the regions) are thus control of the number of formants entails an increase in deduced from acoustic theory. complexity and topological accuracy. Varying the cross-sectional area of any one region maximizes the 3 The Distinctive Region Model (DRM) formant frequency variations, with each binary combination of a positive or negative formant variation The DRM model proposed in 1988 (Mrayati, et al., corresponding to a specific region. Thus, it is 1988) is structured in regions, the limits of which hypothesized that the regions of the model could correspond to the zero-crossings of the sensitivity represent the best places of articulation for vowels and function computed on a uniform closed-open tube . (figure 3). Critics of this model, may claim that defining regions from the sensitivity functions for a uniform tube AF F~ would limit the functioning domain for small l perturbations. But synergetic command of the regions allows one to maintain monotonic formant variations as well as the region structure, as shown in figure 2.

P III- I I II \t / \ I \ F.; \xl I \ \ tlF: \ l_ltlF,

Figure 4. Vocal tract (from Perkell (1969)) and the Figure 3. Sensitivity functions (formant variations AF eight regions of the model. for local positive cross-sectional area perturbations) for the first three formants in the case of a uniform closed- Note that the region boundaries correspond to a open tube (upper part). The limits of the 8 regions of percentage of the total length of the tube. Since the the DRM model correspond to the zero-crossings of the physical length of the acoustic tube is around 19.5 cm sensitivity functions (central part). Corresponding for the vowel /u/ and 17.5 cm for the vowel /i/, the positive or negative formant variations for increases in place of articulation would thus move according to the the cross-sectional area of the regions (lower part). pronounced vowel. However, it is the 'effective' length that one must take into account ; for, in addition to the physical length, Lp, effective length incorporates the the central constriction. For consonant production, the length, Lr, which corresponds to the radiation effect. 8-region model is needed (taking into account the first Now then, Lr is imore or less proportional to the lip three formants). Thus, prediction of consonant opening. For the vowel/i/, Lr=2 cm, and for the vowel production requires a more complex model than that for /u/, Lr=0 cm. Thus the effective length remains almost the vowel production (insofar as the number of degrees constant (Mrayat i, et al., 1990) and the places of of constriction is not taken into account). articulation, fixe& with region 7 always corresponding to the teeth (figure 4). Region 8 (with the radiation 4 Vowel places of articulation effect) corresponds to the lips, regions 3, 4, 5, 6 to the , the region 1 to the . Regions 2 and 7, of The DRM model with its places of articulation was small length, are acoustically unimportant. used to produce vowels. For example, figure 6 shows As such, a closed-open model may not be the execution of a command (using a closed-open DRM employed for an/u/production because the vocal tract model) to pass from a front constriction to a back is closed at the source and more or less closed at the constriction along with a labial command. The acoustic lips. A closed-close.d model would be more convenient. results of these commands are also shown in the FrFz The behavior of such a model is symmetrical, allowing plane. The positions of the vowels obtained with such a for a central constriction to be automatically derived closed-open model are given. Thus, to produce vowels (Carr6 and Mrayati, 1992). the first two formants appear to be sufficient, the third one either being deduced from the first two (Ladefoged front constriction and Harshman, 1979) or is a speaker specific characteristic (Boulogne, et al., 1973). central co~tio~~ ~ ~"

from front to back back '/~-[ R6 ~7~8 constriction const@tion 7/~ K5 ~ / l I- --I / ~/d/closure - lJl b R3 /g/closure 3000 F2(Hz) R2 from front to back +lab _[i] ~ constriction Ri 2000

1000 Figure 5. Places of articulation predicted by the closed- open DRM model. The central constriction is predicted F,() by the closed-closed model. 0 I I I I

The places: of articulation predicted by the 1~ 3~ 5~ 7~ 9~ model as the bes~t places for maximum formant variations are represented in figure 5. It can be Figure 6. Commands of the model and corresponding observed that the three main places of articulation for r formant trajectories in the Fj-F2 plane. The trajectory vowel production may be predicted with the 4-region /uf/ was obtained with a central constriction (closed- model (taking into account the first two formants). The closed model). closed-open model predicts the front constriction and the back constriction; the closed-closed model predicts The consequences of a change from a closed- • the bursts (Blumstein and Stevens, 1979; Stevens open to a closed-closed configuration of the DRM and Blumstein, 1981) associated with the model have been studied elsewhere (for more details, are consequences of closure-opening actions. Indeed, see Carr6 and Mrayati (1995)). Intermediate places of it has beeen shown that if the bursts are important, articulation between central and back and central and they are not determinant for identification in front are obtained. This approach allows for a good comparison with the formant transitions (Kewley- prediction of vocalic systems (Cart6, 1996). Port and Pisoni, 1983; Walley and Carrell, 1983). • the consonants which are generally 5 Consonant places of articulation characterized by their noise sources and not by their forrnant transitions have to be studied according to Figure 7 shows the formant variations associated with another criterion. closing and opening of the eight regions of the DRM • the third formant may be essential for differentiating model with its uniform configuration. Using the places certain consonants like/d/and/g/(the slopes of the of articulation defined by this model (taking into first and second forrnant would not be sufficient). account the first three formants), we were able to obtain Indeed, it is noted in Harris et al. (1958) that the the formant transitions measured by Ohman (1966) consonant /d/ cannot be obtained in the vocalic (Carri and Chennoukh, 1995). The use of the region 7 context /i/ without the third formant. The authors gives rise to the production of consonants perceived as concluded that << The effects of third formant cues labial. Thus, the classification proposed by Mrayati et are independent of the two-formant patterns to al. (1988, figure 18) has been revised. Considering the which they are added. When a third formant cue vowel/El as close to the neutral partly contributed to enhances the perception of a particular , it the misclassification. The patterns of Delattre (1955) typically does not do so equally at the expense of the corresponding to/EcE/where c is a plosive consonant other response alternatives >>. Similarly, in are obtained by the model when only regions 5, 6 and 8 categorical perception experiments, Godfrey et al. are used. The regions 3 and 4 of the back part of the (1981) studied the boundary between/d/and/g/by model correspond to the places of articulation of changing the rate and direction of the third formant pharyngeal plosives ( Dakkak, et al., 1994). only, while the first two formants were changed for a/b/-/d/contrast. Additional experiments have to be undertaken to further explore this point. • the use of a uniform tube allowed for all combinations of formant variations (positive and negative), that form the basis of the DRM structure, to be revealed. Does this imply that perceptual ...... 317.C ...... " identification is carried out with reference to a uniform state? This point is worth studying further. ! ...... i....J....~...J...... ~.....J ...... [ ...... [~'.{.~...i 6 Conclusions X.:,, N[K:...... : ...... : ::- : ..... ~'~ ~'~ ,~'f~'~"~#', : t'~l •...... :...... The main places of articulation for vowels and plosive consonants can be predicted from a criterion of Figure 7. Variations of the first three formants efficiency of articulatory-acoustic relations. For corresponding to a closing-opening gesture in a specific vowels, a 4-region model (taking into account the first region. two formants) is needed, and for the plosive consonants, an 8-region model (taking into account the Since the regions of the DRM model can be first three formants). This has important implication for obtained from the formant variations of a uniform tube the role of complexity in the production of vowels and since these regions indeed correspond to the places versus consonants. of articulation, we may hypothesize the following:

3o If the regions 5, 6, 8 are the best places for R. Carr6. 1996. Prediction of vowel systems using a maximum formant variations (maximum dynamic deductive approach. In Proceedings of the ICSLP acoustic contrast)fin the case of non-nasal closure- 96, pages 434-437, Philadelphia. opening, how does: one explain the production of having the same places of articulation? For R. Carr6 and S. Chennoukh. 1995. Vowel-consonant- nasal consonants, !it is not evident that the maximum vowel modeling by superposition of consonant dynamic contrasts are preserved. Perhaps, simplicity of closure on vowel-to-vowel gesture. J of , production (same iplaces of articulation) may be the 23:231-241. more important criterionE. • if the dynamic acoustic contrasts are sufficient?• I. Additional experiments have to R. Carr6, B. Lindblom and P. MacNeilage. 1994. be undertaken to l know if this apparently optimal Acoustic contrast and the origin of the human vowel acoustic behavior: of the human speech apparatus space. J. Acoust. Soc. Am., 95: $2924. represents adaptat!ons specialized for communication, i.e., is it a consequence of communication theory R. Carr6, B. Lindblom and P. MacNeilage. 1995. R61e (ecological point of view) or is it just a coincidence? Is de l'acoustique dans l'6volution du conduit vocal it that speech is just an overlaid function on a system humain. Comptes Rendus de l~4caddmie des designed primarily for feeding? The answer to this Sciences, Paris, t. 30, s6rie IIb: 471-476. question will determine the importance and validity of our deductions. R. Carr6 and M. Mrayati. 1992. Distinctive regions in In summary, the DRM model uses as its basic acoustic tubes. Speech production modeling. J. elements, control commands, that allow one to deduce d~4coustique, 5: 141-159• the places of articulation for vowels and consonants thereby uncovering the physical bases of phonological R. Carr6 and M. Mrayati. 1995. Vowel transitions, distinctions. Furthermore, using this vowel systems, and the Distinctive Region Model. In model have yielded high quality results (Hill, et al., C. Sorin, J. Mariani, H. M61oni and J. Schoetgen, 1995). Editors, Levels in Speech Communication: Relations and Interactions• Elsevier: Amsterdam. References P.C. Delattre, A.M. Liberman and F.S. Cooper. 1955. . A1 Dakkak, M. Mrayati and R. Carr6. 1994. Acoustic loci and transitional cues for consonants. J. Transitions formantiques correspondant h des Acoust. Soc. Am., 27: 769-773. constrictions rehlis6es dans partie arri6re du conduit vocal. Linguistica Communicatio, VI: 59- G. Fant and S. Pauli. 1974. Spatial characteristics of 63. vocal tract modes. In Proc. of the Speech Communication Seminar. Almqvist & Wiksell: S.E. Blumstein arid K.N. Stevens. 1979. Acoustic Stockholm. invariance in speech production: Evidence from measurements of the spectral characteristics of stop J.J. Godfrey, A.K. Syrdal-Lasky, K.K. Millay and consonants. J. Aeoust. Soc. Am., 66: 1001-1017. C.M. Knox. 1981. Performance of dyslexic children on speech perception tests. Journal of Experimental M. Boulogne, R. Carr6 and J.P. Charras. 1973. La Child Psychology, 32: 401-424. fr6quence fondamentale, les formants, 616ments d'identification des locuteurs. Revue d'Acoustique, K.S. Harris, H.F. Hoffman, A.M. Liberman, P.C. 6: 343-350• Delattre and F.S. Cooper. 1958. Effect of third- formant transitions on the perception of the voiced C.P. Browman and L. Goldstein. 1989. Articulatory stop consonants. J. Acoust. Soc. Am., 30: 122-126. gestures as phofiological units• , 6: 201- 252. 31 D. Hill, L. Manzara and C.R. Taube-Schock. 1995. A.C. Walley and T.D. Carrell. 1983. Onset spectra and Real-time articulatory speech-synthesis-by-rules. In formant transitions in the adult's and child's Proc. of A VIOS'95, San Jose. perception of place of articulation in stop consonants. J. Acoust. Soc. Am., 73:1011-1022. D. Kewley-Port and D.B. Pisoni. 1983. Perception of static and dynamic acoustic cues to place of articulation in initial stop consonants. J. Acoust. Soc. Am., 73: 1779-1793.

P. Ladefoged and R. Harshman. 1979. Formant and movements of the tongue. UCILA Working Papers in Phonetics, 45: 39-52.

J. Liljencrants and B. Lindblom. 1972. Numerical simulation of vowel quality systems: The role of perceptual contrast. , 48: 839-862.

B. Lindblom. 1986. Phonetic Universal in Vowel Systems. In J. J. Ohala and J. J. Jaeger, Editors, Experimental Phonology. Academic Press: Orlando.

B. Lindblom. 1990. On the Notion of "Possible Speech ". Phonetic Experimental Research at the Institute of , University of Stockholm (PERILUS), XI: 41-63.

M. Mrayati, R. Carr6 and B. Gu&in. 1988. Distinctive Region and Modes: a new theory of Speech Production. Speech Communication, 7: 257-286.

M. Mrayati, R. Carr6 and B. Gu6rin. 1990. Distinctive regions and modes: articulatory-acoustic-phonetic aspects. A reply to Bo~ and Perrier comments. Speech Communication, 9:231-238.

S. Ohman. 1966. in VCV utterances: spectrographic measurements. J. Acoust. Soc. Am., 39: 151-168.

J. Perkell. 1969. Physiology of speech production. Results and implications of a quantitative cineradiographic study. The MIT Press: Cambridge.

K.N. Stevens and S.E. Blumstein. 1981. The search for invariant acoustic correlates of phonetic features. In P. D. Eimas and J. L. Miller, Editors, Perspectives on the study of speech. Erlbaum: Hillsdale. 3z