An Articulatory Synthesizer for Perceptual Research
Total Page:16
File Type:pdf, Size:1020Kb
An articulatorysynthesizer for perceptual research Philip Rubin and Thomas Baer Haskins Laboratories, 270 Crown Street, New Haven, Connecticut 06510 Paul Mermelstein Bell-Northern Researchand INRS Telecommunications,University of Quebec, Verdun, Quebec, Canada H3E 1H7 (Received15 March 1979;accepted for publication12 May 1981) A softwareartieulatory synthesizer, based upon a modeldeveloped by P. Mermelstein[J. Aeoust.Soe. Am. 53, 1070-1082 (1973)],has been implemented on a laboratorycomputer. The synthesizeris designedas a tool for studyingthe linguisticallyand pereeptuallysignificant aspects of artieulatoryevents. A prominentfeature of this systemis that it easilypermits modification of a limited setof key parametersthat controlthe positions of the major artieulators:the lips,jaw, tonguebody, tongue tip, velum,and hyoid bone.Time-varying control over vocal-tractshape and nasalcoupling is possibleby a straightforwardprocedure that is similar to key- frame animation:critical vocal-tract configurationsare specified,along with excitation and timing information.Articulation then proceeds along a directedpath betweenthese key frameswithin the time script specifiedby the user.Such a procedurepermits a sufficientlyfine degreeof controlover articulatorpositions and movements.The organizationof this systemand its presentand future applicationsare discussed. PACS numbers:43.70. Bk, 43.70.Jt, 43.70.Qa, 43.60.Qv INTRODUCTION Although the synthesizer is capable of producing short segments of quite natural speech with a parsimonious Articulatory synthesis has not usually been consi- input specification, its primary application is not based dered to be a research tool for studies of speech per- on these characteristics. The most important aspect of eeption, although many perceptual studies using syn- its design is that the artieulatory model, though simple, thetic stimuli are based upon artieulatory premises. captures the essential ingredients of real articulation. • In order to address the need for a research artieulatory Thus, synthetic speech may be produced in which arti- synthesizer, this paper provides a brief description of a eulator positions or the relative timing of artieulatory software articulatory synthesizer, implemented on a gestures are precisely and systematically controlled, laboratory computer at Haskins Laboratories, that is and the resulting acoustic output may be subjected to presently being used for a variety of experiments desig- both analytical and perceptual analyses. Real talkers ned to explore the relationships between perception cannot, in general, produce utterances with systematic and production. A related paper, by Abramson et al. variations of an isolated articulatory variable. Further, (1979), describes in greater detail some of these ex- for at least some artieulatory variables mfor example, periments, which focus on aspects of velar control and velar elevation and the corresponding degree of velar the distinction between oral stop consonants and their port opening--simple variations in the artieulatory pa- nasal counterparts. The intent of the present paper is rameter produce complex acoustic effects. Thus, the to provide an overview of the actual design and opera- synthesizer can be used to perform investigations that tion of this synthesizer, with specific regard to its use are not possible with real talkers, or investigations as a research tool for the perceptual evaluation of arti- that are difficult, at best, using acoustic synthesis eulatory gestures. techniques. Briefly, the artieulatory synthesizer embodies sever- al submodels. At its heart are simple models of six I. THE MODEL key artieulators. The positions of these artieulators determine the outline of the vocal tract in the midsagit- The model that we are using was originally developed tai plane. From this outline the width function and, by Mermelstein(1973) and is designedto permit simple subsequently, the area function of the vocal tract are control over a selected set of key artieulatory parame- determined. Source information is specified at the ters. The model is similar in many respects to that of acoustic, rather than artieulatory, level, and is inde- Coker (Coker and Fujimura, 1966; Coker, 1976), but pendent of the artieulatory model. Speech output dur- was developed independently. The two models em- ing each frame is obtained after first calculating, for a phasize different aspects of the speech production pro- particular vocal-tract shape, the acoustic transfer eess; Coker's implementation stresses synthesis-by- function for both the glottal and frieative sources, if rule, while the focus of the present model is on inter- they are used. For voiced sounds the transfer function active and systematic control of supraglottal artieula- accounts for both the oral and nasal branches of the vo- tory configurations and on the acoustic and perceptual cal tract. Continuous speech is obtained by a technique consequences of this control. The particular set of similar to key-frame animation (see below). parameters employed here provides for an adequate description of the vocal-tract shape, while also incor- a)Portionsof this paper were presentedat the 95th Meeting of porating both individual control over artieulators and the Acoustical Society of America, 16-19 May 1978, Provi- physiologically constrained interaction between artieu- dence, R. I. lators. Figure 1 shows a midsagittal section of the 321 J. Acoust.Soc. Am. 70(2), Aug. 1981 0001-4966/81/080321-08500.80 (D 1981 AcousticalSociety of America 321 Downloaded 06 Feb 2013 to 76.174.24.144. Redistribution subject to ASA license or copyright; see http://asadl.org/terms •L •c VOCAL - TRACT CENTER LINE C -- TONGUE BODY CENTER V -- VELUM T -- TONGUE TIP J - JAW L -- LIPS H-- HYOID FIG. 1. Key vocal tract parameters. FIG. 2. Grid system for conversion of midsagittal shape to cross-sectional area values. vocal tract, with the six key articulators labeled: tongue body, velum, tongue tip, jaw, lips, and hyoid additional parameters suggestedby factor analyses of bone position. (Hyoid bone position controls larynx tongue shapes (Harshman et al., 1977; Kiritani and height and pharynx width.) These articulators can be Sekimoto, 1977; Maeda, 1979), may be included. grouped into two major categories: those whose move- ment is independentof other articulators (the jaw, Once articulator positions and, thus, the vocal-tract velum, and hyoid bone); and articulators whose posi- outline, have been specified, cross-sectional areas are tions are functions of the positions of other articulators calculated by superposing a grid structure fixed to the (the tonguebody, tonguetip, and lips). The articula- maxilla on the vocal-tract outline and computing the tors of this second group all move relative to the jaw. points of intersection of the outline and the grid lines In addition, the tongue tip moves relative to the tongue (see Fig. 2). Grid resolution is variable within a body. In this manner, individual gestures can be sepa- limited range. In the default mode, grid lines are rated into components arising from the movement of 0.25 cm apart where parallel and at 5ø intervals where several articulators. For example, the lip-opening they are radial. Figure 2 shows the lowest resolution gesture in the productionof a/ba/is a functionof the case: parallel grid lines are separated by 0.5 cm and movement of two articulators: the opening of the lips the radial sections are at 10 ø intervals. The intersec- themselves, and the dropping of the jaw for the vowel tion of the vocal-tract center line with each grid line is articulation. Movements of the jaw and velum have one determined by estimating the point at which the dis- degree of freedom, all other articulators move with two tances to the superior-posterior and inferior-anterior degrees of freedom. Movement of the velum has two outlines are equal. The center line of the vocal tract effects: it alters the shape of the oral branch of the vo- is formed by connecting the points from adjacent grid cal tract, and in addition, it modulates the size of the lines. The length of this line represents the length of coupling port to the fixed nasal tract. A more detailed the vocal tract modeled as a sequence of acoustic tubes description of the models for each of these articulators of piecewise uniform cross-sectional areas. Vocal- is providedin Mermelstein (1973). Specificationof the tract widths are measured using the same algorithm as six key articulator positions completely determines the that for finding the center line. outline of the vocal tract in the midsagittal plane. Sagittal cross-dimensions are converted to cross- Validation of the articulatory model involved system- sectional areas with the aid of previously published atic comparisons with data derived from x-ray observa- data regarding the shape of the tract along its length. tions of natural utterances. The data used were ob- Different formulas are used for the pharyngeal region tained from Perkell (1969). An interactive procedure (Heinz and Stevens, 1964), oral region (Ladefogedet el., was used in which the experimenter matched the model- 1971), and labial region (Mermelstein et el., 1971). generated outline to x-ray tracings for the utterance, More specifically, cross-sectional area in the pharyn- "Why did Ken set the soggy net on top of his desk?" geal region is approximated as an ellipse with the Acoustic matching procedures are detailed in Mermel- vocal-tract width (w) as one axis, while the other axis stein (1973).