ISCA Archive the HISTORY of ARTICULATORY
Total Page:16
File Type:pdf, Size:1020Kb
Auditory-Visual Speech Processing ISCA Archive 2005 (AVSP’05) http://www.isca-speech.org/archive British Columbia, Canada July 24-27, 2005 THE HISTORY OF ARTICULATORY SYNTHESIS AT HASKINS LABORATORIES Philip Rubin1,2,GordonRamsay1 and Mark Tiede1,3 1Haskins Laboratories, New Haven, CT, USA 2 Yale University School of Medicine, Dept. of Surgery, New Haven, CT, USA 3 Massachusetts Institute of Technology, Research Laboratory of Electronics, Cambridge, MA, USA branches of the vocal tract. In its early implementations, ABSTRACT continuous speech was obtained by a technique similar to key-frame animation. Later, continuous speech was Articulatory synthesis is a computational technique created by driving the articulatory model from a for synthesizing speech by controlling the shape of the dynamical model of speech production. vocal tract over time. Research at Haskins Laboratories on articulatory synthesis began with the arrival of Paul A number of extensions of the original ASY model Mermelstein in the early 1970s. While at Bell have been developed. Perhaps the most significant is Laboratories, Mermelstein had developed a vocal tract CASY, the configurable articulatory synthesizer [5]. CASY model that is often referred to as the Mermelstein model is a version of the articulatory synthesis program that lets [1]. This model is similar in many ways to that of Coker the user superimpose an outline of the vocal tract model [2] and colleagues [3], but was developed independently. on an acquired sagittal image (typically an MRI). The The Mermelstein model allowed for a specification of user can then graphically adjust the model parameters to vocal tract shape in the midsagittal plane in terms of a fit the dimensions of the image. Transfer functions and small set of articulators, including the jaw, tongue, tip, acoustic output can be generated using those model lips, and velum. At Haskins, Philip Rubin, Thomas Baer parameters. CASY’s model parameters are a superset of and Mermelstein turned this model into the first those in ASY and include values that were in the original articulatory synthesizer regularly used as a research tool Mermelstein model, but which could not be adjusted by for exploring the relationship between speech perception the user. In addition, the fixed surfaces of the vocal tract and production [4]. are represented parametrically so that they can be adjusted to match any arbitrary speaker’s vocal tract. The Haskins articulatory synthesis model (ASY)was CASY also implements the interdependencies among parts designed to allow for simple control of vocal tract shape of the vocal tract geometry in the form of a flexible by direct manipulation of articulators. Rapid synthesis linked list, supporting experimentation with key was required to allow for an experimenter-controlled, articulators beyond the fixed arrangement of the original analysis-by-synthesis approach, in which researchers design. Future developments include the integration of a could make quick adjustments to the shape of the vocal new key articulator controlling parasagittal shape of the tract and then synthesize static and dynamic utterances to tongue dorsum motivated by volumetric MRI data and an evaluate whether or not the desired acoustic results were improved voice source model. achieved. To do this rapidly in the 1970s and 1980s required compromises in the design of the model and the Since the development of the first articulatory synthesis program. ASY consists of several submodels. At synthesizers in the 1960s and 1970s, considerable its heart are simple models of six key articulators. The progress has been made in other fields that can easily be positions of these articulators determine the outline of the applied to physical modeling of the vocal tract. In vocal tract in the midsagittal plane. From this outline, the biomechanics, explicit computational models of the distance function and, subsequently, the area function of processes governing muscle contraction, as well as the vocal tract are determined. Source information is proprioceptive and exteroceptive feedback, have been specified at the acoustic, rather than articulatory, level, coupled to models of rigid body dynamics to provide a and is independent of the articulatory model. Speech more complete representation of peripheral dynamics. At output is obtained by calculating the acoustic transfer present, articulatory synthesis is typically limited to function for both the glottal and fricative sources for a purely kinematic representations that do not capture the particular vocal tract shape. For voiced sounds, the physical constraints on motion that influence speech transfer function accounts for both the oral and nasal motor control. In fluid dynamics, numerical simulations Auditory-Visual Speech Processing 2005 (AVSP’05) 117 are now routinely constructed of unsteady turbulent flows in complex time-varying geometries, and recent developments in aeroacoustics have shown how to use these results to predict the interaction between aerodynamic and acoustic flow that generates sound. Articulatory synthesizers currently rely on quasi-static, quasi-one-dimensional acoustic models that ignore aerodynamic flow effects, and cannot properly simulate the time-varying source and filter properties associated with the complex time-varying geometries seen in consonantal transitions. This paper will also survey these developments, summarize the use of articulatory synthesis in research at Haskins Laboratories [6,7,8,9], and describe current work at Haskins Laboratories which aims to incorporate better models of biomechanics and fluid dynamics into articulatory synthesis. REF ERENCES [1]. Mermelstein, P. Articulatory model for the study of speech production. Journal of the Acoustical Society of America 53: 1070-1082, 1973. [2]. Coker, C. A model of articulatory dynamics and control. Proc. IEEE 64: 452-460, 1976. [3]. Coker, C. and Fujimura, O. Model for specification of the vocal-tract area function. Journal of the Acoustical Society of America 40: 1271, 1966. [4]. Rubin, P., Baer, T. and Mermelstein, P. An articulatory synthesizer for perceptual research. Journal of the Acoustical Society of America 70: 321-327, 1981. [5]. Rubin, P. E., Saltzman, E., Goldstein, L., McGowan, R., Tiede, M. and Browman, C. CASY and extensions to the task-dynamic model. Proceedings of the 1st ESCA ETRW on speech production modeling and 4th speech production seminar. Autrans: ICP Grenoble, 125-128, 1996. [6]. Browman, C. P. and Goldstein, L. Articulatory phonology: an overview. Phonetica 49: 155-180, 1992. [7]. Saltzman, E. Task dynamic coordination of the speech articulators: a preliminary model. Experimental Brain Research Series 15: 129-144, 1986. [8]. McGowan, R. Recovering articulatory movement from formant frequency trajectories using task dynamics and a genetic algorithm: Preliminary model tests. Speech Communication 14: 19-48, 1994. [9]. McGowan, R. S. and Saltzman, E. Incorporating aerodynamic and laryngeal components into task dynamics. Journal of Phonetics 23: 255-269, 1995. Auditory-Visual Speech Processing 2005 (AVSP’05) 118.