INTERSPEECH 2015

A comprehensive 3D biomechanically-driven vocal tract model including inverse dynamics for speech research

Peter Anderson1, Negar M. Harandi1, Scott Moisik2, Ian Stavness3, Sidney Fels1

1University of British Columbia 2Max Planck Institute for Psycholinguistics 3University of Saskatchewan [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract & Flanagan [10] modeled the glottis as a system of two masses connected by springs, which has since been challenged and We introduce a biomechanical model of oropharyngeal struc- elaborated [11, 12]. Most sound propagation models attempt tures that adds the soft-, , and to our pre- to reduce the airway to a 1D area function and calculate sound vious models of jaw, skull, hyoid, , and face in a uni- if propagating though a tube of varying cross-section [13, 14]. fied model. The model includes a comprehensive description The more sophisticated 1D models include the sub-glottal tract, of the upper airway musculature, using point-to-point muscles a split airway to include the nasal passage, resonator cavities, that may either be embedded within the deformable structures sound sources at locations of (modeled) turbulence, and sound or operate externally. The airway is described by an air-tight modification by the walls [13]. More recent efforts have simu- mesh that fits and deforms with the surrounding articulators, lated sound propagation using the 3D wave equation [15] or the which enables dynamic coupling to our articulatory speech syn- Navier-Stokes equation [16]. thesizer. We demonstrate that the biomechanics, in conjunction In our simulation toolkit Artisynth, we have developed a with the skinning, supports a range from physically realistic complete biomechanical model of the upper airway structures, to simplified vocal tract geometries to investigate different ap- which is coupled to an acoustics synthesizer and may be used proaches to aeroacoustic modeling of vocal tract. Furthermore, to calculate inverse mechanics (given a motion, find the mus- our model supports inverse modeling to support investigation of cle activations) as well as forward mechanics (find the motions plausible muscle activation patterns to generate speech. caused by muscle activations). In this paper, we present this Index Terms: biomechanical simulation, upper airway model, model and demonstrate how it may be used towards numerous speech synthesis, inverse modeling. applications in speech production and perception.

1. Introduction 2. Methods Speech production and perception are an essential part of daily 2.1. Model Design life for most people, yet the mechanisms involved are extremely 10.21437/Interspeech.2015-518 complicated and remain poorly understood. Speech research Our VT model is composed of a deformable FE bodies, rigid is usually performed experimentally on people; however, mea- bodies, surface skin meshes, and point-to-point muscles. A surements may be highly limited to the measurements that can mid-sagittal view of the VT is shown in Fig. 1, and the com- feasibly and ethically be performed on humans and within the ponents are summarized in Table 1. The laryngeal cartilages instrumentation capabilities. Simulations may complement ex- (not listed in Table 1) include the: epiglottis, crycoid, thyroid, periments to address many of these concerns, and provide a cuneiform (left and right) and arytenoid (left and right). Except framework for testing hypotheses and making predictions that for the face, all components are symmetrical with respect to the cannot be done experimentally. mid-sagittal plane. The model is implemented in Artisynth; we Models of the human vocal tract (VT) initially focused on refer the interested reader to Lloyd et al. [17] for details of the the function of individual components. For example, the face forward and inverse mechanics in Artisynth. [1], the tongue [2], [3, 4, 5], and larynx [6]. Recent The face, tongue, jaw, skull, and hyoid have previously progress has been made towards developing a comprehensive been composed as a unified model [21, 7], and the larynx along model of the VT, as seen in the face, tongue, skull, jaw, hyoid with its supporting structures have also been simulated indepen- model of Stavness et al. [7] and the tongue and epiglottis model dently [6]; however, here we compose them into a single model by Pelteret et al. [8]. The model we present in this paper is including the soft palate and pharynx. the most complete that the authors are aware of, including finite The pharynx geometry was designed by fitting spline sur- element (FE) models of the face, tongue, soft palate, pharynx, faces to the approximate contour of the pharynx from cone larynx, and supporting structures of bone and cartilage. beam Computed Tomography (CT) data. The constrictor mus- Speech simulation involves the simulation of sound produc- cles were contoured to smoothly extend toward their respective tion and propagation through the airway until it radiates into insertion regions relative to the jaw and cartilage structures of free space beyond the lips. The glottis typically provides the the larynx. The pharynx surface was then smoothly fitted with a main sound source, though sound may also be generated by quadrangle surface mesh, non-rigidly registered to landmarks of other sources as in stop consonants and fricatives [9]. Ishizaka the other VT components, and extruded to form a hex-dominant

Copyright © 2015 ISCA 2395 September 6-10, 2015, Dresden, Germany tions, or they may be embedded in the FE body as transversely- isotropic material properties. There are 11 tongue muscles, 5 palate muscles, 11 face muscles, 6 , and 46 external muscles. These muscle groupings are approximate, as some muscles may pass through multiple bodies (for example, the palatopharyngeus muscle). The components of the upper airway are coupled by various means. The skull is anchored in space, and provides the primary anchoring for the model. The nodes of the face, palate, and pharynx that connect with the skull are chosen to be attach to the skull, and thus be immobile. An FE model may also attach to a dynamic rigid body, as occurs between the face-jaw, jaw- tongue, pharynx-thyroid, and the larynx with cartilage compo- nents. An FE model may attach to another FE model, as occurs Figure 1: A mid-sagittal view with all deformable and rigid with regions between the palate-pharynx, pharynx-tongue and structures visible. tongue-face. Components may also be connected by point-to- point muscles, as happens, for example, between the crycoid Component Type DOF Mesh Ref. and thyroid. face FE 8720 hex [1] tongue FE 948 hex [2] 2.2. Biomechanically Driven Acoustics skull,jaw,hyoid Rigid 6 (x3) tri [18] Articulatory speech synthesizers generate sound based on the soft palate FE 735 tet [5, 19] biomechanics of speech. Vibration of the vocal folds under the pharynx FE 2436 hex N/A expiatory pressure of the lungs acts as the input to the sys- larynx FE 3034 hex [6] tem and the VT constitutes a filter where sound frequencies larynx cartilage Rigid 6 (x7) tri [6] are shaped. We use the two-mass glottal model proposed by airway skin N/A tri [20] [10] coupled with the 1D linearised Navier-Stokes equation de- Table 1: Summary of components. Includes component scribed by van den Doel & Ascher [14]. name, component type (either FE models, rigid body, or de- The airway was modelled as a deformable air-tight mesh, formable skin surface), degrees of freedom (DOF), mesh type referred to as a skin, which is coupled to the articulators as (hexahedral-dominant, tetrahedral, or triangular surface mesh) described in [20]. Each point on the skin is attached to one and references to source publications. or more master components,