<<

The palatal stop: Results from Acoustic-articulatory recovery of articulatory movements

Christian Geng∗, Ralf Winkler†∗ and Bernd Pompino-Marschall‡∗ ∗ Research Centre for General Linguistics [email protected] † Institute for Communications Research, Technical University Berlin [email protected] ‡ Humboldt-University Berlin [email protected]

ABSTRACT LPC approach uses the fact that the filtering process of the lossless uniform tube model of the vocal tract The articulatory data situation with respect to the is the same as that of the optimal inverse filtering of palatal stop in Czech is dissatisfying: Static X-rays, the speech signal with proper boundary conditions at linguo- and palatograms still seem to be state of the the glottis and the lips [3]. Sorting methods perform art. This study aims at the potential benefit of sampling of the articulatory parameters from the artic- acoustic-articulatory recovery strategies in the deter- ulatory model and establish tables of vocal tract shapes mination of features. Results in- and related acoustic representations, usually formant dicate recovery problems using area functions as in- frequencies. These tables are used for matching vocal- put data, that vanish, if linear articulatory models are tract geometry and acoustic representations. One ap- used. The recontructions suggest a primary laminal proach within a computer-sorting framework was de- and a secondary dorsal component for the articulation scribed in a paper by Atal, Chang, Matthews and for the Czech palatal stop. Tukey[4]. Another approach is described in a more recent paper by Story and Titze[5]. They establish a mapping between the first two formant frequencies 1 INTRODUCTION and vocal tract area functions of measured by Magnetic Resonance Imaging (MRI): In the first step, each vocal tract area function in the data set is inter- To our knowledge, nobody has made recordings of polated to a constant number of sections. From these Czech palatal stops using contemporary methods as empirical area functions, a neutral, schwa-like config- electropalatography(EPG) or the Eletromagnetic Mag- uration is calculated as the mean of the whole data netometer(EMA) so far. To give a short overview set, which is in turn subtracted from each individual over the data situation and the fragmentary knowl- (empirical) area function. The resulting displacement edge available, (if not mentioned differently, cited after matrix from the neutral configuration is subjected to Keating et.al.[1]), we contrast the Czech palatal stop a Principal Component Analysis (PCA) or -the pre- with the velar and the palatal glide: Comparing the ferred term of the authors- empirical orthogonal mode velar and palatal stops, it was observed that the oc- decomposition and the first two factors are retained. clusions of the velars are slightly longer than for the The resulting solution can be used as a generator of palatals. Stevens[2] uses occlusion lengths of 2cm in an infinite number of area functions. Story and simulating the . The data further suggest Titze’s model sweeps the first two coefficients of the so- that the occlusion is made with the blade, not the tip lution through 50 steps each to generate 2500 distinct or body of the tongue. Contrasting the palatal stops area functions, which by an area-to-spectrum transfor- with the palatal glide [j], it can be stated that both mation and formant peak picking result in a large ta- of them involve a laminal and a dorsal component, the ble of factor coefficients and associated F1/F2-pairs. difference consists in the dominance of one or the other: In the next step, a data gridding and cubic-spline- For the stop, the laminal articulation is dominant, the interpolation technique is applied, which can in turn be dorsal secondary, and the reverse holds for the palatal used to associate a given F1/F2-pair with correspond- glide. ing coefficients to produce the desired area functions. Various techniques have been proposed to derive use- Finally, nonlinear optimization approaches have not ful vocal tract information from the speech signal. The been listed. One of these approaches, called Simulated 2 DATA

2.1 Articulatory Data With one speaker of German, 3D MRI data on 26 held articulations (the tense vowels [i:,y:,e:,E:,ø:,A:,o:,u:], lax vowels [I,Y,E,œ,a,O,U] neutral vowels [@] and [5], nasals [m,n,N], [f,s,S,¸c,x] and the lateral [l]) were recorded at the radiology department of Virchow Hos- pital Berlin with the procedure Fast Brain SAG TSE 979 8.0 90 at a Gyroscan NT, Philips Medical Sys- tems. The recordings were made as 18 sagittal slices of 3 mm thickness at steps of 3.5 mm with a pixel di- mension of 0.586 x 0.586 mm. One recording lasted Figure 1: Left: Tracing of palatogram; the contacted about 12 s. Phantoms of the teeth were later inserted area is shaded. Right: Tracing of linguogram; by means of an interactive software. The 3D area func- the contacted area is shaded (from Daneˇs, F., tion were calculated in three steps: (i) from a prelimi- H´ala, B., Jedliˇcka, A. and Romportl,M. [6], af- nary semi-polar grid on the midsagittal slice a geomet- ter Keating & Lahiri[1] rical midline of the tract is constructed, (ii) a second grid, perpendicular to this midline is constructed and smoothed with respect to sudden changes in the direc- tion of airflow, and (iii) the 3D data are determined by calculating the area of the air column at the planes Annealing, is an extension of the classical Metropolis defined by these gridlines through all 18 slices. algorithm [7]. Simulated Annealing is a stochastic op- timization technique that can process quite arbitrary 2.2 Acoustic Data degrees of nonlinearities, discontinuities and stochas- The acoustic raw material consists of the palatal ticity in highdimensional parameter spaces. The vari- stops of Czech in syllable-initial position as de- ant used here is Adaptive Simulated Annealing [8], a scribed in the Handbook of the International Phonetic more flexible variant of classical Simulated Annealing Association[10]1. These recordings were lowpass fil- as described in [9]. tered and resampled at 11 kHz. Formant tracks were Now, we can formulate the aims of this study: From created using PRAAT[11]. As the acoustic represen- formant information, we will try to recover meaning- tation of the recovery model consists in formant infor- ful articulatory information with respect to the palatal mation, the first recovered area functions reported are stop in Czech. not associated with the complete occlusion, but with the first valid frame of the formant track. The formant

4000 voiced voiceless 3500

3000

2500

2000

1500

1000 Formant Frequencies [Hz]

500

0 0 10 20 30 40 50 60 Time[ms]

Figure 3: Formant track of the CV transition for [cElo] éE Figure 2: Sagittal X-Ray of a Czech palatal stop (from and [ lo] Daneˇs et. al. [6],after Keating & Lahiri [1])

tracking procedure did not find realistic formant values 1at http://web.uvic.ca/ling/resources/ipa/handbook.htm for some frames due to the presence of friction. Since 4.5 the focus here is on articulatory recovery, these values ST98 table search were imputed by means of linear interpolation. The 4 formant track for the voiced cognate is given in Fig.3. 3.5

3 ] 2 2.5 3 RESULTS 2 Area [cm For our own data, Story & Titze’s method worked best 1.5 if tense vowels were used as input data, i.e. those that 1 are close to the border of the maximal space 0.5 (MVS,[12]). In other words, only a subset of our area 0 functions was used. As this method is at the same time 0 5 10 15 20 based on the first two principal components only, it was Distance from glottis [cm] resorted to another strategy: A lookup-table was gen- erated by sweeping the values of the first three factors Figure 4: Reconstructed area functions for the first tar- that explained about 83% of the variance through ±3 get frame of the voiceless stimulus [cElo]. Solid standard deviations around the means for each factor, line: Recontruction scheme used by Story & but inputting all of our empirical area functions. This Titze. Dotted Line: Table search resulted in a table of about 64000 area functions. This table could have been pruned according to suggestions made by e.g. Bo¨e et.al.[13] in order to retain only high partitioning of the input space might explain why the articulatorily meaningful configurations. They re- all our attempts to obtain reasonable configurations moved configurations with constrictions smaller than by means of simulated annealing failed. On the other 2 20mm , (the lower limit for laminar airflow), configu- hand, control analyses with Simulated Annealing using rations with contrictions centered less than 4.5 cm from the articulator-based model of Maeda [14] quickly re- the glottis (these are articulatorily impossible) and for- sulted in reasonable reconstructions (see Fig. 5). The mants which do not lie in the range of the MVS (ac- only constraint on the articulators that was made was cording to the MVS, the minimum of the first formant constraining the control parameter tongue body not is at 250 Hz, the range of the second formant is be- to move more than 1.5 standard deviations around the tween 510 and 2295 Hz.) Since the focus here is rather mean for this parameter in order to prevent meaning- on the generation and interpretation of vocal tract tar- less configurations. The cost function was specified as get configurations than on vowel spaces or trajectory the percentage of deviation of the original formants generation, an alternative procedure was used: Fur- to the target formants. This measure amounted to ther processing consisted in retaining factor coefficients less than 0.002% after stopping annealing. The recon- that produced area functions with a percentual error struction in Fig. 5 confirms the results by Keating & of less than three percent in the first three formants. Lahiri, that the tongue blade movement is dominating These remaining factor coefficients were subjected to the tongue dorsum movement. Recall that all recon- a k-means cluster analysis, where the distance mea- strution methods used are based on the onset of vo- sure was specified as the squared Euclidean distance. calic structure and not complete oral closure as shown Silhouette plots suggested three distinct clusters with in Fig. 1. good separation between the three clusters indicated by silhouette values up to 0.8. This was further sub- stantiated by the fact that the sum of the squared Eu- clidean distances converged to a stable value for several 4 DISCUSSION runs of the preferred analysis, so solutions with three clusters were selected. This analysis was performed Summarizing our results, one can conclude that the re- in the hope that the partitioning of the factor space constructions obtained by the method of Story & Titze would not be too severe, and that clusters could be are a function of the input data used. This was shown retained that would point to a region in factor space by factoring our complete database and searching for containing our desired target configuration. This hope vocal tract area functions by means of ’brute force’. was not fulfilled though: Visual inspection of the area The other finding underpins the widely accepted fact functions generated by these factor scores pointed to that linear articulator models are difficult to construct but relatively easy to control. For area function models similar tract configurations. An example configuration 2 is shown in Fig. 4, superimposed on the reconstruc- holds the reverse. However, it was possible to recover tion obtained with the reconstruction scheme by Story 2E.g. talk by G. Bailly, Venice International University Work- & Titze, both for the voiceless cognate [cElo]. This shop on Face, Speech and Acoustics, Venice, November 22/23, REFERENCES 10 [1] P. Keating and A. Lahiri, “Fronted velars, palatized velars, and palatals,” Phonetica, pp. 73–101, 1993. 8 [2] K. Stevens, Acoustic , The MIT Press: Cambridge, Massachusetts, London, England, 1998. 6 [3] H. Wakita, “Direct estimation of the vocal-tract

Y[cm] shape by inverse filtering of acoustic speech wave- 4 forms,” IEEE Trans. Audio Electroacoustics, vol. 21, no. 5, pp. 417–427, 1973. 2 [4] B.S. Atal, J.J. Chang, M.V. Mathews, and J.W. Tukey, “Inversion of articulatory-to-acoustic trans- formation in the vocal tract by a computer-sorting 0 0 2 4 6 8 10 12 technique,” JASA, vol. 63, pp. 1535–1555, 1978. X[cm] [5] B. Story and I. R. Titze, “Parametrization of Figure 5: Configuration obtained with Simulated Anneal- vocal tract area functions by empirical orthogonal ing with the Maeda vocal tract model for the modes,” Journal of Phonetics, pp. 223–260, 1998. first vocalic frame of [cElo] [6] F. Daneˇs, B. H´ala, A. Jedliˇcka, and M. Romportl, O mluvem slove, St´atn´ı Pedagogick´e Nakladatelstv´ı, Prague, 1954. [7] N. Metropolis, A. Rosenbluth, M. Rosenbluth, the palatal release by means of a simulated annealing A. Teller, and E. Teller, “Equation of state cal- approach using Maeda’s articulatory model. It shall culations by fast computing machines,” J. Chem. not be concealed that the performance of our simulated Phys., vol. 21, no. 6, pp. 1087–1092, 1953. annealing parametrization is not optimal with respect [8] L. Ingber, “Simulated annealing: Pratice versus to speed yet. Otherwise the authors would have tried theory,” Mathematical Computer Modeling, vol. 18, to explore the possibility of finding resonable articu- no. 11, pp. 29–57, 1993. latory patterns with respect to the rate of increase in [9] S. Kirkpatrick, C. D. Gelatt Jr., and M. P. Vec- cross-sectional area of the constrictions after the onset chi, “Optimization by simulated annealing,” Sci- of formant structure. Stevens([2], p.369) estimates the ence, vol. 220, no. 4598, pp. 671–680, 1983. rate of increase in area for a velar stop with 25cm2/s, about one-half to one-fourth of the rate assumed for [10] International Phonetic Association, Handbook of labials and alveolars. It would be informative to ob- the International Phonetic Association: A Guide serve the effect of the assumed coronal dominance on to the Use of the International Phonetic Alphabet, the release movement and relate such data to empir- Cambridge University Press, Cambridge, UK, Aug. ical measurements. Another important point, which 1999. initially was one of the main incentives for beginning [11] P. Boersma and D. Weenink, “Praat, a system this study, was to explore the potential use of vocal for doing phonetics by computer,” www.praat.org, tract analogues for the study of speech perception, 1992–2003. which incorporates the simultaneous optimization not [12] L.J. Bo¨e, P. Perrier, B. Gu´erin, and J.L. Schwartz, only for formant frequencies, but also bandwidths. As “Maximal vowel space,” Eurospeech, vol. 2, pp. 281– this would have involved a more sophisticated acoustic 284, 1989. and aerodynamical modeling of vocal tract loss mech- anisms, this appeared to be beyond the scope of this [13] L.J. Bo¨e, P. Perrier, and G. Bailly, “The geometric study. vocal tract variables controlled for vowel production: proposals for constraining acoustic-to-articulatory inversion,” Journal of Phonetics, vol. 20, no. 1, pp. 27–38, 1992. Acknowledgments [14] S. Maeda, “Compensatory articulation during speech: evidence from the analysis and synthesis of vocal-tract shapes using an articulatory model,” in Work was supported by German Research Coun- Speech production and speech modelling, pp. 131– cil(DFG) grant GWZ 4/8-1, P.1 149. 1990.

2002