and Experimental , circa 1960-2000*

John Coleman

1. Experimentation, documentation, instrumentation, computation

Phonetics draws on two traditions: the experimental paths of physiology, acoustics, and psychology, and the anthropological-linguistic paths of detailed description and documentation of the sounds of specific languages, including the quest for previously undocumented possible sounds. Especially since 1960, the former has accelerated and yielded a proliferation of discoveries that will form the bulk of this chapter. But before taking that up, let us not overlook the ongoing progress in documenting the phonetics of specific languages.

Phonetic “specimens” of languages had appeared in the journals and brochures of the International Phonetic Association since the beginning of the century. The 1960s' shift in focus from particular languages towards universal or general properties of language is exemplified without parallel in the phonetic surveys of West African and (subsequently) a global spectrum of languages by Peter Ladefoged, his students and colleagues at the UCLA Phonetics Laboratory (Ladefoged 1964, Ladefoged and Maddieson 1996). This work deployed keen observation and some instrumental investigations to discover numerous previously undocumented or neglected possibilities of human speech. In a parallel phonologically-oriented survey, Maddieson (1984) assembled the UCLA Phonological Segment Inventory (UPSID), drawing upon and extending data from the Stanford Phonology Archive (Greenberg et al. 1978), and mined it for new insights into phonological systems and typology.

The mother lode for phonetic discoveries, however, has been instrumental and experimental phonetics, and their symbiotic relationship with the abundant proposals of phonological theory. These, in turn, rested upon significant developments in instrumentation in speech physiology, acoustics, and perception. X-ray stills and cinefluoroscopic movies illustrate the importance of photography as an adjunct technology, which itself developed significantly as cine film gave way to videotape and then digital photography. For phonetics, X-ray microbeam and latterly magnetic resonance imaging (MRI) gave new views of the moving vocal tract. Developments in electronics and computing, in particular, have been central. The sound spectrograph, developed during World War II, yielded a peacetime dividend as successive models were acquired by phonetics laboratories around the world, making it the second most important instrument. The microphone is the most important device for recording and analysing speech, and a paradigm example of an electromechanical transducer, a component that converts movement to measurable electrical signals. Other electrophysiology devices measure tiny currents generated within or passed through the human body: for example, electromyography picks up voltage changes generated by muscle contraction; electroglottography (e.g. the Laryngograph) measures vocal fold opening, typically during phonation; and electropalatography detects the regions of tongue contact against the hard palate. The oscillograph (a.k.a. mingograph, polygraph) – a printing oscilloscope – permitted such electrophysiological signals to be plotted, studied, measured or recorded on magnetic tape, for subsequent "off-line" analysis.

Conversion of analogue signals to digital recordings (A–D conversion) and their subsequent processing using computers developed in the 1960's and early 1970's in telecommunications and

* I thank Ros Temple, Alan Cruttenden, Ethan Sherr-Ziarko and the editors for their invaluable comments on this chapter.

1 audio laboratories (e.g. Bell Laboratories). Digital signal processing was developed in electronics and computing laboratories that could afford the equipment. By the 1980's, A–D converters and digital recorders were to be found in phonetics laboratories too, together with early microcomputers. Software packages such as ILS-PC (“Interactive Laboratory System” - see Read et al. 1992) provided routines for speech editing, pitch analysis, spectral analysis, and the extraction of acoustic analysis features; as the capabilities of the first PC's rapidly improved, on-screen display of spectrograms and waveforms became possible. The increased power, lower cost and wider availability of laboratory computers meant that software for sophisticated statistical analysis on larger datasets became more widespread, and laboratories that focussed on computational modelling of physiological (e.g. kinematic) and acoustic data could develop increasingly more powerful numerical models of speech processes.

These advances in instrumentation underpin the discoveries of phonetics that I take up in the subsequent sections. To give due notice to the great range of discoveries, I shall focus in turn on speech production, speech acoustics and speech perception, further subdividing production into control of air-streams, phonation, articulation and co-articulation. To conclude, I survey the rise of Laboratory Phonology.

2. Speech Production

2.1. Control of air-streams David Abercrombie (1980 [1991]) recalls that when a student of University College London in the 1930's, his phonetics laboratory associates Steve Jones and J. R. Firth were impressed by Stetson's Motor Phonetics (Stetson 1928), and its proposal that every syllable is delimited by a contraction of the chest muscles (a “chest pulse”) or by the constriction of a consonant, or both. Onto this is superimposed a series of less frequent but more powerful contractions of the breathing muscles, causing a more considerable increase in air pressure - “stress pulses”. Later, Abercrombie recommended this theory to his poetically-inclined graduate student and Lecturer in Phonetics, Peter Ladefoged. In collaboration with physiologists M. H. Draper and David Whitteridge, Ladefoged obtained data from 18 subjects using electromyography of six thoracic muscles (Draper et al. 1959). Volume of air in the lungs was measured using a whole-body plethysmograph, a large steel barrel in which only the head and neck are exposed. Tracheal air pressure was estimated using a small balloon passed through the nose to just above the oesophagus; three subjects swallowed the balloon for calibration, and (for one subject) the correlation between oesophageal and tracheal pressure was corroborated by direct measurement of tracheal pressure via puncture of the trachea using a 1.5 mm hollow needle! They ascertained that in speech, the regular in-out airflow of quiet respiration gives way to a deeper inspiration in preparation for speech, followed by a long, gradual exhalation over the extent of the utterance. The flow of air from the lungs is not controlled syllable- by-syllable, but stressed syllables seemed to be associated with “stress pulses”.

Most studies of speech aerodynamics have used rather less invasive methods than Draper et al's; typically, face-masks, usually with separate chambers for mouth and nose, a short tube into the mouth, with pressure and flow transducers. Oral and nasal airflow in normal and disordered production of oral and nasal vowels and consonants is quite easily examined; see Warren (1976) for a review. The less-common air-streams (such as the clicks of some African languages) have been studied by Ladefoged (1964), Ladefoged and Traill (1984), and Demolin (1995).

Ladefoged founded the UCLA Phonetics Laboratory in the 1960's; its studies led the field throughout our period. John J. Ohala, a graduate of the UCLA laboratory, augmented his investigations of speech airflows by a software model of speech aerodynamics (Ohala 1976). In a

2 series of studies, reviewed in Ohala (1990), he challenged Draper et al.'s support for the “stress pulse” hypothesis (that stress is implemented by expiratory muscle contractions). With reduced airflow, e.g. closure portion of stops and affricates, lung volume decreases more slowly, whereas during emphatically stressed syllables it increases more quickly. Ohala interpreted these results as showing that the primary function of the pulmonic system during speech is simply to produce air pressure from the lungs that is reasonably constant and above some minimal level; the lungs do not produce “stress pulses”. Ohala claimed that the large, rapid decreases in lung volume during aspiration, [h], and fricatives are due to "a passive collapse of the lungs due to the rapid air flow out of the lungs … Active short-term variations in lung volume decrement are probably limited to the production of variations in the loudness of speech" (1990: 35). In short, he argued, Stetson's theory was disproved.

2.2. Phonation Laryngeal activities may be observed visually using a larynx mirror (which resembles a dental mirror) or laryngoscope, but with such means, a restricted range of speech is possible: just “aaah”, perhaps. More recently, fiber-optic endoscopy using a catheter placed through the nose and down into the pharynx has enabled a variety of phonation types and other laryngeal activities to be filmed with less discomfort for subjects. This technique, originated by Sawashima and Hirose (1968), has only very recently led to substantial advances in our understanding of laryngeal and pharyngeal activity (Edmundson and Esling 2006 and its references).

The mechanism of vocal cord vibration was discovered by van den Berg (1958), whose myoelastic- aerodynamic theory opposes the aerodynamic force of the Bernoulli effect – which causes the vocal folds to snap shut as air is forced between them – against their muscular elasticity. This explanation of voicing, now widely presented as the correct one, replaced Husson's neurochronaxic theory of voicing, which attributed vocal fold vibration to rapid, rhythmic impulses of the recurrent nerve; Dedo and Dunker (1967) put this to the test and disproved it. Phonetic studies of voicing progressed on a number of fronts. A number of labs used electromyography to investigate the complex activity of the larynx muscles in control of voicing and pitch (e.g. Hirano and Ohala 1969, Hirose and Gay, 1972, Lieberman et al. 1970, and many others). Typically, these studies correlate EMG signals with acoustic and/or other physiological measures, such as airflow. The refinement of electroglottography (the Laryngograph – a trade mark) by Fourcin (1974) enabled glottal aperture to be recorded noninvasively and visualized on an oscilloscope or computer. As well as being suitable for regular clinical use, even with children, electroglottograms show many linguistically- significant phonation types (e.g. normal voice, creaky voice, breathiness) (Esling 1984).

2.3. Articulation The movements of the lips in speech, being easily visible, have been noted since antiquity. The most striking modern discovery regarding labial (co)articulation (Lubker and Gay 1982) is discussed in section 2.4, below. Though phonologists do not normally number the jaw as a speech articulator, artificially preventing jaw raising would severely impair speech if speakers could not adapt to the disruption. Alternatively, if speakers can achieve articulatory or communicative goals by other means – e.g., by moving the lips more, for a labial constriction, or by moving the tongue more, to shape the vocal tract appropriately – important data about speakers’ strategies can be obtained. Bite blocks have been used to assess speakers' ability to adapt to sudden changes in the vocal tract (Warren, Nelson and Allen 1980). Such studies found that even with dramatic increases in aperture, cross-sectional opening for fricatives was relatively unchanged. In the 1970's, Abbs and colleagues developed apparatus for direct measurement of lip and jaw motion, using flexible paddles. In conjunction with EMG measurements of the lips and jaw muscles, Folkins and Abbs (1975) examined the compensations speakers made in response to applying mechanical loads

3 without warning to the lower jaw. Speakers adapt rather well to such disruptions to their articulation, suggesting that we do not have fixed articulatory plans - simply memorized movements - but can adapt our articulations quite flexibly to achieve achieve the same speech “goals”. One interpretation of this result is that articulatory “targets” are best stated in terms of constrictions: the cross-sectional area of the constrictions are most important in articulation, rather than the means by which they are achieved. Another view is that the speakers' goals are acoustic and communicative, and that the articulations used to achieve them matter little, as they are merely means to ends, as Jakobson had often noted (e.g. Jakobson 1951); see further section 3 below.

Tongue movements, being largely concealed during speech, have been studied since the 1930's using various radiographical techniques (see Munhall 2001 for an overview). In view of ethical concerns regarding the relatively high radiation doses involved, X-ray cine filming of articulation drastically reduced after the 1970's. A variety of alternative techniques have been employed instead. X-ray microbeam imaging (Fujimura et al 1973, Westbury 1991) uses a computer-controlled X-ray beam to track the motions of small gold pellets glued to the articulators. The radiological exposure is lower than that of a dental X-ray, but movement data is only obtained from a handful of points on the articulators, so the motion of the articulators must be inferred from the point data. The tongue- tip, which is very mobile and fast-moving, is particularly difficult to track very accurately using such a technique, as it is impossible to put a pellet right on the tongue tip. Nevertheless, a large amount of data has been collected over the years, which has facilitated significant developments in phonetics and phonology. For example, Browman and Goldstein (1992) fit their computational model of articulatory dynamics to data from a corpus of X-ray microbeam data in order to show that, in English, the articulation of [ə] (the weak, unstressed vowel in the, or the second syllable of nonetheless) is somewhat underspecified, hence more variable than other vowels and accommodating to neighbouring sounds. Electromagnetic midsaggittal articulography (EMA) is another point-tracking method that uses radio-frequency location of tiny coils placed on the articulators. The position of each coil is determined by triangulation, according to signal strength. The apparatus for this is far less expensive than X-ray microbeam, and a growing number of laboratories now have EMA systems. Since the 1980's, a few laboratories have also used ultrasound imagine to study tongue shapes (e.g. Stone et al. 1988), as it is far less invasive.

Hidden articulations (underarticulation). According to phonemic phonology and generative phonology, segments may be deleted or substituted by another sound in certain contexts and styles of speech. However, a variety of articulatory studies have found that such apparently deleted or substituted segments actually remain observable in the articulatory record. For example, Browman and Goldstein (1990) showed that in casual speech pronunciations such as [pɚfɛkt mɛmɚɹi] (“perfect memory”), the word-final alveolar articulation of “perfect” - in its citation form a stop, [t] - is not deleted but reduced, overlapped by the other (velar and bilabial) articulations in the vicinity and is consequently acoustically inaudible. Similarly, they found that in the assimilated form [sɛvm̩plʌs] (“seven plus”), the alveolar of the word-final nasal of “seven” was not replaced by the bilabial but hidden by it; the tongue-tip raising-lowering gesture was present even in the assimilated form. Using electropalatography, Nolan (1992) confirmed this gradual assimilation and residual tongue contact of word-final alveolars. In assimilated tokens that Nolan labels “zero-alveolar” (no tongue contact), electropalatography does not directly reveal whether the tongue-tip is raised. However, Mowrey and MacKay (1990) discovered that in “normal sounding” tokens of tongue- twisters, lip and tongue muscle activity revealed “hidden” (inaudible) incorrect articulations. Just as Browman and Goldstein and Nolan's data challenges discrete-segmental accounts of assimilation and deletion, Mowrey and MacKay's data challenges discrete-segmental descriptions of speech errors.

4 In traditional, descriptive courses in phonetics and phonology, the pharynx is a somewhat mysterious place that cannot easily be inspected visually. The pharyngeal fricatives [ħ] and [ʕ] – sounds made with a constriction in the throat known from Arabic and some other languages – are, indeed, relatively rare sounds: they are the least common fricatives in Maddieson's 1984 cross- linguistic survey of phonological inventories. However, the pharynx is about half of the vocal tract's length, and acoustic theory shows that pharynx volume is crucial to vowel sounds. For example, the low first formant frequency of close vowels such as [i] is due more to their large pharynx volume, rather than their narrow oral constriction per se, and vice-versa for the high first formant frequency of open vowels such as [ɑ]. Put simply, a large pharynx, as in [i], resonates at a low frequency and a small pharynx, as in [ɑ], resonates at a high frequency (cf. Mrayati et al. 1988). Obtaining articulatory data about pharynx widths or volumes is particularly difficult: X-ray data was vital to the foundational work on the relationship between the articulation and acoustics of vowels (Chiba and Kajiyama 1941; see also Arai 2001) and consonants (Fant 1960).

A central question of vowel articulation is: what are the principal dimensions of tongue movement? The IPA/Daniel Jones system of vowel classification estimates the position of a single point on the tongue (the highest point) on two dimensions: open−close and front−back, a system justified by the practical experience of phoneticians making transcriptions. Harshman et al (1977) used a factor- analysis method, PARAFAC, to determine numerically the dimensions along which the overall position of the tongue is set, based on a database of X-ray tracings (Ladefoged et al. 1972). They found that (for their English vowel data) two dimensions of movement can account for most of the variation in tongue position, but unlike the IPA/Jones system, the two dimensions are (1) a forward movement of the root of the tongue together with a raising of the front of the tongue, and (2) an upward and backward movement of the tongue. These “front-raising” and “back-raising” movements roughly corresponding to the actions of the genioglossus and styloglossus muscles, respectively.1 The overall tongue shape for every English vowel is defined by a combination of front-raising and back-raising (Figure 1).

Ladefoged (1964: 38-40) and Stewart (1967) noted that in several African languages (e.g. Igbo and Twi), advancing the tongue root appears to be a distinctive feature, so that two vowels can have the same tongue height, but different tongue root positions (and thus different pharynx widths). Painter (1973) and Lindau (1978) found corroborating evidence from Twi and Akyem, by which time Halle and Stevens (1969) had proposed adding [ advanced tongue root] to the inventory of phonological distinctive features. Since various phonologists then began to use the feature [ A(dvanced) T(ongue) R(oot)] to label the contrast between tense and lax vowels in English, even though this had not been proven, the question then arose as to whether this was correct. Wood (1992) saw evidence in cinefluoroscopic films of Southern British English and Arabic speakers that ATR has a role in the formation of vowels, though more recent MRI studies have cast doubt on this. MRI has several advantages over X-rays: it does not employ ionising radiation, and hence the health risks are much reduced and imaging sessions can be quite long; it is a tomographic (“slicing”) technique, so its vocal tract images are unobscured by cheek and gum tissue, or the teeth (unlike sideways-on X- ray images); 3-D (volume) data and movies can easily be collected. Following early trials of MRI of fixed vowel postures (Baer et al. 1987), Tiede (1996) used MRI to compare pharynx widths of vowel pairs [i] vs. [ɪ], [e] vs. [ɛ] and [u] vs. [ʊ] in American English and Akan; he found that while Akan exhibits larger pharynx widths for [i], [e] and [u], the differences in English are less consistent. This supports Ladefoged and Lindau's conclusions that in English, pharynx width differences follow from the degree of tongue raising; the tongue root does not behave as an 1 They remark, however, “that we do not think it entirely appropriate to associate either factor with particular muscles or muscle groups.”

5 independent articulator, and tongue root advancement is not a third dimension of tongue position. So, although the difference between [i] and [ɪ] in English is not an [±ATR] contrast; some other feature, such as [±tense], is involved.

Velum position (that is, whether the velum is lowered to allow egress of air via the nose, or raised to prevent it) is fairly easily studied simply by measuring airflow at the nostrils. Such aerodynamic studies of nasality, especially of nasal-oral coordination, have been common in experimental phonetics and phoniatrics at least since Rousselot (1901: 525-582). Within the period of this chapter, Cohn (1990) is an example of how cross-linguistic nasality data from English, French and Sundanese informs our questions regarding the relationship of phonetics to phonology. Additionally, cineradiographic measures of velum control and its coordination with oral movements have been vital in the study of coarticulation, to which I now turn.

2.4. Coarticulation “Coarticulation” refers to several distinct aspects of articulatory timing, in two main species: (i) The temporal coordination of independent articulators; for example, the coordinated movements of tongue and lips in articulating the consonant and vowel in a combination such as [ku]. (ii) The combined movements of a single articulator in executing two articulatory gestures, for example the competing demands on tongue movement in articulating the consonant and vowel in a combination such as [ki]. In these well-worn examples2, it has long been qualitatively observed that the lip rounding for the [u] of [k] actually begins during the initial stop, possibly even as early as the initial closing movement of the [k]. In the English pronunciation of key, [ki], the initial consonant has a notably advanced, pre-velar articulation, appropriate to the front articulation of the vowel. These examples illustrate “consonant+vowel” (CV) coarticulation. Similar (slightly different) coarticulation patterns may be observed in VC (vowel+consonant) combinations. The third possibility, vowel-to-vowel coarticulation, is observed in many languages even across intervening consonants, e.g. in VCV contexts (Öhman 1966). The fourth logical possibility, consonant- consonant coarticulation, has so far been little studied experimentally (cf. Manuel 1995)3.

Moll and Daniloff (1971) used cinefluoroscopic movies to study the timing and extent of velum movements in consonant-vowel coarticulation and to test two “competing” models of coarticulation. According to Kozhevnikov and Chistovich (1965), the basic articulatory elements are Consonant + Vowel units4. Maximal coarticulation is expected to be found within this domain, as the CV complex is stored as a piece; therefore, less coarticulation is expected between successive CV units. A different model of coarticulation was proposed by Henke (1966), in which independent segments (not CV units) are elementary units. The velum position of each segment can be specified as “closed” (for oral consonants), “opened” (for nasal consonants and for quiet respiration), or “unspecified” (for English vowels, which are not contrastively nasal, but which may be nasalized when in the environment of a nasal consonant). In contemporary phonological notation, we could say that segments are either [+nasal], [nasal], or [0 nasal]. In Henke's model, the value of underspecified features is determined by “looking ahead” to the next specified value. In other words, specified values spread backwards to fill in underspecified features.

According to Henke's model, in CVVN sequences (N denotes a nasal consonant), velar opening

2 Mentioned at least as early as IPA (1949). 3 Coleman (2003) found that the frication and aspiration at the release of [t] is similar to the subsequent [s] in e.g. utter sip vs. more like the subsequent [ʃ] in u tter sh ip, a kind of “sibilant harmony” previously only attested in some African and indigenous American languages. 4 I avoid the term “syllable” here, because a CVC syllable is analysed by Kozhevnikov and Chistovich into two such units: CV.C.

6 should commence just after the initial non-nasal consonant (1); in NVC sequences, velar closure should commence just after the initial nasal (2). In contrast, In Kozhevnikov and Chistovich's model, velar opening in CVVN is not expected until the end of CV.V, as they are separate articulatory units from the final N.

(1) Segments C V V N [nasal] values  0 0 + Predicted beginning of velar opening: Henke  ⇤ spread of here nasality K & C | syllable 1 | syll 2 | syll 3  here

(2) Segments N V C [nasal] values  0  Predicted beginning of velar closure: Henke  here K & C | syllable 1 | syll 2   somewhere and between here here

Test sentences were devised to capture a rich variety of patterns of such velum open and closing movements. For example, “Have Te rry enf orce o ne of the rules” contains the CVVN and NVC patterns, [rien] and [nov]. Velum movement was measured by tracing the velum and tongue contours, frame-by-frame, from the X-ray film (Figure 2).

Figure 2. Tracing of velum positions, from Moll and Daniloff (1971). AB is the direction of velum raising, along which the degree of velum opening was measured. VP is a second measure of velopharyngeal opening.

For NC clusters, it was found that velum raising began during the nasal consonant; for NVC sequences, during the nasal or vowel, there was some carry-over. These results are neither predicted nor contradicted by either model. In CVVN sequences, velar opening started at or before the beginning of tongue movement toward the first vowel; there is anticipatory nasalization over two vowels, across a word boundary. This is not explained by the Kozhevnikov-Chistovich model,

7 but is predicted by Henke's. Öhman (1966) discovered even longer-distance coarticulation than this. Using acoustic measurements of several European languages, he found that in V1CV2 sequences, the pronunciation of V1 varies depending on the identity of the up-coming V2. In particular, the second formant frequency (F2) of V1 is shifted in the direction of the F2 of V2, e.g. F2 is a little lower before /Cu/ than before /Ci/, because the lips are starting to round during the first vowel in anticipation of the coming /u/. To explain this, Öhman proposed that vowels and consonants are not concatenated into a chain, as an alphabetic transcription suggests: instead, he argued that the stop-consonant gestures are superimposed on a continuous motion from one vowel to the next. Subsequent studies examined anticipatory labialization in greater detail. Lubker and Gay (1982) found that Swedish speakers may anticipate labialization over 4 to 6 consonants in e.g. d itt skro t [ɪtwswkwrwu] or ett d anskt skå p [ɑnwswkwtw swkwo], or 4 consonants for American English speakers, e.g. its stew [ɪtwsw swtwu]. Magen (1997) found V-to-V coarticulation across a span of three syllables, and West (1999) found articulatory differences across the two syllables [tədə] correlated with following [l] or [r] in e.g. of u ttered a {l/r} eap. Heid and Hawkins (2000) most strikingly showed that coarticulation to upcoming [l] or [r] is evident across the four preceding syllables in phrases such as it might be a lamb/ram. These studies raise two major questions: (i) How can we quantitatively model such coarticulatory effects? (ii) How do we explain coarticulation? In particular, to what extent does coarticulation reflect purely mechanical properties of the articulators, such as inertia, vs. to what extent is it intentionally planned by the speaker's (mental) motor system? Quantitative modelling of articulations begins with articulatory synthesiser programs such as those of Coker (1976) or Browman et al. (1984). Building on the influential review of coarticulation by Fowler (1980), Kelso et al. (1986) proposed a numeric, dynamic model of articulatory control that established a foundation for an extensive programme of research at Haskins Laboratories in Articulatory Phonology (Browman and Goldstein 1985, 1990, 1992). Kelso et. al tested their model against coarticulation data given by Sussman et al. (1973). For a bilabial C, for instance, they represent V1CV2 productions as two temporally overlapping sequences: (i) the tongue-jaw system lowers for V1 and moves towards the position for V2; (ii) overlaid on this, the lip-jaw system is used to form a constriction at the lips. Henke's idea that articulatory dimensions of some speech elements are unspecified was reprised in a number of models in the 1980's and 90's. Pierrehumbert's (1980) theory of intonational phonology is, in essence, an account of how the larynx movements used for pitch control are coarticulated with mouth movements. While the latter must be specified in some detail, the larynx controls are quite sparsely specified, with relatively few intonational targets (“tones”) compared to articulatory targets (segments). In the utterance in Figure 3, for example, the tongue tip is very active and highly specified: for [t] it contacts the palate, for [s] comes away a little, for [ɹ] it may be curled a little, for [l] the sides of the tongue are lowered, and so on. In all, there are 9 tongue-tip positions in this utterance, 9 tongue-dorsum positions, plus a labial closure for “be” and 2 or 3 lip-rounding gestures. Compared to that, Pierrehumbert argues there are just 5 laryngeal targets (the tones labelled on the graph); the intervening large-scale pitch movements come about by interpolating across the underspecified intervals between tonal targets. The smaller “wiggles” in the f0 contour are artefacts of the consonants, and are thus not part of the intonation melody.

8 Figure 3. Autosegmental-metrical labelling of an f0 contour, from Pierrehumbert (1980: 151). Only five intonational events are proposed: an initial high boundary tone, H%; a low accent, L*, on the stressed first syllable of really; a high accent, H*, on true; a phrase-final low tone, L–, and an utterance final L%. Horizontal axis: time (units not shown); vertical axis: f0 in hertz.

Keating (1990) proposed a refinement to this “target-and-interpolation” approach to coarticulation modelling, representing each target (on each articulatory dimension) with a broader or narrower range – a window – of permitted variation. A narrow window presents a more specific, constrained articulatory target, while a broad window models underspecified, more variable articulations (Figure 4a). Blackburn and Young (2000) translated this idea into a probabilistic form, and used it to model real articulatory data (Figure 4b). According to their version, the movement of an articulator is more likely to pass through the peak of the probability “window”, though if the physics requires it (e.g. because of inertia), the path might pass through the lower-probability “tail” of the bell-curve, as at the right of Figure 4b.

Figure 4 a. Simulated articulator trajectory (solid b. Simulated articulator trajectory (dashed line) line) using the window model of coarticulation. using a probabilistic coarticulation model. The The trajectory is constrained to pass through the midpoints of successive phonemes are indicated

9 “windows” indicated by dashed horizontal lines. by dotted vertical lines, and associated with each From Blackburn and Young (2000), after Keating midpoint is a probability distribution, defining the (1990) probability that the articulator will take particular positions at the midpoints. From Blackburn and Young (2000)

Whalen (1990) addressed the question, “to what extent does coarticulation reflect purely mechanical properties of the articulators, such as inertia, vs. to what extent is it intentionally planned by the speaker”. In one experiment, Whalen studied how the first formant frequency (F1) and the duration of vowels is affected by the voicing of the following consonant when the speakers either know or don't know what the following consonant will be. It was observed over a century ago that vowels tend to be shorter before voiceless consonants, compared to voiced consonants (see Chen 1970); also, F1 is lower just before a voiced stop than before a voiceless stop. These are fairly “low-level” phonetic variations that are not usually thought to be intentional. In Whalen's experiment, items such as “ABI” (to be read like “abbey”), “ABU” ( “abboo”), “API” and “APU” were presented on a computer monitor for subjects to read aloud, but not fully revealed to start with: initially, either the second or third letter was missing. When the subject started speaking, the program filled in the missing letter. If coarticulation is just mechanical, it was expected that the initial /a/ would always have a duration and an F1 endpoint appropriate to the voicing of the medial consonant, and an F2 that is lower before /u/ than /i/, just as if the consonant or final vowel had not been hidden. However, if coarticulation is planned ahead and controlled from the speaker's brain, /a/ could not affected by a missing medial consonant or final vowel. Whalen's experiment showed that when the consonant is hidden, the acoustics and duration of the /a/ vary only according to the second vowel, and when the second vowel is hidden, /a/ varies only according to the medial consonant. This demonstrated that coarticulation is largely planned; therefore, some “low level” phonetic aspects of speaking are centrally (mentally) controlled.

3. Advances in speech acoustics

We now turn to some of the main developments in acoustic phonetics, which cannot, of course, be completely separated from articulatory studies. Indeed, how articulations cause their acoustic outcomes, and how similar sounds may be created by distinct articulations are central questions in this period. The acoustic properties of vowels and consonants had been extensively studied before the 1960's (see Lehiste 1967 for the then state of the art); subsequent advances in acoustic phonetics came from developments in digital computers, signal processing, and speech synthesis. Earlier speech synthesisers were electronic circuits (Dunn 1950, Lawrence 1953, Stevens et al. 1953); because it is impractical for a human operator to adjust the many controls of such a device in real time, synthesis researchers in the early 1960's began to use computers, at first just to work the synthesiser's controls (Holmes, Mattingly and Shearme 1964, Liljencrants 1967), and later to replace the circuits completely, by numerically simulating the generation of audio signals in computer software (Rabiner 1968, Holmes 1973, Klatt 1980). Klatt's publication of his synthesiser program promoted its widespread use in many laboratories for generating stimuli in speech perception experiments.

Although most linguists are happily ignorant of the details of computer programs, two key developments in signal processing software underpin many speech analysis and synthesis methods: the Fast Fourier Transform (FFT), and Linear Predictive Coding (LPC). Fourier analysis – a method for calculating the frequency components of complex waves such as sound waves – was used for speech analysis even in the 19th century, but the necessary computations were very time consuming

10 until the discovery of the FFT algorithm by Cooley and Tukey (1965). The software packages for speech analysis that became common in phonetics laboratories in the 1980's (e.g. ILS: Read et al. 1992) as well as those used today (e.g. Wavesurfer, Praat) depend on the FFT algorithm for calculating the spectrum of selected portions of a waveform. The thousands of computations required to make a spectrogram or to find its formants, and the main methods of pitch estimation also employ FFTs: typically 100-200 for each second of speech.

Linear Predictive Coding (Atal and Hanauer 1971) is a numerical method for breaking down a sound wave into a combination of many (typically 12 to 20) slowly-varying numbers – LPC coefficients – plus a residual (error) signal. This error signal contains regularly spaced pulses, corresponding to glottal pitch pulses, which can be analysed in order to estimate the pitch of the voice. Alternatively, the pitch pulses and the coefficients can be recombined to make a synthetic version of the original signal. The pitch can be altered, making LPC synthesis a valuable tool in models of intonation (e.g. Pierrehumbert 1981) and for experiments on the perception of prosody (e.g. t'Hart et al 1990). Importantly, the FFT of LPC coefficients provides a spectrum which is much smoother than the conventional Fourier spectrum: it has more clearly defined formant frequency peaks, which are often used in plots of vowel quality.

LPC coefficients can also be converted into numbers that approximate the shape of the vocal tract in terms of a number of concatenated tubular sections (Figure 5). Computer-implemented tube models of the vocal tract have been vital in understanding the relationship between articulation and the resulting acoustics. The model presented by Mermelstein (1973) calculated the formant frequencies from a specified vocal tract shape and then used those frequencies to control a (non-articulatory) formant synthesiser. Later, Maeda (1981), Meyer and Strube (1984) and Sondhi and Schroeter (1987) made substantial improvements in articulatory modelling, building on the pioneering work of Kelly and Lochbaum (1962). Whereas any tubular shape will resonate in a determinate manner, the inverse is not true: a given sound may be produced using various different tube-shapes, which is how ventriloquists can produce sounds resembling labials without moving their lips. A significant development in vocal tract modelling was the “Distinctive Regions and Modes” theory of Mrayati, Carré and Guérin (1988). Rather than using tube sections of equal length, Mrayati et al. deduced a division of the vocal tract into eight unequal lengths, for three formants. These eight sections, they argue, correspond to eight natural places of articulation for consonants: (my labels) labial, dental- alveolar, (pre-)palatal, velar, uvular-pharyngeal, pharyngeal-epiglottal, subepiglottal, and laryngeal. A constriction in one region raises or lowers each formant to some degree; Mrayati et al. showed that the same acoustic effect can be obtained by expanding the symmetrically corresponding region of the vocal tract; for example, the raised F1 of [a], usually achieved by opening the oral cavity wide, can instead be achieved even with clenched teeth, by retracting the tongue in the throat more than usual. This symmetry principle, they argue, explains why the larynx is lower for rounded vowels: larynx depression enlarges the larynx section, which counteracts the narrowing of the labial section, maintaining the geometric symmetry of the vocal tract and keeping tongue movements unchanged. Note that lip rounding and larynx lowering are entirely unconnected in conventional articulatory descriptions. Their theory also predicts that lip-rounding, velarization and pharyngealization have a similar acoustic effect: lowering F3 (“flatness”, in the acoustic taxonomy of e.g. Jakobson et al. 1952).

11 Figure 5. Approximation to the vocal tract shape using a number of concatenated cylindrical sections, with cross-sectional areas calculated from LPC coefficients.

The quality and naturalness of synthetic speech depends not only on its model of vocal tract resonances, but also on its model of sound produced in the larynx. Numerical models of the glottal sound source (e.g. Ishizaka and Flanagan 1972) therefore warrant mentioning here, and improved methods for inferring the glottal sound wave by “inverse filtering” (e.g. Rothenberg 1973), factoring out the formant frequencies from the sound coming out of the mouth, to reconstruct the waveform at the glottis. The linguistic utility of this technique is illustrated by Gobl and Ní Chasaide (1992), for example, who used inverse filtering to extract a variety of different phonation types/voice qualities to control a synthetic voice model which they then used in perception experiments. The most significant advance in our understanding of the phonological implementation of voice contrasts, however, is Lisker and Abramson's work on the perception of voice onset time, which I shall discuss in the next section.

4. Speech Perception

From the 1950s into this period, the overarching challenge of speech perception research was to reconcile the continuity and variability of observed speech with the theoretical discreteness and invariance of mental representations of words and their constituent units (phonemes, features, etc.) The quest for invariants began by looking for the essential acoustic properties of speech signals using very pared-down synthetic, speech-like stimuli. Such acoustic properties were related to or identified with the elementary distinctive phonological features (Jakobson et al. 1952). Shortly after the invention of the sound spectrograph in the 1940's, researchers at Haskins Laboratories made one of the earliest speech synthesisers, “pattern playback”. This had an electronic optical sensor that converted the dark areas on a spectrogram into the resonant frequencies of a synthesizer. As well as being capable of “reading” real speech spectrograms, the Haskins team conducted a classic series of experiments using hand-drawn, stylized pseudo-spectrograms, in order to find out which acoustic details of the spectrogram were important for the identification of speech sounds.

Context-sensitivity of perceptual cues At the start of a vowel coming after a consonant, the spectrogram reveals rapid acoustic changes as the vocal tract moves from a shape appropriate to the consonant to the shape needed for the vowel. Because these transitions are a short-lived and mechanical side-effect of the movement from the consonant to the vowel, they might be thought unimportant to perception. Cooper et al. (1952) discovered that this is not so. First, by controlling the frequency of the noise burst that occurs at the release of a plosive, they found that though high frequency bursts were always heard as [t], bursts at lower frequencies were heard as [p] or [k] in a context-sensitive manner: “the identification of the consonant depended, not solely on the frequency position of the burst of noise, but rather on this position in relation to the vowel.” Second, the direction of the formant frequency changes enables

12 listeners to perceive the voicing and place of articulation of consonants. The trajectory of the first formant transition cues the voicing distinction, and that of the second formant transition relates to place of articulation, depending on the vowel. Delattre et al. (1955) studied that relationship systematically. They found three places of articulation (bilabial, alveolar and velar) perceptually related to three frequency regions, acoustic loci (Figure 6). The F2 patterns of [b] and [g] are reasonably similar in all vowel contexts, rising for bilabials and falling for velars. For alveolars, however, the F2 pattern is quite different in different contexts: F2 is rising in [di] and [dɛ], but falling in [da], [dɔ], [do] and [du] (Figure 6a). Because opposite transitions encode the same place of articulation, the Haskins researchers became sceptical about finding stable (invariant) acoustic cues for phonological units, and argued instead for a motor theory of speech production, in which the underlying articulatory events were seen as the units of speech perception (Liberman et al. 1967, Liberman and Mattingley 1985). It was a debatable inference, however, as a relationally invariant F2 locus for [d], fixed at the F2 of [ɛ], can explain why F2 rises or falls according to the following vowel: see, for example, the group of papers by Lindblom (1996), Stevens (1996), Ohala (1996) and Fowler (1996).

Figure 6 (a). Synthetic spectrograms showing (b) Stimulus patterns (shown schematically) and second-formant transitions that produce the voiced identifications with and without a silent interval stops before various vowels. (From Delattre et al. between the second-formant locus and the onset 1955) of the transition. (A) Second-formant transitions that originate at the d locus and go to various steady-state levels, together with the first formant with which each was paired. (B) The same patterns, except that a silent interval of 50 ms has been introduced between the locus and the start of the transition. (From Delattre et al. 1955)

13 Lindblom (1996) addressed the problem that the F2 and F3 ranges for /b/, /d/ and /g/ overlap and are dependent on the following vowel. Whereas the earlier Haskins work looked at acoustic patterns in only two dimensions (time vs. frequency), Lindblom plotted values of F2 and F3 at the end of the consonant and during the vowel on a three-dimensional plot (Figure 7), noting: 'The three “clouds” of Fig. [7] enclose the measurements from all the test words. Coarticulation is responsible for their elongated shapes. ... the three configurations do not overlap! … They evidently meet the condition of being “sufficiently distinct”.'

Figure 7 (Lindblom 1996). At the top, a stylized spectrogram to define the measurements, from Öhman (1966): the onset of the F2 transition at the C–V2 boundary (plotted in the main figure along the horizontal axis), the F3 transition onset at the C–V2 boundary (vertical axis) and the F2 value at the V2 steady state (depth axis). Liberman and co-workers' proposal that humans have a separate perceptual mechanism specifically adapted for speech, distinct from perception of non-speech environmental sounds, gained support from experiments with speech synthesized using sine-waves (Remez et al. 1981). To a naïve listener, sine-wave speech sounds completely unlike speech, more like an electronic warbling or whistling. But if a listener is cued with either the original human speech from which the sine-waves were generated, or even just the text of that speech, the listener's perception suddenly switches into a quite different mode, in which the formerly mysterious whistling sound becomes highly intelligible as speech. For an on-line demonstration, see Davis (2007).

14 Categorical perception Liberman et al. (1957) also discovered that synthetic speech stimuli resembling consonant-vowel sequences (Figure 8a) are perceived in terms of discrete categories, rather than as a continuum of fine distinctions. The three hallmarks of such “categorical” perception are: (i) In the identification function (Figure 8b), there are regions on the stimulus continuum where variation makes no difference to subjects' labelling of the stimuli. Thus, stimuli 1 and 2 are both labelled as /b/, 5, 6, 7, and 8 are labelled as /d/ and 11, 12, 13, and 14 as /g/. (ii) In other regions of the continuum (Figure 8b, stimuli 3, 4, 9 and 10), subjects are more variable (uncertain) in their labelling. Between 3 and 4, and between 9 and 10, subjects are essentially guessing which consonant they are hearing. These points of ambiguity lie at category boundaries. (iii) The discrimination function (Figure 8c) plots how accurately subjects can tell the difference between adjacent pairs of stimuli. It shows that subjects are very poor at discriminating between stimulus pairs within a phonological category (they get c. 50% correct: just guessing), but are more accurate at discriminating pairs of stimuli that lie either side of the category boundary (3-4 and 9-10). Figure 8. Stimuli and responses from pioneering studies of categorical perception. a)

b) c)

d)

15 e)

Voice onset time and rise time (abrupt vs. gradual onset of frication or voicing amplitude, as in /tʃ/ vs. /ʃ/, or of /b/ vs. /w/) are also perceived categorically (Cutting and Rosner 1974, Liberman et al. 1956). Lisker and Abramson (1964) identified Voice Onset Time (VOT) as an acoustic feature of phonation contrasts in various languages; voicing that starts before the consonant release (negative VOT) is characteristic of fully voiced consonants, such as voiced sonorants, and plosives in Spanish (not English); voice onset at the same time as the consonant release (VOT = 0 ms) is typical of voiceless unaspirated stops [p t k], whereas VOT 20-25 ms after the consonant release is characteristic of voiceless aspirated consonants [ph th kh]. To study whether (or how) listeners use VOT differences in the perception of phonological phonation contrasts, Lisker and Abramson (1970) synthesized consonant + /a/ syllables, with 37 different voice onset times ranging from 150 ms before the stop release to 150 ms after it. Their Spanish subjects showed a two-way perceptual contrast between prevoiced [b d g] and voiceless unaspirated [p t k]. For /b/ vs. /p/, their English subjects perceptually divide the VOT continuum into two, unaspirated [p] vs. aspirated [ph] (Figure 8d), and similarly for /d/ vs. /t/ (phonetically, [t] vs. [th]) and /g/ vs. /k/ ([k] vs. [kh]). Thai listeners, however, perceptually divide the same VOT continuum into three categories for labial plosives: [b]

16 vs. [p] vs. [ph] (Figure 8e). Eimas et al. (1971) strikingly discovered that even 1- to 4-month-old infants perceive the VOT continuum categorically, like adults, according to the categories of the ambient language. Categorical perception is sometimes rather loosely taken as evidence for phoneme categories in general, especially in textbooks and popular science articles. For example, Eimas (1985) summarizes the Haskins place-of-articulation experiments and Lisker and Abramson's VOT perception results, and then asserts the more general claim that “in the perception of speech we are ordinarily aware of discrete phonemic categories”. Kenstowicz (1994) teaches that “the perception of speech sounds is categorial in nature” (p. 185) and “speech sounds are interpreted as exemplars of a given category” (p. 186). But categorical perception cannot be equated with phoneme categories, as some phonological contrasts are not perceived categorically, even by adults. The identification function of English vowels /ɪ/ vs. /ɛ/ vs. /æ/ resembles Figure 8b, but the discrimination function is almost flat, showing that listeners do not discriminate vowel qualities categorically, and discriminate quite well between vowel sounds within the same phonemic category (Fry et al. 1962). Also, certain intonational contrasts (e.g. emphatic vs. non-emphatic accents) are not perceived categorically (Ladd and Morton 1997).5 The most severe challenge to the equation “categorical perception = phoneme categories” are results from animal studies showing that e.g. chinchillas perceive VOT continua categorically (Kuhl and Miller 1975). If discrete regions in the identification function are considered a sufficient diagnostic for discrete phonemic categories, ignoring sharp discrimination between categories, we arrive at a prototype theory of phonological categories (Nathan 1986), in which well-identified stimuli nearer the centre of the category (e.g. 5, 6, 7 8 in Figure 8d) are prototypical exemplars, whereas borderline stimuli (e.g. 3, 4, 9, 10 in Figure 8d) are nonprototypical, ambiguous exemplars. Kuhl (1991) found that adult and infant humans perceive vowel stimuli on an /i/–/ɪ/ continuum in this way, preferring prototypical over non-prototypical examplars: the perceptual magnet effect. The effect was not found in Rhesus macaques, leading Kuhl (1992) to hypothesize that the ability to partition the acoustic space into categories is innate to humans, but tuned by exposure to ambient languages in early infancy. However, Kluender et al. (1998) found that Kuhl's (1991) result is not limited to human listeners: European starlings (Sturnus vulgaris) respond in a similar way. Furthermore, Kluender et al. found that a simple linear association model trained with vowels from the starlings' training set explains the data well. Variability and the search for invariance For most of the period, speech perception research was guided by a quest to define acoustic configurations that would be fixed, reliable guides to phonological units, invariant across speakers, and separable from the residual variability and noise (e.g. Jakobson et al. 1952). Miller and Nicely's (1955) study of the perception of consonants in background noise is an early example of this quest. By the beginning of the 1980's, the paired questions “what are the invariant units of speech processes?”, and “indeed, are there any?”, were central to phonetic research. In October 1983 a symposium on “invariance and variability in speech processes” was held at MIT, attended by researchers from a wide variety of disciplines concerned with speech. The papers in the proceedings, Perkell and Klatt (1986), address questions which dominated the field until the end of the millenium. Rather in the tradition of Liberman et al (1957), Stevens and Blumstein (1978) continued the search for stable cues to place of articulation. Stevens was advancing his theory of the “quantal nature of speech” (Stevens 1989), explaining data such as that in Figure 8b by observing that in certain vocal

5 Whether lexical tonal distinctions are perceived categorically is still an open question (Yang 2010).

17 tract regions, small changes in the position of a constriction have little perceptible effect, whereas in other regions, an equal amount of movement is perceptually very salient. For example, tokens of the consonant [s] produced with the tongue tip in a more advanced or more retracted position are all fairly similar, and may all be heard as /s/. However, when the tongue tip is moved by a small amount from the alveolar position of [s] backwards a little to the postalveolar position, the difference between [s] and [ʃ] is obvious. On another phonetic dimension, Figure 9 illustrates the nonlinear relationship between constriction degree and noise intensity that underlies a division of aperture size into three natural degrees of stricture: stops, fricatives and sonorants. A 5 mm2 change in aperture in the region 0-5 mm2 or 18-23 mm2 has a great effect on the intensity of frication noise, whereas larger changes in the region 8-18 mm2 or above 25 mm2 have little or no effect on frication noise intensity. Such nonlinear relationships between production and acoustics, like the nonlinear relationships between acoustics and perception, support a natural 'quantization' of phonetic parameters into distinct categories. Figure 9. The quantal nature of stricture. (From Clements and Ridouane 2006)

The search for invariants was fueled by research into speech recognition technology, and a realisation that the then-prevailing methods were not progressing well. Many speech recognition models were built upon acoustic classifiers which labelled portions of an audio signal with phonetic features, such as 'VOWEL', 'SILENCE', 'VOICED', 'ACUTE', etc., analysing the signal in a bottom- up fashion, from acoustic parameters to phonetic features to phones to words. (Zue 1985 is a canonical example of this approach, and Rabiner and Juang 1993: 46-50 a tutorial overview.) But however intuitively logical this approach to speech recognition appears, it suffers from several problems. By focussing on invariant distinctive features, and disregarding 'redundant' variability, we discard information. For example, we neglect the fact that variation in speech may be informative. For example, voicing and place of articulation are not cued by completely orthogonal acoustic dimensions: even Lisker and Abramson (1964) knew that the voiced/voiceless VOT boundary is earlier for labials than for dentals, and later for velars. Oden and Massaro (1978) showed that listeners exploit this interaction of place and voicing in perception, and that there may be trading relations between multiple cues to a phonological contrast. Instead of seeking independent cues to each phonological feature, they proposed a 'Fuzzy Logical Model of Phoneme Identification' which, as the name implies, uses a probabilistic logic to model the combination of cues. Ganong (1980) discovered that the location of the voiced vs. voiceless category boundary is not fixed at a certain point along the VOT continuum, as e.g. Figure 8 implies. For a continuum in which one end is a word (e.g. dash) and the other end a nonword (e.g. tash), the voiced/voiceless category boundary is nearer to the voiced end of the continuum, at shorter VOTs, than in nonword/word contrasts such as dask – task. In other words, the location of the phonological category boundary is not acoustically fixed, but depends upon the lexical status of the stimuli. This

18 bias is not limited to word vs. nonword stimuli: Connine et al. (1993) showed that the relative word frequency along an acoustic continuum also affects the phonological category boundary. Since cues to phonological features are not readily separable, and interact in complex ways, a move towards more holistic models of speech recognition was inevitable. In addition to Oden and Massaro's Fuzzy Logical Model, two models of word recognition from acoustic properties deserve mention: LAFS (“Lexical Access From Spectra”: Klatt 1979) and TRACE, a connectionist model proposed by McClelland and Elman (1986). Additional evidence that human speech perception integrates multiple cues and combines top-down expectations with bottom-up acoustic processing came from McGurk and MacDonald's (1976) discovery that visual information informs auditory perception (the 'McGurk effect'), and from Warren's (1970) discovery of the perceptual illusion that phonemes excised from speech recordings are nevertheless 'heard' by the listener (the 'phoneme restoration effect'). By the 1980's it was also becoming evident that more holistic approaches to speech recognition, based on pattern-matching stimuli against stored examples of words from a training set, simply perform better than 'feature detectors'. Reputedly tiring of experts, Jelinek and colleagues at IBM T. J. Watson Research Center pioneered a statistical, data-driven approach to automatic speech recognition (e.g. Bahl et al. 1983) that is now the dominant technology in that field. Though far from perfect, their approach prevails because it outperforms other methods. It has profoundly influenced computational linguistics more widely, including part-of-speech tagging, dialogue modelling, and machine translation. The statistical, pattern-matching approach to speech recognition addresses the problem of variance in speech and complex interaction of cues in a rather brute-force manner: by simply memorizing numerous examples of pronunciation variants. Such a simple-minded approach had been discounted in earlier decades on the logic that (a) speech recognition is fast, hence allows no time for searching through a long list of variant examples, and (b) a presumption that storage of redundant information is in some sense cognitively 'costly'. Both of these objections have been reconsidered, and are critically rejected by some. On the argument that sound changes are gradual and continuous, Bybee Hooper (1981) presented evidence that phonological representations in the mental lexicon are rich in phonetic detail. Later, this idea gained traction within “Laboratory Phonology” (e.g. Johnson 1997), to which I now turn.

5. Experimental Phonology: putting theory to the test

Since the 1930's there has been something of a sociological rift between theoretical phonologists and the experimental phoneticians. From the 1970's onwards, John J. Ohala and colleagues from the Berkeley Phonology Laboratory pressed the arguments for phonology to adopt an experimental approach (Ohala 1974, 1986, 1992, 1995; Ohala and Jaeger 1986). Aiming to repair the rift, the Laboratory Phonology (LabPhon) conferences have been held at 2-3 year intervals from 1987 onwards (Pierrehumbert, Beckman and Ladd 2000). LabPhon conferences also advanced research themes, including timing; syllables; articulatory correlates of prosody; Articulatory Phonology; coarticulation; (under)specification, covert contrast, and neutralization; mental lexical representations; and probabilistic phonology and exemplar phonology. LabPhon conferences provided growing knowledge on the temporal synchronisation of intonational accents with segments, syllables, and phrase boundaries (e.g. Beckman and Edwards 1990, Beckman, Edwards and Fletcher 1992, Jun 1995). Studies of covert contrast, neutralization and the phonological relevance of fine phonetic differences continued earlier work establishing that final devoicing in Polish, German and Catalan is not completely neutralizing, as small phonetic differences other than voicing preserve the word-final phonological 'voiced' vs. 'voiceless' contrast (see Dinnsen 1985 for a review). A “devoiced” /d/ is not identical to a /t/, and a phonological 'devoicing' rule such as

19 [+voice] → [-voice] / _ # is not correct; instead, devoicing should be understood in terms of continuous phonetic parameters, such VOT and vowel duration. Local (1992) and Nolan (1992; see also section 2.3 above) showed that place of articulation assimilation is non-neutralizing in English; rules such as [s] → [ʃ] / _ [ʃ], or [t] → [k] / _ [DORSAL] are incorrect, as the altered [s] or [t] often remain distinct from lexical [ʃ] or [k]. Peng (2000) found that although Mandarin listeners cannot differentiate the high-rising lexical tone 2 from the high-rising sandhi form of tone 3, speakers maintain a small f0 difference between tone 2 and sandhi tone 3, demonstrating that the categories are not neutralized; Mandarin speaker-hearers tacitly 'know' that the two forms are distinct, even though the difference between them is only just noticeable. Scobbie et al. (2000) showed that normally-developing children produce 'covert' phonological contrasts that are not usually noted in IPA transcriptions; see also Macken and Barton (1980), Kelly and Local (1989: 242-262).

Evidence that speakers produce very fine phonetic differences to which listeners are sometimes sensitive is also consistent with the arguments presented by Johnson (1997), Goldinger (1997) and Bybee (2000) that mental phonological representations are not abstract symbols, but are rich in phonetic detail, memorized from experience. These exemplar-based approaches to phonology focus on experience and usage (especially usage frequency) of these is pleasingly consistent with the probabilistic turn in Laboratory Phonology arising from lexical frequency effects in phonetic interpretation (Beckman and Edwards 2000, Newman et al. 2000), and from demonstrations that phonotactic grammaticality judgements are best modelled probabilistically, not as a binary 'grammatical/ungrammatical' contrast (Pierrehumbert 1994, Coleman and Pierrehumbert 1997), leading to a burgeoning literature in the new century.

9672

References

Abercrombie (1980) Fifty Years: A Memoir. Work in Progress, 13. University of Edinburgh. Reprinted in Forum Linguisticum 5 (2), and D. Abercrombie (1991) Fifty Years in Phonetics. Edinburgh University Press. 1-11. Arai, T. (2001) The replication of Chiba and Kajiyama's mechanical models of the human vocal cavity. Journal of the Phonetic Society of Japan 5 (2), 31-38. Atal, B. and S. Hanauer (1971) Speech analysis and synthesis by linear prediction of the speech wave. Journal of the Acoustical Society of America 50 (2), 637-655. Baer, T., J. C. Gore, S. Boyce and P. W. Nye (1987) Application of MRI to the analysis of speech production. Magnetic Resonance Imaging, 5, 1-7. Bahl, L. R., F. Jelinek and R. L. Mercer (1983) A maximum likelihood approach to continuous speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-5 (2), 179-190. Beckman, M. E. and J. Edwards (1990) Lengthenings and shortenings and the nature of prosodic constituency. In J. Kingston and M. E. Beckman, eds. Papers in Laboratory Phonology I: Between the Grammar and Physics of Speech. Cambridge University Press. 152-200. Beckman, M. E. and J. Edwards (1994) Articulatory evidence for differentiating stress categories. Articulatory evidence for differentiating stress categories. In P. A. Keating, ed. Phonological structure and phonetic form: Papers in Laboratory Phonology III. Cambridge University Press. 7- 33.

20 Beckman, M. E. and J. Edwards (2000) Lexical frequency effects on young children's imitative productions. In M. B. Broe and J. B. Pierrehumbert, eds. Papers in Laboratory Phonology V: Acquisition and the Lexicon. Cambridge University Press. 208-218. Beckman, M. E., J. Edwards and J. Fletcher (1992) Prosodic structure and tempo in a sonority model of articulatory dynamics. In G. J. Docherty and D. R. Ladd, eds. Papers in Laboratory Phonology II: Gesture, Segment, Prosody. Cambridge University Press. 68-86. Bell, A., and J. Hooper, eds. (1978) Syllables and Segments. Amsterdam: North Holland. Blackburn, C. S. and S. Young (2000) A self-learning predictive model of articulator movements during speech production. Journal of the Acoustical Society of America 107, 1659-1670. Browman, C. P. and L. Goldstein (1985) Dynamic modeling of phonetic structure. In V. A. Fromkin, ed. Phonetic Linguistics: Essays in Honor of Peter Ladefoged. Academic Press. 35-53 Browman, C. P. and L. Goldstein (1990) Tiers in articulatory phonology, with some implications for casual speech. In J. Kingston and M. Beckman, eds. Papers in Laboratory Phonology I: Between the Grammar and Physics of Speech. Cambridge University Press. 341-376. Browman, C. P. and L. Goldstein (1992) “Targetless” schwa: an articulatory analysis. In G. J. Docherty and D. R. Ladd, Papers in Laboratory Phonology II: Gesture, Segment, Prosody. Cambridge University Press. Browman, C. P., Goldstein, L., Kelso, J. A. S., Rubin, P. and Saltzman, E. (1984) Articulatory synthesis from underlying dynamics. Journal of the Acoustical Society of America 75, S22-S23 [Abstract]. Bybee Hooper, J. (1981) The empirical determination of phonological representations. In T. Myers et al. (eds.) The cognitive representation of speech. 347-357. Bybee (2000) Lexicalization of sound change and alternating environments. In M. B. Broe and J. B. Pierrehumbert, eds. Papers in Laboratory Phonology V: Acquisition and the Lexicon. Cambridge University Press. 250-268. Chen, M. (1970) Vowel length variation as a function of the voicing of the consonant environment. Phonetica 22, 129-159. Chiba, T. and M. Kajiyama (1941) The Vowel: Its Nature and Structure. Tokyo: Tokyo-Kaiseikan. Clements, G. N. and S. R. Hertz (1996) An intergrated model of phonetic representation in grammar. Working Papers of the Cornell Phonetics Laboratory 11, 43-116. http://conf.ling.cornell.edu/plab/CPLWorkingPapers/pdfs/wpcpl11.pdf Clements, G. N. and R. Ridouane (2006) Quantal Phonetics and Distinctive Features: a Review . Proceedings of ISCA Tutorial and Research Workshop on Experimental Linguistics, 28-30 August 2006, Athens, Greece. Cohn, A. C. (1990) Phonetic and phonological rules of nasalization. UCLA Working Papers in Phonetics 76. University of California PhD dissertation. Coker, C. H. (1976) A model of articulatory dynamics and control. Proceedings of the IEEE, 64 (4), 452-460. Coleman, J. S. (1994) Polysyllabic words in the YorkTalk synthesis system. In P. A. Keating, ed. Phonological structure and phonetic form: Papers in Laboratory Phonology III. Cambridge University Press. 293-324. Coleman, J. S. (2003) Discovering the acoustic correlates of phonological contrasts. Journal of Phonetics 31, 351-372.

21 Coleman, J. S. and J. Pierrehumbert (1997) Stochastic Phonological Grammars and Acceptability. In Computational Phonology. Third Meeting of the ACL Special Interest Group in Computational Phonology. Somerset, NJ: Association for Computational Linguistics. 49-56. Connine, C. M., D. Titone and J. Wang (1993) Auditory word recognition: extrinsic and intrinsic effects of word frequency. Journal of Experimental Psychology: Learning, Memory and Cognition 19, 81-94. Cooley, J. W. and J. W. Tukey (1965) An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation 19, 297–301 Cooper, F. S., P. C. Delattre, A. M. Liberman, J. M. Borst and L. J. Gerstman (1952) Some Experiments on the Perception of Synthetic Speech Sounds. Journal of the Acoustical Society of America 24 (6), 597-606. Cutting, J. E. and B. S. Rosner (1974) Categories and boundaries in speech and music. Perception and Psychophysics 16, 564-470. Davis, M. (2007) An introduction to sine-wave speech. MRC Cognition and Brain Sciences Unit, Cambridge. http://www.mrc-cbu.cam.ac.uk/people/matt.davis/sine-wave-speech/ Accessed 25/3/11. Dedo, H. H. and E. Dunker (1967) Husson's Theory: An Experimental Analysis of His Research Data and Conclusions. Archives of Otolaryngology 85 (3), 303-313. Delattre, P. C., A. M. Liberman and F. S. Cooper (1955) Acoustic Loci and Transitional Cues for Consonants. Journal of the Acoustical Society of America 27 (4), 769-773. Demolin, D. (1995) The phonetics and phonology of glottalized consonants in Lendu. In B. Connell and A. Arvaniti, eds. Phonology and Phonetic Evidence: Papers in Laboratory Phonology IV. Cambridge University Press. 368-385. Dinnsen, D. A. (1985) A re-examination of phonological neutralization. Journal of Linguistics 21, 265-279. Draper, M. H., P. Ladefoged and D. Whitteridge (1959) Respiratory muscles in speech. Journal of Speech and Hearing Research 2, 16-27. Dunn, H. K. (1950) The calculation of vowel resonances, and an electrical vocal tract. Journal of the Acoustical Society of America 22, 740-753. Edmondson, Jerold A., & Esling, John H. (2006). The valves of the throat and their functioning in tone, vocal register, and stress: laryngoscopic case studies. Phonology, 23(2), 157-191. Eimas, P., E. Siqueland, P. Jusczyk and J. Vigorito (1971). Speech perception in infants. Science 171, 303-306. Eimas, P. D. (1985) The perception of speech in early infancy. Scientific American, 252 (1), 34-40. Esling, J. (1984) Laryngographic study of phonation type and laryngeal configuration. Journal of the International Phonetic Association 14, 56-73. Fant, [C.] G. [M.] (1960) Acoustic theory of speech production with calculations based on X-ray studies of Russian articulations. The Hague: Mouton. Firth, J. R. (2006) The phonetic structure of a Cypriotic dialect. Transactions of the Philological Society 104 (3), 319-329. Firth, J. R. and H. J. F. Adam (1950) Improved techniques in palatography and kymography. Bulletin of the School of Oriental and African Studies 13 (3), pp 771-774. Folkins, J. W. and J. H. Abbs (1975) Lip and jaw motor control during speech: responses to

22 resistive loading of the jaw. Journal of Speech and Hearing Research 18, 207-220. Fourcin, A. (1974) Laryngographic examination of vocal fold vibration. In B. Wyke, ed. Ventilatory and phonatory control systems. Oxford University Press. 315-333. Fowler, C. A. (1980) Coarticulation and theories of extrinsic timing. Journal of Phonetics 8, 113- 133. Fowler, C. A. (1996) Listeners do hear sounds, not tongues. Journal of the Acoustical Society of America 99 (3), 1730-1741. Fry, D. B., A. S. Abramson, P. D. Eimas and A. M. Liberman (1962) The identification and discrimination of synthetic vowels. Language and Speech 5, 171-189. Fujimura, O., S. Kiritani and H. Ishida (1973) Computer controlled radiography for observation of articulatory and other human organs. Computers in Biology and Medicine 3, 371-384. Ganong III, W. F. (1980) Phonetic categorization in auditory word perception. Journal of Experimental Psychology: Human Perception and Performance 6, 110-125. Gobl, C. and A. Ní Chasaide (1992) Acoustic characteristics of voice quality. Speech Communication 11, 481-490. Goldinger, S. D. (1987) Words and voices: perception and production in an episodic lexicon. In K. Johnson and J. W. Mullenix, eds. Talker variability in speech processing. 33-66. Goldinger, S. D. and T. Azuma (2003) Puzzle-solving science: the quixotic quest for units in speech perception. Journal of Phonetics 31, 305-320. Grabe, E. and P. Warren (1995) Stress shift: do speakers do it or do listeners hear it? In B. Connell and A. Arvaniti, eds. Phonology and Phonetic Evidence: Papers in Laboratory Phonology IV. Cambridge University Press. 95-110 Greenberg, J. A., C. A. Ferguson and E. A. Moravcsik (1978) Universals of Human Language Volume II: Phonology. Stanford University Press. Gussenhoven, C. (1991) The English Rhythm Rule as an accent deletion rule. Phonology 8, 1-35. Halle, M. (1983) On distinctive features and their articulatory implementation. Natural Language and Linguistic Theory 1, 91-105. Harshman, R., P. Ladefoged and L. Goldstein (1977) Factor analysis of tongue shapes. Journal of the Acoustical Society of America 62 (3), 693-707. t'Hart, J., R. Collier and A. Cohen (1990) A perceptual study of intonation: an experimental- phonetic approach to speech melody. Cambridge University Press. Hay, J. and K. Drager (2010) Stuffed toys and speech perception. Linguistics 48 (4), 865–892. Heid, S. and S. Hawkins (2000) An acoustical study of long domain /r/ and /l/ coarticulation. Proceedings of the fifth seminar on speech production: models and data, and CREST workshop on models of speech production: motor planning and articulatory modelling. Munich: Institut für Phonetik and Sprachliche Kommunikation. 77-80. Henke, W. (1966) Dynamic articulatory model of speech production using computer simulation. MIT PhD dissertation. Hertz, S. R. (1990) The Delta programming language: an integrated approach to nonlinear phonology, phonetics, and speech synthesis. In J. Kingston and M. Beckman, eds. Papers in Laboratory Phonology I: Between the Grammar and Physics of Speech. Cambridge University Press. 215-257.

23 Hirano, M. and J. Ohala (1969) Use of hooked-wire electrodes for electromyography of the intrinsic laryngeal muscles. Journal of Speech and Hearing Research 12, 362-373. Hirose, H. and T. Gay (1972) The activity of the intrinsic laryngeal muscles in voicing control: an electromyographic study. Phonetica 25, 140-164. Holmes, J. N. (1973) The influence of glottal waveform on the naturalness of speech from a parallel formant synthesizer. IEEE Transactions on Audio and Electroacoustics, Vol, AU-21, no. 3, 298- 305. Holmes, J. N., I. G. Mattingly and J. N. Shearme (1964) Speech synthesis by rule. Language and Speech 7, 127-143. International Phonetic Association (1949) The Principles of the International Phonetic Association. London, University College. Ishizaka, K. and J. L. Flanagan (1972) Synthesis of voiced sounds from a two-mass model of the vocal cords. Bell System Technical Journal 51 (6), 1233-1268. Jakobson, R. (1951) For the correct presentation of phonemic problems. Symposium, V. Reprinted in R. Jakobson (1962) Selected Writings I: Phonological Studies. The Hague: Mouton. 435-442. Jakobson, R., C. G. M. Fant and M. Halle (1952) Preliminaries to Speech Analysis: the distinctive features and their correlates. MIT Press. Johnson, K. (1997) Speech perception without speaker normalization. In K. Johnson and J. W. Mullenix, eds. Talker variability in speech processing. 145-165. Jun, S.-A. (1995) Asymmetrical prosodic effects on the laryngeal gesture in Korean. In B. Connell and A. Arvaniti, eds. Phonology and Phonetic Evidence: Papers in Laboratory Phonology IV. Cambridge University Press. 235-253. Kager, R. and E. Visch (1988) Metrical constituency and rhythmic adjustment. Phonology 5, 21-72. Keating, P. A. (1990) The window model of coarticulation: articulatory evidence. In J. Kingston and M. E. Beckman, eds. Papers in Laboratory Phonology 1: Between the Grammar and Physics of Speech. 451-470. Kelly, J. L. and C. C. Lochbaum (1962) Speech Synthesis. Proceedings of 4th International Congress of Acoustics, Paper G-42. Reprinted in J. L. Flanagan and L. R. Rabiner, eds. Speech Synthesis. 127-130. Kelly, J. and J. Local (1989) Doing Phonology. Manchester University Press. Kelso, J. A. S., Saltzman, E. L., & Tuller, B. (1986). The dynamical perspective on speech production: data and theory. Journal of Phonetics 14, 29-59. Klatt, D. H. (1979) Speech perception: a model of acoustic-phonetic analysis and lexical access. Journal of Phonetics 7, 279-312. Klatt, D. H. (1980) Software for a cascade/parallel formant synthesizer. Journal of the Acoustical Society of America 67 (3), 971-995. Kluender, K. R., A. J. Lotto and L. L. Holt (1998) Role of experience for language-specific functional mappings of vowel sounds. Journal of the Acoustical Society of America 104, 3568- 3582. Kozhevnikov, V. A. and L. A. Chistovich (1965) Speech: Articulation and Perception. English translation: U. S. Dept. of Commerce, Clearing House for Federal Scientific and Technical Information.

24 Kuhl, P. K. (1991) Human adults and human infants show a “perceptual magnet effect” for the prototypes of speech categories, monkeys do not. Perception and Psychophysics 50 (2), 93-107. Kuhl, P. K. (1992) Infants' perception and representation of speech: development of a new theory. In J. J. Ohala, T. M. Nearey, B. L. Derwing, M. M. Hodge and G. E. Wiebe, eds. ICSLP 92 Proceedings: 1992 International Conference on Spoken Language Processing. Volume 1, 449-456. Kuhl, P. K, and J. D. Miller (1975) Speech perception by the chinchilla: Voiced-voiceless distinction in alveolar plosive consonants. Science, 190, 69-72. Ladd, D. R. and R. Morton (1997) The perception of intonational emphasis: continuous or categorical? Journal of Phonetics25, 313-342. Ladefoged, P. (1964) A Phonetic Study of West African Languages. Cambridge University Press. Ladefoged, P. and A. Traill (1984) Linguistic phonetic description of clicks. Language 60, 1-20. Ladefoged, P., J. L. de Clerk and R. Harshman (1972) Control of the tongue in vowels. In A. Rigault and R. Charbonneau, eds. Proceedings of the Seventh International Congress of Phonetic Sciences. The Hague: Mouton. 349-354. Ladefoged, P. and I. Maddieson (1996) The Sounds of the World's Languages. Blackwell. Lawrence, W. (1953) The synthesis of speech from signals which have a low information rate. In W. Jackson, ed. Communication Theory. London: Butterworths. 460-469. Lehiste, I., ed. (1967) Readings in Acoustic Phonetics. MIT Press. Liberman, A. M., P. C. Delattre, L. J. Gerstman and F. S. Cooper (1956) Tempo of frequency change as a cue for distinguishing classes of speech sounds. Journal of Experimental Psychology 52 (2), 127-137. Liberman, A. M., F. S. Cooper, D. P. Shankweiler, and M. StuddertKennedy (1967) Perception of the speech code. Psychological Review 74 (6), 431-461. Liberman, A. M., and I. G. Mattingly (1985) The motor theory of speech perception revised. Cognition 21, 1-36. Lieberman, P., M. Sawashima, K. S. Harris and T. Gay (1970) The articulatory implementation of the breath-group and prominence: crico-thyroid muscular activity in intonation. Language 46 (2), 312-327. Liljencrants, J. (1967) The OVE III speech synthesiser. Speech Transmission Laboratory Quarterly Progress and Status Report 8 (2-3), 76-81. Stockholm: KTH (Royal Institute of Technology). Lindau, M. (1978) Vowel features. Language 54 (3), 541-563. Lindblom, B. (1996) Role of articulation in speech perception: Clues from production. Journal of the Acoustical Society of America 99 (3), 1683-1692. Lisker, L. and A. S. Abramson (1964) A cross-language study of voicing in initial stops: acoustical measurements. Word 20, 384-422. Lisker, L. and A. S. Abramson (1970) The voicing dimension: some experiments in comparative phonetics. Proceedings of the Sixth International Congress of Phonetic Sciences, Prague, 1967. Local, J. (1992) Modeling assimilation in nonsegmental, rule-free synthesis. In G. J. Docherty and D. R. Ladd, Papers in Laboratory Phonology II: Gesture, Segment, Prosody. Cambridge University Press. 190-228. Lubker, J. and T. Gay (1982) Anticipatory labial coarticulation: Experimental, biological, and linguistic variables. Journal of the Acoustical Society of America 71 (2), 437-448.

25 Macken, M. A. and D. Barton (1980) The acquisition of the voicing contrast in English: a study of voice onset time in word-initial stop consonants. Journal of Child Language 7, 41-74. Maddieson, I. (1984) Patterns of Sounds. Cambridge University Press. Maeda, S. (1982) A digital simulation method of the vocal-tract system. Speech Communication 1 (3-4), 199-229. Magen, H. S. (1997) The extent of vowel-to-vowel coarticulation in English. Journal of Phonetics 25, 187-205. Manuel, S. (1995) Speakers nasalize /ð/ after /n/, but listeners still hear /ð/. Journal of Phonetics 23, 453-476. McClelland, J. L. and J. L. Elman (1986) The TRACE model of speech perception. Cognitive Psychology 18, 1-86. McGurk, H. and J. MacDonald (1976) Hearing lips and seeing voices. Nature 264, 746-748. Mermelstein, P. (1973) Articulatory model for the study of speech production. Journal of the Acoustical Society of America 53 (4), 1070-1082. Meyer, P. and H. W. Strube (1984) Calculations on the time varying vocal tract. Speech Communication 3 (2), 109-122. Miller, G. A. and P. E. Nicely (1955) An analysis of perceptual confusions among some English consonants. Journal of the Acoustical Society of America 27, 338-352. Mohanan, K. P. (1986) The Theory of Lexical Phonology. Moll, K. L. and R. G. Daniloff (1971) Investigation of the timing of velar movements during speech. Journal of the Acoustical Society of America 50, 678-684. Mowrey, R. A. and I. R. A. MacKay (1990) Phonological primitives: Electromyographic speech error evidence. Journal of the Acoustical Society of America 88 (3), 1299-1312. Mrayati, M., R. Carré and B. Guérin (1988) Distinctive regions and modes: a new theory of speech production. Speech Communication 7, 257-286. Munhall, K. G. (2001) Functional imaging during speech production. Acta Psychologica 107, 95- 117. Nathan, G. S. (1986) Phonemes as mental categories. Proceedings of the Twelfth Annual Meeting of the Berkeley Linguistics Society, 212-223

Newman, R., J. Sawusch and P. Luce (2000) Underspecification and phoneme frequency in speech perception. In M. B. Broe and J. B. Pierrehumbert, eds. Papers in Laboratory Phonology V: Acquisition and the Lexicon. Cambridge University Press. 299-312. Nolan, F. (1992) The descriptive role of segments: evidence from assimilation. In G. J. Docherty and D. R. Ladd, Papers in Laboratory Phonology II: Gesture, Segment, Prosody. Cambridge University Press. 261-280. Oden, G. C., and D. W. Massaro (1978) Integration of featural information in speech perception. Psychological Review 85, 172-191. Ohala, J. J. (1974) Experimental Historical Phonology. In II. Theory and Description in Phonology. Proceedings of the First International Conference on Historical Linguistics, Edinburgh, 2-7 September 1973. North Holland. 353-389.

26 Ohala, J. J. (1976) A model of speech aerodynamics. Report of the Phonology Laboratory, No. 1. University of California, Berkeley. 93-107. Ohala, J. J. (1986) Consumer's guide to evidence in phonology. Phonology Yearbook 3.3 - 26. Ohala, J. J. (1990) Respiratory activity in speech. In W. J. Hardcastle and A. Marchal (eds), Speech Production and Speech Modelling. Dordrecht: Kluwer. 25-53. Ohala, J. J. (1992) The costs and benefits of phonological analysis. In P. Downing, S. D. Lima, & M. Noonan (eds.), The Linguistics of Literacy. Amsterdam; Philadelphia: J. Benjamins Pub. Co. 211-237. Ohala, J. J. (1995) Experimental phonology. In John A. Goldsmith (ed.), A Handbook of Phonological Theory. Oxford: Blackwell. 713-722. Ohala, J. J. (1996) Speech perception is hearing sounds, not tongues. Journal of the Acoustical Society of America 99 (3), 1718-1725. Ohala, J. J. and J. J. Jaeger (1986) Introduction. In Ohala and Jaeger (eds.), Experimental phonology. Orlando, FL: Academic Press. 1-12. Öhman, S. E. G. (1966) Coarticulation in VCV utterances: spectrographic measurements. Journal of the Acoustical Society of America 39, 151-168. Painter, C. (1973) Cineradiographic data on the feature 'covered' in Twi vowel harmony. Phonetica 28, 97-120. Peng, S.-H. (2000) Lexical versus 'phonological' representations of Mandarin sandhi tones. In M. B. Broe and J. B. Pierrehumbert, eds. Papers in Laboratory Phonology V: Acquisition and the Lexicon. Cambridge University Press. 152-167. Perkell, J. S. and D. H. Klatt (1986) Invariance and variability in speech processes. Hillsdale, New Jersey: Lawrence Erlbaum Associates. Pierrehumbert, J. B. (1980) The Phonology and Phonetics of English Intonation. PhD Dissertation, MIT. Pierrehumbert, J. B. (1981) Synthesizing intonation. Journal of the Acoustical Society of America 70 (4), 985-995. Pierrehumbert, J. (1994) Syllable structure and word structure: a study of triconsonantal clusters in English. In P. A. Keating, ed. Phonological structure and phonetic form: Papers in Laboratory Phonology III. Cambridge University Press. 168-188. Pierrehumbert, , J., M. E. Beckman and D. R. Ladd (2000) Conceptual foundations of phonology as a laboratory science. In N. Burton-Roberts, P. Carr and G. Docherty, eds. Phonological Knowledge: conceptual and empirical issues. Oxford University Press. 273-303. Rabiner, L. R. (1968) Digital-formant synthesizer for speech-synthesis studies. Journal of the Acoustical Society of America 43 (4), 822-828. Rabiner, L. and B.-H. Juang (1993) Fundamentals of Speech Recognition. Prentice-Hall. Read, C., E. H. Buder and R. D. Kent (1992) Speech analysis systems: an evalation. Journal of Speech and Hearing Research 35, 314-332. Remez, R.E., Rubin, P.E., Pisoni, D.B., and Carrell, T.D. (1981) Speech perception without traditional speech cues. Science, 212, 947-9. Rothenberg, M. (1973) A new inverse-filtering technique for deriving the glottal air flow waveform during voicing. Journal of the Acoustical Society of America 53 (6), 1632-1645.

27 Rousselot, Abbé (1901) Principes de Phonétique Expérimentale. Deuxième Partie. Paris and Leipzig: H. Welter. Sawashima, M. and H. Hirose (1968) New laryngoscopic technique by use of fiber optics. Journal of the Acoustical Society of America 43, 168-9. Scobbie, J. M., F. Gibbon, W. J. Hardcastle and P. Fletcher (2000) Covert contrast as a stage in the acquisition of phonetics and phonology. In M. B. Broe and J. B. Pierrehumbert, eds. Papers in Laboratory Phonology V: Acquisition and the Lexicon. Cambridge University Press. 194-207. Sondhi, M. M. and J. Schroeter (1987) A hybrid time-frequency domain articulatory speech synthesizer. IEEE Transactions on Acoustics, Speech and Signal Processin, ASSP-35 (7), 955-967. Stetson, R. H. (1928) Motor Phonetics. Current (1988) edition, edited by J. A. S. Kelso and K. G. Munhall. Boston: College-Hill Press. Sussman, H. M., P. F. MacNeilage and R. J. Hanson (1973) Labial and mandibular dynamics during the production of bilabial consonants: preliminary observations. Journal of Speech and Hearing Research 16, 397-420. Stevens, K. N. 1989. On the quantal nature of speech. Journal of Phonetics 17, 3-46. Stevens, K. N. (1996) Critique: articulatory-acoustic relations and their role in speech perception. Journal of the Acoustical Society of America 99 (3), 1693-4. Stevens, K. N. and S. E. Blumstein (1978) Invariant cues for place of articulation in stop consonants. Journal of the Acoustical Society of America 64, 1358-291. Stevens, K. N., Kasowski, S. and Fant, C. G. M. (1953). An electrical analog of the vocal tract. Journal of the Acoustical Society of America 25 (4), 734-742. Stewart, J. M. (1967) Tongue root position in Akan vowel harmony. Phonetica 16, 185-204. Stone, M., T. Shawker, T. Talbot and A. Rich (1988) Cross-sectional tongue shape during vowels. Journal of the Acoustical Society of America 83, 1586-1596. Tiede, M. K. (1996) An MRI-based study of pharyngeal volume contrasts in Akan and English. Journal of Phonetics 24, 399-421. van den Berg, J. (1958) Myoelastic-aerodynamic theory of voice production. Journal of Speech and Hearing Research 1, 227-244. Vogel, I., H. T. Bunnell and S. Hoskins (1995) The phonology and phonetics of the Rhythm Rule. In B. Connell and A. Arvaniti, eds. Phonology and Phonetic Evidence: Papers in Laboratory Phonology IV. Cambridge University Press. 111-127. Warren, D. W. (1976) Aerodynamics of speech production. In N. J. Lass, ed. Contemporary Issues in Experimental Phonetics. 105-137. Warren, D. W., G. Nelson and G. D. Allen (1980) Effects of increased vertical dimension on oral port size and fricative intelligibility. Journal of the Acoustical Society of America 86, 917-924. Warren, R. M. (1970) Perceptual restoration of missing sounds. Science 167, 392-393. West, P. (1999) The extent of coarticulation of English liquids: An acoustic and articulatory study. Proceedings of the International Congress of Phonetic Sciences, San Francisco, Vol. 3, 1901-1904. Westbury, J. (1991) The significance and measurement of head position during speech production experiments using the x-ray microbeam system. Journal of the Acoustical Society of America 95, 2271-2273. Whalen, D. H. (1990) Coarticulation is largely planned. Journal of Phonetics 18, 3-35.

28 Wood, S. A. J. (1992) A radiographic and model study of the tense vs. lax contrast in vowels. In W. U. Dressler, H. C. Luschütsky, O. E. Pfeiffer and J. R. Rennison, eds. Phonologica 1988: Proceedings of the 6th International Phonology Meeting. Cambridge University Press. 283-291. Yang, Bei. (2010) A model of Mandarin tone categories--a study of perception and production. PhD dissertation, University of Iowa, 2010. http://ir.uiowa.edu/etd/764. Zue, V. W. (1985) The use of speech knowledge in automatic speech recognition. Proceedings of the IEEE, 73, 1602-1615.

29 Figure 1. a. According to the IPA/Daniel Jones system, vowel positions combine front-back with open-close

Front Back i ʉ u Close ɪ ʊ e o

ɛ ɔ

a ɑ Open b. According to the Harshman et al system, vowel positions for English are determined by a weighted combination of front-raising with back-raising vectors (dashed arrows). Tongue root advancement arises as a consequence of front-raising. (This figure is illustrative, not measured: Harshman et al.'s factors are not merely linear motions in the plane, but are deformations of the whole tongue surface.)

Front-raising Back-raising i ʉ u ɪ ʊ e o

ɛ ɔ æ

a ɑ c. For languages with independently controlled tongue root advancement, the Harshman et al aproach could use a third, ATR vector, so that the contrast between tense and lax vowels is attributed to a combination of three factors.

Front-raising Back-raising i ʉ u ɪ ʊ e o

ɛ ɔ æ

a ɑ ATR

30