Speech Perception H Mitterer and a Cutler, Max Planck Institute for Research Program
Total Page:16
File Type:pdf, Size:1020Kb
770 Speech Interface Design after speech has been initiated, the amount of time to Bibliography pause for silence detection can be increased. Baecker R & Buxton W (eds.) (1987). Readings in human- computer interaction: a multidisciplinary approach. Conclusion Los Altos, CA: Morgan-Kaufmann Publishers. Clark H (1993). Arenas of language use. Chicago: University Designing an effective speech user interface is an of Chicago Press. iterative process. Not even the most experienced de- Cohen M, Giangola J & Balogh J (2004). Voice user inter- signer will craft a perfect dialogue in the first itera- face design. Boston: Addison-Wesley. tion. To ensure that the system will be used over time, Cooper A (1995). About face: the essentials of user interface focus on users and do not be afraid to modify the design. Foster City, CA: IDG Books. design as new data are collected. Observe users dur- Java Speech Grammar Format (1998). ‘Specification Version ing the task modeling phase to understand who they 1.0.’ http://java.sun.com/products/java-media/speech/for- are and what are their goals. Listen carefully to users Developers/JSGF. while conducting natural dialogue studies to deter- Lai J & Yankelovich N (2003). ‘Conversational speech interfaces.’ In Jacko J & Sears A (eds.) The human– mine how they speak in the context of the task. Test computer interaction handbook: fundamentals, evolving the design with target users to ensure that the technologies and emerging applications. Mahwah, NJ: prompts are clear, that feedback is appropriate, and Lawrence Erlbaum. 698–713. that errors are caught and corrected. In addition, Nielsen J (1993). Usability engineering. Boston: Academic when testing, verify that the design is accomplishing Press. the business goals that were set out to be achieved. Preece J, Rogers Y & Sharp H (2002). Interaction design: If problems are uncovered during the test, revise beyond human–computer interaction. New York: John the design and test again. By focusing on users and Wiley. iterating on the design, one can produce an effective, Reeves B & Nass C (1996). The media equation: how polished speech interface design. people treat computers, television and new media like real people and places. New York: Cambridge University See also: Speech Recognition: Statistical Methods; Speech Press/CSLI. Synthesis. Speech Perception H Mitterer and A Cutler, Max Planck Institute for research program. Parsimony, because study of the Psycholinguistics, Nijmegen, The Netherlands perceptual cues to a few dozen phonemes is tractable ß 2006 Elsevier Ltd. All rights reserved. in a way that study of the perceptual cues to individ- ual words is not, and validity, because understanding Introduction how listeners tell, say, a /b/ from a /p/ informs us about the recognition of all word pairs that differ in When we listen to speech, our aim is to determine these phonemes. It is important to realize that this the meaning of the acoustic input. Only very rarely is research focus on cues to phonemic identity did a listener – usually some kind of speech scientist – not proceed from an assumption that phonemes nec- concerned with the sounds of speech instead of with essarily played a role as units of speech perception. In the content. Yet the study of speech perception has fact, there have been many competing proposals largely dealt with how listeners tell phonemes – the concerning the basic unit of speech perception, all sounds of speech – from one another. Why is this so? more or less unsatisfactory (Klatt, 1989), from units Phonemes are, by definition, the minimal units that below the level of the phoneme (phonological fea- distinguish between words, i.e., between one meaning tures) to above it (diphones, syllables) to no sublexi- and another. Listeners may know tens, indeed cal unit at all. Rather, the phonemic research focus hundreds of thousands of words, and may still be was the result of rational task analysis: to understand able to learn new words effortlessly on a daily basis. speech, listeners need to recognize words, and the However, languages construct this vast stock of words process of recognizing words involves distinguishing from, on average, only around 30 separate phonemes a word from all other words that are its closest neigh- (Maddieson, 1984; see Figure 1). Thus, there is bors in the vocabulary – word from bird, ward, work, both parsimony and validity in the speech perception and so on. By making such distinctions, the listener Speech Perception 771 Figure 1 Distribution of phoneme inventory size across the representative sample of 317 languages analyzed by Maddieson (1984). The average number of phonemes is 31. is effectively making decisions that are phonemic in of one egg is mixed with the colorings of the sur- nature. rounding eggs. This analogy is intended to convey True to the history of speech perception research, that the speech signal is neither separable nor invari- therefore, this review is organized first by phonemic ant. It is not separable because no part of the speech category and the type of information that enables signal is exclusively influenced by one phonological listeners to recognize members of each category; fol- unit. It is not invariant because adjacent – in fact even lowing sections deal with the principal theories that nonadjacent – phonemes are coarticulated, so that have driven speech perception research; and we con- acoustic correlates of a given phoneme vary. There- clude with a section on how this body of research fore, running speech is completely different from, for relates to the recognition of the larger, meaningful instance, Morse code, in which invariant acoustic units that really concern listeners. signals are concatenated. In making this important point, the lack-of-invariance argument may some- Phoneme Perception times be exaggerated. Some phoneme classes – for example, voiceless fricatives – can have quite salient Phonemes in Real Speech and local cues (see the different tokens of /s/ in If words can indeed be viewed as a sequence of pho- Figure 2). Others, however, show a great deal of nemes, it might seem reasonable to assume that these variation. Stop consonants are a case in point; exactly sounds are concatenated one after another in the the same stretch of speech can be heard as /p/ before speech stream. This is far from the case. Speech sig- one vowel, but as /k/ before another (see Figure 3). nals are continuous and it is hard to discern where Vowels can also be highly coarticulated; thus the sec- one word ends and the next begins, let alone locate ond formant (F2) in a vowel will generally be higher the borders of individual phonemes (see Figure 2). An after alveolar consonants than after labial consonants, analogy concocted by Hockett (1955: 210) still corresponding to the higher F2 locus in alveolar than appears in most introductory textbooks: phonemes in labial consonants. in speech are like colored Easter eggs moving on a Sources of variance arise both within and between conveyor belt to be crushed by a wringer. The wringer speakers. Not only the phonetic context in which a smears out the eggs so that no part of the conveyor phoneme occurs, but also the rate of speech and the belt is exclusively covered by the remains of one speech register (from formal and careful to casual and particular Easter egg. As a consequence, the coloring careless) can affect how that phoneme is realized 772 Speech Perception Figure 2 An oscillogram and spectrogram of the sentence This sentence is just for illustrative purposes spoken by an adult male native speaker of British English born in Scotland. within the utterances of a given speaker. So too can With these two cues, vowels can be reasonably well the degree of prosodic prominence within the sen- distinguished from consonants (see Figure 2). The tence of the word in which the phoneme occurs. vocal fold vibration gives rise to a periodic source Variance between speakers arises most noticeably signal with a large number of harmonics. This source from physiological differences, e.g., in that the for- signal is then filtered by the vocal tract (see Speech mant frequencies of vowels reflect the resonances of Production). The vocal tract amplifies some of the the vocal tract. The resonances are lower the larger harmonics due to its resonance characteristics. the vocal tract; thus children produce higher formant Regions with amplified harmonics are called for- frequencies than adults, and adults also differ. Further mants. The frequencies of these formants depend on interspeaker variance is introduced by dialectal the exact shape of the vocal tract, that is, on tongue variation. position and shape, the position of the jaws, etc., (see Variability may sometimes be so extensive that, Speech Production). Accordingly, vowels can be dis- superficially, a phonemic contrast is blurred. For in- tinguished from one another by their steady state stance, most languages that use nasals in word-final formant frequencies. The vowel system of a particular position allow the place of articulation to be assimi- language is often presented in a two- lated to the place of a following consonant; the final dimensional vowel space with first-formant frequency consonant of sun may be pronounced /n/ in suntan, on the ordinate and the second-formant frequency on /m/ in sunbathing, /ng/ in sunglasses. In running the abscissa. This representation gives rise to a vowel speech, sounds may be deleted (e.g., the word just in triangle with the vowels [u] as in shoe, [i] as in she, Figure 2 contains no [t]) or inserted (an utterance of and [a] as in shah as corners. Other vowels can be something may contain cues consistent with a /p/ associated with different positions in this vowel between the two syllables, or an utterance of pensive triangle. Accordingly, listeners can identify vowels may contain cues consistent with a /t/).