Perception & Psychophysics 1981,29 (2), 121-128 The role of second transitions in the stop-semivowel distinction

EILEEN C. SCHWAB, JAMES . SAWUSCH, and HOWARD C. NUSBAUM State University ofNew York, Buffalo, New York 14226

An experiment was conducted which assessed the relative contributions of three acoustic cues to the distinction between stop and semivowel in initial position. Subjects identified three series of which varied perceptually from [ba] to [wa). The stimuli dif­ fered only in the extent, duration, and rate of the second formant transition. In each series, one of the variables remained constant while the other two changed. Obtained identification ratings were plotted as a function of each variable. The results indicated that second formant transition duration and extent contribute significantly to perception. Short second formant transition ex­ tents and durations signal stops, while long second formant transition extents and durations signal semivowels. It was found that second formant transition rate did not contribute signifi­ cantly to this distinction. Any particular rate could signal either a stop or semivowel. These re­ sults are interpreted as arguing against models that incorporate transition rate as a cue to pho­ netic distinctions. In addition, these results are related to a previous selective adaptation ex­ periment. It is shown that the "phonetic" interpretation of the obtained adaptation results was not justified.

A fundamental claim of bottom-up theories of 1978) does not guarantee that the human speech per­ speech perception is that phonetic labeling is the ceiver takes advantage of this information. direct result of analyzing a number of acoustic fea­ It becomes very important, then, to determine ex­ tures of the speech waveform (e.., see Fant, 1967). actly what acoustic information is utilized by humans Regardless of the specific mechanism employed for in the course of phonetic perception. This inventory this acoustic analysis, these data-driven theories must of acoustic cues will provide a basis for evaluating specify a set of basic acoustic properties which are the psychological validity of theories of speech per­ coded during speech perception. One problem with ception. In addition, assessing the entire repertoire of this approach, pointed out by Studdert-Kennedy perceptually significant cues may constrain the types (1977), is that the choice of these features by theorists of perceptual mechanisms used in cue extraction has been entirely post hoc. At present there are no (e.g., spectral templates vs. formant trackers). Fi­ unifying auditory principles guiding this theoretical nally, the full specification of these acoustic-phonetic feature selection process. In other words, bottom-up cues might allow us to determine if there exist any theories tend to choose those acoustic properties general auditory principles of speech perception (cf. which can be successfully employed to perform pho­ Studdert-Kennedy, 1977). Currently, such principles netic labeling (see Stevens, 1980). Thus, feature pro­ (if they exist) may be obscured by an incomplete pic­ cessing theories typically invoke sufficiency criteria ture of human acoustic information processing dur­ without regard for whether or not the acoustic prop­ ing speech perception. Before any auditory principles erties employed are perceptually significant for hu­ can be defined, it will be necessary to systematically mans (see Norman, 1980, for a related discussion). investigate the separate and conjoint effects of the It is very tempting to assume that the human per­ full spectrum of acoustic information available in ceptual system uses all the acoustic information avail­ speech. able in making phonetic decisions. However, demon­ The phonetic distinctions of voicing (e.g., Lisker strating the sufficiency of a set of acoustic features & Abramson, 1964; Summerfield & Haggard, 1977) for phonetic labeling (e.g., Stevens & Blumstein, and (e.g., Dorman, Studdert­ Kennedy, & Raphael, 1977; Liberman, Cooper, This work was supported by NIMH Grant MH3l468-Q1 and Shankweiler, & Studdert-Kennedy, 1967) in stop con­ NSF Grant BNS7817068 to SUNY/Buffalo and NINCDS Grant sonants are examples of two distinctions that have NS-12l79 to Indiana University (which supported development of been studied intensively and extensively. Yet, even the speech synthesizer). The authors would like to thank James for these phonetic contrasts, all possible cue specifi­ Pomerantz for his comments on an earlier draft of this manu­ script. Requests for reprints should be sent to any author at the cations and interactions have not been fully deter­ Department of Psychology, 4230 Ridge Lea Road, Buffalo, New mined. For other phonetic distinctions, such as man­ York 14226. ner of articulation, the research to date has not ex-

Copyright 1981 Psychonomic Society, Inc. 121 0031-5117/81/020121-08$01.05/0 122 SCHWAB, SAWUSCH, AND NUSBAUM plored cue structure in sufficient depth, especially hence, rate) increased, F2 transition extent (and hence, when some of the cues are intrinsically related. In rate) decreased. It is possible that these extent cues many instances, manipulation of one of these cues could interact and thus increase the variance of the necessitates a change in at least one other cue. For ex­ boundary locations when identification functions are ample, a change in second formant transition extent plotted against transition rate. Since the change in (the frequency excursion from onset to steady state) transition rate for the two was not consis­ may change the overall F2-F3 transition pattern (e.g., tent across , the relative contributions of du­ rising vs. diverging), the spectrum at syllable onset, ration and rate cues could not be determined un­ perceptual summation of F2 and F3 onsets, and tran­ equivocally. sition rate. Anyone, or all, of these features, which The previous studies manipulated the tempo of all are intrinsically interrelated, could be perceptually formants in a stimulus and observed the effect on relevant. This problem is exemplified by considering perception. The next two studies manipulated the the phonetic distinction between stop transition rate and extent of only one formant and (e.g., [bDand semivowels (e.g., [wD. observed the effect on perception. Suzuki (Note 1) Several earlier studies have examined acoustic cues examined the effect of Fl transition rate and extent that serve to distinguish stops and semivowels on the perception of intervocalic stops and semi­ (Hillenbrand, Minifie, & Edwards, 1979; Liberman, vowels. It was reported that, in general, large Fl fre­ Delattre, Gerstman, & Cooper, 1956; Miller & quency extents were perceived as stops. Suzuki found Liberman, 1979; 'Connor, Gerstman, Liberman, that an increase in transition rate reduced the fre­ Delattre, & Cooper, 1957; Suzuki, Note 1). The first quency extent required to perceive a stop. However, published study of stop-semivowel cues used two an examination of the data indicates that an increase formant stimuli to examine the effect of transition in transition rate was accompanied by a decrease in tempo (Liberman et al., 1956). Tempo was varied by transition duration. Thus, the results could also be increasing the duration of the transitions (and de­ indicating that a decrease in Fl transition duration creasing the rate of the transitions by an appropriate reduces the frequency extent required to perceive a amount) while holding transition frequency extent stop. Another study examined the acoustic cues that constant. Subjects identified synthetic stimuli which serve to distinguish semivowels and liquids (O'Connor ranged perceptually from [be] to [WE] and [gE] to [je] et al., 1957). In part of this study, subjects identified (as in "yet"). Adult subjects were able to utilize the stimuli that varied in F2 frequency extent before a tempo of the F1 and F2 transitions as a cue to dis­ variety of vowels. O'Connor et al. found a relation­ tinguish stop consonant from semivowel. These re­ ship between frequency extent and the perception of sults, indicating the usefulness of the tempo cue, semivowels. When the F2 transition was in the ap­ have been extended to infants. Hillenbrand et al. propriate direction (rising for [w] and falling for (1979) examined the ability of infants to discriminate uD, they found that a decrease in the extent of the F2 between [bs] and [WE], which were cued by changes in transition resulted in a decrease in semivowel re­ transition tempo. The first experiment used synthetic sponses. Since transition duration was held constant, stimuli similar to those of Liberman et al. (1956). a decrease in transition extent resulted in a concur­ The second experiment used computer-modified rent decrease in transition rate. tokens of natural speech. In both experiments, infants These previous studies indicate that the extent, were able to discriminate stop from semivowel on duration, and rate of consonant transitions are major the basis of the tempo cue. In another experiment, cues to . Unfortunately, we Liberman et al. (1956) examined the effect of transi­ cannot evaluate the relative contribution of each cue, tion tempo before a variety of vowels. For all stimuli, since these cues have been confounded in previous each transition began at the same frequency (120 Hz studies. Since rate is defined as frequency extent for Fl and 600 Hz for F2). So, transition extent was divided by transition duration, we cannot vary each constant within a series () but varied between cue separately. A change in one of the three cues series (vowels). Liberman et al. (1956) varied transi­ automatically results in a change in at least one of the tion extent across vowels in order to determine whether other two cues. Consequently, manipulations must transition rate or duration contributed more to the involve at least two of these three cues, or all three, perception of stops and semivowels. Since there was simultaneously. Previous studies varied only one of less variance in the location of the category bound­ these three possible pairs. In order to determine the aries when each series was plotted as a function of relative importance of each cue in perception, we transition duration (as opposed to formant transition must vary each of the three possible pairs of cues sep­ rate), it was concluded that transition duration was arately while holding the third cue constant. In the the controlling cue. However, it should be noted that present study, all three pairwise comparisons were for some of their vowels, as Fl transition extent (and made for the F2 transition. Thus, the present experi- STOP·SEMIVOWEL CUES 123 ment will be able to assess the relative contribution of [ba] lwa] each cue to the distinction between stop consonant and semivowel. METHOD r r >- u Subjects z w The subjects were 14 undergraduates at the State University ~ ..----... ------. of New York at Buffalo, who participated to fulfill a course a w F2 requirement. All subjects were right-handed, native speakers of 0:: LL English with no reported histories of either speech or hearing dis­ , orders.

Stimuli The experimental stimuli consisted of three sets of seven syn­ / F" thetic consonant-vowel syllables which varied perceptually from [ba] to twa]. All stimuli were generated using a software cascade TIME synthesizer (Klatt, I980a, or see Kewley-Port, Note 2) in the F2 DURATION CONSTANT Speech Perception Laboratory at the State University of New York at Buffalo. The three series (and all stimuli within a series) were Figure 2. Three formant schematic representation of the two the same in all respects except one, the F2 transition. All stimuli endpoints for the duration constant series. F2 transition duration were 245 msec in duration and contained five formants. The fun­ is indicated by tbe double-beaded arrows between tbe dashed lines. damental frequency contour was the same for all stimuli, with FO starting at 105 Hz and rising to 120 Hz over the first 120 msec and then falling to 100 Hz at syllable offset. Each FI began at 245 Hz [waJ and rose for 45 msec to a steady-state value of 700 Hz. Transition rate was 13.75 Hz/msec for the first 20 msec and 7.2 Hz/rnsec for the next 25 msec. Each F3 began at 2,115 Hz and rose for 70 msec to a steady-state value of 2,600 Hz. The F3 transition rate was 16.9 Hz/msec for the first 20 msec and 2.94 Hz/msec for the next 50 msec. The F2 steady-state value was 1,220 Hz. The fourth and fifth formants were constant at 3,300 and 3,850 Hz, respectively. The FI bandwidth began at 60 Hz, remained constant for 15 msec, and then increased for 30 msec to a final value of 80 Hz. The F2 bandwidth began at 75 Hz, remained constant for 20 msec, and then increased for 40 msec to a final value of 80 Hz. The F3 bandwidth began at 90 Hz and increased for 70 msec to a final value of 140 Hz. The fourth and fifth formant bandwidths re­ mained constant at 250 and 200 Hz, respectively. Amplitude of voicing began at 55 dB and increased during the course of the F2 transition to a value of 60 dB. Amplitude of voicing decreased TIME during the last 50 msec of the vowel to 0 dB. The duration, frequency extent, and rate of frequency change of F2 EXTENT CONSTANT the F2 transition were varied. In each series, one cue was held con­ stant and the other two varied to produce a seven element stop­ Figure 3. Three formant schematic representation of the two semivowel series. All F2 transitions were linear. In the rate con- endpoints for the extent constant series. F2 transition extent is in­ dicated by the double-headed arrows between the dashed lines.

[ba] [wa] stant series, F2 transition rate was held constant at 10.43 Hz/msec. F2 transition duration and extent ranged from 30 msec and 313 Hz for the Stimulus 1 end of the series to 60 msec and 626 Hz for the Stimulus 7 end of the series in 5-msec and 52-Hz steps. r r- Schematic representations of the initial 145 msec of the first three >­ formants of these two endpoints are shown in Figure I. In the ex­ u z tent constant series, F2 transition extent was held constant at w ~ 470 Hz. F2 transition duration and rate ranged from 15 msec and a 31.33 Hz/msec for the Stimulus I end of the series to 75 msec and w 0:: 6.26 Hz/msec for the Stimulus 7 end of the series in IO-msec­ LL duration (and log slope) steps. Schematic representations of the initial 145 msec of the first three formants of these two endpoints are shown in Figure 2. In the duration constant series, F2 transi­ tion duration was constant at 60 msec, F2 transition extent and rate ranged from 260 Hz and 4.33 Hz/msec for the Stimulus 1 end of the series to 680 Hz and 11.33 Hz/msec for the Stimulus 7 end TIME of the series in 70-Hz and 1.17-Hz/msec steps. Schematic repre­ F2 TRANSITION RATE CONSTANT sentations of the initial 145 msec of the first three formants of these two endpoints are shown in Figure 3. In addition to the Figure 1. Three formant schematic representation of the two three experimental sets, there was a training set of stimuli. This set endpoints for the rate constant series. F2 transition rate is indi­ consisted of the two endpoints from each of the three experimental cated by the double-headed arrows between the dashed lines. sets. 124 SCHWAB, SAWUSCH, AND NUSBAUM

Procedure [w] Small groups of two to four subjects each were run at a time. Each subject participated for 1 h. The stimuli were converted to 4-----A analogue form and presented to subjects in real time under com­ <.9 duration constant puter control. The stimuli were presented binaurally to subjects Z through Telephonics TDH-39 matched and calibrated headphones. I­« rate constant The intensity of all stimuli was set to 72 dB SPL for a [bal rate a: ---- constant stimulus. All subjects participated in a short training con­ w dition at the beginning of the session. The subjects were informed o « that they would be listening to synthetic syllables that would sound a: like [ba] and [wa]. During the training set, subjects were asked to w > listen to the stimuli without responding. These stimuli were pre­ « sented in an alternating order (Iba], then [wa)) with an interstim­ ulus interval of 4 sec. After each stimulus, feedback was provided indicating the stimulus that had been presented. Subjects were pre­ sented with 10 occurrences of each endpoint. After the training set, the subjects listened to the experimental sets. They were asked to identify each stimulus by pushing one of six buttons on a computer­ 100 200 300 400 500 600 700 controlled response box. Pushing button "l" indicated a good example of a [ba], and pushing button "6" indicated a good [wa], EXTENT OF F2 TRANSITION (Hz) Variations in quality between these phonetic exemplars were indi­ Figure S. Average identification functions for the two series that cated with the buttons "2" through "5." The experimental trials vary F2 transition extent. were subject-paced with a maximum 5-sec interstimulus interval. Each stimulus series was presented in a block of 10 repetitions of each of the seven stimuli in random order (70 trials). Subjects lis­ [w] tened to two blocks of trials for each series. The order of presen­ tation of the experimental sets was counterbalanced across sub­ jects. By the end of the experimental session, each subject had pro­ .....--... vided 20 identification responses to each stimulus in each series. duration constant

~ RESULTS extent constant w C)« The data from four subjects were eliminated from a: w subsequent analysis, since the identification of the > endpoints of one or more of the three series was in­ « consistent and near chance. Average identification rating functions were calculated for the three series for the remaining 10 subjects. In each series, the aver­ age ratings range from a good [ba] identification for 5 10 15 20 25 3530 Stimulus 1 to a good [wa]identification for Stimulus 7. RATE OF F2 TRANSITION (Hz zmser) The identification results for the two series that varied F2 transition duration are shown in Figure 4. Figure 6. Average identification functions for tbe two series that Data are plotted as a function of the duration of the vary F2 transition rate. F2 transition for these series. The data from the ex- tent constant series (the solid squares) replicates the results of Liberman et al. (1956). When F2 transition [w] extent was held constant, the proportion of [w] re­ sponses increased as transition duration increased. extent constant For the rate constant series, the same result was found. o --.----­ z rate constant As transition duration increased, the proportion of I­« [wI responses increased. Each of the 10 subjects showed a: this same pattern of results. The results for the two w series that varied F2 transition extent are shown in o « Figure 5. Data are plotted as a function of F2 fre­ a: w quency extent. The pattern of subject responses is >« similar to that found when F2 transition duration varied. Increasing F2 frequency extent decreased the proportion of [b] responses for both the duration constant and rate constant series. As the F2 transition 10 20 30 40 50 60 70 80 extent increased, the proportion of [w] responses in­ creased. Again, each of the 10 subjects showed this DURATION OF F2 TRANSITION (msec) same pattern of results. The identification results for Figure 4. Average identification functions for the two series that the two series that varied F2 transition rate are shown vary F2 transition duration. in Figure 6. Data are plotted as a function of F2 rate STOP-SEMIVOWEL CUES 125 for these two series. The pattern of subject responses can be used to differentiate stop consonants from here is very different from the data plotted as a func­ semivowels. The expression E . D represents this rela­ tion of F2 extent or duration. When frequency extent tionship, where E is the value of F2 extent and D is was held constant (the solid squares), the proportion the value of F2 transition duration. The product of of [w) responses decreased as the rate of F2 transi­ these values can be compared with a criterion to per­ tions increased. In contrast, when the F2 transition form phonetic feature assignments. Ifthe product ex­ duration was held constant (the solid triangles), the ceeds the criterion, subjects should label the test stim­ proportion of [w) responses increased as the rate of ulus as a semivowel. If the product is less than the the F2 transitions increased. As with the previous criterion, subjects should respond using a stop label. data sets, all 10 subjects show the same pattern of re­ For our group data, a criterion of 23,000 (Hz· msec) sults that was found for the group data. would be sufficient to distinguish bilabial stops and semivowels. In fact, this criterion, which is based DISCUSSION on group data, is sufficient to appropriately label 196 of 210 judgments.' However, we assume that the ac­ The results indicate that F2 transition rate is not a tual criterion value for individual subjects may vary sufficient cue for distinguishing stops from semi­ depending on individual differences. It is also possi­ vowels. The use of rate as a cue seems to be totally ble that different subjects might rely on the extent dependent on the extent of the F2 transition and on and duration cues to differing degrees. This would the F2 transition duration. No matter what rate was cause these cues to have different (exponential) weights chosen, an appropriate choice of extent or duration in the decision rule. In addition, this product rule could cancel the effect of rate and cause the stimulus might be extended to encompass the contributions of to be identified as either [b) or lwl.' Thus, it appears other acoustic cues, such as Fl transition duration that the significant cues for the stop-semivowel dis­ and extent, and the amplitude profile of the syllable tinction are the duration and extent of the F2 transi­ at onset. tion. Short transition durations cue a stop, while long Despite the extreme simplicity of this description durations cue a semivowel. Small F2 transition ex­ of phonetic labeling, it is interesting to note the form tents signal a stop, while large F2 transition extents of the decision rule. For this rule, the assignment of signal a semivowel. phonetic features is based on the multiplication of It should be noted that there was, necessarily, some the values of two acoustic cues. In this respect, the covariation of acoustic cues in our series. Since the general form of our stop-semivowel decision rule is overall duration of the syllables was constant, a change in agreement with other work on mathematical descrip­ in the duration of the F2 transition resulted in a change tions of phonetic decision making. For example, in the duration of the F2 steady state. So the two Massaro and Cohen (1977) have shown that series which increased the F2 transition duration de­ voicing judgments can be described by a product rule. creased the F2 steady state. Thus, it could be argued Oden and Massaro (1978) have used a similar ap­ that the duration of the F2 steady state contributed proach to modeling the classification of stop conso­ to the perception of manner. However, while it has nants on the dimensions of voicing and place of artic­ been found that vowel duration affects the percep­ ulation. However, the use of a product rule does not tion of manner (Miller & Liberman, 1979), the pres­ provide a process description of speech perception. ent stimuli did not vary vowelduration. The F3 steady­ Consequently, we now turn to considering the impli­ state frequency was reached after the F2 had reached cations of present data for a number of bottom-up pro­ its steady state for 20 of the 21 stimuli. In addition, cess models of speech perception (Klatt, 1980b; Searle, the Fl steady-state duration was constant for all Jacobson, & Kimberley, 1980; Sawusch, Note 3). stimuli. Consequently, if F2 steady-state duration Given that F2 transition rate is not perceptually was a contributing cue to the manner distinction, it relevant to human classification of stops and semi­ would probably not have been through an effect on vowels, the human speech processor must operate vowel duration. The frequency at onset of F2 also under one of two possible constraints. The first pos­ covaried with frequency extent. Separate variation of sibility is that transition rate is never explicitly ex­ these two acoustic aspects of the stimulus would re­ tracted during speech perception. If the cue is not quire using different vowel series, as was done by available, it simply cannot be used. If human speech Liberman et al. (1956). Consequently, the critical perception operates under this constraint, transition variable could be either the extent of the F2 transition rate could not be used as a cue to any phonetic dis­ or the spectrum at onset. tinction. The alternative is that transition rate is ex­ If we assume that duration and extent are the per­ plicitly extracted, but is not generally available for all ceptually relevant cues, the phonetic labeling be­ phonetic feature decisions. This alternative would be havior of our subjects can be described using a simple supported if transition rate were shown to be per­ decision rule. The product of the F2 transition extent ceptually relevant for other phonetic distinctions, (in hertz) and F2 transition duration (in milliseconds) such as place of articulation. In this case, it would be 126 SCHWAB, SAWUSCH, AND NUSBAUM expected that transition rate would be extracted by a the Searle et al. model would classify these stimuli as "sealed channel" mechanism (cf. Pomerantz, 1978), [b]. In contrast, stimuli with slow transition rates and specific to a p,.articular phonetic contrast (e.g., place). small extents were labeled [b] by our subjects but Through the operation of a sealed channel device, would be labeled as [w] by the model. Thus, the Searle transition rate would appear to be interpreted holis­ et al. (1980) model clearly violates the constraints we tically with other cues. This might be demonstrated placed on human speech processing. in perceptual research by showing that rate was ex­ An alternative feature detector model has been tracted in a phonetic feature dependent fashion. proposed by Sawusch (Note 3). This computer simu­ Since human speech perception must operate un­ lation was designed to model both the psychological der one of these two constraints, it seems reasonable processes of speech perception and the perceptual ef­ to apply these constraints to theories of speech per­ fects of selective adaptation for the place of articula­ ception. This provides one criterion that can be used tion feature in stops. According to this theory, speech to evaluate bottom-up theories of speech perception. perception is divided into a sequence of information A second test is determining whether these theories transformation stages. At the earliest level of feature utilize (or could implement) extent and duration as extraction, termed "peripheral auditory analysis," cues to the stop-semivowel distinction. Thus, we have auditory cues are extracted in a frequency-specific, two criteria for evaluating the adequacy of data­ ear-specific fashion. Four classes of feature detectors driven theories for explaining stop-semivowel per­ were implemented in this stage to signal transition ception. rise, transition fall, steady state, and low-frequency Recently, Searle et al. (1980) have proposed a fea­ energy onset-offset (voicing). In this model, trans­ ture detector model of speech perception which has ition rate is not explicitly coded by feature detectors. been instantiated as a computer program. This model Rather, for each frequency region, there are two rising of speech perception operates in two distinct modes. transition detectors and two falling transition detec­ The first is a learning mode in which phonetic proto­ tors. One each of the rise and fall detectors respond types are constructed. An acoustic feature analysis is to extreme frequency changes (extents), while the re­ performed on the waveforms of "known" utterances. maining two respond to gradual changes (short ex­ The results of this feature extraction process are then tents). By only implementing two sets of rise and fall submitted to a discriminant analysis to classify the detectors, rate distinctions are too grossly coded for known utterancesinto categories. any possible use in stop-semivowel judgments. How­ In the second mode, novel utterances are also ana­ ever, this distinction is sufficient for making place of lyzed by the feature detectors. The discriminant analy­ articulation decisions. sis is then used to locate these feature-analyzed utter­ At the second level of feature analysis, called "in­ ances in the multidimensional prototype space. The tegrative auditory analysis," frequency-specific fea­ proximity of the novel utterances to known categor­ tures from peripheral auditory analysis are combined ies in this spaceis used as the basisfor phonetic labeling. to form frequency-independent auditory patterns. Two of the acoustic features used by this model are This second level of processing is implemented as a transition slope (i.e., rate) and the duration of acous­ set of integrative decision rules that take into account tic events (e.g., onset time). There is, however, both auditory features and the auditory context in no explicit representation of transition extent. If, in which those features occur. Within this model, decision the learning stage, this model was givena set of known rules only analyze feature outputs that are directly natural speech stops (e.g., [b]) and semivowels (e.g., relevant. This means that, even if rate were coded at [w]), the program should learn to classify [b]s as the peripheral level, decision rules at the integrative having short transition durations and rapid transition level could selectively ignore or employ this feature rates. The model should also learn to identify [w]s as required. Since this simulation does not explicitly as utterances with long transition durations and slow code rate in sufficient detail for distinguishing stops transition rates. If a [b]-[w] series of stimuli were and semivowels, this model exists within the con­ constructed such that transition rate and extent were straints dictated by our results. In order to differen­ the only features varying in the series (i.e., duration tially label stops and semivowels, the simulation would was constant), the program should classify these need to utilize the extent information encoded at the sounds on the basis of rate alone. This model would first level of feature extraction. The outputs of these identify any stimulus with a rapid transition rate as detectors could be accumulated over the transition [b) and any stimulus with a slow transition rate as duration (onset to steady state). This analysis repre­ [w]. Clearly, this classification scheme is radically sents a product of transition extent and duration and different from the procedure used, under similar cir­ therefore would be consistent with our data. cumstances, by our subjects, who classified this type One model which does predict our results has been of stimulus series according to transition extent. Our proposed by Klatt (1980b). Klatt has described a subjects classified stimuli with rapid transition rates bottom-up approach to speech perception which uses and large frequency extents as [w] (see Figure 6), while static spectral templates as fundamental auditory fea- STOP-SEMIVOWEL CUES 127 tures. These spectral templates are nodes in a was falling rather than rising. All [gal transitions discrimination-recognition network. Sample short-term were 35 msec in duration. From their [ba]-[wa] data, spectra, taken from an input waveform, are com­ this 35-msec duration falls within the semivowel cate­ pared with these nodes and are scored for closeness of gory. Cooper et al. found that a normal [gal adaptor fit. The highest-scoring sequence through the network had second and third formant starting frequencies indicates the recognition path. In this model, transi­ sufficiently close to one another to simulate a burst, tion extent would be indicated by the amount of change which is an acoustic cue for a stop consonant. In or­ in the F2 spectral peak across templates from transi­ der to remove this burst-like effect, the transition fre­ tion onset to steady-state vowel. Duration is analyzed quency extent was reduced for both second and third as a cue by counting the number of times a spectral formants. The F2 transition extent was reduced by template is iteratively matched by looping through approximately 250 Hz. Cooper et al. (1976) hoped to the same node. With both duration and extent cues determine the locus of the selective adaptation effect, being interpreted by the recognition network, this since the [gal adaptor had transition durations simi­ model should emulate human labeling of stops and lar to the [wa] end of their test serieswhile phonetically semivowels. Even more important than the extrac­ it was similar to the [ba] end of the series. They hy­ tion of extent and duration cues by this model is the pothesized that an auditory locus would predict a lack of any means for computing rate in this theory. [wa]-like adapting effect while a phonetic locus would Indeed, Klatt (1980b) has explicitly stated that evi­ predict a [ba]-like adapting effect. The effect of the dence demonstrating the perceptual significance of [gal adaptor was in the same direction as a [ba] adap­ transition rate for the [b]-[w] distinction would be a tor. This led Cooper et al. (1976) to the conclusion strong disconfirmation of his model. Since our re­ that selective adaptation has an effect at a phonetic search demonstrates that F2 transition rate is not level of processing. Given the results of the present utilized by humans making this distinction, the pres­ experiment, the results of Cooper et al. can be ex­ ent study supports Klatt's (l980b) proposal. plained without recourse to a phonetic locus for adap­ Our results also have important consequences for tation. By removing one acoustic cue for a stop, namely the "phonetic" interpretation of a previous selective a burst, they substituted another stop cue, short fre­ adaptation study (Cooper, Ebert, & Cole, 1976). In quency extent. Their [gal adaptor had transition dura­ their experiment, subjects identified a speech series tions only somewhat appropriate for a semivowel, under two conditions. In the control condition, the since the [ba]-[wa] stimulus with 35-msec transitions subjects identified stimuli from a [ba]-[wa] series. In was still labeled as [ba] 20% of the time. However, the adaptation condition, the subjects listened to re­ their [gal adaptor had a relatively small F2 transition peated occurrences of an adapting stimulus and then extent which is a strong stop cue (see Figure 5). Con­ identified the test series. A comparison was made be­ sequently, an auditory level explanation, based on tween identification of the speech series before and the adaptation of both duration and extent detec­ after adaptation. When the adaptor was an endpoint tors, would seem to be adequate to account for the of the test series, the postadaptation identification Cooper et al. data. function shifted towards the adaptor end of the series. In summary, the present experiment explored three For example, using the [ba] endpoint as the adaptor, acoustic cues to the stop-semivowel distinction. Both fewer of the [ba]-[wa] test stimuli were identified as F2 transition duration and frequency extent were [b] after adaptation. found to lead to a reliable stop-semivowel distinction. Two loci for this adaptation effect have been pro­ Short transition durations and short frequency ex­ posed. One locus is an auditory level of speech pro­ tents lead to more stop responses, while long transition cessing (Ades, 1976; Bailey, 1975; Diehl, 1976). If durations and large frequency extents lead to more adaptation occurs at this level, then spectral similar­ semivowel responses. By comparison, F2 transition ity between the adaptor and test series would predict rate was found to be an insufficient cue to the stop­ the direction and magnitude of any adaptation effect. semivowel distinction. These results place certain Alternatively, adaptation could occur at a phonetic constraints on theories of speech perception. Any level of processing where phonetic similarity would theory purporting to explain human speech percep­ predict the direction of the effect (Cooper et al., tion must either extract transition rate in a phonetic 1976). Cooper et al. tried to determine the locus of feature dependent (sealed channel) fashion or ignore the selective adaptation effect by examining the ef­ it entirely. This provides us with a test for the psy­ fect of a velar stop adaptor [gal on their bilabial stop­ chological validity of data-driven theories of speech semivowel series ([ba]-[wa]). In order to differentiate perception. Further, given the present results, it does between the auditory and phonetic explanations of not appear to be necessary to involve a phonetic level selective adaptation, Cooper et al. tried to create a of adaptation to explain the adaptation results found [gal adaptor that had an acoustic structure more sim­ for the stop-semivowel manner distinction. Rather, ilar to [wa] than to [ba). Their [gal was similar to the multiple auditory detectors or channels, tuned to the [ba]-[wa] stimuli, except that the initial F2 transition various cues, are sufficient to explain the existing data. 128 SCHWAB,SAWUSCH,ANDNUSBAUM

REFERENCE NOTES mental frequency as cues to the /zi/-lsi! distinction. Perception & Psychophysics, 1977, 22, 373-382. 1. Suzuki, H. Mutually complementary effect ofrate and amount MILLER,J. L., & LIBERMAN, A. M. Some effects of later occurring offormant tran~ition in distinguishing vowel, semivowel, and stop information on the perception of stop consonant and semi­ consonant (Quarterly Progress Report of the MIT Research vowel. Perception & Psychophysics. 1979,25,457-465. Laboratory of Electronics, No. 96). Boston: MIT, 1970. NORMAN, D. A. Copycat science or does the mind really work by 2. Kewley-Port, D. KLTEXC: Executive program to implement table look-up? In R. A. Cole (Ed.), Perception and production the KLA IT software speech synthesizer (Research on Speech offluent speech. Hillsdale, N.J: Erlbaum, 1980. Perception, Progress Report 4). Bloomington: Indiana University, O'CONNOR, J. D., GERSTMAN, L, J., LIBERMAN, A.M.,DELATTRE. 1978. P. C., & COOPER. F. S. Acoustic cues for the perception of 3. Sawusch, J. R. The structure and flow of information in initial /w,j,r,lI in English. Word, 1957, 13,24-43. speech perception (Research on Speech Perception, Tech. Rep. 2). aDEN, G. C., & MASSARO. D. W. Integration of featural informa­ Bloomington: Indiana University, 1976. tion in speech perception. Psychological Review, 1978, 85, 172-191. REFERENCES POMERANTZ, J. R. Are complex visual features derived from simple ones? In E. L. J. Leeuwenberg & H. F. J. M. Buffart ADES, A. E. Adapting the property detectors for speech percep­ (Eds.), Formal theories of visual perception. New York: Wiley, tion. In R. J. Wales & E. Walker (Eds.), New approaches to 1978. languagemechanisms. Amsterdam: North-Holland, 1976. SEARLE. C. L., JACOBSON, J. Z., & KIMBERLEY, B. P. Speech as BAILEY, P. J. Perceptual adaptation in speech: Some properties patterns in the 3-space of time and frequency. In R. A. Cole of detectors for acoustical cues to phonetic distinctions. Un­ (Ed.), Perception and production of fluent speech. Hillsdale, published doctoral dissertation, University of Cambridge, N.J: Erlbaum, 1980. Cambridge, England, 1975. STEVENS. K. N. Property-detecting mechanisms and eclectic COOPER, W. E., EBERT, R. R., & COLE, R. A. Perceptual analysis processors. In R. A. Cole (Ed.), Perception and production of of stop consonants and glides. Journal of Experimental Psy­ fluent speech. Hillsdale, N.J: Erlbaum, 1980. chology: Human Perception and Performance, 1976,2,92-104. STEVENS, K. N., & BLUMSTEIN, S. E. Invariant cues for place of DIEHL. R. Feature analyzers for the phonetic dimension stop vs. articulation in stop consonants. Journal of the Acoustical . Perception & Psychophysics, 1976, 19, 267-272. Society ofAmerica, 1978,64, 1358-1368. DORMAN, M. F., STUDDERT-KENNEDY, M., & RAPHAEL, L. J. STUDDERT-KENNEDY, M. Universals in phonetic structure and Stop consonant recognition: Release bursts and formant transi­ their role in linguistic communication. In T. H. Bullock (Ed.), tions as functionally equivalent, context-dependent cues. Percep­ Recognition of complex acoustic signals. Berlin: Dahlem tion &Psychophysics, 1977,22, 109-122. Konferenzen, 1977. FANT,G. Auditory patterns of speech. In W. Wathen-Dunn (Ed.), SUMMERFIELD, ., & HAGGARD, M. On the dissociation of Modelsfor the perception ofspeechand visualform. Cambridge, spectral and temporal cues to the voicing distinction in initial Mass: M.LT. Press, 1967. stop consonants. Journal ofthe Acoustical Society ofAmerica, HILLENBRAND, J., MINIFIE. F. D., & EDWARDS, T. J. Tempo of 1977,62,435-448. spectrum change as a cue in speech-sound discrimination by infants. Journal of Speech and Hearing Research, 1979, 22, NOTES 147-165. KLATT. D. H. Software for a cascade/parallel formant synthesizer. I. In the present experiment, the fastest F2 transition rate for a Journal of the Acoustical Society of America. 1980, 67, 971­ [wI stimulus was less than 12 Hz/msec. The conclusion that any F2 995. (a) transition rate can signal either stop or semivowel gains further KLATT, D. H. Speech perception: A model of acoustic-phonetic support from six subjects' identification of an additional stop­ analysis and lexical access. In R. A. Cole (Ed.), Perception and semivowel series ([bl]-[wlJ). In this series, all transition durations production offluent speech. Hillsdale, N.J: Erlbaum, 1980. (b) were constant at 40 msec. F2 transition extent and rate ranged LIBERMAN, A. M., COOPER, F. S., SHANKWEILER, D. P., & from 300 Hz and 7.5 Hz/msec for the [b] end of the series to STUDDERT-KENNEDY, M. Perception of the speech code. 1,200 Hz and 30 Hz/msec for the [wI end of the series in ISO-Hz PsychologicalReview, 1967,74,431-461. and 3.75-Hz/msec steps. In the group data, the stimuli with ex­ LIBERMAN, A. M., DELATTRE. P. C., GERSTMAN, L. J., & tents of 900, 1,050, and 1,200 Hz (with respective rates of 22.5, COOPER, F. S. Tempo of frequency change as a cue for distin­ 26.25, and 30.0 Hz/msec) were all identified as [wI on better than guishing classes of speech sounds. Journal of Experimental 9OOfo of the trials. Each of the six subjects identified these three Psychology, 1956,52, 127-137. stimuli as Iwl- LISKER. L., & ABRAMSON, A. S. A cross-language study of 2. This rule also fits the [bI]-[wI] data for 37 out of 42 points. voicing in initial stops: Acoustical measurements. Word, 1964, 20, 384-422. (Received for publication July 17, 1980; MASSARO, D. W., & COHEN, M. M. Voice onset time and funda- revision accepted October 17, 1980.)