<<

DOI 10.1515/lp-2014-0011 Laboratory Phonology 2014; 5(2): 289 – 335

Noah H. Silbert Perception of voicing and in labial and alveolar English stop

Abstract: Distinctive features define a multidimensional structure that must be implemented in and perception. A multilevel Gaussian Gen­ eral Recognition Theory model is presented as a model of multidimensional fea­ ture perception. The model is fit to data from three experiments probing iden­ tification of noise-masked, naturally-produced labial and alveolar English stop consonants [p], [b], [t], and [d] in onset (syllable-initial) and coda (syllable-final) position. The results indicate systematic perceptual deviations from simple place and voicing structure in individual subjects and at the group level. Comparing onset and coda positions shows that syllable position modulates the deviation patterns, and comparing speech-shaped noise and multi-talker babble indicates that deviations from simple feature structure are reasonably robust to variation in noise characteristics. Possible causes of the observed perceptual confusion pat­ terns are discussed, and extensions of this work to studies of feature structure in speech production and investigation of non-native are briefly outlined.

Noah H. Silbert: University of Cincinnati. E-mail: [email protected]

1 Multiple dimensions in speech perception

Suppose that, in the course of a conversation, a speaker utters the word ‘pin’. If the speaker and the listener both share a native language (and dialect), if the word is enunciated clearly, if there is sufficiently little background noise, and if the listener’s hearing is unimpaired, the listener is very likely to accurately per­ ceive the initial [p] in ‘pin’ (as well as the and final ). If ‘pin’ is spoken by a non-native speaker or speaker of another dialect, if it is produced casually or rapidly, if there is substantial noise in the environment, or if the listener’s auditory system malfunctions, she may misperceive the initial [p]. For example, rather than ‘pin’, she may hear ‘bin’; she may misperceive the voiceless [p] as the voiced [b]. Alternatively, she may hear ‘pin’ as ‘tin’; she may 290 Noah H. Silbert misperceive the labial [p] as the alveolar [t]. Or she may hear ‘pin’ as ‘fin’; she may misperceive the stop [p] as the [f]. All of which is to say that the mapping of a speech signal onto a mean­ ingful linguistic representation requires a listener to process information on ­multiple phonological dimensions. It may be that a is a simple com­ bination of its featural components and that perceptual space is accurately de­ scribed solely by discrete, abstract distinctive features. Alternatively, it may be that perception deviates systematically from a simple featural description and that are not perceived as simple combinations of values. Unfortunately, distinctive features are not directly observable in percep­ tion (or production). Even in a highly controlled experimental setting, one can only directly observe the properties of experimental stimuli and some charac­ teristics of listeners’ responses (e.g., which button is pressed, response latency, etc . . .). Investigations of feature perception are complicated by the fact that the map­ ping between distinctive features and acoustic cues is many-to-many. For exam­ ple, voicing1 is cued by, among other acoustic distinctions, stop burst amplitude and onset time (VOT); both also play a role in cuing place of articulation. In English, voiceless stops have higher amplitude release bursts and longer VOTs than do voiced stops. Simultaneously, alveolar stops have higher amplitude release bursts and longer VOTs than do labial stops (Volaitis and Miller 1992; Oglesbee 2008). General Recognition Theory (GRT) provides a powerful mathematical frame­ work for modeling feature perception more or less directly (Ashby and Townsend 1986; Thomas 2001; Silbert 2012). The present work investigates the perception of voicing and place of articulation in the English stops [p], [b], [t], and [d] by apply­ ing a multilevel Bayesian Gaussian GRT model to data collected in three experi­ ments employing noise-masked, naturally-produced stimuli. The goals of the present work are threefold. The first goal is methodologi­ cal and involves the application of multilevel Gaussian GRT to analyze feature perception at individual subject and group levels. As described below in some detail, Gaussian GRT provides a rigorous, detailed account of patterns of per­

1 The contrast in English may be better characterized as pertaining to ‘spread glottis’ rather than ‘voicing’, though note that, at least in syllable onset position, English doesn’t make use of both features. That is, the difference between [p] and [b], on the one hand, and between [t] and [d], on the other, is minimal in English, and so, at least in the present investigation, there is no ambi­ guity. See Kingston and Diehl (1994) for some argument in favor of ‘voicing’ as the appropriate label. Perception of voicing and place of articulation 291 ceptual confusion as a function of distinct perceptual and response-selection processes. Recent extensions of GRT enable inferences about these distinct pro­ cesses to be drawn simultaneously at the individual listener and group level ­(Silbert 2012). The second goal is to provide fine-grained analyses of distinctive feature structure in (English) consonant perception. There are compelling reasons to ­believe that speech perception relies on features (Bailey and Hahn 2005; Lahiri and Reetz 2010), yet there is also experimental evidence consistent with the ­notion that the basic perceptual unit is the (Nearey 1990; Benkí 2001) and there are compelling models of speech perception that assume that the ba­ sic percep­tual unit is the segment (McClelland and Elman 1986; Norris et al. 2000). To the extent that a basic multidimensional feature structure accurately de­ scribes perceptual space, a feature-based approach to speech perception is sup­ ported (Bailey and Hahn 2005; Lahiri and Reetz 2010). Such results would sug­ gest that segment-based models might profitably be modified to employ features rather than segments. On the other hand, to the extent that perceptual structure deviates from a simple feature structure (i.e., to the extent that features interact with one another), segmental perception is supported (Nearey 1990; Benkí 2001). Such results would suggest that segments are not merely simple combinations of particular levels of multiple features. It is worth keeping in mind that, although segmental perception in its most general form might allow for more or less random deviations from feature-based structure in perception, there is good reason to assume that this most general construal of segmental perception is wrong. We already know, for example, that voiceless stops share articulatory and acoustic properties, and that these differ systematically from the articulatory and acoustic properties of voiced stops (e.g., Volaitis and Miller 1992; Kessinger and Blumstein 1997). Place and manner dis­ tinctions behave similarly. Hence, the focus on feature-based vs. segmental per­ ception is aimed not at declaring one or the other correct, but, rather, it is aimed at investigating the degree to which perceptual structure deviates from simple feature structure. The nature of ‘simple’ feature structure is addressed below in detail and defined with respect to GRT. The third goal is to probe the effect of (one type of) contextual variation in feature perception by probing perception of the same features – place and ­voicing – in syllable onset and coda positions. While it is well documented that syllable position can dramatically affect the acoustics of consonantal dis­ tinctions (Port and Dalby 1982; Volaitis and Miller 1992), much less is known about how perceptual interactions between features may vary across prosodic contexts. 292 Noah H. Silbert

1.1 Interactions vs. independence between dimensions

Much research on feature structure in speech production and perception has fo­ cused on very small portions of phonological space, often a single pair of phones differing on a single feature. This approach has clear value; it has produced a substantial body of knowledge about the production and perception of a number of acoustic cues to a number of features (e.g., VOT as a cue to voicing, formant transitions as cues to place of articulation). Similarly, psycholinguistic work probing the role of feature underspecification in perception has often focused on single pairs of phones (e.g., Lahiri and Reetz 2010). However, because distinctive features define an inherently multidimensional structure, and because the map­ ping between features and acoustic cues is many-to-many, focusing on small sub­ sets of phonological space by necessity misses important properties of the imple­ mentation of features in production and perception. In some cases, researchers have taken an explicitly multidimensional ap­ proach, focusing on multiple cues to multiple features (Sawusch and Pisoni 1974; Oden and Massaro 1978; Eimas et al. 1981) or multiple cues shared between phones occurring in sequence (Nearey 1990, 1997; Smits 2001a, b). Previous stud­ ies of the perception of cues to voicing and place have produced equivocal re­ sults, with some indicating that cues interact (Sawusch and Pisoni 1974) while others suggest that cues are independent (Oden and Massaro 1978). Work focus­ ing on perceptual structure in noise-masked speech has been similarly equivocal, with some studies producing results consistent with simple feature structure (G. A. Miller and Nicely 1955) and others strongly suggestive of featural interac­ tions (Phatak and Allen 2007). Acoustic cues to place of articulation and voicing seem to interact in pho­ netic categorization (Sawusch and Pisoni 1974) and in speeded classification tasks (Eimas, Tartter, Miller, and Keuthen 1978; J. L. Miller 1978; Eimas, Tartter, and Miller 1981). On the other hand, some have argued that cues to place and voicing are, in fact, independent (Oden and Massaro 1978; Massaro and Oden 1980). Cues to phonological distinctions in sequences of phones seem to interact, as well (Nearey 1990, 1992, 1997; Smits 2001a). Recent work has attempted to measure and model processing of all relevant acoustic-phonetic cues to place and voicing in fricative categorization (McMurray and Jongman 2011). However, the focus of this work is not phonological per­ ception, per se, but, rather, the relative performance of a number of different ap­ proaches to dealing with cue variability and context-dependency. While the clas­ sification models used by McMurray and Jongman do make use of a large (and possibly exhaustive) set of cues to place and voicing in , they treat frica­ tives as segments and focus entirely on classification accuracy, which is to say Perception of voicing and place of articulation 293 that they do not model distinctive feature structure or address the fact that pat­ terns of perceptual confusions can be reasonably well accounted for with a sim­ ple feature counting model (Bailey and Hahn 2005). Studies of noise-masked speech provide insight into the role of features in perception, as well. Distinctive feature structure is evident in patterns of accurate identification and confusion of consonants masked by white noise (Miller and Nicely 1955; Allen 2005), signal-correlated noise (Benkí 2003), and multi-talker babble (Cutler et al. 2004). Information theoretic analyses indicate that informa­ tion is not transmitted equally well for different distinctive features: voicing and manner distinctions are more salient than place of articulation distinctions, re­ gardless of the type of masker noise and particularly at low signal-to-noise ratios (Miller and Nicely 1955; Benkí 2003; Cutler et al. 2004). More fine-grained analyses of confusion patterns suggest that certain noise types may induce perceptual interactions between features. While white noise seems to induce reasonably well-behaved feature-based perceptual grouping (Miller and Nicely 1955; Allen 2005), speech-weighted noise may induce more eclectic perceptual grouping (Phatak and Allen 2007). Noise maskers may also interact with token-specific acoustic properties of stimuli, influencing overall ­error rates and confusion patterns (Singh and Allen 2012). Inspection of previously reported confusions among just the set of conso­ nants considered here (i.e., [p], [b], [t], and [d]) indicates that place tends to be less salient than voicing (Miller and Nicely 1955; Wang and Bilger 1973; Benkí 2003; Cutler et al. 2004), and that confusability among these consonants can vary depending on whether the consonants appear in onset or coda position (Wang and Bilger 1973; Benkí 2003). The confusion matrices reported by Benkí (2003) seem to indicate, on the one hand, less confusability between onset [b] and [d] than between onset [p] and [t], less confusability between onset [b] and [d] than between coda [b] and [d], and less confusability between onset [p] and [t] than between coda [p] and [t]. The confusion matrices reported by Wang and Bilger (1973) and by Cutler et al. (2004) indicate less confusability between [b] and [d] than between [p] and [t] in general, with less of a difference in confusion patterns between onset and coda for these consonants. Following the general speech-in-noise approach of these studies, the pres­ ent work aims to further elucidate the role of features in the perception of noise-masked speech. As mentioned above, a number of previous studies of speech in noise have focused on the perception of large sets of consonants, ­typically analyzing responses pooled across (sets of) listeners (Miller and Nicely 1955; Benkí 2003; Cutler et al. 2004; Allen 2005; Phatak and Allen 2007; Singh and Allen 2012). By way of contrast, in the experiments described below, feature perception is analyzed for the smaller set of consonants – [p], [b], 294 Noah H. Silbert

[t], and [d] – by fitting a multilevel GRT model to individual listeners’ classifica­ tion data. The focus of the present work is restricted to place and voicing in this subset of English stop consonants in part so that the results are more or less directly com­ parable to similar results in previous studies with similarly narrow scope (e.g., Sawusch and Pisoni 1974; Oden and Massaro 1978), and in part because categories defined by the factorial combination of two levels on each of two dimensions maps directly onto the simplest full factorial GRT model (described in detail below). Of course, much as previous studies focusing on pairs of sounds are unable to probe certain aspects of multidimensional feature structure, an investigation of a small multidimensional subspace such as that formed by [p], [b], [t], and [d] will miss certain properties of the perceptual space populated by the full com­ plement of English consonants. In a number of investigations of consonant con­ fusion patterns, some analyzing much larger sets of consonants, fairly high rates of confusions among the consonants in the set considered here and consonants outside this set have been reported. For example, [b] and [v] are often found to be highly confusable (Miller and Nicely 1955; Wang and Bilger 1973; Phatak and Allen 2007; Silbert 2012), and similarly for [d] and [g], at least with certain noise types (e.g., white noise; Miller and Nicely 1955; Wang and Bilger 1973). We return to this issue in the final discussion. It should be kept in mind that the confusion matrices reported and analyzed in these earlier studies are aggregated across multiple listeners and (often) multi­ ple experimental conditions (e.g., signal-to-noise ratios). It is also worth noting that the (often information theoretic) analyses employed in most of these earlier studies conflate perceptual and decisional processes. The work described below is aimed, in part, at addressing these limitations. Multilevel GRT provides a detailed cognitive account of correct and incorrect classifications as a function of individual listeners’ perceptual and response-­ selection processes as well as a group-level account of same. Of particular utility is the fact that GRT models distinct feature interactions (or lack thereof) at two perceptual levels. At the ‘micro’ level, features may interact (or not) within a given stimulus, whereas at the ‘macro’ level, features may interact (or not) between stimuli. These distinctions are described in detail in the following section.

2 General Recognition Theory

In the following sections, Gaussian GRT is introduced. A discussion of some re­ lated models, a description of the experimental protocols typically employed with GRT, and some recent developments that extend the model follow. Perception of voicing and place of articulation 295

2.1 The structure of General Recognition Theory

GRT is a two-stage model of perception and response selection that relies on two major assumptions. First, it is assumed that internal and/or external noise cause the presentation of any given stimulus to produce a random perceptual effect. Over the course of multiple presentations, this results in a distribution of percep­ tual effects defined with respect to a multidimensional perceptual space. Second, it is assumed that decision bounds exhaustively partition this perceptual space into mutually exclusive response regions. These regions determine the responses associated with (sets of) perceptual effects (i.e., a perceptual effect in a given re­ gion is deterministically given the response associated with that region). It will be assumed that the perceptual distributions are bivariate Gaussian, an assumption that is common in applications of GRT (Olzak and Wickens 1997; Kadlec and Hicks 1998; Thomas 2001; Silbert et al. 2009; Silbert 2012). In addition to being a common assumption in applications of GRT, Gaussian perceptual dis­ tributions have at least two useful properties. First, they provide a straightfor­ ward relationship between perceptual independence between features within a given perceptual distribution (i.e., at the ‘micro’ level) and statistical indepen­ dence: within a given perceptual distribution, perceptual independence is equiv­ alent to statistical independence (i.e., zero correlation). Second, they allow for a clean separation of within-stimulus and between-stimuli notions of perceptual interactions (or lack thereof) by virtue of the fact that the marginal perceptual distributions are not affected by the presence or absence of correlation within a given perceptual distribution (i.e., whether or not the perceptual effects on one dimension – e.g., voicing – vary across levels of the other dimension – e.g., place – is not affected by the presence or absence of correlations within multivariate Gaussian perceptual distributions). Hence, although other (perceptual) distribu­ tional assumptions are possible in the GRT framework, barring strong evidence in favor of some non-Gaussian model, these two properties of multivariate Gauss­ ians provide a sound justification for their use in GRT. It will also be assumed that the decision bounds are linear and parallel to the coordinate axes of the perceptual space. Although this may seem to be a very strong assumption, in the standard 2 × 2 GRT model (i.e., the model of stimuli defined by the factorial combination of two levels on each of two dimensions), deviations from this assumption are not identifiable (Silbert and Thomas 2013); it is useful to think of the decision bounds as both partitioning the perceptual space into response regions and (implicitly) defining the axes of the perceptual space. Figure 1 shows four illustrative 2 × 2 Gaussian GRT models, each defined with respect to the stops [p], [b], [t], and [d], which consist of the factorial combination of place of articulation (labial vs. alveolar) and voicing (voiced vs. voiceless). In 296 Noah H. Silbert

Fig. 1: Illustrative two-dimensional Gaussian GRT models. The ellipses represent contours of equal likelihood; the plus signs indicate the means of the distributions. The vertical and horizontal lines indicate decision criteria. The model in the top left illustrates the ‘simple feature’ model discussed above; the models in the other panels illustrate deviations from this structure. The model in the bottom left illustrates deviations at the ‘macro’, or between-stimuli, level, with the salience between [b] and [d] differing from the salience between [p] and [t]. The model in the top right illustrates deviations at the ‘micro’, or within-stimulus, level, with correlations present within each perceptual distribution. The model in the bottom right illustrates both types of deviation from simple feature structure. See text for additional details. visualization of Gaussian GRT models, it is convenient to take a bird’s eye view of the perceptual space and look straight down on the perceptual distributions. From this perspective, perceptual distributions are illustrated with equal likeli­ hood contours (i.e., sets of points on the bivariate Gaussian distributions that are the same height above the plane defined by place [x-axes] and voicing [-axes] dimensions). The means of the distributions are indicated by plus signs inside the equal likelihood contour ellipses. These are the peaks of the distributions, and the tails of the distributions extend indefinitely beyond the equal likelihood contours in each direction. For each stimulus, the perceptual effects inside the ellipse are more likely to occur than are the perceptual effects outside the ellipse. Predicted identification-confusion probabilities are given by the volume (i.e., double inte­ grals) of the perceptual distributions in the appropriate response regions. For ex­ Perception of voicing and place of articulation 297 ample, the predicted probability of responding ‘t’ when presented with [p] would be given by the double integral of the [p] distribution in the ‘t’ response region. Note that, whether or not the equal likelihood contours overlap multiple response regions, the fact that bivariate Gaussian distributions are defined for the entire place-by-voicing plane means that some (possibly very small) proportion of each distribution is in each response region. In the model illustrated in the top left panel of Figure 1, the perceptual distri­ butions all have zero correlation and are arranged in a square. This model il­ lustrates a case in which perceptual categories consist of simple combinations of their component parts – the ‘simple feature structure’ mentioned above. In this model, perceptual salience on each dimension is constant across levels of the other dimension (i.e., the marginal perceptual distributions on each dimension are constant across levels of the other dimension). Hence, the square arrange­ ment of distributions predicts symmetric patterns of between-category confu­ sions, as between, for example, voiceless [p]-[t] and voiced [b]-[d], or between labial [p]-[b] and alveolar [t]-[d]. The lack of correlation in the perceptual distri­ butions leads to the prediction that confusions within each category will corre­ spond to simple feature differences. This model predicts, for a given proportion of accurate responses, more confusions between stimuli that differ on just one di­ mension (e.g., [p] and [b]) than between those that differ on both dimensions (e.g., [p] and [d]). Note that, based on the confusion patterns from previous work discussed above, we don’t expect the simple feature model to provide an ade­ quate account of perceptual data. Rather, this model serves as a baseline against which deviations from simple feature structure can be contrasted. The model illustrated in the bottom left panel shows a case in which correla­ tions are zero but the means are no longer arranged in a square. In this model, perceptual salience for place varies as a function of voicing (though not vice ­versa); [p] and [t] are less perceptually salient than are [b] and [d] (i.e., the [mar­ ginal] perceptual distributions corresponding to [p] and [t] are farther apart from one another than are the [marginal] perceptual distributions corresponding to [b] and [d]). Hence, this model predicts more confusions between [p] and [t] than between [b] and [d]. Note, too, compared to the simple feature model (top left), the salience change in the bottom-left model leads to more predicted ‘p’ re­ sponses than ‘d’ responses to the [b] stimulus and more predicted ‘t’ responses than ‘b’ responses to the [d] stimulus. The model in the top right panel illustrates a case in which the means of the distributions are arranged in a square but the within-distribution correlations are non-zero. The non-zero correlations also cause predicted response probabilities to diverge from the simple feature model’s predictions. Consider, for example, the predicted responses for the [b] stimulus. The correlation in this distribution leads 298 Noah H. Silbert to more predicted ‘b’ and ‘t’ responses and fewer predicted ‘p’ and ‘d’ responses relative to the simple feature model (top left). The correlations in the [p], [t], and [d] distributions produce analogous shifts in predicted confusion probabilities relative to the simple feature model. The model in the bottom right panel illustrates the combination of the cases in the bottom left and top right. In this model, salience on the place dimension varies with the level on the voicing dimension (i.e., [b] and [d] are more salient than are [p] and [t]), and, simultaneously, there are non-zero correlations within the perceptual distributions. This leads, not surprisingly, to fairly complicated patterns of predicted response probabilities. The locations of the decision bounds can vary, as well. In the model in the top left panel of Figure 1, the decision bounds are located exactly halfway between the means of the perceptual distributions. Suppose, however, that the horizontal bound were shifted downward toward the means of the [p] and [t] distributions. This would produce more predicted ‘b’ and ‘d’ responses overall, and fewer ‘p’ and ‘t’ responses overall, for all four stimuli (i.e., a general bias toward voiced responses). Shifts in the vertical bound would produce analogous changes on the place dimension, and, of course, such shifts could occur in conjunction with the shifts in the distributions’ means and non-zero correlations illustrated in the ­other panels of Figure 1, as well.

2.2 Other modeling options

Of course, GRT is not the only model of multidimensional perception on offer. One could, in principle, use the Fuzzy Logical Model of Perception (FLMP; Oden and Massaro 1978; Massaro and Oden 1980), the Hierarchical Categorization model (HICAT; Smits 2001a, b), or any of a number of Random Utility Models (RUMs, which have typically been implemented as multinomial logistic regres­ sion models,­ as in, e.g., Nearey 1990, 1992, 1997; Train 2003). The relationships among these models and between these models and GRT are complex and involve a ­number of issues. These include (though perhaps are not limited to) the as­ sumed mental processing architecture, the locus of stochastic variation, a priori assumptions about the nature­ of the stochastic variation (e.g., the functional form of the noise; independence vs. correlation), and the presumed mapping be­ tween the perceptual and stimulus-defining dimensions. It is worth noting that FLMP, HICAT, and RUMs have typically been applied to multidimensional arrays of parametrically manipulated stimuli (e.g., Smits 2001a) presented in quiet listening conditions, whereas in the current project, GRT is applied to a 2 × 2 array of noise-masked, naturally-produced CV syllables. Perception of voicing and place of articulation 299

Note, too, that HICAT and RUMs have typically been used to study perception of cues to sequences of phones (e.g., ‘see’, ‘sue’, ‘she’, ‘shoe’), whereas FLMP has been used to study cues within consonants (e.g., [pæ], [bæ], [tæ], [dæ]; Oden and Massaro 1978). While these models have been used with seemingly substantially different aims, nothing in principle would prevent the use of any of these models to ana­ lyze the same kinds of data sets, be it classifications of combinations of cues to (sequences of) phones or identifications of combinations of distinctive features within phones. The choice of model in the present work is motivated largely by two related properties of (Gaussian) GRT, as described above in detail. Specif­ ically, GRT distinguishes between within-stimulus and between-stimuli percep­ tual interactions, and between perceptual and decisional processes. The assump­ tion of Gaussian perceptual distributions further allows for a straightforward quantification of within-stimulus and between-stimuli perceptual interactions (i.e., correlations and equality of marginal perceptual distributions, respectively).

2.2.1 The locus of random variation

In models of perception and response selection, it is typically either assumed that perception is noisy and response selection is deterministic (Green and Swets 1966; Ashby and Townsend 1986; Nearey 1992; Train 2003) or that perception is deterministic and response selection is probabilistic (Nosofsky 1986; Smits 2001b). It is worth noting that the ‘category boundaries’ in HICAT (Smits 2001b) and in the territorial maps associated with multinomial logit RUMs (Nearey 1990) are conceptually distinct from the decision bounds of GRT. These ‘category boundaries’ reflect the points on either side of which one response is more prob­ able than another (Nearey 1990), whereas the decision bounds in GRT define the points on either side of which only one response is possible, independent of the locations at which the perceptual distributions cross over one another. Hence, the ‘category boundaries’ in HICAT and RUM-based territorial maps are not equiv­ alent to GRT decision bounds. Models with noisy perception and deterministic response selection are diffi­ cult, if not impossible, to distinguish empirically from models with deterministic perception and probabilistic response selection, though certain unusual category structures can tease the two types of model apart (Rouder and Ratcliff 2004). Thus, the selection of one or another type of model should correspond as closely as possible to relevant experimental factors. Stimuli presented in quiet listen­ ing conditions are likely better modeled with deterministic perception and 300 Noah H. Silbert

­probabilistic response selection, as in FLMP or HICAT, while noise-embedded stimuli are better modeled with noisy perception and deterministic response se­ lection, as in Signal Detection Theory (SDT; Green and Swets 1966) and GRT.

2.2.2 Assumptions about noise characteristics

In GRT, perceptual noise is assumed to be multivariate Gaussian, and indepen­ dence is explicitly tested rather than assumed. Note that in previous applications of multidimensional SDT to speech perception, independence has been (at least implicitly) assumed (e.g., Kingston and Macmillan 1995; Kingston et al. 1997; Macmillan et al. 1999; Kingston et al. 2008). The GRT model employed here pro­ vides a statistically rigorous method for testing independence (as part of the more general hypothesis that simple feature structure accurately describes the percep­ tion of distinctive features). HICAT relies on the assumptions that acoustic cue distributions for differ­ ent segments have identical covariance matrices and that covariance is zero in the category goodness functions for syllables (i.e., sequences of two phones). Both of these assumptions can be relaxed and tested in the GRT framework. In FLMP, fuzzy logic values are assumed to be independent from one another, though Batchelder and Crowther (1997) show that FLMP is a special case of a more general multinomial processing tree model that can incorporate statistical dependencies.

2.2.3 Assumptions about perceptual and physical dimensions

All of these models are used to describe unobserved, and in principle unobserv­ able, mental (i.e., perceptual and decisional) processes. In studies of cue per­ ception, the mapping between acoustic cues and corresponding perceptual di­ mensions can be justified, at least in part, in psychoacoustic terms (Nearey 1990; Smits 2001b). In the present case, on the other hand, the perceptual dimensions are assumed to correspond to a more abstract phonological structure, which is itself not directly observable in produced speech. A bad fit between model and data may indicate that this assumption is unjustified, whereas a good model fit is merely consistent with this assumption. Note, though, that the assumption is perhaps less strong than it might seem at first. Cast in terms of expected patterns of perceptual confusions, the assump­ tion that the perceptual dimensions correspond to place and voicing boils down to the assumption that the basics of feature structure will be present in the fitted Perception of voicing and place of articulation 301 model. If place and voicing appropriately define the perceptual axes, then the configuration of perceptual distribution means should describe a quadrilateral: the perceptual distributions for labial [p] and [b] should be aligned (at least roughly) in parallel to those for alveolar [t] and [d], and the distributions for voiced [b] and [d] should be aligned (at least roughly) in parallel to those for voiceless [p] and [t]. Note that roughly parallel alignment allows for deviations from configurations describing a square, a rectangle, or a parallelogram (i.e., strictly parallel alignments).

3 A multilevel Gaussian GRT model

The multilevel Gaussian GRT model used here employs, for each individual sub­ ject, the standard 2 × 2 model. Figure 2 shows a schematic representation of the multilevel model. Individual subject data (i.e., confusion matrices) are indicated by the matrices along the bottom of the figure. In each matrix, the rows corre­ spond to stimulus categories and the columns to responses, so, e.g., d12 indicates the number of ‘2’ responses to the ‘1’ stimulus, d13 the number of ‘3’ responses to the ‘1’ stimulus, and so on. Governing data sets 1, 2, . . . , M are the individual-­ subject 2 × 2 Gaussian GRT models, each with four perceptual distributions and two decision bounds, and governing the individual subject GRT models is a group-level model. A detailed description of the model and the distributional as­ sumptions of the model parameters at each level is given by Silbert (2012).

4 Experiment 1: Place and voicing in onset position

Experiment 1 is an investigation of the perception of place of articulation and voicing in syllable-initial stop consonants [p], [b], [t], and [d]. Some previous work investigating the integration of acoustic-phonetic cues to voicing and place (e.g., VOT, formant frequency at voicing onset) in categorization tasks provides some evidence that cues to place and voicing are independent (Oden and Mas­ saro 1978), while other work suggests that cues to place and voicing interact (Sawusch and Pisoni 1974; Eimas et al. 1981; Benkí 2001). However, whereas each of these earlier projects looked at multiple phono­ logical dimensions, the focus in each was firmly on the relationships between particular acoustic-phonetic cues, while the focus of the present work is the more abstract phonological level. Multilevel GRT is applied to the identification of 302 Noah H. Silbert

Fig. 2: Schematic model representation. The matrices along the bottom indicate individual- subject confusion matrices, with dij in a matrix indicating the number of times stimulus i is given response j by a given subject. The schematic models immediately above the matrices indicate GRT models fit to each matrix, and the schematic model at the top indicates a group-level model that governs the individual-level models. Note that all individual and group-level parameters are estimated simultaneously. See text for additional details. noise-masked, naturally-produced stimuli in an effort to model abstract feature perception more directly.

4.1 Stimuli

In order to ensure that the subjects were not simply able to attend to some irrele­ vant acoustic feature(s) of a particular token of a particular category, a small de­ Perception of voicing and place of articulation 303 gree of within-category variability was introduced by using four tokens of each stimulus type – [pa], [ba], [ta], and [da] – all produced by the author (a mid-30s Midwestern male phonetician). Multiple acoustic measurements were taken (see Appendix B) and extensive pilot experimentation was carried out to ensure that no particular token was overly acoustically distinct from the others. The stimuli were recorded during a single session in a quiet room via an Electrovoice RE50 microphone and a Marantz PMD560 solid-state digital recorder at 44.1 kHz sam­ pling rate with 16-bit depth. In order to avoid ceiling effects, stimuli were embedded in ‘speech-shaped’ noise. The speech-shaped noise was created by filtering white noise such that higher frequencies had relatively lower amplitude than lower frequencies. The filtering was carried out in the frequency domain by multiplying an appropriate white noise spectrum by a Gaussian curve (centered at zero) with a standard ­deviation of approximate 5.5 kHz. Noise waveforms were generated via inverse Fourier transformation and added to the speech stimuli. On each trial, the speech signal started 200 ms after the onset of the noise, and the noise ended approxi­ mately 500 ms after the end of the speech signal.

4.2 Procedure

Each participant was seated in a double-walled sound-attenuating booth with four ‘cubicle’ partitions. One, two, or three participants could run simultane­ ously, each in front of his or her own computer terminal. Stimuli were presented at −3 dB signal-to-noise ratio at approximately 60 dB SPL via Tucker-Davis-­ Technologies Real-Time processor (TDT RP2.1; sampling rate 24,414 Hz), pro­ grammable attenuator (TDT PA5), headphone buffer (TDT HB6), and Sennheiser HD250 II Linear headphones. Before the first session (familiarization and train­ ing), participants read a written instruction sheet, were given verbal instructions, and were prompted for questions about the procedure. Sessions consisted of 1 to 4 experimental blocks. Experimental blocks lasted approximately 25 minutes. Each subject completed 10 blocks, 2 in each of five stimulus presentation base- rate conditions (i.e., conditions in which a subset of stimuli, e.g., [p] and [b], were presented more or less often than the complementary subset). Only the equal base-rate condition is analyzed and discussed here. Each experimental block began with brief written instructions reminding participants to respond as accurately and as quickly as possible and providing explicit guessing advice for trials on which the participant was uncertain of the stimulus identity. After the instructions were cleared from the screen, four on- screen ‘buttons’ corresponding to the buttons on a hand-held button box became 304 Noah H. Silbert visible. On the on-screen buttons, the letters ‘p’, ‘b’, ‘t’, and ‘d’ appeared in black text. Button-response assignments were randomly assigned for each block with the constraint that the basic dimensional structure was always maintained (e.g., ‘p’ and ‘t’ always appeared as neighbors on a single dimension, never on opposite corners). Each trial consisted of the following steps: (1) a visual signal (the word ­‘listen’) presented on the computer monitor; (2) half a second of silence; (3) stim­ ulus presentation; (4) response; (5) feedback; and (6) 1 second of silence. Re­ sponses were collected via a button box with buttons arranged to correspond to the structure of the stimulus space (i.e., two levels on each of two dimensions). Feedback was given visually via color-coded (green for correct, red for incorrect) text above and on the on-screen buttons. Either the word ‘Correct’ or the word ‘Incorrect’ appeared along with brief descriptions of the presented stimulus and the response chosen. The feedback text disappeared and the button text color was reset to black before each successive trial. Each participant received two short (approximately 15 minutes) and two reg­ ular length blocks to familiarize them with the stimuli and ensure that perfor­ mance was consistently above chance. The data analyzed here consist of 800 ­trials completed in two blocks of 400 trials each. Participants were paid $6/hour with a $4/hour bonus for completion of the experiment. The participant with the highest accuracy received a $20 bonus, as did the participant with the fastest overall mean response time. The bonuses were described to participants when they began the experiment. Trial-by-trial feedback and bonuses were imple­ mented in order to ensure that participants had the best possible opportunity to perform well.

4.3 Subjects

Eight adults (two male, six female) were recruited from the university community. The average age of participants was 21 (19–23). All were native speakers of English with 4.5 (1.5–7) years of second language study on average. All but one were right handed, and all but one were from the Midwest (the other was from the West). All participants were screened to ensure normal hearing.

4.4 Analysis

The multilevel Gaussian GRT model described above was fit to the eight subjects’ data using WinBUGS (Lunn et al. 2000), R (R Development Core Team 2012), Perception of voicing and place of articulation 305

Table 1: Accuracy results, Experiment 1. p(C) = proportion correct.

Subject 1 2 3 4 5 6 7 8 p(C) 0.78 0.84 0.74 0.61 0.78 0.87 0.86 0.68 and the R package R2WinBUGS (Sturtz et al. 2005). Response counts were tallied by stimulus category, not by individual stimuli. Statistical tests of the response patterns between each pair of individual tokens within each category revealed no consistent token-by-token differences (see Appendix A for the data and the results of the token difference tests). Details about the fitting procedure are de­ scribed in Silbert (2012), and all software used in fitting the model is available by request from the author.

4.5 Results

Table 1 shows each subject’s overall accuracy for Experiment 1. Overall accuracy was fairly high, ranging from 61% to 87% correct. Figure 3 shows a plot of observed (x-axis) and predicted (y-axis) response proportions for all eight subjects. Labial ‘p’ and ‘b’ responses are indicated by open symbols, alveolar ‘t’ and ‘d’ responses by filled symbols. Voiceless ‘p’ and ‘t’ responses are indicated by circles, voiced ‘b’ and ‘d’ by squares. The closer the points are to the dotted diagonal line, the closer the correspondence between ­observed and fitted response probabilities. The vertical lines indicate the 95% highest density intervals (HDIs), which are the regions containing the most prob­ able 95% of the posterior distribution of predicted response probabilities (i.e., HDIs are a measure of the uncertainty of parameter and predicted probability ­estimates). Consistent with the assumption that phonological features define the perceptual dimensions, the predicted response probabilities correspond very closely to observed response probabilities; the dotted diagonal line is within the 95% HDI for every point. Figure 4 shows the fitted model. Within each panel, the -x axis indicates place of articulation, with labial (p, b) to the left and alveolar (t, d) to the right, and the y-axis indicates voicing, with voiceless (p, t) at the bottom and voiced (b, d) at the top. Hence, the perceptual distribution and response region for [p] is located at the bottom left, for [b] at the top left, for [t] at the bottom right, and for [d] at the top right. The smaller panels show the fitted model for individual subjects 1 through 8 (moving counter-clockwise from top left). The plotted equal likelihood contours 306 Noah H. Silbert

Fig. 3: Observed and predicted response probabilities, place × voicing in onset position.

and decision bounds indicate the median value from the sampled posterior distri­ bution for each parameter. Uncertainty of parameter estimates is indicated by 95% HDIs. For the decision bound estimates, these are indicated by lines perpen­ dicular to and crossing the bounds near the top and right of each panel. As described above, the perceptual distribution mean for [p] was fixed at (0,0). For the other three perceptual distributions, the median estimate of the dis­ tribution mean is indicated by the intersection of the horizontal and vertical lines within the equal likelihood contour; the length of these lines indicates the HDIs on each dimension. The median correlation estimate for each perceptual distri­ bution is indicated by the shape of the equal likelihood contour, and the corre­ sponding HDIs are indicated by the vertical line plotted in the corner nearest the contour; the accompanying horizontal lines indicate correlations of 1, 0, and −1, from top to bottom. Hence, if zero is within the HDI for a correlation parameter, the vertical line will intersect the middle horizontal line; the top and bottom hor­ izontal lines indicate the maximum and minimum possible correlation values. The large panel in the middle shows the fitted group-level model. The deci­ sion bounds, perceptual distribution means, and equal likelihood contour shapes Perception of voicing and place of articulation 307

Fig. 4: Fitted multilevel Bayesian Gaussian GRT model, onset place and voicing. The large panel illustrates the group-level parameters, while the smaller panels illustrate the individual- subject-level parameters. Ellipses indicate equal likelihood contours; the degree of deviation from circular indicates the estimate of the correlation within each distribution. The small horizontal lines in the corners of each panel indicate −1, 0, and 1, while the small vertical lines in the corners indicate the uncertainty of the correlation estimates. The large vertical and horizontal lines illustrate the estimated decision bounds, with the smaller perpendicular lines indicating the uncertainty of these estimates. The smaller vertical and horizontal lines inside the ellipses indicate, at their intersection, the estimates of the perceptual distribution means, and, via their lengths, the uncertainty of the estimates of the means. See text for additional details and discussion of results. indicate the median group-level parameters that govern the corresponding deci­ sion bounds, distribution means, and correlations at the individual subject level. The HDIs for these parameters are indicated as in the smaller panels. In each panel, a coarse correspondence between features and the configura­ tion of perceptual distributions is evident: the labials [p] and [b] are to the left, the alveolars [t] and [d] to the right, the voiceless [p] and [t] on the bottom, and the voiced [b] and [d] on the top. Note that this correspondence is not guaranteed by the model a priori (e.g., the distributions for [b] and [d] could, in theory, be ordered differently on the place dimension). 308 Noah H. Silbert

However, the fitted model exhibits a number of consistent deviations from feature structure as well. First, and most obviously, the perceptual salience be­ tween voiced [b] and [d] is substantially greater than the perceptual salience be­ tween voiceless [p] and [t]; there is much less overlap between the [b] and [d] distributions than between the [p] and [t] distributions (or, equivalently, the mar­ ginal distributions on the place dimension exhibit more separation at the voiced level than at the voiceless level). Similarly, though to a smaller degree, the sa­ lience between alveolar [t] and [d] is greater than the salience between labial [p] and [b]. As discussed above, these results are consistent with what reported con­ fusion matrices suggest (e.g., Benkí 2003; Miller and Nicely 1955; Phatak and ­Allen 2007). These differences in salience differ in magnitude and in their relationship to response bias. On the one hand, there is no obvious general bias toward voiced or voiceless responses; the voicing contrasts are arrayed fairly symmetrically above and below the horizontal decision bound. On the other hand, at the voiceless level (i.e., below the horizontal bound), there is a small bias toward labial ‘p’ re­ sponses relative to alveolar ‘t’ responses, while at the voiced level (i.e., above the horizontal bound), there is a large bias toward alveolar ‘d’ responses relative to labial ‘b’ responses. That is, the vertical decision bound is located such that more of the [t] distribution is in the ‘p’ response region than vice versa, whereas more of the [b] distribution is in the ‘d’ response region than vice versa. The same basic patterns hold at the individual and group levels, though there is a fair amount of variability between subjects. Some subjects exhibit a much larger change in voicing salience across levels of place (e.g., subject 2, middle left panel), while others exhibit no such difference (e.g., subject 8, top right panel). Some subjects exhibit more symmetric shifts in place salience (e.g., subjects 6 and 7, middle and bottom right panels, respectively), and some less symmetric shifts (e.g., subject 8, top right). Note, too, that the differences in overall accuracy given in Table 1 are re­ flected in the overall spacing between perceptual distributions. For example, sub­ ject 4’s low accuracy is reflected by a large degree of overlap between the [p], [b], and [t] distributions (second panel from left along the bottom), while subject 6’s high accuracy is reflected by relatively wide spacing between distributions (bottom right panel). For the most part, zero is within the HDIs for the estimated correlation pa­ rameters, indicating that statistical independence tends to hold within per­ ceptual distributions. Voiced alveolar [d] is the only stimulus for which the cor­ relation is consistently non-zero, either at the individual subject or group level. Although zero is within the HDI for five of the eight subjects, the posterior distri­ bution of correlations for [d] tends toward positive values, and the estimates for Perception of voicing and place of articulation 309 three subjects are extreme enough to pull the group-level HDI away from zero. There is a similar tendency for non-zero correlation in the [t] distributions, though it is not as consistent or extreme.

4.6 Experiment 1 discussion

At least some of the divergences from basic feature structure may be driven by acoustic differences. For example, consider the change in voicing salience across place levels. The amplitude of the higher frequencies of the stop release burst tends to be greater in alveolars than labials (see, e.g., Stevens 2000: Chapter 7), and the difference between voiced and voiceless VOTs tends to be slightly ­larger for alveolars than labials (Volaitis and Miller 1992; Kessinger and Blumstein 1997). For a particular signal-to-noise ratio, then, these kinds of acoustic dif­ ferences may cause the voicing contrast to be more salient in alveolars than in labials. The fact that place salience changes across voicing levels is perhaps a bit more surprising, though a possible explanation may be the fact that some portion of the place information transmitted via formant transitions is displaced or de­ stroyed by long VOTs in voiceless stops (Liberman et al. 1958; Stevens and Klatt 1974), and the masking noise is likely to reduce (or eliminate) any audible for­ mant transitions occurring in the aspiration of the voiceless stops. Acoustic measurements of the tokens used here seem, at least initially, to support the hypothesis that acoustic differences explain the fitted model’s devia­ tion from simple feature structure (see, e.g., Figure B.1 and B.2 in Appendix B). However, the relevance of the apparently close correspondence between, e.g., F1 × F2 space and the fitted perceptual space for this experiment is made less plausible by very similar acoustic measurements for the coda stimuli (see Figure B.4, Appendix B) and very different patterns in the fitted perceptual model (see Figure 6 below). Deviations from simple feature structure can be interpreted as incomplete feature generalization, and the presence of within-distribution independence for three of these consonants may be interpreted as a lack of cross-talk between the cognitive channels processing voicing and place (Ashby 1989). Follow-up studies are currently being conducted to probe the degree to which perceptual devia­ tions from simple feature structure – changes in salience and patterns of within-­ category correlation – are driven by multidimensional production acoustics. Another possible explanation for the perceptual interactions and apparent differences in response bias on the place dimension is a difference in frequency-­ weighted phonological neighborhood densities between the syllables used as 310 Noah H. Silbert stimuli (raw neighborhood densities indicate no difference). The frequency-­ weighted neighborhood densities for these syllables are as follows: for [ta], 29681; for [da], 3460; for [pa], 1447; and for [ba], 13242 (Vaden et al. 2005). If neighbor­ hood density is driving the modeled response bias or perceptual effects, it should be causing similar effects for [p] and [d], on the one hand, and [t] and [b] on the other. Whether neighborhood effects are expected to enhance or inhibit phono­ logical perception (Vitevitch and Luce 1999), there is no obvious correspondence between these neighborhood numbers and the perceptual patterns and response bias results described above. Finally, the multilevel (Bayesian Gaussian) GRT model represents important progress relative to previous work that has employed analyses relying on strong assumptions of independence within distributions (Kingston and Macmillan 1995; Kingston, Macmillan, et al. 1997; Macmillan et al. 1999; Kingston, Diehl, et al. 2008) and those which have only considered individual subjects’ data sep­ arately (e.g., Thomas 2001; Silbert et al. 2009). The present approach allows rig­ orous statistical testing of independence (i.e., consideration of the correlation parameter HDIs with respect to zero) in addition to enabling inferences about feature perception to be drawn for the population from which the individual sub­ jects were drawn.

5 Experiment 2: Place and voicing in coda position

Experiment 2 is an investigation of place of articulation and voicing perception in syllable-final stops [p], [b], [t], and [d]. While at an abstract level, these are the same features and consonants as those investigated in Experiment 1, the articula­ tion of and acoustic cues for phonological place and voicing distinctions differ considerably between onset and coda positions. For example, VOT is not a cue to voicing in coda position, whereas the ratio of vowel and consonant duration is (Port and Dalby 1982). In addition, because VOT does not affect the amount of place information provided by formant transitions in coda position, these cues to place of articulation should be present in equal degrees in the voiced and voice­ less stops in coda position. More generally, the difference between onset and coda syllable position ­represents a case of contextual variation, a central phenomenon in phonology. Hence, Experiment 2 serves, in part, as an illustration of the utility of the current approach to studying the relationship between higher-level phonological struc­ ture and feature perception. Perception of voicing and place of articulation 311

5.1 Stimuli

Four tokens of each stimulus type – [ap], [ab], [at], [ad] – were produced by the author. The stimuli for both experiments (as well as two others) were recorded on the same equipment during the same session. Approximately half of the stimuli had audible closure release in the absence of masking noise.

5.2 Procedure

The procedure was identical to that employed in Experiment 1, with appropriate changes made to instructions and button labels.

5.3 Subjects

Eight subjects participated in Experiment 2, five of whom also participated in ­Experiment 1 (subjects 1, 2, 4, 6, and 7). This was done in order to hold a subset of the listeners’ auditory systems constant so that direct comparisons across ex­ periments (i.e., across syllable positions) could be made. The average age of par­ ticipants was 21.5 (19–23). All were native speakers of English with, on average, 4.3 (3–6.5) years of second language study. All but one were right handed, and all but one were from the Midwest (the other was from the South). All participants were screened to ensure normal hearing.

5.4 Analysis

Analyses were carried out in the same manner as those in Experiment 1. Statistical tests of response patterns for each pair of tokens within each cate­ gory revealed no consistent differences between tokens within categories (see Appendix A).

5.5 Results

Table 2 provides a summary of the results from Experiment 2. As in Experiment 1, overall accuracy was fairly high in Experiment 2, though lower here than in Experiment 1, ranging from 54% to 74% correct. (The full confusion matrices are given in Appendix A.) 312 Noah H. Silbert

Table 2: Accuracy results, Experiment 2. p(C) = proportion correct.

Subject 1 2 3 4 5 6 7 8 p(C) 0.65 0.57 0.71 0.54 0.61 0.71 0.58 0.74

Fig. 5: Observed and predicted response probabilities, place × voicing in coda position.

The predicted and observed response probabilities for Experiment 2 are ­plotted in Figure 5. As in Experiment 1, the model fits the data extremely well; the dotted diagonal line is within the 95% HDI for every point. The fitted model for Experiment 2 is displayed in Figure 6, as above with eight individual-level perceptual spaces depicted in the small panels, arrayed counter-­ clockwise from the top left, and the group-level model shown in the large panel in the middle. A number of properties of the place and voicing space specific to coda posi­ tion are immediately apparent. First, for all subjects and at the group level, the voicing distinction is much more salient than is the place distinction; the over­ lap between the perceptual distributions on the voicing dimension is much ­smaller than the overlap between distributions on the place dimension. As dis­ Perception of voicing and place of articulation 313

Fig. 6: Fitted multilevel Bayesian Gaussian GRT model, coda place and voicing. See text for details. The large panel illustrates the group-level parameters, while the smaller panels illustrate the individual-subject-level parameters. Ellipses indicate equal likelihood contours; the degree of deviation from circular indicates the estimate of the correlation within each distribution. The small horizontal lines in the corners of each panel indicate −1, 0, and 1, while the small vertical lines in the corners indicate the uncertainty of the correlation estimates. The large vertical and horizontal lines illustrate the estimated decision bounds, with the smaller perpendicular lines indicating the uncertainty of these estimates. The smaller vertical and horizontal lines inside the ellipses indicate, at their intersection, the estimates of the perceptual distribution means, and, via their lengths, the uncertainty of the estimates of the means. See text for additional details and discussion of results. cussed above, this is consistent with patterns apparent in previously reported confusion matrices (e.g., Wang and Bilger 1973; Benkí 2003; Cutler et al. 2004). Second, although there is some variability between subjects, salience on the place dimension does not vary across levels of voicing; on the place dimension, the perceptual distributions for [p] and [b] are closely (vertically) aligned, as are the distributions for [t] and [d] (i.e., the marginal place distributions do not differ substantially across levels of voicing). On the other hand, as in onset position, voicing salience tends to shift across levels of place. For four subjects (1, 4, 6, and 314 Noah H. Silbert

8), and at the group level, the distinction between [t] and [d] is more salient than the distinction between [p] and [b]. Third, at the group level and for most of the individual subjects, statistical independence appears to hold for all four perceptual distributions, though there is some variation at the individual subject level. For example, subject 2 (middle left panel) exhibits negative correlations in the [b] and [t] distributions, while subjects 4 and 5 (middle two bottom panels) have positive correlations in the [b] and [d] distributions. For most subjects, zero is well within the associated HDIs, although the median correlation values for each consonant are noticeably non-zero (i.e., the plotted equal likelihood contours describe ellipses rather than circles). There is no consistent response bias with respect to either place or voicing. With respect to place, for subjects 4 and 7 (second from left along the bottom and middle right panels, respectively), there is a strong bias toward alveolar re­ sponses, whereas for subjects 5 and 8 (second from right along the bottom and top right panels, respectively), there seems to be a (smaller) bias toward labial responses, and for subject 1 the perceptual distributions appear somewhat shifted such that there is an apparent bias toward labial responses in the voice­ less area of perceptual space and a bias toward alveolar responses in the voiced area (i.e., the perceptual distributions are not vertically aligned as closely for this subject as they are for the other subjects). With respect to voicing, a few ­subjects seem to show a small bias toward voiceless responses (e.g., subjects 1 and 7, top left and middle right panels, respectively), though most subjects (and the group-­level model) exhibit no consistent bias toward voiced or voiceless ­responses.

5.6 Experiment 2 discussion

As noted above, to the extent that the modeled perceptual effects are driven by acoustics, acoustic and articulatory differences between onset and coda conso­ nants were expected to produce different patterns of (deviation from) feature structure in perception. For the onset results, it was hypothesized that place in­ formation lost in long VOTs in the voiceless stops reduced perceptual salience relative to the (highly salient) voiced stops. In coda position, perceptual salience of place is essentially identical for the voiced and voiceless stops, despite the fact that F2 cues to place differ for [d] and [t] in both onset and coda positions (at least in the stimuli used here; compare Figure B.2 and Figure B.4, Appendix B). A similar tendency for voicing salience to change across levels of place was found in both onset and coda position. In onset position, differences in VOT Perception of voicing and place of articulation 315 across place of articulation might account for this (Volaitis and Miller 1992; Kessinger and Blumstein 1997), but it is not clear what might be driving this dif­ ference in coda position. In any case, determining the cause of this perceptual interaction is beyond the scope of this paper. While independence tends to hold for most subjects for all four of the stops studied here, when it fails, the patterns of failure differ in onset and coda posi­ tion. In onset position, failure of independence occurs primarily due to posi­ tive correlation in the [d] distribution and negative correlation in the [t] dis­ tribution, whereas in coda position, there is less consistency across subjects. There also seems to be less consistency in the patterns of response bias in coda position.

6 Experiment 3

The patterns of perceptual interactions described above may be due to any of a number of factors. Although an exhaustive survey of these factors is well beyond the scope of this paper, it may be that some (or all) of these interactions were ­induced by the speech-shaped noise masker used in the first two experiments. Ruling this possibility out will provide some evidence that the observed pattern of interactions generalize beyond Experiment 1. We probe this possibility in Experi­ ment 3 by using a multi-talker babble noise masker with the speech signals used in Experiment 1.

6.1 Stimuli

The stimuli were the same stimuli used in Experiment 1, except that a multi-talker babble noise masker was used instead of a speech-shaped noise masker.

6.2 Procedure

The procedure was identical to that employed in Experiment 1, except that stim­ uli were presented at –10 dB signal-to-noise ratio. Although there is no estab­ lished method for matching performance with four-choice classification tasks, pilot experimentation indicated that this signal-to-noise ratio produced approxi­ mately the same overall accuracy level as the larger signal-to-noise ratio used with the speech-shaped noise in Experiment 1. 316 Noah H. Silbert

6.3 Subjects

Seven subjects who participated in either Experiment 1 or 2 also participated in Experiment 3. One additional subject who did not participate in either previous experiment also participated in Experiment 3. One subject produced nearly iden­ tical response patterns to all stimuli in Experiment 3; this subject’s data were discarded prior to analysis.

6.4 Analysis

Analyses were carried out in the same manner as those in Experiments 1 and 2. Statistical tests of response patterns for each pair of tokens within each cate­ gory revealed no consistent differences between tokens within categories (see Appendix A)

6.5 Results

Table 3 provides a summary of the results from Experiment 3. As in Experiments 1 and 2, overall accuracy was reasonably high in Experiment 3. Accuracy for this experiment ranged from 61% to 81% correct. (The full confusion matrices are ­given in Appendix A.) Predicted and observed response probabilities for Experiment 3 are shown in Figure 7. As in the previous two experiments, there is a close correspondence between observed and predicted response probabilities, with the diagonal line falling well within the 95% HDI for every point. Figure 8 shows the fitted model for Experiment 3. As above, the small panels show the individual-subject-level models and the larger panel in the middle shows the group-level model. The fitted model for Experiment 3 is similar, though not identical, to the fitted model from Experiment 1. The configuration of the means of the four perceptual distributions is very similar to the configuration of means in the model fit to the Experiment 1 data. More specifically, the place distinction between voiceless [p] and [t] is substan­ tially less salient than the distinction between voiced [b] and [d], and the voicing

Table 3: Accuracy results, Experiment 3. p(C) = proportion correct.

Subject 1 2 3 4 5 6 7 p(C) 0.66 0.81 0.72 0.61 0.68 0.76 0.78 Perception of voicing and place of articulation 317

Fig. 7: Observed and predicted response probabilities, place × voicing in onset position with multi-talker babble masking noise. distinction between labial [p] and [b] is somewhat less salient than the distinc­ tion between alveolar [t] and [d]. In addition, the differences in salience of the place distinctions across levels of voicing are not symmetric about the vertical (place) decision bound, and the differences in salience of the voicing dimension across levels of place are not symmetric about the horizontal (voicing) bound. More of the [b] perceptual distri­ bution is in the ‘d’ response region than vice versa, and more of the [t] distribu­ tion is in the ‘d’ response region than vice versa. This pattern holds at the group level and for each individual subject, though the magnitudes of the perceptual distribution shifts vary somewhat across subjects. On the other hand, the pattern of within-distribution correlations is rather different than the pattern seen in Experiment 1. The correlation in the [d] distribu­ tion is fairly similar for the two experiments, but the correlations in the other three distributions are not. There are much larger magnitude correlations in all three distributions, with zero well outside the HDI for the (negative) correlation for the [b] category. In addition, there is a fair amount of variability in the correla­ tions across subjects. 318 Noah H. Silbert

Fig. 8: Fitted multilevel Bayesian Gaussian GRT model, onset place and voicing with multi-talker babble noise. See text for details. The large panel illustrates the group-level parameters, while the smaller panels illustrate the individual-subject-level parameters. Ellipses indicate equal likelihood contours; the degree of deviation from circular indicates the estimate of the correlation within each distribution. The small horizontal lines in the corners of each panel indicate −1, 0, and 1, while the small vertical lines in the corners indicate the uncertainty of the correlation estimates. The large vertical and horizontal lines illustrate the estimated decision bounds, with the smaller perpendicular lines indicating the uncertainty of these estimates. The smaller vertical and horizontal lines inside the ellipses indicate, at their intersection, the estimates of the perceptual distribution means, and, via their lengths, the uncertainty of the estimates of the means.

6.6 Experiment 3 discussion

The results of Experiment 3 show that the feature interactions observed in Exper­ iment 1 are not simply an artifact of the particular speech-shaped noise used in that experiment. Experiment 3 shows that while perceptual interactions that occur between perceptual distributions generalize to different noise maskers, the perceptual interactions that occur within perceptual distributions do not. The multi-talker noise masker resulted in configurations of perceptual distribution Perception of voicing and place of articulation 319 means in Experiment 3 that were very similar to those seen in Experiment 1, while the different types of noise produced very different patterns of correlation within the perceptual distributions. Hence, the results of Experiment 3 suggest that ‘macro’ level perceptual interactions (i.e., the global arrangement of percep­ tual distributions) may be fairly general while the ‘micro’ level interactions (i.e., correlations within perceptual distributions) likely are not. It should be noted that ‘speech-shaped’ noise and multi-talker babble share some acoustic properties, limiting the degree to which the results of Experiment 3 indicate generality of these results. For example, to the extent that the long-term spectrum of a noise masker is responsible for confusion patterns and, thereby, modeled feature interactions, we might expect ‘speech-shaped’ noise and multi-­ talker babble to produce similar results. Of course, differences in amplitude enve­ lope fluctuations and spectro-temporal variability between these two noise types may also influence confusion patterns. Ultimately, this is an empirical matter, and the present results provide evidence that some kinds of perceptual interac­ tions are robust to some types of variation in noise characteristics.

7 General discussion

Many previous studies of the structure of cues and features in speech have fo­ cused on small regions of phonological space, typically a single pair of speech sounds differing with respect to a single distinctive feature. However, work fo­ cused on small phonological subspaces misses interesting aspects of the im­ plementation of features in speech due to the inherent multidimensionality of distinctive features and the complex, many-to-many mapping between cues and features. Some research has focused directly on multidimensional structure in speech, investigating multiple cues to multiple features (Sawusch and Pisoni 1974; Oden and Massaro 1978; Eimas et al. 1981) or multiple acoustic cues to sequences of consonants and (Nearey 1990, 1997; Smits 2001a, b). Studies aimed at elu­ cidating the perceptual structure of features in noise-masked speech have pro­ duced conflicting results, with some studies suggesting that features are statisti­ cally independent (e.g., Miller and Nicely 1955) while others indicate that features interact in complex ways (e.g., Phatak and Allen 2007). Studies focused on the perception of (cues to) voicing and place of articula­ tion have produced disparate conclusions, as well, with some providing evidence that cues interact with each other (e.g., Sawusch and Pisoni 1974) while others are consistent with models assuming cue independence (e.g., Oden and Massaro 1978). 320 Noah H. Silbert

General recognition theory provides a powerful mathematical framework for modeling feature perception, allowing for feature interaction (or lack thereof) at two perceptual levels. The present work described the application of a multilevel Bayesian Gaussian GRT model to data from three experiments employing noise- masked speech stimuli. These three experiments probed perception of the same four consonants – [p], [b], [t], and [d] – in syllable onset (Experiments 1 and 3) and coda (Experiment 2) positions, and masked by speech-shaped noise (Experi­ ments 1 and 2) and multi-talker babble (Experiment 3).

7.1 Feature perception and syllable structure context

In all three experiments, the models fit extremely well and exhibited con­ figurations of perceptual distributions consistent with the essence of the ex­ pected ­distinctive feature structure (i.e., an orderly space well-defined by place and voicing). However, deviations from simple feature structure were evident in both onset­ and coda position, though the patterns of divergence differed for the two prosodic contexts. These results are largely consistent with previously reported confusion data (e.g., Miller and Nicely 1955; Wang and Bilger 1973; Benkí 2003; Cutler et al. 2004), though the GRT-based approach employed here pro­ vides a more detailed quantitative account of deviations from simple feature structure. In onset position, at both the individual-subject and group level, the percep­ tual space was characterized by changes in perceptual salience on each dimen­ sion across levels of the other. The larger divergence from simple feature structure consisted of a change in the salience of the place distinction across levels of voic­ ing: [d] was perceived as ‘more alveolar’ than was [t] (i.e., the perceptual distribu­ tion for [d] was further toward the alveolar end of the place dimension than was the perceptual distribution for [t]), whereas [p] and [b] were perceived as simi­ larly ‘labial’ (i.e., the perceptual distributions for [p] and [b] were closely aligned on the place dimension). To a lesser degree, [d] was perceived as ‘more voiced’ than was [b] (i.e., the perceptual distribution for [d] was further toward the voiced end of the voicing dimension than was the perceptual distribution for [b]), while [p] and [t] were perceived as equivalently voiceless. Although correlations within the perceptual distributions were mostly very close to zero, there was tendency toward positive correlation in the perceptual distribution for [d] and a smaller tendency for negative correlation in the distribu­ tion for [t]. These results indicate that strong assumptions about independence between stimulus dimensions (Kingston and Macmillan 1995; Kingston et al. 1997, 2008; Macmillan et al. 1999) may well be unwarranted. Perception of voicing and place of articulation 321

The results for coda position differed in important ways from the onset re­ sults. First, and most obviously, in coda position, the salience of the place distinc­ tion did not vary across levels of voicing; both [p] and [b], on the one hand, and [t] and [d], on the other, were closely aligned with respect to the place dimension. Correlations within perceptual distributions differed in coda position, as well, primarily by virtue of varying more across subjects and segments than in onset position. On the other hand, the salience of the voicing distinction did vary across levels of place in coda position, such that there was less overlap between the [t] and [d] distributions than between the [p] and [b] distributions, much as was seen in onset position. Patterns of response bias also differed somewhat between onset and coda position. In onset position, there was a bias toward labial responses at the voice­ less level and a bias toward alveolar responses as the voiced level, whereas anal­ ogous shifts in response bias showed less consistency in coda position. The source of response bias, and the reasons for contextual variation in these, remain open questions. Segmental and lexical frequencies and the functional load of fea­ tures, as measured in appropriate corpus analyses, may provide (some portion of) an explanation.

7.2 Feature perception and feature context

In addition to contextual variation due to changes in syllable structure, it seems that the feature context also influences the nature of feature perception. Using the same approach to study the perception of the English labial consonants [p], [b], [f], and [v], a very different pattern of deviations from simple feature structure was observed between voicing and (Silbert 2012). Figure 9 shows the group-level model fits (as described above) for the place and voicing space reported here and the manner and voicing space described by Silbert (2012). Both phonological subspaces contain the stops [p] and [b]. Indeed, the same [p] and [b] stimuli were used in the place and manner experiments; the only difference in the stimuli was the pair of sounds presented along with [p] and [b]. As is clear in Figure 9, the nature of perceptual feature interactions depends, in part, on the more global feature context. Consider the two onset model fits, for example (top two panels of Figure 9). The salience of the labial-alveolar place distinction varies asymmetrically across levels of voicing in onset position, with voiceless [p] and [t] exhibiting much ­lower salience than voiced [b] and [d], whereas the salience of the stop-fricative manner distinction varies more symmetrically across levels of voicing, with the voiceless [p] and [f] exhibiting much higher salience than the voiced [b] and [v]. 322 Noah H. Silbert

Fig. 9: Group-level Gaussian GRT fits for place × voicing (left column), manner × voicing (right column), onset (top row), and coda (bottom row).

On the other hand, the voicing distinction between the labial stops [p] and [b] tends to be slightly less salient than the voicing distinction between the alveolar stops [t] and [d] and between the labial fricatives [f] and [v]. The coda results show a different pattern (bottom two panels of Figure 9). There is little to no change in place salience across levels of voicing, while the change in manner salience across levels of voicing is similar to, though more ex­ treme than, that seen in onset position. As in onset position, the salience of the voicing distinction changes across both place and manner, though the change in voicing salience across levels of place is more pronounced than the very small change in voicing salience across levels of manner. Note, too, that voicing is more salient relative to place and manner in coda position than it is in onset position (i.e., the whole space seems to be stretched vertically, increasing the distance Perception of voicing and place of articulation 323

­between the voiced-voiceless pairs, in the coda model fits relative to the onset model fits). The differences due to feature context are consistent with studies probing larger sets of consonants (e.g., Miller and Nicely 1955; Wang and Bilger 1973; ­Phatak and Allen 2007). One possibility for extensions of the present work could develop and apply three- or four-dimensional GRT models in order to simultane­ ously model confusions among larger sets of consonants (e.g., [p], [b], [t], [d], [k], [g], [f], [v], etc . . .).

7.3 Implications and limitations

The reported deviations from simple feature structure are broadly consistent with the feature interactions suggested by previous work on confusions induced by speech-weighted noise (Phatak and Allen 2007; Singh and Allen 2012). Such ­deviations provide at least some support for the segment as basic perceptual unit (Nearey 1990; Norris et al. 2000). To the extent that features do not generalize across one another in perceptual identification, we might expect other feature-­ based perceptual research to show similar failures of generalization (Lahiri and Reetz 2010). Similarly, we might expect perceptual learning and perceptual train­ ing (e.g., Kraljic and Samuel 2006; Perrachione et al. 2011) to reflect imperfect featural generalization, which could have important implications for the design of training protocols. The reported differences between onset and coda voicing and place percep­ tion, and the differences in the interactions of voicing with place and manner described immediately above, also suggest that the degree and nature of (lack of) feature generalization is variable and depends, in part, on contextual factors. To the extent that these deviations from simple feature structure are robust, they should have effects on other levels of linguistic processing. For example, the array of salience differences across different pairs of consonants described above may influence the psychological proximity of lexical neighbors; if two words dif­ fer only with respect to onset [b] and [v], for example, they should be much closer neighbors than two words that differ only with respect to onset [b] and [d]. The application of multilevel GRT has been described here as a new approach to the study of multidimensional feature perception. Of course, the small number of stimuli (produced by a single speaker) may seriously limit the generality of the reported results. The results of Experiment 3 indicate that the results generalize, for the most part, across types of noise masker, but it could be that peculiarities of the stimuli are responsible for (a good portion of) the deviations from simple feature structure. 324 Noah H. Silbert

There are, broadly speaking, two ways to address any limitations introduced by the use of a small number of naturally produced tokens. Perhaps most obvi­ ously, one could use a large number of naturally produced tokens. This approach is currently being pursued, and preliminary results from a conceptually related speech production study suggest that similar deviations from simple feature structure can be found in the acoustics of English labial and alveolar stops pro­ duced by multiple native speakers (i.e., not just in the stimuli employed here; de Jong et al. 2011). Another reasonable approach would involve the use of synthetic or resynthe­ sized stimuli. Of course, there are a number of reasons to use synthetic stimuli in experimental speech perception research, the most obvious being that they enable isolation of the effects of specific acoustic properties of interest. Of course, while synthetic or otherwise manipulated stimuli give the experimenter valuable experimental control, there is also a cost to making explicit choices about the acoustic dimension(s) of interest and relying on (often implicit) assumptions about every other acoustic dimension. The generality of the results of experi­ mental work using synthetic stimuli depends, in no small part, on the degree to which observed perceptual patterns depend on the numerous acoustic cues not explicitly manipulated. In much the same way, the generality of the results reported above depend on the extent to which observed patterns depend on the particular acoustic prop­ erties of the stimuli and noise used in each experiment. The use of naturally pro­ duced stimuli does not require strong assumptions about cues to distinctive fea­ tures, but naturally produced speech presented without masking noise is very unlikely to produce errors, which are crucial to gaining insight into the structure of speech perception. Previous work indicates that masking noise characteristics can influence confusion patterns substantially (Allen 2005; Phatak and Allen 2007). The model fits from Experiments 1 and 3 provide a detailed characteriza­ tion of precisely how perception differs when confusions are induced by speech- shaped noise vs. multi-talker babble. Of course, the generality of the findings reported here is ultimately an em­ pirical question. Any number of factors may influence the perception of place and voicing, including, but not limited to, low-level phonetic structure (e.g., the identity of a neighboring vowel, the presence of neighboring consonants), socio­ linguistic variation (e.g., gender, dialect, or socioeconomic class differences), and lexical properties (e.g., usage frequency, bigram frequency, lexical status). The extent to which features function contrastively in a given language may also influence the nature of perceptual featural interactions. For example, it may be that the perceptual relationships between voicing and place are different in English, on the one hand, and in Dutch, Thai, or Czech, on the other, as the Perception of voicing and place of articulation 325 latter three do not have a voiced counterpart to voiceless [k] (Ohala 2005). A number of directions for future research come readily to mind. First, in addition to acoustics, the results reported above may be driven, in part, by ­higher-level cognitive factors. Non-words were employed in the present work in order to minimize the influence of lexical factors. Future work could manipulate lexical factors in order to probe the architecture of the cognitive systems underly­ ing speech perception (McClelland and Elman 1986; Norris et al. 2000). The GRT framework provides a powerful set of tools for studying the influence of higher-­ level factors on perceptual and decisional processes in speech perception. Second, conceptually related production work is being carried out, in part to produce a sizable corpus of naturally produced stimuli for further perceptual experimentation. A model of the multidimensional phonetic and phonological spaces occupied by a large subset of English is being developed to provide a rigorous quantitative description of feature structure in production. This will be used to make quantitative predictions about perceptual behavior. Finally, an approach that quantitatively links production and perception would be very useful in the study of non-native speech perception. GRT (or an­ other model of perception and decision-making such as, e.g., the generalized context model; Nosofsky 1986) mapped to a quantitative model of production fea­ tures could function as a detailed quantitative implementation of the perceptual assimilation model (PAM; e.g., Best et al. 2001; Best and Tyler 2007) and the speech learning model (SLM; e.g., Flege 1995). Both of these models rely crucially on the notion of perceptual similarity, estimates of which can be calculated di­ rectly from configurations of multidimensional perceptual distributions and deci­ sion bounds (Ashby and Perrin 1988). The present work represents a modest step toward building a rigorous quan­ titative foundation for future work relating production, perception, and higher-­ level cognitive and linguistic factors.

Acknowledgments: This work benefited enormously from the knowledge, advice, and patience of Professors James Townsend, Kenneth de Jong, and Jennifer Lentz. The work was funded by NIH grant 2-R01-MH0577-17-07A1.

References

Agresti, Alan. 1992. A survey of exact inference for contingency tables. Statistical Science 7(1). 131–153. doi:10.1214/ss/1177011454 Allen, Jont B. 2005. Consonant recognition and the articulation index. The Journal of the Acoustical Society of America 117(4). 2212–2223. doi:10.1121/1.1856231 326 Noah H. Silbert

Ashby, F. Gregory. 1989. Stochastic general recognition theory. In Douglas Vickers & Phillip L. Smith (eds.), Human Information Processing: Measures, Mechanisms, and Models, 435–457. Amsterdam: North-Holland. Ashby, F. Gregory, & Nancy Perrin. 1988. Toward a unified theory of similarity and recognition. Psychological Review 95(1). 124–150. Ashby, F. Gregory, & James T. Townsend. 1986. Varieties of perceptual independence. Psychological Review 93(2). 154–179. Bailey, Todd M., & Ulrike Hahn. 2005. Phoneme similarity and confusability. Journal of Memory and Language 52(3). 339–362. Batchelder, William H., & Court S. Crowther. 1997. Multinomial processing tree models of factorial categorization. Journal of Mathematical Psychology 41(1). 45–55. doi:06/jmps.1997.1146 Benkí, José R. 2001. Place of articulation and first formant transition pattern both affect perception of voicing in English. Journal of 29(1). 1–22. Benkí, José R. 2003. Analysis of English nonsense syllable recognition in noise. Phonetica 60(2). 129–157. doi:10.1159/000071450 Best, Catherine T., Gerald W. McRoberts, & Elizabeth Goodell. 2001. Discrimination of non-native consonant contrasts varying in perceptual assimilation to the listener’s native phonological system. The Journal of the Acoustical Society of America 109, 775–794. Best, Catherine T., & Michael D. Tyler. 2007. Nonnative and second-language speech perception: Commonalities and complementarities. In Ocke-Schwen Bohn & Murray J. Munro, Language Experience in Second Language Speech Learning: In Honor of James Emil Flege, 13–34. Amsterdam: John Benjamins. Cutler, Anne, Andrea Weber, Roel Smits, & Nicole Cooper. 2004. Patterns of English phoneme confusions by native and non-native listeners. The Journal of the Acoustical Society of America 116(6). 3668–3678. doi:10.1121/1.1810292 de Jong, Ken, Noah H. Silbert, Kirsten T. Regier, & Aaron Albin. 2011. Statistical relationships in distinctive feature models and acoustic‐phonetic properties of English consonants. Journal of the Acoustical Society of America 129. 2455. Eimas, Peter D., Vivien C. Tartter, & Joanne L. Miller. 1981. Dependency relations during the processing of speech. In Peter D. Eimas & Joanne L. Miller (eds.), Perspectives on the Study of Speech, 283–309. New York: Psychology Press. Eimas, Peter D., Vivien C. Tartter, Joanne L. Miller, & Nancy J. Keuthen. 1978. Asymmetric dependencies in processing phonetic features. Perception & Psychophysics 23(1). 12–20. Fisher, R. A. 1922. On the interpretation of χ2 from contingency tables, and the calculation of P. Journal of the Royal Statistical Society 85(1). 87–94. doi:10.2307/2340521 Flege, James E. 1995. Second language speech learning: Theory, findings, and problems. In Winifred Strange (ed.), Speech Perception and Linguistic Experience: Issues in Cross-language Research, 233–277. Weybridge, UK: York Press. Green, David Marvin, & John A. Swets. 1966. Signal Detection Theory and Psychophysics. Malabar, FL: Robert E. Krieger. Kadlec, Helena, & Carrie L. Hicks. 1998. Invariance of perceptual spaces and perceptual separability of stimulus dimensions. Journal of Experimental Psychology: Human Perception and Performance 24(1). 80–104. Kessinger, R. H., & S. E. Blumstein. 1997. Effects of speaking rate on voice-onset time in Thai, French, and English. Journal of Phonetics 25(2). 143–168. Perception of voicing and place of articulation 327

Kingston, John, & Randy L. Diehl. 1994. Phonetic knowledge. Language 70(3). 419–454. Kingston, John, Randy L. Diehl, Cecilia J. Kirk, & Wendy A. Castleman. 2008. On the internal perceptual structure of distinctive features: The [voice] contrast. Journal of Phonetics 36(1). 28–54. Kingston, John, & Neil A. Macmillan. 1995. Integrality of and F1 in vowels in isolation and before oral and nasal consonants: A detection-theoretic application of the Garner paradigm. The Journal of the Acoustical Society of America 97. 1261–1285. Kingston, John, Neil A. Macmillan, Laura Walsh Dickey, Rachel Thorburn, & Christine Bartels. 1997. Integrality in the perception of root position and voice quality in vowels. The Journal of the Acoustical Society of America 101. 1696–1709. Kraljic, Tanya, & Arthur G. Samuel. 2006. Generalization in perceptual learning for speech. Psychonomic Bulletin & Review 13(2). 262–268. doi:10.3758/BF03193841 Lahiri, Aditi, & Henning Reetz. 2010. Distinctive features: Phonological underspecification in representation and processing. Journal of Phonetics 38(1). 44–59. doi:10.1016/j. wocn.2010.01.002 Liberman, A. M., P. C. Delattre, & F. S. Cooper. 1958. Some cues for the distinction between voiced and voiceless stops in initial position. Language and Speech 1(3). 153–167. Lunn, David J., Andrew Thomas, Nicky Best, & David Spiegelhalter. 2000. WinBUGS: a Bayesian modelling framework: concepts, structure, and extensibility. Statistics and Computing 10. 325–337. Macmillan, Neil A., John Kingston, J., Rachel Thorburn, Laura W. Dickey, & Christine Bartels. 1999. Integrality of nasalization and F. II. Basic sensitivity and phonetic labeling measure distinct sensory and decision–rule interactions. The Journal of the Acoustical Society of America 106. 2913–2932. Massaro, Dominic W., & Gregg C. Oden. 1980. Evaluation and integration of acoustic features in speech perception. Journal of the Acoustical Society of America 67(3). 996–1013. McClelland, James L., & Jeffrey L. Elman. 1986. The TRACE model of speech perception. Cognitive Psychology 18(1). 1–86. doi:10.1016/0010-0285(86)90015-0 McMurray, Bob, & Allard Jongman. 2011. What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review 118(2). 219–246. doi:10.1037/a0022325 Miller, George A., & Patricia E. Nicely. 1955. An analysis of perceptual confusions among some English consonants. The Journal of the Acoustical Society of America 27. 338–352. Miller, Joanne L. 1978. Interactions in processing segmental and suprasegmental features of speech. Perception & Psychophysics 24(2). 175–180. Nearey, Terrance M. 1990. The segment as a unit of speech perception. Journal of Phonetics 18. 347–373. Nearey, Terrance M. 1992. Context effects in a double-weak theory of speech perception. Language and Speech 35(1–2). 153–171. Nearey, Terrance M. 1997. Speech perception as pattern recognition. The Journal of the Acoustical Society of America 101. 3241–3254. Norris, Dennis, James M. McQueen, & Anne Cutler. 2000. Merging information in speech recognition: Feedback is never necessary. Behavioral and Brain Sciences 23(3). 299–325. Nosofsky, Robert M. 1986. Attention, similarity, and the identification–categorization relationship. Journal of Experimental Psychology: General 115(1). 39–57. Oden, Gregg C., & Dominic W. Massaro. 1978. Integration of featural information in speech perception. Psychological Review 85(3). 172–191. 328 Noah H. Silbert

Oglesbee, Eric Nathanael. 2008. Multidimensional stop categorization in English, Spanish, Korean, Japanese, and Canadian French. Indiana University Ph.D. dissertation. Ohala, John J. 2005. Phonetic explanations for sound patterns: Implications for grammars of competence. Retrieved from http://www.linguistics.berkeley.edu/phonlab/annual_report/ documents/2005/Ohala_laver269-288.pdf Olzak, Lynn A., & Thomas D. Wickens. 1997. Discrimination of complex patterns: orientation information is integrated across spatial scale; spatial-frequency and contrast information are not. Perception 26. 1101–1120. Perrachione, Tyler K., Jiyeon Lee, Louisa Y. Y. Ha, & Patrick C. M. Wong. 2011. Learning a novel phonological contrast depends on interactions between individual differences and training paradigm design. The Journal of the Acoustical Society of America 130(1). 461–472. doi:10.1121/1.3593366 Phatak, Sandeep A., & Jont B. Allen. 2007. Consonant and vowel confusions in speech-weighted noise. The Journal of the Acoustical Society of America 121(4). 2312–2326. doi:10.1121/1.2642397 Port, Robert F., & Jonathan Dalby. 1982. Consonant/vowel ratio as a cue for voicing in English. Perception & Psychophysics 32(2). 141–152. R Development Core Team. 2012. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Rouder, Jeffrey N., & Roger Ratcliff. 2004. Comparing categorization models. Journal of Experimental Psychology: General 133(1). 63–82. Sawusch, James R., & David B. Pisoni. 1974. On the identification of place and voicing features in synthetic stop consonants. Journal of Phonetics 2. 181–194. Silbert, Noah H. 2012. Syllable structure and integration of voicing and manner of articulation information in labial consonant identification. The Journal of the Acoustical Society of America 131(5). 4076–4086. doi:10.1121/1.3699209 Silbert, Noah H., & Robin D. Thomas. 2013. Decisional separability, model identification, and statistical inference in the general recognition theory framework. Psychonomic Bulletin & Review 20. 1–20. doi:10.3758/s13423-012-0329-4 Silbert, Noah H., James T. Townsend, & Jennifer J. Lentz. 2009. Independence and separability in the perception of complex nonspeech sounds. Attention, Perception, & Psychophysics 71(8). 1900–1915. Singh, Riya, & Jont B. Allen. 2012. The influence of stop consonants’ perceptual features on the Articulation Index model. The Journal of the Acoustical Society of America 131(4). 3051–3068. doi:10.1121/1.3682054 Smits, Roel. 2001a. Evidence for hierarchical categorization of coarticulated phonemes. Journal of Experimental Psychology: Human Perception and Performance 27. 1145–1162. Smits, Roel. 2001b. Hierarchical categorization of coarticulated phonemes: A theoretical analysis. Perception & Psychophysics, 63(7). 1109. Stevens, Kenneth N. 2000. (Vol. 30). Cambridge, MA: The MIT Press. Stevens, Kenneth N., & Dennis H. Klatt. 1974. Role of formant transitions in the voiced- voiceless distinction for stops. The Journal of the Acoustical Society of America 55. 653–659. Sturtz, Sibylle, Uwe Ligges, & Andrew Gelman. 2005. R2WinBUGS: a package for running WinBUGS from R. Journal of Statistical Software 12(3). 1–16. Thomas, Robin D. 2001. Perceptual interactions of facial dimensions in speeded classification and identification. Perception & Psychophysics 63(4). 625–650. Perception of voicing and place of articulation 329

Train, Kenneth. 2003. Discrete choice methods with simulation. Cambridge: Cambridge University Press. Vaden, Kenneth, G. Hickok, & H. Halpin. 2005. Irvine phonotactic online dictionary. Retrieved from http://www.iphod.com/ Vitevitch, Michael S., & Paul A. Luce. 1999. Probabilistic phonotactics and neighborhood activation in spoken word recognition. Journal of Memory and Language 40(3). 374–408. doi:10.1006/jmla.1998.2618 Volaitis, Lydia E., & Joanne L. Miller. 1992. Phonetic prototypes: Influence of place of articulation and speaking rate on the internal structure of voicing categories. Journal of the Acoustical Society of America 92(2). 723–735. Wang, Marilyn D., & Robert C. Bilger. 1973. Consonant confusions in noise: a study of perceptual features. The Journal of the Acoustical Society of America 54(5). 1248–1266. doi:10.1121/1.1914417

Appendix A

Table A.1 gives the confusion matrices, by stimulus category, for Experiments 1 (left) and 2 (right) for all individual subjects (from top to bottom in the order listed in Table 1 and Table 2 and displayed counter-clockwise around the peripheries of Figures 4 and 6). Note that the onset and coda results are only directly compa­ rable for subjects 1, 2, 4, 6, and 7. In order to ensure that the inferences drawn from the models fit to the by-­ category confusion matrices, non-parametric statistical tests were applied to check for differences in response patterns across tokens. More specifically, Fisher’s ­ex act test was used (Fisher 1922; Agresti 1992); this test gives the probability of the observed pattern of counts in a contingency table under the assumption that the rows (here tokens) and columns (here responses) are independent. There are four tokens per category, so there are six pairs of tokens within each category. There were 6 (pairs) × 4 (stimulus categories) × 8 (subjects) = 192 tests.

Results are reported for a Bonferroni corrected αB = 0.0002 and the less conserva­ tive α = 0.001.

For the onset data, three tests were statistically significant with αB and seven with α = 0.001. None of the tests were statistically significant for the [pa] or [da] tokens. Only one (with α = 0.001) was statistically significant for [ba] (tokens 1 and 2 for subject 4). Two of the three statistically significant tests with αB were for comparisons of tokens 1 and 4 (for subjects 2 and 5), and the other was for a com­ parison of tokens 2 and 4. For the less conservative α value, the statistically signif­ icant tests were for comparisons of tokens 1 and 3, 1 and 4 (two times), 2 and 3, 2 and 4, and 3 and 4, occurring for subjects 2 (two times), 5 (three times), and 6 (one time). 330 Noah H. Silbert

Table A.1: Confusion matrices by stimulus category, Experiments 1 (left) and 2 (right).

Onset Coda p b t d p b t d

[pa] 169 6 24 1 161 2 37 0 [ba] 7 157 2 34 12 89 1 98 [ta] 94 6 100 0 83 0 117 0 [da] 0 2 1 197 0 45 1 154

[pa] 151 13 36 0 127 9 56 8 [ba] 2 180 4 14 17 104 25 54 [ta] 55 0 145 0 54 14 128 4 [da] 0 4 0 196 11 74 17 98

[pa] 129 20 46 5 159 1 40 0 [ba] 14 141 5 40 0 132 0 68 [ta] 52 4 134 10 65 2 133 0 [da] 0 6 8 186 2 50 1 147

[pa] 121 5 69 5 112 4 73 11 [ba] 59 60 31 50 13 54 8 125 [ta] 65 3 129 3 78 1 119 2 [da] 7 3 14 176 8 47 2 143

[pa] 161 8 31 0 145 6 43 6 [ba] 12 141 10 37 15 127 3 55 [ta] 68 1 130 1 85 6 107 2 [da] 0 4 3 193 13 74 1 112

[pa] 166 5 27 2 124 1 75 0 [ba] 10 178 2 10 1 154 0 45 [ta] 43 2 154 1 41 0 159 0 [da] 2 2 1 195 0 67 0 133

[pa] 164 5 31 0 79 1 118 2 [ba] 5 183 0 12 3 78 9 110 [ta] 57 3 140 0 34 0 162 4 [da] 0 2 1 197 3 52 2 143

[pa] 162 7 27 4 145 6 49 0 [ba] 13 144 6 37 1 157 0 42 [ta] 130 8 55 7 61 0 139 0 [da] 7 6 3 184 0 49 0 151 Perception of voicing and place of articulation 331

Table A.2: Confusion matrices by stimulus category, Experiment 3.

Onset Coda p b t d p b t d

[pa] 131 23 43 3 [pa] 68 41 90 1 [ba] 13 149 10 28 [ba] 24 131 30 15 [ta] 119 13 56 12 [ta] 22 10 163 5 [da] 1 2 5 192 [da] 2 7 6 185

[pa] 153 23 23 1 [pa] 99 25 73 3 [ba] 15 168 16 1 [ba] 8 172 11 9 [ta] 60 3 136 1 [ta] 53 3 141 3 [da] 0 5 2 193 [da] 0 4 3 193

[pa] 138 13 48 1 [pa] 163 24 10 3 [ba] 47 123 20 10 [ba] 16 175 5 4 [ta] 65 6 126 3 [ta] 87 8 96 9 [da] 2 4 7 187 [da] 1 5 4 190

[pa] 114 17 58 11 [ba] 50 92 22 36 [ta] 72 10 107 11 [da] 3 11 14 172

For the coda data, the results were slightly more variable. None of the tests were statistically significant for [ap] with either α value. None were statistically significant for [ad] with Bα , and only one was statistically significant for [ad] with α = 0.001 (comparing tokens 2 and 3, subject 3). One test was statistically signifi­ cant for [ab] for both α values (also comparing tokens 2 and 3, subject 3). One test was statistically significant for [at] with Bα (tokens 1 and 3, subject 3), and four were statistically significant with the less conservative α = 0.001. These compared tokens 1 and 3 (subjects 3 and 6) and tokens 3 and 4 (subjects 5 and 6). Table A.2 shows the category-level confusion matrices for Experiment 3. Tests of differences in response patterns across individual tokens produced six statis­ tically significant differences with αB and eight with α = 0.001. As with the data from Experiments 1 and 2, there were no systematic differences across stimulus categories or subjects. One test was significant for [pa] at the more stringent Bα level (comparing tokens 1 and 4), and another two tests were significant for [pa] at the more lenient 0.001 level (one comparing tokens 1 and 4, the other compar­ ing tokens 3 and 4). Three tests were significant for [ba] with αB (two comparing 332 Noah H. Silbert tokens 1 and 2, one comparing tokens 2 and 4). Two tests were significant for [ta] with αB (both comparing tokens 1 and 4). No tests were significant for [da]. The overall pattern of results provides strong evidence that the individual tokens produced essentially identical response patterns. Hence, the analysis of confusion matrices collapsed across tokens is justified by the lack of statistical evidence indicating that responses differed consistently across individual tokens.

Appendix B

Two acoustic measurements taken from the stimulus tokens used here are re­ ported (duration in ms of the consonant and syllable, and F1 and F2 at CV or VC

Fig. B.1: Consonant and syllable duration (ms), onset stimuli. Perception of voicing and place of articulation 333

Fig. B.2: F1 and F2 (Hz) at CV boundary and vowel midpoint, onset stimuli. boundary and vowel midpoint). Additional acoustic measurements, spectro­ grams, and the original sound files are available from the author on request.

Appendix C

Each individual subject’s data is modeled with a two-dimensional Gaussian GRT model, each of which consists of four bivariate perceptual distributions and two decision criteria that partition the perceptual space into four response regions. A few of the model’s parameters must be fixed a priori so that unique estimates of the other parameters may be obtained. Hence, the mean of one perceptual distri­ bution (corresponding to [p]) is fixed at (0,0), and all marginal perceptual distri­ bution variances are fixed at 1. 334 Noah H. Silbert

Fig. B.3: Consonant and syllable duration (ms), coda stimuli.

Each subject’s responses are tallied and represented as four vectors dik con­ sisting of the counts of the i th subject’s four responses to the k th stimulus. These count vectors are modeled as multinomial random variables parameterized by th vectors of probabilities θik and the number of presentations of the k stimulus, Nk. The probability that the i th subject gives the r th response when the k th stimulus is presented (i.e., θirk) is, as described in the main text, the double integral of the k th perceptual distribution over the r th response region. For a given stimulus, the four response probabilities are determined by μ, the mean of the perceptual dis­ tribution, ρ, the correlation within that distribution, and κ, the decision criteria. Across subjects, the k th perceptual mean μ and j th decision criterion κ are modeled as (univariate) Gaussian random variables with means ηk and ψj, respec­ tively, and precisions (i.e., one divided by the variance) τ and χ, respectively. Each Perception of voicing and place of articulation 335

Fig. B.4: F1 and F2 (Hz) at VC boundary and vowel midpoint, coda stimuli. correlation ρ is modeled as a truncated (at ±0.975) Gaussian random variable with mean νk and precision π.

Finally, the group-level parameters (ηk, ψj, τ, and χ) governing the ­individual-level parameters are modeled as normal (means) and gamma (vari­ ances) random variables, respectively. The three group-level stimulus means ηk are modeled with means of (0,2), (2,0), and (2,2), variances of 2, and zero covari­ ance. The shape and rate parameters governing all of the group-level precision parameters were set to 5 and 1, respectively, emphasizing standard deviations near and below one while allowing any value greater than zero. All model parameters are estimated simultaneously.