<<

To appear in Journal of Phonetics

Phonetic bases of similarities in cross- production: Evidence from English and Catalan

Lisa Davidson New York University

Corresponding author:

Lisa Davidson Department of Linguistics New York University 10 Washington Place New York, NY 10003 phone: 212-992-8761 fax: 212-995-4707 [email protected]

1 Abstract

Previous research has shown that speakers do not produce all non-native phonotactic sequences with equal accuracy. Several reasons for these accuracy differences have been proposed, including markedness, analogical extension from permitted sequences, and language- independent phonetic factors. In this study, evidence from the production of unattested obstruent- initial onset clusters by English and Catalan speakers tests the viability of these explanations. Variables manipulated in this study include the manner, place, and voicing of the consonant clusters, and the input modality of the stimuli—whether speakers were presented the stimuli in an audio+text condition, or in an audio-only condition. Results demonstrate none of the linguistic factors interacted with language background; all speakers were less accurate on stop-initial sequences than fricative-initial ones, and on voiced sequences than voiceless sequences. It is argued that the fact that the particular accuracy patterns are independent of language background is incompatible with an analogy-based explanation, and is better accounted for by language- independent phonetic factors. However, the role of the native language is reflected in the preferred repair types, which vary by native language. Finally, while the presence of text improves performance, the patterns of accuracy are still largely the same for both audio+text and audio-only input, suggesting that the underlying mechanisms responsible for speech production are independent of input modality.

2 1.0 Errors in the production of non-native phonotactics

Previous research in cross-language speech production and second language acquisition has shown that speakers do not produce different types of sequences that violate their native phonotactics with equal accuracy. For example, Vietnamese speakers’ second language (L2) accuracy varied on different types of coda consonants that are all unattested in Vietnamese (Hansen, 2004), Japanese and Korean speakers were more accurate on voiceless stop-initial onset clusters than voiced ones (Broselow & Finer, 1991; Eckman & Iverson, 1993), and Mandarin speakers produced voiceless coda stops more accurately than voiced ones, even though Mandarin does not permit stops in codas at all (Broselow, Chen, & Wang, 1998). Various explanations have been provided for the discrepancies in accuracy produced by the speakers in all of these studies. In some cases, phonological concepts of markedness have been invoked to account for the findings. Broselow and Finer (1991) appeal to minimal sonority distance, arguing that speakers are more accurate on sequences that have a greater increase in sonority from the first consonant to the second. In the case of the Mandarin speakers (Broselow et al., 1998), voiced stops are often devoiced, which Broselow et al. attribute to the “emergence of the unmarked” (McCarthy & Prince, 1994); that is, speakers produce a less marked version of a prohibited phonotactic element when faced with a target that is more marked. However, providing traditional markedness explanations of accuracy becomes more difficult when other factors known to affect phonological processing both in perception and production are considered. In a cross-language speech production study, Davidson (2006a) examined the performance of English speakers on word-initial fricative-obstruent and fricative- nasal consonant sequences that are possible in some Slavic languages. Results showed that speakers’ performance depended on the identity of the first consonant. Specifically, speakers were most accurate on /f/-initial sequences (e.g. /ftake/), followed by /z/-initial sequences (e.g. /zdanu/), and then by /v/-initial sequence (e.g. /vbaɡo/). At least within the voiced clusters (though possibly also across all of the clusters, depending on one’s theory of sonority), the change in place of articulation alone does not change the status of sonority distance for these sequences, so sonority sequencing cannot account for these findings. Davidson (2006a) considered various factors that could potentially explain the pattern of results. First, it was shown that there is no correlation between either the type or token frequency of the experimental clusters in word-medial and word-final position in English and speakers’ accuracy on them in word-initial position (see also similar results in Davidson, Jusczyk, & Smolensky, 2004). Similarly, a purely articulatory account that disregards language-specific constraints on where those sequences can appear cannot be supported either. Such an account would predict that if speakers have experience producing obstruent-initial clusters elsewhere in the word, then they may able to transfer that articulatory ability to relevant clusters in word- initial position (Ussishkin & Wedel, 2003). This possibility is an extension of the concept of “gestural molecules”, which posits that as speakers become more practiced with combinations of sounds or articulatory gestures, they then become organized into motor programs (e.g. Browman & Goldstein, 2001; Byrd & Saltzman, 2002; Goldstein, Pouplier, Chen, Saltzman, & Byrd, 2007). As Ussishkin and Wedel (2003) argue, if speakers develop motor programs consisting of sequential consonant sequences, then they could potentially be deployed in new environments. However, while words like co[zm]ic, a[ft]er, lo[vb]ird, and cau[zd] ensure that speakers have experience coordinating fricatives with obstruents and nasals in word-medial or word-final position, the existence of such words makes large differences in accuracy among these sequences

3 in word-initial position difficult to explain. Instead, it was argued that while speakers did not seem to be relying on frequency information, they could still be using a process of analogical extension in production that builds off information from the lexicon (Baayen, 2003; Bailey & Hahn, 2001; Bybee, 2001; Greenberg & Jenkins, 1964; Skousen, 1989).1 Although none of these particular sequences are permitted word-initially, the experimental results suggest that featural encoding in the lexicon is accessible to the speakers (e.g. Albright, 2009; Goldrick, 2004; Hayes & Wilson, 2008), since they are more accurate on the clusters that share more features with phonotactically legal sequences. For example, as discussed in Davidson (2006a), performance on /fC/ sequences may be most accurate not only because of the existence of /f/+sonorant sequences, but also because they are only one place feature removed from common /s/+nasal and /s/+obstruent sequences. In other words, /f/ is already permitted in a consonant cluster configuration in English, which may provide an analogical basis for slightly more accurate performance on other unattested /fC/ sequences. On the other hand, /v/+obstruent sequences have no similar /v/+sonorant counterpart, nor are /z/+obstruent sequences permitted in English, though /s/+obstruent sequences may provide a boost for /z/+obstruent sequences, which differ only with respect to voicing. If an analogy explanation of performance is adequate, one possible prediction is that productions by speakers of other languages should also be shaped by the phonotactics of their native languages when producing non-native consonant sequences. However, another pattern of performance is possible: speakers of all languages that do not have these clusters could show the same accuracy patterns as one another, even though the phonotactics of these languages will differ. In that case, another explanation would be required. If the performance patterns on unattested phonotactic sequences are the same across speakers regardless of their language background, then another possible explanation of the accuracy patterns could be attributed to language-independent phonetic factors. For example, in the case of the fricatives, decreased accuracy on /vC/ and /zC/ relative to /fC/ may result from greater aerodynamic difficulties in producing a voiced fricative-obstruent cluster than a voiceless one. Whereas frication favors oral pressure that is maximally higher than atmospheric pressure, voicing requires oral pressure to be lower than glottal pressure (Ohala, 1994; Stevens, Blumstein, Glicksman, Burton, & Kurowski, 1992). These requirements for implementing frication and voicing may be in conflict with one another. Voiced fricative-obstruent clusters are especially disadvantaged, since they may require the conflicting air pressure requirements to be sustained for longer than the speech can accommodate (Ohala & Kawasaki-Fukumori, 1997; Westbury & Keating, 1986). The accuracy difference between /v/ and /z/ may be attributable to both production and perceptual factors. Previous research has argued that speakers may also incorporate the notion of recoverability into their productions in cases where their utterances might otherwise be poorly perceptible (Chitoran, Goldstein, & Byrd, 2002; Mattingly, 1981; Weinberger, 1994). For example, Chitoran et al. maintained that perceptual recoverability played an important role in

1 There is related line of research attempting to provide computational models of well-formedness judgments of unattested phonotactics based on knowledge of the lexicon. Both Hayes and Wilson (2008) and Albright (2009) develop algorithms to account for well-formedness ratings for unattested onset clusters by English speakers (Scholes, 1966). Though the specific learning algorithms proposed by each author differ, both of them embody the notion that natural classes of segments, as defined by features, can be discovered and potentially recombined by learners in order to differentiate between various types of unattested sequences. Since these models are both applied to well-formedness ratings, which differ from the production accuracy data in this study, it remains to be seen whether these types of learning algorithms can also account for how unattested consonant sequences are perceived and/or produced (and explain any differences that might arise between these different types of tasks).

4 determining the amount of temporal overlap in different combinations of stop-stop sequences produced by Georgian speakers. With respect to /v/, it has been shown that low-intensity labiodental fricatives are more confusable than sibilant fricatives (Jongman, Wayland, & Wong, 2000; Miller & Nicely, 1955; Wang & Bilger, 1973), so it is possible that speakers, especially in a laboratory task or L2 situation, would have to make special articulatory effort to produce /v/ in such a way that it is clear and perceptible (Maniwa, Jongman, & Wade, 2008).2 If the cost of producing that effort leads speakers to sometimes prefer a repair of the sequence (such as changing or deleting the first consonant), then speakers may alternate between these options in lieu of always correctly producing /v/ initial sequences.

1.1. Motivations for the current study

The current study extends Davidson (2006a) in a number of directions. There are four main issues addressed by this experiment that are detailed in this section.

1.1.1. Types of word-initial consonant clusters

In addition to fricative-initial sequences, this study also includes stop-obstruent and stop- nasal sequences. From an articulatory perspective, stop-obstruent and stop-nasal sequences may be more difficult to produce than fricative-initial sequences because the closure period of stops does not have any cue to place, so especially in word-initial position, speakers must be very careful to release the stop or produce it before an (inserted) vowel in order to indicate the place of articulation either through the burst or through F2 transitions (e.g. Kewley-Port, 1983; Ohde & Stevens, 1983; Stevens & Blumstein, 1978). This may be problematic for speakers of a language like English, however, since stop release before obstruents within words is not obligatory, depends on the order of place of articulation of the sequential consonants, and varies significantly from speaker to speaker (Henderson & Repp, 1982). Thus, producing stop-initial sequences may require more strict articulatory precision than fricative-initial sequences. Another articulatory issue that would be expected to arise is related to voicing; voiced stop-initial obstruent clusters should be more difficult to produce accurately than voiceless ones (Ohala, 1994). Alternatively, when comparing stop-initial sequences, an analogy account of performance should not predict much difference on the basis of voicing for English, since to the extent that existing clusters influence the production of unattested ones, both voiced stops and voiceless stops can appear in the same kinds of stop-initial cluster (e.g. break, please, great, clean, dream, trade, etc) for all places of articulation.

1.1.2. Language background of the speakers

This study includes speakers from two different language backgrounds: American

2 The concept of recoverability is similar to the motivation behind clear speech. Clear speech refers to the notion that listeners adopt “an intelligibility-enhancing speaking style when they anticipate or sense perceptual difficulty or comprehension failure on the part of a listener” (Maniwa et al., 2008: 1114). While many studies on clear speech have focused on hearing impaired or elderly populations, cochlear implant users, and children with or without learning disabilities (see Maniwa et al., 2008 for extensive references), there has also been research showing that clear speech also benefits normal hearing and second language learning populations (e.g. Bradlow & Bent, 2002; Uchanski, Choi, Braida, Reed, & Durlach, 1996).

5 English and Barcelona Catalan. Speakers of two different native languages were included for several reasons. First, to some extent, these languages allow for a comparison of the analogy and phonetically-motivated hypotheses with respect to obstruent-initial clusters. Unlike English, Catalan does not allow /s/-nasal or /s/-obstruent word-initial onsets (Hualde, 1992; Recasens, 1993; Wheeler, 2005). Catalan is limited to stop+liquid (/l/ and /ɾ/, with the exception of */tl/ and */dl/), /f/+liquid onsets, and some obstruent+glide clusters (e.g. [ɡw]). Word-medially, however, Catalan allows /s/+stop clusters, which are typically analyzed as having a syllable break between the /s/ and the following stop (e.g. estupid [əs.tú.pit] ‘stupid’). Like English, Catalan also allows some other obstruent-initial sequences in medial position; some examples are shown in (1). Word-finally, Catalan allows /s/+stop (e.g. [trist] ‘sad’, [fosk] ‘dark’) and stop+/s/ (e.g. [minuts] ‘minutes’, [kɔps] ‘hits, blows’) in word-final codas, as well as some sonorant+fricative or liquid+stop combinations not examined in this study (see Hualde, 1992; Recasens, 1993 for examples).

(1) Sample obstruent-initial word-medial consonant clusters in Catalan

stop-stop: [əktó] ‘actor’, [əptitút] ‘aptitude’, [məɾáɡdə] ‘emerald’ stop-fricative: [kápsə] ‘box’, [putsé] ‘maybe’, [subzónə] ‘sub-zone’ stop-nasal: [ɛtń ik] ‘ethnic’, [əknɔstik]́ ‘agnostic’, [ətmiráplə] ‘admirable’ fricative-stop: [əspázə] ‘sword’, [nəftəlínə] ‘mothballs’, [əzbést] ‘asbestos’ fricative-fricative: [dəsfiláðə] ‘parade’ fricative-nasal: [əzmál] ‘nail polish’

In addition to the phonotactic illegality of /s/-initial sequences in word-initial position in Catalan, it also differs from English in the status of the sound /v/. Barcelona Catalan does not contain the phoneme /v/, though ‘v’ does exist as an orthographic symbol and [v] occurs as an allophone of [f] in voice assimilation contexts across word boundaries, e.g. baf mullat [báv muʎát] ‘wet steam’. The phoneme /b/—orthographic ‘v’ or ‘b’—is [b] in onset position after a pause, a nasal, or a non-continuant (e.g. enveja [əmbɛʒ́ ə] ‘envy’, advocat [ədbukát] ‘lawyer’) and in coda position before a voiced obstruent (e.g. cap dia [káb díə] ‘no day’), but is realized as [β] in intervocalic position (e.g. [áβil] ‘skillful’) or after a continuant (e.g. [bulβós] ‘bulbous’). These differences between the phonotactics and the phoneme inventories of Catalan and English may help distinguish between the language-specific analogy and universally phonetically-motivated hypotheses. For example, whereas the English results in Davidson (2006a) could be explained by appealing to the permissibility of /f/-initial and /s/-initial word- initial sequences in English, the same cannot be said for Catalan since /s/-initial sequences are prohibited. Thus, for fricative-initial sequences, it might be expected that while performance on /fC/ sequences is more accurate relative to /zC/ and /vC/, differences between the latter should not arise. With respect to stop-initial sequences, like English speakers, there should be little or no difference based on either voicing or place of articulation, since both voiced and voiceless stops can combine with liquids to form word-initial clusters.

6 1.1.3. The nature of the insertion repair

A second reason to compare Catalan speakers to English speakers is to further investigate questions about the types of repairs that were produced in Davidson (2006a). Results from acoustic analysis of the vocoids inserted between the consonants in clusters like /zd/ or /fk/ indicated that they were both shorter than lexical schwas and had lower first and second formants. This suggests that rather than trying to produce a phonological schwa target, speakers were failing to accurately coordinate the consonant articulations and allowed enough of an open vocal tract between the consonants that a transitional vocoid was produced. A significantly shorter duration and lower F1 in the inserted vocoids is consistent with the lack of tongue movement toward a schwa target; instead, the tongue is transitioning between two obstruent constrictions and remains high in the oral tract. This repair was called “gestural mistiming” to reflect the idea that it arises from a difficulty in articulatory coordination. The acoustic findings were consistent with an articulatory analysis of tongue motion during the production of /z/-initial sequences using ultrasound, which showed that some speakers’ productions of [zəC] showed no evidence of tongue motion toward a schwa, as compared to [səC] and [sC] ([ə] indicates the inserted vocoid produced in a /zC/ cluster) (Davidson, 2005). One question arising from Davidson (2006a) is whether gestural mistiming is likely for all languages that do not allow obstruent-obstruent and obstruent-nasal word-initial clusters, or whether it is only possible for a language like English because [CəC] is perceptually similar to [CəC]. This idea has its root in research which claims that perceptual similarity among existing and possible phonological elements is a critical component of phonological processing (e.g. Fleischhacker, 2005; Steriade, 1999b; Steriade, 2003; Zuraw, 2007). In English, non-final schwa is already a high-mid central reduced vowel (Flemming & Johnson, 2007), and in pre-tonic position in words like ‘potato’ or ‘tomorrow’ schwa can span a relatively large range of durations from practically non-existent to greater than 70ms (Davidson, 2006c). Barcelona Catalan, like English, has a schwa that surfaces only in an unstressed position in a word (Carbonell & Llisterri, 1999; Hualde, 1992; Recasens & Espinosa, 2006; Wheeler, 2005). Although there have been no systematic studies examining the degree of reduction of schwa in Catalan casual or connected speech, we can at least investigate (a) whether Catalan and English speakers produce similar amounts of insertion to repair the experimental CC sequences, and (b) whether the acoustic properties of the inserted vowel are the same for both languages. If perceptual similarity plays a role in determining likely repairs, then it might be expected that Catalan speakers will exhibit gestural mistiming errors similar to those produced by English speakers. If perceptual similarity does not constrain the preference for gestural mistiming, then an analysis of the insertion errors by Catalan speakers could reveal that the vowel produced to repair CC sequences maintains a consistent duration, suggesting phonological epenthesis. However, it is possible that proportions of other types of errors may nevertheless vary, especially since Catalan exhibits a productive prothesis process which inserts [ə] before /s/-initial onset sequences (Bonet & Lloret, 1998), which may extend to other types of clusters as well (Fleischhacker, 2005).

1.1.4. Effects of input modality: auditory versus audio+text input

Finally, this study examines whether the input modality—auditory stimuli only or audio+text input—affects the amount and type of errors produced by speakers attempting phonotactically illegal consonant clusters. In the original study, speakers were always presented

7 with both an auditory and orthographic representation of the stimuli, since the role of the speakers’ perception of the stimuli was explicitly not being tested. However, researchers addressing questions of loanword adaptation and second language acquisition have shown that there is an undeniable influence of orthography in the phonological and phonetic characteristics of speakers’ productions (e.g., Dohlus, 2005; Smith, 2008; Vendelin & Peperkamp, 2006; Young-Scholten, Akita, & Cross, 1999). For example, Young-Scholten et al. (1999) argue that orthography promotes the use of epenthesis by English speakers producing Polish consonant sequences, whereas learners exposed only to auditory input are more likely to delete a phoneme. In this study, we explore not only whether the type of repair changes depending on whether the input is auditory only or audio+text, but we also examine whether phonological factors like place and manner interact with input type to affect the kinds of errors that English and Catalan speakers produce. The experiment is presented in the following section. In order to examine the questions outlined above, several experimental factors are manipulated. First, speakers are presented with 60 different types of obstruent-obstruent and obstruent-nasal consonant sequences that vary as to manner combination (e.g. stop-stop, fricative-stop, stop-nasal, etc) and place and voicing of the first and second consonant. Second, half of the stimuli are presented in an audio-only condition, and the other half with an orthographic representation accompanying the auditory stimuli in order to examine the role of input modality. Finally, both the analogy versus language- independent phonetic hypotheses and the question of repair types is addressed by testing speakers of American English and Barcelona Catalan.

2.0 Methods

2.1. Participants

The English-speaking participants were 23 New York University students. They were all native speakers of American English who spoke neither Slavic languages nor any other languages with initial obstruent clusters, such as Hebrew. None of the participants had any known speech or hearing impairments. They were compensated $10 for their participation. The 14 Catalan-speaking participants were all psychology students at the University of Barcelona who were given course credit for their participation. Although all Catalan speakers also speak Spanish, we attempted to obtain Catalan-dominant participants by requiring that they spoke only Catalan before attending primary school. In addition, further details about their language background were assessed with an extensive demographic questionnaire. The questionnaire asked the participants to rate their language usage at four stages in their lives (before schooling, primary school, high school, and adulthood) and at three different locations (at home, at school, in other places). Ratings were given on a scale of 1 to 7, where 1 was “only Spanish”, 4 was “equal use of Spanish and Catalan”, and 7 was “only Catalan”. In order to be considered Catalan speaker, participants had to use primarily Catalan at all stages (ratings of 5- 7). Like the English speakers, no Catalan speakers had learned another language containing the obstruent-initial clusters used in this experiment.

2.2. Materials

The target words contained either initial consonant clusters (CC) or initial consonant- schwa-consonant sequences (CəC). Each CC word had a CəC counterpart (e.g. [pkádi],

8 [pəkádi]). Consonants were combined into stop-stop, stop-fricative, stop-nasal, fricative-stop, fricative-fricative, and fricative-nasal sequences to create 60 unique consonant sequences. The fricatives were /s/, /f/, /z/, and /v/, the stops were /b/, /d/, /ɡ/, /p/, /t/, /k/, and the nasals were /m/ and /n/. All obstruent-obstruent sequences agreed in voicing, but both voiced and voiceless obstruents were combined with the nasal. Two distinct CCáCV and CəCáCV tokens were created for each onset, for a total of 240 target words (60 consonant combinations x 2 word conditions [CC, CəC] x 2 tokens for each word condition). In addition, 120 CVCV words were created as filler items. The stimuli were recorded by a native Russian speaker using a Marantz PMD-670 digital recorder at a 22kHz sampling rate. All of the target words are shown in the Appendix.

2.3. Design and Procedure

In both New York and Barcelona, participants were seated in a sound-attenuated room and faced a window through which they viewed a computer monitor. The stimuli were presented using ePrime. Half of the stimuli were presented both auditorily over computer speakers and visually using orthography appropriate for the native language of the participant, and half were presented only as auditory stimuli. In the case of the English speakers, the orthographic vowel ‘e’ was used to indicate schwa in the CəC tokens (e.g. “pekadi”), whereas ‘a’ was chosen for Catalan since phonological /a/ is one of the vowels reduced to /ə/ in unstressed positions (“pakadi”). The same stimuli were presented in the auditory-only or audio+text modality for each subject, though the order in which the two sections were presented was counterbalanced. Within each half of the experiment, the words were presented in a random order. Participants heard each word twice and then repeated it once into a microphone. Results from similar previous testing indicate that speakers’ accuracy is the same regardless of whether they simply repeat the target words or produce them embedded in an native language sentence (Davidson et al., 2004). The repetition task was chosen for its greater simplicity. Participants’ responses were recorded with a Shure Beta 85A microphone on a Marantz PMD-670 digital recorder in New York, and with a MicroTrack 24/96 in Barcelona. The next stimulus was presented automatically after 3400 ms. Participants were given six practice trials before the experimental trials; the first three trials were audio+text, and the last three were audio-only.

2.4. Data Analysis

All of the participants’ responses were analyzed by repeatedly listening to the files and examining the spectrograms of the utterances in Praat to determine what, if any, error had been made. Errors in the production of a consonant cluster were labeled as shown in Table 1. If multiple errors occurred, each error was labeled, and if none of the errors found in Table 1 occurred, the token was labeled as ‘correct’. A token was coded for insertion if there was either a period of voicing after frication or a stop burst with formant structure containing a visible second formant that ended with abrupt lowering of intensity at the onset of the second stop, fricative, or nasal. All errors that constituted more than 5% of the responses are discussed in the results section.

9 Table 1 Response type codes for CC stimuli

Response Type Definition Example Vowel Insertion CC target is produced with a schwa /pkadi/ J [pəkadi] between the consonants in the cluster Consonant Insertion Any target is produced with an extra, /pəkadi/ J [pənkadi] unintended consonant Consonant Deletion Any target is produced with either the /pkadi/ J [kadi] (C1 or C2) first or second member deleted /pkadi/ J [padi] Prothesis CC target is produced with a vocalic /pkadi/ J [əpkadi] period before the cluster Segment Change Any target is produced with two /pkadi/ J [skadi] (C1 or C2) segments, but one differs from the original Metathesis Any target is produced with the order /pkadi/ J [kpadi] of the consonants is reversed Syllabic C1 CC targets are produced with an /pkadi/ J [p::kadi] unusually long C1 or a C1 that is released abnormally early (with no vocalic material following) and is not perceived as being part of the cluster Other Target is not produced or is /pkadi/ J ∅ completely unrecognizable /pkadi/ J [spaga]

In addition to the error coding, duration, F1 and F2 of the inserted vocoids as compared to lexical schwas are examined in Section 3.3. All measurements were taken using Praat. The duration of vocalic material between target consonants, whether for the lexical schwa of the #CəC- stimuli or the inserted vocoid in #CəC- productions, was measured using the following criteria (S = stop, F = fricative, N = nasal, O = obstruent): for FN clusters, from the offset of frication to the point of abrupt lowering of intensity for the higher formants; for FO, from the offset of frication to the onset of silence or frication for the obstruent; for SN, from the offset of the energy for the burst to the point of abrupt lowering of intensity for the higher formants; and for SO from the offset of the energy for the burst to the onset of frication or silence. For each of these intervals, the midpoint F1 and F2 values were obtained using linear interpolation Burg LPC, with a time step of 10 ms, window length of 25 ms, and pre-emphasis of 50 Hz. An examination of the stimuli presented to the English and Catalan speakers indicated that the Russian talker did not produce any vocalic material in the inter-consonantal position for the CC words for either the voiced or voiceless clusters. An example is shown in Figure 1. For the CəC words, the mean duration of the schwa is 50ms (standard deviation = 13ms) for voiced sequences and 43ms (s.d. = 10ms) for voiceless sequences (including voiceless obstruent + nasal sequences). The average value of F1 for all sequences is 505Hz (s.d. = 86Hz), and the average F2 is 1161 (s.d. = 157Hz). English and Catalan speakers’ utterances were coded as correct if the clusters produced by the participants matched the manner, place, and voice specifications of the input, and the consonants were produced in the correct linear order, as determined using the spectrogram. Other

10 small variations from the target stimulus, such as duration, did not prevent the token from being coded as correct. In the case of the Catalan speakers, if [β] was produced instead of [v], it was coded as ‘correct’ since it is difficult to distinguish between these two sounds on the spectrogram.

Figure 1. Production of stimulus item [tkali] by Russian speaker.

3.0 Results

3.1. Accuracy

Out of a total of 4393 CC utterances produced by the participants (collapsing across languages), 51% were produced correctly according to the criteria defined in Section 2.4. Another 45% fell into five categories: Consonant Insertion, C1 Change, C2 Change, C1 Deletion, and Prothesis. The final 4% were scattered among the remaining error types in Table 1. Of the total number of CC utterances, 3% were produced with multiple errors (usually no more than two). To simplify the following analyses, they will be limited to the top five error types, and when multiple errors were produced, they are divided up into their respective error categories. Errors will be further examined in Section 3.2. The results of Davidson (2006a) showed that the voice and place of the first consonant of the fricative-initial sequences in that study influenced speakers’ accuracy on the production of the non-native sequences. In this study, we examine the effects of manner combination, the identity of C1, and the voicing specification of the cluster to examine whether particular manner combinations, C1s or voicing specifications are more likely to be accurately produced. 3

3 The effect of the frequency of the experimental clusters in other positions in the word is not exhaustively considered in this study, since it was already shown to have no effect in Davidson (2006a) and Davidson et al. (2004). However, four correlations for English between the frequency of the clusters in any other position in the word and accuracy for manner combination as reported in Section 3.1.1 were carried out just to confirm the lack of effect. The correlations assessed the relationship between accuracy for both the text and text+audio conditions and the frequency per million (sum of the frequency per million of the words containing the relevant cluster in any

11 3.1.1. Accuracy and manner combination

The effect of manner combination was investigated with an ANOVA with language (Catalan, English) as a between-subjects variable, and manner combination (stop-stop [SS], stop- fricative [SF], stop-nasal [SN], fricative-stop [FS], fricative-fricative [FF], fricative-nasal [FN]) and input modality (audio-only or audio+text) as within-subjects variables. The dependent variable was the arcsine transformed proportion of accurate production. Because the cells were not equal, Type III sums of squares were used. Results showed a marginal effect of language [F(1, 422) = 3.40, p = 0.07] and significant main effects of manner combination [F(5, 522) = 11.51, p < 0.001] and input modality [F(1, 422) = 29.22, p < 0.001], and an interaction between input modality and manner combination [F(5, 422) = 2.29, p = 0.045]. No other interactions were significant. The main effect of input modality was due to greater overall accuracy on items presented with text (M = 57%) than those presented as audio-only (M = 47%). The marginal main effect of language was due to slightly greater overall accuracy of English speakers (M = 54%) than Catalan speakers (M = 50%). The Student-Newman-Keuls (SNK) post-hoc test for manner combination showed the following pattern of accuracy, where ‘<’ indicates that the sequences were significantly less accurate and a comma indicates no significant difference: SS (M = 43%), SN (M = 43%) < SF (M = 51%), FF (M = 56%) < FS (M = 60%), FN (M = 60%) (p < 0.05). The manner results, divided by language, are shown in Figure 2. The overall pattern emerging from these results shows that speakers are generally less accurate on stop-initial sequences than on fricative-initial ones; this will be investigated further in the following section on the first consonant type. The lack of interaction between language and any of the other variables indicates that the patterns for manner combination and input modality hold for speakers of both languages. To investigate the interaction between input modality and manner combination, separate ANOVAs were conducted for the audio and text conditions with manner combination as the independent variable. There was a significant main effect of manner combination for both conditions (p < 0.01), but SNK post-hoc tests indicate different groupings of significance. For the audio condition, the pattern was SN, SS, SF, FF < FN, FS, and for the text condition, the pattern was SS, SN, SF < SF, FS, FN, FF (note that SF was not significantly different from either group). Thus, a decrease in accuracy on FF in the audio condition is primarily responsible for the interaction.

position in the word) or type frequency (number of individual words containing the relevant cluster in any position in the word) for each of the clusters, based on the Francis and Kucera corpus accessed on the MRC Psycholinguistics Database (http://www.psy.uwa.edu.au/mrcdatabase/uwa_mrc.htm). The words containing the relevant clusters were both mono- and multimorphemic (see Davidson 2006a, p. 109-110 for further clarification of how the frequency data was compiled). Results showed that there were no correlations between accuracy and frequency for any comparisons (accuracy in the text condition and frequency per million: r2 = .171, p = .216; accuracy in the audio condition and frequency per million: r2 = 0.047, p = .733; accuracy in the text condition and type frequency: r2 = .168, p = .224; accuracy in the audio condition and type frequency: r2 = 0.065, p = .641).

12

Figure 2. Proportion of responses for correctly produced stimuli broken down by manner combination. The top graph is English, and the bottom is Catalan. Error bars reflect standard error.

3.1.2. Accuracy and C1 type

An ANOVA was conducted with C1 type (b, d, ɡ, p, t, k, f, s, v, z) and input modality as within-subjects variables and language as a between-subjects variable. The dependent variable was the arcsine transformed proportion of accurate production. Because the cells were not equal, Type III sums of squares were used. Like the preceding analyses, speakers were significantly more accurate in the Text condition than in the Audio condition, and language was not significant, so these are not reported again. There was a main effect of C1 type [F(9, 705) =

13 40.20, p < 0.001], but there were no interactions among any of the variables. These results indicate that the pattern shown for C1 accuracy was the same for both presentation conditions and for speakers of both languages. An SNK post-hoc test showed the following accuracy pattern for C1: /b ɡ v d/ < /z p/ < /k t/ < /f/ < /s/. The accuracy for each C1 broken down by language and input modality is shown in Figure 3.

Figure 3. Proportion of responses for correctly produced stimuli broken down by C1 type. English is in the top graph, and Catalan is in the bottom graph.

14 3.1.3. Accuracy and voicing

Given that accuracy on clusters beginning with voiced initial consonants is lower than for voiceless initial consonants, we can confirm the role of voicing with a separate ANOVA with the voicing specification of the cluster as an independent variable. This variable has three levels: voiced (consists of voiced obstruent-obstruent clusters and voiced obstruents followed by nasal clusters), voiceless obstruents (voiceless obstruent-obstruent clusters), and mixed voicing, which pertains to clusters with an initial voiceless consonant followed by a nasal. The other independent variable is language; because input modality did not have any significant interactions in the preceding analysis, it is not included here. The dependent variable was the arcsine transformed proportion of accurate production. As before, there was no main effect of language. There was a main effect of voicing [F(2, 108) = 26.45, p < 0.001], but no interaction between language and voicing [F(2, 108) = 1.22, p = 0.30]. An SNK post-hoc test indicates that there was no difference between voiceless obstruent (M = 67%) and mixed voicing clusters (M = 66%), but voiced clusters were significantly less accurate (M = 36%).

3.1.4. Summary: Accuracy

The results from manner combination, C1 type, and voicing indicate that although these sequences are not attested in English or Catalan (with the exception of /s/-clusters for English speakers), speakers do not produce them all equally accurately. The findings for manner indicate that speakers are generally less accurate on stop-initial sequences than on fricative-initial ones, though the findings for C1 and voicing specification demonstrate that the voicing of the clusters also contributes to accuracy patterns. That is, voiced stops and voiced fricatives have similarly low rates of accuracy (though speakers are still less accurate on /v/ than on /z/), followed by voiceless stops which are more accurate, and then by voiceless fricatives. There is not a noticeable generalization about place of articulation, though among the voiceless stops, speakers are less accurate on /p/ than on /t/ or /k/. This performance on /p/ may be attributable to the same hypothesis that distinguishes performance on /v/ from /z/; since it is typically produced with weaker and more diffuse energy in the burst than /t/ and /k/ (Stevens, 1998), speakers may need to make more of an effort to produce it accurately and audibly. This may lead them to insert a vocalic element or change the consonant they are producing more often than they do for /t/ or /k/. The manner of C2 does not contribute to accuracy as much as C1. Notably, the lack of interaction for both manner combination and C1 with language background suggests that the phonetic and phonological factors underlying these patterns are similar regardless of the speaker’s native language; this will be further discussed in the General Discussion. In addition, similar relationships between manner combinations in the audio and text conditions (with the exception of FF, which decreases dramatically in the audio-only condition) indicate that while there is a general decrement in performance due to the audio-only input, the phonetic and phonological factors underlying these patterns are not dependent on the type of input to the listener. However, overall accuracy does not indicate whether the types of errors that are produced interact with language background or input type. This is analyzed in Section 3.2.

3.2. Errors

The analysis of errors in this section focuses on the relationship between language, input

15 modality, error type, and manner combination. Since the results for the effects of C1 did not show a substantial effect of place of articulation, the C1 findings largely mirror the manner combination results. Because of this, and to present an interpretable analysis, the focus of the error section will be on manner combination. However, the difference between voiced and voiceless C1s is relevant, so they will be discussed qualitatively in Section 3.2.2.

3.2.1. Manner combination

An ANOVA to examine errors was carried out with language as a between subjects variable, and input modality, manner combination, and error type (insertion, C1 deletion, C1 change, C2 change, prothesis) as within-subjects variables. The dependent variable was proportion of erroneous responses, which was arcsine transformed. Each of the main effects except language was significant: input modality [F(1, 2110) = 37.85, p < 0.001], manner combination [F(5, 2110) = 12.33, p < 0.001], and error type [F(4, 2110) = 176.06 p < 0.001]. The main effect of language was not significant [F(1, 2110) = 1.01, p = 0.32]. There were also significant two-way interactions between language and error type [F(4, 2110) = 19.38, p < 0.001], input modality and error type [F(4, 2110) = 9.86, p < 0.001], and manner combination and error type [F(20, 2110) = 31.60, p < 0.001]. The only significant three- way interaction was language, manner combination, and error [F(20, 2110) = 2.57, p < 0.001]. No other interactions were significant. In this analysis, the only relevant main effect is that for error type, since the main effects of language, input modality, and manner combination—which reflect average error proportions for each of these independent variables—mirror the results already seen for accuracy. Thus, we focus on the main effect of error type and its interactions. A Student-Newman-Keuls posthoc test indicates that the main effect of error is due to the following pattern (‘<’ indicates a lower proportion of errors in this category, averaging over all other independent variables): C2 Change (M = 0.02) < Prothesis (M = 0.08) < C1 Change (M = 0.12), C1 Deletion (M = 0.12) < Insertion (M = 0.24). All of the two-way interactions were examined by carrying out one-way ANOVAs for each of the error types with language, input modality, or manner combination as the independent variable. In cases where there were more than two levels of the independent variable, SNK posthoc tests were then used to investigate further. First, the two-way interaction between language and error type is due to differences between English and Catalan for C1 Change. Means for all of the error types are shown in Table 2.

Table 2 Average values for each error type, collapsed over all manner combinations. The significant difference for C1 Change (p < 0.001) is shown in bold.

Insertion C1 Change C1 Deletion Prothesis English 0.27 0.09 0.12 0.07 Catalan 0.21 0.18 0.11 0.08

The interaction between manner combination and error is a result of different types of errors being preferred for different sequences. Because there is also a significant three-way interaction between language, manner combination, and error, it is more informative to investigate the results by language rather than collapse over them. To examine the interactions,

16 planned comparisons were carried out for a limited number of relevant pairings; all results that are reported are significant at p < 0.01. The full illustration of error types for each language is shown in Figure 4. For both English and Catalan speakers, insertion and C1 deletion are equally common for SF, but insertion is significantly more frequent than any other error type for SS and SN. For English speakers, insertion is the most frequent error for FS, and C1 deletion and insertion are significantly more frequent than the other errors for FF. For Catalan speakers, C1 change is significantly more frequent than all other errors for FN.

Figure 4. Proportion of error types for each manner combination. Each graph represents a different language (English top, Catalan bottom).

The two-way interaction between input modality and error arose because there was no significant difference between the audio and text conditions for insertion errors. All other error types showed a significantly higher proportion of errors in the audio condition than in the text condition (p < 0.001 for all error types, except prothesis, p < 0.02). The interaction between input

17 modality, error, and manner was investigated similarly to the other three-way interaction. Planned comparisons (p < 0.001) show that there are significant increases in C1 deletion for FF, SF and SN in the audio condition, but not for any other sequence. C1 change increased in the audio condition for SN and SS. The amount of prothesis increased for SN and SF sequences. Although there was no four-way interaction, the plots of all four independent variables are provided in Figure 5. A notable point can be made. Though Catalan and English speakers use similar amounts of insertion in the text condition for stop-initial sequences, for fricative-initial sequences, there are considerable differences. English speakers still primarily repair the sequences with insertion, but Catalan speakers use a variety of other strategies.

Figure 5. Proportion of error types for each manner combination. Each row of graphs represents a different language, and each column corresponds to a different presentation condition. C2

18 change has been omitted to make the graphs more readable since it is an infrequent error.

3.2.2. C1 Type

Breaking down errors by C1 type largely echoes the results for manner combination: there are more errors for C1s that are stops, and of those errors, insertion is most common for both languages and both presentation conditions. Overall, English and Catalan speakers show roughly similar patterns for all C1s, although Catalan speakers demonstrate somewhat more C1 change for stops than English speakers do in the audio condition. Error types for each C1 for each language and presentation condition are shown in Figure 6. It is clear that speakers in both presentation conditions produce more errors overall for voiced stops than for voiceless stops. In the audio condition, there is more variation of error types for all speakers for all C1s, but in the text condition, differences among the language groups emerge. Whereas English speakers consistently use insertion most often for almost all C1 types (except /p/, which has equal amounts of C1 deletion and /z/, which is produced with prothesis in the audio condition), Catalan speakers use it most often for stops. For voiced versus voiceless fricatives, the patterns are less clear because there are fewer errors for these C1s. In both presentation conditions, English speakers again use insertion most often for all fricatives, except /z/ in the audio condition. Catalan speakers prefer C1 change for /v/, but use C1 change and prothesis equally for /z/.

19

Figure 6. Proportion of error types for each C1 type. Each row of graphs represents a different language, and each column corresponds to a different presentation condition. C2 change has been omitted from these graphs to make them more readable. Vertical lines represent groupings of voiced stops, voiceless stops, voiced fricatives, and voiceless fricatives.

3.2.3. C1 Change: how speakers modified the clusters

The types of consonants produced for C1 change were counted for each language for those C1s that contain more than 5 errors for Catalan speakers and more than 10 errors for English speakers (since there were nearly twice as many English participants). These are shown in Table 3. Results indicate that the most common kind of change for voiced fricatives is

20 devoicing. When voiced stops were changed, the erroneous productions contained stops, fricatives, and nasals, suggesting that there is no preferred strategy for producing voiced stops. /p/ is the only voiceless stop that is commonly changed; Catalan speakers produce it as a different stop and English speakers are split between stop and fricative. For both /z/ and /f/, English speakers have a tendency to produce an /s/, which makes the sequence legal in English. Catalan does not allow /s/-clusters, but speakers often devoice /z/, and the common change of /f/ to another fricative is attributable to productions of /s/.

Table 3. Proportions of changes for target consonants undergoing C1 change, collapsed over input modality. ‘+/- voice’ indicates that the only change made was to change the voicing of the intended consonant. If speakers changed the place of stops or fricatives, these changes are counted in the ‘stop’ or ‘fricative’ categories, respectively. A change from an intended stop to a fricative is counted in the ‘fricative’ category, and a fricative to a stop is marked in the ‘stop’ column. ‘Nasal’ indicates that the speaker produced a nasal consonant. ‘Legal’, for the English speakers, indicates how often speakers produced an /s/ instead of the intended consonant resulting in a phonotactically possible cluster. The category ‘legal’ is overlapping with ‘fricative’, so those rows do not add up to 1.

C1 +/- voice stop fricative nasal legal Catalan b (n=31) 0.10 0.48 0.35 0.06 — d (n=10) 0.20 0.40 0.40 0.00 — p (n=23) 0.00 0.70 0.30 0.00 — v (n=59) 0.81 0.10 0.08 0.00 — z (n=28) 0.96 0.00 0.04 0.00 — f (n=11) 0.09 0.36 0.55 0.00 — s (n=13) 0.92 0.00 0.00 0.00 — English b (n=13) 0.31 0.31 0.23 0.15 0.00 d (n=24) 0.13 0.42 0.29 0.17 0.04 p (n=17) 0.05 0.37 0.58 0.00 0.00 v (n=37) 0.70 0.24 0.00 0.03 0.00 z (n=19) 1.00 0.00 0.00 0.00 1.00 f (n=11) 0.09 0.09 0.82 0.00 0.82

3.2.4. Summary: Errors

Whereas comparisons of overall accuracy rates for manner combination and C1 showed several similarities across both languages, the types of errors produced by speakers of different languages are more diverse. Insertion is more common in stop-initial sequences for speakers of both languages, but Catalan speakers produce greater proportions of C1 change and prothesis for fricative-initial clusters. The English speakers still produce higher proportions of insertion in fricative-initial sequences, but other error types also increase for these sequences. Overall, results for individual C1 types do not indicate a consistent and substantial difference in error types depending on the voicing of the consonant; whether C1 is a stop or a fricative is a more informative predictor of the error type. However, some interesting distinctions arise depending on input modality. Speakers of both languages seem to use a more consistent

21 strategy in the text condition than in the audio condition. The Catalan speakers vary between C1 change and prothesis for the fricatives, but settle on insertion for stops in the text condition. For English speakers, insertion dominates all manner types in the text condition.

3.3. Insertion: Are inserted vowels the same as schwa?

In order to determine whether the inserted vocoid has the same acoustic characteristics as the lexical, or intentionally produced schwa, we examine duration, F1, and F2. A shorter vowel duration is consistent with the hypothesis that speakers are not attempting to produce an epenthetic lexical schwa with its own articulatory target, and a lower F1 would suggest that the vowel does not have its own target but rather reflects a tongue position consistent with a short and high movement between two obstruent closures. In addition to these measures, Davidson (2006a) also examined F2 values, and showed that for some sequences, the F2 midpoint was significantly lower for the inserted vocoids than for lexical schwas. The finding for F2 was interpreted as being consistent with tongue root movement anticipating the upcoming [a] during the inserted vocoid and/or consonant production even more than for normal vowel-to-vowel coarticulation, since the F2 for a low back vowel would likely be lower than the F2 for a specified schwa target. For this analysis, a subset of the data is used. In order to prevent a massively unequal balance between the CəC words and the CC words that happened to be produced with insertion, the data was culled so that only those CəC words were included for each participant if that person also produced CC with insertion. As above, unequal cell sizes in the ANOVAs are treated by using Type III Sums of Squares.

3.3.1. Duration

An ANOVA was carried out with phonotactic type (CC, CəC) and input modality as within-subjects variables and language as a between-subjects variable. The dependent variable was the duration of the vowel in milliseconds. Results show that there is a main effect of phonotactics [F (1, 1660) = 169.81, p < 0.001] but no effect of input modality [F < 1] or language [F (1, 1660) = 1.75, p = 0.19]. The only significant two-way interaction is language by input modality [F (1, 1660) = 9.88, p = 0.002]. The three-way interaction was not significant [F < 1]. The main effect of phonotactics is a result of insertion being shorter than lexical schwa. As the lack of an interaction between language and phonotactics indicates, the vowel durations for both CC and CəC are similar for both languages (English CC: M = 40ms, CəC: M = 51ms; Catalan CC: M = 37ms, CəC: M = 52ms). The interaction between input modality and language is due to a significant 5ms difference in duration for the text condition between Catalan and English speakers. However, since anything less than an 8-10ms difference may be below the just noticeable difference (JND) for short, unstressed vowels (Bochner, Snell, & MacKenzie, 1988), this result is not considered to carry any importance. These durations are shown in Table 4.

22 Table 4 Duration of schwa in milliseconds for audio-only and text conditions. Standard deviation is in parentheses.

Audio Text Catalan 46 (19) 43 (21) English 44 (18) 48 (20)

3.3.2. First and second formant

An ANOVA similar to the one for duration was carried out with F1 midpoint in Hertz as the dependent variable. Results indicated that there are main effects of phonotactics [F (1, 1660) = 108.6, p < 0.001], language [F (1, 1660) = 30.53, p < 0.001], and input modality [F (1, 1660) = 7.49, p = 0.006]. The only significant two-way interaction was phonotactics by language [F (1, 1660) = 9.88, p = 0.002]. The three-way interaction was not significant [F < 1]. The main effect of phonotactics was a result of F1 for insertion being lower than lexical schwa (CC: M = 433Hz, CəC: M = 502Hz). The effect of language was due to a lower F1 for English speakers collapsing over phonotactics (M = 450Hz), than for Catalan speakers (M = 486Hz) (p < 0.05). Stevens (1998: 228) notes that the JND for spectral prominences in the ranges usually associated with first and second formants is approximately 10-20Hz. Since the differences between the languages is above that range, they will be considered significant. The main effect of input modality is due to different patterns of significance across the audio and text conditions. Since there are no interactions of input modality with either phonotactics or language, F1 for both English and Catalan speakers is lower in the audio condition (M = 459Hz) than in the text condition (M = 477Hz). The interaction between language and phonotactics is due to different relationships among the languages for inserted and lexical schwas. Planned comparisons demonstrated that for inserted schwas, there are no significant differences between English and Catalan. For lexical schwas, the F1 for English is significantly lower than Catalan. The values for F1 for all three independent variables are shown in Table 5.

Table 5 F1 in Hertz for insertion errors (CC) and lexical schwas (CəC). Standard deviation is in parentheses.

Language Input CC CəC English Audio 422 (85) 467 (91) Text 429 (115) 479 (89) Catalan Audio 426 (108) 519 (160) Text 457 (202) 540 (154)

An ANOVA for F2 midpoint in Hertz as the dependent variable was also carried out. Results indicated that there is no main effect of phonotactics [F (1, 1660) = 2.32, p = 0.13] or input modality [F (1, 1660) = 2.56, p = 0.11], but language was significant [F (1, 1660) = 14.37, p < 0.001], The only significant two-way interaction was phonotactics by language [F (1, 1660) = 3.86, p = 0.05]. The three-way interaction was also significant [F (1, 1660) = 5.35, p = 0.02]. The results for F2 are shown in Table 6. The main effect of language was a result of F2 of inserted schwas being lower for English

23 than for Catalan (English: M = 1642Hz, Catalan: M = 1697Hz). The interaction between phonotactics and language is due to the fact that there was no difference in F2 between inserted and lexical schwas for the Catalan speakers, but the F2 for English speakers’ inserted schwas was significantly lower than for lexical schwas (p < 0.001). Planned comparisons to test the three-way interaction further narrow down the results by showing that the two-way interaction is entirely due to a significant difference between inserted and lexical schwas in the text condition for the English speakers (p < 0.001). The F2 difference between inserted and lexical schwas in the audio condition was not significant.

Table 6 F2 in Hertz for insertion errors (CC) and lexical schwas (CəC). Standard deviation is in parentheses.

Language Input CC CəC English Audio 1630 (266) 1644 (241) Text 1605 (255) 1690 (282) Catalan Audio 1667 (272) 1691 (239) Text 1732 (295) 1696 (209)

3.4. Summary: Acoustic characteristics of inserted and lexical schwas

The results for schwa duration, F1, and F2 frequency are consistent with the findings of Davidson (2006a): speakers produce inserted schwas that are significantly shorter and have lower F1 values than their lexical schwas. Whereas Davidson (2006a) found a lower F2 for some of the fricative-initial clusters, F2 values among the bigger cluster sample in this study were largely the same between lexical and inserted schwas, except for the English speakers in the text condition. One reason for the general lack of effect for F2 may be that there is already large variability on the backness dimension for schwa production depending on the surrounding consonants. This has been shown both for English schwa production (Flemming & Johnson, 2007), and for Barcelona Catalan (Recasens & Espinosa, 2006). Since the surrounding contexts being compared for the inserted and lexical schwas were the same, and there is already great variability in F2 for the lexical schwas, it is not surprising that F2 for inserted schwas would have similar means and standard deviations. Though statistical analysis was not carried out on the standard deviations for F2, the mean values shown in Figure 6 are similar for both lexical and inserted schwas. The hypothesis advanced for English speakers both on the basis of the acoustic data and also on articulatory data (Davidson, 2005) was that English speakers are not intentionally inserting a phonological vowel, but instead are failing to appropriately coordinate the flanking consonants since they do not have experience with producing the experimental clusters in word- initial position. The main acoustic predictions of this hypothesis are that the vocalic material between the consonants should be shorter than an intentional schwa, and that F1 should be lower. There are two possible reasons for a lower F1. First, for clusters that are composed of two constrictions of the tongue (such as coronal-dorsal, or homorganic clusters like coronal-coronal), the tongue would be expected to stay higher in the mouth as the articulators transition from one constriction to another instead of lowering to reach a vowel target. Second, for clusters involving a labial consonant, it may be that the tongue body is remaining in a default or rest position,

24 which is higher in the vocal tract than the tongue body position for lexical schwa is (Gick, Wilson, Koch, & Cook, 2004). Gick et al. showed that for North American English, the tongue’s rest position is very similar to the posture for /i/. If speakers have not yet begun to move the tongue blade or body for a consonant constriction or for the /a/ vowel following the cluster, then it is likely that the F1 produced during an open transition would be lower than that for a lexical schwa target. No similar research is available for Catalan, but it is possible that these speakers too have a relatively high tongue resting position. If so, then the finding that the F1 and duration of the inserted schwas was the same for speakers of both languages would be consistent with the hypothesis that speakers of both languages may be using gestural mistiming to maintain perceptual similarity with the target. This will be discussed further in the general discussion.

4.0 General Discussion

This study had three main goals. The first was to determine whether the analogy hypothesis or a language-independent phonetic hypothesis could more adequately explain the accuracy results. This was tested both by adding stop-initial consonant sequences in addition to the fricative-initial sequences studied in Davidson (2006a) and by testing speakers of two different language backgrounds. The second goal was to examine the cross-linguistic similarities or differences of errors on the obstruent-initial sequences, including determining whether the gestural mistiming repair hypothesized by Davidson (2005; 2006a) extended to another language which has schwa in the phoneme inventory. Finally, the last aim was to investigate the role of input type, specifically, whether receiving only auditory information significantly affected speakers’ performance compared to combined orthographic and auditory input. Each of these goals are addressed in more detail below. First we consider whether the analogy hypothesis can explain the results of the current study, both with respect to the larger stimulus set including fricative-initial and stop-initial sequences and to the inclusion of Catalan speakers in addition to English speakers. To start with the English speakers, the analogy hypothesis and the language-independent phonetic hypotheses make similar predictions for fricative-initial sequences, but different predictions for stop-initial sequences. As explained in the introduction, accuracy should be greatest for /fC/ sequences, which are featurally most similar to the most permissible sequences in English, followed by /zC/ and then by /vC/. From a phonetic standpoint, voiced obstruent sequences are more difficult to produce than voiceless sequences, and /v/ might be more difficult to accurately produce than /z/ if speakers must make a greater effort to produce a sound that is both voiced and sufficiently perceptually prominent. As shown by the graph for accuracy by C1 in Figure 3, this is in fact the case, which replicates the findings of Davidson (2006a). In a comparison of fricative-initial versus stop-initial sequences, the analogy hypothesis would be consistent only with the finding that fricative-initial sequences are more accurate, since English does permit /s/+obstruent and /s/+nasal sequences. When the analyses collapse over individual C1s, the results for manner combination demonstrate that speakers are more accurate on fricative-initial sequences than on stop-initial ones. However, the analysis of individual C1s suggests that voicing plays an important role in determining accuracy in addition to manner. The results for C1 from Figure 3 show that voiced stops and /v/ are the least accurate, followed by /z/ and /p/ (which may have a disadvantage similar to that of /v/), then by the other voiceless stops, and finally by voiceless fricatives, which are the most accurate. The analogy hypothesis does not predict this interspersing of stops and fricatives, nor does it predict that there should be a difference between different types of stops since both voiced and voiceless stops of all places of

25 articulation are found in word-initial stop-sonorant clusters. Especially troubling for the analogy hypothesis is the finding that language does not interact with either manner combination or with C1 type for the accuracy results. This indicates that regardless of whether their native language is English or Catalan, the speakers all produce the same patterns of accuracy whether the stimuli are analyzed by manner combination or by C1 type. While Catalan does allow /f/+sonorant sequences as well as stop+sonorant sequences similar to those permitted in English, /s/-initial sequences are not phonotactically attested and thus cannot serve as the basis for an accurate analogical extension to other fricative-initial sequences. However, just like the English speakers, Catalan speakers are most accurate on /f/- initial sequences, followed by /z/-initial and then /v/-initial sequences. Furthermore, Catalan speakers are equally or more accurate on /s/-initial sequences than any others, despite the fact that this is the precise cluster for which there is an active prothetic repair process (Bonet & Lloret, 1998). The Catalan speakers also show the same pattern of accuracy as the English speakers do with respect to voicing and manner. Whereas the analogy hypothesis fails to account for why all both language should pattern the same with respect to accuracy on these non-native sequences, a language-independent phonetic explanation provides a better interpretation of the findings. It is not surprising that fricative-initial sequences are generally more accurate than stop-initial ones, since fricatives have internal cues to place of articulation (Jongman et al., 2000; Wright, 1996), which may mean that the articulatory timing required to produce a fricative-obstruent or fricative-nasal sequence can be more flexible than the coordination necessary for stop-initial sequences. Word-initial stops, on the other hand, can only be accurately perceived if they have a release burst or even better, if they are followed by a sonorant which carries cues to place in the formant transitions (e.g. Dorman, Studdert-Kennedy, & Raphael, 1977; Smits, ten Bosch, & Collier, 1996; Stevens & Blumstein, 1978; see Wright, 2004 for a review). Thus, speakers must at least release stops in stop-initial consonant clusters—and should do so in order to match the Russian speaker’s stimuli—but since neither English nor Catalan speakers have experience with this coordination in word-initial position, they make more errors on these sequences. Likewise, the distinction between voiced and voiceless stops also has phonetic explanations. Because it is aerodynamically difficult to maintain vocal fold vibration for the entire duration of a voiced obstruent cluster (Ohala, 1983, 1994), speakers can repair these sequences either by devoicing, or by producing a period of open vocal tract (i.e. gestural mistiming) between the consonants which provides a more appropriate aerodynamic environment for vocal fold vibration to continue. Thus, the articulatory challenges related to producing stop-initial sequences in conjunction with the difficulty of maintaining voicing on these sequences is consistent with the hypothesis that voiced stop-initial sequences should be the least accurate of all, and this is true for both language groups. The second issue addressed by this study involved differences among the errors produced by speakers of different languages. The first part of this issue pertains to whether the gestural timing repair generalizes to another language containing schwa in the phoneme inventory. This question arises out of a larger research program examining the role of perceptual similarity in phonological processes (e.g. Côté, 2000; Fleischhacker, 2005; Kawahara, 2006; Steriade, 1999b; Steriade, 2003; Zuraw, 2007). Davidson (2006a) showed that when English speakers do not accurately produce phonotactically impossible obstruent-initial sequences, they instead overwhelmingly produce them with a transitional vocoid between the two consonants. For the acoustic and articulatory reasons described in the introduction, this transitional vocoid was

26 argued to result from gestural mistiming, in which speakers fail to adequately overlap the consonant gestures, instead producing a short period of open vocal tract between the constrictions. English speakers may prefer this repair to others since it allows for the recoverability of both intended consonants (Chitoran et al., 2002; Mattingly, 1981; Weinberger, 1994), and the transitional vocoid (as opposed to a full schwa) is the least disruptive repair of the target sequence from a similarity perspective. In addition, gestural mistiming may have the advantage of resulting in an acoustic string that is very similar to an attested phonotactic pattern in English, so it is a compromise between producing a well-formed but phonologically prohibited consonant cluster and a well-formed but “too nativized” CəC sequence. Since Catalan also has schwa in the inventory, it is an interesting language to compare to English to examine whether the inserted and lexical schwas bear the same relationship as they do in English. Results show that the Catalan values for duration and F1 frequency for the inserted schwas are not significantly different from those of English. However, the lexical schwas of Catalan are quite different from those of English with respect to F1, indicating that while the vowel quality of lexical schwa for the two languages is not the same, speakers of both languages produce the same kind of transitional schwa in the phonotactically illegal CC sequences that is consistent with gestural mistiming. Other interesting similarities and differences between the languages emerge when the interaction of error type and manner combination is considered. Whereas English speakers always prefer insertion in equal or greater degrees to other possible repairs for all manner combinations, this is not the case for Catalan speakers. For all fricative-initial sequences, Catalan speakers are equally or more likely to use C1 change than any other repair. The table of C1 changes in Table 3 shows that Catalan speakers tend to devoice voiced fricatives, and /f/ is often produced as /s/, but Figure 6 also shows that for /z/- and even /s/-initial stimuli, there is as much prothesis as there is C1 change. While Catalan speakers do produce prothetic repairs for the sibilants, they are less likely to extend the repair to the labiodentals. The combination of results from accuracy and errors lead to an interesting question about the locus of the phonetic bases that seemingly underlie the patterns. A number of researchers have recently championed phonetically-based phonology (e.g. Hayes, 1999; Hayes, Kirchner, & Steriade, 2004; Steriade, 1999a) or substantively-biased phonology (Wilson, 2006; Zuraw, 2007), the notion that phonetic underpinnings and biases based on preferable phonetic patterns form the basis of phonological grammars. To take just one formalization, Optimality Theory, it has been proposed that constraints can directly incorporate phonetic properties, regardless of whether those properties are articulatorily, aerodynamically, or perceptually-based. It would be possible to construct a set of phonotactic constraints incorporating the phonetic properties that cause production difficulties for the speakers in this study and then rank them in such a way that they interact with “floating” faithfulness constraints to predict how often different types of onset clusters will be repaired in this type of experimental situation (Anttila, 1997; Davidson, 2006b; Nagy & Reynolds, 1997). Because the relevant constraints would target onset clusters that do not exist in either of the languages used in this study, any ranking among them must be “hidden” in the sense that for the purposes of defining the phonology and lexicon of each language, these constraints would be ranked above all relevant faithfulness constraints (Davidson, 2006b). However, this scenario presupposes that both languages do indeed have the same ranking among the relevant constraints, that the order is somehow universal, and that it has not been disturbed in any way by any type of learning algorithm that determines the final state of constraints. The results of this study cannot rule out the possibility that the relevant phonetic

27 properties pertaining to the experimental clusters are encoded in the grammar and that speakers are engaging their phonological grammars when participating in this task. However, the results are also in line with an account positing that speakers whose native languages do not permit these clusters are all subject to the same language-independent articulatory and perceptual limitations when producing the unattested sequences. That is, the accuracy patterns do not necessarily reflect the involvement of the phonological grammar, but rather they are attested because the phonetic difficulties described in previous sections are presumably common to the speech production capabilities of all speakers (see similar claims developed in the Evolutionary Phonology framework and its precursors, Blevins, 2004; Hyman, 2001; Mielke, 2003; Ohala, 1983). Regardless of whether phonological or physical mechanisms best account for the accuracy results, it is clear that the specific phonological grammars of each language cannot be ignored, namely, in explaining the errors. Though accuracy results for either C1 type or manner combination did not interact with language, productive phonological repair processes that differ between the languages do emerge in these results. The final question raised by this study is how different input modalities affect which sequences are more likely to be repaired and whether the repairs are the same regardless of how the stimuli are presented. First, we focus on whether the sequences that are most likely to be produced accurately change from the text to the audio condition. In general, speakers were significantly less accurate in the audio-only condition across-the-board (see Figure 2). For the analysis of C1, there was no interaction with input modality. Although there was an interaction between input modality and manner combination, the only real change was that accuracy on fricative-fricative (FF) sequences for all speakers declined in the audio-only condition compared to when orthography was also presented. Ohala and Kawasaki-Fukumori (1997) argue that these types of clusters are particularly disfavored since speakers prefer to produce sequences that modulate the spectral characteristics over the duration of the signal, which fricative-fricative sequences do not do. If the lack of modulation also affects perception, then it is not surprising that performance on FF sequences declines in the audio-only condition, since speakers may not even hear a low amplitude fricative in the audio-only condition. Indeed, C1 deletion is much higher in the audio condition for FF than in the text condition (see Figure 5). These findings suggest that the phonetic mechanisms which make certain sequences more difficult to produce relative to others are generally independent of input modality. The influence of input modality seems to be more important with respect to the errors that speakers make. Regardless of the preferred repair for a particular sequence and language group, speakers tend to be more consistent in the use of that repair when text input is also presented. Thus, English speakers are more consistent with the insertion repair for all sequences in the text condition, whereas Catalan speakers prefer insertion for the stops and C1 change or prothesis for the fricatives. In the audio-only condition, rates of all error types tend to rise for speakers of both languages, including those for errors that may be especially indicative of misperception of the identity of the consonants (such as C1 deletion or C1 change). Misperception in laboratory tasks, especially for elements which crucially do not contain robust acoustic cues (i.e. obstruent clusters in general, Côté, 2000; Steriade, 1999a; Wilson, 2001) has been indirectly shown by poor performance on discrimination tasks which pair phonotactically illegal pseudowords with various types of phonetically similar but legal pseudowords (e.g., Berent, Steriade, Lennertz, & Vaknin, 2007; Davidson, 2007; Dupoux, Kakehi, Hirose, Pallier, & Mehler, 1999; Fleischhacker, 2005; Hallé, Segui, Frauenfelder, & Meunier, 1998), though there is little direct evidence assessing what listeners actually hear (see Davidson, 2007; Haunz,

28 2003 for data from transcription tasks). This study likewise does not answer the question of what speakers hear, since effects of perception and production are conflated in the speakers’ output in the audio-only condition, but it is reasonable to assume that misperception must play some role since overall accuracy in the audio-only condition decreases and the types of errors that speakers produce are not exactly the same as they are in the audio+text condition. This is consistent with the literature in loanword adaptation which argues that perceptual factors play an important role in determining the form of loans in borrowing languages, especially when borrowers may not have access to the orthography of the source language (e.g. Kang, 2003; Miao, 2005; Peperkamp & Dupoux, 2003; Silverman, 1992; Yip, 2006).

5.0 Conclusion

This study was undertaken as a comprehensive investigation of factors affecting the production of non-native consonant sequences. The results show that performance on obstruent- initial clusters that are not possible in a speakers’ native language is more likely influenced by language-independent phonetic factors such as articulatory and aerodynamic difficulty than it is by considerations such as sonority sequencing or by analogical processes that depend on the phonotactics of the speakers’ native language. Whereas the overall patterns of accuracy seem to be universal across the two languages explored in this study, the place where language-specific differences emerged was in the errors produced by Catalan and English speakers. Likewise, these results suggest a complex interaction between phoneme inventory and the likelihood of some repairs. Whereas English and Catalan speakers show high rates of insertion that is consistent with gestural mistiming for stops, Catalan speakers are more likely to choose other repairs for fricatives, which may be influenced by a productive repair process in the language. Finally, while the use of auditory stimuli only causes a general decrement in performance, the mechanism driving variable production across sequence types is relatively independent from the input modality.

Acknowledgments I would like to thank Tuuli Adams for help with data collection and analysis, and the NYU Phonetics and Experimental Phonology lab group for comments and discussion of this work. I also thank my colleagues at the University of Barcelona, especially Núria Sebastián Gallés, for providing me with the resources for carrying out this research. This work was supported by the National Science Foundation CAREER grant BCS-0449560.

29 References

Albright, A. (2009). Feature-based generalisation as a source of gradient acceptability. Phonology, 26, 9-41.

Anttila, A. (1997). Deriving variation from grammar. In F. Hinskens, R. v. Hout & W. L. Wetzel (Eds.), Variation, Change, and Phonological Theory (pp. 35-68). Amsterdam: John Benjamins.

Baayen, R. H. (2003). Probabilistic approaches to morphology. In R. Bod, J. Hay & S. Jannedy (Eds.), Probabilistic Linguistics (pp. 229-287). Cambridge: MIT Press.

Bailey, T. M., & Hahn, U. (2001). Determinants of wordlikeness: Phonotactics or lexical neighborhoods? Journal of Memory and Language, 44, 568–591.

Berent, I., Steriade, D., Lennertz, T., & Vaknin, V. (2007). What we know about what we have never heard: Evidence from perceptual illusions. Cognition, 104, 591-630.

Blevins, J. (2004). Evolutionary Phonology. Cambridge: Cambridge University Press.

Bochner, J., Snell, K., & MacKenzie, D. (1988). Duration discrimination of speech and tonal complex stimuli by normally hearning and hearing-impaired listeners. Journal of the Acoustical Society of America, 84(2), 493-500.

Bonet, E., & Lloret, R. M. (1998). Fonologia Catalana. Barcelona: Editorial Ariel.

Bradlow, A., & Bent, T. (2002). The clear speech effect for non-native listeners. Journal of the Acoustical Society of America, 112, 272-284.

Broselow, E., Chen, S., & Wang, C. (1998). The emergence of the unmarked in second language phonology. Studies in Second Language Acquisition, 20, 261-280.

Broselow, E., & Finer, D. (1991). Parameter setting in second language phonology and syntax. Second Language Research, 7(1), 35-59.

Browman, C., & Goldstein, L. (2001). Competing Constraints on Intergestural Coordination and Self-Organiztion of Phonological Structures. Bulletin de la Communication Parleé, 5, 25-34.

Bybee, J. (2001). Phonology and Language Use. Cambridge: Cambridge University Press.

Byrd, D., & Saltzman, E. (2002). Speech production. In M. Arbib (Ed.), The Handbook of Brain Theory and Neural Networks (2nd Edition ed., pp. 1072-1076). Cambridge: MIT Press.

Carbonell, J. F., & Llisterri, J. (1999). Catalan. In Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet (pp. 61-65). Cambridge: Cambridge University Press.

30 Chitoran, I., Goldstein, L., & Byrd, D. (2002). Gestural overlap and recoverability: Articulatory evidence from Georgian. In N. Warner & C. Gusshoven (Eds.), Papers in Laboratory Phonology VII. Berlin: Mouton de Gruyter.

Côté, M.-H. (2000). Consonant Cluster Phonotactics: A Perceptual Approach. Unpublished Ph.D. dissertation, MIT, Cambridge, MA.

Davidson, L. (2005). Addressing phonological questions with ultrasound. Clinical Linguistics and Phonetics, 19(6/7), 619-633.

Davidson, L. (2006a). Phonology, phonetics, or frequency: Influences on the production of non- native sequences. Journal of Phonetics, 34(1), 104-137.

Davidson, L. (2006b). Phonotactics and articulatory coordination interact in phonology: evidence from non-native production. Cognitive Science, 30(5), 837-862.

Davidson, L. (2006c). Schwa elision in fast speech: segmental deletion or gestural overlap? Phonetica, 63(2-3), 79-112.

Davidson, L. (2007). The relationship between the perception of non-native phonotactics and loanword adaptation. Phonology, 24(2), 261-286.

Davidson, L., Jusczyk, P., & Smolensky, P. (2004). The initial and final states: Theoretical implications and experimental explorations of richness of the base. In R. Kager, W. Zonneveld & J. Pater (Eds.), Fixing Priorities: Constraints in Phonological Acquisition (pp. 321-368). Cambridge: Cambridge University Press.

Dohlus, K. (2005). Phonetics or phonology: Asymmetries in loanword adaptations-French and German mid front rounded vowels in Japanese. ZAS Papers in Linguistics, 42, 117-135.

Dorman, M., Studdert-Kennedy, M., & Raphael, L. (1977). Stop-consonant recognition: Release bursts and formant transitions as functionally equivalent, context dependent cues. Perception & Psychophysics, 22, 109-122.

Dupoux, E., Kakehi, K., Hirose, Y., Pallier, C., & Mehler, J. (1999). Epenthetic vowels in Japanese: A perceptual illusion? Journal of Experimental Psychology: Human Perception and Performance, 25, 1568-1578.

Eckman, F., & Iverson, G. (1993). Sonority and markedness among onset clusters in the interlanguage of ESL learners. Second Language Research, 9, 234-252.

Fleischhacker, H. (2005). Similarity in Phonology: Evidence from Reduplication and Loan Adaptation. Unpublished Ph.D. Dissertation, UCLA, Los Angeles.

Flemming, E., & Johnson, S. (2007). Rosa’s roses: reduced vowels in American English. Journal of the International Phonetic Association, 37(1), 83-96.

31

Gick, B., Wilson, I., Koch, K., & Cook, C. (2004). Language-specific articulatory settings: Evidence from inter-utterance rest position. Phonetica, 61, 220-233.

Goldrick, M. (2004). Phonological Features and Phonotactic Constraints in Speech Production. Journal of Memory and Language, 51(4), 586-603.

Goldstein, L., Pouplier, M., Chen, L., Saltzman, E., & Byrd, D. (2007). Dynamic action units slip in speech production errors. Cognition, 103, 386-412.

Greenberg, J., & Jenkins, J. (1964). Studies in the psychological correlates of the sound system of American English. Word, 20, 157-172.

Hallé, P., Segui, J., Frauenfelder, U., & Meunier, C. (1998). Processing of illegal consonant clusters: A case of perceptual assimilation? Journal of Experimental Psychology: Human Perception and Performance, 24(2), 592-608.

Hansen, J. (2004). Developmental sequences in the acquisition of English L2 syllable codas. Studies in Second Language Acquisition, 26, 85-124.

Haunz, C. (2003). Grammatical and non-grammatical factors in loanword adaptation. In M. J. Solé, D. Recasens & J. Romero (Eds.), Proceedings of the 15th International Congress of Phonetic Sciences. Barcelona, Spain: Universitat Autónoma de Barcelona.

Hayes, B. (1999). Phonetically-driven phonology: The role of Optimality Theory and inductive grounding. In M. Darnell, E. Moravscik, M. Noonan, F. Newmeyer & K. Wheatly (Eds.), Functionalism and formalism in linguistics, volume I: General papers (pp. 243-285). Amsterdam: John Benjamins.

Hayes, B., Kirchner, R., & Steriade, D. (Eds.). (2004). Phonetically-Based Phonology. Cambridge: Cambridge University Press.

Hayes, B., & Wilson, C. (2008). A maximum entropy model of phonotactics and phonotactic learning. .

Henderson, J., & Repp, B. (1982). Is a stop consonant released when followed by another stop consonant? Phonetica, 39, 71 - 82.

Hualde, J. I. (1992). Catalan. London: Routledge.

Hyman, L. (2001). The limits of phonetic determinism in phonology: *NC revisited. In E. Hume & K. Johnson (Eds.), The Role of Speech Perception in Phonology. New York: Academic Press.

Jongman, A., Wayland, R., & Wong, S. (2000). Acoustic characteristics of English fricatives. Journal of the Acoustical Society of America, 108(3), 1252-1263.

32 Kang, Y. (2003). Perceptual similarity in loanword adaptation: English postvocalic word-final stops in Korean. Phonology, 20, 219-273.

Kawahara, S. (2006). A faithfulness ranking projected from a perceptibility scale: The case of [+voice] in Japanese. Language, 82(3), 536-574.

Kewley-Port, D. (1983). Time-varying features as correlates of place of articulation in stop consonants. Journal of the Acoustical Society of America, 73(1), 322-334.

Maniwa, K., Jongman, A., & Wade, T. (2008). Perception of clear fricatives by normal-hearing and simulated hearing-impaired listeners. Journal of the Acoustical Society of America, 123(2), 1114-1125.

Mattingly, I. (1981). Phonetic representation is speech synthesis by rule. In T. Myers, J. Laver & J. Anderson (Eds.), The Cognitive Representaiton of Speech. Amsterdam: North Holland.

McCarthy, J., & Prince, A. (1994). The emergence of the unmarked: Optimality in prosodic morphology. In M. Gonzales (Ed.), Proceedings of the 24th North East Linguistics Society. Somerville: Cascadilla Press.

Miao, R. (2005). Loanword Adaptation in Mandarin Chinese: Perceptual, Phonological and Sociolinguistic Factors. Unpublished Ph.D. dissertation, Stony Brook University, Stony Brook, NY.

Mielke, J. (2003). The diachronic influence of perception: Experimental evidence from Turkish. Proceedings of the Berkeley Linguistics Society, 29, 557-567.

Miller, G., & Nicely, P. (1955). An analysis of perceptual confusions among some English consonants. Journal of the Acoustical Society of America, 27(2), 338-352.

Nagy, N., & Reynolds, W. (1997). Optimality Theory and variable word-final deletion in Faeter. Language Variation and Change, 9, 37-55.

Ohala, J. (1983). The origin of sound patterns in vocal tract constraints. In P. MacNeilage (Ed.), The Production of Speech. New York: Springer-Verlag.

Ohala, J. (1994). Speech aerodynamics. In R. E. Asher & J. M. Y. Simpson (Eds.), The Encyclopedia of Language and Linguistics (Vol. 8, pp. 4144-4148). New York: Pergamon Press.

Ohala, J., & Kawasaki-Fukumori, H. (1997). Alternatives to the sonority hierarchy for explaining segmental sequential constraints. In S. Eliasson & E. H. Jahr (Eds.), Language and Its Ecology (pp. 343-365). Berlin: Mouton de Gruyter.

Ohde, R. N., & Stevens, K. (1983). Effect of burst amplitude on the perception of stop consonant place of articulation. Journal of the Acoustical Society of America, 74, 706-714.

33 Peperkamp, S., & Dupoux, E. (2003). Reinterpreting loanword adaptations: The role of perception. In M. J. Solé, D. Recasens & J. Romero (Eds.), Proceedings of the 15th International Congress of Phonetic Sciences (pp. 367-370). Barcelona: Universitat Autónoma de Barcelona.

Recasens, D. (1993). Fonètica i fonologia. Barcelona: Enciclopèdia Catalana.

Recasens, D., & Espinosa, A. (2006). Dispersion and variability of Catalan vowels. Speech Communication, 48(6), 645-666.

Scholes, R. (1966). Phonotactic Grammaticality. The Hague: Mouton.

Silverman, D. (1992). Multiple scansions in loanword phonology: evidence from Cantonese. Phonology, 9, 289-328.

Skousen, R. (1989). Analogical Modeling of Language. Dordrecht: Kluwer.

Smith, J. (2008). Source similarity in loanword adaptation: Correspondence Theory and the posited source-language representation. In S. Parker (Ed.), Phonological Argumentation: Essays on Evidence and Motivation. London: Equinox.

Smits, R., ten Bosch, L., & Collier, R. (1996). Evaluation of various sets of acoustic cues for the perception of prevocalic stop consonants. I. Perception experiment. Journal of the Acoustical Society of America, 100(6), 3852-3864.

Steriade, D. (1999a). Phonetics in phonology: The case of laryngeal neutralization. UCLA Working Papers in Linguistics: Papers in Phonology, 3, 25–146.

Steriade, D. (1999b). The phonology of perceptibility effects: the P-map and its consequences for constraint organization.Unpublished manuscript.

Steriade, D. (2003). Directional asymmetries in place assimilation: a perceptual account. In E. Hume & K. Johnson (Eds.), The role of speech perception phenomena in phonology. San Diego: Academic Press.

Stevens, K. (1998). Acoustic Phonetics. Cambridge, MA: MIT Press.

Stevens, K., & Blumstein, S. (1978). Invariant cues for place of articulation in stop consonants. Journal of the Acoustical Society of America, 64(5), 1358-1368.

Stevens, K., Blumstein, S., Glicksman, L., Burton, M., & Kurowski, K. (1992). Acoustic and perceptual characteristics of voicing in fricatives and fricative clusters. Journal of the Acoustical Society of America, 91(5), 2979-3000.

Uchanski, R., Choi, S., Braida, L., Reed, C., & Durlach, N. (1996). Speaking clearly for the hard of hearing IV: Further studies of the role of speaking rate. Journal of Speech and Hearing Research, 39, 494–509.

34

Ussishkin, A., & Wedel, A. (2003). Gestural motor programs and the nature of phonotactic restrictions: Evidence from loanword phonology. WCCFL, 22, 505-518.

Vendelin, I., & Peperkamp, S. (2006). The influence of orthography on loanword adaptations. , 116(7), 996-1007.

Wang, M. D., & Bilger, R. C. (1973). Consonant confusions in noise: A study of perceptual features. Journal of the Acoustical Society of America, 54, 1248–1266.

Weinberger, S. (1994). Functional and phonetic constraints on second language phonology. In M. Yavas (Ed.), First and Second Language Phonology (pp. 283-302). San Diego: Singular Publishing Group.

Westbury, J., & Keating, P. (1986). On the naturalness of stop consonant voicing. , 22, 145-166.

Wheeler, M. (2005). Phonology of Catalan. Oxford: Oxford University Press.

Wilson, C. (2001). Consonant cluster neutralisation and targeted constraints. Phonology, 18, 147-197.

Wilson, C. (2006). Learning phonology with substantive bias: An experimental and computational study of velar palatalization. Cognitive Science, 30, 945–982.

Wright, R. (1996). Consonant Clusters and Cue Preservation in Tsou. Unpublished Ph.D. dissertation, UCLA.

Wright, R. (2004). A review of perceptual cues and cue robustness. In B. Hayes, R. Kirchner & D. Steriade (Eds.), Phonetically Based Phonology. Cambridge: Cambridge University Press.

Yip, M. (2006). The symbiosis between perception and grammar in loanword phonology. Lingua, 116(7), 950-975.

Young-Scholten, M., Akita, M., & Cross, N. (1999). Focus on form in phonology: Orthographic exposure as a promoter of epenthesis. In P. Robinson & N. Jongheim (Eds.), Pragmatics and Pedagogy, Vol. 2: The Proceedings of the 3rd Pacific Second Language Research Forum (pp. 227-233). Tokyo: Aoyama Gakuin University.

Zuraw, K. (2007). The role of phonetic knowledge in phonological patterning: corpus and survey evidence from Tagalog infixation. Language, 83(2), 277-316.

35 Appendix

CC and CC stimuli in IPA used for all languages. In the text condition, the schwa was represented by ‘e’ for English and by ‘a’ for Catalan. Primary stress was on the /a/ for all words.

C1C2 C1 combination CC stimuli CəC stimuli /s/-initial sm smava smani səmava səmani sn snabu snadi sənabu sənadi sf sfano sfadu səfano səfadu sp spaɡi spanu səpaɡi səpanu st stamo staka sətamo sətaka sk skabu skafe səkabu səkafe /f/-initial fm fmatu fmake fəmatu fəmake fn fnaɡu fnada fənaɡu fənada fs fsaɡa fsake fəsaɡa fəsake fp fpami fpala fəpami fəpala fk fkada fkabe fəkada fəkabe ft ftake ftapi fətake fətapi /z/-initial zm zmafo zmaɡu zəmafo zəmaɡu zn znaɡi znafe zənaɡi zənafe zv zvato zvabu zəvato zəvabu zb zbatu zbasi zəbatu zəbasi zd zdanu zdaba zədanu zədaba zɡ zɡano zɡame zəɡano zəɡame /v/-initial vm vmala vmape vəmala vəmape vn vnali vnake vənali vənake vz vzaku vzamo vəzaku vəzamo vb vbaɡu vbano vəbaɡu vəbano vd vdale vdapi vədale vədapi vɡ vɡane vɡalu vəɡane vəɡalu /p/-initial pm pmaɡu pmava pəmaɡu pəmava pn pnaka pnado pənaka pənado ps psaɡi psatu pəsaɡi pəsatu pf pfale pfama pəfale pəfama pt ptane ptaɡa pətane pətaga pk pkadi pkamo pəkadi pəkamo

36 /t/-initial tm tmapa tmase təmapa təmase tn tnasa tnapi tənasa tənapi tf tfano tfani təfano təfani ts tsapi tsalu təsapi təsalu tp tpaze tpaku təpaze təpaku tk tkali tkale təkali təkale /k/-initial km kmapi kmadu kəmapi kəmadu kn knabo knase kənabo kənase ks ksami ksapa kəsami kəsapa kf kfaɡi kfano kəfaɡi kəfano kp kpado kpani kəpado kəpani kt ktaba ktaɡe kətaba kətaɡe /b/-initial bm bmalu bmada bəmalu bəmada bn bnapa bnase bənapa bənase bz bzaka bzagi bəzaka bəzagi bv bvadu bvani bəvadu bəvani bd bdava bdaɡu bədava bədaɡu bɡ bɡapo bɡadi bəɡapo bəɡadi /d/-initial dm dmafo dmaka dəmafo dəmaka dn dnala dnape dənala dənape dv dvasi dvako dəvasi dəvako dz dzabe dzatu dəzabe dəzatu db dbake dbalo dəbake dəbalo dɡ dɡama dɡani dəɡama dəɡani /ɡ/-initial ɡm ɡmalo ɡmada ɡəmalo ɡəmada ɡn ɡnasu ɡnamo ɡənasu ɡənamo ɡz ɡzabo ɡzale ɡəzabo ɡəzale ɡv ɡvaki ɡvata ɡəvaki ɡəvata ɡb ɡbame ɡbafu ɡəbame ɡəbafu ɡd ɡdana ɡdasi ɡədana ɡədasi

37