<<

Differences in the Production and Perception of Chinese Tones

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Chiung-Yun

Graduate Program in Speech and Hearing Science all

The Ohio State University 2010

Dissertation Committee: Robert A. Fox, Advisor Marjorie K.M. Chan Ewa Jacewicz

i

Copyright by

Chiung-Yun Chang

2010

ii

Abstract

The production study examined the three main acoustic properties, fundamental

frequency (f0), rms amplitude and duration, of the four lexical tones in Mandarin

(BM) and Mandarin (TM) produced in isolation and in a sentence-medial

position by native female speakers of these two regional . Acoustical and

statistical analyses showed cross-dialectal differences in tonal realization, especially for

Tone 3 produced in isolation. Specifically, a citation BM 3 has a dipping with an point around the mid of the whereas TM Tone 3 has a falling pattern. Similarly, the amplitude envelope of the citation Tone 3 in BM and TM had a double-peak and a falling pattern, respectively. Accordingly, BM Tone 3 in isolation was the only tone that was statistically significantly different from the TM counterpart, with the former being 141 ms longer. In addition, cross-correlation analyses between f0 contours and amplitude contours showed that acoustic divergence in citation Tone 3 resulted in different relational patterns between amplitude contours and f0 contours and

among tones in two dialects. Specifically, the low-falling amplitude contour of TM

Tone 3 in isolation was highly correctly with the f0 contours of not only Tone 3 and but

also Tone 4 which have the similar falling pattern. In contrast, the double-peak

amplitude envelope of BM Tone 3 was only correlated to the dipping f0 contour of the

same tone. Furthermore, cross-correlation among tonal contours revealed that f0

iii

patterns of isolated TM Tones 3 and 4 were highly correlated with each other (r=0.996) while that of the BM counterpart barely correlated with each other.

The perception study investigated the effects of speaker and dialect variability on the time-course of native and nonnative tone identification when no -extrinsic contextual information is available for speaker and dialect normalization. Regardless of the dialects of speakers and listeners, Tone 2 had the longest TIP75%, followed by Tones

3, 4 and 1. As hypothesized, speaker dialect-listener dialect mismatch had differential impacts on the identification of intact and truncated tones, especially citation Tone 3.

Perceptual results showed that TM listeners required significantly less acoustic information than the BM and AE counterparts to identified partial TM Tone 3 at 75% correct. Even with complete acoustic information available to listeners, both BM and

AE listeners still had significantly lower accuracy of identifying an intact TM Tone 3 produced in isolation. Tonal confusion patterns revealed that BM and AE listeners consistently misidentified TM Tone 3 as Tone 4 but not vice versa while TM listeners made bi-directional misidentification. Interpretations were related to the cross-dialectal differences in the realization of citation Tone 3 and differential experience with its phonetic variants. Most importantly, confusion patterns and sensitivity (d′) measures at gate 1 indicated that listeners were able to make low- and high-onset tone distinction and to estimate f0 height from very short multiple-talker stimuli of 30 ms without mediation of gender detection as proposed by Lee (2009).

iv

Dedication

Dedicated to my parents.

v

Acknowledgments

I would like to express my deepest gratitude to my advisor, Dr. Robert Fox, for his

intellectual guidance, constant encouragement, and support during my research and doctoral study. I am gratefully for having him not only as a dissertation advisor but also as a mentor who has showed me different ways to approach a research question and the need to be persistent in trying to find an answer.

I would like to thank Dr. Marjorie Chan, for her constructive comments and insightful assistance throughout the project. Ma Laoshi offers advice and suggestions whenever I need them. I also thank Dr. Ewa Jacewicz for her valuable comments on the draft of the dissertation. Her enthusiasm in research has always been a great source of inspiration.

I wish to thank my friends and fellow graduate students in the SPA labs. It has been a pleasure to share my graduate student life with them. Special thanks to Dr. D.-R.

Chen, Dr. C.-J Sun, Amy Chiang and Yolanda Holt for their friendship and help over the past few years.

My deepest gratitude goes to my parents and siblings for their unconditional love and support thought my life. I could not have accomplished my dreams without them.

I also thank the subjects who participated in this study; this dissertation would not have been possible without their participation. This research was supported by the

Grants-in-Aid of Research from Sigma Xi, The Scientific Research Society and the

AGGRS from the graduate school at The Ohio State University.

vi

Vita

2010 Ph.D. Speech and Hearing Science, The Ohio State University 2004 M.A. TESOL, The Ohio State University 2003 M.A. Economics, The Ohio State University 2001 B.Com (Hons). Economics, University of Auckland 1998 B.A. Spanish, University of Auckland

Publications

Fox., R. A., Jacewicz, ., and Chang, C.-Y. (in press). Auditory spectral integration in the perception of diphthongal . Journal of the Acoustical Society of America, 128 (4). Fox, R. A. and Jacewicz, E., and Chang, C.-Y. (under review). Auditory spectral integration in the perception of static vowels. Journal of Speech, Language, and Hearing Research. Fox, R.A., Jacewicz, E., and Chang, C.-Y. (2007). Vowel perception with virtual formants. In: Proceedings of the XVIth International Congress of Phonetic Sciences, edited by J. Trouvain and W.J. Barry, pp. 689-692. Saarbrucken, Germany. Chang, C-Y. and Fox, R. A. (2010). Production and perception of lexical tones in Beijing and Taiwan Mandarin. 159th Meeting of the Acoustical Society of America, Baltimore, MD, 19-23 April. Journal of the Acoustical Society of America, 127: 2023. Fox, R. A., Jacewicz, E, and Chang, C.Y. (2010). Speech intelligibility in cross-dialectal multi-talker babble. 159th Meeting of the Acoustical Society of America, Baltimore, MD, 19-23 April. Journal of the Acoustical Society of America, 127: 1903. Chang, C-Y. and Fox, R.A. (2009). Time-course of perception of tones. 157th Meeting of the Acoustical Society of America, Portland, OR. Journal of the Acoustical Society of America, 125(4): 2773. Fox, R. A., Jacewicz, E., and Chang, C.-Y. (2007). Salience of dynamic virtual formants in . Acoustical Society of America, Salt Lake City, UT, 07 June. Journal of the Acoustical Society of America, 121(5): 3189. Fox, R. A., Jacewicz, E., Chang, C.-Y. and Fox, J. D. (2006). Salience of virtual formants as a function of the frequency separation between spectral components. Acoustical Society of America, Honolulu, HI, 01 December. Journal of the Acoustical Society of America, 120(5): 3252.

Fields of Study Major Field: Speech and Hearing Science

vii

Table of Contents Abstract...... iii Dedication...... v Acknowledgments...... vi Vita...... vii Publications...... vii Table of Contents...... viii List of Tables ...... xi List of Figures...... xiii

Chapter 1 Introduction ...... 1

Chapter 2 Literature Review...... 6 2.1. Chinese and Its Dialects...... 7 2.1.1. Chinese Languages in the PRC and Taiwan & Nomenclatures...... 8 2.1.2. Establishment and development of Modern ...... 11 2.1.3. Sociolinguistic Situation in Mainland ...... 13 2.1.4. Sociolinguistic Situation in Taiwan...... 15 2.2. in Mandarin Chinese ...... 17 2.3. Tones in Mandarin Chinese ...... 18 2.3.1. Domain of Tones...... 19 2.3.2. Main Acoustic Correlates of Tones: f0...... 21 2.3.3. Secondary Acoustic Correlates of Tones: Amplitude and Duration...... 21 2.3.3.1. Intrinsic Amplitude and Amplitude Contours...... 22 2.3.3.2. Intrinsic Duration...... 24 2.4. Production of Mandarin Tones ...... 26 2.4.1. Perturbation in Tonal Contour ...... 26 2.4.1.1. The Effect of consonant voicing and aspiration ...... 26 2.4.1.2. The Effect of Segmental Structure...... 27 2.4.1.3. The Effect of Intrinsic Pitch Differences...... 27 2.4.1.4. The Effect of Sentence Environment...... 28 2.4.1.5. Description of Isolated Tones in BM...... 29 2.4.1.6. Description of Coarticulated Tones ...... 32 2.4.2. Acoustic Comparison of Tones between BM and TM ...... 36 2.4.2.1. Tone 3 ...... 37 2.4.2.2. Other Tones...... 39 2.4.2.3. Tonal and range...... 42

viii

2.4.2.4. Durational patterns...... 43 2.4.2.5. Amplitude contours...... 44 2.5. Perception of Mandarin Tones...... 45 2.5.1. f0 Cues to Tone Perception and Recognition ...... 45 2.5.2. Temporal Envelope Cues to Tone perception and Recognition ...... 47 2.5.3. Speaker Variability and Speaker Normalization in Tone Perception ...... 49 2.5.4. Dialectal variability in Tone Perception ...... 53 2.5.5. Lexical Tone Processing using gating task...... 53 2.5.6. Tone Perception by Nonnative Speakers of Mandarin ...... 59 2.6. Research Outline...... 60 2.6.1. Production Study...... 60 2.6.2. Perception Study ...... 61

Chapter 3 Production Study: Methodology ...... 65 3.1. Speech Stimuli ...... 65 3.2. Participants...... 67 3.3. Procedure ...... 68 3.4. Acoustic analysis ...... 70 3.4.1. Extraction of f0 contours ...... 73 3.4.2. Measurement of vowel duration ...... 76 3.4.3. Calculation of rms amplitude...... 76 3.4.4. Exclusion of outliers ...... 76 3.4.5. Time Normalization...... 77 3.4.6. f0, rms, and duration measurements...... 78 3.5. Statistical Analyses ...... 78

Chapter 4 Production study: Results and Discussion ...... 80

4.1. f0 contours ...... 81 4.1.1. Individual f0 contours and mean f0 contours ...... 81 4.1.2. Group-Averaged f0 contours ...... 95 4.1.3. Statistical Results—dynamic f0 information...... 102 4.2. Duration ...... 111 4.3. Rms Amplitude Contours ...... 114 4.3.1. Individual rms amplitude contours & mean rms amplitude contours...... 114 4.3.2. Group-Averaged f0 Contours...... 127 4.3.3. Statistical Analyses ...... 130 4.3.4. Correlation between f0 and Amplitude Contours...... 134 4.4. Summary...... 139

Chapter 5 Perception study: Methodology ...... 145 5.1. Speech Stimuli ...... 145 5.2. Gating Procedure ...... 146 5.3. Participants...... 151 5.4. Experimental Procedure...... 153 5.5. Data Analyses ...... 154

ix

Chapter 6 Perception study: Results and Discussion...... 156 6.1. Response Accuracy at Each Gate ...... 158 6.2. Tone Identification Points (TIPs)—TIP75% ...... 160 6.2.1. Baseline Condition...... 164 6.3. Tonal Confusions...... 167 6.3.1. Confusions patterns in the baseline condition ...... 168 6.3.2. Confusions patterns at gate 5 ...... 172 6.3.3. Confusion patterns at gate 1...... 177 6.4. Signal Detection Theory: Sensitivity (d′) to Mandarin tones ...... 182 6.4.1. Sensitivity to high-low f0 distinction ...... 183 6.4.2. Overall sensitivity in 4-alternative forced-choice (4AFC) tone identification 185 6.5. Discussion and Conclusion...... 189

Chapter 7 Conclusions and General Discussion ...... 194

Appendix A. Stimuli for the production study...... 204 Appendix B: Speaker demographic information ...... 208 Appendix C: Listener information...... 210 Appendix D: Tone confusion matrices at other gates...... 213 References...... 222

x

List of Tables

Table. 2.1. to fill each position in a syllable (adapted from Duanmu, 2007).. 17 Table 2.2. Representation of four lexical tones in Standard Chinese ...... 19 Table 2.3. Summaries of phonetic studies on isolated BM tones...... 31 Table 2.4. Allotones of Tone 3 when it occurs in different contexts...... 34 Table 2.5. Summaries of phonetic studies on isolated TM tones...... 41 Table 3.1. Summary statistics of two groups of speakers...... 68 Table 3.2 The surface vowels in Standard Mandarin...... 72

Table 4.1. Summary statistics of ten f0 parameters (in Hz) of (a) isolated tones, and (b) contextual tones...... 101 Table 4.2. Summary of main effects and interactions from the first set of repeated-measures ANOVAs for four Mandarin tones...... 106 Table 4.3. Mean durations of four isolated and contextual tones (in ms) by BM and TM speakers...... 112 Table 4.4. Summary of main effects and interactions from the three-way repeated-measures ANOVAs for vowel durations...... 113 Table 4.5. Summary of main effects and interactions from three-way repeated-measures ANOVAs for rms amplitude...... 132

Table 4.6 (a) Cross-correlations between amplitude contours (Amp.) and f0 contours (f0) for tokens produced in isolation by BM and TM speakers. (b) Cross-correlations between f0 contours of tones (Tone) produced in isolation by BM and TM speakers.... 136

Table 4.7. (a) Cross-correlations between amplitude contours and f0 contours for tokens produced in context by BM and TM speakers. (b) Cross-correlations between f0 contours for tokens produced in context by BM and TM speakers...... 137 Table 5.1. Numbers of stimulus tokens presented to listeners at each gate as a function of tone and speaker language for isolated tones...... 150 Table 5.2. Numbers of stimulus tokens presented to listeners at each gate as a function of tone and speaker language for coarticulated tones...... 150 Table 5.3. The gender and age distribution for each listener group...... 151

xi

Table 6.1. Confusion matrices displaying observed tone identification responses and response rate (%) for (a) BM listeners, (b) TM listeners, and (c) AE listeners in the baseline condition (intact syllables)l...... 168 Table 6.2. Confusion matrices displaying observed tone identification responses and response rate (%) for (a) BM listeners, (b) TM listeners, and (c) AE listeners at Gate 5...... 173 Table 6.3. Confusion matrices displaying observed tone identification responses and response rate (%) for (a) BM listeners, (b) TM listeners, and (c) AE listeners at Gate 1...... 177 Table 6.4. Direction of confusions for BM and TM tones in the following testing conditions: (a) Baseline condition, (b) Gate 5, and (c) Gate 1, arranged by listener group...... 179 Table 6.5. Confusion matrices displaying observed identification responses to high- and low-onset tone categories with hit (H) and false-alarm (F) rates for (a) BM listener, (b) TM listeners, and (c) AE listeners at Gate 1...... 180 Table 6. 6. Group sensitivity (d′) and criterion (c) calculated according to the yes-no method (Macmillan and Creelman, 1991) for each group of listeners...... 184 Table 6. 7. Group overall sensitivity d′ calculated according to an unbiased SDT model for 4AFC identification (Macmillan and Creelman, 2005) for (a) Gate 1, (b) Gate 5 and (c) Baseline condition...... 187

xii

List of Figures

Figure 3.1. F0 tracking of the syllable ge2 /k/ produced by s18...... 75

Figure 4.1. The f0 contours of 13 syllables produced in isolation by BM speakers, arranged by tone...... 83 Figure 4.1 continued ...... 84 Figure 4.1 continued ...... 85 Figure 4.2. The f0 contours of 13 syllables produced in isolation by TM speakers, arranged by tone...... 86 Figure 4.2 continued ...... 87 Figure 4.2 continued ...... 88 Figure 4.3. The f0 contours of 13 syllables produced in context by BM speakers, arranged by tone. Tone contours were plotted against percentage time and aligned to 0%...... 89 Figure 4.3 continued ...... 90 Figure 4.3 continued ...... 91 Figure 4.4. The f0 contours of 13 syllables produced in context by TM speakers, arranged by tone...... 92 Figure 4.4 continued ...... 93 Figure 4.4 continued ...... 94

Figure 4.5. Normalized group-averaged f0 contours of four Mandarin tones produced in isolation by the BM group (on the left) and the TM group (on the right) ...... 96

Figure 4.6. Normalized group-averaged f0 contours of four Mandarin tones produced in context by the BM group (on the left) and the TM group (on the right)...... 96

Figure 4.7. The group-averaged f0 contours of four tones measured at five temporal locations in the vowel, arranged by tones...... 104 Figure 4.7 continued ...... 105 Figure 4.8. Mean durations (in ms) of four isolated and contextual tones by BM and TM speakers...... 112 Figure 4.9. The rms amplitude contours of 13 syllables produced in isolation by BM speakers, arranged by tone. Amplitude contours were plotted against percentage time and aligned to 0%...... 115

xiii

Figure 4.9 continued ...... 116 Figure 4.9 continued ...... 117 Figure 4.10. The rms amplitdue contours of 13 syllables produced in isolation by TM speakers, arranged by tone...... 118 Figure 4.10 continued ...... 119 Figure 4.10 continued ...... 120 Figure 4.11. The rms amplitdue contours of 13 syllables produced in context by BM speakers, arranged by tone. Amplitude contours were plotted against percentage time and aligned to 0%...... 121 Figure 4.11 continued ...... 122 Figure 4.11 continued ...... 123 Figure 4.12. The rms amplitdue contours of 13 syllables produced in context by TM speakers, arranged by tone...... 124 Figure 4.12 continued ...... 125 Figure 4.12 continued ...... 126 Figure 4.13. Normalized group-averaged rms amplitude contours of four Mandarin tones produced in isolation by the BM group (on the left) and the TM group (on the right). . 128 Figure 4.14. Normalized group-averaged rms amplitude contours of four Mandarin tones produced in context by the BM group (on the left) and the TM group (on the right) .... 128 Figure 5.1 The schematic representation of the series of gates of Mandarin word [po] “peel”...... 148 Figure 5.2. Fundamental frequencies of the gating sequences of Mandarin word [po] “peel”...... 149 Figure 6.1. Tone identification accuracy as a function of gates and tones, shown for BM and TM tones and each listener group separately...... 159 Figure 6.2. Average number of gates required for 75% identification accuracy (i.e., TIP75%) with standard error bars as a function of tone, dialect, and listener dialect .... 161 Figure 6.3. Identification accuracy for intact tones (raw percent correct scores) as a function of tone, dialect, and listener dialect. T...... 165 Figure 6.4. Truncated stimuli of Tones 2, 3, and 4 presented at gate 5...... 175

xiv

Chapter 1 Introduction

How a listener maintains perceptual constancy for linguistic categories in the face of acoustic-phonetic differences due to speaker variability is an important question in speech perception and has been extensively researched in the literature (on the perception of English vowels). Speaker variability arises from (1) anatomical or physiological differences such as gender, speaker size, and vocal tract and characteristics of the glottis and/or (2) sociolinguistic identity characterized by dialectal backgrounds, socioeconomic status and cultural, social or geographical allegiance (e.g., Ladefoged and

Broadbent, 1957; Evans and Iverson, 2004). In order to extract the linguistic information pertinent to spoken word processing, listeners need to perceptually normalize for or accommodate to speaker-dependent variation inherent in the incoming speech signals.

Speaker variability (1) is germane to perception of Mandarin Chinese tones since lexical processing involves decoding not only segmental structure but also tone information, which is phonemically contrastive in Mandarin Chinese. For example, four tonal contrasts of syllable ma denote different lexical meanings: Tone 1 mā as

“mother”, Tone 2 má as “hemp”, Tone 3 mǎ as “”, and Tone 4 mà as “to scold”.

Tones therefore play an important role in lexical access and in processing spoken words of tone languages. The main acoustic correlate of differentiating tones is the fundamental frequency (f0) (Gandour, 1978; and Repp, 1989). Mandarin tones can

1

be distinctively characterized by 2 phonetic dimensions of f0 domain, including (i) f0

1 height or register and (ii) f0 movement or contour (Chao, 1968; Howie, 1974; Ho, 1976;

Lin and Repp, 1989; Tseng, 1990; Moore and Jongman, 1997; Jongman et al., 2006).

Perception experiments have demonstrated that these two acoustic cues play an important

role in Mandarin Chinese tone identification and perception (e.g., Howie, 1976; Gandour,

1984; Massaro et al., 1985; Whalen and , 1992). As a result, idiosyncratic speaker

differences in pitch range will contribute to acoustic variations in phonologically

identical tone categories, which results in perceptually ambiguous tones. For example, a

phonologically high level tone produced by a male speaker or a female speaker who has

relatively low pitch register may be acoustically similar in terms of f0 height to a

phonologically low level tone produced by a female speaker.

Another speaker-dependent variation is dialect variability (2). While previous

acoustic studies have corroborated that there are four phonologically distinct tones in

Mandarin Chinese (e.g., Chao, 1948; Ho, 1976), some acoustic studies have suggested

discrepancies in tonal realization due to dialect differences between Beijing Mandarin

and Taiwan Mandarin, which are two mutually intelligible regional varieties of the

language (e.g., Fon and Chiang, 1999; Fon et al., 2004; et al., 2006; et al., 2006;

Kuo et al., 2008; Sanders, 2008a, b). Consequently, acoustic overlaps in tone categories

due to cross-dialectal variation in tone production may give rise to perceptual ambiguity.

In this study, a production study and a related perceptual experiment were

conducted to examine cross-dialectal differences in tone production and the effects of

1 Lin and Repp (1989) suggested that f0 register and f0 contours refer to f0 characteristics of phonological tones while f0 height and f0 movement refer to the corresponding phonetic dimensions. They also suggested that “register and contour thus are characterized in discrete terms (high, mid, low; rising, level, falling), whereas height and movement are continuously variable and are described in acoustic terms” (p. 26). f0 movement depicts directions of movement in f0.

2

dialect and speaker variability and on the production and perception of the four lexical

tones in Mandarin Chinese by native and non-native listeners. The aims of the

production study were to compare tonal characteristics of the four Mandarin tones

produced by both male and female speakers from two dialect regions and to investigate any dialectal divergence in tonal realization. The research question is how the four lexical tones in these two regional dialects differ in terms of fundamental frequency (f0), duration, and amplitude when produced in isolation and in a sentential context. The rationale behind the production study is that while previous acoustic studies have documented potential cross-dialect variation in Mandarin tone production, generalizability of these results is limited due to the scale and design of the study. First, acoustic measurements are based on only two to four native speakers from each dialect region and there were no systematic control over speakers’ linguistic backgrounds and experience. Second, cross-dialect comparisons in acoustic parameters were limited to qualitative and descriptive data.

The purpose of the perception study is to explore the time-course of identification of isolated and contextual tones produced by multiple unfamiliar speakers with a wide range of f0 values from two regional dialects, under the condition in which no syllable-extrinsic contextual information is available for speaker normalization. That is, test stimuli will be presented in a mixed-talker and mixed–dialect condition. Past research has shown that listeners engage in perceptual normalization for speaker variability in pitch range (i.e., speaker normalization) when perceiving Mandarin tones.

Specifically, ambiguous tone is identified with reference to contextual cues extrinsic to

the target tone, for example, perceived f0 range of the speaker or the preceding context

3

tone (Leather, 1983; Fox and , 1990; Moore and Jongman, 1997; Wong and Diehl,

2003). In addition, contextual information has a facilitative effect for native Chinese

listeners on the identification of tone stimuli produced by multiple speakers (Lee et al.,

2008). Lee and his colleagues found that even for isolated and contextual tones

presented without a carrier sentence, both native and nonnative listeners achieve

significantly higher tone identification accuracy when stimuli are presented in

single-talker block condition. These results have suggested that listener can gauge the

talker’s f0 range from trial to trial in a single-talker presentation mode, which serves as a frame of reference to resolve phonetic category ambiguity resulted from talker variability

(Verbrugge et al., 1976, for English vowel perception; Wong and Diehl, 2003).

Furthermore, Lee (2009) found that listeners are able to determine f0 height information

when there is no syllable-extrinsic contextual information nor syllable-intrinsic dynamic

f0 information available.

Despite the accomplishments of this body of work, the studies are remarkably

limited in a number of ways. First, while it has been documented that Taiwan Mandarin

is acoustically different from its Beijing counterpart due to changes in sociolinguistic

conditions/situation (e.g., , 1985; Fon et al., 1999; Fon et al., 2004), no perception

studies, to our knowledge, have investigated the effect of dialect variability on Mandarin

tone perception (cf. Evans and Iverson, 2004, for vowel normalization for two British

English accents; Fox and McGory, 2007, for identification by native and non-native

listeners of American English vowels produced by talkers from two dialect regions).

Second, tone perception has been studied by using traditional identification tasks

or asking subjects to identify acoustically modified tone syllables, resembling the

4

“silent-center” vowel perception experiments (Jenkins et al., 1994). While the

“silent-center” tone perception experiments (e.g., Lee et al., 2008; Gottfried and Suiter,

1997) are actually one type of gating experiments, identification results only provide information as to which tonal portions of the signal, whose window sizes were arbitrarily defined by researchers across studies, contain the critical perceptual cues for tone identification. They do not provide information on the time-course of the tone recognition process. Employing an authentic gating experiment will more accurately portrait the online nature of processing tone information. Most importantly, the use of the gating paradigm will reveal the time-course of the listener’s perceptual normalization for speaker and dialect variability. Consequently, a perception study employing a gating task and a mixed-talker and mixed-dialect design will fill the gaps in the understanding of the lexical tone processing in Mandarin given the speaker and dialect variability and advance the existing knowledge of tone perception in general

The general outline of this dissertation is as follows. Chapter 2 provides a brief introduction to the and of Mandarin Chinese, together with the literature review. Chapter 3 discusses methodology for the production study and results and discussion of the acoustic analyses on the production data are presented in Chapter 4.

Chapters 5 and 6 will respectively present the methodology and results of the perception study. Summary and general discussion will be provided in Chapter 7.

5

Chapter 2 Literature Review

This literature review is divided into the following sections:

2.1 Chinese and its dialects

2.2 Syllables in Mandarin Chinese

2.3 Tones in Mandarin Chinese

2.4 Production of Mandarin lexical tones

2.5 Perception of Mandarin lexical tones

2.6 Research Outline

The first section provides background information regarding Chinese and its classified dialect families and a concise description of establishment and development of

Modern Standard Chinese (§2.1.2),2 followed by discussion of current sociolinguistic

situations of two regional dialects of modern Standard Chinese (SC) spoken in mainland

China and Taiwan (§2.1.3 and §2.1.4, respectively). The second and third sections of

the review delineate syllabic structure and tonal characteristics in modern SC. Review

of findings from previous production studies, factors influencing tonal contours, and

comparisons of lexical tones in Beijing Mandarin (hereafter BM) and Taiwan Mandarin

2 For a more detailed dialect description of Chinese languages and dialects, please refer to such sources as Chappell (2004), Language Atlas of China (1987), Chen (2000), (1989), Norman (1988), and Zhu (2001). Chen (1999) provides a comprehensive and detailed overview on the historical development of Modern Standard Chinese. Modern Standard Chinese is defined as the standard variety of Chinese (Li, 2006), which can be further specified in terms of spoken standard (ie., pǔtōnghuà) and written standard (i.e., Modern ).

6

(hereafter TM) are presented in §2.4. The review of Mandarin tone perception (§2.5)

focuses on studies concerning issues of speaker variability and those employing gating

paradigm to investigate the nature of Mandarin tone processing. Tone perception by

nonnative speakers of Mandarin will be also addressed. §2.6 outlines research questions

for the production and perception study in this dissertation.

2.1. Chinese and Its Dialects

Chinese is a consisting of several varieties of speech which are

mutually unintelligible to various degrees but share a single system of writing and a

common literary and cultural history (Crystal, 1987). Chinese is traditionally classified into seven dialect families: Mandarin, Wú, Xiāng, Gàn, Kèjiā (Hakka), Yuè (), and Mǐn families (Li, 1973; Yuan, 1989; Norman, 1988; Chen, 1999; Chappell, 2004).

Based on phonological, grammatical, and lexical criteria, these seven traditional dialect families have been further organized into three groupings by Norman (1988): (1) the

Northern group, which contains Mandarin dialects, (2) the Central group, which includes the Wú, Gàn, and Xiāng families, and (3) the Southern group, which includes the Mǐn,

Yuè, and Kéjiā.

The speech varieties within each dialect family constitute genetically related varieties that are more or less mutually intelligible. Norman (1988) proposed that the sub-dialect varieties of Mandarin3 in the Northern group are the most linguistically

homogeneous in terms of phonological, grammatical and lexical features. Nevertheless,

3 Mandarin comprises four mutually intelligible sub-dialects or varieties: (1) Northern, (2) Northwestern, (3) Southwestern, and (4) Eastern Mandarin (Li, 1973; Norman, 1988). Eastern Mandarin is also referred to as Jiāng-Huái (Norman, 1988; Chen, 1999; Zhu, 2001; Chappell, 2004) or Xiàjing (i.e., the lower reach of the River). According to another scheme of dialectal categorization, Mandarin can be also divided into Northeastern (東北官話), Jiaoliao (膠遼官話), Jilu (冀魯官話),Beijing (北京官話), Zhongyuan (中原官話); and Lanyin (蘭銀官話, replace Northwestern) (Zhu, 2001, p. 146).

7

Zhu (2001) pointed out that the most noticeable differences within Mandarin are the

diverse pitch values of local varieties, for example, Tone 1 has a very high level pitch

[55]4 in Beijing but a very low [11] in neighboring (p. 146). On the contrary,

the Central and Southern dialect groups display considerable phonological and lexical

heterogeneity among local vernaculars (Norman, 1988). In these areas, speakers of the same dialect variety communities suffer from problems. For example, it has been documented that speakers of regional varieties of Min (Chen, 1999;

Li, 2006) have difficulties understanding people living in close proximity.

Generally speaking, speakers from different dialect families or even from the same dialect family have difficulty communicating with each other. As a result, it has been widely accepted that Chinese dialects are different languages since a great number of them are mutually unintelligible across different dialect families (e.g., Crystal, 1987;

Norman, 1988; Duanmu, 2007; Chen, 1999; Li, 2006).5 As Li (2006) pointed out,

“Chinese varieties are really more like discrete languages than dialects of the same

language” (p. 154).

2.1.1. Chinese Languages in the PRC and Taiwan & Nomenclatures

Mandarin Chinese is the in the People’s Republic of China (PRC)

and Taiwan and one of the official languages in and (Chen, 1999;

Chappell, 2004, Zhu, 2001). Among the Chinese languages, Mandarin Chinese has the

4 The pitch values of the four lexical tones can be represented on a five-point pitch scale (Chao, 1948, 1968), with 1 being the lowest pitch level and 5 the highest of the perceived pitch range of a speaker. Tones 1-4 from monosyllables in isolation or in sentence-final position can be transcribed as 55, 35, 214, and 51, respectively (see Table 2.2 and §2.3.2). 5 Zhou and Ross (2004) pointed out that western researchers will consider Sinitic dialects as different languages while Chinese scholars and native speakers usually consider them as dialects under one single designation—Chinese (p. 2). Please refer to the note 3 in Introduction in Chen (1999) for a detailed discussion on the classification of these two terminologies.

8

greatest number of speakers, totaling around 840 million (, 2005; figure was based on 2000 census and the number is increasing) people in the .

Roughly speaking, over 70 per cent of Chinese speakers use one of the Mandarin dialect varieties as their first language (Zhu, 2001; Ethnologue, 2005; Chappell, 2004). Its geographical distribution widely covers areas north of the Yangtze and southwestern provinces in China (Language Atlas of China, 1987; Norman, 1988; Ethnologue, 2005).

The rest of the population speaks one of the other 7 main dialects as their first language:

Wu (spoken by ), (spoken by Cantonese), Min Nan ()6,

Xiang, Hakka, and Gan, ordered by decreasing number of speakers of languages

(Ethnologue, 2005, Table 3).

In Taiwan it is estimated that there are 4 million native speakers of Mandarin

(based on 1993 figures reported in Ethnologue, 2005). In addition to Mandarin, around

73%-80% of the population of 23 million speaks a variant of Southern Min since majority of the Chinese population in Taiwan is ethnically Mǐn (The Republic of China Year book, 2009).7 There are also a sizable group of native speakers of Hakka, which has

made up around 20% of the Han population in Taiwan (The Republic of China Year book,

2009). Around 2% of the population are descendants of indigenous tribes who speak

various indigenous languages belonging to Austronesian linguistic family, though the

number of speakers is declining.

6 In addition to Min Nan (Southern Min), Min dialect can be divided into several sub-dialects, including Northern Min (MǐnBěi), (MǐnDōng), and (MǐnZhōng). 7 The Chinese refer to themselves and their language as Han—a name which derives from the (202 BC-AD 220), which is to be distinguished from the non-Han minority groups and languages such as Tibetan and Mongolian(Crystal, 1987, p. 312). Over 95 percent of population in Taiwan is made up of and the remainder composed of Austronesian (Malayo-Polynesian) people and recent immigrants. Among the Han people in Taiwan, the Holo (Mǐn) are the largest subgroup, accounting for around 70% of the population (The Republic of China Year Book, 2009). The variety of Min spoken in Taiwan variously referred to as Holo, Minnan, Táiyǔ, and Taiwanese.

9

As complicated as the diversity of Chinese languages, there are several different

nomenclatures used loosely or interchangeable in different contexts and by different

scholars. Li (2006) provided definitions for several terms that are often seen in the

Chinese literature. Specifically, the term xiàndài hànyǔ ‘Modern Chinese’ or literally

‘modern language of the Han people’ includes both spoken and written Han Chinese

varieties. The term xiàndài biāozhǔn hànyǔ ‘Modern Standard Chinese’8 specifically

refers to the standard variety of Chinese, which can be further specified as spoken standard

and written standard. The official spoken form of Modern Standard Chinese (i.e.,

Modern Spoken Chinese) has different names in each country. It is known as guóyǔ

’ in Taiwan, pǔtōnghuà ‘common language’ in the PRC, and as huáyǔ

’ in Singapore and .

As noted by several researchers (e.g., Norman, 1988; Fon and Chiang, 1999) and

recent phonetic studies on regional varieties of modern SC (e.g., Li et al., 2006; Deng et

al., 2006), the official standard Chinese is usually spoken with influences from local

dialects by bilingual speakers of the official standard and their own Chinese dialects.

Nevertheless, most of Mandarin studies in the speech production and perception literature

use Mandarin as a loose term without clearly specifying which regional variety of

Mandarin was examined. Since one of the objectives of the current study is to examine

whether there is dialectal divergence in the production of lexical tones in pǔtōnghuà

(PTH) and guóyǔ, the following nomenclatures will be used with precise designation.

The term Beijing Mandarin (BM) is used to refer to Beijing pronunciation of

Mandarin, which serves as the norm of pronunciation of Standard Chinese in the PRC. It

8 Rogers (2005) pointed out that the term hànyǔ ‘Chinese language’ (漢語) is also sometimes used interchangeably with Modern Standard Chinese, especially in the context of academic writing (p. 22).

10

is considered to be representative of PTH since Beijing speakers do not speak other

Chinese dialects. In the literature and Chinese language textbooks, BM has been the designated standard (e.g., Cheng, 1985). Distinction is also made between Taiwan

Mandarin (TM) and (e.g. Chen, 1985; Fon and Chiang, 1999).

The former refers to the official language of Taiwan, guǒyǔ; the latter refers to “an L3 which exists exclusively on the island, with different proportions of mixture of

Taiwanese (L1) and Taiwan Mandarin (L2)” (Fong and Chiang, 1999; p. 32). Similarly,

Chen (1985) defined TM as “the variety which is learned and used, primarily as , by the people of Taiwan, 80% of whom speak Taiwanese (TW), a variety of

Minnan, as a native language” (p. 352). Instead of using Taiwanese, the term

Taiwan Southern Min (TSM; Taiwanese Southern Min) (Sanders, 2008) is used to provide a more specific designation as to which regional sub-dialect of Min is spoken in

Taiwan.

2.1.2. Establishment and development of Modern Standard Chinese

In response to western industrialization and the need for modernization during the

late nineteenth century and the early twentieth century, a modern Chinese language (both

spoken and written) was felt to be needed to strengthen national unification (Chen, 1999;

Zhou, 2001; Zhou and Ross, 2004). Guóyǔ ‘national language’ was to be promoted as

the new national and serve as Modern Standard Chinese in the Nationalist

period (1911-1949). There are two phases in the development of the phonological

system in guóyǔ. 9 The newer pronunciation standard (xīn guóyīn “new national

9 The old phonological system, usually referred to as the “old national pronunciation” (lǎo guóyīn), was a hybrid system based mainly upon the phonology of the Beijing vernacular with some linguistic features adopted from other varieties of Mandarin and dialects, which were no longer phonetically distinctive in

11

pronunciation”) is entirely based on the modern for its pronunciation,

phonetic realization of four tones, and other major aspects of phonology (Chen, 1999).

The status of modern Beijing dialect as the sole standard dialect of Modern Spoken

Chinese was recognized in 1926 after replacing the older pronunciation standard.

Promotion of guóyǔ was facilitated by the employment of sound annotating symbols, zhùyīn zìmŭ, now renamed as zhùyīn fúhào (Chen, 1999) or commonly known in Taiwan as zhù yīn. It is the first set of phonetic annotating scripts officially promulgated by the government in 1913.

The promotion of guóyǔ by the lasted until 1949 when the

People’s Republic of China (PRC) was founded by the (CCP).

The standard national language was replaced by the establishment and promotion of pǔtōnghuà ‘common speech’, which is the new nomenclature of Modern Spoken Chinese under the governance of the PRC at the beginning of 1956. The term pǔtōnghuà was defined as “the speech form that can be used in all provinces” (Zhu, 1906).

Following the phonological system (new national pronunciation) in guóyǔ, the standard form of pǔtōnghuà was defined as being based on the Beijing dialect of

Mandarin for its pronunciation norm at the National Conference on Script Reform in

1955. PTH was formally defined in 1956 with regard to its phonology, lexicon and as the following (J. , 1995, cited in Chen, 1999, p. 20):

Pǔtōnghuà is the standard form of Modern Chinese with the Beijing phonological system as its norm of pronunciation, and Northern dialects as its base dialect, and

the contemporary Beijing dialect (Chen, 1999). This old pronunciation was later referred to as “artificial national pronunciation” (人造國音) since it did not occur in any natural languages.

12

looking to exemplary modern works in bái huà ‘vernacular ’ for its grammatical norms.

Nevertheless, PTH as the spoken norm of Modern Standard Chinese is not

phonologically and phonetically identical to the Beijing dialect (Chen, 1999). There are

a few linguistic features in the Beijing dialect that are not permitted or present in PTH

(see Chen, 1999 for a more detailed description). Details of phonological and phonetic

comparisons between PTH and Beijing dialect are not our main concerns since there are

no overt differences between the two varieties in pitch contours of four tones.10

The new official system promulgated in 1958 for pǔtōnghuà

denoting the pronunciation norms is hànyǔ pīnyīn or pīnyīn. This facilitated the

development and promotion of PTH, which lasted for around half century until the end of

the twentieth century.

2.1.3. Sociolinguistic Situation in Mainland China

Active and vigorous promotion and propagation of PTH as the lingua franca in

mainland China have been one of the priorities on language planning agenda since

mid-1950s. It is the official language taught in schools. However, the spread of PTH was not meant to “wipe out Chinese dialects artificially, but to reduce the scope of dialect use progressively” (People’s Daily, 1955). In the 1990s, linguists and scholars have started reviewing the coexistence of PTH and Chinese dialect not as mutually exclusive but as complementary (ref. , 2004; Li, 2006).

10 Putonghua is mutually intelligible with Beijing and other Mandarin varieties in the northeast part of china (Ethnologue, 2005).

13

The prevailing linguistic situation in the mainland, due to its linguistic complexity,

can be described as “bidialectism” (Norman, 1988; Chen, 1999) or “ with

increasing (dialect) bilingualism” (Li, 2006, p. 149; Chen, 1999). Chen (1999)

suggested that a “diglossic differentiation” has developed between PTH and local dialects

(p. 53). That is, PTH, as a High variety, is the standard linguistic code used in education

and public affairs such as broadcasting and written communication, and in public places

and cross-dialect communication. Local dialects, as a Low variety, are the linguistic

codes for interpersonal and daily communication in the home and local dialect-speaking

communities. Studies examining the pattern of language use of PTH and local dialects

in three dialect-speaking areas (Wú, Mǐn, Yuè dialects) showed similar patterns, except

for Cantonese-speaking areas (Chen, 1999; ref. Tables, 4.1, 4.2, and 4.3).11 A survey

conducted by Wu and Yin (1984; cited in Chen, 1999, pp. 27-28) showed that 91% of the

population understood PTH (as compared to 41 percent in the early 1950s) and 50%

could speak it. Among the number of people who could speak PTH, 54% were from the

population in the Mandarin-speaking areas and 40% were from other dialect regions.

Based on a national survey carried out in 2004, only 53% of the 1.3 billion population

could speak PTH and at least 40% of Chinese population was unable to communicate in

it (China Daily, 2006).

As can be observed from bidialectal (bilingual) speech community or diglossic

language situations, language contact and the creation of interlanguage (Selinker, 1972) intermediate between PTH and dialects are inevitable. Norman (1988) pointed out that while PTH being a genuine national language, it is “rarely spoken in its purely standard

11 It should be noted that the patterns of uses of local dialect and PTH have changed since the studies cited in Chen (1999) were conducted in the early 1990s.

14

form outside the city of Peking and its environs” (p. 213). Similarly, Chen (1999)

commented that speakers from areas outside Beijing speak “adulterated PTH” (p. 41) due

to interference from their native dialects. The adulterated form of PTH or the

“interlanguage” is often labeled as “local putōnghuà” (dìfāng putōnghuà) in recent literature (Chen, 1999, p. 42). In some phonetic studies investigating regional varieties of standard Chinese, local PTH are referred to “accented Chinese” (vs. standard Chinese)

(e.g., Li and Wang, 2003; Li et al., 2006). Based on the findings of a few field studies,

Chen (1999) pointed out that deviation of pitch contours of the four lexical tones is one among the most common features characteristic of non-natives of the Beijing dialect.12

2.1.4. Sociolinguistic Situation in Taiwan

As mentioned earlier, 73% - 80% of the Han Chinese population in Taiwan is

ethnically Min, who are descendants of successive waves of Chinese immigrants from

Fujian and provinces on the mainland since the 17th century. Along these

successive migration waves a sizable group of natives of Hakka also came to settle down

in Taiwan. There are also aborigines who settled in the island thousands of years before

these groups arrived. Due to its ethnic makeup, Taiwan is characterized by its cultural

and linguistic diversity. However, language practices and status should not be

understood in terms of population of ethnolinguistic groups. Instead, Taiwan’s public

language history has been one of struggles and changes due to various succeeding

colonial governments and ruling parties (Sandel, 2003, p. 530).

12 The other two features are simplification of the phonology of PTH: (1) merging of the nasals [n] and [] as syllabic endings and (2) merging of the dental initial [ts], ts], and [s] with the respective retroflex initials [t.], [t.], and [.] (Chen, 1999, p. 44).

15

After the transfer of sovereignty of Taiwan from Japan back to China (Nationalist

government led by Kuomingtang, KMT) in 1945, guóyǔ has replaced the former national

language, Japanese, and become the sole official language in Taiwan. Large influx of

Mandarin-speaking “mainlanders” came with the troops when the Nationalist government

lost the to Communists and retreated to Taiwan in 1949. During

KMT’s ruling, the government promulgated a Mandarin-only language policy from

1946-1987, which promoted rigid monolingualism. Contrary to the lenient language

policy adopted by the PRC government toward the dialect use, guóyǔ was leveled to the

status of the national language at the expense of local Chinese dialects, including

Min-Nan, Hakka and the aboriginal languages (Hsiau, 1997; Li, 2006). For example,

students were for speaking “local dialects” (fāngyán) at schools.13 As a result,

it is estimated in S. (1993: 117) that in 1991, about 90% of the population of

Taiwan spoke guóyǔ, a figure that is much higher than that on the mainland in the early

1980s (cf. Wu and Yin, 1984, cited in Chen, 1999).

In 1990 the new ruling party, Democratic Progressive Party (DDP), revived the

bilingual education at the local level and have started promoting and incorporating the

“mother tongue” (mǔyǔ, the term used to replace the more general term fāngyán) into the

school curriculum. The current language situation in Taiwan can also be characterized

as diglossic as that in Mainland China, in which people converse in Taiwan Mandarin in

public domains and speak local dialect at home.

13 Nevertheless, people did not get punished for using their local dialects at home, with friends and in informal and private settings.

16

2.2. Syllables in Mandarin Chinese

The correspondence between syllable and morpheme in Mandarin Chinese is

almost one-to-one, except for very rare dissyllabic morphemes and morphemes with

diminutive suffix [‘] (Howie, 1974; Duanmu, 2007). Almost all morphemes are composed of a single phonological syllable and the majority of are

monosyllabic (Wang, 1967; Duanmu, 2009). In the Chinese , one

morpheme is represented by one logograph or character.

The maximal size of phonemes in a Chinese syllable is CGVX, where C is a consonant, G a glide, V a vowel and X either a or the offglide of the , and VX the rime (Duanmu, 2007). The medial glide contains a positional variant of a high vowel /i/, /y/ and /u/, which occur only in syllables with a non-high

syllabic vowel (/e/ and /a/). They are also called non-syllabic high vowels (e.g., Howie,

1974). The X can be a non-syllabic vowel (/i/ and /u/) or a nasal coda /n/ or /N/.

Non-syllabic vowels in prevocalic and postvocalic positions (G and X, respectively)

cannot be the same. While nucleus (V) is the only obligatory , other syllabic

components are optional. Phonemes that can be put in the four slots of a syllable are

summarized in Table 2.1.

C G V X most consonants, positional variants of high vowels any [i,u,n,N] except [N] [i, y,u]→[j,w,Á] vowels Table. 2.1. Phonemes to fill each position in a syllable (adapted from Duanmu, 2007).

17

Here are some syllables that has the maximal length CGVX (Duanmu, 2010):

[kwai] ‘fast’

[tjan] ‘day’

[kwa] ‘light’

[tswan] ‘diamond’

2.3. Tones in Mandarin Chinese

In modern Standard Chinese, there are four lexical tones (Tones 1-4). As the

example showed in Chapter 1, four tonal contrasts of segmental syllable ma denote

different lexical meanings: Tone 1 mā as “mother”, Tone 2 má as “hemp”, Tone 3 mǎ

as “horse”, and Tone 4 mà as “to scold”.

According to the phonological representation of these lexical tones described by

Chao (1948, 1968), Tone 1 is characterized as high level pitch, Tone 2 as rising pitch,

Tone 3 as low-falling-rising (or low-dipping) pitch, and Tone 4 as high falling contour.

Table 2.2 summarizes several ways in which tones are represented in Standard Chinese, using the syllable ma.

18

Tone number Tone 1 Tone 2 Tone 3 Tone 4 Tone value14 55 35 214 51 Tone contour High-level High-rising Low-dipping High-falling (level tone) (rising tones) (falling-rising tone) (falling-tone) Tone features H(igh) LH L(ow)† HL ma1 ma2 ma3 ma4 (tones as numbers) Pinyin mā má mǎ mà (tones as ) †T3 is L(ow) in non-final positions.

Table 2.2. Representation of four lexical tones in Standard Chinese (adapted from Chao, 1948, and Duanmu, 2009).

There is also a fifth tone, Tone 5 or the neutral tone, which does not have a well-defined and is usually described as toneless.15 Its tonal values can be predicted from its preceding tone (Wang and Li, 1967) and interact with other prosodic features such as and (e.g., , 1992, Cheng, 1966). The neutral tone and its allotones will not be discussed in the current study.

2.3.1. Domain of Tones

While Howie stated that the “distinctive tones are patterns of pitch that coincide with the syllable” (1976, p. 4), there has been no consensus on where tones phonetically lie in Mandarin citation monosyllables. Wang (1967) and Chao (1968) pointed out that the domain of tone is coextensive over the voiced portion of the syllable, which includes the initial consonant when it is voiced.

14 In this study, the expression of tone value in brackets is used interchangeably with Tones 1, 2, 3, 4 (or T1, T2, T3, T4). 15 See Shen (1992) for the typology of Mandarin neutral tones.

19

Other researchers claimed that the initial voiced consonant and nasal coda should be excluded from serving as tone-bearing units (e.g., Kratochvil, 1968; Dow, 1972; Lin,

1995). Kratochvil (1968) and Dow (1972, p. 102) claimed that the tonal patterns are distinctive over the vocalic part of the syllable.16 Similarly, Lin, M.-C. (1995) argued

that the tonal domain is over only the nucleus, excluding the initial voiced consonant, the

medial, the nasal coda and the vocalic ending. Lin M-C. (1963, 1988) analyzed 147 isolated monosyllables produced by two BM speakers and found that f0 contour of

Mandarin tones contains three parts: (1) pre-onset, (2) basic contour, and (3) post-offset.

In a subsequent study using a gating experiment to examine the relevance of syllabic vowel, medial glide, initial voiced consonant, vocalic ending and nasal coda to tone identification, Lin M-C. (1995) showed that perceived tonal pitch was mainly relevant to

the syllabic vowel and its adjacent transitions. The rising and falling in f0 during the

pre-onset and post-offset of an f0 contour that is often observed in someone’s tonal

production did not play a role in tone identification. Lin M-C. suggested that only the

“basic contour” in an f0 curve (i.e., the vowel) that contains distinctive pitch patterns for

each tone is relevant for tone identification.

Nevertheless, based on the acoustic analysis of the f0 patterns of 136

monosyllables of various syllable structures, Howie (1974) claimed that the domain of

tone is over the rhyme, i.e., the VX part. argued that pitch patterns that occur with

an initial voiced consonant or a non-syllabic vowel are merely anticipatory adjustments

of the voice. That is, any voiced portion preceding the syllabic nucleus belongs to the

tonal transition.

16 Based on Kratochvil’s definition on Mandarin vowels, which includes diphthongs and triphthongs, his view on the tone-bearing unit includes the prevocalic on-glide; whereas, Dow (1972) excluded the prevocalic segment.

20

2.3.2. Main Acoustic Correlates of Tones: f0

Tonal distinctions in Mandarin are implemented mainly by different patterns of

perceived voice pitch within the proper domain of a syllable. The main acoustic

correlate of pitch differentiating lexical tones is the fundamental frequency (f0) (e.g.,

Howie, 1974; Grandour, 1978; Lin and Repp, 1989). Lin and Repp (1989) pointed out

that F0 dimension may be characterized by two phonetic aspects: (1) f0 height and (2) f0 movement. 17 Mandarin tone can be contrastively characterized by these two

characteristics (e.g., Blicher et al., 1990; Jongman et al., 2006). Representing the

perceived pitch range of a speaker on a five-point pitch scale (Chao, 1948, 1968), with 1

being the lowest pitch level and 5 the highest, four lexical tones can be transcribed by one,

two, three, or even four digits indicating tonal height and contour (Cheng, 1973, p. 98).

For example, contour of Tones 1-4 from monosyllables in isolation or in sentence-final position can be transcribed as 55, 35, 214, and 51, respectively (see Table 2.2). These tonal patterns have been considered as the prescriptive and canonical forms for each tone in the literature. Review of phonetic studies on four lexical tones is discussed in

detailed in §2.4.

2.3.3. Secondary Acoustic Correlates of Tones: Amplitude and Duration

17 Lin and Repp (1989) suggested that the terms height and movement are phonetic dimensions of f0 characteristics and the corresponding phonological description are register and contour. The former are continuous variables and are described in acoustic terms while the latter are characterized in discrete terms such as high, mid, low; rising , level, and falling. The term register here does not refer to changes in “vocal register” (Please see p. 26, footnote 1 (p. 26). f0 movement depicts directions of movement in f0. Here I use the term movement and contour interchangeably.

21

In addition to f0 information as the primary acoustic property distinguishing tonal identity, other acoustic correlates such as amplitude contour and duration also systematically vary with tones (Howie, 1976; Ho, 1976; Coster and Kratochvil, 1984; Lin,

M.-C., 1988; Whalen and Xu, 1992). Previous tone perception studies have shown that these two temporal properties play an important role in Mandarin tone recognition and identification and may serve as secondary cues when the dominant f0 information is

absent or not available (e.g., Howie, 1976; Tseng, 1981; Lin, 1988; Blicher et al., 1990;

Whalen and Xu, 1992; et al., 1998; Fu and Zeng, 2000; Kong and Zeng, 2006; Kuo et

al., 2008).

2.3.3.1. Intrinsic Amplitude and Amplitude Contours

Zee (1980) suggested that intrinsic intensity of a tone interacts with the intrinsic

intensity of vowels that is further conditioned by the tone it carries.18 Massaro et al.,

(1985) also indicated that intrinsic amplitude seemed to vary as a function of tone.

Specifically, Tone 3 was produced with the lowest amplitude while the falling Tone 4

was the highest. Lin, M-C. (1988) also reported similar results; in particular, Tone 3

often had the lowest amplitude while Tones 4, 2, and 1 often had the highest (ref. Table 5,

p. 186).

Based on the data consisting of 24 syllables with 3 vowels produced in isolation

and in sentence environments by five BM speakers, Ho (1976) suggested that the

amplitude contours of isolated BM tones could be described as level or level-falling for

Tone 1, rising or level for Tone 2, rising-falling-rising-falling (i.e., double-peak or

18 Intrinsic intensity is defined in the similar manner as the intrinsic duration, which is defined as duration of a segment determined by its phonetic quality (Lehiste, 1970),

22

two-peak contour), and falling or level-falling for Tone 4. The shape of amplitude

contours remained unchanged for contextual tones, except for Tone 3. Specifically,

contextual Tone 3 had various types of amplitude contours other than the double-peak pattern observed for an isolated Tone 3, including level, falling, rising-falling, and falling-rising contours (p. 361).

Lin, M-C. (1988) analyzed 147 isolated monosyllables produced by two BM speakers and found that tone amplitude could be categorized into five different types: (1) level (plateau-shaped), (2) higher at onset, (3) higher at offset, (4) higher in the middle, and (5) a double-peak amplitude contour (translation in Jongman et al., 2006, p. 212).

Most of time the shape of amplitude contours was type 2 or 3 for Tones 1, 2, and 4, and type 5 for Tone 3. Nevertheless, there was considerable inter- and intra-subject variation in the shape of amplitude contours. For example, while the amplitude contour of Tone 3 always showed the double-peak pattern for the male speaker, it only exhibited this pattern 59.4% of the time for the female counterpart.

In addition to the regulation of vocal fold tension as the primary regulator of f0 control, an increase in subglottal pressure, ceteris paribus, leads to an increase in f0,.

Since amplitude reflects subglottal pressure, there should be correlation between f0 and

amplitude, either in terms of peak values or contours. Ho (1976) reported a positive

correlation between the amplitude peaks and f0 contours in seven phonetic environments as the following: “high amplitude peaks occur mostly in the environments where the tone contours have high fundamental frequency, and the low peaks occur in the environments where the fundamental frequency of the tone contours is low” (p. 361).

23

Previous studies have shown a positive correlation between amplitude contours

and f0 contours of citation Mandarin tones (Garding et al., 1986; Sagart et al., 1986;

Whalen and Xu, 1992; Fu et al., 1998; Fu and Zeng, 2000; also Zee, 1980, for Taiwanese

tones). Whalen and Xu (1992) conducted a correlation analyses on the amplitude at the

center of windows of various durations (100, 80, 60 and 40 ms) and the absolute f0 at the midpoint of the window and found strong positive correlation between amplitude and f0 for Tones 2, 4, and 3 (r=0.94, 0.94, and 0.65, respectively, p<0.01). Since sounds with lower f0s have fewer pitch periods in a fixed duration window than do sounds with higher

f0s, Whalen and Xu (1992) again performed correlation analyses f0 and amplitude on a

pitch period basis and reported a significant positive correlation, though the magnitude

was lower (r=0.71).

Fu and Zeng (2000) analyzed six V syllables produced with all four tones by 10

speakers (5 male and 5 female) also found considerable similarity between the amplitude

and f0 contours. The mean correlation coefficients ranged from 0.09 to 0.69, with Tone

3 having the highest correlation coefficient and Tone 1 having the lowest. In addition,

the acoustic analysis in Fu et al., (1998) found that the amplitude envelope was highly correlated with f0 contours for Tones 3 and 4. Nevertheless, Fu and Zeng (2000)

suggested that such correlation also exhibited “large variability across tones, speakers and

syllables” (p. 51). On the contrary, Kuo et al., (2008) found that amplitude contour is

not necessarily correlated with the f0 contour of the same tone.

2.3.3.2. Intrinsic Duration Intrinsic vowel duration also varies as a function of the tone that the vowel carries

(Zee, 1980; Tseng, 1981). Previous studies examining the duration of isolated tones

24

produced by BM speakers have revealed a similar durational pattern, in which Tone 3 has

the longest duration and Tone 4 is the shortest. The duration of Tones 2 and 1 usually

fall in the middle of the continuum (Ho, 1976; Tseng, 1981; Lin, M.-C, 1988; Xu, 1992).

Identfically, Ho (1976) reported that when CV syllables produced in isolation, the

duration for the four tones was, in descending order: Tones 3, 2, 1, and 4.

Nevertheless, Lin, M.-C. (1985) proposed that there was no exact correspondence

between duration and tones produced in isolation, i.e., there was no intrinsic duration of

tones. Based on the tone production by two BM speakers in his study, Lin found that

Tones 4 was the shortest for the female speaker while it was Tone 1 for the male speaker.

However, vowel duration for Tone 3 was usually the longest among all four tones.

Ho (1976) reported a durational pattern of T3>T2>T4>T1 when tones were

produced in sentence-final position and at the end of a clause in a sentence with falling

intonation. Nevertheless, when they were produced in a sentence-initial or

sentence-medial position, all four tones had almost the same duration. Ho (1976)

concluded that sentence environments definitely had some influence over the duration of

tones.

When four tones are produced in spontaneous speech, the hierarchy of intrinsic

duration was Tones 3, 4 and 2 (Tseng, 1981).19 Since Tone 3 still had the longest duration and Tone 4 no longer was the shortest, Tseng (1981) suggested that intrinsic differences in duration were not maintained in connected spontaneous speech and duration thus was not a primary phonetic parameter in the production of Mandarin tones.

19 For the measurements of vowel duration in spontaneous speech, Tseng (1981) only used vowel [i] and Tone 1 was excluded from the analyses since there was no occurrence of vowel [i] in the level tone in the spontaneous speech (Tseng, 1981, p. 24).

25

Similarly, some studies reported that in normal speech, the duration differences between

tones are not a reliable cue to their identity (Coster and Kratochvil, 1984; Kratochvil,

1985, 1987, 1998, cited in Halle et al., 2004, p. 399).

2.4. Production of Mandarin Tones

2.4.1. Perturbation in Tonal Contour

Four tone contours are modified to varying degrees due to initial consonant

perturbation (Howie, 1974; Ho, 1976; Shih, 1988; Xu and Xu, 2003), syllabic structure

(Howie, 1974; Ho, 1976), intrinsic vowel height (Ho, 1976; Zhi and , 1987; Shih,

1988), and sentence environments in which they occur (Ho, 1976). While they maintain

the same basic shapes characterizing its tonality, Ho (1976) pointed out that the

magnitude of effect of these three factors on tones was, in descending order, sentence

environment, vowel, and then the preceding consonant.

2.4.1.1. The Effect of consonant voicing and aspiration There have been studies investigating the effect of initial consonant voicing and aspiration on the four Mandarin tone contours (Howie, 1974; Ho, 1976; Shih, 1987; Xu and Xu, 2003). Shih (1987) suggested a consonantal effect on the first 50 ms of a tone.

Shih (1987) also indicated that initial consonant affects tones with an initial H(igh) target, i.e., Tones 1 and 4, but not much on tones that begins with M(id) or L(ow) tonal target,

such as Tones 2 and 3.

It has been shown that f0 contours of each tone start lower after voiced consonant

than after voiceless ones (Ho, 1976, Shih, 1987). For example, Shih (1987) reported

that the initial nasal depresses the initial H pitch target (i.e. Tones 1 and 4) by 25 Hz than

26

the expected value. However, this lowering effect of nasal was not observed in Tone 2

and 3. In addition, Howie (1974) found humps or shoulders near the f0 onset for tones

with initial voiced consonants or non-syllabic onglides.

Previous studies have shown incompatible findings regarding the effect of

consonant aspiration and frication on f0 contours. Ho (1976) and Xu and Xu (2003)

found that the onset f0 of a tone is higher when following unaspirated consonants than

when following aspirated consonants. However, Shih (1987) found that aspiration and

frication had a raising effect on an initial H target and suggested that the extra f0 height resulted from the laryngeal gesture that implements the aspiration of the consonants.

2.4.1.2. The Effect of Segmental Structure

Studies have shown that syllabic structure also influences initial portion of the f0

contour. While the overall basic tonal shapes of each tone are similar to those of CV

syllables, research has found that f0 contour of monosyllables starting with vowels or

onglides differ from those with initial stops. Specifically, Tones 1 and 4 with initial vowels begin with a “prolonged rise” or “delayed peak (maximum f0)” at some distance

from the syllable onset (Howie, 1974; Ho, 1976; Shih, 1987, 1988). Similarly, syllables

with onglides show a hump or shoulder at the beginning of f0 contour (Howie, 1974).

Shih (1988) pointed out that this small rising slope at the onset f0 is often invisible in

syllables with initial voiceless consonants. In addition, Tones 2 and 3 with initial

vowels do not dip as low as those with initial stops (Howe, 1974; Ho, 1976).

2.4.1.3. The Effect of Intrinsic Pitch Differences

27

Ho (1976) reported intrinsic pitch differences for different vowels in Mandarin

Chinese, which are consistent to the universality of intrinsic f0 of vowels (Whalen and

Levitt, 1995). High vowels usually have higher fundamental frequency than

mid-vowels, and mid-vowels usually have higher fundamental frequency than the low

vowels with some exceptions (p. 364). Shi and Zhang (1987), examining the f0 values of 9 vowels derived from 400 real and nonsense monosyllables (averaged across all possible consonantal contexts in Standard Chinese) produced in a carrier sentence, found that “the f0 values of the vowels go from high to low as the tongue height of the

associated vowel drops, and the f0 differences between high and low vowels are

significant” (p. 134). Shi and Zhang (1987) concluded that Standard Chinese, as a tone

language, also exhibits the influence of intrinsic pitch of vowels.

2.4.1.4. The Effect of Sentence Environment Examining the tonal contours of four tones produced in seven sentence

environments, Ho (1976) concluded that four tones maintained their characteristic tonal

distinctions while their tone contours were modified by sentential environments. For

example, the contour of Tone 1 exhibited an appreciable fall throughout the syllable

when produced at the sentence-final position. However, “the fall is moderate compared

to that in the contour of Tone 4” (Ho, 1976, p. 359), which may be due to the final

lowering effect (Fon and Chiang, 1999). Another difference is that the f0 offset of Tone

4 was no longer the minimum f0 in all sentence environments; instead, the f0 minimum occurred at the inflection point of Tone 3. Despite these differences in tonal register and contour, the order of f0 onset for the four tones remained the same as in isolation.

28

2.4.1.5. Description of Isolated Tones in BM There have been a large number of acoustic studies examining Mandarin lexical tones (e.g., Wang et al., 1963; Lin, M.-C., 1965; Howie, 1974; Ho., 1976; Tseng, 1981;

Lin, M.-C., 1988). The f0 contours of the four lexical tones are well defined and quite stable when produced in isolation (Chao, 1948; Lin, M.-C., 1965, 1988; Xu, 1997).

They are conventionally demonstrated on monosyllables produced either in isolation (e.g.,

Lin, M.-C., 1965, 1988; Ho, 1976; Xu, 1997) or in citation form in carrier sentences or syllables (e.g., Tseng, 1981; Kuo et al., 2008). These “citation tones” are considered as the canonical norms. Table 2.1 presents brief summaries for each study reviewed in this section.

Acoustic analyses on the four citation tones produced by native speakers of BM have shown that tonal contours and pitch values are generally consistent with the prescriptively phonological description proposed by Chao (1948). Tone 1 and Tone 4 are described as high level and high falling, respectively.

Tone 2 is either mid-rising or shows a dip before rising throughout the second half of the vowel. While Tone 2 has been often designated as a rising tone in the literature, phonetic studies have found that Tones 2 and 3 have similar not only f0 onset but also concave contours (e.g., Shih, 1987; Shen, 1989, 1990; Shen and Lin, 1991, Whalen and

Xu, 1992; Moor and Jongman, 1997). The differences lie in the turning point (inflection point) in the f0 contour and the magnitude of decline in f0 from the onset of the tone to the turning point (∆f0). For instance, based on the production of Tones 2 and 3 by two female native speakers of BM, Moore and Jongman (1997) reported that the turning point

29

was earlier for Tone 2 than in Tone 3 (0%-30% and 28%-54% of total tone duration, respectively), and the ∆f0 was, on average, larger for Tone 3 (44 Hz) than Tone 2 (15 Hz).

30

Studies # speakers (BM) Stimuli (monosyllables) Tone 1 Tone 2 Tone 3 Tone 4 Lin, M.-C. 1 male and 1 female 147; in isolation 55 25/35 214/212 51 (1963, 1988) (with another set of 20 (male/female) (male/female) speakers) Howie (1974, 1 male 136 citation tones in a carrier 43** 25** 212** 52** 1976) sentence Li et al. (2006) 4 male and female - 55 35 214 51 Deng et al. 2 females and 2 males and bisyllabic words in 55 35 212 51 (2006); (18 yrs) isolation Shi and Deng(2006) Ho (1976) 5 68 in isolation and in other level rising with a dipping falling six sentence contexts dip at 15% Tseng (1981) 1 female Monosyllables in isolation level rising dipping* Falling*

31 (and with carrier syllables) and in spontaneous speech 31 Xu (1997) 8 male 48 /ma/ high level rising with a low dipping high falling dip at 20% with the highest f0 value Lee et al., (2008) 1 female 24 in isolation high level low rising low dipping high falling Lee et al., (2009) 5 (2 males) 24 in isolation and in a high level Low dipping low tone high falling sentence context (same set of (low rising for stimuli as in Lee et al., 2008) 2 male speakers) *The tonal production of this female BM speaker was compared to that of the male speaker in Howie’s (1976) study and Tseng (1981) suggested that the range of the “dip” (falling and rising) in Tone 3 and the falling in Tone 2 was larger. **Since stimuli in Howie’s (1974, 1976) studies were produced in a carrier sentence, they were treated as coarticulated tones.

Table 2.3. Summaries of phonetic studies on isolated BM tones.

Tone 3 exhibits a dipping contour. While only four studies converted tonal contours into pitch values based on the five-point pitch scale (Chao, 1948, 1968), a couple of studies (Lin, 1963, 1985; Deng et al., 2006; Shi and Deng, 2006; and Lee et al.,

2009) have shown that the final tonal target of Tone 3 in BM differs from the canonical form traditionally described in the literature and textbooks, i.e., [212] vs. [214]. Tone 3 produced by two male Beijing speakers in Lee et al., (2009) was realized as a low tone without noticeable rising after the dip and f0 did not dip as low as in the female speakers

in the study. Deng et al., (2006) and Shi and Deng (2006) suggested a tonetic sound

change in Tone 3 in which the final tonal target of Tone 3 produced by speakers of

younger generation is lower than that produced by speakers of an older generation whose

Tone 3 production still prescriptively maintains the [214] contour shape.

In addition, the order of f0 onset for the four tones in descending order was Tones

4, 1, 2, and 3 (Ho, 1976). Similarly, Xu (1997) also found that the f0 onset of Tone 4

was the highest among the four tones. In terms of minimum f0 of the four tones, sometimes the inflection point in Tone 3 reaches the lowest (about 90 Hz) and is even lower than the f0 offset of Tone 4 (e.g., Xu, 1997).

2.4.1.6. Description of Coarticulated Tones

When words are juxtaposed or combined in speech, the normative f0 patterns of

the tones are influenced by the preceding and following tones (e.g., Wang, 1967; Chao,

1968; Shih, 1987; Shen, 1990; Xu, 1994, 1997). These f0 perturbations or variations

result from and/or tonal coarticulation. While some scholars distinguish the former from the latter (e.g., Shen, 1992), some take the position that there is no essential difference between them (e.g., Chen, 2000). Shen (1992) proposed that the former is a

32

phonological contextual tonal variation and the latter is a phonetic contextual tonal

variation20. The tonal identity is changed under sandhi process (i.e., tonemic change)

while it is preserved in tonal coarticulation process (i.e., allotonic variations).

Among the tone sandhi rules, third tone (T3) sandhi is a tonal alternation process

that has been extensively studied. Tone 3 has the most complicated tonal shapes due to

phonological processes (Wang and Li, 1967; Shih, 1987; Tseng, 1981). The

prescriptive tonal contour [214] changes to a surface Tone 2 [35] when preceding another

Tone 3. This sandhi form of Tone 3 is therefore perceptually indistinguishable from

canonical Tone 2 (Wang and Li, 1967). Since in the current study Tone 3 is always

followed by a Tone 4 when produced in the sentence context, this sandhi form of Tone 3

will not concern us.

In addition to this morphotonemic alternation, tonal shapes of Tone 3 change

according to sentential positions at which it occurs. In sentence-final position, it is

usually realized as the canonical dipping tone [214] and can be optionally reduced to a

[21], which is known as the “half third tone” or “half T3 Sandhi” (Chao, 1948, 1968).

Tone 3 reduces to a low falling tone [21] when it occurs in non-prepausal position.

While it has been widely accepted that Tone 3 exhibits the truncated contour [21] when it

20 They are different from each other in terms of (1) mechanisms of tonal variation, (2) phonetic process, and (3) tonal identity. Shen (1992) defined tone sandhi as “when tones are adjacent one another, the realization of a given tone depends heavily upon the neighboring tones” (p. 84) and tonal coarticulation as “when tones are adjacent one another, the overall pitch height of a tone varies considerable depending upon its immediate tonal environment” (p. 84). She pointed out that the mechanism underlying tone sandhi is attributed to language-specific morphophonemic constraints while that of the tonal coarticulation is to language-independent biomechanical constraints. The phonetic process involved in tone sandhi may result from tonal assimilation or dissimilation, whereas tonal change is uniquely a result of tonal assimilation in tonal coarticulation (p. 86) (cf, Xu, 1992). Wang (1967, p. 629) defined tone sandhi as the “effect of neighboring tones upon one another.” Wang (1967) also pointed out that tone sandhi is different from effects of other suprasegmental features such as intonation, emphatic and contrastive stresses on F0 contours and from other “intrinsic” factors due to physiological constraints and physical properties of the speech mechanism (p. 631). 33

precedes a non-Tone 3 (Chao, 1948), there has been no consensus on the tonal shape of

Tone 3 when it is followed by a neutral tone. For example, Howie (1974) observed a low falling-rising [212] contour for the pre-neutral Tone 3 and argued that Chao’s (1948) half third tone occurs only before Tones 1, 2, and 4. Variants of Tone 3 are summarized in Table 2.3.

[214] [21] [35]

Sandhi before another T3 − − +

Contexts utterance-final + + −

elesewhere − + −

Table 2.4. Allotones of Tone 3 when it occurs in different contexts (adapted from Chen, 2000, p. 21)

Among the tonal coarticulation rules proposed by Shih (1988), the following two are of particular interest since the target monosyllabic words are preceded by a Tone 1 and followed by a Tone 4 when they are produced in the sentential context. Rule (i) states that the final L target of Tone 4 (HL) changes to an M when it is followed by another tone. Shen (1990a) also found that Tone 4 holds its fall before any full lexical tone and falls to its extremity before a neutral tone or a pause.21 Shen (1990a,b) suggested that this “half fall” of Tone 4 was due to anticipatory coarticulation rather than

21 However, the pre-neutral Tone 4in Howie’s (1974) study did not fall to its extremity; rather, it has a [52] tonal shape. This may be because the neutral tone is actually a Tone 4 although the citation syllable and the unstressed neutral tone syllable following it becomes a syntactic word when spoken in the carrier sentence (Howie, 1974, p. 145). 34

tone sandhi as suggested by Chao (1948). Rule (ii) states that the final H of Tone 2 (LH)

is deleted when the following tone starts with H. Chen (2000) used the phonetic H– to represent the transition tone between M and the following H.

Studies have investigated coarticulation in disyllabic forms (e.g., Shih, 1987,

1988; Xu, 1997) and trisyllabic sequences (e.g., Shen, 1990b; Xu, 1994) and have shown conflicting results regarding the direction and magnitude of tonal coarticulation (Shen,

1990b; Xu, 1997). For example, Xu (1997) found asymmetrical bi-directional effects for Mandarin tones within the bi-tonal combinations. In particular, the magnitude of assimilatory carryover effect is larger than the dissimilatory anticipatory effect. The magnitude of assimilatory carryover effects diminished over the course of the vowel.

Contrary to the asymmetric bidirectional coarticulation effects reported in the Xu

(1997), Shen (1990b) reported active and symmetric carryover and anticipatory coarticulation effect based on 1200 tonal combinations from 400 trisyllabic words and phrases. She also pointed out that tonal coarticulation does not affect f0 direction but f0 height and the extent of coarticulation effect is extensive over the entire tonal contour.

Therefore, tonal coarticulation contributes to the upward and downward shifting of the entire tonal contours. Similarly to Xu’s (1997), Shen (1990b) pointed out that

“coarticulatory effects do not extend across tones; tonal coarticulaton only occurs on two contiguous tones” (p. 294).

Nevertheless, Lee et al., (2008, 2009) found no discernible differences in f0 shapes between the isolated and contextual tones. However, the onset f0 and average f0 of the test syllable was higher when the preceding syllable was Tone 1 rather than Tone 4 (Lee et al., 2008). Lee et al., (2009) suggested that this was most likely due to the fact that

35

the target syllables in their study were produced at the utterance-final position in the

carrier sentence where syllables tend to be prosodically prominent and minimally

susceptible to neighboring tonal contexts.

In summary, deviation of f0 shapes in coarticulated tones from its prescriptive

forms depend on the adjacent tonal context. It has been found that assimilatory

carry-over effect and dissimilatory anticipatory effect shift the half or entire tonal pattern

of the target word upward or downward. Among the four lexical tones, Tone 3 has the

most complicated tonal shapes due to phonological processes (Wang and Li, 1967; Shih,

1987; Tseng, 1981, p. 26). The tonal variations of Tone 3 in different phonetic environments are summarized Table 2.3. However, studies have shown some

inconsistent patterns regarding the phonetic realization of Tone 3. For example, Tone 3

has a mild final rising ([212])when preceding a neutral tone (e.g., Wang and Li, 1967;

Howie, 1974). Some scholars argued that low-falling pattern is another phonetic variant of Tone 3 in isolation and in sentence-final position, especially for southern speakers of

Standard Chinese (e.g., Shih, 1987, 1988).

2.4.2. Acoustic Comparison of Tones between BM and TM

Pǔtōnghuà (BM) and guóyǔ (TM) can be considered as two regional varieties of modern Standard Chinese (e.g., Tseng, 2004). Phonetic-acoustic studies on tonal behaviors in the two regional varieties of modern Standard Mandarin have been limited in number and scope. Though less well investigated, over the past two decades a number of studies have investigated tonal patterns, including tonal contour, tonal range, and tonal register, of monosyllabic and disyllabic words in isolation and in different

36

sentential positions in these two regional varieties of Mandarin Chinese (e.g., Kubler,

1985; Shih, 1987; Fon and Chiang, 1999; Fon et al., 2004; Li et al., 2006; Deng et al.,

2006; Shi and Deng, 2006; Sanders, 2008a,b). Major findings in these acoustic studies include (1) considerable phonetic differences in tonal patterns of isolated Tone 3, (2) smaller tonal range and lower tonal register in TM (Fon and Chiang, 1999; Li et al., 2006;

Torgerson, 2005; Tseng, 2004), and (3) a narrower tonal distinction (Fon and Chiang,

1999; Chiung, 1999, 2003; Li et al., 2006; Deng et al., 2006; Shi and Deng, 2006).

Summaries of these studies is provided in Table 2.2

2.4.2.1. Tone 3 Shih (1987, 1988) argued that Tone 3 is “phonetically a low falling tone” (i.e.,

[31]), which starts at the speaker’s mid range and falls to the low range and is often characterized by laryngealization over the second half or at the end of the syllable. Shih

(1987) suggested that the canonical form of Tone 3 in isolation and pre-pausal position

(i.e., [214]) is used in “formal deliberate speech” (p. 6). Furthermore, she claimed that this prescriptive norm of Tone 3 involves “dialectal variances, personal preference, and style of speech” (p. 6). Specifically, Shih (1988) argued that while Northern speakers often use the falling-rising pattern in sentence-final position in all speech acts, southern speakers frequently use the low-falling pattern even in the sentence-final position in casual speech, and use the falling-rising pattern only in deliberate, emphatic speech, or in yes-no question. As a result, Shih (1987, 1988) models Tone 3 as (M) M L to avoid confusion with a Tone 2 that assumes a dipping contour.22

22 The placement of tonal target for each tone depends on tonal pattern of each tone and the syllabic structure. While the first and final targets are always aligned with the (syllable) onset or the initial 37

Examining tone production by speakers of TM and dialect, which is the standard southern Min dialect, Li and her colleagues (2006) found that Tone 3 in isolation and word-final position became [31] for both dialect groups while it assumed

[214] for speakers of BM. Similarly, Deng et al (2006) and Sanders (2008a, b) reported a low-falling f0 pattern for isolated Tone 3 in TM. Sanders suggested this tonal change has been accelerating across three age groups of TM speakers, with it being more pronounced in the production of young generation.

Most importantly, Sanders (2008b) proposed a model for the tonetic sound change, in which he argued that, similar to a vowel chain shift, the dipping contour in Tone 2 encroaches upon Tone 3 and thus pushes Tone 3 to evolve into a falling contour. The new falling citation contour of Tone 3 then encroaches upon the Tone 4 that also has a falling pattern. Instead of shifting Tone 4 to a different tonal contour, its fall becomes to occupy higher pitch range [53]. As a result, Tones 3 and 4 in TM display a low-high register contrast, i.e., [31] versus [53] (ref. Chiung, 1999).

While not specifically investigating the dialectal tonal differences between these two dialects, two studies reported similar findings on Tone 3 contours produced by both male and female TM speakers (Feng et al., 2006; Kuo et al., 2008). For example, the production by five male speakers in Feng et al. (2006) showed a contour which “turns up only minimally after the initial lowering” (p. 77). The female speaker in Kuo et al.,

(2008) used both falling-rising and low-falling patterns in her production of

consonant and the offset of the syllable, respectively, the location of the second target is not always aligned with the onset of the rhyme, for example, Tones 2 and 3. Shih (1987) suggested that the initial target has the similar value to the target at the beginning of the rhyme, which accounts for the relatively stable pitch through out the consonantal area. In syllables with voiceless consonants, the initial target would have no effect, but the articulatory attributes of the consonants might affect the following tonal target (p. 5). 38

sentence-final Tone 3, and the final rising, if existed, was of minimal magnitude. While

Kuo et al., (2008) claimed that the male speaker used the falling-rising pattern

consistently in his production of Tone 3, the magnitude of final rising was also minimal

(ref. Fig. 1, p. 2817).

Similarly, the Tone 3 production by the female speaker in Fon et al. (2004) also

exhibited this mixed patterns. Although Fon et al., (2004) claimed that TM Tone 3 still had the dipping contour [312], less than half of the time Tone 3 was actually produced in its full dipping form (43.2%) by this female TM speaker in the study. Besides, this falling-rising Tone 3 was still shorter than Tone 2, which was inconsistent with Shih’s

(1988) observation that dipping Tone 3 is usually longer than the low-falling allophonic variant.

2.4.2.2. Other Tones

Among the four lexical tones, Tones 1 and 4 exhibited the same tonal shapes in

two regional dialects. The differences in Tones 1 and 4 showed up in f0 height or tonal

register (Deng et al., 2006; Shi and Deng, 2006). For example, TM Tone 1 usually

assumed [44] or lower as compared to [55] in the BM counterparts. The final tonal

target of Tone 4 was in TM not as low as that in the BM. As mentioned earlier, this [53]

contour may be used to contrast with TM Tone 3 which shares a similar tonal shape but is

realized as [31] in lower register.

Deng and his colleagues reported that TM Tone 2 had a mild rising contour [23],

in which the magnitude of the final rising was much smaller to that of [35] in BM.

Nevertheless, some studies have shown that Tone 2 has a dip in BM (e.g., Shen and Lin,

39

1991; Moore and Jongman, 1997) and TM (e.g., Fon and Chiang, 1999; Fon et al., 2004).

Specifically, Fon and her colleagues found that TM Tone 2 had a dipping contour and a

comparable f0 height in the mid-low register range similar to the canonical [214] of Tone

3. They suggested that the slopes of both the falling and rising portions were steeper in

Tone 3. Overall, these three TM tones seem to have a narrower tonal contrast when

compared to the BM counterparts.

40

Tone 1 Tone 2 Tone 3 Tone 4 Study # speakers (TM) Stimuli BM TM BM TM BM TM BM TM (in isolation) Shih (1987, not available Monosyllables, (H)HH (L)LH or (M)ML (H)H+L 1988) bi-syllables (M)M/M-H Fong and 1 female TM† Monosyllables and [55]* [44] [35]* [323] 214* [312] [51]* [42] Chiang trisyllables (1999) Chiung 11 females and 11 monosyllables ― [22] ― [212] ― [31] ― [453]** (1999, males of TM-TSM 2003) bilingual Li et al. 1 female and 1 male Monosyllables and [55] [44] [35] [325] or 214 [31] [51] [52] (2006) of TM-TSM bilingual bisyllables [214] (LLM) Deng et al. 2 females and 2 Monosyllables and [55] [44] [35] [23] 212 [211] or [21] [51] [51] (2006); males of BM (18 bisyllables

41 Shi and yrs); Deng(2006) 2 females and 2 males of TM† Sanders 21 females and 12 Monosyllables - [44] or - [213] - [31] or [41] - [53] or (2008a,b) male of monolingual [33] [51] TM and bilingual TM-TSM† † Feng et al. 5 males and 5 Monosyllables ― ― ― ― (2006) females Kuo et al. 1 female and 1 male Monosyllables in a ― ― ― Male:low-falling ― (2008) carrier sentence rising Female: low-falling *Phonologically tonal values based on the prescriptive description in the literature (e.g., Chao, 1968). ** The initial tonal target of the TM Tone 4 may result from the target word /wn/, which starts with a nonsyllabic vowel with nasal ending) used in the study. † Subjects were born to parents of Taiwanese origin and either one or both of them speak TSM. Subjects can converse in simple TSM but claims everyday language is mainly TM, which is their L1 and the dominant language. ††Sanders (2008a,b) explicitly assigned subject into two linguistic/language groups on the basis of self-reporting survey to identify TM-TSM language use

Table 2.5. Summaries of phonetic studies on isolated TM tones.

2.4.2.3. Tonal register and range

Comparing tonal ranges (f0,max - f0,min) and registers (average f0 of each tone) of the

citation tones between BM and other three regional varieties of Standard Chinese, Li et al.

(2006) found that TM had the lowest tonal register and narrowest tonal range. Similarly,

Torgerson (2005) and Shi and Deng (2006) found cross-dialectal differences in the

register of individual tones produced by BM and TM speakers. Specifically, TM tones

were produced in a lower register than those produced in BM. In addition, Tseng (2004)

analyzed short dialogues consisting of sentences of different prosodic structures produced

by 6 radio announcers from each dialect region and found that BM demonstrated a

general higher pitch register than TM. Based on the acoustic data obtained from both

trained radio announcers and untrained speakers, there seems to be cross-dialect

differences in the tonal registers.

Researchers have hypothesized substratum influence of Taiwan Southern Min

(TSM), which is characterized with relatively low pitch/tonal register (Lin and Repp

1989; Deng et al., 2006; Shi and Deng, 2006; Li et al., 2006). For instance, the pitch

values Tone 1 and Tone 2 in TSM are [44] and [13]/[24], respectively (Deng et al.,

2006).23 In addition, Li et al. (2006) found that TM speakers kept similar tonal range

and register in the production of both TM and TSM tones. This may provide support the substrate effect of TSM on TM.

According to these acoustic studies, both Beijing and Taiwan Mandarin have four phonologically distinct tones; however, they are acoustically realized in different ways.

The findings of previous phonetic studies are summarized as the following:

23 Tone values [13] and [24] of TSM Tone 2 represent respectively Northern () and Southern () TSM dialect (Deng et al., 2006). 42

(1) While four lexical tones in BM maintain the contours similar to the

prescriptive forms described in the literature, the final H tonal target of

Tone 3 is not as high as the prescriptive value. That is, production of

Tone 3 by a younger generation of speakers exhibited a [212] tonal

pattern instead of the canonical [214].

(2) TM tones differ from their BM counterparts in terms of f0 height (tonal

registers) and contours. All four tones in TM are characterized with

lower tonal registers. Tones 2 and 3 tend to respectively become

dipping and low-falling. For TM speakers, Tone 3 has a low-falling

pattern (i.e., half 3rd tone) in isolation and pre-pausal and non

pre-pausal positions. This is consistent with findings in previous

studies.

(3) Scholars have proposed tonetic sound change for BM Tone 3 and TM

Tones 2 and 3 across different generations of speakers. The final

tonal target in Tone 3 in both dialects is under the process of shifting to

a lower register.

2.4.2.4. Durational patterns

Examining the production of 200 isolated tones by eight speakers from Beijing and Taiwan, the durational relationship among four isolated tones has been reported as:

Tones 3>2>1>4 in BM and Tones 1>2>3>4 in TM (Deng et al., 2006; Shi and Deng,

2006). For BM tones produced in the sentence-medial and sentence-final position, the durational relationship became Tones 2>1>3>4. On the other hand, there was no

43

significant difference in duration among four TM tones when they were produced in the

sentence-medial position. In the sentence-final position, contextual TM tones exhibited

a durational pattern of Tones 2>1>4>3. Based on these findings, Deng et al., (2006) and

Shi and Deng (2006) suggested that the durational relationship among four tones was

more stable in BM since it did not change as a function of sentential positions. In

addition, production of 64 stimuli at the sentence-final position by one male and one

female TM speakers showed a duration pattern of Tones Tones 2>3>1>4 (Kuo et al.,

2008).

As reviewed in the previous section (§2.3.3.2), in BM isolated Tone 3 is usually

has the longest duration, Tone 4 have the shortest, and Tones 1 and 2 are in the

intermediate. On the contrary, while Tone 4 in TM is still the shortest, Tone 3 is no longer the longest, which usually falls in the middle of the continuum.

2.4.2.5. Amplitude contours

Previous studies investigating amplitude contours of four lexical tones focused on

BM (see §2.3.3.1), Kuo et al. (2008) is the only study that provides measurements of amplitude contours of TM tones. Kuo et al. (2008) found that the amplitude contour of a tone did not always correlate most highly with the pitch contour of the same tone (p.

2823). For the female speaker whose Tone 3 production demonstrated a low-falling pattern, the amplitude contour of Tone 3 was highly correlated with the pitch contours of both Tones 3 and 4 (r=0.79 and 0.85, respectively), so as the amplitude contour of Tone 4 to the pitch contours of Tones 4 and 3 (r=0.89 and 0.88, respectively). For the male speaker, the amplitude contour of Tone 1 was correlated to the f0 contour of Tone 4

44

(r=0.69) rather than that of Tone 1 (r=0.38). These findings again confirmed

inconsistent relational patterns of amplitude and pitch contours across tones, speakers,

and syllables (e.g., Fu and Zeng, 2000).

2.5. Perception of Mandarin Tones Studies on tone perception have demonstrated that the three acoustic correlates

reviewed above, fundamental frequency, intrinsic amplitude and duration, are integrated to serve as perceptual cues to tonal identity. The primary cue to tonal identity lies in the domain of pitch (Howie, 1976; Gandour and Harshman, 1978; Tseng, 1981; Gandour,

1984; Massaro et al., 1985; Kuo et al., 2008). The other two cues, duration and amplitude, serve as secondary cues in tone recognition when f0 information is not

available (e.g., Lin and Repp, 1989, for Taiwanese tone perception; Fu et al., 1998; Kuo

et al., 2008).

2.5.1. f0 Cues to Tone Perception and Recognition Gandour and Harshman (1978) indicated that the following pitch characteristics

(f0 patterns) are commonly exploited to signal tonal distinctions: pitch height, direction of pitch movement, pitch range, magnitude of pitch slope, beginning and ending point of pitch movement. In particular, f0 height/register and f0 contours are two perceptual cues

that play an important role in Mandarin Chinese tone identification and perception (e.g.,

Howie, 1976; Gandour, 1984; Massaro et al., 1985). Investigating the relative perceptual

saliency of these two f0 cues, Gandour (1984) found that Mandarin listeners assigned

slightly more perceptual weight to f0 contour than f0 height. However, Massaro et al.,

45

(1985) also found that these two cues were equally effective in influencing perception of

Tones 1 and 2.

Some studies have shown effects of f0 height/register on perception of certain tone

categories (Massaro et al., 1985; Whalen and Xu, 1992; Gottfried and Suiter, 1997; Fon

et al., 2004; Lee et al., 2008; Lee, 2009). For example, Whalen and Xu (1992) found

that Tone 3 identification was associated with the low f0 register when there was no f0 movement in the fragmented stimuli of very short duration. Besides, a couple of studies have shown that listeners use f0 height to distinguish low-onset tones (Tones 2 and 3) from high-onset tones (Tones 1 and 4) (Gottfried and Suiter, 1997; Lee et al., 2008; Lee et al., 2009).

The f0 movement is an important perceptual cue for the perception of contour

tones. Studies using synthetic stimuli varying in either (1) turning point of a concave f0 contours or (2) f0 fall from the tonal onset to the turning point (i.e., ∆f0) have found that

these are two perceptual cues for differentiating Tones 2 and 3 (Shen and Lin, 1991; Shen

et al., 1993; Moore and Jongman, 1997; Fon et al., 2004). Besides, Fon et al., (2004)

found that for Tones 2 and 3 syllables the initial ∆f0 were more likely to evoke a Tone 3

percept while medial and final portions of an f0 contour were more likely to evoke a Tone

2 percept. While some scholars strongly claimed that the timing of turning points

constitutes a reliable perceptual cue for coding contour tones 2 and 3 (Shen and Lin, 1991;

Shen et al., 1993), f0 height may contribute to discrimination between these two tones.

Perception of synthetic stimuli differing in initial f0 height and slope of ∆f0 revealed that

Tone 2 percept comprised a mid initial f0 height and a shallower slope (smaller ∆f0) while

46

Tone 3 percept comprised a low initial f0 height and a steeper slope (larger ∆f0) (Fon et al.,

2004).

2.5.2. Temporal Envelope Cues to Tone perception and Recognition

In addition to f0 cues to tone perception, other acoustic cues such as duration and

amplitude also play an important role in Mandarin Chinese tone recognition and

identification. Several perception experiments have used various noise stimuli

modulated by envelopes of the speech signal (i.e., speech-shaped noise) to separate

contribution of major temporal envelope cues such as amplitude contour and duration to tone perception and recognition (Lin, M.-C., 1988; Whalen and Xu, 1992; Fu et al., 1998;

Fu and Zeng, 2000; Xu et al., 2003; Kuo et al., 2008).

Using signal-correlated noise stimuli to remove both f0 and fine spectral

information from the speech signal, some studies have demonstrated that tone recognition

with only amplitude contour cues is above the chance level. For example, Whalen and

Xu (1992) found 45%, 55.3%, 69.5% and 92.3% for Tones 1, 2, 3, and 4, respectively.

Fu and Zeng (2000) reported an average of 58.5% correct tone identification.

Specifically, the amplitude envelope cue contributed mostly to discriminating Tones 3 and 4 (Fu and Zeng, 2000). However, Lin, M.-C. (1988) suggested no effect of

amplitude contour on tone perception.

Studies examining the effect of duration as a perceptual cue to Mandarin tone

recognition have showed that duration plays a minor role in identifying tones (e.g., Tseng,

1981; Lin, M.-C., 1988; Fu and Zeng, 2000; Whalen and Xu, 1992). For example, by

presenting native speakers four types of acoustically truncated vowel [i], including fully

47

syllable, 75%, 50% and 25% of the initial portion of the vowel only,24 Tseng (1981) found that listeners correctly identified tones in the first half of a vowel, except for Tone

4. This demonstrated that listeners made tone judgment primarily based on the f0 contour present in the fixed-duration stimuli. Tseng (1981) concluded that vowel duration did not affect the perception of tones and f0 pattern played the primary role in

both tone production and perception.

Lin, M.-C. (1988) found that listeners were able to achieve high level of tone recognition (95.8%) when conditioning the synthetic tones to the typical duration of Tone

4 (which is usually the shortest). The level of tone recognition was only 3% lower than

the recognition of stimuli containing comparable acoustic properties from real speech.

Therefore, Lin, M.-C. (1988) concluded that duration cue contributed around 3% to tone

recognition. In addition, Whalen and Xu (1992) found that listeners were also able to

identify the tonal category of a signal-correlated noise stimulus when duration as a cue

was removed.

Fu and Zeng (2000) found that the performance to stimuli containing only

duration information was poorest (35.6%). Therefore, Fu and Zeng (2000) concluded

that duration cues play a relatively minor role in tone identification while the amplitude

contour cues play a major role. In particular, duration cue mainly contributes to

discrimination of Tone 3 from other tones, which is consistent with the Blicher et al.,

(1998). Statistical analyses revealed that duration was not the primary cue in Mandarin

24 Please also note that actual duration of those acoustically modified stimuli is different across tones since they were created by editing off certain percentage of vowel duration from the offset and there are intrinsic durational differences across tones. This is different from the gating experiment in which only a fixed amount of acoustic input is presented to listeners. 48

tone recognition due to the high variability in the vowel duration (Fu and Zeng, 2000),

which was consistent with the findings in Tseng (1981) and Lin, M.-C. (1988).

2.5.3. Speaker Variability and Speaker Normalization in Tone Perception

Since f0 height (and syllable-intrinsic f0 cues, in general) plays an important role in perception of lexical tone in Mandarin Chinese, tonal judgment must also be made with reference to the voice pitch range of a speaker. For instance, for tones with similar contours but varying in registers, a phonologically low tone produced by a high-pitched speaker may be acoustically similar to a phonologically high tone produced by a low-pitched speaker (Moore and Jongman, 1997; Jongman et al., 2006; Lee, 2009).

Gauging pitch location within a speaker’s f0 range may also be important for

contour tones that have similar f0 movements but are located in different registers of a

speaker’s pitch range (Lee, 2009; cf. Moore and Jongman, 1997). For example, both

Tones 2 and 3 have similar dipping contours and differ in pitch height (e.g., Shen and Lin,

1991; Moore and Jongman, 1997; Fon et al., 2004); thus, Tone 2 produced by the low-pitched speakers and Tone 3 produced by the high-pitched speaker may overlap in the f0 range (e.g., Moore and Jongman, 1997). However, Moore and Jongman (1997)

found reduced f0-range normalization effects for synthetic stimuli more closely matching

natural production of Tones 2 and 3. That is, for contour tones intrinsically dynamic f0 patterns provide sufficient cues and tone identification may not depend on context or extrinsic f0 information for tone identification (Moore and Jongman, 1997).

In addition, information of f0 height may still be relevant in identification of

contextual tones since the canonical f0 contours are perturbed due to tonal coarticulation

49

(Shih, 1987, 1988; Shen, 1990a,b; Xu, 1994, 1997). For example, the short allophonic

variant of a non-final Tone 3 (except for two adjacent Tone 3s) has a low-falling pattern

(31), which contrasts with a non-final Tone 4 (53) only in register (ref. Lin and Repp,

1989, for similar results for Taiwanese high falling vs. mid falling tones). Since

individuals differ in ranges of voice pitch due to difference in larynx size,

speaker-indexical variation inherent in linguistically identical tonal categories can result

in tone category overlap leading to acoustically and perceptually ambiguous tones.

Previous studies have shown that in lexical tone perception listeners perform

perceptual speaker normalization in which listeners categorize an acoustically ambiguous

target tone, either synthetic or naturally produced, with reference to the perceived f0 range

obtained from contextual cues extrinsic to the target (Leather, 1983; Fox and Qi, 1990;

Moore and Jongman, 1997; Wong and Diehl, 2003, for Cantonese tone perception).

Speaker-indexical acoustic information can be obtained from either a

multiple-talker precursor context (Leather, 1983; Moore and Jongman, 1997; Wong and

Diehl, 2003) or an anchor tone (Qi and Fox, 1990). While the effect of a context tone

on the categorization of the following target tone is assimilatory (Qi and Fox, 1990;

Wong and Diehl, 2003), that of a preceding sentential context is contrastive (Moore and

Jongman, 1997). For example, acoustically identical stimuli are identified as low tones for the high-pitched precursor sentence, but as high tones for the low-pitched precursor condition (Moore and Jongman, 1997). However, using synthetic tone stimuli for independently manipulating important syllable-intrinsic acoustic cues to tonal identity

(e.g., Leather, 1983; Qi and Fox, 1990; Moore and Jongman, 1997) may force listeners to exploit contextual information due to suppression or unavailability of one or more

50

acoustic cues. This may render the significance of talker normalization effects. For

example, Moore and Jongman (1997) concluded that listeners used contextual f0 information only when the intrinsic acoustic cues for target tone contrasts were degraded and the available intrinsic acoustic cue needs to be in the same f0 dimension (i.e., f0 range)

as the precursor context.

In addition to resolving speaker-dependent tonal ambiguities by resorting to

extrinsic contextual information cuing speakers’ f0 range, studies have shown that

listeners were able to gauge f0 level and correctly identify isolated intact or fragmented multiple-talker tone syllables (Wong and Diehl, 2003; Lee, 2009) and English vowel syllable (Verbrugge et al., 1976; Honorof and Whalen, 2005) when no syllable-extrinsic contextual information was available. Wong and Diehl (2003) asked 16 native

Cantonese listeners to identify Cantonese level tones produced in isolation by seven male

speakers and found that they achieved an accuracy rate of 48.6% when stimuli were presented in mixed-talker block condition, which was beyond the chance level of

33.3%.25 However, multiple and successive presentation of multi-speaker stimuli may

contribute to perceptual familiarization and learning of speakers voice pitch range.

Similarly, Lee et al., (2009) reported significantly higher identification accuracy for isolated single-talker stimuli (89%) produced in isolation by one female speaker than for the multiple-talker stimuli (86%) produced in isolation by two male and two female speakers. Results corroborated that the listener gauges the talker’s f0 range from trial to

trial in a single-talker presentation mode, which serves as a frame of reference to resolve

25 There are three level tones (Tones 1, 3, and 6) in Cantonese tonal system. 51

phonetic category ambiguity resulting from talker variability (Verbrugge et al., 1976;

Wong and Diehl, 2003).

Verbrugge et al., (1976) reported 83% correct identification for isolated

mixed-talker English vowels in /p-p/ syllables vs. 90.5% for isolated single-talker vowels.

This finding suggested that syllable-intrinsic information (when no syllable-extrinsic

contextual information available) provided sufficient information for listeners to

compensate for talker variability when they were presented in mixed-talker condition.

In addition, Honorof and Whalen (2005) asked English natives to judge the pitch

location of an isolated multi-speaker vowel [] within a speaker’s f0 range and found that

they were able to locate pitch reliably within a speaker’s f0 range without extrinsic context or prior experience to a speaker’s voice.

Using multi-speaker (32 speakers) tone fragments consisting of only the and the first six glottal periods and presenting them only once to 40 native listeners, Lee

(2009) showed that identification of these brief stimuli without context, dynamic f0

information, or prior exposure to the speakers’ voices was beyond chance. Listeners

were capable of judging high-low distinction in tones above chance (high: Tones 1 and 4

versus low: Tones 2 and 3). Furthermore, f0 height estimation was correlated with f0, duration, and voice quality measures (F1 bandwidth and spectral tilt). Lee (2009) proposed that the listeners could have used these voice quality differences to detect gender, and then gender detection may in turn be implicated in detecting f0 height from

very short multi-talker stimuli of around 20-49 ms.

52

2.5.4. Dialectal variability in Tone Perception

The only study that, to my knowledge, specifically investigating effect of dialect

variability on Mandarin tone perception by native speakers of Mandarin

Chinese/Standard Chinese in mainland China was conducted by Li and her colleagues.

Li et al., (2006) investigated perception of the Tone 3 produced by eighteen speakers of

BM and three other regional-accented Mandarin (speakers with mid level accent from

Shanghai, Taiwan, and Xiamen) and found that Tone 3 produced by Beijing speaker was

100% correctly identified while that produced by Taiwan and Xiamen speakers were

often judged as Tone 4 or uncertain and that by speakers as Tone 2 or uncertain.

More specifically, perceptual results on Tone 3 produced by Taiwan and Xiamen

speakers revealed that only 20-30% of Tone 3 was correctly identified as the canonical

Tone 3, 35%-40% were identified as Tone 4, and more than 10% could not be identified

as any tone of Standard Chinese. The perceptual results corresponded well with the

production of citation tones by speakers of these regions.

2.5.5. Lexical Tone Processing using gating task

Perception studies on Mandarin tone perception using gating technique can be

divided into two categories. The first includes studies that use brief segments from

different locations in the syllable, including “silent-center” tones (Gottfried and Suiter,

1997; Lee et al., 2008; Lee et al., 2009; Lee, 2009), to investigate which tonal portions of the signal contain the critical perceptual cues for tone identification (Tseng, 1981;

Whalen and Xu, 1992; Fon et al., 2004). The second includes studies that employ word-gating experiment (Grosjean, 1980, 1996; Cotton and Grosjean, 1984; Tyler and

53

Wessels, 1985) to examine online processing of tone information in recognizing

Mandarin Chinese words (Lee, 2000; Wu and , 2003; Lai and Zhang, 2008).

Tseng (1981) presented truncated vowels produced in isolation to listeners for

tone identification and found that they could be perfectly identified in the first half of a

vowel. Misidentification occurs in the initial 25% for contour tones (Tones 2, 3, and 4)

and they were often perceived as level tone (Tone 1) since the particular f0 patterns for

contour tones were absent. Whalen and Xu (1992) using a gating technique found

similar results. Tone 1 was well identified at all locations while other tones were better

identified at the middle to late portions of the syllable. For the first segments where

there are not much f0 changes, they were most often identified as Tone 1 regardless of the

original tones. In addition, Tones 2 and 3 were often confused with each other, especially toward the end of syllable. Tone 3 was more often confusable with Tone 2, but not vice versa. There was an effect of f0 height on the Tone 3 percept. That is,

when f0 was low enough and there was no f0 movement, percept of Tone 3 was more

often. For example, for Tone 3 of syllable /yi/ that started with a low f0 and stayed at

that level, listeners were able to accurately identify it from the first segment. Most

importantly, with the shortest window (40 ms), Tone 4 judgments begin to dominate in

the first segment of the syllable (// only) regardless of original tones and Whalen and

Xu (1992) proposed that it was possible that with such a short stimulus the amplitude

drop imposed by the hamming window was perceived as a drop in f0.

Using fragments extracted from initial, medial and final positions of Tones 2 and

3 of syllable /ba/, Fon et al., (2004) found that regardless of source tones, the initial

54

falling contour cued a Tone 3 percept while the rising contour (medial and final portions) sounded more Tone 2-like. Similarly, Liu and Samuel (2004) used signal processing techniques to selectively neutralize the f0 information in the falling and rising f0 portion of

Tone 3 and found that the final f0 rising was not needed for the perception of Tone 3.

This was consistent with Whalen and Xu’s (1992) observation that Tones 2 and 3 were especially confusable toward the end of syllable. Overall, Fon et al (2004) found that a

Tone 2 percept was more frequently elicited despite what the source tone was (either

Tones 2 or 3). Besides, creaky voice at the end of the initial falling portion did not contribute to evoking more Tone 3 responses. That is, the dipping portion and the creaky voice quality of TM Tone 3 were not as “essential to the percept of Tone 3 as that in Putonghua anymore” (p. 263). Fon and her colleagues suggested that it was likely due to the fact that the final rising portion in TM Tone 3 was short (shorter than TM Tone

2) and option (less than half of the production of Tone 3 was dipping).

Another group of studies have used “silent-center” tones, which parallel to

“silent-center” research on American English vowel identification (Strange, 1989), to examine sources of acoustic information for vowels and tones in Mandarin Chinese

(Gottfried and Suiter, 1997; Lee et al., 2008; Lee et al., 2009; Lee, 2009). Four types of tone syllables were used for tone identification, including onset-only, center-only, silent-center and intact syllables 26 . This series of studies has shown that tone identification becomes less accurate (Gottfried & Suiter; 1997; Lee et al., 2008, 2009) and listeners required more time to identify tones as the type of acoustic input changed

26 The initial-only syllable contains only the first six pitch periods of the test syllable (first 20-30 ms); the center-only syllable contains six pitch periods from the beginning of voicing to eight pitch periods from the end of syllable; the silent-center syllable contains the inital six pitch periods and the final eight pitch periods; and the intact condition has the full syllable (Gottfried and Suiter, 1997, p. 212). 55

from intact to silent-center and onset-only syllables (e.g., Lee et al., 2008). While

listeners made more confusions in the onset-only condition, identification accuracy for

onset-only syllables was still beyond chance even though the f0 contours were relatively

flat and showed no contrastive patterns among the four tones in the first six glottal pulses

(Gottfried and Suiter, 1997; Lee et al., 2008; Lee et al., 2009; Lee, 2009). Acoustic

analyses revealed that there was statistically significant pair-wise difference in f0 height, distinguishing between high-onset tones (Tones 1 and 4) and low-onset tones (Tones 2 and 3) (Lee et al., 2008, Lee, 2009).

The confusion patterns revealed that for onset-only syllables, natives often misidentified Tone 2 as Tone 3 and Tone 4 as Tone 1 (Gottfried and Suiter, 1997)27.

Lee et al. (2008) also found Tones 2-3 confusion and Tones 1-4 confusion with asymmetries in the error patterns for the onset-only syllables. Specifically, Tone 2 is misidentified as Tone 3 and Tone 1 as Tone 4 more often than vice versa. The Tones

2-3 confusion patterns were consistent with findings reported in previous tone identification studies (Blicher et al., 1990; Shen and Lin, 1991, Whalen and Xu, 1992;

Gottfried and Suiter, 1997; Wang et al., 1999; Fon et al., 2004).

Another version of the gating technique (word-gating experiments) has been used to examine real-time processing of spoken word recognition, in which fragments of a word of increasing duration are successively presented to listeners and listeners are asked to propose a word for the word listened and give a confidence rating (Grosjean, 1980;

Cotton and Grosjean, 1984). Perceptual results are often analyzed in terms of (1)

27 Test tokens are excised coarticulated tone syllables originally produced in a carrier sentence and were presented without the original tonal context in which they were produced (Experiment 2) (Gottfried and Suiter, 1997). 56

isolation point (IP), (2) confidence ratings, and (3) proposed responses or error patterns at each test gate. IP is defined as the amount of acoustic information (i.e., gates or segments) needed to correctly identify the word or tone of a word without further changes.

Proposed responses are subjects’ responses at each gate before the IP (Lai and Zhang,

2008).

Wu and Shu (2003) presented segments of 120 isolated Mandarin monosyllables in 40-ms increments to native listeners in a duration-blocked presentation format starting at the 80-ms gate28. Results showed that IP was longest for Tone 2 and there were no significant differences in the IPs for the other tones. When taking into account intrinsic duration difference among tones, only Tone 3 needed the least acoustic information to be correctly identified. On average, IP was 157 ms and an isolated word was correctly identified with 55.2% of acoustic input. This is consistent with the result in previous gating studies that tones can be correctly identified within the first half of the syllable.

Examining tonal confusion at each gate showed the following patterns: (1) Tones 1 and 4 were mostly misidentified with each other, and (2) Tones 2 and 3 were mostly misidentified as Tone 1.29 These findings were partially consistent with the confusion patterns from previous research (Tseng, 1981; Whalen and Xu, 1992; Gottfried and Suiter,

1997; Lee et al., 2008; Lee, 2009). Lee (2009) suggested that since Tone 1 responses dominated even for the low-onset tones (Tones 2 and 3), native listeners did not show sensitivity to f0 height.

28 Wu and Shu (2003) indicated that the first 40-ms gate is too short to provide sufficient acoustic-phonetic information; therefore, the presentation started the 80-ms gate. 29 Tone 2 was mostly misidentified as Tones 1 and 3 and Tone 1 error was only significantly larger than Tone 3 error at 160 ms. Tone 3 was mostly misidentified as Tone 1, followed by Tones 4 and 2 up to 120 ms. 57

Lai and Zhang (2008) used a gating task to evaluate the effect of the initial

segment on processing four Mandarin tones. Gates were generated from 32

monosyllables with matching frequencies of occurrence; the initial consonant formed the

first gate and later gates were formed in 40-ms increments. Results showed an earlier IP

for Tone 1, followed by Tone 4 and the Tones 2 and 3. Error patterns before the IPs showed that Tones 1 and 4 were often misidentified with each other. Lai and Zhang

(2008) suggested that the reason that it took longer for Tone 4 to be correctly identified than Tone 1 may be that tone duration at earlier gates was not long enough for subjects to perceive the falling contour. This is contrary to the finding that Tone 4 responses dominated when the gate was as short as 40 ms (Whalen and Xu, 1992). Similar to the finding reported in Wu and Shu (2003), Tone 1 responses dominated at earlier gates for

Tones 2 and 3. Tone 3 tokens was correctly identified at the very beginning (gate 2) and sometimes misidentified as Tone 4 in gates 4-7. Lai and Zhang (2008) surmised that listeners may have used the low register cue to correctly identify Tone 3; however, Tone

4 responses were triggered when enough duration was heard, which warranted a falling tone percept. This implied that listeners might be sensitive to f0 height to certain degree.

Based on the tone error patterns before the IP, Lai and Zhang suggested a hierarchy of

cues at the onset of tonal identification; namely, at tonal onset high register cue has more

significant perceptual weight than contour cue, which in turn has more perceptual weight

than low register cue. In other words, high-onset tones, regardless of contours, were not

misidentified as low-onset tones; but low-onset tones were sometimes misidentified as

high-onset tones due to their contour shapes (e.g., low falling pitch at tonal onset of Tone

3 was misidentified as Tone 4).

58

2.5.6. Tone Perception by Nonnative Speakers of Mandarin

Literature review here focuses on studies concerning effects of modification of

acoustic input (Gottfried and Suiter, 1997; Lee et al., 2009) and speaker variability (Lee

et al., 2009) on Mandarin tone perception by nonnative English listeners. Both studies

(Gottfried and Suiter, 1997; Lee et al., 2009) showed that non-native tone identification

was affected to a greater extent by acoustic modification of speech input than natives.

Confusion analyses also showed that non-native listeners made more tone errors and error

patterns were more variable as compared to those of the native. Gottfried and Suiter

(1997) reported the following confusion patterns based on five non-native listeners’

responses: Tone 2→Tone 3, Tone 4→Tone 1, Tone 3→Tone 2, and Tone 3→Tone 4.

While the first two tone confusions on the onset-only syllables were also common for

natives, the latter two confusion patterns were prevalent for only non-natives when they

identified both intact and acoustically modified tone syllables.

Similarly, Lee et al., (2009) showed a Tones 2-3 confusion and for both groups of

listeners. However, interpretation of Tones 2-3 confusion was different for two groups

of listeners. While natives were able to make classification of four tones into high-onset

and low-onset tones (Tones 1-4 vs. Tones 2-3, respectively) in the absence of f0 contour information, non-native listeners did not show consistent evidence of classifying them into low- and high-onset groups. Instead, non-natives consistently confused Tones 2 and 3 even when relatively complete acoustic information was available. This is consistent with previous studies that Tone 2 is often confused with Tone 3 (Blicher et al.,

1990; Shen and Lin, 1991; Whalen and Xu, 1992; Wang et al., 1999; Fon et al., 2004;

Lee et al., 2008). In addition, ability to classify onset-only tones into high- versus

59

low-onset tones implies that listeners are capable of estimating f0 height information and

thus engaging in speaker normalization. As a result, it is not clear if non-native

speakers engage in speaker normalization.

Unlike identification accuracy is affected by reduced acoustic input, non-native

listeners are able to deal with speaker variability in perceiving Mandarin tones. For

example, Lee et al., (2009) presented 12 single-talker and multi-talker Mandarin syllables

to nonnative English listeners and found that non-native tone identification (65% vs. 62%)

was affected to the same extent as the native performance (89% vs. 86%).

2.6. Research Outline In the light of previous studies reviewed above, a set specific research questions

are outlined here for the following production and perception study.

2.6.1. Production Study While previous acoustic studies reviewed above have documented cross-dialect

variation in Mandarin tone production by speakers of BM and TM, generalizability of

phonetic differences in tone realization are limited to small speaker sample, the subset of

acoustic parameters examined, and the qualitative as opposed to quantitative evaluation of acoustic data. Therefore, the objective of the production study is to examine the three main acoustic properties (f0, duration, and amplitude) of four lexical tones in BM and TM

produced in two constrained phonetic contexts by numerous speakers from each region.

Of particular interest is dialectal divergence in tone production. Acoustic and

subsequent statistical analyses are performed to answer the following research questions:

60

(1) Are there cross-dialect differences in the acoustic properties of the four lexical

tones when produced in isolation and in a sentential context? In particular, what

is the specific nature of Tone 3 in these two dialects?

(2) Is there cross-dialectal divergence in the production of the four lexical tones in

terms of amplitude envelope and duration?

(3) Is the amplitude contour of a tone positively correlated with the f0 contour of the

same tone and different tone?

2.6.2. Perception Study

One recent phonetic study has shown that native speakers are capable of identifying multiple-speaker, isolated tone fragments (with no dynamic f0 information available) and estimating f0 height beyond chance even when there is no syllable-extrinsic information available (Lee, 2009). Non-native speakers are also capable of dealing with speaker variability, although the accuracy of tone identification by non-native listeners is remarkably affected by the reduced acoustic information in speech input as compared to that by native speakers (Lee et al., 2009; Gottfried and Suiter; 1997). Nevertheless, there are still a couple of limitations in these studies. First, the multiple-speaker stimuli used in Lee (2009) and Lee et al., (2009) were produced by speakers of both genders.

As the results showed, gender detection was implicated in the estimation of f0 height when natives identified isolated tone fragments of around 20-49 ms long (Lee, 2009).

As a result, it is not clear whether listeners engage in genuine speaker normalization.

Second, while other types of gating experiments have provided information as to which tonal portions in the signal contain the critical perceptual cues for native tone

61

identification (e.g., Whalen and Xu, 1992; Fon et al., 2004) and to what extent native and

non-native listeners cope with impoverished acoustic inputs (e.g., Gottfried and Suiter;

1997; Lee et al., 2008, 2009; Lee, 2009), the time-course of the tone recognition process in response to speaker variability and reduced acoustic input is still unknown. Third, those acoustically modified stimuli used in the “silent-center” tone studies had different duration (i.e., low-onset tones have longer duration than high-onset tones; male stimuli are longer than female stimuli), duration difference of tones may be implicated in tone identification and gender detection. Longer stimulus duration contains more acoustic information, which presumably aid tone identification. Stimulus duration difference could be useful for gender detection (ref. Lee, 2009, p. 1132).

Lastly, word frequency and syllabic structure were not systematically controlled for in the previous word gating studies (e.g., Wu and Shu, 2003; Lee et al., 2008, 2009).

The word recognition literature has reported a significant effect of word frequency on lexical access (see Grosjean, 1980). It has been shown in the literature that syllabic structures affect tone contours (Howie, 1974; Ho, 1976; Shih, 1987; see §2.4.1). For example, Lai and Zhang (20008), using a word-gating experiment with initial consonant forming the first gate, found that the sonorancy of the initial consonant does not necessarily trigger an earlier IP, but it contributes to accuracy of tone identification at gate 1. Furthermore, the Wu and Shu (2003) study did not control for the equal distribution of the four tones. In particular, the number of words of Tone 4 was larger than that of other three tones. While the confusion patterns did not reveal significant influence of the unbalanced tone distribution on tonal judgment, this may affect the results.

62

With regard to dialect variability, no perception study, to our knowledge, has

investigated how both native listeners and non-native English learners of Mandarin deal

with regional dialect variability in addition to speaker variability. It remains unclear

whether native listeners engage in perceptual normalization for dialect variability. For

non-native English learners of Mandarin, they are assumed to be familiar with the BM

variety since this is the variety of Mandarin they are learning in the U.S.30

Taking these into consideration, a word-gating experiment was used in which (1)

fixed amount of acoustic information was presented to listeners at each gate, (2) syllabic

structure and word frequency were systematically controlled for, and (2) stimuli were

produced by only female speakers with various f0 ranges from two regional dialects.

The purpose of the study is to examine the effects of speaker and dialect variability on the

time-course of native and non-native tone identification when no syllable-extrinsic

contextual information is available for speaker and dialect normalization. The

hypothesis is that given speaker variability and a mismatch of speaker-listener dialect,

listeners require more acoustic information to correctly identify tones. Tone

identification by the non-native listeners will be compromised to a larger extent.

As for speaker variability, native listeners should be able to make low-onset and high-onset distinctions relatively quickly at earlier gates. Non-native listeners are predicted to be able to deal with speaker viability and make a low-onset and high-onset distinction, but at later gates than the native counterparts.

As for dialect variability, both native and nonnative listeners should perform better with the stimuli produced by speakers of the same dialect. In particular, nonnative

30 At institution where the instructors are from Taiwan, AE learners of Mandarin could be influenced by the variety spoken by the instructor. 63

listeners were hypothesized to perform better with BM stimuli since it is the variety of

Mandarin Chinese they are familiar with through textbooks and their instructors.

Based on the previous findings, it is hypothesized that, for the high-onset tone pairs (Tones 1 and 4), Tone 1 should be relatively easier to be recognized than the contour Tone 4. For the low-onset tone pairs (Tones 2 and 3), Tone 3 is hypothesized to be identified earlier than Tone 2, given the findings that the low-onset f0 gives rise to a

Tone 3 percept (e.g., Whalen and Xu, 1992; Fon et al., 2004). Nevertheless, Tone 2 can

be possibly identified earlier than Tone 3 given that the former has an earlier turning point (e.g., Shen and Lin, 1991). Furthermore, cross-dialect differences in tonal realizations in these two dialects, especially citation Tone 3, could significantly affect the

identification of partial and intact tone stimuli.

64

Chapter 3 Production Study: Methodology

This chapter describes the methodology by which speech corpus from 19 speakers

of Beijing Mandarin (BM) and Taiwan Mandarin (TM) was elicited, recorded and

acoustically analyzed. Elicitation techniques and recording of speech samples followed

procedures commonly used in the speech production literature. Acoustic analyses were

accomplished by several custom-written Matlab programs, which were created by the

author, in consultation with Dr. Robert A. Fox.

3.1. Speech Stimuli

The reading list consists of two parts. The first included 618 Mandarin

monosyllabic words and the second contained a set of short sentences. The

monosyllabic reading list was constructed with reference to the word list compiled by

Howie (1976) and the Inventory of Mandarin Syllables of Wang, Li, and Brotzman

(1963). It is comprised of monosyllables representing all possible syllabic structures in

Mandarin phonology—CV, V(Cnasal), and CVCnasal. According to Mandarin

phonological rules, Vs can be a monophthong, a diphthong, or a triphthong but any

31 consonant in the coda must be a nasal, either /n/ or // (see §2.2). Following Howie’s

(1974) classifications of monosyllables according to the initial consonant or the medial in

31 /n/ occurs in both syllable-initial and syllable-final positions; and // occurs only in syllable-final position. 65

syllables without initials (initial syllabic and non-syllabic vowel in the V syllable), stimuli were organized into the following nine groups:

Syllable type # Syllable starting with… V 1 syllabic vowels 2 non-syllabic vowels

VCnasal 3 non-syllabic vowel and nasal coda

CV and CVCnasal 4 voiced continuants: /m/, /n/, and /l/ 5 voiceless : /f/, /s/, /./, and /h/ 6 aspirated stops: /p/, /t/, and /k/ 7 unaspirated stops: /b/, /d/, and // 8 aspirated : /ts/, /t./, and /t/ 9 unaspirated affricates: /ts/, /t./, and /t/

Most of the selected syllables can be produced with each of the four lexical tones.

Specifically, 150 segmental syllables have quadruplets and 6 have triplets that are

minimally distinguished by tones. The monosyllables were read in two constrained phonetic contexts: in isolation (i.e., citation tones) and in a carrier sentence (i.e.,

contextual tones) Qing3 shuo1 _____ zi4 (“Please say ____ word”).

The second part of the reading list included four short sentences. Participants were also

asked to produce them after the recording of the first reading list. Monosyllabic words

analyzed for the current study and four short sentences are provided in Appendix A.

66

3.2. Participants

Since one of the purposes of the current study is to investigate a “genuine” dialect

difference between BM and TM, the speakers selected had to speak a dialect

representative of the regional varieties of Mandarin under investigation. However, most

Mandarin Chinese natives actually speak more than one Chinese language32 due to the

diverse linguistic environments and social expectations in these two countries.

Therefore, efforts were made to experimentally control for subject’s linguistic backgrounds, especially for TM speakers. Only speakers who were born and had lived in the Beijing area or Taiwan before age of 15 and have had no frequent or extended contact with the other Mandarin variety of interest were recruited in the study.

A total of 19 native speakers of either BM or TM were recruited for the production study. The age of these two speaker groups ranged from 21 to 33 years.

The BM group included three males and six females (NBM = 9) and the TM group

consisted of four males and six females (NTM = 10). Eight out of nine BM speakers

were born and raised in the Beijing area. They were native speakers of BM and did not

speak other Chinese languages. One female speaker (S4) also speaks Xiāng dialect.33

Speakers in the TM group came from various geographical regions in Taiwan. Seven of these speakers are bilingual speakers of Mandarin and TSM34. Participants were

screened by completing a survey on their linguistic backgrounds before the recording.

32 Different Chinese dialects such as Cantonese and Hakka are not mutually intelligible with Mandarin Chinese. 33 S4 was born and raised in Province and speak Xiang with family and friends. However, she began studying Standard Chinese (Pǔtōnghuà) in elementary school and her tertiary education was done in Beijing. She also passed the official Standard Chinese proficiency test, Putonghua Proficiency Test (Pǔtōnghuà Shuǐpíng Cèshì, PSC), in Mainland China. 34 Among the rest of three speakers (s11, s14, s19), two of them (s11, s14) do not speak TSM were born to parents of Hakka heritage. S19 were born to TSM-speaking parents and reported minimal command of TSM. 67

All speakers used the respective regional variety of Mandarin Chinese as their native language and as the main language of communication in their daily life. A summary of ethnographic information about each of the speakers is presented in Table 1 (see

Appendix B for the detailed speaker demographic information).

Dialect Gender Mean Age Total # of Speakers Male 23.3(0.6) 3 Beijing Mandarin (BM) Female 25.0(3.8) 6 Male 30.0(2.0) 4 Taiwan Mandarin (TM) Female 28.7(3.0) 6

Table 3.1. Summary statistics of two groups of speakers.

All speakers were graduate students recruited from The Ohio State University community. All were literate in Chinese and able to read the words and short sentences written in either Traditional or Simplified Mandarin Chinese characters (with respective phonetic notations) and had no known speech-language disorder or any sensory difficulty.

They were compensated $10 for their participation.

3.3. Procedure

The recording of speech tokens was done while speakers were seated in a sound-attenuating booth in the Speech Perception and Acoustics Laboratories (SPA Labs) in the Department of Speech and Hearing Science at The Ohio State University. Speech samples were recorded directly onto a hard drive using a head-mounted microphone

68

connected through a preamplifier (Mackie 1202-VLZ3) and A/D converter to a Windows

PC, using a 22-kHz anti-biasing filter at a sampling rate of 44.1 KHz. The head-mounted microphone (Shure SM10A) was positioned about 1.5 in. from the subject’s lips. All talkers produced a short set of sample tokens or sentences (different from the test stimulus) prior to the recording of the test stimuli to make sure that the equipment and the recording program (Adobe Audition 1.0) were running properly and that the speaker was familiar and comfortable with the task and experimental setup. All speakers were recorded using the same equipment and recording procedures.

Speech materials (monosyllables and short sentences) were printed in both simplified and traditional Chinese characters with phonetic of pinyin and zhuyin, respectively, on a set of as several blocks. Blocks of words were printed in the order of the nine syllable types. The presentation order of blocks was randomized for each subject. The speakers first produced the monosyllable with the four lexical tones in isolation, then the monosyllables in the carrier sentence followed by the short sentences. Subjects were instructed to read the syllables while making all of their productions as natural as possible, using a steady/constant speaking rate. They were instructed not to insert a pause before the target word when it was produced in a carrier sentence (nor to insert a pause in the middle of a sentence). In case the token was mispronounced in the judgment of the experimenter (a native speaker of Taiwan

Mandarin), the subject was asked to repeat that word or sentence until a satisfactory elicitation was obtained. The recording session lasted around 65 minutes, with breaks provided so speakers had opportunities to drink water to reduce possibility of laryngealization (which can occur after long periods of talking, McKenna, 1996). Each

69

speaker was compensated $10 for their participation. A total of 23484 monosyllabic words (618 words × 19 speakers × 2 contexts) and 380 short sentences (4 sentences × 19 speakers × 5 repetitions) were obtained.

3.4. Acoustic analysis

Due to the enormous number of speech tokens recorded and the time needed to complete acoustic analysis on the full sets of speech sample recorded, only a subset of the stimuli from each speaker was analyzed to explore tonal characteristics. The stimulus set analyzed consisted of 13 syllables with each of the four lexical tones produced both in isolation and in the carrier sentence. This set of 13 tone quadruplets included V and CV syllable with initial nasal, fricative and consonants and excluded words that had a nasal coda. The monosyllables selected for the acoustic analysis are listed below, organized according to Howie’s (1974) groupings of monosyllables:

Syllable Type # Syllable starting with…

V 1 syllabic vowels: yi /i/, wu /u/ 

CV 4 voiced continuants: ma /ma/ 5 voiceless fricatives: /xu/ 6 aspirated stops: pi /pi/, tu /tu/

unaspirated stops: bao /pau/, bi /pi/, bo /po/, da /t/, du /tu/, ge /k/, 7 guo /kuo/

35 Phonetically, these words are produced with a homorganic glide such as [ji] and [wu]. 70

While it would be meaningful to include words of all possible syllabic

combinations to be more representative of speech in daily life, it is desirable and

important to experimentally and carefully control for the segmental structure so

systematic comparisons can be made. The criteria for selecting this set of

monosyllables from certain syllable types were as follows.

First, given differential effects of consonant voicing and aspiration and syllabic

structure on the f0 of the following vowel (see§ 2.), words from syllable types 1, 4, 6, and

7 were selected. One syllable with initial fricative was also included in the f0 analysis since Howie (1974) suggested syllable type 5 has the characteristic shapes of Mandarin tones. Syllable types 8 and 9 involving affricates were excluded because we might expect similar tonal patterns as those with initial stops. Furthermore, syllable types 2, 3 and CVCnasal syllables were excluded on the basis of findings that initial non-syllabic

vowels and nasal coda play an insignificant role in tone perception (Lin M.-C., 1995).

Second, stimulus selection criteria of stimuli were made in consideration of the

gating tasks to be employed in the tone perception experiments. In order to make the

duration of the first gate of stimuli (i.e., first 30 ms of vowel plus the consonant) more

comparable to the rest, CV syllables that contain a short stop burst were chosen to be

used as test stimuli in the perception experiments to be discussed in Chapter 5.

Therefore, more words from syllable type #7 were included for the acoustic analysis so

they could be used in follow-up perception experiments.

Third, the selected monosyllables include vowels differing in height, backness,

and . Mandarin vowel chart is presented in Table 3.2. Specifically, speech tokens contain high front unround and back rounded vowel /i/ and /u/, mid back

71

unrounded and rounded vowel // and /o/ (allphones of the // depending on

the phonetic environment) (Duanmu, 2005; Lin, Y-H., 2008), low /a/, and

two diphthongs /au/ and /uo/. This was done because Ho (1976) and Zhi and Zhang

(1987) showed that Mandarin vowel height also affects, though to a lesser extent, the f0 values associated with vowels themselves (i.e., intrinsic pitch or f0). With these

considerations regarding segmental environments in mind, selected speech sample should

therefore be representative of the tone contrast in other phonetic environments.

front front back back central unrounded rounded unrounded rounded

high i y u

e mid   o 

low a ac 

Table 3.2 The surface vowels in Standard Mandarin. [i, y, u, , ac] are the five

phonemic vowels; ac denotes a central [a] (adapted from Lin, Y-H., 2008, Table 2b).

Acoustic analyses were performed on the syllable nucleus of each target word since it is the tone-bearing unit. Three acoustic properties were examined to characterize each tone category, namely (1) fundamental frequency (f0), (2) amplitude, and (3) duration. We chose to examine these three measures because f0 is the primary

acoustic correlate of Mandarin tones (e.g., Howie, 1974; Zee, 1978, and Tseng, 1990),

while amplitude (e.g., Whalen and Xu, 1992; Fu et al., 1998, Fu and Zeng, 2000, and

72

Kuo et al., 2008) and duration (e.g., Blicher et al., 1990; Whalen and Xu, 1992, and Liu

and Samuel, 2004) usually serve as the secondary cues to tonal identity when explicit f0

information is not available for tone identification.

A total of 1960 words, instead of 1976 (13 syllables×4 tones×2 contexts×19

speakers), were analyzed since two sets of isolated and contextual tone quadruplets

(syllables da /da/ and guo /ku/, a total of 16 tokens) were excluded from s5’s data set due to recording errors. Due to the imbalance of male subjects between two regional dialect groups and the expected gender differences in pitch of voice, analyses and results

will be separately presented by gender of the speaker.

3.4.1. Extraction of f0 contours

All of the stimuli were downsampled from 44.10 kHz to a sampling rate of 22.05

kHz before they were submitted for f0 extraction. f0 contours were obtained via a

36 custom f0 tracking program written by Xu (2009) in Praat 5.1.01 (Boersma and

Weenink, 2009). The program reported f0 values calculated from period-by-period

glottal pulse markings. The calculation range in the pitch setting was set to 30 Hz as the

minimal f0 (Minf0) and 400 Hz the maximal f0 (Maxf0). When analyzing a sound in Praat,

two panels are displayed: (1) waveform with the automatic vocal pulse marks and (2) a

spectrogram with optional pitch and formant tracks generated by Praat. The first display

allowed for interactive hand editing of spurious glottal cycle labeling, which result in

doubling or halving the actual f0 (e.g., Ladefoged, 2003). Speech tokens that could not

36 The latest version of Praat can be download from http://www.fon.hum.uva.nl/praat/download_win.html. 73

be tracked with this program were analyzed using WaveSurfer 1.8.5 (Sjölander and

Beskow, 2005) with comparable pitch settings.37

Vowel onsets and offsets were manually labeled by consulting with both displays

of the waveform and the spectrogram. Vowel onset was defined as the onset of the first

full-cycle glottal period at a zero crossing on a waveform. Vowel offset was determined

as the end of periodic energy and label was placed at the beginning of the last full-cycle

glottal period. Glottal pulse marks were placed on the onset of each glottal cycle within

the vowel duration on the waveform. Placement of glottal pulse labeling was made with

reference to the automatically generated f0 tract provided by Praat in the spectrogram

display. Vocal pulse markings always included the onsets and offsets of the syllabic

nuclei.

For syllables with a voiced initial, such as syllable ma /ma/, the nasal-vowel

boundary was relatively easy to determine in the waveform, “due to the abrupt shift in the

waveform shape and amplitude level commonly seen at the boundary” (Xu, 1997, p. 64).

The decision to exclude the nasal from f0 measurements was based on the finding that the fundamental frequency during the initial voiced consonant or non-syllabic vowel is merely the pre-onset of the f0 curve (Lin, M.-C., 1965) or the anticipatory adjustments of

the voice pitch (Howie, 1976, p. 146). Examples of f0 extraction applied to CV syllables

starting with a voiceless fricative and a voiced nasal by using the Praat tracking program

(Xu, 2009) are illustrated in Figures 3.1 and 3.2, respectively.

37 Among the f0 contours analyzed, only one token produced by a female speaker (s14) was analyzed via WaveSurfer program 1.8.5. 74

Figure 3.1. f0 racking of the syllable ge2 /k/ produced by s18. The panel on the top (1) shows the waveform with automatic vocal pulse marks that can be manually modified, and that at the bottom (2) is the spectrogram with optional pitch and formant track generated by Praat. Onset and offset of vowel // are marked at the bottom.

Figure 3.2. f0 tracking of the syllable ma2 /ma/ produced by s18.

75

3.4.2. Measurement of vowel duration

Vowel duration was measured as the temporal distance between the first and last glottal pulse mark (i.e., vowel onset and offset, respectively) on the waveform display in

Praat, as seen in Figures 3.1 and 3.2. They were specified to the nearest tenth of a millisecond.

3.4.3. Calculation of rms amplitude

Amplitude curve for each tone was obtained by calculating a set of root mean square (rms) values (i.e., quadratic mean) from the vowel onset to the vowel offset.

The series of rms values were calculated from a 40-ms rectangular window moving in

5-ms steps along the vocalic portion, using a custom-written Matlab program. The program reports and plots the rms values as a function of time. The number of rms values obtained in each token depends on the duration of the vowel.

3.4.4. Exclusion of outliers

By visual inspection, f0 ontours that deviated from the rest of syllables of a tone contrast were excluded from the acoustic analysis. Tone 2 of syllable guo /kuo/ produced by s14 in the sentence context was therefore excluded. While the shapes of two contextual Tone 2 contours produced by s17 (syllables bo /po/ and da /ta/) were different from what we have expected, they were not excluded since they served as speech stimuli in the perception experiments. After ruling out the outliers and missing data from s5, 98.6% of recorded tokens (1231 out of 1248 words=13 syllables×4 tones×2 contexts×12 female speakers) were included in the acoustic and statistical analyses. 76

3.4.5. Time Normalization

Since there are between- and within-talker variation in vowel duration and

syllable duration differences among tones, time normalization (normalization to the

syllable’s duration) is required to make f0 and rms amplitude contours readily comparable

across different tonal contrasts and speakers and to allow computation of average tonal

curves. Specifically, normalization was performed by first aligning the f0 onset to time 0.

Second, each temporal point (in ms) where the fundamental frequency of a voiced segment was measured was converted to a percentage of its duration, i.e., percentage time.

Note that percentage time was calculated with respect to the vowel duration within which f0 values were measured. For example, the first two and the last f0 values of ge2 (gé) by

s18 were measured at 35, 39, and 426 ms into the vowel, respectively. They were first

aligned to 0, 4, and 391 ms on the time scale. They were then converted to percentage time by dividing each temporal point by the total duration within which f0 values were

measured (i.e., 391 ms in this case) and then multiplying it by 100. Finally, they were

represented in percentage time as 0%, 1%, and 100%, respectively.

Third, percentage time and its corresponding f0’s of each token were transformed

to a 1000-point series using a custom-written Matlab program, in which a linear

interpolation algorithm was applied. This normalization technique was applied to both

f0 and rms amplitude contours. As a result, f0 and rms amplitude contour of a tone was

plotted against percentage time starting from 0% of time.

77

3.4.6. f0, rms, and duration measurements

Eleven f0 measurements were taken from the pre-normalized and normalized f0

contours for quantitative analyses. Six of them were obtained from the pre-normalized

f0 data, including maximum f0, minimum f0, f0 range, mean f0, f0 offset, and vowel

duration. Maximum and minimum f0 was the greatest and lowest f0 value along a

contour, respectively. F0 range was defined as the difference between the maximum and

minimum f0 values. Mean was defined as the average of all f0 values along the contour.

F0 offset was defined as the f0 value measured at the syllable offset. For each individual

normalized f0 curve, five f0 values were obtained at the following temporal points in each

contour: 5%, 25%, 50%, 75% and 95%, i.e., beginning, one quarter, one half, three quarters and end of the contour, respectively. Similarly, five rms amplitude

measurements were taken at 5%, 25%, 50%, 75% and 95% temporal points. These

values will provide dynamic information about f0 and rms amplitude contours. Duration

of three segments were measured, including the onset duration for the initial consonant,

vowel duration, and the syllable duration.

3.5. Statistical Analyses

The three basic acoustic measures and their related measurements for four lexical

tones were analyzed for each speaker and then pooled together according to gender,

dialect, and context prior to the subsequent statistical analyses. The research questions

were addressed through the following statistics:

(1) Descriptive statistics of the various acoustic measures were used to illustrate the

acoustic properties of four lexical tones in Mandarin.

78

(2) An overall three-way mixed design analyses of variance (ANOVAs) were

conducted on the three acoustic measures and their related measurements to

investigate whether these acoustic measures of each tone in different phonetic

contexts differ significantly between two dialects. To further explore the

significant effect of dialect on each measure, two-way repeated-measures

ANOVAs were used. Greehouse-Geisser corrected degrees of freedom were

used to compute F-ratios to avoid inflated Type I errors when the assumption of

sphericity was violated (p<.05). Partial eta squared (η2) values were reported

for significant main effects and interactions as a measure of the effect size.

Moreover, post hoc analyses were carried out using either additional ANOVAs

on selected subsets of the data or Bonferroni-adjusted t-test for pair-wise means

comparisons.

79

Chapter 4 Production Study: Results and Discussion

This chapter presents the results of the acoustic analyses in the following order:

(§4.1) f0 contours, (§4.2) vowel duration, and (§4.3) rms amplitude contours. Due to

the small number of male speakers recruited form each dialect region (i.e., 3 and 4 males

from Beijing and Taiwan, respectively), the analyses and discussion will focus on the

acoustic data obtained from 12 female speakers. This will be followed by a general

discussion of the data (§4.4). Specifically, individual and mean f0 contours from each

speaker are presented first, followed by the group-averaged f0 curves and appropriate

statistical analyses on f0 contours. Similar organization will be also used when

reporting results for rms amplitude contour.

Any main or interaction effects found to be significant at a .05 level or less are

reported, unless otherwise specified. Bonferroni-adjusted t-tests with a family-wise alpha level of .05 (αFWE = .05) were used as post hoc analyses for pair-wise means

comparisons. The significance level for various independent samples t-tests were set at

α = .05.

In the discussion section, results from acoustic analyses are presented to answer

the following research questions:

80

(1) Are there cross-dialect differences in the acoustic properties of the four lexical

tones when produced in isolation and in a sentential context? In particular, what

is the specific nature of Tone 3 in these two dialects?

(2) Is there cross-dialectal divergence in the production of the four lexical tones in

terms of amplitude envelope and duration?

(3) Is the amplitude contour of a tone positively correlated with the f0 contour of the

same tone and different tone?

4.1. f0 contours

4.1.1. Individual f0 contours and mean f0 contours

Figures 4.1-4.4 show the individual normalized f0 contours of all 13 syllables

produced in isolation and in a carrier sentence by female speakers as a function of tone

and speaker dialect. Both isolated and contextual tone contours are plotted against

percentage time normalized to its duration and were aligned to 0%. The boldfaced red

line denotes the mean f0 contour of each tone, averaged over 13 syllables.

There were both between- and within-speaker variations in isolated and

contextual f0 contours produced within each speaker group. The former can be

contributed to the individual differences in pitch voice. The latter may arise from

potential effects of the initial consonant (Howie, 1974; Ho, 1976, Xu and Xu, 2003),

syllabic structure (Howie, 1974; Ho, 1976) and the intrinsic pitch of the vowels (Ho,

1976; Zhi and Zhang, 1987) on f0 contours. Intra-speaker variation also appeared to be more prominent for Tone 3 produced in isolation by BM speakers.

81

Despite minor intra-speaker variations in the shape of individual normalized f0

curves within one tone category, f0 contours obtained from 13 words of various syllable

types follow almost identical patterns. Therefore, they can be collapsed together to

form a basic f0 contour. The basic f0 contour characteristic of each tone is shown as the mean f0 contour for each speaker (boldfaced in Figures 4.1-4.4).

Since the mean f0 contours of each tone type were quite consistent across speakers

within each dialect group, the normalized group-averaged f0 contour of each tone was

characterized by averaging across all six female speakers from each dialect region.

Cross-dialectal differences in tonal production will be discussed in terms of the “basic f0 contours” shown in the group-averaged f0 contours in the next section.

82

(BM)S4 (BM)S5

Figure 4.1. The f0 contours of 13 syllables produced in isolation by BM speakers, arranged by tone. Tone contours were plotted against percentage time normalized to its duration and were aligned to

0%. The boldfaced red line denotes for the mean f0 contour of each tone, averaged over 13 contours.

83

Figure 4.1 continued

(BM)S8 (BM)S9

Continued

84

Figure 4.1 continued

(BM)S10 (BM)S17

85

(TM)S2 (TM)S3

Figure 4.2. The f0 contours of 13 syllables produced in isolation by TM speakers, arranged by tone. Tone contours were plotted against percentage time and aligned to 0%.

86

Figure 4.2 continued

(TM)S14 (TM)S15

Continued

87

Figure 4.2 continued

(TM)S18 (TM)S19

88

(BM)S4 (BM)S5

Figure 4.3. The f0 contours of 13 syllables produced in context by BM speakers, arranged by tone. Tone contours were plotted against percentage time and aligned to 0%.

89

Figure 4.3 continued

(BM)S8 (BM)S9

Continued

90

Figure 4.3 continued

(BM)S10 (BM)S17

91

(TM)S2 (TM)S3

Figure 4.4. The f0 contours of 13 syllables produced in context by TM speakers, arranged by tone. Tone contours were plotted against percentage time and aligned to 0%.

92

Figure 4.4 continued

(TM)S14 (TM)S15

Continued

93

Figure 4.4 continued

(TM)S18 (TM)S19

94

4.1.2. Group-Averaged f0 contours

Figures 4.5 and 4.6 show the normalized group-averaged f0 contours of four

Mandarin tones produced in isolation (NBM,Isolation=304 and NTM,Isolation=312) and in

context (NBM,Context=304 and NTM,Context=311). Tone 1 occupies the upper half of

frequency range, with the onset f0 value starting around 264 Hz and leveling throughout

the syllable. Tone 2 starts with a lower f0 value around 203 Hz and then falls slightly

before rising (at 40% into the vowel) through the rest of the syllable to a frequency level

as high as Tone 1. Similarly, Tone 3 starts with a low f0 value which is somewhat lower

than the f0 onset of Tone 2 (around 199 Hz), falls throughout the syllable to the lowest f0 value among four tones, around 117 Hz at the midpoint of vowel (50% into the vowel), and then rises toward the end of the syllable. The onset f0 value of Tone 4 is around 10

Hz higher than that of Tone 1 (around 274 Hz), gradually falls during the first 20% (one

fifth) of the vowel, and then falls sharply for the rest of the syllable to 156 Hz.

The shapes of the normalized group-averaged f0 contours of four tones produced

in isolation by the BM group is fairly consistent with the canonical forms of four

Mandarin tones produced in isolation or in the utterance-final position by Beijing

speakers reported in the literature (e.g., Chao, 1948; Howie, 1974; Ho, 1976; Xu, 1997).

The f0 contours of Tones 1-4 are characterized as high-level, mid-rising,

low-falling-rising, and high-falling.

95

Figure 4.5. Normalized group-averaged f0 contours of four Mandarin tones produced in isolation by the BM group (on the left) and the TM group (on the right)

Figure 4.6. Normalized group-averaged f0 contours of four Mandarin tones produced in context by the BM group (on the left) and the TM group (on the right).

96

The shapes of group-averaged tonal contours of Tones 1, 2, and 4 produced by

TM speakers, on average, are similar to those produced by BM counterparts. Tone 1

has a level f0 contour, centering around 247 Hz. Tone 2 starts around 197 Hz, slightly

falls and then rises at around 60% into the vowel. The magnitude of the rising portion is

not as large as the BM counterpart. The onset f0 of Tone 4 starts around 261 Hz, which is higher than that of Tone 1, and then falls to 160 Hz at the end of the syllable. Similar to the BM counterpart, Tone 4 does not fall to an f0 value as low as the minimum in Tone

3. Cross-dialectal differences evident in these three isolated tone contours may be

considered as relative changes in f0 height (in the absolute frequency), which likely

represent a function of speaker variation.

However, the f0 contours of the isolated Tone 3 show noticeable differences

between two dialects. While Tone 3 exhibits the canonical form of low-falling-rising

(i.e., dipping) pattern in the BM group, it shows a low-falling pattern throughout the

syllable for the TM speakers, which deviates from the traditional description. There is

no rising component at the second half of the syllable in the production of isolated Tone 3

by TM speakers. In particular, Tone 3 starts with an onset f0 (around 188 Hz) lower than that of Tone 2 and falls throughout the syllable to the minimum f0 of the four tones

(around 123 Hz). This tonal pattern is similar to the “half third tone” in the non-prepausal position. The observed low-falling shape is consistent with findings on

Tone 3 in Taiwan Mandarin (Jeng et al., 2005; Li et al., 2006; Kuo et al., 2008, Sanders,

2008). As a result, TM Tone 3 and Tone 4 share almost identical f0 contour and differ

only in terms of pitch height, with the former occupying in the lower register region and

the latter in the higher register region. Differentiation between a TM Tone 3 and a Tone

97

4 becomes a low-high register contrast, as previously suggested by Chiung (1999) and

Sanders (2008b).

The acoustic properties of the monosyllables produced in a sentential context will

be perturbed by factors such as global prosodic environments and neighboring tonal

context. The onset and offset f0 values of adjacent tones will lead to local contextual f0

variations due to assimilation or anticipation (Xu, 1997). As can be seen in Figure 4.6,

there are no obvious cross-dialect differences in the group-averaged f0 curves of tones

produced in context, except for minor fluctuations in absolute value on the frequency

scale.

In general, coarticulated Tone 1 has relatively flat f0 curve located in the higher

frequency region. Tone 2 also has a rising contour, although the rising portion of f0 contour is not as prominent as that produced in isolation, especially for the Tone 2 in TM.

This is expected since the H target is deleted when followed by another H target (either

Tone 1 or 4) according to the one of the coarticulation rules proposed by Shih (1987,

1988). The most conspicuous local contextual change occurs in Tone 3 produced by

BM group, in which the final rising portion observed in the isolated T3 contour is absent.

The contour becomes low-falling, which is the “half third tone” as expected according to the “half T3 Sandhi” rule. Similarly, contextual Tone 3 by the TM group also shows a low-falling pattern. Tone 4 in both dialects has the high-falling contour and falls to a f0 value that is higher than the f0 minimum in contextual Tone 3. This is consistent with the

“half fall’ of Tone 4 suggested by Shen (1990a, b) and Shih (1988).

In summary, a set of distinctive “basic contours” exists for the four lexical tones

in these two dialects of Mandarin. For isolated tones, the most notable cross-dialectal

98

difference in tonal realization lies in the Tone 3, in which the tone in BM has a low-dipping shape whereas in Taiwan Mandarin it is low-falling. The magnitude of the rising segment in the isolated TM Tone 2 is not as large as that in the BM counterpart.

Isolated Tones 1 and 4 in these two regional dialects have comparable tonal contours, despite minor differences in f0 register (absolute frequency values), especially in Tone 1.

Isolated TM Tone 4 does not fall all the way to the minimum f0 as reported in the

previous acoustic studies (Fong and Chiang, 1999; Chiung, 1999 and 2003; Li et al., 2006;

Sanders, 2008a, b). This tonal pattern may be used to contrast with the low-falling Tone

3 in terms of register (e.g., Sanders, 2008b). Similarly, the final tonal target in isolated

BM Tone 4 does not drop to a minimum f0, which is in contrast to the previous findings

in which the citation Tone 4 in BM was reported as falling throughout the syllable (e.g.,

Li et al., 2006; Deng et al., 2006).

For tones in context the f0 contours are similar to what is expected from the sandhi process and coarticulation rules discussed in Chapter 2. Since the primary purpose of the study is to examine the potential dialectal difference in tonal production by speakers

of two regional varieties, following statistical analyses will focus on the potential

cross-dialectal differences in producing tones in context, instead of examining differences

in f0 contours between citation tones and tones in context. There were, by visual inspection, no overt cross-dialectal discrepancies in the production of contextual tones in terms of the general tone shape.

These findings on isolated and contextual tones will be further explored by performing statistical analyses on several f0 measurements defined in Chapter 3. Table

4.1 reports the descriptive statistics for ten f0 parameters, including f0 values measured at

99

five temporal locations in the vowel, 5%, 25%, 50%, 75%, and 95%, maximum f0, minimum f0, f0 range, mean f0, and f0 offset. A set of analyses of variance (ANOVAs)

were conducted to examine effects of dynamic f0 information obtained at five locations in the vowel, context and dialect on different tones. Results were presented in section

4.1.3.

100

(a) Isolation 5% 25% 50% 75%95% Max. Min. Range Mean Offset BM Tone 1 264 264 263 262 264 272 256 16 263 266 (38) (37) (35) (34) (34) (37) (34) (9) (35) (34) Tone 2 200 194 201 227 254 262 192 70 215 261 (23) (19) (20) (29) (37) (39) (20) (32) (24) (39) Tone 3 192 154 118 152 176 206 103 103 163 183 (27) (32) (34) (39) (44) (32) (34) (25) (23) (45) Tone 4 272 265 233 188 162 276 156 120 233 157 (48) (46) (32) (25) (30) (50) (32) (56) (33) (31) TM Tone 1 247 247 247 247 250 255 243 12 248 252 (35) (37) (37) (36) (36) (36) (36) (5) (36) (36) Tone 2 193 186 188 196 217 227 185 42 195 226 (24) (24) (24) (25) (29) (31) (25) (13) (25) (31) Tone 3 185 175 159 143 126 188 120 68 161 123 (22) (22) (22) (26) (30) (23) (33) (33) (20) (35) Tone 4 259 248 221 188 163 262 159 103 222 160 (34) (35) (36) (34) (34) (35) (33) (21) (33) (33)

(b) Context 5% 25% 50% 75%95% Max. Min. Range Mean Offset BM Tone 1 276 276 275 275 278 284 270 13 276 279 (36) (36) (35) (34) (34) (37) (35) (11) (35) (35) Tone 2 193 182 179 191 213 221 176 45 191 217 (27) (21) (22) (25) (38) (40) (23) (34) (24) (42) Tone 3 206 191 173 153 134 210 128 82 176 133 (32) (25) (22) (25) (34) (34) (35) (46) (22) (37) Tone 4 284 274 245 203 179 291 174 117 246 176 (39) (36) (25) (21) (28) (39) (28) (53) (25) (29) TM Tone 1 246 243 242 243 245 250 239 11 244 245 (30) (31) (32) (32) (32) (32) (31) (6) (31) (32) Tone 2 186 178 175 179 189 195 174 21 181 190 (22) (21) (21) (22) (23) (24) (21) (10) (21) (24) Tone 3 189 176 158 143 132 190 129 61 163 133 (23) (22) (30) (34) (36) (25) (36) (31) (24) (39) Tone 4 259 247 224 199 181 260 182 78 227 184 (31) (31) (31) (31) (33) (33) (32) (26) (29) (31)

Table 4.1. Summary statistics of ten f0 parameters (in Hz) of (a) isolated tones, and (b) contextual tones.

101

4.1.3. Statistical Results—dynamic f0 information As shown in Figures 4.5 and 4.6, each tone is characterized by its distinctive dynamic movement in f0 over time and absolute frequency values (i.e., f0 height) as a function of context and dialect. In order to better show the cross-dialect differences in each tone produced in isolation and in a sentential context, f0 contours of both BM and

TM tones sampled from five temporal locations were plotted against each other, as shown in Figure 4.7.

An overall three-way repeated-measures ANOVA was performed with within-subject factors of f0 measurements at five temporal locations in the vowel

(hereafter “location”) and phonetic context (isolation and sentential context), and between-subject factor of dialect (BM and TM)38 to assess their effects on each tone.

Table 4.2 lists main effects, two- and three-way interactions obtained in the overall analyses.

Given that the main goal in the current production study is to examine how each

tone, produced in isolation and in a sentential context, differs in dynamic f0 movement

between two regional dialects, the effect, among all main and interaction effects, that

concern us is the three-way interaction of location, dialect and context (i.e., location ×

dialect × context). Consequently, the following discussion of statistical analyses will

be focused on this three-way interaction since the results from the main and two-way

interaction effects do not answer our research questions.

38 This factor is renamed as “speaker dialect” in Chapter 5 and 6 when we discuss the effects of speaker dialect on tone identification by three language groups of listeners (“listener dialect’).

102

In order to interpret the effect of this three-way interaction when it reaches

significance, a set of two-factor mixed factorial ANOVAS were separately performed on each tone in each phonetic context, with within-subject factor location and between-subject factor dialect. Similarly, interaction of location and dialect (location

× dialect) is of particular interest since it indicates whether the dynamic f0 contour of a

tone varies between two dialects. If significant, independent samples t-tests at each

location were performed to examine at which location in the dynamic f0 movement the

cross-dialect difference occurs.

As shown in Table 4.2, three-way location × dialect × context interaction is

significant for all four tones, subsequent statistical analyses outlined above were

performed. Since the interpretations of the main effects from the overall three-factor

ANOVAs is not germane to our research questions, they will not be discussed in detail

and are mainly summarized in Table 4.2. Discussion and interpretation will therefore

focus on the following two-way ANOVAs and t-tests. However, Tone 1 will serve as

an example with detailed interpretation of all statistical results.

103

(a)

(b)

Figure 4.7. The group-averaged f0 contours of four tones measured at five temporal locations in the vowel, arranged by tones.

104

Figure 4.7 continued

(c)

(d)

105

Tone 1 Tone 2 Tone 3 Tone 4 df F η2 F η2 F η2 F η2 Location 11.12** 293.87** 432.30** 851.96** .068 .661 .741 .849 (df=1.67) (df=1.48) (df=2.24) (df=1.24) sign. pair-wise sign. pair-wise sign. pair-wise sign. pair-wise comparisons: 2-5, comparisons: all, for comparisons: all comparisons: all 3-5, 4-5 locations 2 and 3

context 1 6.9* .043 266.84** .639 17.09** .102 36.54** .194 isolationcontext isolation

dialect 1 20.46** .119 17.11** .102 ― ― 6.88* .043 106 TM

location× dialect 4 ― ― 47.97** .241 53.03** .260 13.96** .084

context × dialect 1 24.62** .139 16.94** .101 8.95* .056 4.66* .030

location ×context 4 5.37* .034 187.13** .553 70.48** .318 20.88** .121

location × context ×dialect 4 8.13** .051 20.67** .12 84.10** .358 4.17** .027

**p<0.001. *p<0.05.

Table 4.2. Summary of main effects and interactions from the first set of repeated-measures ANOVAs for four Mandarin tones.

Tone 1. As shown in Table 4.2, all main effects and interaction were significant,

except for the location × dialect interaction. Bonferroni post hoc analyses showed a significant difference (p<0.001) in mean f0 values between two dialects and two

phonetic contexts (p=0.009). Specifically, Tone 1 produced by BM speakers is, on

average, 24 Hz higher than that produced by TM speakers. Tone 1 in a sentential

context is, on average, 4 Hz higher than the isolated counterpart. Effect of context also

varied between dialects, as shown in the significant interaction between context and dialect. While the mean f0 of Tone 1 in context was higher in BM, it was lower in TM.

In addition, the pairwise mean f0 comparison to reach significance at the α = .05

level included location pairs 2-5, 3-5, and 4-5. The mean f0 value at vowel offset

(location 5) was 1.8 and 2.3 Hz higher than that at the 25% (location 2), 50% (location

3), and 75% (location 4) into the vowel, respectively. For the contour tones, effect of

location is expected to be significant. The associated effect size, η2, should also be relatively large as compared to the level Tone 1 since mean f0 values should be

significantly different for more pairs of locations in the vowel.

A significant interaction between location and context (location × context) indicates that the mean f0 values at certain locations in a vowel fluctuated when a tone was produced in context, which can be expected due to coarticulation effect. An interaction between location and dialect (location × dialect) suggests that there may be cross-dialect difference in the mean f0 values at certain locations in a vowel. However, this interaction term is insignificant for Tone 1.

Subsequent two-way repeated-measures ANOVAs examining the effects of

location and dialect on each tone produced in either phonetic context showed that

107

location × dialect interaction was significant for Tone 1 [F(1.92,608)=4.09, p=.019,

η2=.026] produced in isolation but not for those produced in context [F(1.79,608)=2.12,

p=.127, η2=.014]. Independent-samples t-tests comparing citation Tone 1 between two

dialects at each location showed that cross-dialectal differences were present at all five

locations in the vowel. The same patterns were observed for the contextual Tone 1

while the interaction between location and dialect was not statistically significant.

Tone 2. All main effects and interactions are significant, as shown in Table 4.2.

Separate two-way ANOVAs indicated that location × dialect interaction was statistically significant in both contexts [Isolation: F(1.48,608)=63.24, p<.001, η2=.294; Context:

F(1.92,604)=21.88, p<.001, η2=.127]. T-tests showed significant cross-dialectal

differences in the mean f0 values measured at 50%, 75% and 95% into the vowels (i.e.,

locations 2, 3 and 5, respectively). That is, BM Tone 2 was significantly higher than the

TM counterpart from the second half of the f0 contour and cross-dialectal differences

increased progressively toward the end of the vowel. Similarly, cross-dialectal

differences in contextual Tone 2 contours lie in the last 25% of the f0 contour. The

magnitude of dialect differences was smaller for the contextual Tone 2.

Tone 3. All main effects were statistically significant, except for the effect of

dialect. While the main effect of dialect is not significant, there was a relatively strong

effect of location × dialect × context interaction (η2=.358). This is not surprising since

Tone 3 is a contour tone that exhibits considerable cross-dialectal variation when

produced in isolation and undergoes changes in the tonal shape for BM speakers

between contexts. The non-significance of the effect of dialect may be due to lumping

f0 variations across locations and between contexts together.

108

Two-way ANOVAs examining the effect of location and dialect on Tone 3 in

isolation revealed the same result—no significant effect of dialect [F(1,15)=.03, p=.862,

η2=.000] but relatively strong effects of location [F(2.54,608)=143.80, p<.001, η2=.486]

and location × dialect interaction [F(2.54,608)=113.97, p<.001, η2=.429]. The latter

confirms that citation Tone 3 in two dialects has different dynamic f0 contour at some

locations in the vowel among the five points examined.

Individual t-tests showed that isolated Tone 3 in TM differed from the BM

counterpart at 25%, 50% and 95% into the vowel (p<.001). Specifically, TM Tone 3

was respectively 20 and 41 Hz higher in the first two locations, and 50 Hz lower at the

vowel offset. Statistically significant differences in mean f0 values at these three

locations again supported that there were striking cross-dialectal differences in the

production to Tone 3 in terms of tonal shape. In particular, citation Tone 3 had a

dipping pattern (with an inflection point) in BM while it showed a falling contour in TM

(ref. part (c) Figure 4.7).

Two-way ANOVAs examining the effect of location and dialect on contextual

Tone 3 showed effects of dialect [F(1,151)=8.43, p=.004, η2=.053], location

[F(1,52)=394.27, p<.001, η2=.723], and location × dialect interaction [F(1.52,604)=5.27, p=.011, η2=.034]. Independent-samples t-tests indicated that cross-dialectal differences

in contextual Tone 3 exhibited in the first half of the contour. Specifically, contextual

BM Tone 3 was 17, 14, and 15 Hz higher than that in TM at these locations. These findings were not surprising since contextual Tone 3 had similar falling contours in the two dialects due to tone sandhi rule. Unlike Tone 3 in isolation, the cross-dialectal

differences in the production of contextual Tone 3 showed up as relative changes in the

109

absolute frequency values in the first portion of the contour instead of as different

trajectories of f0 movement.

Tone 4. As shown in Table 4.2, all main and interaction effects from the

three-way ANOVA were statistically significant. Noticeably, the magnitude of statistical significance is greatest for the effect of location. The large effect size

2 (η =.849) can be attributed to Tone 4 having a dynamic high-falling f0 contour.

Separate two-way ANOVAs indicated significant effect of location × dialect on

both Tone 4 in isolation [F(1.34,608)=6.96, p=.004, η2=.044] and Tone 4 in context

[F(1,152)=12.422, p=.001, η2=.076]. While Tone 4 in the 2 dialects are realized as

high-falling, these interaction effects suggest that there are still cross-dialectal differences in the dynamic f0 movement at many of the five locations in the vowel. Individual

t-tests showed that such cross-dialectal differences were present in the first half of the f0 contour. Specifically, BM Tone 4 produced in isolation was 13, 17, and 11 Hz higher and that produced in a sentence context was 24, 27, and 21 Hz higher than the TM counterpart at 5%, 25%, and 50% into the vowel. As shown in part (d) of Figure 4.7,

Tone 4 in the two dialects and two contexts has the high-falling pattern, with cross-dialectal differences exhibiting in the higher f0 values for the first half of the tonal

contour.

110

4.2. Duration The mean vowel durations are presented in Table 4.3 and graphically displayed in

Figure 4.8 as a function of tone, context, and dialect. Isolated BM Tone 3 had the

longest duration, followed by Tones 2, 1, and 4 (i.e., Tones 3>2>1>4). However,

isolated Tone 2 in TM was the longest, followed by Tones 1, 3, and 4 (i.e., Tones

2>1>3>4). The tones produced in a sentence context exhibited the same duration

pattern, Tones 2>1>4>3, in the two regional dialects. The patterns of cross-dialectal

differences in tone duration are fairly consistent with the findings reported in Deng et al.,

(2006), and Kuo et al., (2008).

A three-factor repeated-measures ANOVA was performed with tone and context

as within-subject factors and dialect as between-subject factor to examine their effects on

the vowel duration. Table 4.4 lists all main and interaction effects. The significant

main effect of context revealed that vowels produced in isolation had longer durations

than that produced in context [F(1, 150)= 1342.40, p<0.001, η2=.899]. The main effect

of dialect was also significant [F(1,150)=12.41, p=0.001, η2=.076]. Bonferroni-adjusted

t-tests indicated that BM tones were, on average, longer than TM tones (means = 325 and

292 ms, respectively). As expected, effect of tone was significant [F(2.46, 450)=254.02, p<0.001, η2=.629], which indicated intrinsic differences in duration of tones. Post hoc

tests showed that Tone 2 was longest, followed by Tones 1, 3, and 4. However, this

duration pattern of tones should be discussed in the light of dialect and context, as shown

in the significant three-way interaction of tone, context and dialect (tone × dialect ×

context). Subsequent two-way ANOVAs with tone and dialect as within- and

between-subject factor, respectively, were performed on vowel duration in each phonetic

111

context in order to examine cross-dialectal differences in duration of tones produced in isolation and conetext.

Isolated Tones 1 2 3 4 BM 411 428 475 304 TM 433 448 333 314

Contextual Tones BM 254 261 223 247 TM 212 213 178 200

Table 4.3. Mean durations of four isolated and contextual tones (in ms) by BM and TM speakers.

Figure 4.8. Mean durations (in ms) of four isolated and contextual tones by BM and TM speakers

112

Duration df F η2 tone 2.46 254.02** .629 T2>T1>T3>T4 context 1 1342.40** .899 isolation>context dialect 1 12.41* .076 BM>TM tone × dialect 3 98.820** .397 context × dialect 1 5.692* .037 tone ×context 3 234.48** .610 tone × dialect × context 3 119.41** .443 **p<0.0010. * p<0.050.

Table 4.4. Summary of main effects and interactions from the three-way repeated-measures ANOVAs for vowel durations.

Separate two-way ANOVAs on the duration of vowels in isolation showed significant effects for tone [F(2.48,456)=293.31, p<.001, η2=.659] and tone × dialect interaction [F(2.48,456)=138.19, p<.001, η2=.476], but insignificant effect of dialect

[F(1,152)=3.55, p=.061, η2=.023]. Significance of the two-way interaction suggests that there were cross-dialectal variations in the duration of each tone. Independent samples t-tests showed statistically significant cross-dialectal difference in the duration of Tone 3, but not in the duration of the other three tones. Specifically, isolated Tone 3 in BM was

141 ms longer than that was TM counterpart (p<.001).

A similar two-way ANOVAs on tones produced in context indicated significant effects of tone [F(2.28,450)=83.67, p<.001, η2=.358] and dialect [F(1,150)=25.02, p<.001, η2=.143] but not a significant interaction between these two factors (p=0.61).

113

The insignificance of tone × dialect interaction indicates that durations of tones did not

vary between dialects when they were produced in a sentence context.

In summary, when tones are produced in isolation, the cross-dialectal difference

in intrinsic (tone) duration was observed for Tone 3. While it had the longest duration

in BM, it was in the intermediate in TM. In the current study, we found there wereno

cross-dialectal differences in the durational patterns for the tones produced in a

sentence-medial position.

4.3. Rms Amplitude Contours 4.3.1. Individual rms amplitude contours & mean rms amplitude contours The graphical presentation for the rms amplitude contours of each tone is similar

to that for the f0 contours above. Figures 4.9 and 4.10 show the individual normalized rms amplitude contours of all 13 syllables produced in isolation by female speakers,

arranged by tone and dialect. Individual normalized rms amplitudes contours of all

syllables produced in a sentence context are shown in Figures 4.11 and 4.12 in the same layout. Both isolated and contextual rms amplitude contours are plotted as functions of their normalized percentage time. The boldface red line denotes for the mean rms

contour of each tone, averaged over 13 syllables.

114

(BM)S4 (BM)S5

Figure 4.9. The rms amplitude contours of 13 syllables produced in isolation by BM speakers, arranged by tone. Amplitude contours were plotted against percentage time and aligned to 0%.

115

Figure 4.9 continued

(BM)S8 (BM)S9

Continued

116

Figure 4.9 continued

(BM)S10 (BM)S17

117

(TM)S2 (TM)S3

Figure 4.10. The rms amplitdue contours of 13 syllables produced in isolation by TM speakers, arranged by tone. Amplitude contours were plotted against percentage time and aligned to 0%.

118

Figure 4.10 continued

(TM)S14 (TM)S15

Continued

119

Figure 4.10 continued

(TM)S18 (TM)S19

120

(BM)S4 (BM)S5

Figure 4.11. The rms amplitdue contours of 13 syllables produced in context by BM speakers, arranged by tone. Amplitude contours were plotted against percentage time and aligned to 0%.

121

Figure 4.11 continued

(BM)S8 (BM)S9

Continued

122

Figure 4.11 continued

(BM)S10 (BM)S17

123

(TM)S2 (TM)S3

Figure 4.12. The rms amplitdue contours of 13 syllables produced in context by TM speakers, arranged by tone. Amplitude contours were plotted against percentage time and aligned to 0%.

124

Figure 4.12 continued

(TM)S14 (TM)S15

Continued

125

Figure 4.12 continued

(TM)S18 (TM)S19

126

Like corresponding f0 contours, there are inter- and intra-speaker variations/

variability in the rms amplitude contours for tones produced in isolation and context by

both speaker groups. Based on visual inspection, amplitude contours for isolated Tone

3 by BM speakers appeared to be most variable across lexical items and speakers.

Since the pattern of mean rms amplitude contour of each tone were consistent across speakers within each dialect group, the normalized group-averaged rms amplitude contours of each tone contrast were obtained by averaging across all six female speakers from each dialect group. Cross-dialectal differences in tone amplitude will be discussed in terms of the group-averaged rms amplitude contour of each tone below.

4.3.2. Group-Averaged f0 Contours Figures 4.13 and 4.14 show the normalized group-averaged rms amplitude

contours of four Mandarin tones produced in isolation and in context.

127

Figure 4.13. Normalized group-averaged rms amplitude contours of four Mandarin tones produced in isolation by the BM group (on the left) and the TM group (on the right).

BM TM

Figure 4.14. Normalized group-averaged rms amplitude contours of four Mandarin tones produced in context by the BM group (on the left) and the TM group (on the right)

128

The time-normalized group-averaged rms amplitude contours for Tones 1-4

produced in isolation by the BM group have rising level-falling, rising with a small dip

around 25% of the vowel, rising-falling-rising-falling, and rising-falling patterns,

respectively. Generally speaking, amplitude contours of isolated BM tones observed in

the current study were consistent with the findings reported in Ho (1976): level or

level-falling for Tone 1, rising or level for Tone 2, rising-falling-rising-falling (i.e.,

double-peak curve), and falling or level-falling for Tone 4 (ref. §2.3.3.1). Most importantly, the amplitude contour of Tone 3, whether shown in the individual or the group-averaged rms amplitude curves, have a double-peak shape. While Lin (1988) suggested that not all production of Tone 3 in isolation exhibits this double-peak pattern,

results from the current study demonstrated that amplitude contours of Tone 3 produced

by all BM speakers have the falling-rising pattern.

A set of basic amplitude contours for the isolated TM tones shows up after

applying the same technique. Specifically, they are rising-falling for Tone 1, level for

Tone 2, and falling for Tones 3 and 4. As compared to the amplitude contours of

isolated BM tones, the most noticeable cross-dialectal differences showed up in the

amplitude curve for Tone 3. While it has a falling-rising pattern in BM, it is falling in

TM. Cross-dialectal differences also occurred in the amplitude contours for Tone 2.

In particular, Tone 2 in BM had rising amplitude while that for the TM counterpart had flat amplitude.

In addition, there are noticeable cross-dialect differences in the amplitude curves of Tones 1 and 2 when produced in the sentential context. Specifically, the amplitude for BM Tone 1 in context was relatively flat over time with final falling and that for Tone

129

2 was rising with a dip around the mid point of the vowel. In contrast, both amplitude for TM Tones 1 and 2 in context showed a pattern of slight falling. For Tones 3 and 4 in the sentential context, they both had falling patterns in two dialects, with Tone 4 locating in the higher amplitude region.

In summary, the most prominent cross-dialectal differences in amplitude contours of the four tones can be observed in Tone 3 produced in isolation: double-peaked in BM versus falling in TM. While previous studies have reported considerable individual

variability in producing a citation Tone 3 with amplitude level that shows a double-peak

pattern (e.g., Lin, M.-C, 1963, Ho, 1976), all amplitude contours of Tone 3 produced in

isolation by the BM female speakers in the current study showed a double-peaked pattern.

On the other hand, those produced by the TM counterpart show a falling contour.

The most noticeable change in amplitude contour between contexts is the BM

Tone 3: double-peak in isolation and falling in a sentential context. Ho (1976)

suggested that Tone 3, when produced in several sentence environments, shows several

different types of amplitude curves in addition to the double-peak pattern observed in

isolation. Those include level, falling, rising-falling, and falling-rising contours. As a

Tone 3 produced in a sentence-medial position as in the current study, the pattern of

amplitude contour consistently became falling for all BM female speakers.

4.3.3. Statistical Analyses Procedures for the statistical analyses on amplitude contours were similar to that

on f0 contours. Three-way repeated-measures ANOVAs with rms amplitude levels

measured at five positions in the vowel (i.e., “location”) and context as within-subject

130

factors and dialect as between-subject factor were conducted to assess the effects of dialect, context, and dynamic amplitude information on each tone. All main and interaction effects are listed in Table 4.5.

Presentation layout of the statistical analyses will follow that outlined in §4.1.3

for the f0 contours. Since we are interested in the cross-dialectal differences in vowel

amplitude of each tone in each phonetic context (i.e., the interaction term location ×

context × dialect in the three-way ANOVA), additional two-way ANOVAs with location

and dialect as within- and between-subject were performed separately on each tone in

each context to examine the interaction of location and dialect.

131

Tone 1 Tone 2 Tone 3 Tone 4 df F η2 F η2 F η2 F η2 location 176.948** .538 67.592** .309 335.068** .189 524.578** .775 (df=1.61) (df=1.97) (df=2.46) (df=1.77) sign. pair-wise sign. pair-wise sign. pair-wise sign. pair-wise comparisons: all comparisons: 1-5, 2-5, comparisons: all comparisons: all, 3-5, 4-5 except for 1-3

context 1 18.584** .109 77.975** .341 50.035** .249 22.78** .130 isolation>context isolation>context isolation>context Isolation>context

dialect 1 18.258** .107 12.944** .079 5.968* .038 7.942* .05 132 TM

location× dialect 4 ― ― 14.040** .007 35.270** .189 ― ―

context × dialect 1 4.111* .026 ― ― ― ― ― ―

location ×context 4 144.252** .487 79.859** .346 32.459** .177 97.991** .392

location × context ×dialect 4 ― ― 2.513* .016 67.727** .310 ― ―

**p<0.001. * p<0.05.

Table 4.5. Summary of main effects and interactions from three-way repeated-measures ANOVAs for rms amplitude.

Tone 1. All mean and two-way interaction effects are significant, except for the

interaction of location and dialect (location × dialect). The three-way interaction

location × context × dialect interaction also reaches insignificance. The insignificance

of the location× dialect interaction indicates that there were no cross-dialectal differences

in the amplitude level of tones, either produced in isolation or in a sentential context, at

some locations in the vowel. Subsequent two-way ANOVAs showed that the

interaction was significant for Tone 1 in context [F(1.54,608)=4.93, p=.014, η2=.031] but

not for that in isolation [F(1.91,608)=1.22, p=.296, η2=.006]. That is, there were no

cross-dialectal differences in the amplitude of tone at five locations in the vowel for Tone

1 in isolation. Independent samples t-tests showed that amplitude level differed

between two dialects at all five points in the vowel for Tone 1 in context. Besides, the

mean amplitude of contextual Tone 1 in BM was higher than that in TM.

Tone 2. The three-way interaction was statistically significant and so was the

location × dialect interaction in the subsequent two-way ANOVAs for both phonetic

contexts [isolation: F(2.35,608)=8.38, p<.001, η2=.052; context: F(1.81,604)=13.22,

p<.001, η2=.081]. Individual t-tests showed that significant cross-dialectal differences in amplitude level of isolated Tone 2 in the second half of the amplitude contour (2-3 dB higher in BM). As shown in Figure 4.13, the amplitude contour for BM Tone 2 in isolation starts rising around the 40 percent into the vowel; whereas, that in TM is relatively flat. For Tone 2 in the sentence context, there were statistically significant cross-dialectal differences in the amplitude at five locations in the vowel. The amplitude curve for the contextual Tone 2 in BM has a dipping pattern with the dip

133

around the midway of the vowel, as shown in Figure 4.14. On the contrary, the

amplitude contour of TM Tone 2 in context has a falling pattern.

Tone 3. While the three-way interaction of location, dialect and context

(location × dialect × context) was statistically significant, separate two-way ANOVAs on each tone in each context showed that location × dialect interaction was statistically significant for Tone 3 produced in isolation [F(3.00,608)=77.72, p<.001, η2=.338] but not

in context [F(1.68,604)=1.97, p=.149]. Subsequent t-tests showed that cross-dialect

differences in amplitude for Tone 3 in isolation showed up at all five locations, except for

the onset. This provides statistical support to the observation that citation Tone 3 in

isolation has a double-peaked amplitude contour as opposed to the falling pattern in the

TM Tone 3, as shown in Figure 4.13.

Tone 4. Similar to Tone 1, the effect of location × dialect × context interaction

was not significant. Individual ANOVAs indicated that interaction term was not

significant in both phonetic environments [isolation: F(2.00,608)=1.54, p=.217; context:

F(1.57,608)=2.35, p=.110]. For the isolated Tone 4, insignificance of this interaction

effect might arise from the insignificant effect of dialect [F(1,152)=3.324, p=.07].

4.3.4. Correlation between f0 and Amplitude Contours

The focus of present analyses is to investigate whether rms amplitude and f0 contours covary in the same direction and whether these correlational patterns vary as a function of dialect and context. Previous studies examining the relationship between f0 and amplitude found incongruent results as to whether amplitude of a tone is correlated with its own f0 contour or the f0 contours of other tones (e.g., Whalen and Xu, 1992; Kuo

et al., 2008). Accordingly, two sets of cross-correlation analyses on f0 contours and

134

amplitude contours were conducted to address this question. The first set examined the

relationship of the rms amplitude contour of one tone to the f0 contour of each of the four

tones. The second set examined the correlational relationship of the f0 contour of one

tone to the f0 contours of the other three tones. The rationale is as follows. When the

amplitude contour of one tone is highly correlated with the f0 contour of a different tone,

it may be the case that the f0 contour, not necessarily the rms amplitude contour itself,

happens to be highly correlated with the f0 contour of different tones. That is, high correlations between the amplitude contour of one tone and the pitch contour of different

tones may arise from two tones being highly correlated.

The normalized group-averaged rms and f0 contours were used in the

cross-correlation analysis. Interpretation of the results will be constrained to positive

correlations between the two. Interpretation of the size of a Pearson correlation

coefficient follows the guidelines proposed by Cohen (1988), with .5 to 1.0 as having

large correlation, .3 to .5 as medium, .1 to .3 as small, and .0 to .09 as none. Tables 4.6

and 4.7 show the correlation between the rms amplitude contour of a tone and the pitch

contours of all four tones, arranged by dialect and context.

135

BM speaker TM speaker (a) Amp. f0 f0 1 2 3 4 1 2 3 4 Tone 1 -.173** -.938** -.484** .794** -.926** -.993** .848** .858**

Tone 2 -.745** -.477** -.512** .247** -.960** -.985** .751** .750**

Tone 3 -.468** .224** .755** -.062* -.848** -.858** .989** .992**

Tone 4 .101** -.999** -.332** .933** -.896** -.971** .896** .910**

(b) Tone 1 2 3 4 1 2 3 4 Tone 1 1 -.145** .177** .320** 1 .918** -.800** -.781**

Tone 2 -.145** 1 .316** -.944** .918** 1 -.780** -.792**

Tone 3 .177** .316** 1 -.006 -.800** -.780** 1 .996**

Tone 4 .320** -.944** -.006 1 -.781** -.792** .996** 1 **. Correlation is significant at the 0.01 level (2-tailed). *. Correlation is significant at the 0.05 level (2-tailed).

Table 4.6 (a) Cross-correlations between amplitude contours (Amp.) and f0 contours (f0) for tokens produced in isolation by BM and TM speakers. (b) Cross-correlations between f0 contours of tones (Tone) produced in isolation by BM and TM speakers.

136

BM speaker TM speaker (a) Amp f0 f0 1 2 3 4 1 2 3 4 Tone 1 .618** -.959** -.906** .277** .985** -.361** -.739** .825**

Tone 2 -.597** .598** .714** -.366** .972** -.068* -.506** .937**

Tone 3 -.473** -.675** .990** .998** .069* -.379** .986** .996**

Tone 4 -.717** -.869** .917** .942** -.221** -.633** .900** .931**

(b) Tone 1 2 3 4 1 2 3 4 Tone 1 1 .953** -.410** -.446** 1 .893** .219** .139**

Tone 2 .953** 1 -.606** -.657** .893** 1 -.238** -.313**

Tone 3 -.410** -.606** 1 .989** .219** -.238** 1 .996**

Tone 4 -.446** -.657** .989** 1 .139** -.313** .996** 1 **. Correlation is significant at the 0.01 level (2-tailed). *. Correlation is significant at the 0.05 level (2-tailed).

Table 4.7. (a) Cross-correlations between amplitude contours and f0 contours for tokens

produced in context by BM and TM speakers. (b) Cross-correlations between f0 contours for tokens produced in context by BM and TM speakers.

137

The rms amplitude contour of isolated BM Tone 1 was highly correlated with the

f0 contours of isolated BM Tone 4. This may be explained by the fact that amplitude

contour of BM Tone 1 in isolation exhibited a level-falling pattern. Amplitude

contours of Tones 3 and 4 were strongly correlated with its own f0 curves. This is

expected since both amplitude contours and f0 contours of Tone 3 in isolation have a

concave shape. Inspection of the cross-correlations among four isolated tones in BM

revealed that f0 contour of a tone only slightly correlated with that of other tones. As a

result, cross-correlations between amplitude and f0 contour observed for BM tones in

isolation resulted from amplitude co-varying with the f0.

The correlational patterns for TM tones in isolation are more difficult to interpret

since Pearson correlation coefficients for each amplitude-f0 combination are relatively

high. The amplitude contours of all four tones were highly correlated with the f0 contours of Tones 3 and 4. Large correlation of the amplitude contours of Tones 1 and

2 to the f0 contours of Tones 3 and 4 may be explained by the fact that the former

displayed rising-falling and flat-falling patterns and the latter all had falling contours.

Of particular interest is the cross-dialect difference in correlation between the amplitude and f0 contours of Tone 3. The amplitude contour of TM Tone 3 in isolation was

highly correlated with f0 contours of not only Tone 3 (r=0.989) but also Tone 4

(r=0.992). Similarly, the amplitude contour of Tone 4 was highly correlated with f0 contour of not only Tone 4 (r=0.910) but also Tone 3 (r=0.896). This finding is consistent with the Kuo et al., (2008), in which similar correlational patterns were found for the female TM speaker whose Tone 3 production showed a low-falling pattern.

138

Cross-correlations among f0 contours in TM tones in isolation (Table 4. 7 (b)) showed that f0 contours of Tones 3 and 4 (r=0.996) and that of Tones 1 and 2(r=0.918) were highly correlated with each other. The former is expected since isolated Tone 3 in

TM has a low-falling pattern, which should correlate with the high-falling Tone 4. High correlation between TM Tones 3 and 4 provides additional evidence that TM Tone 3 in isolation is different from the BM counterpart in terms of tonal shape. The strong correlation between f0 contours of Tones 1 and 2 might imply that TM Tone 2 in isolation has a flatter pitch contour (i.e., the rising segment is of smaller magnitude).

Cross-correlations between amplitude and pitch and among f0 contours for contextual tones are shown in Table 4.8. (a) and (b), respectively. Since Tone 3 has a low-falling pattern when produced at a non-perpausal position in a sentence context in both dialects, amplitude contour of a contextual Tone 3 should be expected to be highly correlated with the f0 contours of its tone and Tone 4 as well. As shown in Table 4.8 (a), amplitude contours of Tone 3 and 4 were highly correlated with the pitch contours of

Tones 3 and 4 in both dialects. In addition, contextual Tones 3 and 4 were also highly with each other in both dialects (BM: r = 0.989; TM: r = 0.996).

4.4. Summary The primary purpose of the production study was to investigate cross-dialectal differences in the realization of the four lexical tones in Mandarin Chinese in two constrained phonetic contexts. Three acoustic parameters of tones, f0, rms amplitude, and duration, were analyzed. Mixed design ANOVAs and independent-samples t-tests were performed to investigate effects of dynamic f0 and rms amplitude movement (i.e.,

139

measurements obtained at five temporal locations in the vowel), context, and dialect on

each tone.

Group-averaged time-normalized f0 contours of the four tones produced in isolation

and in a sentence-medial position are shown in Figures 4.5 and 4.6. The most

prominent cross-dialectal difference in citation tones is Tone 3. Tone 3 in Beijing

Mandarin has a low-falling-rising contour whereas it is low-falling in Taiwan Mandarin.

Statistical analyses comparing f0 values measured at five temporal locations in the vowel

(i.e., the dynamic f0 movement) between two dialects corroborated discrepancies in tonal

contours. Specifically, while citation TM Tone 3 was higher (in Hz) in the first 25%

and 50% of the contour, it was 50 Hz lower at the offset. The cross-dialectal

discrepancy in the production of Tone 3 between BM and TM is in accord to previous findings (e.g., Li et al., 2006; Deng et al., 2006; Shi and Deng, 2006; Sanders, 2008a,b).

The final rising portion in isolated TM Tone 2 was not as prominent as that in the BM

counterpart. Statistical analyses showed that BM Tone 2 was significantly higher (in Hz)

that the TM counterpart starting the second half of the f0 contour and dialectal differences

increased progressively toward the end of the vowel.

Citation Tone 4 in both dialects had similar falling f0 contours, in which BM Tone

4 was significantly higher than the TM counterpart in the first half of the contour.

Besides, Tone 4 in neither dialects fell to the minimum f0 as reported in the previous

acoustic studies (e.g., Chao, 1948; Li et al., 2006; Deng et al., 2006; Fong and Chiang,

1999; Sanders, 2008a, b). The possible explanation for TM Tone 4 that did not fall to

an f0 minimum is that this falling pattern in higher f0 region may be used by TM speakers to differentiate it from Tone 3 which also has a falling f0 contour but locates in the lower

140

register (ref. Sanders, 2008b). While this proposition has not been explored in the

current statistical analysis, this high-low f0 register contrast between TM Tone 3 and

Tone 4 (of either dialect and phonetic context) can be related to the perceptual results obtained from the perception experiments (see Chapter 6).

While Tone 1 has a high level contour regardless of dialects and phonetic contexts, statistical tests revealed that there was cross-dialectal difference in Tone 1 along the entire vowel when produced in isolation and in a sentence-medial context. This might be attributed to the idiosyncratic differences in voice pitch of the selected speakers from two these dialect regions. Based on the results obtained from the three-way repeated-measures ANOVAs, the mean f0 values of the four tones in BM, on average, are

higher (in Hz) than those in TM.

The f0 contours in a sentence context were consistent with the patterns predicted

from the tone sandhi and coarticulation rules (ref. Chapter 2). Discussion focused on

the interpretation of potential cross-dialectal differences in producing tones in context,

instead of examining differences in f0 contours between citation tones and tones in

context. Visual inspection revealed no distinguished cross-dialectal discrepancies in the

general shapes of f0 contours of tones produced in context. Tone 1 is high-level, Tone 2

as mid-rising, Tone 3 as low-falling (i.e., half third tone), and T4 as high falling in both

dialects.

Statistically significant cross-dialect differences were found in Tones 2, 3 and 4.

Specifically, BM Tone 2 in context was higher (in Hz) in the last 25% of the f0 contour, thought the magnitude of dialectal difference was smaller as compared to that in isolation.

We speculate this may be due to coarticulation effect. According to the coarticulation

141

rule proposed by Shih (1987), the rising portion of Tone 2 diminishes when followed by a

Tone 1 or Tone 4. However, it is inconclusive whether the rising segment is present or

not since statistical analyses comparing Tone 2 in isolation and in context within each

dialect have not been performed.

Since Tone 3 has the low-falling f0 contour when produced at a non-sentence final position, the cross-dialect difference exhibited in the first half of the contour, with the

BM Tone 3 being higher than the TM counterpart. Similar patterns were found in contextual Tone 4.

In terms of intrinsic duration of tones produce in isolation, the pattern is Tones

3>2>1>4 in BM and Tones 2>1>3>4 in TM. The only statistically significant difference in duration between two dialects was Tone 3. As suggested by Shih (1988) that falling-rising pattern of Tone 3 shows the longest duration among the four tones, the low-falling pattern has the shortest duration, cross-dialectal difference in producing Tone

3 in isolation may result in such durational difference. This cross-dialectal difference in tone duration is consistent with previous studies (Deng et al., 2006, Kuo et al., 2008).

Besides, durational patterns found in BM tones in isolation were consistent with previous results that Tone 3 has the longest duration, Tone 4 the shortest, and Tones 1 and 2 the intermediate (e.g., Ho, 1976; Tseng, 1980).

The rms amplitude contours for four BM tones produced in isolation were rising-level-falling, rising with a small dip around 25% of the vowel, double-peak, and rising-falling, respectively. For TM tones in isolation, they were rising-falling for Tone

1, level for Tone 2 and falling for both Tones 3 and 4. These patterns of amplitude contours are generally consistent with Ho (1976). Cross-dialectal differences in the

142

amplitude contour of Tone 3 in isolation resemble to that observed for the f0 contours of

the same tone. Specifically, BM Tone 3 has a dipping pitch contour with a double-peaked amplitude envelope; whereas, TM Tone 3 has a falling pitch pattern with a

falling amplitude contour.

Statistical analyses revealed significant cross-dialectal differences in amplitude

contours for Tones 2 and 3 in isolation. In particular, BM Tone 2 differed from that of

TM in the second half of the amplitude contour. Amplitude contour of Tone 3 was different at all five locations, except for the onset.

Amplitude contours of tones in context did not differ between two dialects, except for those of Tones 1 and 2. While amplitude contours were flat and rising with a dip for

Tones 1 and 2 in BM, respectively, those in TM were all falling. These cross-dialectal differences were statistically significant—amplitude contours of BM Tones 1 and 2 in context were higher than those in TM. When produced in a sentence context, amplitude envelope of a BM Tone 3 became falling, which was similar to that in the TM counterpart. This contextual change in amplitude shape of Tone 3 is consistent with Ho

(1976), in which he also reported a falling amplitude contour for Tone 3 produced in a sentence-medial position.

Finally, analyses on cross-correlation between amplitude and f0 contours of tones

corroborate that (1) the amplitude contour of a tone does not necessarily correlate most

highly with the pitch contour of that same tone (Kuo et al., 2008), (2) the size of

correlation (Pearson correlation coefficients) between amplitude and pitch varies considerably across different tones (Whalen and Xu, 1992; Fu and Zhang, 2000), and (3) each tone can assume various amplitude contours of different shape (Ho, 1976; Lin, M-C.,

143

1988). Most importantly, cross-dialectal differences in tonal contours resulted in different relational patterns between amplitude and f0 contours of a tone in two regional dialects. In particular, while the low-falling amplitude envelope of Tone 3 in isolation was highly correlated with the f0 contours of Tones 3 and 4 (which also have falling patterns) in TM, the double-peak amplitude contour of Tone 3 in BM was only correlated to the dipping f0 curve of BM Tone 3.

144

Chapter 5 Perception study: Methodology

This chapter discusses the speech stimuli and experimental procedure used in the tone-gating experiment to examine the effects of speaker and dialect variation on the time-course of lexical tone identification by native listeners and English learners of

Mandarin Chinese. Unlike traditional word-gating experiment in which listeners are asked to propose a word for the presented speech fragments at each gate, their task was to identify the tone of each stimulus.

5.1. Speech Stimuli The target tone syllables consist of four tone quadruplets of CV structure with unaspirated stops (i.e., syllable type 7) produced in isolation and in a carrier sentence by six female speakers with various f0 ranges from each dialect region. The four monosyllables, bo /po/, da /t/, du /tu/, ge /k/, represent real words in Mandarin when produced with each of the four lexical tones (cf. Wu and Shu, 2004). These were selected from speech tokens submitted for acoustic analyses in the production study (§ 3).

The selection criteria for test stimuli were made with the following considerations to improve stimulus creation from previous gating studies. First, word frequencies39 for quadruplets in a set were closely matched/balanced since word frequency/familiarity affects tone identification (Fox and Unkefer, 1985). Specifically, syllable frequencies with tones were used instead of character frequencies since each syllable in Mandarin can

39 Word frequency was obtained with consultation to Da’s (2007) syllable frequency data in which frequency counts are based on the first 3,500 characters of the Modern Chinese character frequency list. 145

have several that are represented by different characters or . For

example, ge1 (gē) have several homophones that can be represented by several different

characters, such as 歌, 哥, 割, each of which has the same segmental structure and tone information but different word frequencies. Besides, it is unclear which Mandarin character will be activated when they hear the gated stimuli. Second, only CV syllables with initial unaspirated stops were selected as test tokens to avoid effects of syllabic structures on tones (Howie, 1974; Ho, 1976; Shih, 1987; see §2.4.1). Third, selected words should be familiar to non-native learners. The recording procedure is described in §3.3. A total of 192 tone syllables (4 syllables × 4 tones × 2 contexts40 × 6 speakers)

were used to create gated stimuli in the experiment.

5.2. Gating Procedure Before the target tone syllables were submitted for acoustic editing to create gated

stimuli, they were first amplitude rms normalized to have similar intensity level so the

intrinsic stimulus properties such as amplitude that is likely to contribute to tone

recognition was equalized. Each tone syllable was digitally edited with custom-written

Matlab programs to generate a series of gated or fragmented stimuli of increasing

duration in 30 ms increments starting from the syllable onset. In particular, the first gate

(Gate 1) consists of the initial stop plus the first 30 ms of the vowel. The successive

gates were created in 30 ms increments. For example, Gate 2 is composed of the initial

stop plus the first 40 ms of the vowel, i.e., Gate 1 plus the next 30 ms of the vowel. For

both isolated and coarticulated (i.e., those produced in a carrier sentence) syllables, the

first eleven gates were included as the test stimuli since 330 ms (11×30) of a vowel

40 Both words produced in isolation and in a sentence context were used as stimuli. 146

should provide enough acoustic information to correctly identify tones (e.g., Tseng, 1981;

Wu and Shu, 2003). The last gate, Gate 12, consists of the entire syllable (i.e., intact

stimuli). If the remainder of the syllable at any point was less than the 30 ms window, that segment was not used. When generating gated stimuli, a ramping/tapering procedure written in Matlab program was applied to taper both ends of the fragmented syllable to create more naturally sounding stimuli. There were no perceptible clicks as a result of the acoustic editing. Figure 5.1 illustrates the entire gating sequences for the

Mandarin word [po] on the waveform display. Figure 5.2 illustrates the fundamental frequencies of the Mandarin word [po] produced by six female speakers.

The reason that a 30-ms window was chosen is motivated by the findings from

previous studies. First, natives were able to make distinction between high-onset and

low-onset tones within the first six pitch periods, which usually lasted around 20-50 ms

(e.g., Gottfried and Suiter, 1997; Lee et al., 2008, 2009; Lee, 2009). Second, previous

word-gating studies (Wu and Shu, 2003; Lai and Zhang, 2008) and gating experiments

using fixed-duration segments all used a 40-ms (e.g., Whalen and Wu, 1992) window as

the smallest unit. Consequently, a 30-ms window was used in this study with the aim of

providing a more fine-grained lexical processing.

147

Figure 5.1 The schematic representation of the series of gates of Mandarin word [po] “peel”. The first gate (gate 1) contains the unaspirated stop and the first 30 ms of the vowel. The subsequent gates were created in 30ms increments. The last gate (gate 12) contains the entire syllable.

148

Figure 5.2. Fundamental frequencies of the gating sequences of Mandarin word [po] “peel”. BM and TM speakers are represented in blue and red, respectively.

149

Since each monosyllable has different duration, the number of gates generated from each word produced by each speaker is different. Tables 5.1 and 5.2 present numbers of stimulus tokens presented to listeners at each gate as a function of tone, speaker language and context. A total of 1745 stimuli (gated and intact stimuli) were presented to the listeners in the gating experiment.

Speaker lang. Tone Gates Total 1 2 3 4 5 6 7 8 9 10 11 12 TM 1 12 12 12 12 12 12 12 12 12 12 12 12 144 2 12 12 12 12 12 12 12 12 12 12 12 12 144 3 12 12 12 12 12 12 11 11 8 7 7 12 128 4 12 12 12 12 12 12 10 10 8 7 4 12 123 Total 48 48 48 48 48 48 45 45 40 38 35 48 539 BM 1 12 12 12 12 12 12 12 12 12 12 11 12 143 2 12 12 12 12 12 12 12 12 12 12 12 12 144 3 12 12 12 12 12 12 12 11 11 10 8 12 136 4 12 12 12 12 12 12 12 8 5 4 3 12 116 Total 48 48 48 48 48 48 48 43 40 38 34 48 539 Table 5.1. Numbers of stimulus tokens presented to listeners at each gate as a function of tone and speaker language for isolated tones.

Speaker lang. Tone Gates Total 1 2 3 4 5 6 7 8 9 10 11 12 TM 1 12 12 12 12 12 6 0 12 78 2 12 12 12 12 12 6 2 12 80 3 12 12 12 11 5 2 0 12 66 4 12 12 12 12 11 4 0 12 75 Total 48 48 48 47 40 18 2 48 299 BM 1 12 12 12 12 12 12 8 2 12 94 2 12 12 12 12 12 12 10 2 12 96 3 12 12 12 12 12 9 5 1 12 87 4 12 12 12 12 12 12 7 0 12 91 Total 48 48 48 48 48 45 30 5 48 368 Table 5.2. Numbers of stimulus tokens presented to listeners at each gate as a function of tone and speaker language for coarticulated tones.

150

5.3. Participants

Three groups of listeners, 16 native speakers of Standard Mandarin (PTH), 15 native speakers of Taiwan Mandarin (TM), and 10 nonnative American English learners of Mandarin (AE), participated in the perception experiment. The gender and age distribution for each of three listener group is presented in Table 5.3.

Listener lang. Listener gender N Mean SD TM male 6.0 28.5 3.4 female 9.0 31.1 10.2 Total 15.0 30.1 8.1 PTH male 5.0 23.8 2.6 female 11.0 26.6 4.1 Total 16.0 25.8 3.8 AE male 5.0 23.2 5.4 female 5.0 24.2 3.0 Total 10.0 23.7 4.2 Total male 16.0 25.4 4.5 female 25.0 27.8 7.1 Total 41.0 26.8 6.3 Table 5.3. The gender and age distribution for each listener group.

Six native listeners in the PTH group were native speakers of BM and the other

ten listeners were speakers of PTH, who also speak other regional dialect belonging to

Mandarin dialect family. Two of them, due to educational experience, had listening

proficiency in Shanghainese, which belongs to the Wu dialect family. Nevertheless, all

151

listeners identified PTH as their main language of communication in daily life. They had, on average, lived in the United States for at least two years and all spoke English.

Almost all the native listeners in the TM group were bilingual speakers of TM and TSM, except for three listeners who identified themselves as having limited or only listening proficiency in TSM. TM listeners on average had lived in the United States for at least two and half years and all spoke English.

The non-native participants included ten Mandarin language students recruited

from the Department of and Literatures at the Ohio State

University. Among the ten nonnative AE listeners, four of them had studied Mandarin

(PTH) for more than five years, and one of them is American-born Chinese and had

studied Mandarin for 10 years41 (group mean =6.12 yrs, excluding this subject; group

mean=6.9, including this subject). In addition, four AE learners of Mandarin out of ten

had lived or studied in mainland China for at least 1 year (mean=1.86). The rest had studied Mandarin for less than 3 years (mean number of years of study was 2.4) and two of them had studied in Taiwan and mainland China for 8 and 3 months, respectively.

According to the level of Mandarin classes they have taken at the OSU at the time of testing, their Mandarin proficiency levels can be divided into two groups: advanced and intermediate. Although two non-native listeners had extensive experience with TM variety, it is assumed that they are all familiar with PTH since it is the language of instruction in the Department (and in the U.S.). None had a history of either

speech-language disorder or hearing disorders. They were paid for participating in the

41 While this subject (s81) has intermittingly studied Mandarin for 10 years (4 hrs/week since age of 8), she was placed in the Beginning level (Chinese 101-103) when she entered the Chinese Language Program at The Ohio State University. At this proficiency level, students are introduced to basic conversational Mandarin. 152

study at a rate of $10 per hour. Appendix 5.1 presents detailed linguistic background

of each participant in each listener group.

5.4. Experimental Procedure The gating experiment was conducted using a custom-written Matlab program.

Stimuli were presented in a mixed-talker, mixed-dialect and mixed-context condition.

The stimuli were divided into two blocks: gated stimuli (1553 items) and intact stimuli

(192 items, i.e., baseline condition). The presentation order was always presenting

gated stimuli first to listeners, followed by the intact stimuli. Gated stimuli were presented in a gate-blocked format in which listeners heard the first gates of all stimuli before they heard the second gates of any other stimuli and etc., i.e., presentation order was in an ascending order according to the duration of the gates. The presentation order of speech fragments within each gate block was randomized across listeners. Similarly,

the presentation of intact syllables was also in a random order.

Subjects were tested individually in a sound-treated booth. They were seated in

front of a computer monitor and heard the speech stimuli through high-quality

circumaural headphones (Sennheiser 600) at their original sampling rate (44.1 kHz) at a

standard loudness level used in experimental testing (70-72 dB SPL). They were

instructed to identify the tone of each stimulus presented by moving the mouse to the

appropriate response area labeled “Tone 1”, “Tone 2”, “Tone 3”, and “Tone 4” on the

computer monitor and then clicking the mouse button to register responses. They were

also instructed to make the best guess of the stimulus tones when stimuli presented were

ambiguous to them. Although their responses were not timed, they were told to respond

153

as quickly as possible without sacrificing accuracy. The inter-stimulus interval was 1 s

after the subject responded to the previous stimulus.

Participants completed a brief audiometric evaluation (a hearing screening)42 to make sure they had normal hearing and also filled out a survey on their linguistic background before participating in the tone-gating experiment. A practice of 20 intact syllables, which are different from the test stimuli and were produced by male speakers, was given prior to the perception experiment to make sure that the subject was able to correctly identify the four lexical tones in Mandarin Chinese.

5.5. Data Analyses Analyses of the perceptual results followed the standard procedures in word-gating

literature with appropriate modifications. Specifically, the following data were

collected and analyzed from each subject’s responses: (1) response accuracy at each gate,

(2) tone isolation point (TIP) at which gate tones were correctly identified at predetermined level of accuracy and (3) confusion patterns at each gate. In particular, five TIPs were defined as the temporal points or gates at which the tone of target stimulus was correctly identified at 33%, 50%, 75%, 90% and 100% correct responses; that is,

TIP33%, TIP50%, TIP75%, TIP90% and TIP100%. This is different from the TIP traditionally defined in the gating literature (e.g., Lee, 2000; Wu and Shu, 2003; Lai and

Zhang, 2008) as the temporal points or gates at which the tones are correctly identified

without subsequent changes. In our data analyses, TIP100% corresponds to the

traditional TIP. For tones that still could not be correctly identified with 100% accuracy

42 The brief audiometric evaluation consisted of testing subjects on several pure tones at 500, 1000, 2000, and 300 Hz presented at 25 dB SPL to each ear at a random order. The performance on the auditory screening was evaluated as either Pass or No Pass. 154

rate after presenting all eleven gates, gate 12 (maximum gate number 11 plus 1) was

assigned as the number of gates required to be correctly identified.

In addition, tone responses were put into tabular confusion matrix from which the error patterns could be examined on a tone-by-tone, dialect-by-dialect, and

context-by-context basis. Of particular interest is to determine the direction of

identification confusions, which provide information on the effects of cross-dialectal

differences in tone contours on identification.

Mixed design ANOVAs were performed on the selected TIPs with listener

language (BM, TM, and AE) as the between-subject factor, dialect (BM and TM) and

stimulus tone (1, 2, 3, and 4) as within-subject factors. When there were significant

violations of sphericity, Greehouse-Geisser adjusted degrees of freedom and F-tests were

used for reported significant main effects and interactions. In addition to significance

values, a measure of the effect size, partial eta squared (η2) was reported. When a main

effect or an interaction effect was significant, additional ANOVAs on selected subsets of

the data (with appropriate F tests) or Bonferroni post-hoc tests were used for pair-wise

means comparisons (so family-wise Type I error rate was kept at 5%).

155

Chapter 6 Perception Study: Results and Discussion

Perceptual results of the gating experiments for each listener group are reported in this section. Seven listeners (BM=3, TM=2, and AE=2) were excluded from data analyses due to errors, dropout in data collection, and identification accuracy below predefined threshold43. Only responses to the isolated test syllables will be presented and discussed here.

Discussion of results is organized in terms of (1) response accuracy at each gate,

(2) TIPs, and (3) tone confusions at each gate. For TIPs and tone confusions, only a subset of data at selected gates will be presented. Specifically, statistical analyses on

TIP75% were performed and will be reported. The patterns of tone confusions at gates

1, 5, and in the intact syllable condition are shown. The rest is presented in Appendix D.

Most critical results will be interpreted in terms of the match and/or mismatch of speaker dialect-listener dialect. For example, identification of TM tones by a native BM and

American English listener is a speaker-listener dialect mismatch. Responses were collapsed across target tone syllables produced by speakers of one regional dialect of

Mandarin Chinese. Unless otherwise specified, any main or interaction effects found to be significant at a .05 alpha level will be reported and discussed.

As mentioned in §2.6.2, the hypotheses for the perception study include:

43 Exclusion was made due to (1) errors in data collection (2 speakers), (2) drop-out during data collection due to inability to finish the entire experiment (1 speaker), and (3) accuracy rate below 50% in identifying intact tones of his/her dialect tones (4 speakers). 156

(1) Listeners require more acoustic information to correctly identify tones,

given speaker variability and a mismatch of speaker-listener dialect. Tone

identification by the non-native listeners will be compromised to a larger

extent.

(2) Native listeners should be able to make low-onset and high-onset

distinctions relatively quickly at earlier gates. Non-native listeners are

predicted to be able to deal with speaker viability and make a low-onset and

high-onset distinction, but at later gates than the native counterparts.

(3) Both native and nonnative listeners should perform better with the stimuli

produced by speakers of the same dialect. In particular, nonnative listeners

were hypothesized to perform better with BM stimuli since it is the variety

of Mandarin Chinese they are familiar with through textbooks and their

instructors.

(4) For the high-onset tone pairs, Tone 1 should be relatively easier to be

recognized than the contour Tone 4. For the low-onset tone pairs, it is

hypothesized that Tone 3 will be identified earlier than Tone 2 if listeners

use the low register as a cue for the perception of Tone 3 (e.g., Whalen and

Xu, 2991). Based on the results obtained from the production study,

different tonal realizations of isolated Tone 3 in these two dialects could

significantly affect the identification of partial and intact Tone 3 stimuli.

157

6.1. Response Accuracy at Each Gate Figure 6.1 presents identification accuracy for BM and TM tones at each gate for three language groups of listeners. As can be seen from the figure, the identification functions are relatively smooth, especially given the task difficulty inherent in this type of experimental design—multiple-talker, mixed-dialect and mixed-context gating paradigm.

By visual inspection, Tones 1 and 4 seemed to be correctly identified with less acoustic information (i.e., fewer gates) required than the other two tones, which are exemplified by steeper slopes. This relationship was further explored by examining TIP75%, i.e., number of gates required to achieve 75% tone identification accuracy, and performing statistical analyses on this set of data (see §6.2). Given the evident difficulty of the tone identification, especially for Tones 2 and 3, TIP75% was chosen as the critical accuracy level, which is in contrast with previous word-gating studies that have traditionally analyzed TIP100%. Analyses for the baseline condition will also be discussed (see

§6.2.1 and §6.3.1).

158

159

Figure 6.1. Tone identification accuracy as a function of gates and tones, shown for BM and TM tones and each listener group separately.

6.2. Tone Identification Points (TIPs)—TIP75% Figure 6.2 displays number of gates required to reach the tone identification point

with 75% correct (TIP75%) as a function of speaker dialect, tone type, and listener

dialect. The overall mixed design ANOVA was first completed with the within-subject

factors speaker dialects and tone. Listener dialect was included as a between-subjects

factor. The main effect of tone was significant [F(3, 93)= 54.17, p<.001, η2=.64]. On

average, Tone 1 had earlier TIP75% (3.7 gates, SD=.31), followed by Tone 4 (5.1 gates,

SD=.12), Tone 3 (7.3 gates, SD=.43), and Tone 2 (9 gates, SD=.37). All pair-wise means comparisons were significant.

There was a main effect of speaker dialect [F(1, 31)=52.05, p<.001, η2=.63].

TM tones, on average, had longer TIP75% than the BM tones. In other words, TM

tones required, on average, more gates or acoustic information to be identified with an

accuracy rate of 75% than BM tones (TM=7.0 gates or 210 ms of the vowel and BM=5.5

gates or 165 ms of the vowel). Post hoc analysis showed a significant (p<.001) difference in pair-wise comparisons. The speaker dialect × tone interaction was significant [F(2.24, 93)= 25.36, p<.001, η2=.45]. As expected, effect of speaker dialect was not uniform across four tones. As can be seen from Figure 6.2, Tones 2 and 3 in either dialect had longer TIP75% than did the other two tones.

160

Figure 6.2. Average number of gates required for 75% identification accuracy (i.e., TIP75%) with standard error bars as a function of tone, dialect, and listener dialect. TM is represented by blue color, BM by red color, and AE by gray color.

161

In addition, the main effect of listener dialect was significant [F(2, 31)= 3.52,

p=.04, η2=19]. TM listeners had fewer TIP75% (M =5.7, SD=.29), followed by BM

(M=6.2, SD=.29) and AE listeners (M =6.9, SD=.36). Post hoc pair-wise means comparisons showed that TM listeners required statistically significant fewer gates than the non-native AE listeners in reaching TIP75% (p=.013). Effect of listener dialect also varied by tone, i.e., significant tone × listener dialect interaction [F(4.75, 93)= 3.49, p=.008, η2=.18]. The interaction between speaker dialect and listener dialect was

significant [F(2, 31)= 3.72, p=.04, η2=.19]. As conjectured, a match or mismatch

between speaker dialect and listener dialect can facilitate or inhibit accurate tone

identification responses.

Most importantly, the speaker dialect × tone × listener dialect three-way interaction was significant [F(4.47, 93)= 3.05, p=.019, η2=.16]. Figure 6.2 shows that

dialect variability in Tone 3 minimally affected TM listeners but greatly impacted BM

and AE listeners. Given our primary interest is to investigate how speaker-listener

dialect match and mismatch affects the identification of particular tones, separate

two-way repeated-measures ANOVAs were performed on each individual tone. Results

showed that while there was main effect of speaker dialect on processing each tone

(overall p<.005), effect of listener dialect was only significant for Tone 3 identification

(p=.004). In addition, interaction between the two factors was significant only for Tone

3 (p=.008). There was no speaker dialect × listener dialect interaction effect evident in

the identification of the other three tones and no further analyses and interpretations will

be provided.

162

Separate one-way ANOVA on Tone 3 in TM and BM revealed that the three listener groups required different amount of acoustic information in identifying TM Tone

3 [F(2,33)= 9.775, p=.001], but not BM Tone 3 [F(2,33)= 1.763, p=.188]. Post hoc tests showed that the TM listeners had significantly earlier TIP75% (M=5.5 gates) than the BM (M=10 gates) and AE (M=10.8 gates) counterparts in identifying TM Tone 3.

That is, TM listeners required significantly fewer gates than the BM and AE counterparts in reaching 75% correct identification of TM Tone 3. Native BM and non-native AE listeners needed about the same amount of acoustic-phonetic information to identify TM

Tone 3

These findings suggest that while three groups of listeners required, on average, more acoustic-phonetic information to correctly identify TM tones with 75% accuracy, especially Tones 2 and 3, identification of the truncated TM Tone 3 was most susceptible to a speaker dialect-listener dialect mismatch. Native TM listeners did have a speaker dialect-listener dialect match advantage in terms of requiring less information in achieving the 75% accuracy level. In contrast, a mismatch between speaker dialect and listener dialect prolonged, to the same extent, the number of gates to achieve TIP75% for both BM and AE listeners.

While both BM and AE listeners required longer stretches of the stimulus token to correctly identify fragmented TM Tone 3, it was still not clear whether inter-talker variability will affect the accuracy of identifying the intact version of Tone 3. As a result, similar statistical analyses were performed on the identification response to intact tone syllables.

163

6.2.1. Baseline Condition Figure 6.3 displays tone identification accuracy for intact syllables as a function of

speaker dialect, tone, and listener dialect. The overall mixed-design ANOVA on

arcsine-transformed percent correct values44 revealed significant main effects of speaker

dialect [F(1, 31)= 63.23, p<.001, η2=.671], tone type [F(3, 93)= 12.21, p<.001, η2=.28] and interaction between them [F(1.95, 93)= 38.62, p<.001, η2=.56]. Post hoc pair-wise

comparisons showed that on average, BM tones were identified with higher accuracy

than TM tones (BM=100% and TM=90%, p<.001). Tone 3 was identified with least

accuracy, around 85% (p≤.001).

Although listeners’ dialect background did not have a significant effect on

identifying full tone syllables [F(2.31)= 2.44, p=.10, η2=.14], this effect varied by speaker

dialect [F(2, 31)=5.22, p=.01, η2=.25], and tone type [F(6, 93)= 4.80, p<.001, η2=.56].

The three-way interaction of speaker dialect × tone× listener dialect was significant

[F(3.91, 93)= 3.48, p=.01, η2=.18]. Additional within-subject ANOVAs were carried

out on each tone to further investigate the effect of speaker-listener dialect match and

mismatch on the identification of individual tones. Results showed significant effect of

speaker dialect (overall p<.001) and a significant interaction between speaker dialect and

listener dialect for all tones, except for Tone 2 (p=.812).

44 Raw response scores were transformed into rationalized arcsine units (rau) prior to statistical analysis (Studebaker, 1985). 164

Figure 6.3. Identification accuracy for intact tones (raw percent correct scores) as a function of tone, dialect, and listener dialect. TM is represented by blue color, BM by red color, and AE by gray color.

165

Individual one-way ANOVAs on Tones 1, 3 and 4 in each (speaker) dialect

showed significant listener group differences in the mean identification scores (in rau) for the following tones: Tone 1 in BM [F(2, 31)=6.67, p=.004], Tone 3 in TM [F(2, 31)=4.72,

p=.02], and Tone 4 in BM [F(2, 31)=9.67, p=.001] and TM [F(2, 31)= 6.34, p=.005].

Bonferroni-adjusted post hoc tests showed that TM listeners (109.9%) outperform the AE

counterparts (93.6%) in identifying BM Tone 1 (p=.004).

Of particular interest is the identification accuracy for Tones 3 and 4. TM

listeners, on average, had higher accuracy in identifying TM Tone 3 (83%) than did the

BM (61.8%) and AE (60.9%) counterparts. Tukey HSD post hoc analysis showed

significant differential identification of TM Tone 3 between TM and BM (p=.028) and

between TM and AE (p=.048), with BM-AE pair-wise comparison reaching

insignificance. In addition, mean identification accuracy for TM Tone 4 was

significantly lower for the TM listeners (96.8%) than that for the BM listeners (109.9%,

p=.004). Similarly, accuracy in identifying BM Tone 4 was significantly lower for the

TM listeners (81.6%) as compared to that for the BM (105.6%, p=<.000) and the AE

(97.2%, p=0.048) listeners.

These results confirm that given a speaker-listener dialect mismatch, both BM

and AE listeners have difficulty in correctly identifying an intact TM Tone 3 in

isolation. Regardless of the source dialect of Tone 4 stimuli, TM listeners had

significantly lower identification accuracy than did the other two listener groups.

Nevertheless, TM listeners were still more accurate in identifying Tone 4 in TM than

in BM. Examination of confusion patterns of these Tones 3 and 4 will provide an

explanation to the misidentification of this pair of tones.

166

In addition to the effect of speaker dialect on tone identification, the use of

multiple-speaker stimuli also introduces another source of speaker variability—variations

in f0 range. A more stringent assessment on how listeners deal with aspect of speaker

variability (and thus speaker normalization) in tone identification requires a direct

comparison between listeners’ identification of single- and multi-speaker tone stimuli

without prior or repeated exposure to a speaker’s voice. However, such test cannot be

completed in the current study due to the nature of the experimental design.

6.3. Tonal Confusions In order to examine the patterns of tone identification errors made by each listener group, tone identification responses obtained from each gate were tabulated to generate a series of 4×4 confusion matrices. Separate confusion matrices were created for each of the three listener groups, arranged by dialect and gates. Confusion matrices at three selected temporal locations, gates 1, 5, and in the intact syllable condition were examined to provide a sketch of the time-course of tone identification and confusion. In addition, patterns of tonal confusions at gate 1 will provide information on whether the listener is able to deal with speaker viability and thus engage in f0 height estimation at the tonal

onset.

Tables 6.1 - 6.3 show the confusion matrices for TM, BM and AE listeners in the

intact syllable condition, and at gates 1 and 5. The maximum number of responses in each cell is 156 (4 words per tone × 3 speakers × 13 listeners) for two native listener groups, and 96 (4 words per tone × 3 speakers × 8 listeners) for the non-native AE listeners.

167

6.3.1. Confusions patterns in the baseline condition As shown in the identification of intact syllables discussed above (6.2.1) and the confusion matrices for intact syllables in Table 6.1, three groups of listeners had relatively higher accuracy in identifying intact BM tones and showed little tonal confusion.

Nevertheless, TM listeners tended to identify Tone 4 as Tone 3 (18.6%). Tonal confusion of this pair made BM Tone 4 identification by the TM listeners significantly lower that that of the other two groups of listeners (see 6.2.1.).

TM tones were, on average, identified with relatively lower accuracy.

Specifically, BM and AE listeners had statistically lower accuracy in identifying intact

TM Tone 3 and they systematically misidentified TM Tone 3 as Tone 4. TM Tone 3 →

Tone 4 misidentification was at a remarkably high rate of 37% for the BM listeners.

Similarly, AE listeners misidentified it as Tone 4 (29%) and occasionally as other tones, as shown in Table 6.1 (c). While the TM listeners showed significantly more accurate identification of TM Tone 3 (83%), they were still misidentified, to a lesser extent, as

Tone 4 (16%). Furthermore, TM listeners also at times misidentified TM Tone 4 as

Tone 3 (6.4%). While the misidentification of this tone pair was relatively small in magnitude, TM listeners’ identification of TM Tone 4 significantly inferior to that by BM and AE listeners.

168

(a) BM listeners (b) TM listeners (c) AE listeners Response Response Response BM Tone 1 Tone 2 Tone 3 Tone 4 BM Tone 1 Tone 2 Tone 3 Tone 4 BM Tone 1 Tone 2 Tone 3 Tone 4 Stimulus Stimulus Stimulus 1 149 6 0 1 1 156 0 0 0 1 87 6 2 1 95.5% 3.8% 0% .6% 100.0% 0% 0% 0% 90.6% 6.3% 2.1% 1.0% 2 0 154 0 2 2 0 152 4 0 2 2 87 7 0 0% 98.7% 0% 1.3% 0% 97.4% 2.6% 0% 2.1% 90.6% 7.3% 0% 3 0 2 154 0 3 0 7 149 0 3 0 8 88 0 0% 1.3% 98.7% 0% 0% 4.5% 95.5% .0% 0% 8.3% 91.7% 0% 4 0 0 3 153 4 0 0 29 127 4 0 1 6 89 0% 0% 1.9% 98.1% 0% 0% 18.6%81.4% 0% 1.0% 6.3% 92.7% Response Response Response 169 TM Tone 1 Tone 2 Tone 3 Tone 4 TM Tone 1 Tone 2 Tone 3 Tone 4 TM Tone 1 Tone 2 Tone 3 Tone 4 Stimulus Stimulus Stimulus 1 136 18 2 0 1 141 14 1 0 1 87 5 3 1 87.2% 11.5% 1.3% 0% 90.4% 9.0% .6% 0% 90.6% 5.2% 3.1% 1.0% 2 3 146 6 1 2 5 146 5 0 2 2 83 11 0 1.9% 93.6% 3.8% .6% 3.2% 93.6% 3.2% 0% 2.1% 86.5% 11.5% 0% 3 1 0 97 58 3 0 1 130 25 3 2 7 59 28 .6% 0% 62.2% 37.2% 0% .6% 83.3%16.0% 2.1% 7.3% 61.5% 29.2% 4 0 0 0 156 4 1 0 10 145 4 1 0 1 94 0% 0% 0% 100% .6% .0% 6.4% 92.9% 1.0% 0% 1.0% 97.9%

Table 6.1. Confusion matrices displaying observed tone identification responses and response rate (%) for (a) BM listeners, (b)

TM listeners, and (c) AE listeners in the baseline condition (intact syllables). The boldfaced values are the number of correct

responses to each tone corresponds to the numbers along the major diagonal.

In order to better understand the motivation of these perceptual confusions, these

tonal confusions need to be interpreted with reference to the acoustic properties of these

tones. Acoustic analyses on the four tones produced in isolation (see Chapter 4) showed that TM Tone 3 has a low-falling contour, which is opposed to the low-falling-rising shape of BM Tone 3. Consequently, a citation TM Tone 3 has an f0 contour similar to

that of Tone 4 in either TM or BM and contrasts with Tone 4 only in terms of f0 height

(register). Similarity in f0 contours of a citation TM Tone 3 and any Tone 4 (in either dialect and phonetic contexts) may easily lead to perceptual confusions between these

two tones when listeners identified multiple-speaker stimuli since identification becomes

a task of locating relative f0 within a speaker’s pitch range.

Perceptual confusions due to high-low f0 distinction between a TM Tone 3 and a

Tone 4 of either dialect resulted in poor identification of TM Tone 3 and systematic TM

→ Tone 4 confusion by both BM and AE listeners. Although TM listeners showed a

native dialect advantage in terms of significantly better TM Tone 3 identification , TM

Tone 3 was still misidentified as Tone 4 as a rate of 16%.

Another pair of tonal misidentification that might have been due to acoustic

ambiguity between TM Tone 3 and Tone 4 is the systematic misidentification of Tone 4

as Tone 3 by TM listeners. This also accompanies significantly lower accuracy for

Tone 4 identification by TM listeners (than the other two groups of listeners) regardless

of the source dialect of the speech tokens.

These findings suggest that acoustic ambiguity of this pair had differential impact

on the tone identification by different groups of listeners. While acoustic ambiguity

between this tone pair led to TM Tone 3→ Tone 4 confusion for the BM and AE

170

listeners, it does not lead to noticeable lower identification scores for Tone 4 nor a Tone 4

→ Tone 3 confusion45. On the contrary, TM listeners made not only TM Tone 3 →

Tone 4 but also Tone 4 → Tone 3 misidentification. One possible explanation is

that TM listeners may have learned that a TM Tone 3 produced in isolation has a

low-falling f0 pattern (TM Tone 3 is often realized as low-falling) that is similar to that of

Tone 4. As a result, when identified multiple-speaker and mixed-dialect Tone 4 stimuli,

TM listeners were more likely to confuse Tone 4 as Tone 3 (regardless of the source

dialect). Nevertheless, TM listeners were still more accurate in identifying Tone 4 in

TM than in BM, which implies that (TM) listeners still enjoy native advantage of

speaker-listener match when dealing with acoustic ambiguities.

Furthermore, TM listeners had relatively high accuracy of identifying BM Tone 3

(95.5%), which is not significantly different from that of the native BM listeners. One plausible explanation is that TM listeners are aware that a citation Tone 3 can be realized

with either a low-falling-rising or a low-falling f0 pattern. The former is the prescriptive

form that has been taught in schools. In addition, some native TM speakers always

produce a dipping tone or use a combination of these two tonal patterns for a citation

Tone 3, as shown in the previous acoustic studies on TM tones (e.g., Sanders, 2008; Fon

et al., 2004). Therefore, TM listeners should be familiar with, or at least be aware of,

these two phonetic variants of Tone 3 when they are produced in isolation.

Unlike TM listeners who have linguistic experience with low-falling f0 pattern as

a phonetic variant of citation Tone 3, a citation Tone 3, for BM and AE listeners, expect a

dipping contour to be identified as a Tone 3. Consequently, when identified a citation

45 It is worth nothing that when identifying BM Tone 4’s, AE listeners also at times misidentified BM Tone 4 as Tone 3 at a rate of 6.3% 171

TM Tone 3, the low-falling f0 pattern tended to be misidentified as a Tone 4 that also has a falling f0 contour. Besides, unlike TM counterparts who exhibited Tone 4 → Tone 3

misidentification, BM and AE listeners never misidentified Tone 4 as a TM Tone 3.

This further confirms that, for BM listeners, a dipping (low-falling-rising) shape is the

prototypical f0 contour for a citation Tone 3.

In summary, tone confusions patterns, coupled with identification performance

and acoustic analyses on four intact tones in two regional dialects of Mandarin, have

confirmed that dialectal divergence in citation Tone 3 production had differential effects

on identification of intact Tones 3 and 4 in BM and TM by three language groups of listeners. When identifying a citation TM Tone 3 and a Tone 4 in either dialect (i.e., two contour tones that vary only in f0 register), BM and AE listeners always misidentified

the low-f0 onset tone as the high-f0 onset one but not the vice versa. However, TM

listeners made not only occasional Tone 3 → Tone 4 but also Tone 4 → Tone 3 misidentification. Different direction of tonal confusions may be due to differential linguistic experience with phonetic variants of a citation Tone 3.

6.3.2. Confusions patterns at gate 5 Generally speaking, tone responses showed more confusions when listeners

identified acoustically-shorten syllables. In particular, tonal confusions at this temporal

point were more prominent for Tones 2 and 3 than that for the other two tones. This is

expected since analyses on TIP75% showed that these two tones needed, on average,

more then seven gates to achieve 75% correct responses (see §6.2). This is due to the

fact that identification of these two contour tones may require a percept of f0 movement

172

(i.e., f0 contour). Table 6.2 shows a general pattern of tonal confusions when three

listener groups identified shorter BM and TM tones.

When identifying BM Tones, all three listener group consistently misidentified

Tone 2 as Tone 1 approximately half of the time (out of 12 tokens of BM Tone 1).

Specifically, Tone 2 → Tone 1 confusion occurred more often for the BM speakers (58%)

than for the AE and TM listeners (around 47%). Another systematic confusion is the

Tone 3 → Tone 4 misidentification. Again, this pair of tonal confusion was more apparent for the BM listeners (36% vs. 20% for the other two listener groups). It seemed that a mismatch between speaker and listener dialect did not impact identification of BM Tone 3 since TM listeners had a more accurate identification than the BM listeners (67% vs. 49%). Even the AE listeners achieved relatively high accuracy in identifying the BM Tone 3; the tonal confusions scattered among the other three tones.

General patterns of tone confusions also exhibited in the identification of truncated TM tones by three listener groups. While TM Tone 2 was frequently misidentified as Tone 1 by three listener groups, it was occasionally misidentified, to a lesser degree, as Tone 3 by AE and TM listeners (22%). Confusion patterns for TM

Tone 3 are more complicated. While the BM listeners frequently misidentified TM 3 as

Tones 4 and 1 (56% combined), the AE listeners often misidentified it as either Tones 1,

2, or 3. Similarly, misidentification of TM Tone 3 by TM listeners distributed over the other three tones even though they were relatively more accurate than other listeners in identifying TM Tone 3 (60%). In order to relate the perceptual results to the acoustic properties of the fragmented tones, the f0 contours of the truncated test stimuli used at gate 5 in the gating experiments are plotted in Figure 6.4.

173

(a) BM listeners (b) TM listeners (c) AE listeners Response Response Response BM Tone 1 Tone 2 Tone 3 Tone 4 BM Tone 1 Tone 2 Tone 3 Tone 4 BM Tone 1 Tone 2 Tone 3 Tone 4 Stimulus Stimulus Stimulus 1 146 5 1 4 1 134 17 1 4 1 81 6 5 4 93.6% 3.2% .6% 2.6% 85.9% 10.9% .6% 2.6% 84.4% 6.3% 5.2% 4.2% 2 91 54 11 0 2 71 68 15 2 2 45 33 14 4 58.3% 34.6% 7.1% .0% 45.5% 43.6% 9.6% 1.3% 46.9% 34.4% 14.6% 4.2% 3 14 9 77 56 3 9 9 104 34 3 10 11 57 18 9.0% 5.8% 49.4% 35.9% 5.8% 5.8% 66.7% 21.8% 10.4% 11.5% 59.4% 18.8% 4 23 4 6 123 4 20 14 18 104 4 20 5 4 67 14.7% 2.6% 3.8% 78.8% 12.8% 9.0% 11.5% 66.7% 20.8% 5.2% 4.2% 69.8% 174 Response Response Response TM Tone 1 Tone 2 Tone 3 Tone 4 TM Tone 1 Tone 2 Tone 3 Tone 4 TM Tone 1 Tone 2 Tone 3 Tone 4 Stimulus Stimulus Stimulus 1 140 12 2 2 1 124 22 4 6 1 73 16 3 4 89.7% 7.7% 1.3% 1.3% 79.5% 14.1% 2.6% 3.8% 76.0% 16.7% 3.1% 4.2% 2 93 37 23 3 2 64 58 33 1 2 44 31 21 0 59.6% 23.7% 14.7% 1.9% 41.0% 37.2% 21.2% .6% 45.8% 32.3% 21.9% .0% 3 39 9 59 49 3 19 20 94 23 3 18 18 47 13 25.0% 5.8% 37.8% 31.4% 12.2% 12.8% 60.3% 14.7% 18.8% 18.8% 49.0% 13.5% 4 16 7 7 126 4 13 3 13 127 4 20 3 5 68 10.3% 4.5% 4.5% 80.8% 8.3% 1.9% 8.3% 81.4% 20.8% 3.1% 5.2% 70.8%

Table 6.2. Confusion matrices displaying obser ved tone identification responses and response rate (%) for (a) BM listeners, (b) TM listeners, and (c) AE listeners at Gate 5.

Figure 6.4. Truncated stimuli of Tones 2, 3, and 4 presented at gate 5. TM speakers are represented with blue color and BM speakers with red color.

175

The majority of the Tone 2 tokens in both BM and TM were relatively flat during

the first 150 ms of the vowel, except for a couple of syllable such as /t/ and /tu/ which

have a slight initial falling in f0. It is not surprising that the listener frequently

misidentified Tone 2 as Tone 1 since these two acoustically shorten tones have similar f0 contour shape and differ only in f0 height in the early gates.

TM Tone 2 was also occasionally misidentified as Tone 3 by the TM and AE

listeners in addition to the more common confusions with Tone 1. Closer examination

of TM Tones 2 and 3 revealed that some of the tokens share not only similar f0 onset but

also f0 contour shape. This can be expected on the basis that both of these two phonological contour tones occupy in the low f0 (register) region and the contour

information is absent in this set of gated stimuli which have relatively short duration.

On the other hand, BM Tone 2 was seldom misidentified as Tone 3. Unlike the

acoustic similarities between TM Tones 2 and 3, these two fragmented tones in BM

have very different f0 patterns, except that both of them have low f0 onset. For more

than half of the BM Tone 3, the f0 drops quickly to its minimum (with a steeper slope)

and contains the initial portion of the f0 break, i.e., the dipping portion that is

characteristic of a typical BM Tone 3 (see Figure 4.5). This will give rise to a percept

of creaky voice. Although different in f0 height or register, 75 percent of the BM Tone

3 fragments in this set have a contour shape which is more similar to that of Tone 4 rather than Tone 2 that occupies in the low register region but has a relatively flat f0 contour. This may explain for the BM listeners why BM Tone 2 was often misidentified as Tone 1 but not Tone 3, and BM Tone 3 was usually confused as Tone 4 but not Tone 2.

176

When comparing TM Tone 3 with its Tone 4 counterparts, it should be noted that they share similar tonal contour, except for the tonal register and the slope of f0 drop.

Nevertheless, TM Tone 3 tended to be confused as one of the other three tones. It is possible that the slope in TM Tone 3 is not as steep as the BM counterparts, which warrants percepts of contour tone within 150 ms of the vowel.

6.3.3. Confusion patterns at gate 1 Confusion matrices for three listener groups at gate 1 are shown in Table 6.3.

As expected, tone identification accuracy was relatively low and exhibited a great deal of confusions when listeners identified onset-only tone stimuli. This is due to the fact there is little dynamic f0 movement within the first 30 ms of the vowel and f0 contour is relatively flat (ref. Figure 6.3).

When identifying onset-only syllable, three listener groups made, regardless of the source dialect of the speech tokens, similar identification errors and the following patterns of tonal confusions emerged. Tones 1 and 4 were confused with each other, with systematic Tone 4→Tone 1 misidentification more prominent. While BM listeners tend to misidentified Tones 2 and 3 as either Tone 1 or Tone 4, both TM and

AE listeners made systematic confusion between Tones 2 and 3. The latter group of listeners also occasionally misidentified Tones 2 and 3 as Tone 4, and BM Tone 1 as

Tone 2. Table 6.4 summarizes the direction of systematic confusions between tones as a function of speaker and listener dialect at these selected temporal points.

177

(a) BM listeners (b) TM listeners (c) AE listeners Response Response Response BM Tone 1 Tone 2 Tone 3 Tone 4 BM Tone 1 Tone 2 Tone 3 Tone 4 BM Tone 1 Tone 2 Tone 3 Tone 4 Stimulus Stimulus Stimulus 1 90 12 7 47 1 58 38 14 46 1 35 25 12 24 57.7% 7.7% 4.5% 30.1% 37.2% 24.4% 9.0% 29.5% 36.5% 26.0% 12.5% 25.0% 2 60 17 27 52 2 25 42 51 38 2 20 29 28 19 38.5% 10.9% 17.3% 33.3% 16.0% 26.9% 32.7% 24.4% 20.8% 30.2% 29.2% 19.8% 3 39 19 32 66 3 27 38 55 36 3 12 18 43 23 25.0% 12.2% 20.5% 42.3% 17.3% 24.4% 35.3% 23.1% 12.5% 18.8% 44.8% 24.0% 4 79 18 11 48 4 52 30 28 46 4 32 23 21 20 50.6% 11.5% 7.1% 30.8% 33.3% 19.2% 17.9% 29.5% 33.3% 24.0% 21.9% 20.8% 178 Response Response Response TM Tone 1 Tone 2 Tone 3 Tone 4 TM Tone 1 Tone 2 Tone 3 Tone 4 TM Tone 1 Tone 2 Tone 3 Tone 4 Stimulus Stimulus Stimulus 1 86 20 15 35 1 66 26 19 45 1 48 24 10 14 55.1% 12.8% 9.6% 22.4% 42.3% 16.7% 12.2% 28.8% 50.0% 25.0% 10.4% 14.6% 2 54 14 35 53 2 20 49 53 34 2 16 24 37 19 34.6% 9.0% 22.4% 34.0% 12.8% 31.4% 34.0% 21.8% 16.7% 25.0% 38.5% 19.8% 3 54 14 41 47 3 16 48 66 26 3 10 31 35 20 34.6% 9.0% 26.3% 30.1% 10.3% 30.8% 42.3% 16.7% 10.4% 32.3% 36.5% 20.8% 4 95 13 8 40 4 68 20 16 52 4 42 17 11 26 60.9% 8.3% 5.1% 25.6% 43.6% 12.8% 10.3% 33.3% 43.8% 17.7% 11.5% 27.1%

Table 6.3. Confusion matrices displaying obser ved tone identification responses and response rate (%) for (a) BM listeners, (b)

TM listeners, and (c) AE listeners at Gate 1.

(a) Baseline condition (b) Gate 5 (c) Gate 1 Group BM TM AE BM TM AE BM TM AE BM T2→T1 T2→T1 T2→T1 BM T1→T4 T1→T4/T2 T1→T2/T4 Tones (58%) (46%) (47%) (30%) (27%) (26%) BM T4→T3 (18.6%) T3→T4 T3→T4 T3→T4 T2↔T1/T4 T2→T3/T4 T2→T3/T1 (36%) (22%) (19%) (26%) (28.5%) (29%) TM T3→T4 T3→T4 T3→T4 (37%) (16%) (29%) T3→T4/T1 T3→T2/T4 T3→T4 TM T2→T1 T2→T1 T2→T1 (33.5%) (24%,) (24%) T4→T3 (60%) (41%) (46%) (6.4%) T4→T1 T4→T1 T4→T1/T2 T3→T4/T1 T3→T1/T (51%) (33%) (29%) (31%, 25%) 2 (19%)

179 TM T1→T4 T1→T4 T1→T2 (22%) (29%) (25%)

T2→T1/T4 T2→T3/T4 T2→T3 (35%) (34%) (39%)

T3→T1/T4 T3→T2 T3→T2 (32.5%) (31%) (32%)

T4→T1 T4→T1 T4→T1 (61%) (44%) (44%)

Table 6.4. Direction of confusions for BM and TM tones in the following testing conditions: (a) Baseline condition, (b) Gate 5, and (c) Gate 1, arranged by listener group.

Based on the confusion patterns in the part (c) of the Figure 6.4 and the f0 height of confusing tone pairs at this gate, listeners seemed to be able to make a distinction

between high-onset (Tones 1 and 4) and low-onset (Tones 2 and 3) tones. In order to

test whether the high- vs. low-onset judgments is beyond the chance level, tone

responses were coded based on f0 height of the tone to create another set of confusion

matrices. Table 6.5 shows the numbers of actual responses to high- and low-onset tone

categories in a 2×2 table.

Chi-square (χ2) tests46 of association were performed to evaluate the relationship

between the observed high- and low-onset tone responses and stimulus tones. The null

hypothesis is that tone identification responses were independent of stimulus tones, i.e.,

there is no association between them and the tone identification is random. The alternative hypothesis is that tone responses are related to the stimulus tones. The purpose of chi-square analyses is to explore whether tone identification is dependent on the acoustic patterns of stimulus tone especially when the acoustic-phonetic information

available is limited in the fragmented syllables. χ2 tests showed that high-low

distinction was not random for both dialects and three listener groups (p<.001). This

finding suggests that both native and non-native listeners were able to make f0 height estimation and low-high estimation is not random.

46 The χ2 test statistic is a measure of divergence of the observed response frequencies from the expected response frequencies. 180

(a) BM listeners (b) TM listeners (c) AE listeners Response Response Response BM High Low BM High Low BM High Low Stimulus Stimulus Stimulus High 264 48 High 202 110 High 111 81 H=0.85 H=0.65 H=0.58

Low 217 95 Low 126 186 Low 74 118 F=0.70 F=0.40 F=0.39

Response Response Response

181 TM High Low TM High Low TM High Low Stimulus Stimulus Stimulus High 256 56 High 231 81 High 130 62 H=0.82 H=0.74 H=0.68

Low 208 104 Low 96 216 Low 65 127 F=0.67 F=0.31 F=0.34

Table 6.5. Confusion matrices displaying obser ved identification responses to high- and low-onset tone categories with hit (H) and false-alarm (F) rates for (a) BM listener, (b) TM listeners, and (c) AE listeners at Gate 1. The row total is 312 for both BM and TM listeners and 192 for the AE listeners.

6.4. Signal Detection Theory: Sensitivity (d′) to Mandarin tones Lee and his colleagues (e.g., Lee, 2009; Lee et al., 2008, 2009) claimed that significance of chi-square tests shows that tone identification accuracy exceeds the chance level and have used expected cell frequencies47 generated from chi-squared tests as an indicator/indication of “response bias” in tone identification. A problem involved in interpreting these statistics is that they do not provide a rigorous examination

(meaningful measure of performance accuracy) of accuracy of identification performance taking into account listeners’ response strategies. For example, identification accuracy

(i.e., percent-correct identification scores) may be resulted from the listener being biased toward one response category rather than being sensitive to the stimulus of that same category.

Therefore, a more stringent set of statistics to interpret listeners’ response behavior, as proposed in the signal detection theory (SDT), is to tease apart subjects’ sensitivity to identify a token out of a set of tone stimuli from their potential response bias. In the SDT listeners’ responses to each stimulus presentation follow a normal probability distribution, which aggregately forms a decision space in which listeners place their decision criterion (c) for deciding on a response.

According to the SDT (Macmillan and Creelman, 2004), sensitivity measure (d′) in the yes-no paradigm is the distance, in standard deviation units, between the means of the two underlying normal distribution of listeners’ responses to two possible stimulus classes. Response bias indicates subject’s inclination for one response category over

47 Mathematically, they are computed based on the actual data as (row total × column total)/N, where N is the total number of observations in the table. 182

another. The measure for bias in the SDT is called “criterion” (c), which serves as subject’s decision rule and governs how the decision space is partitioned/divided.

According to the SDT (Macmillan and Creelman, 2004), sensitivity (d′) and bias

(c) can be calculated as:

d′ = z(Hit) − z(False Alarm) (1)

c = -.5[ z(Hit) + z(False Alarm)] (2)

6.4.1. Sensitivity to high-low f0 distinction

In order to examine listeners’ sensitivity to high-low f0 distinction, group

sensitivity (d′) and criterion (c) were calculated for the high vs. low f0 pair in each

dialect for each (language) groups of listeners. Group d′ and c values48 based on the

mean hit- and false-alarm rates of all subjects in one listener group, as shown in Table

6.6, were calculated with the yes-no (one-interval design) methods.

Sensitivity values ranged from 0.5 to 1.2, which approximately correspond to

60% to 71% correct for both high-f0 and low-f0 trials. These measures indicate that all

three listener groups made f0 height distinction beyond the chance level performance

(i.e., when d′ =0.00). TM listeners exhibited relatively greater sensitivity than the

other two listener groups regardless of the dialect of tone stimuli. In addition, both TM

and AE listeners had greater sensitivity to f0 height distinction in TM tones than in BM

tones.

48 Individual d′ values from each listener were also calculated. However, the differences between individual and group d′ values were minor and only the latter was reported in order to avoid adjustment for perfect scores that are occasionally observed in the individual data (ref. Francis and Ciocca, 2003). However, individual d′ values were used for the subsequent ANOVA analyses. 183

Listener groups

BM TM AE

Stimulus class

d′ =0.51 d′ =0.62 d′ =0.49 BM c=-0.77 c=-0.07 c=0.05

d′ =0.49* d′ =1.15* d′ =0.88 TM c=-0.67 c=-0.07 c=-0.02

*The mean difference is significant at the 0.05 level.

Table 6. 6. Group sensitivity (d′) and criterion (c) calculated according to the yes-no method (Macmillan and Creelman, 1991) for each group of listeners.

A two-way repeated-measures ANOVA on individual listener’s sensitivity (i.e.,

z-transformed difference between hit rate and false-alarm rate, z(H)-z(F)) showed a

main effect of speaker dialect [F(1, 31)=9.04, p=.005], but not of listener dialect [F(2,

31)=2.03, p=.148]. Listeners were, on average, more sensitive to the high-low f0 distinction in TM tones (d′=0.89) than in BM tones (d′=0.59), p=.005. There was also a significant interaction between these two factors, F(2, 31)=4.28, p=.023. While three groups of listeners did not significantly differ in sensitivity to f0 height distinction, their

sensitivity varied with dialect of the tokens.

A one-way ANOVA of listener sensitivity to f0 height distinction in each dialect

showed significant group differences in sensitivity to TM tones, F(2, 31)=4.60 p=.018,

but not to BM tones, F(2, 31)=.32 p=.729. Bonferroni post hoc tests showed that TM

listeners were significantly more sensitive to high-low f0 distinction in TM tones than

the BM counterparts (p=.015). These findings corroborate that with only the onset of a

184

tone that is 30-ms long all listener groups were able to identify tones in terns of f0 height

beyond the chance level.

6.4.2. Overall sensitivity in 4-alternative forced-choice (4AFC) tone identification

Sensitivity measures obtained from high-low tone pairs in each dialect for each

group of listeners assume that tone stimuli vary along one single physical dimension— f0 height. However, the assumption of unidimensionality may be too strong for a

4-alternative forced-choice tone identification task in which listener identifies the tone of the stimulus presented on each trial from a set of four lexical tones.

In order to obtain an overall index of identification performance for each group of listeners on truncated tone stimuli at gates 1 and 5, and intact stimuli, the overall sensitivity statistic was calculated by using an SDT analysis that relates the proportion correct, p(c), to d′ (Macmillan and Creelman, 2005). For example, the overall d′ value for BM listeners’ identification of fragmented TM Tones at gate 5 is calculated as: p(c) =

(146+54+77+123)/624 = 0.64, which implies an overall d′ of 1.29 obtained from Table

A5.7 in Macmillan and Creelman (2005). Table 6.7 summarizes the overall sensitivity d′ to tones in BM and TM presented at gates 1 and 5 and in the intact syllable condition for each listener group.

A two-way mixed design ANOVA with within-subject factor speaker dialect and

between-subject factor listener dialect was performed on the individual listener’s overall

sensitivity to the first 30ms of the tone stimuli at gate 1. Results showed insignificant

main effects of speaker dialect [F(1, 31)=1.74, p=.197, η2=.053], listener dialect [F(2,

31)=1.528, p=.233, η2=.090], and interaction between these two factors [F(2, 31)=2.06,

p=.144, η2=.117]. These suggested that three different language groups of listeners did

185

not have differential overall sensitivity to the onset-only syllables in two dialects.

One-way ANOVA on listeners’ sensitivity to the tone onset in each dialect indicated

that there were significant group mean differences in overall sensitivity when

identifying the onset of the TM tones. Bonferroni post hoc tests showed that TM

listeners had higher sensitivity to the onset of TM tones than did the BM counterparts

(p=.046). However, independent samples t-test comparing overall sensitivity to BM

and TM tones for each listener group revealed no significant differences between overall

sensitivity indexes to tones in two dialects.

The same procedure of statistical analyses was applied to the overall sensitivity to

tones of fragmented stimuli at gate 5 and the intact stimuli. 49 The two-way

repeated-measures ANOVA on the overall sensitivity to truncated stimuli at gate 5

showed a main effect of speaker dialect [F(1, 31)=10.19, p=.003, η2=.247], but not of

listener dialect [F(2, 31)=1.71, p=.198, η2=.099].

Listeners, on average, showed more sensitivity to BM tones at gate 5 (p=.003).

There was no interaction between speaker dialect and listener dialect [F(2, 31)=1.63,

p=.212, η2=.095]. Nevertheless, One-way ANOVA on listeners’ sensitivity to tones in each dialect revealed that there were significant group mean differences in overall sensitivity when identifying truncated TM tones. Post hoc tests (LSD) showed that

TM listeners had greater sensitivity to TM tones than BM and AE listeners (p=.023 and p=.032, respectively). Independent samples t-test comparing overall sensitivity to BM and TM tones for each listener group revealed that BM listeners were significantly more sensitive to truncated tones in BM, t(24)==-2.37, p=.026, while the other two listener

49 For listeners who achieved 100% correct on the identification of intact stimuli, adjustment was made and p(c) was adjusted to 0.99, which implies a d′ =3.80. 186

groups did not have differential sensitivity to either dialect (TM: t(24)==-.33, p=.75, AE:

t(14)==-1.15, p=.268).

(a) Gate 1 Listener groups

BM TM AE

Stimulus class

BM d′ =0.19 d′ =0.26 d′ =0.29

TM d′ =0.15* d′ =0.42* d′ =0.36

(b) Gate 5 Listener groups

BM TM AE

Stimulus class

BM d′ =1.29† d′ =1.35 d′ =1.22

TM d′ =1.09†* d′ =1.32* d′ =1.06*

(c) Baseline condition Listener groups

BM TM AE

Stimulus class

BM d′ =3.80 d′ =2.80 d′ =2.53

TM d′ =2.20† d′ =2.45† d′ =2.09†

*The mean difference in overall sensitivity (d′) is significant at the 0.05 level based on the results obtained from the one-way ANOVAs. † The mean difference in overall sensitivity (d′) is significant at the 0.05 level based on the results obtained from the t-tests.

Table 6. 7. Group overall sensitivity d′ calculated according to an unbiased SDT model for 4AFC identification (Macmillan and Creelman, 2005) for (a) Gate 1, (b) Gate 5 and (c) Baseline condition.

187

In addition, a two-way repeated-measures ANOVA on the individuals’ sensitivity to the full tone stimuli showed significant main effects of speaker dialect [BM=3.04 and

TM=2.33, F(1, 31)=50.53, p<.001, η2=0.62] and interaction between two factors [F(2,

31)=4.23, p=.023, η2=0.216]. The effect of listener dialect was not significant [F(2,

31)=0.73, p=0.489, η2=0.045]. These results suggest that although three listener

groups did not differ in sensitivity to Mandarin tones, they had differential sensitivity to

the dialect of the tones. One-way ANOVA on the overall sensitivity to four tones in

each dialect did not reveal any group mean differences. Nevertheless, independent

sample t-test showed that all listeners had lower overall sensitivity to TM tones when

they identified the intact tone stimuli [BM: t(24)=-6.52, p<.001; AE: t(14)=-2.55,

p=.023, and TM: t(24)=-2.08, p=.048].

In summary, statistical analyses on listeners’ identification and sensitivity to

high-low f0 distinction in the onset of the stimuli showed that listeners were sensitive to

the f0 height information contained within the first 30 ms of the tone stimuli and were

able to identify four tones in terms of high-low f0 distinction beyond the chance level.

Listeners also demonstrated overall sensitivity to all four tones when the tone stimuli are only 30 ms long. Comparing the overall sensitivity measures of three listener groups for tones in each dialect, some listeners indeed began to show differential sensitivity to the dialects of tone stimuli as early as in the syllable onset. In particular, TM listeners were more sensitive than BM listeners (BM and AE listeners at Gate 5) to the TM tones when identified 30-ms, 150-ms truncated tones, and made high-low f0 distinction at the

onset. However, TM listeners no longer had a “native dialect sensitivity” advantage

when identifying intact stimuli.

188

6.5. Discussion and Conclusion Analyses on TIP75% showed that regardless of speaker and listener dialects, Tone

2 had the longest TIP75%, followed by Tone 3, Tone 4 and Tone 1 (9, 7.3, 5.1 and 3.7

gates, respectively). This pattern is consistent with the findings reported in the

previous gating studies on tones in BM (Wu and Shu, 2003, Lai and Zhang, 2008).

Besides, longer TIP75%’s for Tone 2 than Tone 3 provide additional evidence to the

observation that low-onset f0 (the initial falling contour) may give rise to a Tone 3

percept (e.g., Whalen and Xu, 1992; Fon et al., 2004) and the final f0 rising is critical for

the perception of Tone 2 (Fon et al., 2004). Notwithstanding, speaker variability and

speaker dialect-listener dialect mismatch indeed have differential impacts, as

hypothesized, on the identification of intact and truncated tones by three listener groups.

While Tones 2 and 3 had, on average, longer TIP75%s, identification of

fragmented TM Tone 3 was adversely hindered by a speaker dialect-listener dialect

mismatch. Specifically, TM listeners required significantly less acoustic information

than the BM and AE counterparts to reach 75% correct responses. Even with complete

acoustic information available to listeners, both BM and AE listeners still had

significantly lower accuracy of identifying an intact TM Tone 3 produced in isolation.

Tonal confusion patterns revealed that TM Tone 3 was easily misidentified as

Tone 4 by BM and AE listeners. In addition, examination of the overall sensitivity of each listener group to four intact tones in each dialect revealed that BM and AE listeners were less sensitivity to TM tones. While a speaker dialect -listener dialect mismatch seemed have no effect TM listeners’ identification of truncated or/and intact BM tones,

TM listeners had significantly worse identification of intact BM Tone 4 and often

misidentified it as Tone 3.

189

These confusion patterns can be explained by the cross-dialectal differences in the production of an intact citation Tone 3 and the resulting acoustic similarities between

TM Tone 3 and Tone 4 in terms of f0 contours. The acoustic differentiation between these two tones relies on f0 height distinction. For BM and AE listeners, they

consistently misidentified TM Tone 3 as Tone 4 but not vice versa while TM listener

made bi-directional misidentification. One possible explanation is that a prototypical

tonal contour for an isolated Tone 3 is low-falling-rising for speakers of BM dialect.

Therefore, when they identified an isolated TM Tone 3 that has a low-falling pattern,

they misidentified it as a Tone 4 and they never misidentified a Tone 4 as a Tone 3.

On the contrary, TM listeners are familiar with both the prescriptive form of a citation

Tone 3 and the phonetic variant that is characteristic in TM (i.e., low-falling). As a

result, TM listeners’ performance was not compromised by the speaker-listener dialect

mismatch when they identified the dipping BM Tone 3 and they achieved higher

accuracy in identifying the low-falling TM Tone 3. Besides, TM listeners made TM

Tone 3 →Tone 4 confusion (with smaller magnitude as compared to that made by the

other two groups of listeners) and significant more Tone 4 →Tone 3 confusion

regardless of the dialect.

However, as seen in the production study (see chapter 4), Tone 3 has a low-falling

contour when it is produced in a sentence-medial position. BM and AE listeners

should be also familiar with this phonetic variant when identified mixed-dialect and

mixed-context stimuli. Preliminary perceptual results from the identification of intact

tones produced in the sentential context indicated that BM and AE listeners still had

inferior performance on the contextual TM Tone 3 (BM=55.8%, AE=52.1%) as

190

compared to that on contextual BM Tone 3 (BM=77.4%, AE=65.6%) and they made

more contextual Tone 3 → Tone 4 misidentification in TM. These findings provide

additional support to the claim that tone identification of BM and AE listeners is more

susceptible to a speaker dialect-listener dialect mismatch, especially for Tone 3. In addition, tonal confusion patterns were different from the often-cited Tone 2→Tone 3

confusion reported in the previous literature (Blicher et al., 1990; Shen and Lin, 1991,

Whalen and Xu, 1992; Gottfried and Suiter, 1997, Wang et al., 1999; Fon et al., 2004;

Lee et al., 2008).

For the fragmented tone stimuli at gates 1 and 5, confusion patterns can be explained by the acoustic properties of the edited tone syllables. When identifying onset-only syllables, all listener groups tended to confuse Tones 1 and 4, with Tone 4

→Tone 1 misidentification being the dominant error pattern (cf. Lee et al., 2008, found dominant Tone 1→Tone 4 misidentification). BM listeners also judged Tones 2 and 3

as either Tones 1 or 4. However, BM and AE listeners often confused between Tones

2 and 3 and at times judged them as Tone 4. Tones were most often identified as Tone

1 or Tone 4 given the lack of dynamic movement in f0 (i.e., flat tonal contours) and limited acoustic information, for example, a stimuli of very short duration tend to give

rise to a percept of Tone 4. These error patterns are generally consistent with previous

findings (Tseng, 1981; Whalen and Xu, 1992; Gottfried and Suiter, 1997; Lee et al.,

2008; Lee et al., 2009; Lee, 2009).

When listeners identified 150-ms stimuli at gate 5, tonal confusions remained for

Tones 2 and 3 since Tones 1 and 4 were identified with at least 75% correct, regardless

of the source dialect. Listeners usually judged Tone 2 as Tone 1; this is because the f0

191

contour of Tone 2 in BM and TM was still relatively flat within the first 150 ms of the

vowel. This provides additional evidence to the previous finding that perceptual cue to

Tone 2 percept relies in the medial and final rising portion (Fon et al., 2004). However,

it is unclear why, for some listeners (TM and AE), Tone 2→Tone 1 misidentification

became dominant when they often misidentified Tone 2 as Tone 3 at gate 1. Besides,

Tone 3 →Tone 4 confusion became dominant in BM while TM Tone 3 tended to be

misjudged as one of the other three tones. The former can be explained by the f0 pattern of the first 150 ms of BM Tone 3 (i.e., low-falling with possible creaky voice) which is very similar to that of Tone 4 except for f0 height.

In addition, these confusion patterns suggest that listeners are able to distinguish high-onset tones (Tones 1 and 4) from low-onset tones (Tones 2 and 3). Sensitivity (d′) measures based on the SDT indicated that listeners were actually sensitive to such f0 height distinction when there was only 30 ms of acoustic signal available. Most importantly, TM listeners started showing advantage of speaker dialect-listener dialect match when identified onset-only tone tokens. However, when listeners identified intact stimuli, such “native dialect sensitivity” to all four tones was reduced.

While the main focus of this study is to examine the effect of speaker dialect- listener dialect mismatch on identification of tones in BM and TM, it is worth noting that the AE listeners recruited in the current study have native-like proficiency of

Mandarin Chinese. Their performance on tone identification was not significantly inferior than that of native speakers of BM, even for the TM Tone 3 identification.

Their tonal confusion patterns are not different from that of native speakers from either

BM or TM. In particular, they did not exhibit the often-cited Tone 2 – Tone 3

192

confusion when identifying intact tones (e.g., Gottfried and Suiter, 1997; Lee et al.,

2009). Furthermore, they were also sensitive to high-low f0 distinction when they identified the onset-only syllables, which contrasted to the inconclusive findings reported in the Lee et al., (2009).

193

Chapter 7 Conclusions and General Discussion The production study examined the three main acoustic properties, fundamental

frequency (f0), amplitude and duration, of the four lexical tones in Beijing Mandarin (BM)

and Taiwan Mandarin (TM) produced in isolation and in a sentence-medial position by

native male and female speakers of these two dialects. Acoustical and statistical

analyses on these acoustic parametersof the tone syllables produced in both constrained

phonetic contexts by the female speakers showed cross-dialectal differences in tonal

realization in terms of f0 contour and register, amplitude contour and duration.

The most prominent cross-dialectal divergence in the f0 contour or movement of tones produced in isolation was observed in Tone 3. Statistical comparisons of f0 values at five locations in the vowel between two dialects (see part (c) in Fig. 4.7) showed that

TM Tone 3 was significantly 20 and 41 Hz higher than the BM counterpart at 25% and

50% into the vowel, but 50 Hz lower at the vowel offset (i.e., 95%). This confirms that a citation BM Tone 3 has a dipping contour with an inflection point around the mid of the vowel whereas TM Tone 3 has a falling pattern. These findings are consistent with the previous studies in which low-falling contour was reported for native TM speakers (Shih,

1987 and 1988; Chiung, 1999 and 2003; Li et al., 2006; Deng et al., 2006; Shi and Deng,

2006; Sanders, 2008a,b).

Unlike the cross-dialectal differences in tonal contours observed in Tone 3, Tones

1, 2 and 4 in both dialects had high-level, mid-rising and high-falling, respectively, when

194

produced in isolation. While the general shape of the tonal contours of these three tones

did not differ between two dialects, there were significant cross-dialectal differences in

terms of the relative f0 values in the entire or parts of the contour. In general, isolated

BM tones had higher tonal register (mean f0 values of the four tones) than the TM

counterparts. Specifically, BM Tone 1 was higher than the TM counterpart along the

entire vowel. BM Tone 2 was significantly higher (in Hz) than the TM counterpart from

the second half of the f0 contour and the cross-dialectal differences (in Hz) increased

progressively toward the end of the vowel. BM Tone 4 was significantly higher than the

TM counterpart in the first half of the contour.

Most importantly, isolated Tone 4 in isolation in neither dialect fell to the f0 minimum (L) as suggested by previous studies (e.g., Chao, 1948). The possible explanation for TM Tone 4 is that in order to differentiate the high-falling Tone 4 from the low-falling TM Tone 3, tone 4 no longer hits the L target (ref. Sanders, 2008b,

Chiung, 1999). As a result, contrast between Tone 3 and Tone 4 becomes a high-low f0 register contrast; that is [31] versus [53].

The tonal contours of syllables produced in a sentential context are consistent with the tone sandhi and coarticulation rules, and showed no overt cross-dialectal discrepancies in the shape of f0 contours. The f0 patterns are respectively high-level,

mid-rising, low-falling and high falling for Tones 1 – 4 in both dialects. The Tone 3 in

context exhibited the low-falling contour in both dialects, as suggested by the “half T3

Sandhi” (Chao, 1948, 1968). Tone 4 in context exhibited an “half T4” pattern due to

anticipatory coarticulation (Shen, 1990a, b) or a sandhi process (Chao, 1948), in which it

prevents f0 pattern from falling to a minimum (L) before any full lexical tone.

195

The shapes of tones in context showed no overt cross-dialectal discrepancies in the shape of f0 contours. Nevertheless, there were significant cross-dialectal differences

in terms of relative f0 values (in Hz) for the four lexical tones. Specifically, BM Tones1

and 2 in context were higher the TM counterparts at the five locations and in the last 25%

of the f0 contour, respectively. The magnitude of cross-dialectal differences in the final

rising portion of the contextual Tone 2 was smaller as compared to that for Tone 2 in

isolation. This may be attributed to the coarticulation effect as proposed by Shih (1987),

in which the final rising portion of Tone 2 diminishes when followed by a Tone 4 (as in

the current study) or Tone 1. Both Tones 3 and 4 were higher in the first half of the

contour in BM than in TM.

Based on these findings, BM tones, on average, were higher in the frequency

values (in Hz) than the TM counterparts. There are two possible explanations. First,

this may be the result of the idiosyncratic differences in voice pitch of the selected

speakers from these two dialects. Second, it may provide additional supporting

evidence to the previous finding that TM tones had lower tonal register than the BM

counterparts (Tseng, 2004; Torgerson, 2005; Li et al., 2006; and Shi and Deng, 2006).

The patterns of amplitude contours are more complicated, which is consistent

with previous findings that each tone can assume different amplitude contours of

different shape (Ho, 1976; Lin, M-C., 1988). The rms amplitude contours were

respectively rising-level-falling, rising with a small dip, double-peak, and rising-falling

for BM Tones 1 – 4 produced in isolation. They were rising-falling for Tone 1,

rising-level for Tone 2, and rising-falling for both Tones 3 and 4 counterparts in TM.

Similar to the prominent cross-dialectal differences in the f0 contour of Tone 3 in

196

isolation, the amplitude envelope of BM Tone 3 has a double-peak pattern while it is falling in TM Tone 3. While previous research has reported considerable inter- and intra-speaker variation in the shape of amplitude contours, especially for BM Tone 3 produced in isolation (Lin, M-C., 1988), the amplitude contour of Tone 3 produced by female BM speakers in the current study always showed the double-peak pattern. Those produced by the TM counterparts consistently had the falling pattern. Statistical analyses further confirmed that the amplitude contours of Tone 3 significantly differed between two dialects at all five locations in the vowel. This provides additional support to our claim that there is dialectal divergence in tonal realization of Tone 3 in isolation.

Another statistically significant cross-dialectal difference in amplitude contours was observed in Tone 2—rising in BM but flat in TM. Specifically, the second half of the amplitude envelope of BM Tone 2 was significantly higher (in dB) than that in TM.

This may be related to the relatively larger magnitude of the final rising in isolated Tone

2 as compared to that in the TM counterpart.

When tones were produced in a sentence context, the amplitude contours did not differ between two dialects, except for those of Tones 1 and 2. They were respectively rising-flat and rising with a dip for Tones 1 and 2 in BM, and rising-falling for both

Tones 1 and 2 in TM. Amplitude envelope of a BM Tone 3 became falling, which resembled to that in the TM counterpart. Similarly, amplitude contours of Tone 4 had the rising-falling pattern in both dialects.

Results from the analyses on the cross-correlation between f0 and amplitude contour showed that cross-dialectal differences in production of a citation Tone 3 resulted in different relational patterns between amplitude and f0 contours of tones in two regional

197

dialects. Specifically, while the low-falling amplitude contour of TM Tone 3 in

isolation was highly correlated with the f0 contours of Tones 3 and 4 which also have

falling patterns, the double-peak amplitude envelope of BM Tone 3 was only correlated

to the dipping f0 contour of the same tone. In addition, results were consistent with previous studies. First, the amplitude contour of a tone does not necessarily correlate most highly with the f0 contour of the same tone (Kuo et al., 2008). Second, the size of

correlation, represented by Pearson correlation coefficient, between pitch and amplitude

varies considerably across different tones (Whalen and Xu, 1992; Fu and Zhang, 2000).

The perception study investigated the effects of speaker and dialect variability on

the time-course of native and nonnative tone identification when no syllable-extrinsic

contextual information is available for speaker and dialect normalization. A subset of

the analyses on TIP75% obtained from a series of gating experiments (only tone stimuli

produced in isolation were examined) showed that, regardless of dialects of speakers and

listeners, Tone 2 had the longest TIP75%, followed by Tones 3, 4 and 1 (9, 7.3, 5.1, and

3.7 gates, respectively, each gate is 30 ms long). In other words, Tones 2 and 3 required more acoustic information to be correctly identified with 75% correct responses than

Tones 4 and 1. These results supported our hypotheses that for the high-onset tone pairs

(Tones 1 and 4), level Tone 1 should be relatively easier to be recognized than the contour Tone 4, and for the low-onset tone pairs (Tones 2 and 3), Tone 3 is hypothesized to be identified earlier than Tone 2. The latter confirmed that a low f0 onset gives rise to

a Tone 3 percept (also Whalen and Xu, 1992; Fon et al., 2004) and the final f0 rising is indispensable for the perception of Tone 2 (Fon et al., 2004).

198

Most importantly, as hypothesized, speaker dialect-listener dialect mismatch had

differential impacts on the identification of intact and truncated tones by three listener

groups. Perceptual results from tone identification of truncated stimuli showed the

identification of partial TM Tone 3 was adversely hindered by a speaker dialect-listener

dialect mismatch. In particular, TM listeners required significantly less acoustic

information than the BM and AE counterparts to reach 75% correct response. Even

with complete acoustic information available to listeners (i.e., identification of intact tone

syllables), both BM and AE listeners still had significantly lower accuracy of identifying

an intact TM Tone 3 produced in isolation. Examination of the overall sensitivity of

each listener group to four intact tones in each dialect indicated that BM and AE listeners

were less sensitive to TM tones. While a speaker dialect-listener dialect mismatch did

not adversely affected identification of fragmented or/and intact BM tones by TM

listeners, TM listeners achieved significantly lower accuracy of identifying intact Tone 4

in BM.

Tonal confusion patterns revealed that TM Tone 3 was often misidentified as

Tone 4 by BM and AE listeners and TM listeners misidentified BM Tone 4 as Tone 3.

These confusion patterns can be explained by the cross-dialectal differences in the tonal

realization of a citation Tone 3 and the resulting acoustic similarities between TM Tone 3

and Tone 4 in terms of f0 contour as mentioned above. The f0 height distinction became

an important acoustic cue to this tone pair: TM Tone 3 and Tone 4 in either TM or BM,

which poses difficulty for listeners when they identify multiple-speaker stimuli.

However, BM and AE listeners made consistent TM Tone 3 → Tone 4 misidentification but not vice versa while TM listeners made bi-directional

199

misidentification. One possible explanation is differential experience with the phonetic

variants of a citation Tone 3. BM (and AE) listeners consider low-falling-rising contour

as the prototypical for an isolated Tone 3; whereas, TM listeners know that a citation

Tone 3 may have the prescriptive form or the phonetic variant (i.e., low-falling) that is characteristic in TM. Preliminary perceptual results from the identification of intact tones excised from the sentential context in which they were originally produced showed that BM and AE listeners still had lower accuracy in identifying the contextual TM Tone

3 (TM=55.8% vs. BM=77.4%) even though contextual Tone 3 has a low-falling pattern in both dialects. This provides additional support to the claim that tone identification of

BM and AE listeners is more susceptible to a speaker dialect-listener dialect mismatch.

Analyses on the duration of tones produced in isolation showed that Tones

3>2>1>4 in BM and Tones 2>1>3>4 in TM. The cross-dialectal difference in duration was significant for Tone 3 only. The dialectal differences in the production of isolated

Tone 3 may result in this durational difference, which is consistent with Shih’s (1988) suggestion that a dipping Tone 3 shows the longest duration while the low-falling Tone 3 has the shortest.

Tone responses showed more confusions when listeners identified acoustically-shorten syllables and patterns of tonal confusions can be explained by the acoustic properties of the edited tone syllables. At gate 1 (onset-only syllables), all listener groups confused Tones 1 and 4, with the dominant error being Tone 4 → Tone 1 misidentification. This is expected the given lack of dynamic movement in f0 and limited acoustic information. While BM listeners also judged Tones 2 and 3 as either

200

Tone 1 or 4, TM and AE listeners often confused Tones 2 and 3 and at times judged them as Tone 4.

At gate 5 (150-ms stimuli), tonal confusions remained for Tone 2 and 3 since these two tones required more than seven gates to reach TIP 75%, regardless of the source tone. Listeners often judged Tone 2 as Tone 1 due to the relatively flat f0 contour

of Tone 2 in both dialects within the first 150 ms of the vowel. Interestingly, Tone 2 →

Tone 1 misidentification became dominant for TM and AE listeners when they often

misidentified onset-only Tone 2 as Tone 3 at gate 1. Besides, Tone 3 → Tone 4

confusion became dominant in BM while TM Tone 3 tended to be misjudged as one of

the other three tones. This is likely because the f0 contour of the first 150 ms of BM

Tone 3 is acoustically similar to that of Tone 4 except for f0 height.

Furthermore, confusion patterns at gate 1 and sensitivity (d′) measures based on the SDT analyses indicated that listeners were able to make low- and high-onset estimation and were actually sensitive to such f0 height distinction when there was only

30 ms of acoustic signal available and stimuli were produced by female speakers only.

This provides stronger evidence that listeners were able to estimate f0 height from very short multi-talker stimuli of 30 ms without mediation of gender detection as proposed by

Lee (2009). Besides, TM listeners started showing advantage of speaker dialect-listener dialect match when identified onset-only tone tokens.

Finally, we hypothesized that the tone identification by AE listeners should be

more compromised as compared to the native listeners due to the mismatch of speaker

dialect-listener-dialect and make low-onset and high-onset distinction at later gates than

the native counterparts (see chapter 2). However, results from the current study

201

showed that their performance was not significantly inferior than that of native BM

speakers, even when they identified TM Tone 3. When AE listeners identified intact

tones produced in isolation, their performance is near native-like. Besides, their tonal

confusion patterns are not different from those of native speakers from either BM or TM.

In particular, they did not exhibit the often-cited Tone 2 – Tone 3 confusion when

identifying intact tones (e.g., Gottfried and Suiter, 1997; Lee et al., 2009).

Furthermore, they were also sensitive to high-low f0 distinction when they identified the

onset-only syllables, which contrasts with the inconclusive findings reported in the Lee et al., (2009).

There are several possible explanations as to why AE listeners achieved native-like tone identification accuracy. First, four out of the eight listeners analyzed for the perception study had been taking graduate level Mandarin courses and studying

Mandarin for an average of four years. Second, the majority of the AE learners of

Mandarin had studied in mainland China and/or Taiwan for an extended period of time.

They therefore might be familiar with phonetic variants of Tone 3 in both dialects.

Third, five out of the eight AE listeners analyzed in the perception study had extensive training in music (see Appendix C). A recent study by Lee and Hung (2008) has shown that musical training facilitated Mandarin tone identification by English-speaking listeners who have had no experience with lexical tones. When they identified the tones of the syllable sa produced by 32 speakers, both musicians and nonmusicians identified tones beyond chance at 68% and 44% correct, respectively.

This study has two limitations. First, it is inconclusive as to the effects of language experience on tone identification by AE listeners due to relatively small and

202

unequal number of subjects at each proficiency level recruited in the study. According to the level of Mandarin courses they took at the OSU and the number of years of language study they had, they were divided into advanced (N=7) and intermediate (N=3)

groups. While the descriptive individual data indicated that not all of advanced

students identified the tones of the partial and intact stimuli with higher accuracy than

did the intermediate learners, they can be considered as a group of highly proficient

language users.

Second, due to small and unequal numbers of native male speakers recruited for

the production study, it is still inconclusive whether cross-dialectal differences are

present in the productions of isolated Tone 3 by male speakers. Preliminary acoustic

analyses on the production by three native male BM speakers showed that one of them

consistently realized Tone 3 in isolation as a low-falling tone. In addition, Deng et al

(2006) and Shi and Deng (2006) also reported that the final rising target of BM Tone 3

produced by speakers of younger generation is lower than that produced by speakers of

an older generation whose Tone 3 production still prescriptively maintains the [214]

contour shape. The male speakers of BM (mean age=23.3) were, on average, younger

than the female and TM counterparts. Therefore, it is possible that there might be

effects of gender and age group in tone production, especially for Tone 3. A study

with both male and female participants from different age groups will be needed to

further investigate the possible tonetic sound chance in Tone 3 in both dialects.

203

Appendix A. Stimuli for the production study

A.1. Monosyllabic words yi /i/ Tone 1 Tone 2 Tone 3 Tone 4 一 宜 已 易 one move cease easy

wu /u/ Tone 1 Tone 2 Tone 3 Tone 4 屋 無 五 物 house none five object

hu /xu/ Tone 1 Tone 2 Tone 3 Tone 4 呼 狐 虎 戶 exhale fox door

ma /ma/ Tone 1 Tone 2 Tone 3 Tone 4 媽 麻 馬 罵 mother hemp horse scold

204

pi /pi/ Tone 1 Tone 2 Tone 3 Tone 4 批 皮 痞 僻 slap skin ruffian secluded

tu /tu/ Tone 1 Tone 2 Tone 3 Tone 4 禿 圖 土 兔 bold picture land

bao /pau/ Tone 1 Tone 2 Tone 3 Tone 4 包 雹 寳 抱 bag hail treasure hug

bi /pi/ Tone 1 Tone 2 Tone 3 Tone 4 逼 鼻 筆 必 force nose pen certainly

bo /po/ Tone 1 Tone 2 Tone 3 Tone 4 播 伯 跛 擘 broadcast uncle lame huge

205

da /t/ Tone 1 Tone 2 Tone 3 Tone 4 搭 答 打 大 construct answer hit big

du /tu/ Tone 1 Tone 2 Tone 3 Tone 4 督 毒 賭 度 supervise poison bet degree

ge /k/ Tone 1 Tone 2 Tone 3 Tone 4 歌 格 鬲 個 cell hiccup individual

guo /kuo/ Tone 1 Tone 2 Tone 3 Tone 4 鍋 國 果 過 nation fruit pass

206

A.2 Short sentences

(1) Pinyin: Qǐng shuō “Please say”

Traditional: 請說; Simplified: 请说

(2) Pinyin: Qǐng kàn “Please look”

Traditional: 請看; Simplified: 请看;

(3) Pinyin: Biāo míng shēng diào “Identify the lexical tone”

Traditional: 標明聲調; Simplified: 标明声调

(4) Pinyin: Biāo míng nǐ tīng dào de shēng diào “Please identify the lexical

tone you just heard”

Traditional: 標明你聽到的聲調; Simplified: 标明你听到的声调;

207

Appendix B: Speaker demographic information Place of Knowledge and Parents' and Subj. Place of # years in Gender Age residence fluency of another grandparents’ L1 # birth Columbus, OH before 15 yrs Chinese dialect Taipei, Taipei TSM:fluent (home) TSM 5 1 M 33 Taiwan Taipei, Hakka (guǎngdōng): Father/grandmother:guǎngxī; Taipei 5 11 M 29 Taiwan basic (home) Mother: Hakka , Nantou (15); TSM: fluent (home) TSM (& grandparents) 0 12 M 29 Taiwan TSM: intermediate Taipei, Ponghu Taipei, (listening); 8yrs and (3-6yrs), TSM 1.5 13 M 29 Taiwan onwards: TM; before Tainan(8yr-) 6yr: TSM

208 Taipei, TSM: fluent TSM 7.5 2 F 32 Taiwan Taipei, Taipei TSM: fluent TSM 4 3 F 26 Taiwan Taipei, TM (grandmother: hakka) Taipei TSM: minimal 0.5 14 F 30 Taiwan TSM: listening Tainan, Tainan TSM: fluent TSM <1yr 15 F 24 Taiwan Kaoshiung, Kaohsiung TSM: fluent TSM 0 18 F 30 Taiwan Taipei, Taipei TSM:minimal TSM 4 19 F 30 Taiwan Beijing, Beijing n/a BM 3 6 M 23 China Beijing, Beijing n/a BM 4 7 M 23 China Beijing, listening knowledge of Beijing dialect 2 16 M 24 China Kantonese (can't speak) 4 F 28 Hunan Hunan Xiang: fluent Xiang (Hunan) 1.5 Beijing, Beijing n/a BM 0.5 5 F 27 China

Beijing, Beijing n/a BM 1.5 8 F 22 China Beijing, Beijing n/a BM 2 9 F 22 China Beijing, Beijing n/a BM, (grandmother) 2mo 10 F 21 China listening knowledge of Beijing, mother: BM; father: Beijing Kantonese & 8mo 17 F 30 China shanghainese shanghainese

209

Appendix C: Listener information B.1 BM Listeners Subj. # Gender Age Place of birth and Knowledge and fluency of Exp. With Music training residence before 15 yrs another Chinese dialect TM dialect 40 F 22 Cangzhou, Hebei n/a n/a n/a 41 F 31 GuangHan, Sichuanese(Home) n/a n/a 42 F 28 Shandong dialect n/a n/a 43 F 27 Beijing n/a n/a piano; violin; guitar (8-9 yrs) 44 M 23 , Hebei Tangshan dialect n/a n/a 45 F 35 Beijing Shanghainese (listening) n/a piano (3 yrs) 46 F 21 Jilin city n/a n/a piano (6 yrs); 210 singing (3 yrs) 47 F 28 , Hebei Sichuanese (listening) TV n/a 48 F 25 , / Sichuan (listening); n/a n/a Shanghai (4 yrs) Shanghainese (listening) 49 F 28 , Shandong Zibo dialect; friends Piano (10 yrs); Nanjinghua singing (6 yrs) 53 F 24 , Henan dialect n/a piano(1 yr) 70 M 24 Xian, / Xian dialect (listening) n/a violin (1 yr) Beijing (7 yrs) 71 M 28 Beijing n/a friends; TV n/a (ocassionally) 72 M 23 Beijing n/a TV violin (3 yrs) (occasionally) 73 M 21 Beijing n/a n/a n/a 74 F 24 Beijing n/a friends n/a

B.2 TM Listeners Subj. # Gender Age Place of birth and residence Knowledge and fluency Music training before 15 yrs of TSM 10 M 28 Haliang L2 n/a 21 F 27 Taipei L2 n/a 22 F 27 Taichuang a little piano (5 yrs) 23 F 28 Kaohsiung a little n/a 24 F 28 Taipei not fluent; listening n/a (parents speak TSM) 211 25 M 27 L2 piano (3-4 yrs) 26 M 31 Chiayi L2 piano (1.5 yrs) 27 M 29 Taipei L2 piano (3 yrs) 28 F 25 Kaohsiung L2 piano (15 yrs) 29 M 23 Taipei L2 n/a 30 F 27 Tainan L2 piano (1-2 yrs) 31 M 33 Taipei L2 n/a 32 F 30 Taipei listening piano (2 yrs) 33 F 30 Kaohsiung L2 piano (2-3 yrs) 34 F 58 Taoyuang L2 n/a

B.3. AE Listeners Subj. Gender Age # yrs Proficiency Trips to Exp. With Other L2s Music training # studying Level (OSU) China/Taiwan other Chinese Mandarin dialects 50 F 25 3 500 level Mainland China (3 n/a n/a n/a mos) 52 F 25 5 600 level (1 yr); n/a n/a n/a 3 travels (2-3 mos); Kuilin (weeks) 54 F 27 1 700 level Taiwan (8 mos), Minnan (Min Japanese (2 yrs); violin (7 yrs) ICLP, Cornell Falcon Dialect) Korean (2 yrs); German (4 yrs) 55 M 29 6.5 700 level Beijing (1 yr); TSM n/a n/a 212 Shanghai (6 mos); Shanghainese Taiwan (1.5 yrs, (Wu Dialect) work) 56 M 28 3 100 level n/a TM n/a piano (8 yrs); band (7 yrs) 57 M 18 2 200 level n/a n/a n/a n/a 58 M 29 5 700 level China: 2.5 yrs Shanghainese n/a Erhu/Chinese (education) (Wu Dialect); violin (7 yrs); Cantonese piano (5 yrs); (Yue Dialect); cello (1 yrs) Minnan (Min Dialect) 59 F 25 8 700 level China: 2.5 yrs; n/a n/a piano (7-8 yrs); 2 travels: 5 mos singing (3 yrs) 80 M 21 3 700 level Beijing (1 yr) n/a n/a n/a 81 F 19 10 100 level n/a n/a piano (15 yrs); violin (10 yrs)

Appendix D: Tone confusion matrices at other gates (a) BM listeners (b) TM listeners (c) AE listeners Response Response Response BM Tone 1 Tone 2 Tone 3 Tone 4 BM Tone 1 Tone 2 Tone 3 Tone 4 BM Tone 1 Tone 2 Tone 3 Tone 4 Stimulus Stimulus Stimulus 1 105 20 5 26 1 81 27 19 29 1 51 22 8 15

2 67 36 19 34 2 33 38 59 26 2 27 31 31 7

3 36 17 46 57 3 14 38 81 23 3 10 30 48 8

213 4 88 18 7 43 4 84 25 9 38 4 45 16 15 20

Response Response Response TM Tone 1 Tone 2 Tone 3 Tone 4 TM Tone 1 Tone 2 Tone 3 Tone 4 TM Tone 1 Tone 2 Tone 3 Tone 4 Stimulus Stimulus Stimulus 1 90 29 3 34 1 95 22 5 34 1 47 22 8 19

2 59 18 28 51 2 35 49 53 19 2 24 38 25 9

3 33 30 45 48 3 19 34 75 28 3 11 29 43 13

4 87 19 4 46 4 77 31 17 31 4 51 9 11 25

Confusion matrices displaying observed tone identification responses and response rate (%) for (a) BM listeners, (b) TM listeners, and (c) AE listeners at Gate 2.

(a) BM listeners (b) TM listeners (c) AE listeners Response Response Response BM Tone 1 Tone 2 Tone 3 Tone 4 BM Tone 1 Tone 2 Tone 3 Tone 4 BM Tone 1 Tone 2 Tone 3 Tone 4 Stimulus Stimulus Stimulus 1 111 22 7 16 1 106 27 2 21 1 59 10 6 21

2 57 55 22 22 2 41 64 42 9 2 28 43 20 5

3 14 14 46 82 3 10 26 100 20 3 5284023

4 80 19 6 51 4 79 23 4 50 4 54 14 5 23

214 Response Response Response TM Tone 1 Tone 2 Tone 3 Tone 4 TM Tone 1 Tone 2 Tone 3 Tone 4 TM Tone 1 Tone 2 Tone 3 Tone 4 Stimulus Stimulus Stimulus 1 110 29 7 10 1 102 33 6 15 1 60 13 8 15

2 72 45 24 15 2 41 50 46 19 2 34 34 27 1

3 44 39 37 36 3 18 46 76 16 3 20 32 37 7

4 88 7 4 57 4 80 15 10 51 4 47 12 4 33

Confusion matrices displaying observed tone identification responses and response rate (%) for (a) BM listeners, (b) TM listeners, and (c) AE listeners at Gate 3.

(a) BM listeners (b) TM listeners (c) AE listeners Response Response Response BM Tone 1 Tone 2 Tone 3 Tone 4 BM Tone 1 Tone 2 Tone 3 Tone 4 BM Tone 1 Tone 2 Tone 3 Tone 4 Stimulus Stimulus Stimulus 1 134 16 3 3 1 132 15 1 8 1 77 12 3 4

2 74 61 9 12 2 43 75 34 4 2 39 34 20 3

3 15 10 46 85 3 7 10 103 36 3 10 12 42 32

4 52 16 8 80 4 46 22 5 83 4 39 13 8 36

215 Response Response Response TM Tone 1 Tone 2 Tone 3 Tone 4 TM Tone 1 Tone 2 Tone 3 Tone 4 TM Tone 1 Tone 2 Tone 3 Tone 4 Stimulus Stimulus Stimulus 1 123 22 3 8 1 131 16 7 2 1 72 14 5 5

2 74 40 33 9 2 66 50 36 4 2 40 28 23 5

3 44 28 48 36 3 20 43 78 15 3 23 28 37 8

4 43 5 4 104 4 40 5 16 95 4 30 11 3 52

Confusion matrices displaying observed tone identification responses and response rate (%) for (a) BM listeners, (b) TM listeners, and (c) AE listeners at Gate 4.

(a) BM listeners (b) TM listeners (c) AE listeners Response Response Response BM Tone 1 Tone 2 Tone 3 Tone 4 BM Tone 1 Tone 2 Tone 3 Tone 4 BM Tone 1 Tone 2 Tone 3 Tone 4 Stimulus Stimulus Stimulus 1 150 5 0 1 1 14673 0 1 83544

2 65 82 9 0 2 52 92 12 0 2 30 50 14 2

3 8 8 86 54 3 3 13 109 31 3 6106119

4 3 1 4 148 4 2 3 23 128 4 8 4 10 74

216 Response Response Response TM Tone 1 Tone 2 Tone 3 Tone 4 TM Tone 1 Tone 2 Tone 3 Tone 4 TM Tone 1 Tone 2 Tone 3 Tone 4 Stimulus Stimulus Stimulus 1 134 17 5 0 1 130 20 5 1 1 78 13 4 1

2 90 43 23 0 2 84 54 18 0 2 44 35 17 0

3 25 3 67 61 3 16 12 101 27 3 14 21 38 23

4 1 1 5 149 4 2 1 13 140 4 41685

Confusion matrices displaying observed tone identification responses and response rate (%) for (a) BM listeners, (b) TM listeners,

and (c) AE listeners at Gate 6.

(a) BM listeners (b) TM listeners (c) AE listeners Response Response Response BM Tone 1 Tone 2 Tone 3 Tone 4 BM Tone 1 Tone 2 Tone 3 Tone 4 BM Tone 1 Tone 2 Tone 3 Tone 4 Stimulus Stimulus Stimulus 1 149 5 1 1 1 14781 0 1 88431

2 49 96 11 0 2 29 110 17 0 2 29 59 8 0

3 9 3 112 32 3 3 6 133 14 3 0127410

4 0 1 2 153 4 2 5 17 132 4 31785

217 Response Response Response TM Tone 1 Tone 2 Tone 3 Tone 4 TM Tone 1 Tone 2 Tone 3 Tone 4 TM Tone 1 Tone 2 Tone 3 Tone 4 Stimulus Stimulus Stimulus 1 146 9 1 0 1 138 15 2 1 1 83931

2 87 48 21 0 2 61 81 14 0 2 50 29 17 0

3 21 3 57 62 3 4 7 97 35 3 4134130

4 0 0 2 128 4 006 124 4 10079

Confusion matrices displaying observed tone identification responses and response rate (%) for (a) BM listeners, (b) TM listeners, and (c) AE listeners at Gate 7.

(a) BM listeners (b) TM listeners (c) AE listeners Response Response Response BM Tone 1 Tone 2 Tone 3 Tone 4 BM Tone 1 Tone 2 Tone 3 Tone 4 BM Tone 1 Tone 2 Tone 3 Tone 4 Stimulus Stimulus Stimulus 1 150 6 0 0 1 14592 0 1 88521

2 28 122 6 0 2 27 118 11 0 2 13 74 9 0

3 0 5 125 13 3 3 6 133 14 3 1 5 71 11

4 0 0 1 103 4 019 94 4 21160

218 Response Response Response TM Tone 1 Tone 2 Tone 3 Tone 4 TM Tone 1 Tone 2 Tone 3 Tone 4 TM Tone 1 Tone 2 Tone 3 Tone 4 Stimulus Stimulus Stimulus 1 141 12 3 0 1 134 21 1 0 1 86721

2 84 56 16 0 2 70 74 12 0 2 44 37 15 0

3 9 2 68 64 3 1 2 115 25 3 1114531

4 0 0 1 129 4 011 128 4 01079

Confusion matrices displaying observed tone identification responses and response rate (%) for (a) BM listeners, (b) TM listeners,

and (c) AE listeners at Gate 8.

(a) BM listeners (b) TM listeners (c) AE listeners Response Response Response BM Tone 1 Tone 2 Tone 3 Tone 4 BM Tone 1 Tone 2 Tone 3 Tone 4 BM Tone 1 Tone 2 Tone 3 Tone 4 Stimulus Stimulus Stimulus 1 149 6 0 1 1 14772 0 1 87720

2 20 129 7 0 2 10 133 13 0 2 14 71 11 0

3 2 10 118 13 3 0 6 136 1 3 012733

4 0 0 0 65 4 006 59 4 00139

219 Response Response Response TM Tone 1 Tone 2 Tone 3 Tone 4 TM Tone 1 Tone 2 Tone 3 Tone 4 TM Tone 1 Tone 2 Tone 3 Tone 4 Stimulus Stimulus Stimulus 1 140 14 2 0 1 133 22 1 0 1 81 11 1 3

2 72 73 11 0 2 57 92 7 0 2 36 53 7 0

3 9 1 35 59 3 0 1 79 24 3 2 3 33 26

4 0 0 2 102 4 002 102 4 01162

Confusion matrices displaying observed tone identification responses and response rate (%) for (a) BM listeners, (b) TM listeners,

and (c) AE listeners at Gate 9.

(a) BM listeners (b) TM listeners (c) AE listeners Response Response Response BM Tone 1 Tone 2 Tone 3 Tone 4 BM Tone 1 Tone 2 Tone 3 Tone 4 BM Tone 1 Tone 2 Tone 3 Tone 4 Stimulus Stimulus Stimulus 1 154 2 0 0 1 14763 0 1 92310

2 15 136 5 0 2 4 142 10 0 2 9789 0

3 1 7 113 9 3 0 4 125 1 3 0 5 73 2

4 0 0 1 51 4 013 48 4 11228

220 Response Response Response TM Tone 1 Tone 2 Tone 3 Tone 4 TM Tone 1 Tone 2 Tone 3 Tone 4 TM Tone 1 Tone 2 Tone 3 Tone 4 Stimulus Stimulus Stimulus 1 134 20 2 0 1 128 24 4 0 1 81 11 3 1

2 64 80 11 1 2 47 98 11 0 2 35 47 14 0

3 1 1 47 42 3 0 1 77 13 3 0 5 23 28

4 0 0 1 90 4 010 90 4 00056

Confusion matrices displaying observed tone identification responses and response rate (%) for (a) BM listeners, (b) TM listeners, and (c) AE listeners at Gate 10.

(a) BM listeners (b) TM listeners (c) AE listeners Response Response Response BM Tone 1 Tone 2 Tone 3 Tone 4 BM Tone 1 Tone 2 Tone 3 Tone 4 BM Tone 1 Tone 2 Tone 3 Tone 4 Stimulus Stimulus Stimulus 1 139 4 0 0 1 13832 0 1 78712

2 6 146 4 0 2 1 145 10 0 2 185100

3 0 1 101 2 3 1 3 100 0 3 0 6 56 2

4 0 0 0 39 4 002 37 4 00123

221 Response Response Response TM Tone 1 Tone 2 Tone 3 Tone 4 TM Tone 1 Tone 2 Tone 3 Tone 4 TM Tone 1 Tone 2 Tone 3 Tone 4 Stimulus Stimulus Stimulus 1 139 16 1 0 1 136 16 4 0 1 84831

2 47 98 11 0 2 36 111 9 0 2 29 58 9 0

3 0 2 47 42 3 0 0 79 12 3 0 4 30 22

4 0 0 1 51 4 000 52 4 00032

Confusion matrices displaying observed tone identification responses and response rate (%) for (a) BM listeners, (b) TM listeners, and (c) AE listeners at Gate 11.

References Blicher, D. L., Diehl, R. L., and Cohen, L. B. (1990). Effects of syllable duration on the perception of the Mandarin Tone 2/Tone 3 distinction: evidence of auditory enhancement. Journal of Phonetics, 18, 37-49. Chao, Y. R. (1948). Mandarin Primer. Cambridge, MA: Harvard University Press. Chao, Y. R. (1968). A Grammar of Spoken Chinese. Berkeley, CA: University of California Press. Chappell, H. (2004). Synchrony and diachrony of : A brief history of Chinese dialects. In H. Chappell (Ed.), : synchronic and diachronic perspectives (pp. 1-29). New York: Oxford University Press Chen, M. Y. (2000). Tone Sandhi: patterns across Chinese dialects. Cambridge, UK: Cambridge University Press. Chen, P. (1999). Modern Chinese: history and sociolinguistics. Cambridge, UK: Cambridge University Press. Chen, R. L. (1985). A comparison of Taiwanese, Taiwan Mandarin, and Peking Mandarin. Language 61, 352-377. Cheng, R. L. (1966). Mandarin phonological structure. Journal of Linguistics, 2(2), 135-158. Cheng, C.-C. (1973). A quantitative study of Chinese tones. Journal of Chinese Linguistics, 1, 93-110. Chiung, T. W.-V. (1990). The tonal comparisons and contrasts between Taiwanese and Taiwan Mandarin. Presentation at the Sixth Annual UTA Student Conference in Linguistics. University of Texas at Arlington, Arlington, Texas. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ.: Lawrence Erlbaum Associates. Coster, D. C., and Kratochvil, P. (1984). Tone and stress discrimination in normal Peking dialect speech. In B. Hong (Ed.), New papers in Chinese linguistics (pp. 119-132). Canberra: Australian National University Press. Cotton, S., and Grosjean, F. (1984). The gating paradigm: A comparison of successive and individual presentation formats. Perception & Psychophysics, 35(1), 41-48. Crystal, d. (1987). The Cambridge Encyclopedia of Language. Cambridge: Cambridge University Press.

222

Da, J. (2000). ‘Chinese text computing’ http://lingua.mtsu.edu/chinese-computing/. Murfreesboro, TN: Department of Foreign Languages and Literatures. Middle Tennessee State University. Deng, D., Feng, S., and , S. (2006). The contrast on tone between Putonghua and Taiwan Mandarin. Xue Xue Bao [Acta Acustica], 31(6), 536-541. Dow, F. D. M. (1972). An outline of Mandarin phonetics. Canberra, Australia: Australian National University Press. Duanmu, S., 2010. Chinese syllable structure. To appear in C. Ewen, B. Hume, M. van Oostendorp, and K. Rice (Eds.). Companion to Phonology. Wiley-Blackwell. Duanmu, S. (2007). The phonology of standard Chinese. New York: Oxford University Press. Duanmu, S. (2006). Chinese (Mandarin): phonology. In K. Brown (Ed.). Encyclopedia of Language and Linguistics (2nd Edition, pp. 351-355). Oxford, UK: Elsevier Publishing House. Evans, B.G. and Iverson, P. (2004). Vowel normalization for accent: an investigation of best exemplar locations in northern and southern British English sentences. J. Acoust. Soc. Am., 115, 352-361. Fon, J. & Chiang, W.-Y. (1999). What does Chao have to say about tones? a case study of Taiwan Mandarin. Journal of Chinese Linguistics, 27 (1) 15 - 37. Fon, J., Chiang, W.-Y., & Cheung, C. (2004). Production and perception of two dipping tones (T2 and T3) in Taiwan Mandarin. Journal of Chinese Linguistics, 32 (2), 249 - 280. Fox, R. A., and Qi, Y. (1990). Context effects in the perception of Mandarin tone. Journal of Chinese Linguistics, 18, 261-284. Fox, R. A., and McGory, J. T., (2007). Second language acquisition of a regional dialect of American English by native Japanese speakers. In O.-S. Bohn and M. J. Munro (Eds), Language Experience in Second Language Speech Learning (pp. 117-134) Amsterdam: John Benjamins Press. Francis, A. L. and Ciocca, V. (2003). Stimulus presentation order and the perception of lexical tones in Cantonese. J. Acoust. Soc. Am. 114 (3), 1611-1621. Fu, Q.-J., Zeng, F.-G., Shannon, R. V. and Soli, S. D. (1998). Importance of tonal envelope cues in Chinese speech recognition. J. Acoust. Soc. Am., 104 (1), 505-510. Fu, Q.-J. and Zeng, F.-G. (2000). Identification of temporal envelope cues in Chinese tone recognition. Asia Pacific Journal of Speech, Language, and Hearing, 5, 45-57. Garding, E., Kratochvil, P. Swantesson, J.-O. and Zhang, J. (1986). Tone 4 and Tone 3 discrimination in modern Standard Chinese. Lang. Speech, 29, 281-293.

223

Gandour, J. T. (1978). The perception of tone. In Tone: A Linguistic Survey, V.A. Fromkin (ed.) NY: Academic Press, 41-76. Gandour, J. T. and Harshman, R. A. (1978). Crosslanguage differences in tone perception: a multidimensional scaling investigation. Lang. Speech, 21, 1-33. Gandour, J. T. (1984). Tone dissimilarity judgments by Chinese listeners. Journal of Chinese Linguistics, 12, 235-261. Gordon, R. G. and Grimes, B. F. (Eds.). (2005). Ethnologue: languages of the world (15thed.). Dallas: SIL International. Gottfried, T. L., and Suiter, T. L. (1997). Effects of linguistic experience on the identification of mandarin Chinese vowels and tones. J. Phonetics, 25, 207-231. Greenberg, S. and Zee, E. (1979). On the perception of contour toens. Perception and Psychophysics, 28 (4), 267-283. Grosjean, F. (1980). Spoken word recognition processes and the gating paradigm. Perception & Psychophysics, 28(4), 267-283. Guo, L. (2004). The relationship between Putonghua and Chinese dialects. In Language Policy in the People’s Republic of China: Theory and Practice since 1949. M. Zhou (Ed.). Boston: Kluwer Academic Publishers, 45-53. Halle, P. A., Chang, Y.-C., Best, C. T. (2004). Identification and discrimination of Mandarin Chinese tones by Mandarin vs. French listeners. Journal of Phonetics, 32, 395-421. Ho, A. T. (1976). The acoustic variation of Mandarin tones. Phonetica, 33, 353-367. Honorof, D. N. and Whalen, D. H. (2005). Perception of pitch location within a speaker’s F0 range. J. Acoust. Soc. Am., 117 (4), p. 2193-2200. Howie, J. M. (1976). Acoustical studies of Mandarin vowels and tones. New York: Cambridge University Press. Halberstam, B. and Raphael, L. (2004). Vowel normalization: the role of fundamental frequency and upper formants. Journal of phonetics, 32, 423-434. Hillenbrand, J., Getty, L.A., Clark, M.J. & Wheeler, K. (1995). Acoustic characteristics of American English vowels. J. Acoust. Soc. Am., 97(5), 3099-3111.

Huang, Shuan-fan (黃宣範). (1993). 語言,社會與族群意識. Yǔyán Shèhuì yǔ Qunzúnzú Yìshì (Language, Society and Group Identity). Taipei: Crane Publishing Co. Jeng, J.-Y., Weismer, G. and Kent, R. D. (2006). Production and perception of Mandarin tone in adults with cerebral palsy. Clinical Linguistics & Phonetics, 20(1), 67-87. Jenkins, J. J., Strange, W., and Miranda, S. (1994). Vowel identification in mixed-speaker silent-center syllables. J. Acoust. Soc. Am., 95(2), 1030-1043.

224

Johnson, K. (1990) The role of perceived speaker identity in F0 normalization of vowels. J. Acoust. Soc. Am., 88, 642-654. Johnson, K. & Mullennix, J.W. (Eds.) (1997). Talker Variability in Speech Processing. San Diego: Academic Press. Johnson, K. (2005) Speaker Normalization in speech perception. In Pisoni, D.B. & Remez, R. (Eds). The Handbook of Speech Perception (pp. 363-389). Oxford: Blackwell Publishers. Jongman, A., Wang, Y., Moore, C., and Sereno, J. A. (2006). Perception and production of Mandarin Chinese tones. In P. Lin, H. Tan, E. Bates, and O. J. L. Tzeng (Eds.). The Handbook of East Asian Psycholinguistics (pp. 209-217.). NY: Cambridge University Press. Kratochvil, P. (1968). The Chinese Language Today. London: Hutchinson University Library. Kratochvil, P. (1984). Phonetic tone sandhi in Beijing dialect stage speech. Cahiers de Linguistique Asie Orientale, 13, 135-174. Kubler, C. (1985). The development of Mandarin in Taiwan: A case study of language contact. Taipei: Student Book Co. Ltd. Kuo, Y.-C., Rosen, S. and Faulkner, A. (2008). Acoustic cues to tonal contrasts in Mandarin: Implications for cochlear implants. J. Acoust. Soc. Am., 123 (5), 2815-2824. Ladefoged, P. and Broadbent, D. E. (1957). Information conveyed by vowels. J. Acoust. Soc. Am., 29, 98-104. Lai, Y. and Zhang, J. (2008). Mandarin lexical tone recognition: The gating paradigm. Kansas Working Papers in Linguistics, 30, 183-194. Leather, J. (1983). Speaker normalization in perception of lexical tone. Journal of Phonetics, 11, 373-382. Lee, C.-Y. (2009). Identifying isolated, multispeaker Mandarin tones from brief acoustic input: A perceptual and acoustic study. Journal of the Acoustical Society of America, 125, 1125-1137. Lee, C.-Y., , L., and Bond, Z. S. (2009). Speaker variability and context in the identification of fragmented mandarin tones by native and nonnative listeners. J. Phonetics, 37, 1-15. Lee, C.-Y., and Hung, T.-H. (2008). Identification of Mandarin tones by English-speaking musicians and nonmusicians. Journal of the Acoustical Society of America, 124, 3235-3248.

225

Lee, C.-Y., Tao, L., and Bond, Z. S. (2008). Identification of acoustically modified Mandarin tones by native listeners. Journal of Phonetics, 36, 537-563 Li, A. Xiong, Z. and Wang, X. (2006). Contrastive study on tonal patterns between accented and standard Chinese. In Qiang Huo, Bin Ma and Eng-Siong Chng (Eds.). Proceedings of Chinese Spoken Language Processing: 5th International Symposium, ISCSLP 2006, Singapore, December 13-16, 2006. Ladefoged, P. (2003). Phonetic Data Analysis: An introduction to fieldword and instrumental techniques. Blackwell Publishing Ltd. Malden, MA.

Li A. and Wang, X. (2003). A contrastive investigation of standard Mandarin and accented Mandarin. In Proceedings of the 5th International Symposium on Chinese Spoken Language 2006. Li, D. C.S. (2006). Chinese as a lingua franca in . Annual Review of Applied Linguistics, 26, 149-176. Li, F.-K. (1973). Language and dialects. Journal of Chinese Linguistics, 1 (1), 1-13. Lin, H.-B., and Repp, B. H. (1989). Cues to the perception of Taiwanese tones. Speech, 32(1), 25-44. Lin, M.-C. (1965). Yingao xianshiqi yu Putonghua shengdiao yyingao texing [The pitch indicator and the pitch characteristics of tones in Standard Chiense]. Acta Acoustica Sinica, 8-15. Lin, M.-C. (1988). Putonghua shengdiao de shengxue texing he zhijue zhengzhao [The acoustic characteristics and perceptual cues of tones in Standard Chinese]. Zhongguo Yuwen [Chinese Linguistics], 204, 182-193. Lin, M.-C. (1995). A perceptual study on the domain of tones in Standard Chiense. Chinese Journal of Acoustics, 14 (4), 350-357. Lin, Y.-H. (2008). Variable vowel adaptation in Standard Mandarin loanwords. J. East Asian Linguist, 17, 363-380. Liu, S. and Samuel, A. G. (2004). Perception of Mandarin lexical tones when F0 information is neutralized. Language and Speech, 47 (2), 109-138. Macmillan, N. A., and Creelman, C. D. (2005). Detection Theory: A user’s guide. Mahwah, NJ: Lawrence Erlbaum Associates, Inc., Publishers. 2nd ed. Massaro, D. W., Cohen, M. M., and Tseng, C.-Y. (1985). The evaluation and integration of pitch height and pitch contour in lexical tone perception in Mandarin Chinese. Journal of Chinese Linguistics, 13 (2), 267-289. Moore, C. B. and Jongman, A. (1997). Speaker normalization in the perception of Mandarin Chinese tones. J. Acoust. Soc. Am. 102(3), 1864-1877. Mullennix, J. W., Pisoni, D. B., and Martin, C. S. (1989). Some effects of talker variability on spoken word recognition. J. Acoust. Soc. Am., 85, 365-378.

226

Nearey, T.M. (1989) Static, dynamic, and relational properties in vowel perception. J. Acoust. Soc. Am. 85(5), 2088-113. Norman, J. (1988). Chinese. Cambridge, UK: Cambridge University Press. Peterson, G. and Barney, H. (1952). Control methods used in a study of the vowels. J. Acoust. Soc. Am. 24, 175-184. Rogers, H. (2005). Writing systems: a linguistic approach. Malden, MA: Blackwell Publishing Ltd. Sagart, L., Hallé, P., De Boysson-Bardies, B. and Arabia-Guidet, C. (1986). Tone production in modern Standard Chinese: an electromyographic investigation. Cahiers de Linguistique Asie Orientale, 15, 205-221. Sandel, T. L. (2003). Linguistic capital in Taiwan: The KMT’s Mandarin Language policy and its perceived impact on language practices of bilingual Mandarin and Tai-gi speakers. Language in Society, 32, 523-551. Sanders, R. M. (2008a). Citation tone change in Taiwan Mandarin: Should teachers remain tone deaf? Journal of the Chinese Language Teachers Association, 43, 1-16. Sanders, R. M. (2008b). Tonetic sound change in Taiwan Mandarin: The case of Tone 2 and Tone 3 citation contours. Proceedings of the 20th North American Conference on Chinese Linguistics (NACCL-20), Volume 1. M. K.M. Chan and H. Kang. P87-107. Shen, X. S. (1989). Interplay of the four citation tones and intonation in Mandarin Chiense. Journal of Chinese Linguistics, 17(1), 61-74. Shen, X. S. (1990a). On Mandarin Tone 4. Australian Journal of Linguistics, 10, 41-59. Shen, X. S. (1990b). Tonal coarticulation in Mandarin. Journal of Phonetics, 18, 281-295. Shen X. S. (1992). On tone sandhi and tonal coarticulation. Acta Linguistica Hafniensia, 24, p. 131-152. Shen, X. S. and Lin, M.-C. (1991). A perceptual study of Mandarin tones 2 and 3. Language and Speech, 34(2), 145-156. Shen, X. S., Lin, M.-C., and , J. (1993). F0 turning point as an F0 cue to tonal contrast: a case study of Mandarin tones 2 and 3. J. Acoust. Soc. Am., 93 (4), 2241-2243. Shi, B. & Zhang, J. (1987) Vowel intrinsic pitch in standard Chinese. In Proceedings Xlth International Congress of Phonetic Sciences, 1 (pp. 142-145). Tallinn, Estonia: Academy of Sciences of the Estonian Shi, F. and Deng, D. (2006). Putonghua yu Taiwan de yuyin duibi [A phonetic contrast of Mainland Mandarin and Taiwan Mandarin]. In D.-A. Ho et al. (Eds.). Linguistic studies in Chinese and neighboring languages: Festschrift in honor of

227

Professor Pang-hsin Ting on his 70th birthday (pp. 371-393). Language and Linguistics Monograph Series Number W-6, 1. Shih, C. (1987). The phonetics of the Chinese tonal system. Technical memo, AT&T Bell Labs. Shih, C. (1988). Tone and intonation in Mandarin. Working Papers, Cornell Phonetics Laboratory No. 3, p. 83-109. Studebaker,G.A.(1985). A‘‘rationalized’’arcsinetransform. Journal of Speech and Hearing Research, 28, 455–462. Torgerson Jr., R. C. (2005). A comparison of Beijing and Taiwan Mandarin tone register: an acoustic analysis of three native speech styles. M.A. thesis. Brigham Young University. Tseng, C.-C. (2004). Prosodic properties of intonation in two major varieties of Mandarin Chiense: Mandarin China vs. Taiwan. International Symposium on Tonal Aspects of Languages: Emphasis on Tone Languages. Beijing, China. March 28-30. Tseng, C.-Y. (1981). An acoustic phonetic study on tones in Mandarin Chinese. Ph.D. dissertation, Brown University. Tyler, L. K., and Wessels, J. (1985). Is gating an on-line task? Evidence from naming latency data. Percpetion & Psychophysics, 38(3), 217-222. The Government Information Office (2009). The Republic of China Year book 2009. Taipei, Taiwan, R. O. C.: Government Information Office. Retrieved from http://www.gio.gov.tw/taiwan-website/5-gp/yearbook/ch02.html Vergrugge, R. R., Strange, W., Shankweiler, D. P., and Edman, T. R. (1976). What information enables a listener to map a talker’s vowel space? J. Acoust. Soc. Am., 60 (1), 198-212. Wang, W. S-Y. (1967). Phonological features of tones. International Journal of American Linguistics, 33 (2), 93-105. Wang, W. S-Y. and Li, K.-P. (1967). Tone 3 in Pekinese. Journal of Speech and Hearing Research, 10, 629-636. Wang, W. S-Y., Li, K. P., and Brotzman, R. L (1963). Research on Mandarin Phonology. Project on Linguistic Analysis, Ohio State University Research Foundation, 6r, 1-63. Wang, Y., Spence, M. M., Jongman, A., and Sereno, J. A. (1999). Training American listeners to perceive Mandarin tones. J. Acoust. Soc. Am., 106 (6), 3649-3658. Wong, P. C. M., and Diehl, R. L. (2003). Perceptual normalization for inter- and intratalker variation in Cantonese level tones. Journal of Speech, Language, and Hearing Research, 46, 413-421.

228

Whalen D. H., Yi Xu (1992). Information for Mandarin tones in the amplitude contour and in brief segments. Phonetica, 49, 25-47. Wu, N. and Shu, H. (2003). The gating paradigm and spoken word recognition of Chiense. Acta Psychologica, 35 (5), 582-590. Wurm, S. A. et al. (1987-1991). Language Atlas of China. Hong Kong: Longman. Xu, Y. (1994). Production and perception of coarticulated tones. J. Acoust. Soc. Am., 95, 2240-2253. Xu, Y. (1997). Contextual tonal variations in Mandarin. Journal of Phonetics 25, 61-83. Xx, C. X. and Xu, Y. (2003). Effects of consonant aspiration on Mandarin tones. Journal of the International Phonetic Association, 33, 165-181. Yuan, Jiahua. (1989). Hànyǔ fāngyán gàiyào [An Overview of Chinese Dialects], 2nd Edition. Beijing: Wenzi Gaige Chubanshe (Writing-System Reformation Publishing House). Zee, E. (1980). A spectrographic investigation of Mandarin tone sandhi. UCLA Working Papers in Phonetics, 49, 98-116. Zhou, M. L. (2001). The spread of Putonghua and language attitude changes in Shanghai and Guanzhou China. Journal of Asian Pacific communication, 11, 231-253. Zhou, M. (2001a). The spread of Putonghua and language attitude changes in Shanghai and , China. Journal of Asian Pacific Communication, 11 (2), 231–253. Zhou, M. and Ross, H. (2004). Introduction: the context of the theory and practice of China’s language policy. In M. Zhou (Ed.). Language Policy in the People’s Republic of China: Theory and Practice since 1949 (pp. 1-18.). Boston: Kluwer Academic Publishers. Zhu, X.-N. (2001). Chinese Languages: Mandarin. In J. Garry and C. Rubino (Eds.). Facts about the world’s languages, An encyclopedia of the world’s languages: past and present (pp. 146-150).

229