Prosody and Speaker State: Paralinguistics, Pragmatics, and Proficiency
Jackson J. Liscombe
Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences
COLUMBIA UNIVERSITY
2007 c 2007 Jackson J. Liscombe All Rights Reserved ABSTRACT
Prosody and Speaker State: Paralinguistics, Pragmatics, and Proficiency
Jackson J. Liscombe
Prosody—suprasegmental characteristics of speech such as pitch, rhythm, and loudness— is a rich source of information in spoken language and can tell a listener much about the internal state of a speaker. This thesis explores the role of prosody in conveying three very different types of speaker state: paralinguistic state, in particular emotion; pragmatic state, in particular questioning; and the state of spoken language proficiency of non-native English speakers.
Paralinguistics. Intonational features describing pitch contour shape were found to dis- criminate emotion in terms of positive and negative affect. A procedure is described for clustering groups of listeners according to perceptual emotion ratings that foster further understanding of the relationship between acoustic-prosodic cues and emotion perception.
Pragmatics. Student questions in a corpus of one-on-one tutorial dialogs were found to be signaled primarily by phrase-final rising intonation, an important cue used in conjunc- tion with lexico-pragmatic cues to differentiate the high rate of observed declarative questions from proper declaratives. The automatic classification of question form and function is explored.
Proficiency. Intonational features including syllable prominence, pitch accent, and bound- ary tones were found to correlate with language proficiency assessment scores at a strength equal to that of traditional fluency metrics. The combination of all prosodic features further increased correlation strength, indicating that suprasegmental infor- mation encodes different aspects of communicative competence. Contents
I INTRODUCTION 1
II PARALINGUISTICS 4
1 Previous Research on Emotional Speech 6
2 The EPSAT and CU EPSAT Corpora 12
3 Extraction of Emotional Cues 15
4 Intended Emotion 18 4.1 Automatic classification ...... 20 4.2 Emotion profiling ...... 21 4.3 Discussion ...... 27
5 Perceived Emotion 30 5.1 Perception survey ...... 30 5.2 Rating correlation ...... 31 5.3 Label correspondence ...... 34 5.4 Rater behavior ...... 37 5.5 Rater agreement ...... 39 5.6 Rater clustering ...... 43 5.7 Cluster profiling ...... 47 5.8 Automatic classification ...... 53 5.9 Emotion profiling ...... 58
i 5.10 Discussion ...... 70
6 Dimensions of Perceived Emotion 72
7 Abstract Pitch Contour 78 7.1 Annotation ...... 78 7.2 Analysis ...... 80
8 Non-acted Emotion 85 8.1 The HMIHY Corpus ...... 86 8.2 Extraction of Emotional Cues ...... 87 8.3 Automatic Classification ...... 90 8.4 Discussion ...... 91
9 Discussion 92
III PRAGMATICS 95
10 Previous Research on Questions 97 10.1 The syntax of questions ...... 97 10.2 The intonation of questions ...... 98 10.3 Declarative questions ...... 100 10.4 The function of questions ...... 103 10.5 Student questions in the tutoring domain ...... 105 10.6 Automatic question identification ...... 107
11 The HH-ITSPOKE Corpus 110
12 Question Annotation 113 12.1 Question type ...... 113 12.2 Question-bearing turns ...... 116
ii 13 Question Form and Function Distribution 117
14 Extraction of Question Cues 122
15 Learning Gain 127
16 Automatic Classification of QBTs 129 16.1 QBT vs. ¬QBT ...... 131 16.2 QBT form ...... 136 16.3 QBT function ...... 140
17 Discussion 146
IV PROFICIENCY 151
18 Non-native Speaking Proficiency 152
19 Communicative Competence 154
20 Previous Research 157
21 The DTAST Corpus 161
22 Annotation 163 22.1 Intonation ...... 163 22.2 Rhythm ...... 164
23 Automatic Scoring Metrics 166 23.1 Fluency ...... 166 23.2 Intonation ...... 168 23.3 Rhythm ...... 170 23.4 Redundancy ...... 171
iii 24 Automatic Estimation of Human Scores 176
25 Automatic Detection of Prosodic Events 179 25.1 Feature extraction ...... 180 25.2 Performance ...... 183
26 Discussion 187
V CONCLUSION 190
VI APPENDICES 193
A CU EPSAT Corpus 194
B Mean Feature Values Per Emotion 197 B.1 EPSAT intended ...... 198 B.2 CU EPSAT perceived: All raters ...... 200 B.3 CU EPSAT perceived: Cluster 1 ...... 201 B.4 CU EPSAT perceived: Cluster 2 ...... 202 B.5 CU EPSAT perceived: Plots ...... 203
VII BIBLIOGRAPHY 206
iv List of Figures
4.1 Intended emotion distinctive features (Part I) ...... 24 4.2 Intended emotion distinctive features (Part II) ...... 25
5.1 Screen-shot of perceived emotion survey ...... 32 5.2 Histogram of perceived non-emotion ...... 40 5.3 Histogram of degree of perceived emotion ...... 40
5.4 Histogram of inter-rater κqw (perceived emotion) ...... 42
5.5 Histogram of κqw for clusters c1 and c2 (perceived emotion) ...... 46 5.6 Histogram of mean ratings per rater cluster (perceived emotion) ...... 48 5.7 Examples of f0-rising quantizations ...... 61 5.8 Examples of f0-slope/f0-rising quantizations ...... 65
6.1 Screen-shot of dimensional emotion survey ...... 73 6.2 Plot of emotion/dimension correlation ...... 77
8.1 Sample HMIHY dialog ...... 86
11.1 HH-ITSPOKE excerpt ...... 112 11.2 ITSPOKE screenshot ...... 112
v List of Tables
1.1 Acoustic correlates of emotions ...... 8
2.1 Intended emotion labels ...... 13 2.2 Perceived emotion labels ...... 13
3.1 EPSAT acoustic-prosodic features ...... 16
4.1 Intended monothetic emotion classification performance ...... 20 4.2 Intended emotion mean feature values ...... 22 4.3 Full intended emotion profiles ...... 26 4.4 Minimal intended emotion profiles ...... 26
5.1 Correlation of perceived emotions ...... 33 5.2 Correspondence of intended and perceived emotions ...... 35 5.3 Distribution of perceived emotion ratings ...... 37 5.4 Clusters of raters of perceived emotion ...... 44 5.5 Age distribution of rater clusters (perceived emotion) ...... 48 5.6 Distribution of speaker/subject gender (perceived emotion) ...... 50 5.7 Perceived emotion frequency of differing samples ...... 50 5.8 Intended emotion frequency of differing samples ...... 51 5.9 Cluster comparison of mean perceived emotion ratings ...... 52 5.10 Classification performance of intended and perceived emotions ...... 54 5.11 Classification performance of perceived emotions ...... 58
vi 5.12 Minimal perceived emotion profiles (unclustered) ...... 60 5.13 Minimal perceived emotion profiles (c1) ...... 63 5.14 Minimal perceived emotion profiles (c2) ...... 66 5.15 Generalized emotion profiles (c1 and c2) ...... 67 5.16 Comparison of full intended and perceived emotion profiles ...... 68
6.1 Correlation of emotion dimension and acoustic feature ...... 74 6.2 Correlation of perceived discrete emotion and dimension ratings ...... 75
7.1 ToBI tone distribution in CU EPSAT ...... 80 7.2 Average emotion rating per pitch accent ...... 81 7.3 Average dimension rating per pitch accent ...... 82 7.4 Average rating per pitch contour ...... 83
8.1 Emotion classification accuracy of HMIHY...... 91
12.1 Question form labels ...... 115 12.2 Question function mapping ...... 115 12.3 Question function labels ...... 115
13.1 Question form and function distribution ...... 118 13.2 Question