Data mining Mandarin shapes

Shuo Zhang CED Applied Research Bose Corporation The Mountain Rd, Framingham, MA 01701 shuo [email protected]

Abstract et al., 2009)? (3) What features can we use to im- prove the accuracy of automatic tone recognition In spontaneous speech, Mandarin tones that (Surendran, 2007)? Each of the works was driven belong to the same tone category may exhibit many different contour shapes. We explore the by a particular set of theoretical or practical moti- use of data mining and NLP techniques for un- vations and offered us a slice of understanding into derstanding the variability of tones in a large the problem. corpus of Mandarin newscast speech. First, In this work, we are interested in looking at the we adapt a graph-based approach to character- tone variability problem from a data mining per- ize the clusters (fuzzy types) of tone contour spective: we explore the structure and distribution n shapes observed in each tone -gram category. of tone contour shapes within a large amount of Second, we show correlations between these realized contour shape types and a bag of au- data. By taking a data mining approach, we con- tomatically extracted linguistic features. We trast our work with those works that focus on tone discuss the implications of the current study recognition or tone learning (either by machine or within the context of phonological and infor- by human): we seek to extract tone patterns of em- mation theory. pirical significance from a large data set of tones from spontaneous speech. 1 Introduction Working with the MCPST corpus (see Section One of the central phenomena of interest in lex- 3) of Mandarin newscast speech (about 100,000 ical tone production is the deviation of their sur- tones), we ask two questions: (1) For each tone face realizations from canonical templates of tone category, what are the (coarse) types/classes of categories(Xu, 1997; Prom-on et al., 2009; Suren- tone contour shapes we observe in this corpus? (2) dran, 2007). In a tone language, different tone cat- For a particular tone category, what linguistic fac- egories differing in pitch movements can distin- tors caused the same tone to be realized as these guish different lexical meanings of a (e.g., different types of shapes? in Mandarin, the syllable “ma” in a high level pitch Inspired by works in natural language process- contour means “mother”, whereas the same syl- ing (NLP), we further extend these research ques- lable spoken in a falling means “to tions in two directions. First, we extend our in- scold”). Even though each tone category is de- vestigation of tone categories into a series of n fined with a general pitch contour profile (such as consecutive tones, or tone n-grams. N-grams is level, rising, falling, etc.), they typically exhibit a classic technique in NLP language modeling1, great variability in spontaneous speech. As an ex- whereas in the current context, we study tone n- ample, Figure1 shows many different realizations grams due to the importance of context in tone of Mandarin tone 1, observed during speech pro- variability (Xu, 1997): a tone category maybe re- duction experiments in the lab. alized differently depending on their neighboring Previous works in , speech , tones. What can we learn from data mining tone and tone recognition have investigated this vari- contour shapes for tone unigrams, bigrams, and ability by asking questions such as: (1) What fac- trigrams? tors contribute to the variability in tone produc- Second, to study prosody interface in MCPST tion (Xu, 1997)?(2) How can we model the tone 1Readers may refer to the classic NLP textbook chapter if contour trajectory in synthesized speech (Prom-on needed: https://web.stanford.edu/ jurafsky/slp3/3.pdf.

144 Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 144–153 Florence, Italy. August 2, 2019 c 2019 Association for Computational Linguistics data, we use automatic methods (NLP and other) to extract linguistic features from the text, includ- ing Named Entity Recognition (NER), Corefer- ence resolution, Part-of-speech (POS) tagging, de- pendency parsing, and other phonological, mor- phological and contextual features. In order to find out the importance of these linguistic factors in shaping tone variability, we run the following ex- periment: given a particular tone (or tone n-gram) Figure 1: Samples of Mandarin Tone1 by the same speaker in lab speech. Data source: (Xu, 1997). The category, how well can we predict the type of tone canonical contours of Mandarin Tone 1,2,3,4 are: high contour shape it will take in running speech, using level, low rising, low dipping, high falling, where low these linguistic features that exclude information and high denote the pitch starting point of the tone. about the pitch contour f0 values? Previous works showed that many linguistic factors (such as focus, topic, etc.) affect tone pro- of Mandarin tones, most works have focused on duction or prosody (see Section2) . In this work, the effect of local tonal context (e.g., neighbor- we extend this to a more comprehensive set of lin- ing tones and pitch range, such as (Gauthier et al., guistic features, motivated by the information the- 2007; Xu, 1997)) and broader context (e.g., focus, ory account of tone production. We hypothesize topic, information structure, long term f0 varia- that there exists an information content inequality tions, such as (Xu et al., 2004; Liu et al., 2006; resulting from probability distribution of events in Wang and Xu, 2011)). The data in these works various linguistic domains (phonological, seman- usually consisted of a small number of tone obser- tic, etc). These inequalities affect speakers’ speech vations obtained in speech production experiments production, resulting in gradient variants of tone in the lab. They have informed later works on im- contour shapes in a given tone category. We in- proving the performance of supervised or unsu- vestigate the relative importance of these factors pervised tone recognition ((Levow, 2005; Suren- in predicting the types of contour shapes any par- dran, 2007) etc.). Other works such as (Surendran, ticular tone n-gram will take. 2007) and (Yu, 2011) have shown the importance The rest of the paper is organized as follows. of signals in speech outside of f0 for tone recogni- Section2 discusses relevant previous works. Next tion and learning. we describe the data used in this paper in Sec- In the PENTA (Xu, 1997, 2005) and qTA (quan- tion3. In order to characterize the types of con- titative target approximation) models (Prom-on tour shapes a tone n-gram will take, we develop et al., 2009), the surface f0 contour is viewed as a method to derive clusters of tone contour shape the result of asymptotic approximation to an un- types using network analysis (Section4). In Sec- derlying pitch target, which can be a static tar- tion5, we discuss feature engineering and feature get (High or Low) or a dynamic target (Rise or extraction from various linguistic domains (syn- Fall). An important contribution of the qTA is that tax, morphology, semantics, information structure, it provides a mathematical model to account for etc.). Section6 reports machine learning experi- the process of generating of a particular realiza- ments and results on predicting tone contour shape tion of a tone template, defined by a pitch target types and the analysis on feature importance. Fi- (with slope and intercept parameters) and the ac- nally, in Section7 we discuss the implications of celeration rate. As such, the specific shape of the this work in the context of information theory and contour then would depend on the starting pitch, phonological theory of speech and tone produc- ending pitch target, and how fast the pitch moves. tion. A fundamental theoretical question is how should we view the underlying factors that ac- 2 Related Work count for the tone surface variability. Previous re- search exhibits two opposing theories to this ques- There has been a long line of research on the vari- tion. The first approach (Cooper et al., 1985; ability of tone contour shapes as well as interfac- Cooper and Sorenson, 1981) postulates a direct ing between other linguistic factors and prosody link between communicative functions and sur- (Li, 2009; Buring, 2013). In linguistic research face acoustic forms by finding the acoustic corre-

145 lates of certain communicative functions, such as Given the set S (represented as a network) that focus, , newness, questions, etc. Such ap- contains all observations of f0 vectors that belong proaches have met criticisms from phonologists to a particular tone n-gram category, an algorithm (Ladd, 1996; Liberman and Pierrehumbert, 1984), A, defined in this section, partitions S into k clus- who argue that prosodic meanings are not directly ters, c1, c2, ..., ck, where all tone contours within ci mapped onto acoustic correlates. Instead, into- are highly similar to each other, and members of ci national meanings should be first mapped onto maximally distinct from cj for i 6= j. For a partic- phonological structures, which is in turn linked ular tone n-gram category, we define the centroid to surface acoustic forms through phonetic imple- f0 vector of ci to be its tone contour shape type ti. mentation rules. In this work, we attempt to show If we denote C to be set of types {c1, c2, ..., ck}, a new middle ground between these two theories. our goal in this section is to describe the algorithm A that learns a function g : S → C. We adapt an 3 Data algorithm first proposed by (Gulati et al., 2016), All the data in this work comes from the Man- which has been shown to be effective in identify- darin Chinese Phonetic Segmentation and Tone ing clusters in time-series data such as pitch con- (MCPST) corpus 2, developed by the Linguistic tours. It also has several advantages over baseline Data Consortium (LDC). It contains 7,849 Man- algorithms such as k-means clustering, including darin Chinese newscast speech “utterances” and outlier pruning and no need to determine the num- their phonetic segmentation and tone labels. Ut- ber of clusters before hand. terances are defined as the time-stamped between- On a high level, this method represents all tone pause units in the transcribed news recordings. We contours in a data set as a fully connected network used the auto-correlation algorithm implemented G. It then filters G using heuristics based on the 3 in Praat for f0 (pitch) estimation from speech pairwise similarity of tone contour shapes. Af- audio signal. We obtained f0 pitch contour data ter the filtering step, only those nodes that have for 100,161 . After pre-processing the a similarity score beyond a threshold will remain pitch tracks (e.g., speaker-dependent normaliza- connected. It then leverages network community tion, f0 outlier detection and removal, pitch in- detection algorithms to optimize the community terpolation, downsampling), we generate tone un- structure, effectively deriving tone contour shape igram, bigram and trigram f0 data sets, giving rise types T = {t1, t2, ..., tk}. to a total of 75 unigram (5), bigram (16), and tri- gram data sets for the prediction task (54) 4. The 4.2 Network construction total number of tone n-grams in these data sets are To construct the network as described above, we on the order of 250k. All tone unigram, bigram, first partition all data in MCPST corpus by their and trigram f0 vectors are downsampled to tone n-gram category. For each category, we con- of 30, 100, and 200 samples respectively. struct a network where each node stores the f0 4 Deriving tone n-gram contour shape vector of an observation, and the edge between types through network analysis two nodes holds the Euclidean distance between the two f0 vectors as weights. We derive an undi- 4.1 Problem formulation rected, weighted and fully connected network G We define a tone n-gram category as a consecutive of tone n-gram patterns for each n-gram category. sequence of n tones ti, for i = 1, ..., n, where ti ∈ {0, 1, 2, 3, 4}, the five tone categories of Man- 4.3 Network filtering darin. In this paper we restrict n to {1, 2, 3}. In this step, we take a fully connected network G 2https://catalog.ldc.upenn.edu/LDC2015S05 of a given tone n-gram category and use a princi- 3Boersma, Paul and Weenink, David (2019). Praat: doing pled method to remove edges from the network. phonetics by computer [Computer program]. Version 6.0.48, Our goal is to find an appropriate threshold so retrieved 17 February 2019 from http://www.praat.org/ 4Mandarin Chinese has four regular tone categories plus that all edges whose weights (distance between one neutral tone. Since neutral tones occurs infrequently, we two tone n-gram f0 vectors) are greater than the did not include them in the analysis of tone bigrams and tri- threshold will be cut. In the resulting network, grams due to data sparsity in the conditional distributions. Similarly, we also excluded ngrams categories where the data only those nodes representing similar enough tone points are sparse. contour shapes will remain connected.

146 Specifically, we decide the threshold value by Bigram Trigram 6 a six-step process: (1) We search for the ap- 20 propriate threshold in the set of values Φ ∈ 4 10 {1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5} for bigrams 2 frequency frequency and trigrams, Φ ∈ {0.2, 0.4, 0.6, 0.8} for uni- 0 0 4 5 6 7 8 2 4 6 grams. These values are empirically chosen; (2) number of shape classes number of shape classes Unigram we iterate over this set of values and each time 4 apply a threshold to the network; (3) after we ap- plied the threshold we convert the network to a un- 2 weighted network G0 where only those nodes that frequency 0 have a distance below the threshold will remain 7.00 7.25 7.50 7.75 8.00 connected; (4) we produce a randomized network number of shape classes G by randomly swapping edges from G0 k times r Figure 2: Histogram of number of shape classes in n- while keeping the degree of the nodes constant, gram data sets. where k is equal to the number of edges in G0. This can be seen as producing a maximally ran- dom network given the degree distribution of the each tone n-gram category. Figure3 shows exam- current network; (5) we compute the difference in ples of learned clusters of tone contour shape types 0 Clustering Coefficient (CC) of both G and Gr; (6) from two n-gram categories of tone unigram, bi- after repeating this for all values in T , we pick the gram, and trigram, respectively. Without declar- threshold that has the largest difference of CCs. 5 ing any cognitive or phonological significance of these clusters, these resulting clusters should re- 4.4 Community detection flect the similarities of tone contour shapes within We use the Louvain algorithm (Blondel et al., any given tone n-gram category: those that are 2008) to perform community detection, in order to highly similar are grouped into the same clus- partition the filtered network derived from last step ter. This is an intrinsic property derived from the into communities (clusters) C1,C2, ..., Ck. We above method, and is a necessary property suffi- pre-tuned the hyperparameters in the network fil- cient for carrying out the subsequent experiment tering step so that it will result in a small number on predicting the tone contour shape types from (n < 10) of tightly connected medium-sized com- linguistic factors. munities. Figure2 shows a histogram of number Nonetheless, we propose two different ways to of shape classes for unigram, bigram, and trigram evaluate the validity of these clusters. First, in the data sets. following experiments, we show that we are able to predict these learned tone contour shape types 4.5 Outlier community filtering significantly better than randomly assigned clus- We propose an extra step of outlier community ters (Section 6.2, Figure4). Second, we train a filtering before deriving our final contour profile decision tree classifier to predict the shape type of classes. In this step, we use a heuristic threshold a given tone n-gram using its f0 vector and ob- of t = 10 to filter out any communities (clusters) tained a mean accuracy of 92% (following (Zhang, with a size less than t. 2016)). This indicates how well these tone contour shape types can be predicted with complete infor- 4.6 Evaluation of tone contour profile classes mation of its pitch trajectory, which will serve as an upper bound to our next prediction task using We have leveraged the intrinsic structure of the linguistic factors without information about pitch tone data to derive the tone contour shape types for movements f0 values. 5Clustering coefficient (CC) measures the extent to which the nodes in a network tend to cluster together. Intuitively, it 5 Linguistic features expresses how saturated the network is — how many of the possible connections are actually expressed. The CC for a network of k nodes and n edges is computed as: After we obtained tone contour shape types emerged from each tone n-gram category in the 2n CC = corpus, we now describe the linguistic features k(k − 1) used to predict which type of shape the tone n-

147 2.0 0 0 0 1.00 1 1 1 0.6 1.5 2 3 2 0.75 3 4 3 0.4 1.0 4 6 4 5 7 0.50 0.5 6 0.2 8

0.0 0.25 0.0

0.2 0.5 0.00 1.0 0.4 0.25 1.5 0.6 0.50 2.0 0.8 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 20 40 60 80 100

(a) Tone 1 (b) Tone 2 (c) Tone 2-3

0 0 0 2 0.6 1 1 0.75 1 2 3 2 3 0.4 5 3 0.50 1 4 7 4 5 6 0.2 0.25

0 0.0 0.00

0.2 0.25 1 0.4 0.50

0.75 2 0.6

0 20 40 60 80 100 0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200

(d) Tone 4-2 (e) Tone 3-4-2 (f) Tones 2-2-4

Figure 3: Example contour shape type clusters found in randomly selected tone n-gram categories, with two examples from each of tone unigram, bigram, and trigram. Each cluster is represented by a mean pitch vector with error bars (as shown in the legend). Clusters are indexed by integers shown in legend. X-axis shows number of samples (discrete time index) for the tone contour f0 vector. Y-axis shows the speaker-normalized pitch values. gram will take. All syntactic and semantic features “advmod” etc.). are extracted using Stanford CoreNLP for Chinese (Manning et al., 2014). 5.2 Semantic features

5.1 Syntactic features We extract two semantic features for a tone n- n Syntax and prosody has been the subject of in- gram data point: (1) whether the tone -gram in- n vestigation in (psycho)linguistic studies (Bard and cludes a named entity; (2) whether the tone - Aylett, 1999). We extract the part-of-speech (POS) gram includes a singleton (as opposed to being tags for all syllables in a tone n-gram. In addition, part of a coreference chain in the discourse). Se- we also extract the dependency function (Chen mantic features such as information structure have and Manning, 2014) of all syllables in the tone n- been postulated to have an effect on the prosody gram. Therefore there are 2 ∗ N syntactic features domain (Buring, 2013). In particular, given in- where N is the number of syllables included in the formation may encode prosodic features different tone n-gram data under consideration. The origi- from new information. This could also apply to nal tag set used in CoreNLP comes from Penn Chi- named entities vs. non-named entities. Named en- nese Treebank6 and is too fine grained. To avoid tities points to definite, specific objects in the real data sparseness, we collapsed several categories. world. Whether the token is a singleton (i.e., does For both POS tags and dependency edge function not co-refer to an entity with another mention in categories, we compute their distributions using the text) or part of a coreference chain can be cor- the original tag set and we collapse any categories related with information structure (Recasens et al., that appear less than 5 times in the data. For POS 2013). That is, a singleton may signify new infor- tags we mapped the original 33 tags onto 5 cat- mation in discourse, while a non-singleton is part egories. For dependency functions, we collapsed of a coreference chain with potential antecedent all tags with a subcategory separated by a colon or anaphor, pointing to potentially a different in- (e.g., “advmod:loc”, “advmod:rcomp”, mapped to formation structure. Both can have distinguishing effects on the mental representations and the pro- 6https://catalog.ldc.upenn.edu/LDC2013T21 duction of speech prosody (indirectly related to re-

148 dundancy in (Aylett and Turk, 2004)).

5.3 Morphological features 0.8 In , each word usually consists of one to four syllables. Building on the intu- 0.6 ition that the first syllable is usually spoken with higher prominence (e.g., neutral tone, which does 0.4 not carry stress, only occurs on word-final posi- tions), we extract morphological features for each 0.2 syllable in the given tone n-gram: whether they cross word boundary or not. There are n features data(U)dfp(U) MLE(U)data(B)dfp(B) MLE(B)data(T)dfp(T) MLE(T) no_syn(U)no_pitch(U)random(U) no_syn(B)no_pitch(B)random(B) no_syn(T)no_pitch(T)random(T) in this category in total. no_Ntone(U) no_Ntone(B) no_Ntone(T) Figure 4: Distribution of classification accuracies for all 75 data sets in unigram(U), bigram(B), tri- 5.4 Phonological features gram(T). For each n-gram the labels on the x-axis are: A basic representation of phonological features is data:full feature set; dfp:start/end pitch only baseline; no syn:without syntactic features; no Ntone:without the identity of phonemes in each syllable of the prev/next tone features; no pitch:without start/end n-gram. However, due to the sparseness of this pitch features; random:baseline with randomly as- feature representation, we have designed 7 binary signed labels; MLE:MLE baseline. features to encode the phonological properties of the syllables in the tone n-gram: (1) whether the syllable includes a nasal; (2) whether the syllable 5.6 Bag of features includes a dipthong; (3) whether the syllable in- In this task, we note that the unit of feature ex- cludes a high vowel; (4) whether the syllable in- traction is not as straightforward as it would be cludes a low vowel; (5) whether the syllable in- in classic NLP tasks. That is, instead of a typi- cludes a front vowel; (6) whether the syllable in- cal syntactic constituent (word, phrase, sentences) cludes a back vowel; (7) whether the syllable in- as the feature extraction unit, here, our target is cludes a round vowel. In addition, we add two tone n-gram, a sequence of n syllables that may or contextual tone features: the tone identity of the may not be a syntactic constituent. As described previous and following syllables of the tone n- above, in many features we have adopted a “Bag of gram in question. features” approach (similar to the speech corefer- ence resolution work in Roesiger et al. (Roesiger¨ 5.5 Other features and Riester, 2015)) where each feature describes whether the n-gram contains a certain target value We add two pitch features to the feature set: the in any position. For some other features, we sim- beginning and ending pitch of the tone n-gram. ply use a set of n features applied to each syllable This is based on the notion in generative Mandarin in the n-gram in question. These are precisely de- tone modeling (Parallel ENcoding and Target Ap- scribed in Table1. proximation model, or PENTA) that in speech pro- duction, the actual realized tone shape of a given 6 Predicting tone contour profiles tone category highly depends on the starting point of the pitch contour and its distance to the actual 6.1 Experimental setup pitch target of the current tone, which affects its For each of the 5 unigram, 16 bigram, and 54 tri- course of trajectory when it approximates the tar- gram data sets, the extracted linguistic feature vec- get (Prom-on et al., 2009). An additional feature tor (f1, f2, ..., fm) forms the input space X. The to be included is the position of the current tone n- contour shape types T forms the output space Y . gram within the context of the current sentence as Our goal is to learn a function h : X → Y min- a percentage. It is a known effect that pitch tends imizing expected loss. We use SVM with linear to in speech production as sentence pro- kernel so that we can extract feature importance gresses (Wang and Xu, 2011). Therefore, we also in subsequent analyses. Each data set is randomly want to account for the effect of sentence position. split into 90/10 for train and test. Since the classes

149 Table 1: Feature set overview. 1...N indicates this feature is computed for all syllables in the N-gram. Total number is N*10+7 features, 37 for trigrams and 27 for bigrams, etc.

Syntactic Morphological Semantic Phonological Others

POS T ag1...N T ok Bound1...N is entity is nasal1...N sent position Dep F unc1...N is singleton is dipthong1...N start pitch is round1...N end pitch is front1...N prev tone is back1...N next tone is high1...N is low1...N are balanced in Y , we evaluate the classifier per- formance directly with accuracy on the test sets. 5

6.2 Results 4 To visualize model performances on a large num- 3 ber of tone n-gram data sets (75), we choose the 2 boxplot because of its efficiency to convey statis- tical information of the distribution of the results 1 across all data sets. Figure4 gives an overview of 0 our proposed model classification accuracies for pos1 func1 is_low1 is_entity is_high1 singletonis_nasal1 is_front1is_back1 is_round1prev_tonenext_tone all unigram(U), bigram(B), and trigram(T) data tok_bound1 ending_pitch is_dipthong1 starting_pitch sent_position sets, as compared to several baselines. In this figure, data, dfp, no syn, no Ntone, and Figure 5: Feature weights for all unigram data sets. no pitch denote results using different sets of features: full set, start/end pitch only, no syn- tactic features, no prev/next tone features, and no start/end pitch features, respectively. The all n-grams (no syn baseline). In Section 6.3, we random baseline uses the same set of linguistic provide a more detailed analysis and discussion of features as our proposed model but the target tone feature importance. contour shape type is randomly assigned (while keeping the number of shape types constant). Fi- A significant trend to note in Figure4 is that we nally, the Maximum Likelihood Estimation (MLE) observe a negative correlation between the model baseline reflects chance level performance if lin- accuracy and the baseline (MLE, random) accu- guistic factors are independent from the output racy as the value of n becomes larger in n-grams. tone contour shape types, and is calculated as 1/d, This is striking because it indicates a decrease in where d is the number of output classes in a data predictive power as n grows larger. Moreover, the set. variance on model accuracy also increases as n be- First, the proposed model significantly outper- comes larger. From inspecting the tone contour forms MLE baselines for unigrams, bigrams, and shape types we obtained (such as those showed in trigrams. Second, the accuracy for predicting Figure3), we attribute this to three factors from learned tone contour shape types is significantly the data perspective: (1) the dimensionality of un- higher than randomly assigned clusters. This igram, bigram, and trigram f0 vectors in our data serves as a sanity check for the validity of these set are different (increase); (2) the number of data learned tone contour shape types. Overall, this re- sets also increases as n grew larger; (3) the com- sult supports the hypothesis that a variety of lin- plexity of tone contour shapes tend to increase guistic and contextual features are strongly corre- from unigram to bigram to trigram. On a linguistic lated with the realization of a particular category level, we hypothesize that the longer the window of tone n-grams. In particular, we observe that us- of n-grams, the stronger an effect of unaccounted ing the set of features excluding the syntactic fea- factors come into play (longer range prosodic fac- tures (POS tags and dependency functions) allows tors such as focus and topic, as demonstrated in the model to achieve the best median accuracy in (Xu et al., 2004)).

150 To have a more detailed understanding of the 1.75 feature importance, Figure4 shows how the model 1.50 performs with partially ablated feature sets across 1.25 different n-grams. First, to understand the role of 1.00 start/end pitch vs. non-pitch linguistic features, we 0.75 observe that the dfp baseline (using only start/end 0.50 pitch features) has lower results than other mod- 0.25 els with linguistic features. This is true for all 0.00 the markers on the boxplots (min, max, first quan- tile, median, third quantile) when comparing dfp pos1 pos2 func1func2 is_low1 is_low2 is_entity is_high1 is_high2 singletonis_nasal1 is_front1is_back1 is_nasal2 is_front2is_back2 is_round1 is_round2prev_tonenext_tone to no syn. However, there is also considerable tok_bound1tok_bound2 ending_pitch is_dipthong1 is_dipthong2 starting_pitchsent_position overlap in these accuracy distributions to different degrees as n varies, which indicates cases where Figure 6: Feature weights for all bigram data sets. the dfp baseline outperforms the other models with more linguistic features. To see this possi- 6.3 Feature importance bility, we plotted the difference (delta) in accura- cies values for all data sets for the pair of baselines To analyze feature importance, we extracted no syn - dfp in Figure8. It shows that for 80% weight vectors (coefficients) associated with all of the data sets, the no syn baseline with linguis- features in the linear SVM classifiers (following tic features outperforms the pitch-only baseline in (Guyon et al., 2002)) for all data sets in each of most data sets by a margin of less than 20% in ac- the n-grams. We take the absolute values of fea- curacy improvements. ture weights and normalize them to be compara- Second, comparing across different n-grams, ble across data sets. We then aggregated feature the no pitch baseline (only linguistic features) weights for all classes and across all data sets in a performs worse in unigram, and the best in tri- given n-gram. grams. This shows that the pitch feature is less Figure5,6,7 show the distribution of feature important when n becomes larger in n-grams. weights (importance) for all features for unigram, The same trend is observed in the weights of the brigram, and trigram data. Since feature impor- pitch features in feature importance. This ob- tance values tend to behave similarly according servation is also consistent with (Prom-on et al., to their linguistic domains, we group them to- 2009): the target approximation of tones only op- gether instead of analyzing the features individu- erates on the syllable units. Therefore the ef- ally. We observe three levels of feature importance fect of start/end pitch should diminish when n>1. based on the weights, consistent among different Third, even though feature weights are small for n-grams: (1) High: starting pitch, ending pitch, previous/next tones, in these results we didn’t see and sentence position (especially when n > 1). an improvement when we exclude these features The importance of starting and ending pitch is in the no Ntone baseline. Finally, we confirm consistent with the qTA model of Mandarin tones the importance of the non-pitch linguistic features (Prom-on et al., 2009). The latter (sentence po- since the no pitch baseline significantly outper- sition) is consistent with the effect of downdrift forms the random and the MLE baselines. (Wang and Xu, 2011). (2) Medium: Phonological features and morphological token boundary fea- 7 Discussion tures, as well as coreference (singleton) and entity, which are information structure of the discourse In this work, we first described a method to mine and semantics (givenness, newness of information tone contour shape types from a large amount of in speech). (3) Low: Syntactic (pos tag, depen- tone data. These shape types, as well as those dency functions) and contextual features (previous mined from tone n-grams with larger values of and next tone). This is consistent with (Bard and n (n>3), can be further analyzed with linguistic Aylett, 1999) and (Surendran, 2007)7. knowledge to better understand the behaviors of tones in different tonal contexts in future works. 7Specifically, (Bard and Aylett, 1999) showed the dis- We also showed that using linguistic and contex- sociation of syntax with de-accent in spontaneous speech, and (Surendran, 2007) showed that context did not help tone recognition as much as expected.

151 3.5 0.2 3.0

2.5 0.1 2.0

1.5 0.0

1.0

0.5 delta in accuracies 0.1

0.0 unigram bigram 0.2 pos1pos2pos3 trigram func1func2func3 is_low1 is_low2 is_low3 is_entity is_high1 is_high2 is_high3 singletonis_nasal1 is_front1is_back1is_nasal2 is_front2is_back2is_nasal3 is_front3is_back3 is_round1 is_round2 is_round3prev_tonenext_tone tok_bound1tok_bound2tok_bound3 ending_pitch is_dipthong1 is_dipthong2 is_dipthong3 0 10 20 30 40 50 60 starting_pitchsent_position indexes of data sets

Figure 7: Feature weights for all trigram data sets. Figure 8: Delta (differences in accuracies): no syn - dfp baselines. A point above the y=0 line indicates the model with linguistic features outperforms the model tual factors, we can predict with reasonable accu- with pitch only features. racy the contour shape type that a tone n-gram will take in the MCPST data. We analyzed the feature retical question of whether there is a direct link importance to form a relative ranking of the lin- between communicative functions and surface guistic factors. These results should be interpreted acoustic forms, a question that we found disagree- with caution because they are constrained by the ment in literature (as summarized in (Xu, 2005) particular representations of the linguistic features and in Section2 of the current paper). In this pa- used in this study, as well as the accuracy of the per, we showed that by taking a data driven ap- NLP softwares used to extract them. Nonethe- proach, we can predict the contour shape type of less, we envision that by mining correlates be- a prosodic category (such as a tone n-gram) using tween speech prosody and automatic analysis of linguistic factors, even though we are not predict- linguistic features extracted from the data, this line ing its exact shape. In doing so, we give an ap- of work could have potential applications in im- proximate solution for a middle ground between proving the quality and naturalness of prosody in the two theories. speech synthesis such as Text-To-Speech (TTS) technologies. Acknowledgements Previous works targeting information theory We thank Sankalp Gulati, Xavier Serra, Amir and information structure in prosody domain have Zeldes, Elizabeth Zsiga, George Wilson, Richard largely looked at acoustic correlates directly, such Wright and Gina-Anne Levow for insights and dis- as accent and duration, all of which may in turn cussions, as well as the anonymous reviewers for have an impact on the shape of tone contours in valuable comments on the previous versions of speech production. Therefore, looking at tone this paper. contour shapes can be thought of as a differ- ent level of manifestation of such phenomena, an amalgamation of single dimension acoustic cor- References relates (e.g., duration and intensity). It is also a level that is more difficult to quantify and mea- Matthew Aylett and Alice Turk. 2004. The smooth signal redundancy hypothesis: a functional explana- sure in the traditional linguistic/phonetic investi- tion for relationships between redundancy, prosodic gations on a smaller scale. One possible exten- prominence, and duration in spontaneous speech. sion of this work in the future is to look at specific Language and speech, 47(Pt 1):31–56. ways tone contour shapes correlate with particular E. G. Bard and M. P. Aylett. 1999. The dissociation of features when they are perturbed in a certain direc- deaccenting, givenness, and syntactic role in sponta- tion. Moreover, it is also of interest to demonstrate neous speech. In in ICPhs99, pages 1753–1756. this change in a quantified manner using the infor- mation theory formulation. Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast un- In Section2 we raised a fundamental theo- folding of communities in large networks. Journal

152 of Statistical Mechanics: Theory and Experiment, Christopher D. Manning, Mihai Surdeanu, John Bauer, 10008(10):6. Jenny Finkel, Steven J. Bethard, and David Mc- Closky. 2014. The Stanford CoreNLP natural lan- Daniel Buring. 2013. Syntax, information structure, guage processing toolkit. In Association for Compu- and prosody, Cambridge Handbooks in Language tational Linguistics (ACL) System Demonstrations, and Linguistics, pages 860–896. Cambridge Univer- pages 55–60. sity Press. Santitham Prom-on, Yi Xu, and Bundit Thipakorn. 2009. Modeling tone and in Mandarin Danqi Chen and Christopher D. Manning. 2014.A and English as a process of target approximation. fast and accurate dependency parser using neural The Journal of the Acoustical Society of America, networks. In Proceedings of the 2014 Conference 125(1):405–24. on Empirical Methods in Natural Language Pro- cessing, EMNLP 2014, October 25-29, 2014, Doha, Marta Recasens, Marie-Catherine de Marneffe, and Qatar, A meeting of SIGDAT, a Special Interest Christopher Potts. 2013. The life and death of dis- Group of the ACL, pages 740–750. course entities: Identifying singleton mentions. In Proceedings of the 2013 Conference of the North E Cooper, J Eady, and R Mueller. 1985. Acoustical American Chapter of the Association for Computa- aspects of contrastive stress in question-answer con- tional Linguistics: Human Language Technologies, texts. The Journal of the Acoustical Society of Amer- pages 627–633, Atlanta, Georgia. Association for ica, 77:2142–2156. Computational Linguistics.

E Cooper and M Sorenson. 1981. Fundamental fre- Ina Roesiger¨ and Arndt Riester. 2015. Using prosodic quency in sentence production. New York: Spring- annotations to improve coreference resolution of Verlag. spoken text. In Proceedings of the 53rd Annual Meeting of the Association for Computational Lin- Bruno Gauthier, Rushen Shi, and Yi Xu. 2007. Learn- guistics and the 7th International Joint Conference ing phonetic categories by tracking movements. on Natural Language Processing, volume 2. Cognition, 103(1):80–106. Dinoj Surendran. 2007. Analysis and Automatic Recognition of Tones in Mandarin Chinese. PhD Sankalp Gulati, Joan Serra, Vignesh Ishwar, and Xavier Thesis, Department of Computer Science, University Serra. 2016. Discovering raga motifs by charac- of Chicago. terizing communities in networks of melodic pat- terns. Proc. of IEEE International Conf. on Acous- Bei Wang and Yi Xu. 2011. Differential prosodic en- tics, Speech, and Signal Processing (ICASSP), pp. coding of topic and focus in sentence-initial posi- 286-290, Shanghai, China. tion in Mandarin Chinese. Journal of Phonetics, 39(4):595–611. Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. 2002. Gene selection for cancer Yi Xu. 1997. Contextual tonal variations in Mandarin. classification using support vector machines. Ma- Journal of Phonetics, 25(1):61–83. chine Learning, 46(1):389–422. Yi Xu. 2005. Speech melody as articulatorily imple- mented communicative functions. Speech Commu- R Ladd. 1996. Intonational phonology. Cambridge nication, 46(3-4):220–251. University Press, Cambridge. Yi Xu, Ching X Xu, Xuejing Sun, Haskins Laborato- Gina-Anne Levow. 2005. Context in multi-lingual tone ries, and New Haven. 2004. On the Temporal Do- and pitch accent recognition. Interspeech, pages main of Focus. Speech Prosody, pages 81–84. 1809–1812. Kristine Yu. 2011. The learnability of tones from the Kening Li. 2009. The information structure of man- speech signal. PhD Dissertation, Department of darin chinese: Syntax and prosody. PhD Disserta- Linguistics, UCLA. tion, Department of Linguistics, University of Wash- ington. Shuo Zhang. 2016. Mining linguistic tone patterns with symbolic representation. In Proceedings of the 14th SIGMORPHON Workshop on Computational M Liberman and J Pierrehumbert. 1984. Intonational Research in Phonetics, Phonology, and Morphol- invariance under changes in pitch range and length. ogy, pages 1–9. Association for Computational Lin- M. Aronoff and R. Oehrle (Eds.), Language Sound guistics. Structure. M.I.T. Press, Cambridge, Massachusetts, pages 157–233.

Fang Liu, Dinoj Surendran, and Yi Xu. 2006. Classifi- cation of statement and question intonation in man- darin. In Proceedings of Speech Prosody.

153