Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago [email protected] Abstract ing to basic structural principles. This structure in turn guides the interpretation of individual utter- Automatic topic segmentation, separation of a dis- ances and the discourse as a whole. Formal writ- course stream into its constituent stories or topics, ten discourse signals a hierarchical, tree-based dis- is a necessary preprocessing step for applications course structure explicitly by the division of the text such as information retrieval, anaphora resolution, into chapters, sections, paragraphs, and sentences. and summarization. While significant progress has This structure, in turn, identifies domains for in- been made in this area for text sources and for En- terpretation; many systems for anaphora resolution glish audio sources, little work has been done in rely on some notion of locality (Grosz and Sidner, automatic, acoustic feature-based segmentation of 1986). Similarly, this structure represents topical other languages. In this paper, we consider exploit- organization, and thus would be useful in informa- ing both prosodic and text-based features for topic tion retrieval to select documents where the primary segmentation of Mandarin Chinese. As a tone lan- sections are on-topic, and, for summarization, to se- guage, Mandarin presents special challenges for ap- lect information covering the different aspects of the plicability of intonation-based techniques, since the topic. pitch contour is also used to establish lexical iden- Unfortunately, spoken discourse does not include tity. We demonstrate that intonational cues such as the orthographic conventions that signal structural reduction in pitch and intensity at topic boundaries organization in written discourse. Instead, one must and increase in duration and pause still provide sig- infer the hierarchical structure of spoken discourse nificant contrasts in Mandarin Chinese. We build from other cues. Prior research (Nakatani et al., a decision tree classifier that, based only on word 1995; Swerts, 1997) has shown that human label- and local context prosodic information without ref- ers can more sharply, consistently, and confidently erence to term similarity, cue phrase, or sentence- identify discourse structure in a word-level tran- level information, achieves boundary classification scription when an original audio recording is avail- accuracy of 84.6-95.6% on a balanced test set. We able than they can on the basis of the transcribed contrast these results with classification using text- text alone. This finding indicates that substantial based features, exploiting both text similarity and additional information about the structure of the n-gram cues, to achieve accuracies between 77- discourse is encoded in the acoustic-prosodic fea- 95.6%, if silence features are used. Finally we in- tures of the utterance. Given the often errorful tran- tegrate prosody, text, and silence features using a scriptions available for large speech corpora, we voting strategy to combine decision tree classifiers choose to focus here on fully exploiting the prosodic for each feature subset individually and all subsets cues to discourse structure present in the original jointly. This voted decision tree classifier yields speech. We then compare the effectiveness of a pure an overall classification accuracy of 96.85%, with prosodic classification to text-based and mixed text 2.8% miss and 3.15% false alarm rates on a repre- and prosodic based classification. sentative corpus sample, demonstrating synergistic combination of prosodic and text features for topic In the current set of experiments, we concentrate segmentation. on sequential segmentation of news broadcasts into individual stories. While a richer hierarchical seg- 1 Introduction mentation is ultimately desirable, sequential story segmentation provides a natural starting point. This Natural spoken discourse is composed of a sequence level of segmentation can also be most reliably per- of utterances, not independently generated or ran- formed by human labelers and thus can be consid- domly strung together, but rather organized accord- ered most robust, and segmented data sets are pub- licly available. that emphasis is marked intonationally by expansion Furthermore, we apply prosodic-based segmenta- of pitch range even in the presence of Mandarin lex- tion to Mandarin Chinese, in addition to textual fea- ical tone (Shen, 1989) suggests encouragingly that tures. Not only is the use of prosodic cues to topic prosodic, intonational cues to other aspects of in- segmentation much less well-studied in general than formation structure might also prove robust in tone is the use of text cues, but the use of prosodic cues languages. has been largely limited to English and other Euro- pean languages. 4 Data Set We utilize the Topic Detection and Tracking (TDT) 2 Related Work 3 (Wayne, 2000) collection Mandarin Chinese Most prior research on automatic topic segmenta- broadcast news audio corpus as our data set. Story tion has been applied to clean text only and thus segmentation in Mandarin and English broadcast used textual features. Text-based segmentation ap- news and newswire text was one of the TDT tasks proaches have utilized term-based similarity mea- and also an enabling technology for other retrieval sures computed across candidate segments (Hearst, tasks. We use the segment boundaries provided with 1994) and also discourse markers to identify dis- the corpus as our gold standard labeling. Our col- course structure (Marcu, 2000). lection comprises 3014 news stories drawn from ap- The Topic Detection and Tracking (TDT) eval- proximately 113 hours over three months (October- uations focused on segmentation of both text and December 1998) of news broadcasts from the Voice speech sources. This framework introduced new of America (VOA) in Mandarin Chinese, with 800 challenges in dealing with errorful automatic tran- regions of other program material including musi- scriptions as well as new opportunities to exploit cal interludes and teasers. The transcriptions span cues in the original speech. The most successful approximately 750,000 words. Stories average ap- approach (Beeferman et al., 1999) produced auto- proximately 250 words in length to span a full story. matic segmentations that yielded retrieval results No subtopic segmentation is performed. The audio approaching those with manual segmentations, us- is stored in NIST Sphere format sampled at 16KHz ing text and silence features. (Tur et al., 2001) ap- with 16-bit linear encoding. plied both a prosody-only and a mixed text-prosody model to segmentation of TDT English broadcast 5 Prosodic Features news, with the best results combining text and We consider four main classes of prosodic features prosodic features. (Hirschberg and Nakatani, 1998) for our analysis and classification: pitch, intensity, also examined automatic topic segmentation based silence and duration. Pitch, as represented by f0 in on prosodic cues, in the domain of English broad- Hertz was computed by the “To pitch” function of cast news, while (Hirschberg et al., 2001) applied the Praat system (Boersma, 2001). We selected the similar cues to segmentation of voicemail. highest ranked pitch candidate value in each voiced Work in discourse analysis (Nakatani et al., 1995; region. We then applied a 5-point median filter to Swerts, 1997) in both English and Dutch has iden- smooth out local instabilities in the signal such as tified features such as changes in pitch range, in- vocal fry or small regions of spurious doubling or tensity, and speaking rate associated with seg- halving. Analogously, we computed the intensity ment boundaries and with boundaries of different in decibels for each 10ms frame with the Praat “To strengths. They also demonstrated that access to intensity” function, followed by similar smoothing. acoustic cues improves the ease and quality of hu- For consistency and to allow comparability, we man labeling. compute all figures for word-based units, using the automatic speech recognition transcriptions pro- 3 Prosody and Mandarin vided with the TDT Mandarin data. The words are In this paper we focus on topic segmentation in used to establish time spans for computing pitch Mandarin Chinese broadcast news. Mandarin Chi- or intensity mean or maximum values, to enable nese is a tone language in which lexical identity is durational normalization and the pairwise compar- determined by a pitch contour - or tone - associ- isons reported below, and to identify silence or non- ated with each syllable. This additional use of pitch speech duration. raises the question of the cross-linguistic applicabil- It is well-established (Ross and Ostendorf, 1996) ity of the prosodic cues, especially pitch cues, iden- that for robust analysis pitch and intensity should tified for non-tone languages. Specifically, do we be normalized by speaker, since, for example, aver- find intonational cues in tone languages? The fact age pitch is largely incomparable for male and fe- male speakers. In the absence of speaker identifica- 7 Classification tion software, we approximate speaker normaliza- 7.1 Prosodic Feature Set tion with story-based normalization, computed as ¢¡¤£¦¥¨§ © ¡ 1 § © ¡ , assuming one speaker per topic . For du- The results above indicate that duration, pitch, and ration, we consider both absolute and normalized intensity should be useful for automatic prosody- word duration,

Load more