Using Conditional Random Fields For Sentence Boundary Detection In Speech
Yang Liu Andreas Stolcke Elizabeth Shriberg Mary Harper ICSI, Berkeley SRI and ICSI Purdue University [email protected] stolcke,[email protected] [email protected]
Abstract Sentence segmentation information is crucial and as- sumed in most of the further processing steps that Sentence boundary detection in speech is one would want to apply to such output: tagging important for enriching speech recogni- and parsing, information extraction, summarization, tion output, making it easier for humans to among others. read and downstream modules to process. In previous work, we have developed hid- 1.1 Sentence Segmentation Using HMM den Markov model (HMM) and maximum Most prior work on sentence segmentation (Shriberg entropy (Maxent) classifiers that integrate et al., 2000; Gotoh and Renals, 2000; Christensen textual and prosodic knowledge sources et al., 2001; Kim and Woodland, 2001; NIST- for detecting sentence boundaries. In this RT03F, 2003) have used an HMM approach, in paper, we evaluate the use of a condi- which the word/tag sequences are modeled by N- tional random field (CRF) for this task gram language models (LMs) (Stolcke and Shriberg, and relate results with this model to our 1996). Additional features (mostly related to speech prior work. We evaluate across two cor- prosody) are modeled as observation likelihoods at- pora (conversational telephone speech and tached to the N-gram states of the HMM (Shriberg broadcast news speech) on both human et al., 2000). Figure 1 shows the graphical model transcriptions and speech recognition out- representation of the variables involved in the HMM put. In general, our CRF model yields a for this task. Note that the words appear in both lower error rate than the HMM and Max- the states1 and the observations, such that the ent models on the NIST sentence bound- word stream constrains the possible hidden states ary detection task in speech, although it to matching words; the ambiguity in the task stems is interesting to note that the best results entirely from the choice of events. This architec- are achieved by three-way voting among ture differs from the one typically used for sequence the classifiers. This probably occurs be- tagging (e.g., part-of-speech tagging), in which the cause each model has different strengths “hidden” states represent only the events or tags. and weaknesses for modeling the knowl- Empirical investigations have shown that omitting edge sources. words in the states significantly degrades system performance for sentence boundary detection (Liu, 1 Introduction 2004). The observation probabilities in the HMM, implemented using a decision tree classifier, capture Standard speech recognizers output an unstructured the probabilities of generating the prosodic features stream of words, in which the important structural features such as sentence boundaries are missing. 1In this sense, the states are only partially “hidden”.
2
P F jE W
i i i . An N-gram LM is used to calculate W E W E
the transition probabilities: i i i+1 i+1
P W E jW E W E
i i 1 1 i 1 i 1
W F
i i W F