Using Conditional Random Fields for Sentence Boundary Detection in Speech

Using Conditional Random Fields For Sentence Boundary Detection In Speech Yang Liu Andreas Stolcke Elizabeth Shriberg Mary Harper ICSI, Berkeley SRI and ICSI Purdue University [email protected] stolcke,[email protected] [email protected] Abstract Sentence segmentation information is crucial and as- sumed in most of the further processing steps that Sentence boundary detection in speech is one would want to apply to such output: tagging important for enriching speech recogni- and parsing, information extraction, summarization, tion output, making it easier for humans to among others. read and downstream modules to process. In previous work, we have developed hid- 1.1 Sentence Segmentation Using HMM den Markov model (HMM) and maximum Most prior work on sentence segmentation (Shriberg entropy (Maxent) classifiers that integrate et al., 2000; Gotoh and Renals, 2000; Christensen textual and prosodic knowledge sources et al., 2001; Kim and Woodland, 2001; NIST- for detecting sentence boundaries. In this RT03F, 2003) have used an HMM approach, in paper, we evaluate the use of a condi- which the word/tag sequences are modeled by N- tional random field (CRF) for this task gram language models (LMs) (Stolcke and Shriberg, and relate results with this model to our 1996). Additional features (mostly related to speech prior work. We evaluate across two cor- prosody) are modeled as observation likelihoods at- pora (conversational telephone speech and tached to the N-gram states of the HMM (Shriberg broadcast news speech) on both human et al., 2000). Figure 1 shows the graphical model transcriptions and speech recognition out- representation of the variables involved in the HMM put. In general, our CRF model yields a for this task. Note that the words appear in both lower error rate than the HMM and Max- the states1 and the observations, such that the ent models on the NIST sentence bound- word stream constrains the possible hidden states ary detection task in speech, although it to matching words; the ambiguity in the task stems is interesting to note that the best results entirely from the choice of events. This architec- are achieved by three-way voting among ture differs from the one typically used for sequence the classifiers. This probably occurs be- tagging (e.g., part-of-speech tagging), in which the cause each model has different strengths “hidden” states represent only the events or tags. and weaknesses for modeling the knowl- Empirical investigations have shown that omitting edge sources. words in the states significantly degrades system performance for sentence boundary detection (Liu, 1 Introduction 2004). The observation probabilities in the HMM, implemented using a decision tree classifier, capture Standard speech recognizers output an unstructured the probabilities of generating the prosodic features stream of words, in which the important structural features such as sentence boundaries are missing. 1In this sense, the states are only partially “hidden”. 2 P F jE W i i i . An N-gram LM is used to calculate W E W E the transition probabilities: i i i+1 i+1 P W E jW E W E i i 1 1 i1 i1 W F i i W F W jW E W E P i+1 i+1 i 1 1 i1 i1 P E jW E W E E 1 1 i1 i1 i i O O i i+1 In the HMM, the forward-backward algorithm is used to determine the event with the highest poste- Figure 1: A graphical model of HMM for the rior probability for each interword boundary: sentence boundary detection problem. Only one word+event pair is depicted in each state, but in N P E jW F arg max E a model based on N-grams, the previous i i (1) E i tokens would condition the transition to the next W The HMM is a generative modeling approach since state. O are observations consisting of words and E it describes a stochastic process with hidden vari- prosodic features F , and are sentence boundary ables (sentence boundary) that produces the observ- events. able data. This HMM approach has two main draw- backs. First, standard training methods maximize Maxent are chosen to maximize the conditional like- the joint probability of observed and hidden events, Q P E jT F i i lihood i over the training data, bet- as opposed to the posterior probability of the correct i hidden variable assignment given the observations, ter matching the classification accuracy metric. The which would be a criterion more closely related to Maxent framework provides a more principled way classification performance. Second, the N-gram LM to combine the largely correlated textual features, as underlying the HMM transition model makes it dif- confirmed by the results of (Liu et al., 2004); how- ficult to use features that are highly correlated (such ever, it does not model the state sequence. as words and POS labels) without greatly increas- A simple combination of the results from the ing the number of model parameters, which in turn Maxent and HMM was found to improve upon the would make robust estimation difficult. More details performance of either model alone (Liu et al., 2004) about using textual information in the HMM system because of the complementary strengths and weak- are provided in Section 3. nesses of the two models. An HMM is a generative model, yet it is able to model the sequence via the 1.2 Sentence Segmentation Using Maxent forward-backward algorithm. Maxent is a discrimi- A maximum entropy (Maxent) posterior classifica- native model; however, it attempts to make decisions tion method has been evaluated in an attempt to locally, without using sequential information. overcome some of the shortcomings of the HMM A conditional random field (CRF) model (Laf- approach (Liu et al., 2004; Huang and Zweig, 2002). ferty et al., 2001) combines the benefits of the HMM For a boundary position i, the Maxent model takes and Maxent approaches. Hence, in this paper we the exponential form: will evaluate the performance of the CRF model and P relate the results to those using the HMM and Max- g (E T F ) i i i k k k e P E jT F i i i (2) ent approaches on the sentence boundary detection Z T F i i task. The rest of the paper is organized as follows. T F T Z Section 2 describes the CRF model and discusses i i i where is a normalization term and represents textual information. The indicator func- how it differs from the HMM and Maxent models. E T F g Section 3 describes the data and features used in the i i i tions k correspond to features defined over events, words, and prosody. The parameters in models to be compared. Section 4 summarizes the experimental results for the sentence boundary de- 2In the prosody model implementation, we ignore the word identity in the conditions, only using the timing or word align- tection task. Conclusions and future work appear in ment information. Section 5. 2 CRF Model Description E E E E 1 2 i N A CRF is a random field that is globally conditioned on an observation sequence O . CRFs have been suc- cessfully used for a variety of text processing tasks O (Lafferty et al., 2001; Sha and Pereira, 2003; McCal- E E E lum and Li, 2003), but they have not been widely ap- i-1 i i+1 plied to a speech-related task with both acoustic and textual knowledge sources. The top graph in Figure O O O i-1 i i+1 2 is a general CRF model. The states of the model O correspond to event labels E . The observations are composed of the textual features, as well as the Figure 2: Graphical representations of a general CRF and the first-order CRF used for the sentence prosodic features. The most likely event sequence E boundary detection problem. E represent the state for the given input sequence (observations) O is tags (i.e., sentence boundary or not). O are observa- P G (E O ) k k k W e tions consisting of words or derived textual fea- arg max E (3) T F O Z tures and prosodic features . E where the functions G are potential functions over Z the events and the observations, and is the nor- E jO boundary label probabilities P . The under- malization term: lying N-gram sequence model of an HMM does P X not cope well with multiple representations (fea- G (E O ) k k k e Z O (4) tures) of the word sequence (e.g., words, POS), es- E pecially when the training set is small; however, the Even though a CRF itself has no restriction on CRF model supports simultaneous correlated fea- G E O the potential functions k , to simplify the tures, and therefore gives greater freedom for incor- model (considering computational cost and the lim- porating a variety of knowledge sources. A CRF ited training set size), we use a first-order CRF in differs from the Maxent method with respect to its this investigation, as at the bottom of Figure 2. In ability to model sequence information. The primary O this model, an observation i (consisting of textual advantage of the CRF over the Maxent approach is T F i features i and prosodic features ) is associated that the model is optimized globally over the entire E with a state i . sequence; whereas, the Maxent model makes a local The model is trained to maximize the conditional decision, as shown in Equation (2), without utilizing log-likelihood of a given training set.

Load more