Prosodically-Based Automatic Segmentation and Punctuation

Prosodically-based automatic segmentation and punctuation Helena Moniz1;2, Fernando Batista2;3, Hugo Meinedo2, Alberto Abad2, Isabel Trancoso2, Ana Isabel Mata1, Nuno Mamede2 1FLUL/CLUL, University of Lisbon, Portugal 2IST / INESC-ID, Lisbon, Portugal 3ISCTE, Lisbon, Portugal {helenam;fmmb;meinedo;alberto;isabel.trancoso;njm}@l2f.inesc-id.pt and [email protected] Abstract of silent pauses. The better results achieved with this prosodic module motivated the retraining of both the acoustic models and This work explores prosodic/acoustic cues for improving a punctuation models. baseline phone segmentation module. The baseline version is This work was done using a subset of the EP broadcast provided by a large vocabulary continuous speech recognition news corpus, collected during the ALERT European project. system. An analysis of the baseline results revealed problems in Although the corpus used for training/development/evaluation word boundary detection, that we tried to solve by using post- of the speech recognizer includes 51h/5h of orthographically processing rules based on prosodic features (pitch, energy and transcribed audio, a limited subset of 1h was transcribed at the duration). These rules achieved better results in terms of inter- word boundary level, in order to allow us to evaluate the efficacy word pause detection, durations of silent pauses previously de- of the post-processing rules. With this sample we could evalu- tected, and also durations of phones at initial and final sentence- ate the speech segmentation robustness with several speakers like unit level. These improvements may be relevant not only in prepared non-scripted and spontaneous speech settings with for retraining acoustic models, but also for the automatic punc- different strategies regarding speech segmentation and speech tuation task. These two tasks were evaluated. Results based on rate. more reliable boundaries are promising. This work allows us The next section reviews related work. The post-processing to tackle more challenging problems, combining prosodic and rules and their impact on word segmentation results are de- lexical features for the identification of sentence-like units. scribed in Sections 3 and 4, respectively. Section 5 deals with Index Terms: prosody, automatic phone segmentation, punctu- the retraining of acoustic models. Section 6 is devoted to the ation. description of the punctuation module and the results obtained before and after the retraining. Conclusions and future work are 1. Introduction presented in Section 7. The main motivation of our work is the improvement of the punctuation module of our automatic broadcast news captioning 2. Related work system. Although this system is deployed in several languages, 2.1. Segmentation including English, Spanish and Brazilian Portuguese, the current paper refers only to the European Portuguese (EP) version. Speech has flexible structures. Speaker plans what to say on- Like the audio diarization and speech recognition modules that line and makes use of different cues from context, thus spon- precede them, the punctuation and capitalization modules share taneous speech may have elliptic utterances, backchannel ex- low latency requirements. pressions, disfluencies, and overlapping speech, inter alia. It Although the use of prosodic features in automatic punctu- is also characterized by temporal characteristics (speech rate, ation methods is well studied for some languages, the first im- elongated linguistic material, etc) that make its modeling diffi- plemented version for EP deals only with full stop and comma cult. This set of properties poses interesting challenging both recovery, and explores a limited set of features, simultaneously from a linguistic and from an automatic speech recognition targeting at low latency, and language independence. The aim point of view [1]. Our work focus in broadcast news, where of this work is to improve the punctuation module, first by ex- one can find both spontaneous and read speech. ploring additional features, namely prosodic ones, and later by The study by [2] shows that there are different levels of seg- encompassing interrogatives. This paper describes our steps in mentation when combining linguistic features with automatic this first direction. methods. We could say that those different types of segmen- One of the most important prosodic features is the duration tation may reflect an increasing gradient scale: pause-based, of silent pauses. Even though they may not be directly con- pause-based with lexical information, the previous and dialog verted into punctuation, silent pauses are in fact a basic cue for acts information, topic and speaker segmentation. While pause punctuation and speaker diarization. The durations of phones and speaker segmentation are based on audio diarization tech- and silent pauses are automatically provided by our large vo- niques, the remaining types are related with structural segmen- cabulary continuous speech recognition module. An analysis tation methods. Audio diarization may comprehend identifica- of these results, however, revealed several problems, namely tion of jingles, speech/non-speech detection, speaker clustering in the boundaries of silent pauses, and in their frequent miss- and identification, etc. Structural segmentation concerns algo- detection. These problems motivated the use of post-processing rithms based on linguistic information to delimit "spoken sen- rules based on prosodic features, to better adjust the boundaries tences" (units that may not be isomorphic to written sentences), and topic and story segmentation. This structural segmentation 3. Word boundaries and silent pauses is the core of our present work. 3.1. Baseline phone segmentation Several studies (e.g., [3, 1, 4, 5, 2, 6]) have been showing The first module of our broadcast news processing pipeline, af- that the analysis of prosodic features is used to model and im- ter jingle detection, performs audio diarization [15]. The sec- prove natural language processing systems. The set of prosodic ond module is the automatic speech recognition module. Au- features, such as pause, final lengthening, pitch reset, inter alia, dimus is a hybrid automatic speech recognizer [15] that com- are among the most salient cues used in algorithms based on bines the temporal modeling capabilities of Hidden Markov linguistic information. The implementation is supported on ev- Models with the pattern discriminative classification capabili- idences that this cues are language-independent [7] and also on ties of Multi-layer Perceptrons (MLP). The vocabulary has 100k the fact that already studied languages have prosodic strategies words. Modeling context dependency is a particularly hard to delimit sentence-like units (SU) and paragraphs with pitch problem in hybrid systems. Our current solution uses, in ad- amplitude, pitch contours, boundary tones and pauses. dition to monophone units modeled by a single state, multiple- state monophone units, and a fixed set of phone transition units We do know that there is no one-to-one mapping between aimed at specifically modeling the most frequent intra-word prosody and punctuation. Silent pauses, for instance, can not be phone transitions [16]. Using this strategy, the word error rate directly transformed in punctuation marks for different reasons, (WER) for the current test set of 5h was 22.0%. e.g. prosodic constraints regarding the weight of a constituent; This recognizer was used in a forced alignment mode in our speech rate; style; different pragmatic functions, such as em- reduced test set of 1h duration that was manually transcribed phasis, emotion, on-line planning. However, the correct identi- at the word boundary level. As explained above, this revealed fication of silent pauses and phone delimitation do contribute to several problems, namely in the boundaries of silent pauses, and the segmentation of speech in sentence-like units and do in fact in their frequent miss-detection. contribute to punctuation. 3.2. Post-processing rules 2.2. Punctuation Reducing these problems was the motivation for first apply- ing post-processing rules to the baseline results, and later re- Although different punctuation marks can be used in spoken training the speech recognition models. These post-processing texts, most of them rarely occur and are quite difficult to auto- rules were applied off-line, and used both pitch and energy in- matically insert or evaluate. Hence, most studies focus either formation. Pitch values were extracted using the Snack Sound 1 on full stop or on comma. Comma is usually the most frequent Toolkit , but the only used information was the presence or ab- punctuation mark, but it is also the most problematic because it sence of pitch. serves many different purposes. [8] describes a method for in- The energy information was also extracted off-line for each serting commas into text, and presents a qualitative evaluation audio file. Speech and non-speech portions of the audio data based on the user satisfaction, concluding that the system per- were automatically segmented at the frame-level with a bi- formance is qualitatively higher than the sentence accuracy rate Gaussian model of the log energy distribution. That is, for each would indicate. audio sample a 1-dimensional energy based Gaussian model of two mixtures is trained. In this case, the Gaussian mixture with Detecting positions where a punctuation mark is missing, the “lowest” mean is expected to correspond to the silence or roughly corresponds to the task of detecting a SU, or finding background noise, and the one with the “highest” mean corre- the SU boundaries. SU boundary detection has gained increas- sponds to speech. Then, frames of the audio file having a higher ing attention during recent years, and it has been part of the likelihood with the speech mixture are labeled as speech and NIST rich transcription evaluations. A general HMM (Hidden those that are more likely generated by the non-speech mixture Markov Model) framework that allows the combination of lexi- are labeled as silence. cal and prosodic clues for recovering full stop, comma and ques- The integration of extra info was implemented as a post- tion marks is used by [9] and [10].

Load more