Speech Segmentation and Spoken Document Processing
Total Page:16
File Type:pdf, Size:1020Kb
[Mari Ostendorf, Benoit Favre, Ralph Grishman, Dilek Hakkani-Tür, Mary Harper, Dustin Hillard, Julia Hirschberg, Heng Ji, Jeremy G. Kahn, Yang Liu, Sameer Maskey, Evgeny Matusov, Hermann Ney, Andrew Rosenberg, Elizabeth Shriberg, Wen Wang, and Chuck Wooters] Speech Segmentation and Spoken Document Processing [Applying language processing techniques developed for written input to the spoken word] rogress in both speech and language processing has spurred efforts to sup- port applications that rely on spoken—rather than written—language input. A key challenge in moving from text-based documents to such “spoken doc- uments” is that spoken language lacks explicit punctuation and formatting, which can be crucial for good performance. This article describes different Plevels of speech segmentation, approaches to automatically recovering segment bound- ary locations, and experimental results demonstrating impact on several language pro- cessing tasks. The results also show a need for optimizing segmentation for the end task rather than independently. INTRODUCTION Dramatic improvements in automatic speech recognition (ASR) technology make it now possible to explore how language processing techniques designed for text can be applied to spoken language. Ever-increasing collections of information are available as speech record- ings, including news broadcasts, meetings, debates, lectures, hearings, oral histories, and Digital Object Identifier 10.1109/MSP.2008.918023 © IMAGESTATE IEEE SIGNAL PROCESSING MAGAZINE [59] MAY 2008 1053-5888/08/$25.00©2008IEEE Authorized licensed use limited to: International Computer Science Inst (ICSI). Downloaded on May 07,2010 at 21:30:46 UTC from IEEE Xplore. Restrictions apply. Web casts, among other types of human-directed (versus com- area of sentence segmentation by combining lexical information puter-directed) communications. Given the vast amount of audio from a word recognizer, with spectral and prosodic cues. Lexical data and the time involved in human listening, clearly some sequence information provides cues related to syntactic and automatic means for data processing is necessary. ASR can auto- semantic constraints and is thus helpful in finding sentence and matically transcribe (albeit imperfectly) the speech in such spo- clause boundaries. For example, a sentence in English is not ken documents into a stream of words. But to derive content of likely to end with a determiner. Such cues, however, are subject interest, one would like to be able to apply language processing to degradation from word recognition errors. Lexical cues can techniques, such as question answering and summarization, also be fairly domain specific and may thus perform poorly when which have traditionally been developed for written input. training and test data come from different speaking contexts. In applying text-based language processing techniques to Spectral information provides cues to speaker and show (or speech, it is important to acknowledge that there are many dif- scene) changes, as well as to nonspeech events such as laughter. ferences between the two genres, as well as within each genre, Prosodic features such as fundamental frequency, duration, and depending on the specific communication context. For example, energy patterns provide information about multiple types of seg- conversational speech is generated on the fly and thus contains ment boundaries. For example, pitch tends to drop before the disfluencies and discourse elements used to manage turn-taking ends of sentences and to an even lower value at the end of a and other forms of interaction. In contrast, news broadcasts are topic or paragraph-like unit. Boundaries are often accompanied typically read from a script and more closely resemble written by pauses and by durational lengthening of phones directly pre- documents. Nevertheless, a shared challenge for the processing ceding the boundary. of most classes of spoken documents—as compared with text For many language processing tasks, it is essentially impossi- documents—is the lack of overt segmentation information. Text ble to process speech without some sort of segmentation. input typically contains punctuation and capitalization, which Historically, spoken language processing has assumed the avail- segments words into sentences and subsentential units. ability of good sentence and document segmentation. Most ini- Sentences are further organized into higher-level units such as tial work on problems such as parsing and summarizing speech speaker quotes, paragraphs, sections, chapters, articles, and so were based on oracle conditions using hand-marked sentence on, via formatting. In contrast, when spoken language is and story boundaries. However, automatically detecting these processed by an automatic speech recognizer, the output is sim- boundaries is challenging, and several studies have demonstrated ply a nonannotated stream of words, as shown in “Unformatted that segmentation and punctuation prediction accuracy signifi- Word Transcripts.” Human listeners can easily segment such cantly impact language processing performance in a variety of spoken input, arriving at the formatted version shown in tasks. Hence, over the past decade researchers have been explor- “Formatted Transcripts.” To do so they can draw on sophisticat- ing methods for improving the accuracy of models for various ed syntactic, semantic, acoustic, prosodic, pragmatic, and dis- levels of segmentation, showing that combining both acoustic course knowledge—not all of which are fully understood. and lexical cues provides significant benefit to detection accuracy Automatic segmentation, on the other hand, is much more beyond a naive pause-based segmentation. More importantly, as difficult. Nevertheless, significant progress has been made in the will be shown here for a variety of language processing tasks, UNFORMATTED WORD TRANSCRIPTS FORMATTED TRANSCRIPTS with more american firepower being considered for the Reporter: With more American firepower being considered persian gulf defense secretary cohen today issued by far for the Persian Gulf, defense secretary Cohen today issued by the administration’s toughest criticism of the u. n. securi- far the administration’s toughest criticism of the U.N. Security ty council without mentioning russia or china by name Council. Without mentioning Russia or China by name, Cohen cohen took dead aim at their reluctance to get tough took dead aim at their reluctance to get tough with Iraq. with iraq frankly i find it uh incredibly hard to accept the Cohen: Frankly I find it incredibly hard to accept the proposi- proposition that in the face of saddam’s uh actions that tion that in the face of Saddam’s actions that members of the uh members of the security council cannot bring them- Security Council cannot bring themselves to declare that this is selves to declare that this is a fundamental or material a fundamental or material breach of conduct on his part. I breach uh of uh conduct on his part i think it challenges think it challenges the credibility of the Security Council. the credibility of the security council in europe today Reporter: In Europe today, Secretary of State Albright trying secretary of state albright trying to gather support for to gather support for tougher measures was told by the tougher measures was told by the british and french that British and French that before they will join the U.S. in using before they will join the u. s. in using force they insist force they insist the Security Council pass yet another resolu- the security council pass yet another resolution british tion. British Prime Minister Blair said if Saddam Hussein then prime minister blair said if saddam hussein then does not does not comply: comply the only option to enforce the security council’s Blair: The only option to enforce the security Council’s will is will is military action military action. IEEE SIGNAL PROCESSING MAGAZINE [60] MAY 2008 Authorized licensed use limited to: International Computer Science Inst (ICSI). Downloaded on May 07,2010 at 21:30:46 UTC from IEEE Xplore. Restrictions apply. these segmentations also lead to much better task performance mark boundaries of different classes of sentences, or dialog than pause-based segmentations. In addition, when looking at acts—including statements, questions, and backchannels. For findings over a range of tasks, it is clear that optimizing segmen- example, dialog act boundaries rather than sentence bound- tation for the task is a useful strategy, i.e., the best tradeoff of aries have been widely used for multiparty meeting speech recall and precision varies depending on the task. [2], [3]. Dialog act boundaries are thus essentially equivalent In the remainder of this article, we describe different types to sentence boundaries in such work but provide additional of segmentation useful for spoken document processing, out- information on utterance function. line popular methods for feature extraction and computation- Sentence-level information is but one of many useful levels al modeling, survey recent results in several language of structure in language, as evidenced by the additional forms of processing applications that punctuation (for example, com- demonstrate the impact of mas) often available in text. For speech segmentation, and dis- DRAMATIC IMPROVEMENTS IN some language analysis tasks, cuss open challenges for leverag- AUTOMATIC SPEECH RECOGNITION such as parsing and entity extrac- ing and augmenting automatic TECHNOLOGY MAKE IT POSSIBLE