An Analysis of Sentence Segmentation Features For
Total Page:16
File Type:pdf, Size:1020Kb
An Analysis of Sentence Segmentation Features for Broadcast News, Broadcast Conversations, and Meetings 12 1 Sebastien Cuendet1 Elizabeth Shriberg Benoit Favre [email protected] [email protected] [email protected] 1 James Fung1 Dilek Hakkani-Tur [email protected] [email protected] 2 1 ICSI SRI International 1947 Center Street 333 Ravenswood Ave Berkeley, CA 94704, USA Menlo Park, CA 94025, USA t sp eaking stylesbroadcast news BN broadcast ABSTRACT dieren conversations BC and facetoface multiparty meetings Information retrieval techniques for sp eech are based on MRDA We fo cus on the task of automatic sentence seg those develop ed for text and thus exp ect structured data as mentation or nding b oundaries of sentence units in the An essential task is to add sentence b oundary infor input otherwise unannotated devoid of punctuation capitaliza mation to the otherwise unannotated stream of words out tion or formatting stream of words output by a sp eech systems We analyze put by automatic sp eech recognition recognizer sentence segmentation p erformance as a function of feature Sentence segmentation is of particular imp ortance for sp eech ual versus automatic for news typ es and transcription man understanding applications b ecause techniques aimed at se sp eech meetings and a new corpus of broadcast conversa mantic pro cessing of sp eech inputsuch as machine trans tions Results show that overall features for broadcast lation question answering information extractionare typ news transfer well to meetings and broadcast conversations ically develop ed for textbased applications They thus as pitch and energy features p erform similarly across cor sume the presence of overt sentence b oundaries in their in p ora whereas other features duration pause turnbased put In addition many sp eech pro cessing tasks and lexical show dierences the eect of sp eech recogni show improved p erformance when sentence b oundaries are tion errors is remarkably stable over features typ es and cor provided F or instance sp eech summarization p erformance p ora with the exception of lexical features for meetings and improves when sentence b oundary information is provided broadcast conversations a new typ e of data for sp eech as observed in Similarly named entity extraction and technology b ehave more like news sp eech than like meetings partofsp eech tagging in sp eech is improved using sentence for this task Implications for mo deling of dierent sp eaking b oundary cues in and the use of sentence b oundaries for styles in sp eech segmentation are discussed machine translation is shown to b e b enecial for machine in Sentence b oundary annotation is also General Terms translation imp ortant for aiding human readability of the output of au Proso dic Mo deling Sentence Segmentation tomatic sp eech recognition systems and could b e used for determining semantically and proso dically coherent b ound for playback of sp eech to users in tasks involving audio Keywords aries search Sp oken Language Pro cessing Sentence Segmentation Broad While sentence segmentation of broadcast news and to cast Conversations Sp ontaneous Sp eech Proso dy Word Bound some extent of meetings has b een studied in previous work ary Classication Bo osting little is known ab out broadcast conversations Indeed data for this task has only recently b ecome available for work sp eech technology Studying the prop erties of broadcast 1. INTRODUCTION in conversations and comparing them with those of meetings We investigate the role of identicallydened lexical and and broadcast news is of interest b oth theoretically and also proso dic features when applied to the same task across three practically esp ecially b ecause there is currently less data available for broadcast conversations than for the other two typ es studied here For example if two sp eaking styles share haracteristics one can p erform adaptation from one to an Permission to make digital or hard copies of all or part of this work for c to improve the p erformance of the sentence segmen personal or classroom use is granted without fee provided that copies are other as proved previously for meetings by using conversa not made or distributed for profit or commercial advantage and that copies tation telephone sp eech bear this notice and the full citation on the first page. To copy otherwise, to tional goal of this study is to analyze how dierent sets of republish, to post on servers or to redistribute to lists, requires prior specific The including lexical features proso dic features and permission and/or a fee. features Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...5.00. MRDA TDT BC their combination p erform on the task of automatic sen Training set size tence segmentation for dierent sp eaking styles More sp ecif Test set size ically we ask the following questions Heldout set size Vo cabulary size Ho w do dierent feature typ es p erform for the dierent Mean sentence length sp eaking styles Table Data set statistics Values are given in What is the eect of sp eech recognition errors on p er num b er of words based on the output of the sp eech formance and how do es this eect dep end on the fea recognizer STT ture typ es or on the sp eaking style For this task are broadcast conversations more like broadcasts or more like conversations general tend to contain syntactically simpler sentences and signicant pronominalization News sp eech is typically read Results have implications not only for the task of sentence from a transcript and more closely resembles written text b oundary detection but more generally for proso dic mo del It contains for example app ositions center emb eddings and ing for natural language understanding across genres prop er noun comp ounds among other characteristics that The next section describ es the data set features and ap contribute to longer sentences Discourse phenomena also proach to sentence segmentation Section rep orts on ex obviously dier across corp ora with meetings containing p eriments with proso dic and lexical features and provides more turn exchanges incomplete sentences and higher rates further analysis and a discussion of usage of various feature of short backchannels such as y eah and uhhuh than typ es or groups and comparison across sp eaking styles A sp eech in news broadcasts and in the broadcast conversa summary and conclusions are provided in Section tions Sentence b oundary lo cations are based on reference tran for all three corp ora Sentences b oundaries are 2. METHOD scriptions annotated in BN transcripts directly For the meeting data are obtained by mapping dialog act b oundaries 2.1 Data and annotations b oundaries to sentence b oundaries The meetings data are lab eled ac To study the dierences b etween the meetings BN and cording to classes of dialog acts backchannels o orgrabb ers BC sp eech for the task of sentence segmentation we use the and o orholders questions statements and incompletes ICSI Meetings MRDA the TDT English Broadcast In order to b e able to compare the three corp ora all dia YQ Broadcast Conversations News and the GALE log act classes are mapp ed to the sentence b oundary class corp ora The BC Corpus frequently lacked sentence b oundary anno The ICSI Meeting Corpus is a collection of meetings tations but included line breaks and capitalization as well including simultaneous multichannel audio recordings word as dialog act tag annotations In order to use this data level orthographic transcriptions The meetings range in we implemented heuristic rules based on human analysis to length from to minutes but generally run just under pro duce punctuation annotations For BC data we similarly an hour each summing to hours We use a meeting mapp ed the typ es of dialog acts dened statements ques subset of this corpus that was also used in the previous re tions backchannels and incompletes to sentence b ound search with the same split into training heldout and aries Note that we have chosen to map incomplete sentence test sets TDT Corpus was collected by LDC and includes b oundaries to the b oundary class even though they are not multilingual raw material news wires and other electronic full b oundaries This is b ecause the rate of incompletes text web audio broadcast radio and television We use while not negligible was to o low to allow for adequate train a subset of TDT English broadcast radio and television ing of a third class in the BC data given the size of the cur data in this study The GALE YQ Broadcast Conversa rently available data We thus chose to group it with the tions Corpus also collected by LDC is a set of instudio b oundary class even though incompletes also share some talk shows with two or more participants including formal characteristics with nonb oundaries Namely material to oneonone interviews and debates with more participants the left of an incomplete resembles nonb oundaries whereas Three shows last for half an hour and the rest of the shows material to the right resembles b oundaries run an hour each for a total of ab out hours In the exp eriments to follow classication mo dels are on a set of data tuned on a heldout set and tested trained 2.2 Automatic speech recognition on an unseen test set within each genre The corp ora are Automatic sp eech recognition results for the ICSI Meet available with the words transcrib ed by humans reference ings data the TDT data and the BC data were obtained and with the words output by the sp eech recognizer STT using the stateoftheart SRI