Summscreen: a Dataset for Abstractive Screenplay Summarization
Total Page:16
File Type:pdf, Size:1020Kb
SummScreen: A Dataset for Abstractive Screenplay Summarization Mingda Chen1 Zewei Chu∗ Sam Wiseman1 Kevin Gimpel1 1Toyota Technological Institute at Chicago, IL, USA {mchen,swiseman,kgimpel}@ttic.edu,[email protected] Abstract Transcript: [ The apartment ] Sheldon : What color would you like to be ? We introduce SUMMSCREEN, a summariza- Leonard : Well , I 'd like to be green , but you know you always take it . tion dataset comprised of pairs of TV series Sheldon : That 's not true . Any color 's fine with me . Yeah , I could be a - a combination of blue and yellow . transcripts and human written recaps. The Leonard : Blue and yellow make green . dataset provides a challenging testbed for ab- Sheldon : Well , then it 's settled . stractive summarization for several reasons. Penny : Hi . Ready to go ? Sheldon : Oh , good news , we ordered lunch , so we can all stay here and Plot details are often expressed indirectly play Lord of the Rings Risk . in character dialogues and may be scattered Amy : Sheldon , we said that we would play games with you tonight . Sheldon : Oh , no , we 'll still be playing it tonight , this game can easily across the entirety of the transcript. These take eight hours . details must be found and integrated to form Penny : Sweetie , you really thought I 'd want to do this ? the succinct plot descriptions in the recaps. Leonard : No . Penny : Well , did you tell him that ? Also, TV scripts contain content that does not Leonard : Yes . directly pertain to the central plot but rather Penny : Did you say it out loud with words ? Leonard : No . serves to develop characters or provide comic Penny : I do n't want to spend the whole day playing a board game . relief. This information is rarely contained in … recaps. Since characters are fundamental to Recap: Sheldon and Leonard are happy playing a board game until Amy and TV series, we also propose two entity-centric Penny say they are tired of doing what the guys want … evaluation metrics. Empirically, we charac- terize the dataset by evaluating several meth- Figure 1: Excerpts from an example from SUMM- ods, including neural models and those based SCREEN. The transcript and recap are from the TV on nearest neighbors. An oracle extractive show “The Big Bang Theory”. Generating this sen- approach outperforms all benchmarked mod- tence in the recap requires discerning the characters’ els according to automatic metrics, showing feelings (clues in the transcript are underlined) about that the neural models are unable to fully ex- playing the board game (references are shown in red). ploit the input transcripts. Human evaluation Colored boxes indicate utterances belonging to the and qualitative analysis reveal that our non- same conversations. oracle models are competitive with their ora- cle counterparts in terms of generating faithful plot events and can benefit from better content selectors. Both oracle and non-oracle models Grusky et al., 2018), online forums (Völske et al., generate unfaithful facts, suggesting future re- 2017), patents (Sharma et al., 2019), meeting di- search directions.1 alogues (Janin et al., 2003; Carletta et al., 2005), arXiv:2104.07091v1 [cs.CL] 14 Apr 2021 and webpages (Chen et al., 2020). However, few 1 Introduction datasets exist for abstractive summarization of nar- rative text, which focuses on entities and dialogue Abstractive summarization aims to produce a sum- among entities, with plot details often communi- mary that concisely expresses key points of the in- cated indirectly via dialogue. In this work, we put document rather than merely extracting pieces build SUMMSCREEN, an abstractive summariza- of it. Existing datasets are constructed from various tion dataset combining TV series transcripts and domains, such as news (Sandhaus, 2008; Hermann episode recaps. Figure1 shows an example from et al., 2015; Rush et al., 2015; Narayan et al., 2018; SUMMSCREEN. ∗ Work done while the author was at the University of Several aspects of SUMMSCREEN make it a Chicago 1Code, data, and trained models are available at https: challenging testbed for abstractive summarization. //github.com/mingdachen/SummScreen First, the relationship between character dialogue and plot details is not straightforward. Plot events 2 Related Work are often expressed indirectly in dialogue, and dia- logue contains other information that is not directly There has been prior work on extractive screen- relevant to the plot, such as character development play summarization (Gorinski and Lapata, 2015; and humor. Also, a typical episode has multiple Papalampidi et al., 2020), and analyzing crime subplots that proceed in parallel, with consecutive drama (Frermann et al., 2018). The majority of TV scenes often describing different subplots. Solving show transcripts are dialogues, relating our work to prior work on dialogue and meeting summarization. SUMMSCREEN requires drawing information from utterances across a wide range of the input and Relevant datasets have been studied for medical di- integrating the information to form concise plot alogues (Joshi et al., 2020; Krishna et al., 2020), descriptions. Moreover, since actual TV episodes chitchat (SAMSum; Gliwa et al., 2019), meetings ground their scripts with audio-visual accompani- (AMI; Carletta et al., 2005; ICSI; Janin et al., 2003; ment, many details may be omitted from the tran- QMSum; Zhong et al., 2021), and news interviews script itself. This omission of details and the other (MediaSum; Zhu et al., 2021). challenging aspects mentioned above have inspired There have been attempts in summarizing long- research into other NLP tasks on TV show tran- form text (other than screenplays), such as books scripts, such as entity tracking (Chen and Choi, (Mihalcea and Ceylan, 2007), scientific articles 2016; Choi and Chen, 2018) and coreference reso- (PubMed and arXiv; Cohan et al., 2018), multi- lution (Chen et al., 2017; Zhou and Choi, 2018). ple news articles (Multi-News; Fabbri et al., 2019), opinionated text (RottenTomatoes; Wang and Ling, Another prominent characteristic of TV series 2016), Government reports (GovReport; Huang transcripts is their focus on characters. To reflect et al., 2021) and (extractive summarization of) this aspect, we propose two entity-centric metrics chapters of novels (Ladhak et al., 2020). More de- to evaluate the quality of generated plot summaries. tailed discussion on the differences between these One is based on bags of characters, which mea- datasets and SUMMSCREEN is in the next section. sures the overlap of the characters that appear in Recently there have been efforts on adapting both the generated and reference recaps. The other resources for TV shows for different tasks, includ- metric measures character relations: the overlap of ing question answering (Ma et al., 2018; Yang cooccurrences of character pairs in generations and and Choi, 2019), speaker identification (Ma et al., recaps. 2017), sarcasm detection (Joshi et al., 2016), emo- tion detection (Zahiri and Choi, 2017; Hsu and Ku, We empirically evaluate several types of meth- 2018), and character relation extraction (Yu et al., ods on SUMMSCREEN. We consider nearest neigh- 2020). bor models, which look up similar transcripts or recaps, neural abstractive summarization models, 3 SUMMSCREEN and hybrid models, which use the nearest neighbor models as content selectors followed by abstrac- SUMMSCREEN contains pairs of transcripts from tive summarization. Oracle extractive approaches TV series and their corresponding recaps. The tran- outperform all models on all the automatic met- scripts consist of dialogue utterances with speaker rics. These results suggest that the benchmarked names, and descriptions of scenes or character ac- methods are unable to fully exploit the input tran- tions. The recaps are human-written summaries of scripts and that improving content selection may the corresponding transcripts. Figure1 shows an be a promising research direction. example in SUMMSCREEN from the TV show “The Big Bang Theory”. The transcript documents a dia- Human evaluations show that our non-oracle hy- logue involving four characters (Sheldon, Leonard, brid models are competitive with their oracle coun- Penny, and Amy) about playing a board game, and terparts in terms of generating faithful plot events. the recap summarizes the dialogue into sentences. Hybrid models may be promising approaches for future research. Qualitative analysis shows that 3.1 Dataset Construction neural models tend to generate generic summaries, We use two sources to construct SUMMSCREEN: hybrid models can benefit from better content se- The TV MegaSite, Inc. (TMS)2 and ForeverDream- lection, and hybrid models sometimes generate un- faithful details. 2http://tvmegasite.net/ FD TMS Genre Count number of shows 88 10 Drama 65 number of episodes 4348 22503 Romance 24 min. # episodes per show 1 168 Comedy 23 max. # episodes per show 568 3784 Crime 18 median # episodes per show 9.0 1973.5 Action 15 avg. # episodes per show 49.4 2250.0 Science-Fiction 12 avg. # tokens in recaps 113.7 380.6 Adventure 9 Supernatural 9 avg. # tokens in transcripts 7605.4 6420.7 Genre Count avg. # lines in transcripts 447.6 360.8 Mystery 8 Drama 10 avg. # char. utterances in transcripts 330.7 327.0 Thriller 5 Romance 6 avg. # uniq. char. in recaps 5.8 14.3 Family 5 Family 4 avg. # uniq. char. in transcripts 20.6 29.8 Medical 5 Medical 1 Fantasy 4 Table 1: Detailed dataset statistics for SUMMSCREEN. Horror 4 History 3 Sports 3 Western 3 ing (FD),3 both of which provide community- Children 2 contributed transcripts. As FD does not pro- Legal 2 Espionage 1 vide recaps, we obtain recaps of FD shows from Music 1 Wikipedia and TVMaze.4 To ensure dataset quality of SUMMSCREEN, we filter out instances based on Figure 2: Left: TV show genres from ForeverDream- two criteria. First, the overlap ratio of TV show ing. Right: TV show genres from TVMegaSite. characters appearing in the recap and its transcript ForeverDreaming Train Dev Test should be higher than 85%. We use this criterion # shows 66 78 81 to ensure that the alignments between recaps and # episodes 3673 338 337 transcripts are correct.