Spotify at TREC 2020: Genre-Aware Abstractive Podcast Summarization
Total Page:16
File Type:pdf, Size:1020Kb
Spotify at TREC 2020: Genre-Aware Abstractive Podcast Summarization Rezvaneh Rezapour∗ Sravana Reddy University of Illinois at Urbana-Champaign Spotify Urbana, IL, USA Boston, MA, USA [email protected] [email protected] Ann Clifton Rosie Jones Spotify Spotify New York, NY, USA Boston, MA, USA [email protected] [email protected] Abstract of information as well as their structure, which varies by theme, genre, level of formality, and tar- This paper contains the description of our sub- get audience, motivating the TREC 2020 podcast missions to the summarization task of the Pod- summarization task (Jones et al., 2020), where the cast Track in TREC (the Text REtrieval Con- objective is to design models to produce coherent ference) 2020. The goal of this challenge was to generate short, informative summaries text summaries of the main information in podcast that contain the key information present in a episodes. podcast episode using automatically generated While no ground truth summaries were provided transcripts of the podcast audio. Since pod- for the TREC 2020 task, the episode descriptions casts vary with respect to their genre, topic, written by the podcast creators serve as proxies for and granularity of information, we propose summaries, and are used for training our supervised two summarization models that explicitly take models. We make some observations about the genre and named entities into consideration in order to generate summaries appropriate to the creator descriptions towards understanding what style of the podcasts. Our models are abstrac- goes into a ‘good’ podcast summary. Our first ob- tive, and supervised using creator-provided de- servation is that the descriptions vary stylistically scriptions as ground truth summaries. The re- by genre. For instance, the descriptions of sports sults of the submitted summaries show that our podcasts tend to list a series of small events and best model achieves an aggregate quality score interviews, usually anchored on the names of play- of 1.58 in comparison to the creator descrip- ers, coaches, and teams. In contrast, descriptions tions and a baseline abstractive system which both score 1.49 (an improvement of 9%) as as- of true crime podcasts tend to contain suspenseful sessed by human evaluators. hooks, and avoid giving away much of the plot (Ta- ble1). Some podcasts include the names of hosts 1 Introduction or guests in their descriptions, while others only talk about the main theme of the episode. Advancements in state-of-the-art sequence to By definition, a good summary should be con- sequence models and transformer architectures cise, preserve the most salient information, and (Vaswani et al., 2017; Dai et al., 2019; Nallapati be factually correct and faithful to the given doc- et al., 2016; Ott et al., 2019; Lewis et al., 2020), ument (Nenkova and McKeown, 2011; Maynez and the availability of large-scale datasets have et al., 2020). In addition, we believe that since led to rapid progress in generating near-human- users’ expectations of summaries are shaped by the quality coherent summaries with abstractive sum- creator descriptions and the style and tone of the marization. The majority of effort in this domain content, it is important to generate summaries that has been focused on summarizing news datasets are appropriate to the style of the podcast. such as CNN/Daily Mail (Hermann et al., 2015) In this work, we present two types of summa- and XSum (Narayan et al., 2018). Podcasts are rization models that take these observations into much more diverse than news articles in the level consideration. The goal of the first model, which ∗∗ Work done while at Spotify we shorthand as ‘category-aware’, is to generate Sports In EP 4 Jon Rothstein is joined by The dataset after these filters consists of 90055 Arizona State Head Coach Bobby Hur- ley, Creighton Guard Mitchell Ballock, episodes. We held out 1000 random episodes for and Merrimack Head Coach Joey Gallo test and validation sets for development, and used checks in on the Hustle Mania Hotline. 88055 episodes for training our models. Our final This is March and Jon has a special message for all of you heading into the submissions were made on the task’s test set of NCAA tournament. 1027 episodes disjoint from the training data. True Crime Sometimes, it takes years to connect a killers crimes. On March 6th 1959 a man was born who would, eventually, be tried 3 Abstractive Model Framework for the murder of a single woman. It would take years for police to connect Our summarization models are built using BART him to 4 other murders. Years and a (Lewis et al., 2020), which is a denoising autoen- clever investigator who got the DNA he coder for pretraining sequence to sequence mod- needed. Comedy This weeks episode is a little bit of a els. To generate summaries, we started from a throwback.. We have one of our close model pretrained on the CNN/Daily Mail news high school friends Zak on the show, and summarization dataset1 and then fine-tuned it on he is just a blast! We discuss relation- ships, our favourite movies and our expe- our podcast transcript dataset with respect to the rience’s with high school! This episode two proposed models (described below in §4). is pretty chill, so you better have a drink We used the implementation of BART in the for this one! Huggingface Transformers library (Wolf et al., Table 1: Examples of creator descriptions for a differ- 2020) with the default hyperparameters. We set the ent genres of podcast episodes, highlighting the preva- batch size to 1, and made use of 4 GPUs for train- lence of named entities in sports, suspense in true ing. The best model with the highest ROUGE-L crime, and colloquial language in comedy. score on the validation set was saved. The maxi- mum sequence limit we set for the input podcast summaries whose style is appropriate for the genre transcript was 1024 tokens (that is, any material or category of the specific podcast. The second beyond the first 1024 tokens was ignored). We con- model aims to maximize the presence of named strained the model to generate summaries with at entities in the summaries. least 50 and at most 250 tokens. This setup is the same as the bartpodcasts model 2 Data Filtering described in the task overview (Jones et al., 2020), The Spotify Podcast Dataset consists of 105,360 and is one of our baselines for comparison. We also bartcnn episodes with transcripts and creator descriptions compare our models to the model baseline, (Clifton et al., 2020), and is provided as a training which is the out of the box BART model trained on dataset for the summarization task. We applied a the CNN/Daily Mail corpus. set of heuristic filters on this data with respect to 4 Models the creator descriptions. 4.1 Category-Aware Summarization • Removed episodes with unusually short (fewer than 10 characters) or long (more than The idea of building a category-aware summariza- 1300 characters) creator descriptions. tion model is motivated by the hypothesis that se- lecting what is important to a summary depends on • Removed episodes whose descriptions were the topical category of the podcast. This hypothesis highly similar to the descriptions of other is based upon observations of the creator descrip- episodes in the same show or to the show de- tions in the dataset. For example, descriptions of scription. Similarity was measured by taking ‘Sports’ podcast episodes tend to contain names the TF-IDF representation of a description and of players and matches, ‘True Crime’ podcasts comparing it to others using cosine similarity. incorporate a hook and suspense, and ‘Comedy’ podcasts are stylistically informal (Table1). • Removed ads and promotional sentences in While we expect some of these patterns to be the descriptions. We annotated a set of 1000 naturally learned by a supervised model trained on episodes for such material, and trained a clas- the large corpus, we nudged the model to generate sifier to detect these sentences; more details are described in Reddy et al.(2021). 1https://huggingface.co/facebook/bart-large-cnn Category-Aware Summarization Model Podcast Transcript Podcast Podcast Summary Categories e.g.,: [News, Sports] Figure 1: Sketch of the Category-Aware Summarization Model stylistically appropriate summaries by explicitly Categories Counts Arts 81 conditioning the summaries on the podcast genre. Business 97 Some previous work approaches this problem by Comedy 136 concatenating a vector corresponding to an embed- Education 118 Fiction 5 ding of the conditioning parameters to the inputs in Games 7 an RNN (Ficler and Goldberg, 2017). More recent Government 3 work simply concatenates a token corresponding Health 94 History 16 to the parameter to the text input in sequence-to- Kids & Family 28 sequence transformer models (Raffel et al., 2020). Leisure 45 Lifestyle & Health 42 We experimented with a few ways of encoding the Music 33 category labels in the summarization model, and News 15 settled upon prepending the labels as special tokens Religion & Spirituality 92 Science 11 to the transcript as the input during training and in- Society & Culture 96 ference, as shown in Figure1. For the episodes Sports 137 with multiple category labels, we concatenated the Stories 37 TV & Film 41 labels and included them all as distinct special to- Technology 17 kens, concatenated in a fixed (lexicographic) order. True Crime 28 The podcast categories are not given in the Table 2: Set of collapsed category labels and their dis- metadata files distributed with the podcast dataset. tribution on the task test set. Some episodes are associ- However, they can be derived from the RSS ated with multiple labels. headers associated with each podcast under the itunes:category field.