Arxiv:2004.03090V1 [Cs.CL] 7 Apr 2020

Interview: A Large-Scale Open-Source Corpus of Media Dialog Bodhisattwa Prasad Majumder,∗ Shuyang Li,∗ Jianmo Ni, Julian McAuley Computer Science and Engineering University of California, San Diego fbmajumde, shl008, jin018, [email protected] Abstract Dataset Spoken # Dialogs # Turns # Words Reddit 7 147 M – 1,800.0 M Existing conversational datasets consist either DailyDialog 7 13,118 102,979 1.4 M of written proxies for dialog or small-scale CALLHOME 3 120 22,214 0.3 M transcriptions of natural speech. We introduce Interview 2P 3 23,714 454,739 21.7 M Interview: a large-scale (105K conversations) Interview 3 105,848 3,199,856 126.7 M media dialog dataset collected from news interview transcripts. Compared to existing large- Table 1: Comparative dialog dataset statistics, includ- scale proxies for conversational data, language ing two-party (2P) and full Interview dataset models trained on our dataset exhibit better zero-shot out-of-domain performance on existing spoken dialog datasets, demonstrating other popular conversational datasets, demonstrat- its usefulness in modeling real-world conver- ing that Interview contains more complex dialog sations. Interview contains speaker role an- and better models the characteristics of natural spo- notations for each turn, facilitating the devel- ken conversations. Our dataset is an order of magni- opment of engaging, responsive dialog sys- tude larger than existing high-quality natural dialog tems. In fact, experiments on two dialog tasks datasets and contains speaker role annotations for show that leveraging such labels improves performance over strong speaker-agnostic base- each turn, facilitating the development of conver- lines, and enabling models to generate more sational agents and assistive systems for settings specific and inquisitive responses in interview- involving specific speaker roles, such as doctor- style conversations. patient interviews or hosted talk shows. In particular, we explore the tasks of role model- 1 Introduction ing in media dialog and role change detection on Interview and find that leveraging role informa- Large repositories of textual communications tion can enable more nuanced, on-topic and natural (e.g. forum and microblog posts) have gained re- dialog generation, as well as improve role change cent popularity as proxies for dialog (Galley et al., classification performance. 2019; Ritter et al., 2010; Lowe et al., 2015). How- In summary, we present Interview, the first ever, conversations in these settings differ from nat- large-scale open-domain media dialog dataset. We arXiv:2004.03090v1 [cs.CL] 7 Apr 2020 ural dialog: turns may be sparsely scattered over explore two tasks for which it serves as a promis- a large temporal span, contain distinct syntax and ing benchmark dataset: speaker role modeling and vocabulary (Maity et al., 2016), and differ greatly speaker change detection. We build simple yet in formality and focus (Li et al., 2017). In this pa- strong models to show quantitatively that role la- per, we investigate how appropriate such data is for bels from Interview improve performance on such modeling natural dialog, and introduce Interview, tasks. Interview’s scale, spoken origins, role di- a new high-quality large-scale open-domain con- versity, and complex utterances make it a better versational dataset grounded in interview settings source for grounded open-domain conversations. with annotations for specific speaker roles. We compare the performance of state-of-the- 2 Related Works art language models fine-tuned on Interview and Broadly speaking, dialog and conversation datasets ∗ denotes equal contribution can be classified as constrained (goal-oriented) or open-domain, written or spoken, and scripted Model Interview DailyDialog CALLHOME or spontaneous (Serban et al., 2018). In the realm GPT2 35.20 57.19 137.21 of written dialog, the closest proxy to natural dia- FT-Interview 17.77 32.85 51.40 FT-DailyDialog 50.05 11.63 82.67 log comes in the form of role-play-style (Bernsen FT-CALLHOME 32.10 33.30 28.19 et al., 1998) conversations, featuring two agents in- structed to participate in a constrained conversation. Table 2: Zero-shot BPE perplexity for GPT2-based This setup has seen recent usage to construct goal- models. Bold denotes best out-of-domain performance. oriented (Byrne et al., 2019; Budzianowski et al., 2018) and grounded conversations (Dinan et al., 2019; Gopalakrishnan et al., 2019). These datasets are learned from a set of grounding facts (Zhang are expensive to collect at scale and are heavily et al., 2018) or other non-conversational metadata constrained/guided by the instructions given to par- (Luan et al., 2017). Our approach eschews exter- ticipants. Several initiatives have recorded and nal grounding and learns speaker embeddings via manually transcribed natural conversations occur- dialog modeling, similar to Li et al.(2016). We, ring in the course of normal life, resulting in small, however, propose to learn speaker embeddings for roles high-quality natural dialog datasets (Canavan et al., different and capture role-dependent lexical 1997; Godfrey et al., 1992; Renals et al., 2007; Mor- profiles in conversation. gan et al., 2001). We explore an alternative venue 3 Interview Dataset1 for collecting a large-scale dataset of natural dialog: conversations and interviews on public radio. We collect a novel dataset of 105K multi-party The US Defense Advanced Research Projects interview transcripts for 7 programs on National 2 Agency (DARPA) has undertaken efforts to collect Public Radio (NPR) over 20 years (1999–2019), broadcast and informal conversation from public total of 10k hours. These transcripts contain a to- and private sources including messaging boards, tal of 3M turns comprising 7.5M sentences (127M SMS (DARPA, 2011), and broadcast newswire con- words) from 184K speakers, of which 287 are hosts. tent (Strassel, 2004; Cohen, 2007). However, it To investigate role-play in media dialog, we curate proves difficult to adopt these datasets as widely a subset, Interview 2P, with two roles: a host and available benchmarks on dialog modeling tasks, a guest, comprising 23K two-party conversations as they come with a substantial cost ($100-$1000 encompassing 455K turns, with 1.24M sentences per dataset/year, covering up to a hundred hours and 21.7M words. of transcribed conversation). In this vein, we con- In these two-party conversations, each speaker tribute an open-access large-scale corpus of cleanly takes an average of nine turns per dialog. Guests annotated broadcast media dialog. tend to speak longer on their turns, with 1.6x as many sentences spoken and 2x as many words Weizman(2008) explores the patterns and dis- per turn, and also use a more diverse vocabulary course within media dialog and contrast the associ- (1.6x size). Meanwhile, hosts ask five times as ated speaker role dynamics with spontaneous nat- many questions as guests, with 40% of their dialog ural conversation. The author manually annotates turns containing questions. When asking questions, and investigates 24 hours of Israeli news television hosts and guests use interrogative forms (See et al., programs. We see an opportunity for the inves- 2019) at the same rate (65%). We note that the host tigation of speaker dynamics and significance of and guest roles have differing discourse patterns, speaker roles at scale with our dataset. which support the notion of role modeling. Dialog modeling of open-domain chit-chat pre- dicts one turn of dialog from one or many context Comparison with Other Datasets To assess turn(s). Structured approaches for dialog modeling how well Interview represents open-domain di- build on hierarchical RNNs (Sordoni et al., 2015a; alog, we look to two datasets in widespread usage: Serban et al., 2016; Sankar and Ravi, 2019), with DailyDialog (Li et al., 2017), 13K short dialogs recent work employing a simple concatenation of written to simulate simple conversations from daily dialog history in a transformer-based architecture life; and CALLHOME (Canavan et al., 1997), tran- (Zhang et al., 2019). We draw inspiration from 1https://www.kaggle.com/shuyangli94/ recent works in dialog generation that model speak- interview-npr-media-dialog-transcripts ers via persistent ‘personas,’ whose representations 2https://www.npr.org/ scriptions from 120 half-hour casual telephone con- ternal world knowledge. Thus, we model host re- versations. We measure the language modeling sponses, which are generally aimed towards mod- performance of a pre-trained transformer model— erating the conversation via follow-up questions 117M-parameter GPT2 (Radford et al., 2019)— and acknowledgements. Role-specific generation both in its original form and versions fine-tuned like this can benefit the development of assistive (FT) on the training splits for Interview, DailyDia- technologies and role-dependent dialog systems. log, and CALLHOME. We evaluated the zero-shot We approach speaker role modeling conditional performance of these models on the test splits of language modeling task: generating the next re- these datasets, with perplexities shown in Table2. sponse Tt;h for host h with the highest likeli- While models fine-tuned on the training set hood given a trace of prior utterances T1:::t;g and performed best on each dataset as expected, we T1:::t−1;h. We use a transformer decoder to gener- observe that 1) models trained on other datasets ate tokens T1:::t from inputs T0:::t−1, but calculate obtain relatively poor zero-shot performance on loss only across the target sequence (gold host re- Interview; and 2) the model trained on Inter- sponse). We mimic the input schema for DialoGPT, view achieved the best out-of-domain performance concatenating all historical turns with separator to- on DailyDialog and CALLHOME by large margins. kens, appending the host target response. This suggests that language models trained on In- Conditioning on Speakers To condition on a terview can learn patterns characteristic of natural open-domain dialog in both simple daily conver- speaker role, we prepend each utterance in the dia- sation and informal long spoken exchanges.

Load more