A Repository of Conversational Datasets Github.Com/Polyai-LDN/Conversational-Datasets

A Repository of Conversational Datasets github.com/PolyAI-LDN/conversational-datasets Matthew Henderson, Paweł Budzianowski, Iñigo Casanueva, Sam Coope, Daniela Gerz, Girish Kumar, Nikola Mrkšic,´ Georgios Spithourakis, Pei-Hao Su, Ivan Vulic,´ and Tsung-Hsien Wen [email protected] PolyAI Limited, London, UK. Abstract et al., 2015, 2017a,b; Mrkšic´ and Vulic´, 2018; Ra- madan et al., 2018; Li et al., 2018, inter alia). The Progress in Machine Learning is often driven research community, as in any machine learning by the availability of large datasets, and con- field, benefits from large datasets and standardised sistent evaluation metrics for comparing modeling approaches. To this end, we present evaluation metrics for tracking and comparing dif- a repository of conversational datasets con- ferent models. However, collecting data to train sisting of hundreds of millions of examples, data-driven dialogue systems has proven notori- and a standardised evaluation procedure for ously difficult. First, system designers must con- conversational response selection models us- struct an ontology to define the constrained set of ing 1-of-100 accuracy. The repository con- actions and conversations that the system can sup- tains scripts that allow researchers to repro- port (Henderson et al., 2014a,c; Mrkšic´ et al., 2015). duce the standard datasets, or to adapt the Furthermore, task-oriented dialogue data must be pre-processing and data filtering steps to their needs. We introduce and evaluate several com- labeled with highly domain-specific dialogue anno- petitive baselines for conversational response tations (El Asri et al., 2017; Budzianowski et al., selection, whose implementations are shared 2018). Because of this, such annotated dialogue in the repository, as well as a neural encoder datasets remain scarce, and limited in both their model that is trained on the entire training set. size and in the number of domains they cover. For instance, the recently published MultiWOZ 1 Introduction dataset (Budzianowski et al., 2018) contains a total Dialogue systems, sometimes referred to as con- of 115,424 dialogue turns scattered over 7 target do- versational systems or conversational agents, are mains. Other standard task-based datasets are typi- useful in a wide array of applications. They are cally single-domain and smaller by several orders used to assist users in accomplishing well-defined of magnitude: DSTC2 (Henderson et al., 2014b) tasks such as finding and/or booking flights and contains 23,354 turns, Frames (El Asri et al., 2017) restaurants (Hemphill et al., 1990; Williams, 2012; comprises 19,986 turns, and M2M (Shah et al., El Asri et al., 2017), or to provide tourist informa- 2018) spans 14,796 turns. tion (Henderson et al., 2014c; Budzianowski et al., An alternative solution is to leverage larger con- 2018). They have found applications in entertain- versational datasets available online. Such datasets ment (Fraser et al., 2018), language learning (Raux provide natural conversational structure, that is, et al., 2003; Chen et al., 2017), and healthcare the inherent context-to-response relationship which (Laranjo et al., 2018; Fadhil and Schiavo, 2019). is vital for dialogue modeling. In this work, we Conversational systems can also be used to aid in present a public repository of three large and di- customer service1 or to provide the foundation for verse conversational datasets containing hundreds intelligent virtual assistants such as Amazon Alexa, of millions of conversation examples. Compared Google Assistant, or Apple Siri. to the most popular conversational datasets used in Modern approaches to constructing dialogue sys- prior work, such as length-restricted Twitter con- tems are almost exclusively data-driven, supported versations (Ritter et al., 2010) or very technical by modular or end-to-end machine learning frame- domain-restricted technical chats from the Ubuntu works (Young, 2010; Vinyals and Le, 2015; Wen corpus (Lowe et al., 2015, 2017; Gunasekara et al., 2019), conversations from the three conversational 1For an overview, see poly-ai.com/blog/towards-ai- assisted-customer-support-automation datasets available in the repository are more natural and diverse. What is more, the datasets are By default the train set consists of 90% of the total large: for instance, after preprocessing around 3.7B data, and the test set the remaining 10%. comments from Reddit available in 256M conversational threads, we obtain 727M valid context- context/1 Hello, how are you? response pairs. Similarly, the number of valid pairs context/0 I am fine. And you? in the OpenSubtitles dataset is 316 million. To put context Great. What do you think of the weather? these numbers into perspective, the frequently used response It doesn’t feel like February. Ubuntu corpus v2.0 comprises around 4M dialogue turns. Furthermore, our Reddit corpus includes 2 Figure 1: An illustrative Tensorflow example in a con- more years of data and so is substantially larger versational dataset, consisting of a conversational con- than the previous Reddit dataset of Al-Rfou et al. text and an appropriate response. Each string is stored as a bytes feature using its UTF-8 encoding. (2016), which spans around 2.1B comments and 133M conversational threads, and is not publicly Each Tensorflow example contains a conversa- available. tional context and a response that goes with that Besides the repository of large datasets, another context, see e.g. figure1. Explicitly, each example key contribution of this work is the common evalu- contains a number of string features: ation framework. We propose applying consistent data filtering and preprocessing to public datasets, • A context feature, the most recent text in the and a simple evaluation metric for response se- conversational context. lection, which will facilitate direct comparisons • A response feature, text that is in direct re- between models from different research groups. sponse to the context. These large conversational datasets may support modeling across a large spectrum of natural con- • A number of extra context features, context/0, versational domains. Similar to the recent work context/1 etc. going back in time through on language model pretraining for diverse NLP ap- the conversation. They are named in reverse plications (Howard and Ruder, 2018; Devlin et al., order so that context/i always refers to the ith 2018; Lample and Conneau, 2019), we believe that most recent extra context, so that no padding these datasets can be used in future work to pre- needs to be done, and datasets with different train large general-domain conversational models numbers of extra contexts can be mixed. that are then fine-tuned towards specific tasks using • Depending on the dataset, there may be some much smaller amounts of task-specific conversa- extra features also included in each example. tional data. We hope that the presented reposi- For instance, in Reddit the author of the con- tory, containing a set of strong baseline models text and response are identified using addi- and standardised modes of evaluation, will provide tional features. means and guidance to the development of next- generation conversational systems. 3 Datasets The repository is available at github.com/ PolyAI-LDN/conversational-datasets. Rather than providing the raw processed data, we provide scripts and instructions to the users to gen- 2 Conversational Dataset Format erate the data themselves. This allows for viewing and potentially manipulating the pre-processing Datasets are stored as Tensorflow record files con- and filtering steps. The repository contains instruc- taining serialized Tensorflow example protocol tions for generating datasets with standard param- buffers (Abadi et al., 2015). The training set is eters split deterministically into train and test por- stored as one collection of Tensorflow record files, tions. These allow for defining reproducible evalu- and the test set as another. Examples are shuffled ations in research papers. Section5 presents bench- randomly (and not necessarily reproducibly) within mark results on these standard datasets for a variety the Tensorflow record files. Each example is de- of conversational response selection models. terministically assigned to either the train or test Dataset creation scripts are written using Apache set using a key feature, such as the conversation Beam and Google Cloud Dataflow (Akidau et al., thread ID in Reddit, guaranteeing that the same 2015), which parallelizes the work across many ma- split is created whenever the dataset is generated. chines. Using the default quotas, the Reddit script Built from Training size Testing size Reddit 3.7 billion comments in threaded conversations 654,396,778 72,616,937 OpenSubtitles over 400 million lines from movie and tele- 283,651,561 33,240,156 vision subtitles (also available in other lan- guages) AmazonQA over 3.6 million question-response pairs in the 3,316,905 373,007 context of Amazon products Table 1: Summary of the datasets included in the public repository. The Reddit data is taken from January 2015 to December 2018, and the OpenSubtitles data from 2018. starts 409 workers to generate the dataset in around Examples may be filtered according to the con- 1 hour and 40 minutes. This includes reading the tents of the context and response features. The comment data from the BigQuery source, grouping example is filtered if either feature has more than the comments into threads, producing examples 128 characters, or fewer than 9 characters, or if from the threads, splitting the examples into train its text is set to [deleted] or [removed]. Full de- and test, shuffling the examples, and finally writing tails of the filtering are available in the code, and them to sharded Tensorflow record files. configurable through command-line flags. Table1 provides an overview of the Reddit, Further back contexts, from the comment’s par- OpenSubtitles and AmazonQA datasets, and fig- ent’s parent etc., are stored as extra context features. ure3 in appendixA gives an illustrative example Their texts are trimmed to be at most 128 charac- from each.

Load more