Dynabench: Rethinking Benchmarking in NLP Douwe Kiela†, Max Bartolo‡, Yixin Nie?, Divyansh Kaushik§, Atticus Geiger¶

Dynabench: Rethinking Benchmarking in NLP Douwe Kielay, Max Bartoloz, Yixin Nie?, Divyansh Kaushikx, Atticus Geiger{, Zhengxuan Wu{, Bertie Vidgenk, Grusha Prasad??, Amanpreet Singhy, Pratik Ringshiay, Zhiyi May, Tristan Thrushy, Sebastian Riedelyz, Zeerak Waseemyy, Pontus Stenetorpz, Robin Jiay, Mohit Bansal?, Christopher Potts{ and Adina Williamsy y Facebook AI Research; z UCL; ? UNC Chapel Hill; x CMU; { Stanford University k Alan Turing Institute; ?? JHU; yy Simon Fraser University [email protected] Abstract 0.2 MNIST ImageNet SQuAD 2.0 We introduce Dynabench, an open-source plat- GLUE SQuAD 1.1 Switchboard form for dynamic dataset creation and model 0.0 benchmarking. Dynabench runs in a web 0.2 browser and supports human-and-model-in- the-loop dataset creation: annotators seek to 0.4 create examples that a target model will mis- 0.6 classify, but that another person will not. In this paper, we argue that Dynabench addresses 0.8 a critical need in our community: contempo- rary models quickly achieve outstanding per- 1.0 formance on benchmark tasks but nonethe- 2000 2005 2010 2015 2020 less fail on simple challenge examples and falter in real-world scenarios. With Dyn- Figure 1: Benchmark saturation over time for popular abench, dataset creation, model development, benchmarks, normalized with initial performance at mi- and model assessment can directly inform nus one and human performance at zero. each other, leading to more robust and infor- mative benchmarks. We report on four initial NLP tasks, illustrating these concepts and are part of the reason we can confidently say we highlighting the promise of the platform, and have made great strides in our field. address potential objections to dynamic bench- In light of these developments, one might be marking as a new standard for the field. forgiven for thinking that NLP has created models with human-like language capabilities. Prac- 1 Introduction titioners know that, despite our progress, we are While it used to take decades for machine learning actually far from this goal. Models that achieve models to surpass estimates of human performance super-human performance on benchmark tasks (ac- on benchmark tasks, that milestone is now rou- cording to the narrow criteria used to define hu- tinely reached within just a few years for newer man performance) nonetheless fail on simple chal- arXiv:2104.14337v1 [cs.CL] 7 Apr 2021 datasets (see Figure1). As with the rest of AI, NLP lenge examples and falter in real-world scenarios. has advanced rapidly thanks to improvements in A substantial part of the problem is that our bench- computational power, as well as algorithmic break- mark tasks are not adequate proxies for the so- throughs, ranging from attention mechanisms (Bah- phisticated and wide-ranging capabilities we are danau et al., 2014; Luong et al., 2015), to Trans- targeting: they contain inadvertent and unwanted formers (Vaswani et al., 2017), to pre-trained lan- statistical and social biases that make them artifi- guage models (Howard and Ruder, 2018; Devlin cially easy and misaligned with our true goals. et al., 2019; Liu et al., 2019b; Radford et al., 2019; We believe the time is ripe to radically rethink Brown et al., 2020). Equally important has been the benchmarking. In this paper, which both takes a rise of benchmarks that support the development of position and seeks to offer a partial solution, we ambitious new data-driven models and that encour- introduce Dynabench, an open-source, web-based age apples-to-apples model comparisons. Bench- research platform for dynamic data collection and marks provide a north star goal for researchers, and model benchmarking. The guiding hypothesis be- hind Dynabench is that we can make even faster of much recent interest. Challenging test sets have progress if we evaluate models and collect data shown that many state-of-the-art NLP models strug- dynamically, with humans and models in the loop, gle with compositionality (Nie et al., 2019; Kim rather than the traditional static way. and Linzen, 2020; Yu and Ettinger, 2020; White Concretely, Dynabench hosts tasks for which et al., 2020), and find it difficult to pass the myriad we dynamically collect data against state-of-the- stress tests for social (Rudinger et al., 2018; May art models in the loop, over multiple rounds. The et al., 2019; Nangia et al., 2020) and/or linguistic stronger the models are and the fewer weaknesses competencies (Geiger et al., 2018; Naik et al., 2018; they have, the lower their error rate will be when in- Glockner et al., 2018; White et al., 2018; Warstadt teracting with humans, giving us a concrete metric— et al., 2019; Gauthier et al., 2020; Hossain et al., i.e., how well do AI systems perform when inter- 2020; Jeretic et al., 2020; Lewis et al., 2020; Saha acting with humans? This reveals the shortcomings et al., 2020; Schuster et al., 2020; Sugawara et al., of state-of-the-art models, and it yields valuable 2020; Warstadt et al., 2020). Yet, challenge sets training and assessment data which the community may suffer from performance instability (Liu et al., can use to develop even stronger models. 2019a; Rozen et al., 2019; Zhou et al., 2020) and In this paper, we first document the background often lack sufficient statistical power (Card et al., that led us to propose this platform. We then de- 2020), suggesting that, although they may be valu- scribe the platform in technical detail, report on able assessment tools, they are not sufficient for findings for four initial tasks, and address possible ensuring that our models have achieved the learn- objections. We finish with a discussion of future ing targets we set for them. plans and next steps. Models are susceptible to adversarial attacks, and despite impressive task-level performance, 2 Background state-of-the-art systems still struggle to learn robust Progress in NLP has traditionally been measured representations of linguistic knowledge (Ettinger through a selection of task-level datasets that grad- et al., 2017), as also shown by work analyzing ually became accepted benchmarks (Marcus et al., model diagnostics (Ettinger, 2020; Ribeiro et al., 1993; Pradhan et al., 2012). Recent well-known 2020). For example, question answering models examples include the Stanford Sentiment Tree- can be fooled by simply adding a relevant sentence bank (Socher et al., 2013), SQuAD (Rajpurkar to the passage (Jia and Liang, 2017). et al., 2016, 2018), SNLI (Bowman et al., 2015), Text classification models have been shown to be and MultiNLI (Williams et al., 2018). More sensitive to single input character change (Ebrahimi recently, multi-task benchmarks such as SentE- et al., 2018b) and first-order logic inconsisten- val (Conneau and Kiela, 2018), DecaNLP (McCann cies (Minervini and Riedel, 2018). Similarly, ma- et al., 2018), GLUE (Wang et al., 2018), and Super- chine translation systems have been found suscepti- GLUE (Wang et al., 2019) were proposed with the ble to character-level perturbations (Ebrahimi et al., aim of measuring general progress across several 2018a) and synthetic and natural noise (Belinkov tasks. When the GLUE dataset was introduced, and Bisk, 2018; Khayrallah and Koehn, 2018). Nat- “solving GLUE” was deemed “beyond the capabil- ural language inference models can be fooled by ity of current transfer learning methods” (Wang simple syntactic heuristics or hypothesis-only bi- et al., 2018). However, GLUE saturated within a ases (Gururangan et al., 2018; Poliak et al., 2018; year and its successor, SuperGLUE, already has Tsuchiya, 2018; Belinkov et al., 2019; McCoy et al., models rather than humans at the top of its leader- 2019). Dialogue models may ignore perturbations board. These are remarkable achievements, but of dialogue history (Sankar et al., 2019). More there is an extensive body of evidence indicating generally, Wallace et al.(2019) find universal ad- that these models do not in fact have the human- versarial perturbations forcing targeted model er- level natural language capabilities one might be rors across a range of tasks. Recent work has also lead to believe. focused on evaluating model diagnostics through counterfactual augmentation (Kaushik et al., 2020), 2.1 Challenge Sets and Adversarial Settings decision boundary analysis (Gardner et al., 2020; Whether our models have learned to solve tasks Swayamdipta et al., 2020), and behavioural test- in robust and generalizable ways has been a topic ing (Ribeiro et al., 2020). 2.2 Adversarial Training and Testing 3 Dynabench Research progress has traditionally been driven by Dynabench is a platform that encompasses different a cyclical process of resource collection and ar- tasks. Data for each task is collected over multiple chitectural improvements. Similar to Dynabench, rounds, each starting from the current state of the recent work seeks to embrace this phenomenon, ad- art. In every round, we have one or more target dressing many of the previously mentioned issues models “in the loop.” These models interact with through an iterative human-and-model-in-the-loop humans, be they expert linguists or crowdworkers, annotation process (Yang et al., 2017; Dinan et al., who are in a position to identify models’ shortcom- 2019; Chen et al., 2019; Bartolo et al., 2020; Nie ings by providing examples for an optional context. et al., 2020), to find “unknown unknowns” (Atten- Examples that models get wrong, or struggle with, berg et al., 2015) or in a never-ending or life-long can be validated by other humans to ensure their learning setting (Silver et al., 2013; Mitchell et al., correctness. The data collected through this pro- 2018). The Adversarial NLI (ANLI) dataset (Nie cess can be used to evaluate state-of-the-art models, et al., 2020), for example, was collected with an and to train even stronger ones, hopefully creat- adversarial setting over multiple rounds to yield ing a virtuous cycle that helps drive progress in “a ‘moving post’ dynamic target for NLU systems, the field.

Load more