Arxiv:2102.12060V2 [Cs.CL] 1 Mar 2021

Teach Me to Explain: A Review of Datasets for Explainable NLP Sarah Wiegreffe∗ Ana Marasovic´∗ School of Interactive Computing Allen Institute for AI Georgia Institute of Technology University of Washington [email protected] [email protected] Abstract learning practitioners encounter data cascades, or downstream problems resulting from poor data Explainable NLP (EXNLP) has increas- quality (Sambasivan et al., 2021). It is impor- ingly focused on collecting human- annotated explanations. These explanations tant to constantly evaluate data collection practices are used downstream in three ways: as data critically and establish common and standardized augmentation to improve performance on methodologies (Bender and Friedman, 2018; Ge- a predictive task, as a loss signal to train bru et al., 2018; Jo and Gebru, 2020; Ning et al., models to produce explanations for their 2020; Paullada et al., 2020). Even highly influ- predictions, and as a means to evaluate the ential datasets that are widely adopted in research quality of model-generated explanations. undergo such inspections. For instance, Recht In this review, we identify three predom- et al.(2019) tried to reproduce the ImageNet test inant classes of explanations (highlights, free-text, and structured), organize the set by carefully following the instructions in the literature on annotating each type, point paper (Deng et al., 2009), but multiple models do to what has been learned to date, and give not perform well on the replicated test set. We ex- recommendations for collecting EXNLP pect that such examinations are particularly valu- datasets in the future. able when many related datasets are released con- temporaneously and independently in a short pe- 1 Introduction riod of time, as is the case with EXNLP datasets. Interpreting supervised machine learning models This survey aims to review and summarize the by analyzing local explanations—justifications of literature on collecting EXNLP datasets, highlight their individual predictions1—is important for de- what has been learned to date, and give recom- bugging, protecting sensitive information, under- mendations for future EXNLP dataset construc- standing model robustness, and increasing war- tion. It complements other explainable AI (XAI) ranted trust in a stated contract (Doshi-Velez and surveys and critical retrospectives that focus on Kim, 2017; Molnar, 2019; Jacovi et al., 2021). definitions (Gilpin et al., 2018; Miller, 2019; Clin- ciu and Hastie, 2019; Barredo Arrieta et al., 2020; Explainable NLP (EXNLP) researchers rely on datasets that contain human justifications for the Jacovi and Goldberg, 2020), methods (Biran and true label (overviewed in Tables3–5). Human jus- Cotton, 2017; Ras et al., 2018; Guidotti et al., tifications are used to aid models with additional 2019), evaluation (Hoffman et al., 2018; Yang arXiv:2102.12060v2 [cs.CL] 1 Mar 2021 training supervision (learning-from-explanations; et al., 2019), or a combination of these topics Zaidan et al., 2007), to train interpretable models (Lipton, 2018; Doshi-Velez and Kim, 2017; Adadi that explain their own predictions (Camburu et al., and Berrada, 2018; Murdoch et al., 2019; Verma 2018), and to evaluate plausibility of explana- et al., 2020; Burkart and Huber, 2021), but not on tions by measuring their agreement with human- datasets. Kotonya and Toni(2020a) and Thaya- annotated explanations (DeYoung et al., 2020a). paran et al.(2020) review datasets and methods for Dataset collection is the most under-scrutinized explaining automated fact-checking and machine component of the machine learning pipeline (Par- reading comprehension, respectively; we are the itosh, 2020)—it is estimated that 92% of machine first to review datasets across all NLP tasks. ∗ To organize the literature, we first define our Equal contributions. 1As opposed to global explanation, where one explanation scope (§2), sort out EXNLP terminology (§3), and is given for an entire system (Molnar, 2019). present an overview of existing EXNLP datasets (§4).2 Then, we present what can be learned Input: Jenny forgot to lock her car at the grocery store. from existing EXNLP datasets. Specifically, §5 discusses the typical process of collecting high- lighted elements in the inputs as explanations, and What tense is the What might follow What might discrepancies between this process and that of sentence in? from this event? follow from this evaluating automatic methods. In the same sec- event? Answer: Past Answer: Jenny’s car tion, we also draw attention to how assumptions is broken into. Answer: made in free-text explanation collection influence Jenny’s car is Explanation: Explanation: Thieves broken into. their modeling, and call for better documenting of “Forgot” is the often target unlocked past tense of the cars in public places. EXNLP data collection. In §6, we illustrate that verb “to forget”. not all template-like free-text explanations are un- wanted, and call for embracing the structure of an explanation when appropriate. In §7.1, we present Explaining Human Explaining Observations a proposal for controlling quality in free-text ex- Decisions about the World planation collection. Finally, §8 gathers recommendations from related subfields to further re- Figure 1: A description of the two classes of duce data artifacts by increasing diversity of col- EXNLP datasets (§2). For some input, an ex- lected explanations. planation can explain a human-provided answer, the answer (left and middle) or something about 2 Survey Scope the world (middle and right). The shaded area is our survey scope. These classes can be used to An explanation can be described as a “three-place teach and evaluate models on different skill sets: predicate: someone explains something to some- their ability to justify their predictions (left), rea- one”(Hilton, 1990). But what is the something son about the world (right), or do both (middle). being explained in EXNLP? We identify two types of EXNLP dataset papers, based on their answer to this question: those which explain human de- tional training supervision to predict labels from cisions (task labels), and those which explain ob- inputs, (ii) training supervision for models to ex- served events or phenomena in the world. We il- plain their label predictions, and (iii) an evaluation lustrate these classes in Figure 1. set for the quality of model-produced explanations The first class, explaining human decisions, has of label predictions. We define our survey scope as a long history and strong association with the term focusing on this class of explanations. “explanation” in NLP. In this class, explanations The second class falls under commonsense rea- are not the task labels themselves, but rather ex- soning, and this type of explanation is commonly plain some task label for a given input (see exam- referred to as commonsense inference about the ples in Figure 1 and Table 1). One distinguish- world. While most datasets that can be framed as ing factor in their collection is that they are of- explaining something about the world do not use ten collected by asking “why” questions, such as the term “explanation” (Zellers et al., 2018; Sap “why is [input] assigned [label]?”. The underly- et al., 2019b; Zellers et al., 2019b; Huang et al., ing assumption is that the explainer believes the 2019; Fan et al., 2019; Bisk et al., 2020; Forbes assigned label to be correct or at least likely, ei- et al., 2020), a few recent ones do: ART (Bhaga- ther because they selected it, or another annotator vatula et al., 2020) and GLUCOSE (Mostafazadeh did (discussed further in §7.3). Resulting datasets et al., 2020). These datasets have explanations as contain instances that are tuples of the form (input, instance-labels themselves, rather than as justifi- label, explanation), where each explanation is tied cations for existing labels. For example, ART col- to an input-label pair. Of course, collecting expla- lects plausible hypotheses and GLUCOSE collects nations of human decisions can help train systems certain or likely statements to explain how sen- to model them. This class is thus directly suited tences in a story can follow other sentences. They to the three EXNLP goals: they serve as (i) addi- thus produce tuples of the form (input, explanation), where the input is an event or observation. 2We created a living version of the tables as a website, where pull requests and issues can be opened to add and up- The two classes are not mutually exclusive. date datasets: https://exnlpdatasets.github.io For example, SBIC (Sap et al., 2019a) contains Instance Explanation (highlight) Premise: A white race dog wearing the num- Premise: A white race dog wearing the number eight runs on ber eight runs on the track. Hypothesis: A white race dog the track. Hypothesis: A white race dog runs around his yard. runs around his yard. Label: contradiction (free-text) A race track is not usually in someone’s yard. Question: Who sang the theme song from Russia With Love? (structured) Sentence selection: The theme song was Paragraph: John Barry, arranger of Monty Norman’s “James composed by Lionel Bart of Oliver! fame and sung by Matt Bond Theme”. would be the dominant Bond series com- Monro. Referential equality: “the theme song from russia poser for most of its history and the inspiration for fellow se- with love” (extracted from question) = “The theme song” (ex- ries composer, David Arnold (who uses cues from this sound- tracted from paragraph) Entailment: X was composed by Li- track in his own. ). The theme song was composed by Lionel onel Bart of Oliver! fame and sung by ANSWER. ` AN- Bart of Oliver! fame and sung by Matt Monro. Answer: Matt SWER sung X Monro Table 1: Examples of explanation types discussed in §3. The first two rows show a highlight and free-text explanation for an E-SNLI instance (Camburu et al., 2018). The last row shows a structured explanation from QED for a NATURALQUESTIONS instance (Lamm et al., 2020).

Load more