Introspective Extraction and Complement Control

Rethinking Cooperative Rationalization: Introspective Extraction and Complement Control Mo Yu∗~ Shiyu Chang∗ |~ Yang Zhang∗ |~ Tommi S. Jaakkola ♠ ~ IBM Research | MIT-IBM Watson AI Lab ♠ MIT [email protected] fshiyu.chang, [email protected] [email protected] Abstract Label: negative Original Text: really cloudy, lots of sediment , washed Selective rationalization has become a com- out yellow color . looks pretty gross , actually , like swamp water . no head, no lacing. mon mechanism to ensure that predictive models reveal how they use any available features. Rationale from (Lei et al., 2016): The selection may be soft or hard, and iden- [“really cloudy lots”, “yellow”, “no”, “no”] Rationale from cooperative introspection model: tifies a subset of input features relevant for [“. looks”, “no”, “no”] prediction. The setup can be viewed as a co- Rationale from our full model: operate game between the selector (aka ratio- [“cloudy”, “lots”, “pretty gross”, “no lacing”] nale generator) and the predictor making use of only the selected features. The co-operative Table 1: An example of the rationales extracted by different setting may, however, be compromised for two models on the sentiment analysis task of beer reviews (ap- reasons. First, the generator typically has no pearance aspect). Red words are human-labeled rationales. Details of the experiments can be found in AppendixB. direct access to the outcome it aims to justify, resulting in poor performance. Second, there’s typically no control exerted on the information of reasoning such as module networks (Andreas left outside the selection. We revise the overall et al., 2016a,b; Johnson et al., 2017). While impor- co-operative framework to address these chal- tant, such approaches may still require adopting lenges. We introduce an introspective model specialized designs and architectural choices that which explicitly predicts and incorporates the do not yet reach accuracies comparable to black- outcome into the selection process. Moreover, box approaches. On the other hand, we can im- we explicitly control the rationale complement via an adversary so as not to leave any useful pose limited architectural constraints in the form information out of the selection. We show that of selective rationalization (Lei et al., 2016; Li the two complementary mechanisms maintain et al., 2016b; Chen et al., 2018a,b) where the goal both high predictive accuracy and lead to com- is to only expose the portion of the text relevant for prehensive rationales.1 prediction. The selection is done by a separately trained model called rationale generator. The re- 1 Introduction sulting text selection can be subsequently used as Rapidly expanding applications of complex neu- an input to an unconstrained, complex predictor, ral models also bring forth criteria other than i.e., architectures used in the absence of any ra- 2 mere performance. For example, medical (Yala tionalization. The main challenge of this track is et al., 2019) and other high-value decision appli- how to properly coordinating the rationale gener- cations require some means of verifying reasons ator with the powerful predictor operating on the for the predicted outcomes. This area of self- selected information during training. explaining models in the context of NLP appli- In this paper, we build on and extend selec- cations has primarily evolved along two parallel tive rationalization. The selection process can be tracks. On one hand, we can design neural ar- thought of as a cooperative game between the gen- chitectures that expose more intricate mechanisms erator and the predictor operating on the selected, ∗Authors contributed equally to this work. 2Therefore the selective rationalization approach is easier 1The code and data for our method is publicly available to be adapted to new applications, such as question answer- at https://github.com/Gorov/three_player_ ing (Yu et al., 2018), image classification (Chen et al., 2018a) for_emnlp. and medical image analysis (Yala et al., 2019). partial input text. The two players aim for the the predicted label with special word patterns (e.g. shared goal of achieving high predictive accuracy, highlighting “.” for positive examples and “,” neg- just having to operate within the confines imposed ative ones). Table1 shows such degenerate cases by rationale selection (a small, concise portion of for the two cooperative methods. input text). The rationales are learned entirely In order to prevent degenerate rationales, we in an unsupervised manner, without any guidance propose a three-player game that renders explicit other than their size, form. An example of ground- control over the unselected parts. In addition to truth and learned rationales are given in Table1. the generator and the predictor as in conventional The key motivation for our work arises from the cooperative rationale selection schemes, we add potential failures of cooperative selection. Since a third adversarial player, called the complement the generator typically has no direct access to the predictor, to regularize the cooperative commu- outcome it aims to justify, the learning process nication between the generator and the predictor. may converge to a poorly performing solution. The goal of the complement predictor is to predict Moreover, since only the selected portion is eval- the correct label using only words left out of the ra- uated for its information value (via the predictor), tionale. During training, the generator aims to fool there is typically no explicit control over the re- the complement predictor while still maintaining maining portion of the text left outside the ratio- high accuracy for the predictor. This ensures that nale. These two challenges are complementary the selected rationale must contain all/most of the and should be addressed jointly. information about the target label, leaving out ir- Performance The clues in text classification tasks relevant parts, within size constraints imposed on are typically short phrases (Zaidan et al., 2007). the rationales. However, diverse textual inputs offer a sea of such We also theoretically show that the equilibrium clues that may be difficult to disentangle in a man- of the three-player game guarantees good proper- ner that generalizes to evaluation data. Indeed, ties for the extracted rationales. Moreover, we em- the generator may fail to disentangle the informa- pirically show that (1) the three-player framework tion about the correct label, offering misleading on its own helps cooperative games such as (Lei rationales instead. Moreover, as confirmed by the et al., 2016) to improve both predictive accuracy experiments presented in this paper, the collabo- and rationale quality; (2) by combining the two rative nature of the game may enable the play- solutions – introspective generator and the three ers to select a sub-optimal communication code player game – we can achieve high predictive ac- that does not generalize, rather overfits the training curacy and non-degenerate rationales. data. Regression tasks considered in prior work 2 Problem Formulation typically offer greater feedback for the generator, making it less likely that such communication pat- This section formally defines the problem of ratio- terns would arise. nalization, and then proposes a set of conditions We address these concerns by proposing an in- that desirable rationales should satisfy, which ad- trospective rationale generator. The key idea is to dresses problems of previous cooperative frame- force the generator to explicitly understand what works. Here are some notations throughout this to generate rationales for. Specifically, we make paper. Bolded upper-cased letters, e.g. X, denote the label that would be predicted with the full text random vectors; unbolded upper-cased letters, e.g. as an additional input to the generator thereby en- X, denote random scalar variables; bolded lower- suring better overall performance. cased letters, e.g. x, denote deterministic vectors Rationale quality The cooperative game setup or vector functions; unbolded lower-cased letters, does not explicitly control the information left out e.g. x, denote deterministic scalars or scalar func- of the rationale. As a result, it is possible for tions. pX (·|Y ) denotes conditional probability den- the rationales to degenerate as in containing only sity/mass function conditional on Y . H(·) denotes select words without the appropriate context. In Shannon entropy. E[·] denotes expectation. fact, the introspective generator proposed above can aggravate this problem. With access to the pre- 2.1 Problem Formulation of Rationalization dicted label as input, the generator and the predic- The target application here is text classification tor can find a communication scheme by encoding on data tokens in the form of f(X;Y )g. Denote X = X1:L as a sequence of words in an input text 2.3 Comprehensiveness and Degeneration with length L. Denote Y as a label. Our goal is to There are two justifications of the comprehensive- generate a rationale, denoted as r(X) = r1:L(X), ness condition. First, it regulates the information which is a selection of words in X that accounts outside the rationale, so that the rationale contains for Y . Formally, r(X) is a hard-masked version of all the relevant and useful information, hence the X that takes the following form at each position i: name comprehensiveness. Second and more im- portantly, we will show there exists a failure case, r (X) = z (X) · X ; (1) i i i called degeneration, which can only be prevented N where zi 2 f0; 1g is the binary mask. Many pre- by the comprehensiveness condition. vious works (Lei et al., 2016; Chen et al., 2018a) Degeneration refers to the situation where, follows the above definition of rationales. In this rather than finding words in X that explains Y , work, we further define the complement of ratio- R attempts to encode the probability of Y using nale, denoted as rc(X), as trivial information, e.g. punctuation and position. Consider the following toy example of binary clas- rc(X) = (1 − z (X)) · X : (2) i i i sification (Y 2 f0; 1g), where X can always per- fectly predict Y .

Load more