Geval: Tool for Debugging NLP Datasets and Models
Total Page:16
File Type:pdf, Size:1020Kb
GEval: Tool for Debugging NLP Datasets and Models Filip Gralinski´ y;x and Anna Wroblewska´ y;z and Tomasz Stanisławeky;z yApplica.ai, Warszawa, Poland [email protected] Kamil Grabowskiy;z and Tomasz Gorecki´ x zFaculty of Mathematics and Information Science, Warsaw University of Technology xFaculty of Mathematics and Computer Science, Adam Mickiewicz University, Poznan´ Abstract present a tool to help data scientists understand the model and find issues in order to improve the pro- This paper presents a simple but general and effective method to debug the output of ma- cess. The tool will be show-cased on a number of chine learning (ML) supervised models, in- NLP challenges. cluding neural networks. The algorithm looks There are a few extended reviews on inter- for features that lower the evaluation metric pretability techniques and their types available at in such a way that it cannot be ascribed to (Guidotti et al., 2018; Adadi and Berrada, 2018; Du chance (as measured by their p-values). Us- et al., 2018). The authors also introduce purposes ing this method – implemented as GEval tool – of interpretability research: justify models, control you can find: (1) anomalies in test sets, (2) is- sues in preprocessing, (3) problems in the ML them and their changes in time (model debugging), model itself. It can give you an insight into improve their robustness and efficiency (model val- what can be improved in the datasets and/or idation), discover weak features (new knowledge the model. The same method can be used to discovery). The explanations can be given as: (1) compare ML models or different versions of other models easier to understand (e.g. linear re- the same model. We present the tool, the the- gression), (2) sets of rules, (3) lists of strong and ory behind it and use cases for text-based mod- weak input features or even (4) textual summaries els of various types. accessible for humans. 1 Introduction The interpretability techniques are categorized into global or local methods. “Global” stands for Currently, given the burden of big data and pos- techniques that can explain/interpret a model as sibilities to build a wide variety of deep learning a whole, whereas “local” stands for methods and models, the need to understand datasets, intrinsic models that can be interpreted around any chosen parameters and model behavior is growing. These neighborhood. Another dimensions of the inter- problems are part of the interpretability trend in the pretability categorization are: (1) intrinsic inter- state-of-the-art research, the good example being pretable methods, i.e. models that approximate the publications at NeurIPS 2018 conference and its more difficult ones and are also easy to understand Interpretability and Robustness in Audio, Speech, for humans or (2) post-hoc explanations that are 1 and Language Workshop. derived after training models. Hence, explanations The problem of interpretability is also crucial in can be model-specific or model-agnostic, i.e. need- terms of using ML models in business cases and ing (or not) the knowledge about the model itself. applications. Every day, data scientists analyze As far as model-agnostic (black-box) methods large amounts of data, build models and sometimes are concerned, one of the breakthroughs in the do- they just do not understand: why the models work main was the LIME method (Local Interpretable in a certain way. Thus, we need fast and efficient Model-Agnostic Explanations) (Tulio Ribeiro et al., tools to look into models in their various aspects, 2016). LIME requires access to a model and it e.g. by analyzing train and test data, the way in changes the analyzed dataset many times (doing which models influence their results, and how their perturbations) by removing some features from in- internal features interact with each other. Conse- put samples and measuring changes in the model quently, the aim of our research and paper is to output. The idea has two main drawbacks. The 1https://irasl.gitlab.io/ first is that it requires access to the model to know 254 Proceedings of the Second BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 254–262 Florence, Italy, August 1, 2019. c 2019 Association for Computational Linguistics the model output for perturbed samples. The other significantly goes down below the general average disadvantage is that it takes a very long time to (in absolute terms or when compared with another process big datasets, which makes the method un- model) as they might reveal a bug somewhere in feasible in case of really large datasets, e.g. several the ML pipeline. millions of text documents. Which features are suspicious? We should look Other interpretability methods concern the in- for either the ones for which evaluation metric de- ternal model structure in a white-box manner, creases abruptly (even if they are infrequent) or the e.g. L2X (Chen et al., 2018), which instruments ones which are very frequent and which influence a deep learning model with an extra unit (layer) the evaluation metric in a negative manner, even and the analyzed model is trained with this unit if just slightly (or the ones which are somewhere jointly. in between these two extremes). We will show (in We introduce an automatic, easy to use and a Section4) that natural language processing (NLP) model-agnostic method that does not require ac- tasks are particularly amenable to this, as words cess to models. The only requirement is access to and their combinations can be treated as features. the dataset, i.e. input sample data points, model Consider, for example, a binary text classification results and gold standard labels. The method (and task. If you have an ML model for this task, you a command-line tool), called GEval,2 is based on could run it on a test set, sort all words (or bigrams, statistical hypothesis testing and measuring the sig- trigrams, or other types of features) using a chi- nificance of each feature. GEval finds global fea- squared statistical test to confront the feature (or its tures that “influence” the model evaluation score in lack) against the failure or success (using a 2 × 2 a bad way and worsen its results. contingency table) of the classification model and Moreover, we present the work of GEval us- look at the top of the list, i.e. at words with the ing examples from various text-based model types, highest value of χ2 statistics, or, equivalently, the i.e. named entity recognition, classification and lowest p-value. P-value might be easier to interpret translation. for humans and they are comparable across statis- In the following sections we introduce the idea tical tests. As we are not interested in p-values as to use p-value and hypotheses testing to debug ML such (in contrast to hypothesis testing), but rather in models (Section2), describe the algorithm behind comparing them to rank features, there is no need to GEval (Section3), and then show some use cases use procedures for correcting p-values for multiple for state-of-the-art ML models (Section4). experiments, such as Bonferroni correction. See, for instance, Table1, where we presented 2 Using p-values for debugging ML the results of such an experiment for a specific Models – general idea classifier in the Twitter sentiment classification task. The average accuracy for the tweets with the word Creating an ML model is not a one-off act, but a “know” is higher than for the ones containing the whole continuous process, in which data scientist or word “reading”; still, the accuracy for “know” is ML engineer should analyze what are the weakest more significant (as it was more frequent). Thus, points of a model and try to improve the results by when debugging this ML model, more focus should fixing the pre- or post-processing modules or by be given to “know”, and even more to “though”, changing the model hyperparameters or the model for which lower average accuracy and p-value was type itself. Moreover, regression checks are needed found (this is, of course, related to the fact that this for new releases of a model and its companion conjunct connects contrastive clauses, which are software, because, even if the overall result is better, hard to handle in sentiment analysis). some specific regressions might creep in (some of them for trivial reasons) and be left unnoticed in the 2 face of the general improvement otherwise. Thus, WORD COUNT + − ACC χ P-VALUE one may look for features, in a broad sense, in the THOUGH 343 254 89 0.7405 35.2501 0.00000 test data and, during the ML engineering process, KNOW 767 619 148 0.8070 13.4284 0.00025 focus on the ones for which the evaluation metric READING 72 57 15 0.7917 2.226 0.1357 2 https://gonito.net/gitlist/geval.git/; Table 1: Example of words from Twitter classification see also (Gralinski´ et al., 2016) for a discussion of a companion Web application (see Section4) task with their statistical properties 255 This kind of analysis is clearly not white-box cesses and failures, but in the case of more “gradual” debugging (it does not depend on the internals of evaluation schemes, the items will be ranked more the models), but even calling it black-box is not “smoothly” – from items for which the system out- accurate, as the model is not re-run. What is needed put was perfect, through nearly perfect, partially is just the input and the actual and expected output wrong to completely incorrect. Building on this, of the system. Hence, the most appropriate name we will be able to compare the distribution of evalu- for this technique should be ”no-box” debugging.