A Survey on Neural Network Interpretability Yu Zhang, Peter Tino,ˇ Alesˇ Leonardis, and Ke Tang

IEEE TRANSACTIONS ON XXXX, VOL. X, NO. X, MM YYYY 1 A Survey on Neural Network Interpretability Yu Zhang, Peter Tino,ˇ Alesˇ Leonardis, and Ke Tang Abstract—Along with the great success of deep neural net- trees, support vector machines), but also achieved the state- works, there is also growing concern about their black-box of-the-art performance on certain real-world tasks [4], [10]. nature. The interpretability issue affects people’s trust on deep Products powered by DNNs are now used by billions of people1, learning systems. It is also related to many ethical problems, e.g., algorithmic discrimination. Moreover, interpretability is a e.g., in facial and voice recognition. DNNs have also become desired property for deep networks to become powerful tools powerful tools for many scientific fields, such as medicine [11], in other research fields, e.g., drug discovery and genomics. In bioinformatics [12], [13] and astronomy [14], which usually this survey, we conduct a comprehensive review of the neural involve massive data volumes. network interpretability research. We first clarify the definition However, deep learning still has some significant disadvan- of interpretability as it has been used in many different contexts. Then we elaborate on the importance of interpretability and tages. As a really complicated model with millions of free propose a novel taxonomy organized along three dimensions: parameters (e.g., AlexNet [2], 62 million), DNNs are often type of engagement (passive vs. active interpretation approaches), found to exhibit unexpected behaviours. For instance, even the type of explanation, and the focus (from local to global though a network could get the state-of-the-art performance interpretability). This taxonomy provides a meaningful 3D view and seemed to generalize well on the object recognition task, of distribution of papers from the relevant literature as two of the dimensions are not simply categorical but allow ordinal Szegedy et al. [15] found a way that could arbitrarily change subcategories. Finally, we summarize the existing interpretability the network’s prediction by applying a certain imperceptible evaluation methods and suggest possible research directions change to the input image. This kind of modified input is called inspired by our new taxonomy. “adversarial example”. Nguyen et al. [16] showed another way Index Terms—Machine learning, neural networks, inter- to produce completely unrecognizable images (e.g., look like pretability, survey. white noise), which are, however, recognized as certain objects by DNNs with 99.99% confidence. These observations suggest that even though DNNs can achieve superior performance on I. INTRODUCTION many tasks, their underlying mechanisms may be very different VER the last few years, deep neural networks (DNNs) from those of humans’ and have not yet been well-understood. O have achieved tremendous success [1] in computer vision [2], [3], speech recognition [4], natural language process- A. An (Extended) Definition of Interpretability ing [5] and other fields [6], while the latest applications can To open the black-boxes of deep networks, many researchers be found in these surveys [7]–[9]. They have not only beaten started to focus on the model interpretability. Although this many previous machine learning techniques (e.g., decision theme has been explored in various papers, no clear consensus on the definition of interpretability has been reached. Most This work was supported in part by the Guangdong Provincial Key previous works skimmed over the clarification issue and left it Laboratory under Grant 2020B121201001; in part by the Program for Guangdong Introducing Innovative and Entrepreneurial Teams under Grant as “you will know it when you see it”. If we take a closer look, 2017ZT07X386; in part by the Stable Support Plan Program of Shenzhen the suggested definitions and motivations for interpretability Natural Science Fund under Grant 20200925154942002; in part by the are often different or even discordant [17]. Science and Technology Commission of Shanghai Municipality under Grant 19511120602; in part by the National Leading Youth Talent Support Program of One previous definition of interpretability is the ability to China; and in part by the MOE University Scientific-Technological Innovation provide explanations in understandable terms to a human [18], arXiv:2012.14261v3 [cs.LG] 15 Jul 2021 Plan Program. Peter Tino was supported by the European Commission Horizon while the term explanation itself is still elusive. After reviewing 2020 Innovative Training Network SUNDIAL (SUrvey Network for Deep Imaging Analysis and Learning), Project ID: 721463. We also acknowledge previous literature, we make further clarifications of “explana- MoD/Dstl and EPSRC for providing the grant to support the UK academics tions” and “understandable terms” on the basis of [18]. involvement in a Department of Defense funded MURI project through EPSRC Interpretability is the ability to provide explana- grant EP/N019415/1. 1 2 Y. Zhang and K. Tang are with the Guangdong Key Laboratory of Brain- tions in understandable terms to a human. Inspired Intelligent Computation, Department of Computer Science and where Engineering, Southern University of Science and Technology, Shenzhen 518055, P.R.China, and also with the Research Institute of Trust-worthy Autonomous 1) Explanations, ideally, should be logical decision rules Systems, Southern University of Science and Technology, Shenzhen 518055, (if-then rules) or can be transformed to logical rules. P.R.China (e-mail: [email protected], [email protected]). However, people usually do not require explanations to Y. Zhang, P. Tinoˇ and A. Leonardis are with the School of Computer Science, University of Birmingham, Edgbaston, Birmingham B15 2TT, UK (e-mail: be explicitly in a rule form (but only some key elements fp.tino, [email protected]). which can be used to construct explanations). Manuscript accepted July 09, 2021, IEEE-TETCI. © 2021 IEEE. Personal 2) Understandable terms should be from the domain use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing knowledge related to the task (or common knowledge this material for advertising or promotional purposes, creating new collective according to the task). works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. 1https://www.acm.org/media-center/2019/march/turing-award-2018 2 IEEE TRANSACTIONS ON XXXX, VOL. X, NO. X, MM YYYY TABLE I the general neural network methodology, which cannot be SOME INTERPRETABLE “TERMS” USED IN PRACTICE. covered by this paper. For example, the empirical success of DNNs raises many unsolved questions to theoreticians [30]. Field Raw input Understandable terms What are the merits (or inductive bias) of DNN architec- a Super pixels (image patches) tures [31], [32]? What are the properties of DNNs’ loss Computer vision Images (pixels) b Visual concepts surface/critical points [33]–[36]? Why DNNs generalizes so NLP Word embeddings Words Bioinformatics Sequences Motifs (position weight matrix)c well with just simple regularization [37]–[39]? What about DNNs’ robustness/stability [40]–[45]? There are also studies a image patches are usually used in attribution methods [20]. b colours, materials, textures, parts, objects and scenes [21]. about how to generate adversarial examples [46], [47] and c proposed by [22] and became an essential tool for computational motif detect adversarial inputs [48]. discovery. B. The Importance of Interpretability Our definition enables new perspectives on the interpretabil- The need for interpretability has already been stressed by ity research: (1) We highlight the form of explanations rather many papers [17], [18], [49], emphasizing cases where lack of than particular explanators. After all, explanations are expressed interpretability may be harmful. However, a clearly organized in a certain “language”, be it natural language, logic rules exposition of such argumentation is missing. We summarize or something else. Recently, a strong preference has been the arguments for the importance of interpretability into three expressed for the language of explanations to be as close as groups. possible to logic [19]. In practice, people do not always require 1) High Reliability Requirement: Although deep networks a full “sentence”, which allows various kinds of explanations have shown great performance on some relatively large test (rules, saliency masks etc.). This is an important angle to sets, the real world environment is still much more complex. categorize the approaches in the existing literature. (2) Domain As some unexpected failures are inevitable, we need some knowledge is the basic unit in the construction of explanations. means of making sure we are still in control. Deep neural As deep learning has shown its ability to process data in the raw networks do not provide such an option. In practice, they have form, it becomes harder for people to interpret the model with often been observed to have unexpected performance drop in its original input representation. With more domain knowledge, certain situations, not to mention the potential attacks from the we can get more understandable representations that can be adversarial examples [50], [51]. evaluated by domain experts. TableI lists several commonly Interpretability is not always needed but it is important

Load more