<<

.

Bachelor thesis (Computer Science BSc) Understanding Deep Neural Networks

Authors David Kempf Lino von Burg

Main supervisors Dr. Thilo Stadelmann Prof. Dr. Olaf Stern

Date 08.06.2018

DECLARATION OF ORIGINALITY

Bachelor’s Thesis at the School of Engineering

By submitting this Bachelor’s thesis, the undersigned student confirms that this thesis is his/her own work and was written without the help of a third party. (Group works: the performance of the other group members are not considered as third party).

The student declares that all sources in the text (including Internet pages) and appendices have been correctly disclosed. This means that there has been no plagiarism, i.e. no sections of the Bachelor thesis have been partially or wholly taken from other texts and represented as the student’s own work or included without being correctly referenced.

Any misconduct will be dealt with according to paragraphs 39 and 40 of the General Academic Regulations for Bachelor’s and Master’s Degree courses at the Zurich University of Applied Sciences (Rahmenprüfungsordnung ZHAW (RPO)) and subject to the provisions for disciplinary action stipulated in the University regulations.

City, Date: Signature:

…………………………………………………………………….. …………………………..………………………………………………………...

…………………………..………………………………………………………...

…………………………..………………………………………………………...

The original signed and dated document (no copies) must be included after the title sheet in the ZHAW version of all Bachelor thesis submitted.

Zurich University of Applied Sciences

Abstract

Artificial Intelligence (AI) comprises a wealth of distinct problem solving methods and attempts with its various subfields to tackle different classes of challenges. During the last decade, huge advances have been made in the field and, as a consequence, AI has achieved great results in machine translation and computer vision tasks. In particular, Artificial Neural Networks (ANNs) won several international con- tests in pattern recognition between 2009 and 2012 and even reached human-competitive performance in recognition applications. Despite these impressive achievements, ANNs are still often perceived as incomprehensible black boxes due to the complexity of the underlying models. As ANNs find their way more and more into everyday processes, the need for understanding their inner workings is becoming central. This thesis therefore surveys the state of the art in debugging Convolutional Neural Networks (CNNs) and explores the question how helpful the explanations provided by such methods are to improve the understanding of these classifiers. In order to investigate this question, an experimental testbed for applying debugging methods to CNNs was built and an experiment to evaluate these methods was carried out. An image recognition task related to the classification of cats and dogs served as a basis to conduct the investigation. The goal was to analyze how the model arrives at its classification scores when it is presented with input images that belong to neither of the classes it was trained to detect. When using debugging methods for better understanding the of CNNs, it is important that the results obtained from such methods do explain not only reasons for assigning a given image to a specific output class but also reasons for not assigning it to a different output class. In this way, the results faciliate the understanding of the notion of classes that a CNN has developed from the data used to train it. Hence, part of the conducted experiment was to compare how differently the trained network behaves when it is fed with data that does not contain concepts corresponding to the output classes. Within the context of this work, state-of-the-art methods for debugging CNNs were reviewed and subsumed into a new taxonomy. A characteristic subset of these methods was applied to a CNN trained on the kaggle dogs vs cats competition dataset.

ii

Zusammenfassung

Künstliche Intelligenz (KI) umfasst eine Vielzahl von verschiedenen Problemlösungsmethoden und ver- sucht mit seinen Unterbereichen unterschiedliche Klassen von Herausforderungen anzugehen. Während der letzten Dekade konnten in diesem Forschungsfeld sehr grosse Fortschritte verzeichnet werden und in der Folge hat KI sehr gute Ergebnisse in maschineller Übersetzung und maschinellem Sehen er- reicht. Insbesondere künstliche neuronale Netze (KNNs) gewannen im Zeitraum zwischen 2009 und 2012 zahlreiche internationale Wettbewerbe in Mustererkennung und erreichten sogar menschenähnliche Leistungen in Erkennungs-Anwendungen. Trotz dieser beeindruckenden Erfolge werden KNNs aufgrund der Komplexität der zugrundeliegenden Modelle immer noch oft als unverständliche Blackboxes aufge- fasst. Je mehr KNNs Teil der alltäglichen Prozesse werden, umso notwendiger wird es ein Verständnis ihrer inneren Funktionsweise zu haben. Diese Bachelorarbeit begutachtet deshalb den State of the Art, um Convolutional Neural Networks (CNNs) zu debuggen und untersucht die Frage, wie hilfreich die Erklärungen solcher Methoden dabei sind ein besseres Verständnis von diesen Klassifikatoren zu gewinnen. Um diese Frage zu untersuchen wurde eine experimentelle Testumgebung für das Anwenden von Debug- ging Methoden auf CNNs aufgebaut und ein Experiment für die Evaluation dieser Methoden durchge- führt. Ein Bilderkennungstask in Verbindung mit der Klassifizierung von Katzen und Hunden diente dabei als Basis, um die Untersuchung durchzuführen. Dabei war das Ziel zu analysieren, wie das trainierte Modell zu seinen Klassifikationsscores kommt, wenn als Input für das Netzwerk Bilder genom- men werden, die zu keiner der Outputklassen zugehörig sind. Wenn Debugging Methoden benutzt werden, um die Funktion von CNNs besser zu verstehen, dann ist es wichtig, dass die von solchen Methoden erhaltenen Ergebnisse nicht nur die Gründe erklären, wieso ein gegebenes Bild zu einer spezifischen Outputklasse zugeordnet worden ist, sondern auch die Gründe, wieso das gleiche Bild nicht einer anderen Outputklasse zugeordnet worden ist. Auf diese Weise verbessern die Resultate das Verständnis dafür, was für eine Vorstellung der Klassen das CNN aus den Trainingsdaten entwickelt hat. Daher wurde in einem Teil des Experiments verglichen, wie anders sich das Netzwerk verhält, wenn Daten eingegeben werden, die nicht den Outputklassen entsprechen. Im Rahmen dieser Arbeit wurden State-of-the-Art Methoden für das Debuggen von CNNs überprüft und in eine neue Taxonomie eingeordnet. Eine charakteristische Auswahl dieser Methoden wurde auf ein CNN angewendet, das auf dem dogs vs cats Wettbewerbsdatensatz trainiert worden ist.

iii

Preface

Acknowledgement

Our supervisor Dr. Stadelmann provided us with many inputs and practical tips while we were working on this Bachelor thesis. We would like to thank Dr. Stadelmann for all his support during the whole work. We thank our second supervisor Prof. Dr. Stern for reviewing the drafts of our report and giving us for it. We would further like to express thanks to Mrs. Fernando for giving us language-related advice. We would also like to thank Mr. Amirian for his support with the training of the Neural Network.

iv

Contents

1 Introduction 1 1.1 Motivation ...... 1 1.2 Research Question ...... 2 1.3 Limitations of Learning Capability ...... 2 1.4 Structure of Thesis ...... 3

2 Prerequisites 4 2.1 Artificial Neural Networks ...... 4 2.2 Convolutional Neural Networks ...... 7

3 Literature Survey9 3.1 Scope ...... 9 3.2 Intuitive Interpretability ...... 10 3.3 Taxonomy Features ...... 10 3.4 Visualization of Feature Representation ...... 11 3.5 Visualization of Activation Attribution ...... 11 3.5.1 Code Inversion ...... 12 3.5.2 Occlusion Sensitivity ...... 16 3.6 Visualization of Learned Class-Concepts ...... 19 3.6.1 Optimisation ...... 19 3.7 Classification of Methods ...... 21

4 Experiments 23 4.1 Methodology ...... 23 4.2 Experimental Testbed ...... 24 4.3 Implementation of Debugging Methods ...... 25

5 Results 27

6 Summary and Future Work 30 (Listings) ...... 31 Bibliography ...... 31 (List of Figures) ...... 35 (List of Tables) ...... 36

A Appendix 37 A.1 Announcement ...... 37 A.2 Visualizations of ‘Unknown’ Classes ...... 39

v 1 Introduction

The present Introduction chapter determines the frame for the thesis by naming the motivation for it and the question to investigate. First, the problem of lacking comprehension regarding Artificial Neural Networks is contemplated within section 1.1. After that, the research question of this work is defined in the next section 1.2. The third section 1.3 then points out the limitations of a Neural Network’s learning capability. Finally, the last section 1.4 describes the structure of the remaining thesis.

1.1 Motivation

The field of Artificial Intelligence (AI) consists of a variety of distinct problem solving methods. Its several subfields aim at providing support in different matters and have already succeeded in doing so in areas like computer vision and natural language processing. Voice recognition and machine translation are examples of narrow tasks which, atleast for some languages, AI systems are able to carry out on human-level [RN03]. Although such systems are present in many modern devices, AI applications are still often perceived as little more than toys. Additionally, classifiers such as Neural Networks have frequently been referred to as incomprehensible black boxes due to the fact that they independently find structures in the processed material. The increasing application of AI technology makes it necessary to improve the understanding of it. Artificial Neural Networks (ANNs) as a branch of Artificial Intelligence have been on researchers’ and pundits’ minds for a long time. Warren McCulloch and Walter Pitts described in the year 1943 a first computational model based on Neural Networks (NNs) which, according to them, would be able to calculate any realistic function. Since then, ANNs have followed the up-and-down-course of its underlying field, being theoretically more and more investigated, but without reaching as much broad and practical significance as currently [Sch15]. Since 2009, ANNs have again been gaining popularity as networks that were developed within Jürgen Schmidhuber’s research group won several international competitions in pattern recognition and applications [CMS12]. Pattern recognition methods like ANNs are also integral components of artificial agents in high risk environments, where reliable behavior is crucial. For example, autonomous vehicles rely on the correct evaluation of visual inputs in order to ensure correct and safe behaviour in complex traffic situations. Hence it is key to understand how a Neural Network learns to distinguish between different inputs and how its final decisions are created. Because of the complexity of the models that are applied to pattern recognition tasks, the understanding of such models is often limited to empirical validation. These measurements do not provide a guarantee that the achieved precision is applicable to real world data. Even if a trained model performs well on new data, this does not assure that the network works accurately in general and is robust against targeted attacks specifically [MDFF16]. One type of the latter has recently become known as adversarial examples and is another aspect which points out the necessity of comprehending Deep Neural Networks (DNNs) [SZS+13]. Adversarial examples are a special type of input data which Neural Networks do not classify correctly despite otherwise performing arguably well. These samples will unlikely occur in real data but are created intentionally to fool a Neural Network. It is enough to slightly modify a given image, imperceptable for humans, in order to make a Convolutional Neural Network (CNN) misclassify it. Different works show that such attacks are not dependent on feeding modified input directly to a classifier, but also work if perceived in real time through a camera [KGB16]. Adversarial examples thereby demonstrate how important it is to have a good understanding of the inner workings of Neural Networks in order to avoid mistakes in application areas where even small errors can lead to catastrophic outcomes. As a result, methods have to be employed to increase the interpretability of such models.

1 1.2. Research Question CHAPTER 1. INTRODUCTION

1.2 Research Question

This thesis therefore examines the question if so-called debugging methods can be used to disclose patterns within the inner workings of Neural Networks and to make such structures visible in an inter- pretable way, both explaining the function and faciliating the understanding of these classifiers. To this end, this work focuses on an image recognition task with a CNN. Based on photos of cats and dogs and their true labels (a supervised learning task), it will be examined what features are learned by the network from the data when trying to classify the pictures. The aforementioned term “patterns” stands for any structure or characteristic property within the examined data which contributes to the final decision of the CNN. This could be for example a certain fur-related feature that sends the network classifying pictures with this property as label one or label two (as “cat” or “dog”). Choosing an image recognition task such as the binary classification of cats and dogs provides the possibility to gain insights into the learning process of a Neural Network. Since this is a well-known task in literature and a task which humans are able to solve easily, it is possible to concentrate impartially on the structures that the network is learning while training the model. This is underlied by the working assumption that a network is able to find spurious correlations even if there are no reasons for it in the real world. For instance, a network could take a certain feature in a cat image to be a unique property of the class cat, while a human would never associate that particular feature with the concept of a cat. Such a task thereby demonstrates what notion of the data the network develops. This thesis focuses on visualising and interpreting these patterns and not on trying to predict the labels of new input animals. Addressing the problem of understanding ANNs, the goal of this work is to provide a comprehensible survey of the state of the art in debugging CNNs. Different approaches for doing this will be examined and applied to the mentioned classification task. Further, the reviewed methods will be subsumed into a new taxonomy using self-defined features. In the context of this work, the term “debugging” refers to the process of inspecting the inner workings of a Neural Network in order to improve the understanding of such classifiers. By contrast, debugging in the classical sense, that is localizing a bug within the model which results in a poor learning curve, is not covered.

1.3 Limitations of Learning Capability

In the motivation section, it was mentioned that NNs could potentially calculate anything. Accordingly, the universality theorem proves that specific NN architectures can approximate any realistic function [Kub15]. In 2016, Zhang et al. demonstrated a similar result: DNNs can “[...] easily fit random labels” [ZBH+16]. By randomizing the training labels and leaving everything else unchanged in the setting of an image classification task, the authors were still able to achieve zero training error on several standard architectures trained on CIFAR10 and ImageNet classification benchmarks. In other words, despite the supervised labels did not match the assigned pictures, the network could still learn the correct mapping. Additionally, even after replacing the true training images by completely random Gaussian noise, the CNN continued to fit the data with zero training error [ZBH+16]. These findings seem to confirm what was stated in the preceding section. NNs are capable of finding even difficult and unexpected correlations in data and, consequently, can learn (almost) anything. However, these results could not be reproduced in the experiment initially conducted within the context of this thesis. Originally, it was intended to carry out the image recognition task on a different data set containing photos of computer science students and their assessment grades. The underlying question was “Are final grades predictable from resume photos?”. It was planned to apply the debugging methods to the model resulting from training a CNN on this task. The training of the network did not lead to a good performance. Although common strategies were established to account for the rather small and imbalanced data set, accuracy did not exceed random performance. On the one hand, this result might be a relieving answer to the provocative student question, since no real connection between appearance and grades is expected to exist. On the other hand, the result seems to diminish the promising universality of NNs. However, an extensive analysis regarding the

2 CHAPTER 1. INTRODUCTION 1.4. Structure of Thesis

poor performance has not been performed, since this work focuses on evaluating debugging methods. Nonetheless, the findings of Zhang et al. and the result of the first thesis experiment emphasize again the importance of finding new and better ways to understand ANNs.

1.4 Structure of Thesis

Chapter 2 first gives an introduction to ANNs needed to follow the rest of the thesis and will then explain the fundamental components of CNNs. Chapter 3 will discover and describe the different methods for debugging Neural Networks and classify them into a new taxonomy on the basis of self- defined features. After that, the experiments performed with the debugging methods in regard to the cats and dogs image classification task and the challenges connected to that are described in chapter 4. Following this, chapter 5 reports the results from applying the methods to different types of input images and finally, chapter 6 discusses the experimental findings and sums up the thesis.

3 2 Prerequisites

The following two sections provide introductory explanations concerning the basic concepts required in order to understand the next chapters. Section 2.1 covers the fundamental concepts involved in a Artificial Neural Network. The next section 2.2 then introduces Convolutional Neural Networks which will be the focus of the remaining thesis.

2.1 Artificial Neural Networks

Inspired by the hypothesis of neuroscience, that brain activity mainly corresponds to eletrochemical activity in networks of , AI research early started attempting to create Artificial Neural Networks. As stated in the introduction, McCulloch and Pitts devised a first resembling such networks in 1943. They showed, that the logical connectives AND, OR and NOT could be computed by such a model consisting of artificial neurons and that actually any desired functionality could be obtained by some network of arbitrary depth containing large numbers of connected neurons. Unfortunately, it was not known yet how to train such networks. Since 1943, AI researchers became more and more interested in the abstract properties of Artificial Neural Networks, also called connectionist systems [RRND10]. These efforts finally led to AI’s first winter. In 1969, Minsky and Papert demonstrated in their book “: An Introduction to Computational Geometry” that networks (the simplest form of ANNs) are crucially restricted in their expressiveness: They can not represent the exclusive OR (XOR) operation. Additionally, computers did not have enough processing power needed to tackle the computational complexity of large ANNs [RRND10, Sch15]. For that reason, research funding for ANNs dwindled almost completely. In 1974, the rediscovered back- propagation technique solved the XOR problem and made it possible to accelerate the training of more complex networks. Rumelhart and McClelland renewed the interest in NNs in 1986 by demonstrating in their book “Parallel Distributed Processing” how well the technique works in Neural Networks. Increasing computing power through the use of GPUs and distributed computing further expanded the application of ANNs to image and visual recognition problems. as a part of machine learning methods started to gain in popularity, since networks could be deployed on deeper and deeper architectures [RRND10]. Starting with 2009, ANNs have been able to outperform competing approaches in several contests with regard to machine learning tasks such as handwriting recognition [GS09, Sch15]. The Neural Networks of Cireşan et al. were even the first to achieve human-competitive performance in traffic sign recognition [CMMS12]. ANNs are now definitely a main focus of research in the discipline of machine learning. Artificial Neural Networks are computational models and learning systems. Such systems progressively improve their performance regarding a specific task by iteratively processing examples. They do this without being programmed for that task, i.e. they independently try to extract a pattern from the data of interest. Such a task could be the recognition of objects within an image. For instance, in experiments carried out for this thesis, an ANN was trained to identify images containing cats and such containing dogs (image recognition task). After the training of a NN finishes, the results are used to apply the same task to new, unseen data. Therefore, when using an ANN, the goal is to train it well enough on a well-crafted training set of examples so that it subsequently is able to run the same task on new data that it has not seen before [RRND10, vGB17]. Generally, ANNs are made up of four components: Neurons, connections between neurons and their respective weights, a propagation function and a learning rule [Zel97]. There exists a variety of different ANN architectures which usually are strong on solving one specific class of problem tasks such as object segmentation. The architectures mainly differ from each other in that how they are combining and

4 CHAPTER 2. PREREQUISITES 2.1. Artificial Neural Networks

Figure 2.1: Mathematical model for an artificial by McCulloch and Pitts. The neuron receives a set of inputs and weights which are linearly combined by the input function and then passed on to the , generating the output of the neuron [RRND10]. putting together the four general components of ANNs. Some architectures introduce further compo- nents with special functions. For example, Convolutional Neural Networks, which will be described in the next section, form a popular architecture of ANNs. Besides the standard components, CNNs use additional operations like convolution and pooling. The basic unit of an ANN is the artificial neuron. Figure 2.1 depicts the simple model for a neuron, proposed by McCulloch and Pitts [RRND10].

The neuron comprises several parts. In the first part, it receives data ai through its input links from either a preceding unit i or from the initial input itself. Each input link has a weight wi,j assigned to it, the indices i and j denote the link’s beginning and its end, respectively. The weight of a link determines how strong the values flowing through that link are considered. Data ai and weights wi,j are linearly combined by the input function inj which corresponds to the propagation function. The resulting sum is then passed as an input to the second part of the neuron, the activation function g. This function calculates the actual output aj of the neuron. After that, the output is forwarded through the output links of the neuron to the subsequent units [RRND10, Zel97]. The process of values flowing from input to output of the network in this manner is called forward propagation [Kub15]. The three parts of the neuron are summarized in Formula 2.1.

n X aj = g( wi,jai) (2.1) i=0

The Formula uses the symbol g to represent the activation function which carrys out an significant task: It is reponsible for determining what the activation value aj of the relative unit should be. The Pn weighted sum i=0 wi,jai resulting from the input function inj can be a value of arbritary size. To implement a learning system consisting of several units, something is needed to decide based on the weighted sum whether subsequent units should consider this value produced by neuron j or not.

The activation function does this by mapping the result of input function inj to a value in a specific, usually smaller range. When the transformed value exceeds a certain threshold, then the activation function activates or fires this unit. In this way, a single unit “[...] implements a linear classifier [...]” [RRND10]. Different sorts of activation functions with different properties exist that can be used in place of g. Normally, a hard threshold or a (sigmoid) is used. Both this functions are nonlinear. This is important because a network of units employing such a nonlinear activation function is then capable of representing a nonlinear function [RRND10]. When the model for a single neuron is determined, the next step is to use many instances of it to build a network. This can be done in two ways: (1) The neurons are connected to form a directed acyclic graph, which means that data can only flow in one direction from the input to the ouput of the network. No loops are allowed. This structure is called a feed-forward network. (2) Neurons are additionally connected to neurons lying behind them which makes it possible to feed the output of a neuron back into the input of a preceding neuron. Such a structure is called a recurrent network but will not be

5 2.1. Artificial Neural Networks CHAPTER 2. PREREQUISITES

Figure 2.2: Two types of networks. (a) A perceptron network with two inputs and one layer of output neurons. (b) A multilayer network with two inputs, one layer of hidden neurons and one layer of output neurons [RRND10]. explained further. The neurons of ANNs are usually organized into layers so that the output of neurons in one layer is connected to the inputs of neurons in the next layer via a directed link. In this way, a certain unit can only receive input from units of the immediately preceding layer. Layers are called fully-connected if each neuron of a layer has a link to every neuron of the next layer. Depending on the number of used layers, single-layer networks are distinguished from multilayer networks [RRND10]. Figure 2.2 depicts the two network types. A single-layer feed-forward Neural Network comprises only one layer of output neurons, thereby connect- ing the input units directly to the outputs of the network. Such networks are also called perceptrons. More precisely, a perceptron with y output neurons is actually y separate perceptrons, since each weight wi,j always affects only one output neuron, as is shown in Figure 2.2 (a). As already mentioned in the beginning of this section, the perceptron can not represent the XOR function. In general, it fails to learn functions which are not linearly separable. Nonetheless, perceptrons are able to describe some complex boolean functions compactly and are therefore still used today. A multilayer feed-forward Neu- ral Network in contrast has additional layers of hidden neurons that are not connected to the outputs of the network. Consequently, every weight wi,j has an influence on each output neuron, as can be seen in Figure 2.2 (b). Any layer that lies between input and the last layer of output neurons is called hidden, because it is not exposed to the outside [RRND10]. By using more hidden neurons, it is possible to approximate any of the input units with arbitrary accuracy. This is proofed by the universality theorem which was already mentioned in the introduction. However, what this theorem does not state is how many neurons are required and which individual values the weights of the network should have. Thus, it is known that a “[...] solution exists, yet there is no guarantee we will ever find it.” [Kub15]. Without solving the problem of unknown weights, an ANN will not be able to calculate an approximation for the task function. The weights need to be found somehow. This is achieved by using a learning rule [Zel97]. The backpropagation technique is an example of such a learning rule and is used to find the weights for a given task and works as follows. At first, all the weights wi,j of the links between layers are initialised with random values. Training an ANN with these random weights would of course lead to a large error in the resulting predictions. In order to make the network learn the task of interest, the values of its weights need to be adjusted with respect to this error so that the ANN gets better at approximating the solution. As illustrated in Figure 2.2 (b), each weight wi,j affects every output neuron. This means, that basically each output neuron is a function of the input values and weights. The backpropagation method uses this relationship to approximate the weights by considering the prediction error [RRND10, Kub15].

X ∂ (y − a )2 (2.2) ∂w k k k

Formula 2.2 shows how the error is computed with respect to the weights for the neurons of the output layer. yk denotes the output calculated by the network, ak the true solution. The gradient of this

6 CHAPTER 2. PREREQUISITES 2.2. Convolutional Neural Networks

difference is determined with respect to the weights and this is done for all k neurons of the output layer. The value resulting from the formula combines the different contributions from the weights to the overall error. Therefore, by backpropagating the error through the whole network, the weights can be adjusted accordingly. The backpropagation is repeated a specific number of times or until the prediction error falls below a predetermined threshold [RRND10, Kub15].

2.2 Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are one special class of ANNs which have proven effectiveness in specific tasks, especially in such related to the machine learning subfield of computer vision. This particular variant of ANN architectures was first proposed by LeCun et al. in 1998 after many iterations of improvements during a whole decade [LBBH98, Cul17]. Since then, CNNs have been getting in- creasingly more and more attention and are now an important part of machine learning methods. Such supervised CNN architectures were the first to achieve human-comparable results in computer vision. The DNNs of Cireşan et al. were even able to outperform humans in the image classification task of traffic sign recognition [CMS12, Sch15]. Apart from their application in the mentioned visual tasks, CNNs are also successful in natural language processing tasks such as sentence classification [Kar16].

CNNs are very strong in tasks where image data needs to be processed. Although it is possible to use standard fully-connected feed-forward networks for this purpose as well, there remain some fundamental problems with doing that. First, images are usually very large, containing hundreds of pixels. Thus, a fully-connected first layer with one hundred hidden units “[...] would already contain several thousands of weights” [LBBH98]. Such a high number of parameters leads to an increased system capacity and, consequently, requires a bigger training set. Additionally, a large number of parameters demands more memory and computation time [Cul17]. Secondly, ordinary DNNs are not invariant to translations and distortions of image data. This means that when presenting a DNN with an arbitrary image and subsequently with a shifted version of it, it will not be able to learn to recognize that both images do contain the same concept. Thirdly, fully-connected layers consider individual pixels of an image as independent input features. Since images are locally structured in two dimensions and neighbouring pixels are highly correlated, this spatial information is lost after image data is fed to fully-connected layers [LBBH98, Cul17].

Figure 2.3: A simple CNN architecture [Kar16].

The special architecture of CNNs allows to handle these problems. CNNs are composed of the same com- ponents as ANNs but introduce additional components with special functions: Convolutional layer, non linearity (ReLU) and pooling layer. Figure 2.3 depicts a simple CNN architecture with its components [Kar16]. Each of these components are normally alternated several times until the fully-connected layer is reached. The convolutional layers perform a convolution operation on the input, CNNs are named after that operation. The purpose of the convolution step is to extract features from a given input image by using small, squared filters. In this way, the problem of missing spatial correlation between pixels is avoided. The filter, also called kernel, is slided over the input matrix (input image) and for each position, the element-wise multiplication between the two matrices is computed. The resulting products are added to obtain a single integer. Therefore, one value is generated for each position and all these values are combined into the output matrix (feature map). By performing these described steps, features within the input image are detected. The actual values of the filter matrix determine what sort

7 2.2. Convolutional Neural Networks CHAPTER 2. PREREQUISITES

of features are extracted. Similar to the concrete values of an ANN’s weights, the concrete values of a CNN’s filters are adjusted with respect to the output error during the training phase [Kar16]. As illustrated in Figure 2.3, convolutional layers are followed by a rectified linear unit (ReLU). This operation replaces any negative values in pixels of the feature map by zero. Basically, ReLU brings non-linearity into the training process of a CNN. This is important because real-world data is for the most part non-linear as well as the dot product calculated within the convolution step. By removing all the negative values in the feature maps, ReLU accounts for non-linearity in the network [Kar16].

Figure 2.4: Visualization of the pooling operation, using the maximum [Kar].

Pooling layers combine the results of neurons from one layer into a single neuron of the next layer, thereby reducing the dimensionality of feature maps [CMM+11]. The pooling operation is performed separately on each feature map and uses a spatial region which needs to be defined beforehand. This could be for example a 2 x 2 window. Then, a representative is chosen for each of these regions according to some measure. Figure 2.4 shows the max pooling variant with a 2 x 2 window. Max pooling takes the maximum value in each region. The stride defines the number of positions by which the windows is slided over the feature map. In this example, performing the pooling operation in this way results into a reduced feature representation of size 2 x 2. Generally, the size of the pooling output is dependent on the chosen window size and stride value. Pooling has the important effect of increasing the network’s invariance to translations and distortions in the input image, thereby solving the problem mentioned in the beginning of this section. When a given image is slightly modified, this will likely not change the output of the pooling operation since the maximum in each pooling region is taken. At the same time, the reduced representation is easier to manage and reduces the complexity in the network, leading to a decreased number of computations [Kar16]. When successively performing convolution operation, ReLU and pooling, features are extracted from the input data. Each block of successive convolution, ReLU and pooling learns more complex features than the last one. This means that basic building blocks such as edges and simple forms are combined to larger blocks until the highest-level block finally has a notion of complete objects. Since all spatial information is assured to be preserved, the output of the last convolution block can now serve as an input to the fully-connected layers. This group of layers is responsible for classifying the input image by means of the extacted features. The last fully-connected layer directly corresponds to the class predictions, with each output probability indicating the support for the relative class [Kub15, Kar16]. CNNs are trained in the same way as standard ANNs. First, all filters and weights are randomly intialized. Input images are fed to the network and propagated forward through convolutional layers, ReLUs, pooling layers and fully-connected layers. The output probabilites are then compared with the true values, calculating the total error. After that, backpropagation is used to compute the gradients of this error. The filters and weights are adjusted accordingly and these steps are repeated until the error reaches a predefined threshold [Kar16, Kub15, RRND10]. Further details such as parameters of the convolution operation and alternative pooling variants are covered in the blog post by Karn [Kar16].

8 3 Literature Survey of Debugging Methods

This chapter identifies and discusses different methods of gaining insight into the inner workings of a DNN. A taxonomy for classifying these methods is developed and applied to the current state of the art of debugging DNNs. First, Section 3.1 defines the scope for the methods of interest. The Section 3.2 and provides a bird’s eye view of the problem of interpretability in general. After that, the Taxonomy Features section introduces the characteristics which are used to classify the debugging methods. Finally, Sections 3.4 to 3.5 depict the hierarchy of the considered methods.

3.1 Scope

Several distinct ways exist to analyse the processes running within a DNN. Figure 3.1 presents a hierarchical categorization of methods that will be used within this thesis.

Methods for Debugging DNNs

Realtime-based A-posteriori

Data Flow

Visualization Textual Explanation Numerical Graph

Activation Attribution Class-Concepts

Figure 3.1: Hierarchical taxonomy of methods for debugging DNNs. The bolded debugging types will be covered within this thesis.

As illustrated in Figure 3.1, approaches for understanding DNNs can be divided into realtime-based and a-posteriori-based groups of methods. The former group contains methods for inspecting the data flow during the training phase, thereby analysing performance and architectural aspects. The most obvious example of a method belonging to this group is TensorBoard [ten]. This tool graphically represents a network’s structure. The latter group consists of methods for understanding the representation after training that a DNN has learned of its input. Methods of this group can further be subdivided regarding in what way the understanding is attempted (visual-, text- or numeric-based). An example for a visual- based approach is the Devonvolutional Network by Zeiler and Fergus [ZF14] that will be discussed in section 3.5. The InterpNET NN design paradigm by Barratt is a representative of text-based explanation methods. It can be combined with any present classification architecture to generate natural language explanations for the classifications [Bar17]. Thoma lists further methods for both groups [Tho17]. Since this work examines the example case of an image classification task and approaches it using a CNN, it will focus on methods for understanding this kind of DNN. Realtime-based methods as TensorBoard can be used indenpendently of the task in question. This means that such methods often do not provide informative insights into the processes involved in specific tasks. Moreover, these methods are already widely known and employed by most specialists working with DNNs. Therefore, the thesis focuses on methods for inspecting models a posteriori and ignore methods for runtime inspection.

9 3.2. Intuitive Interpretability CHAPTER 3. LITERATURE SURVEY

3.2 Dangers in Human Intuitive Interpretability of Feature Representation in DNNs

The representation of features of the training data that a DNN learns is highly abstract, consisting of millions of numeric weights [YCN+15]. This representation is not directly interpretable even to experts. Methods for modelling these representations in ways accessible to and interpretable by human experts are therefore key to the understanding of how a DNN works. We have to always keep in mind that DNNs, even though they are loosely modelled after (mammal) brain functions, still don’t work in the same way as our brains do. Since we can not directly interpret the learned representation, we need to transform (and usually simplify) the model in some way to render it interpretable. This process inherently leads to a new model that is not accurately describing the original model because its aim is to be more persuasive or convincing to us [Her17]. This means that we need to be careful about the methods we use for interpretation of learned models. One example how this can lead to problems are adversarial examples. The small perturbations have some similarities to the representations that visualisation methods produce if no regularisation is applied to the result.

3.3 Taxonomy Features

This section explains the six features used in the subsequent sections to classify the methods, namely: Resolution of Feature, Subject of Explanation, Computational Overhead, Architectural Adaptation Nec- essary, Model-Agnostic and Descriptive vs. Persuasive. The first feature “Resolution of Feature” names the level of units in the network that is considered by the particular method. Most methods operate on a certain network unit level, e.g. on layer- or class-level. The resolution of a method is an important feature because it helps to place the method’s result in context. For instance, any method that works on neuron-level will (probably) provide an insight into the function of a neuron, but will not be able to tell what current notion of the data the next level beyond has. “Subject of Explanation” denotes what a method’s result is an explanation for. Different methods aim at explaining different aspects of Neural Networks. For example, the Deconvolutional Network method by Zeiler and Fergus is an explanation for the concept of a unit (feature map), while the Gradient-Based methods by Simonyan et al. are an explanation for the concept of a class. Since a single method cannot explain every detail of a Neural Network, it is important to know what aspect a particular method tries to explain. The “Computational Overhead” names the additional, computational expense generated by applying the method in relation to training or using the network without the method. Depending on the method, this feature can be interpreted in two ways: If the method of interest is applied during the training phase of the network, then the feature’s value stands for the computational overhead in relation to training the network without the use of the method. If the method is applied after the training phase, the feature tells the same, but in relation to using the network for predictions without applying the method. Using a debugging method could increase the computing time which might be unsatisfactory in certain cases. Thus, the computational overhead of a method is an important feature as well. “Architectural Adpatation Necessary” estimates the degree of architectural modifications that need to be made in order to take advantage of the method. For example, the Deconvolutional Network by Zeiler and Fergus requires additional interfaces at each layer of the CNN to pass information to the reconstruction components of the method [ZF14]. The degree of required adaptation will have a strong impact on the decision if a certain method should be used to debug a CNN or not. The “Model-Agnostic” feature determines whether a particular method is independent of a specific model. Unfortunately, only few methods are completely model-agnostic and can therefore be applied to various architectures. Often, the model needs to be adapted as expressed by the last feature. This

10 CHAPTER 3. LITERATURE SURVEY 3.4. Visualization of Feature Representation

could be a disqualying factor for the use of a certain method because in most cases a given model needs to be examined and any change to it is undesirable. In addition, comparing the results of a method relying on a specific model across different architectures might require many modifications and therefore be unfeasible.

The last feature “Descriptive vs. Persuasive” includes the problem of interpretability in the method classsification by establishing at which place on a scale ranging from descriptive to persuasive a given method operates. As described in the preceding section 3.2, interpreting a model requires careful examination of the method in use [Her17]. For example, a certain visualization method could describe more accurately the actual inner workings of a CNN but might not produce helpful visualizations. In contrast to this, another method could modify a given model in a way which leads to more convincing visualizations but might not accurately cover the function of the CNN in question.

To classify the methods precisely, Table 3.1 lists the possible values for each feature. Values enclosed by square brackets indicate what kind of information is provided by the corresponding feature.

Table 3.1: Possible values for the different taxonomy features. Feature Values Resolution of Feature Neuron, Channel, Layer, Class Subject of Explanation Classification, Concept of Class (image-specific or not), Concept of Unit (image-specific or not) Computational Overhead [Relative factor] Architectural Adaptation [Degree of necessary modifications] Necessary Model-Agnostic Yes, No Descriptive vs. Persuasive [Place on this scale]

3.4 Visualization of Feature Representation in Image Classification CNNs

For CNNs applied to image classification tasks, the intuitive approach is to visualise the representation in relation to its input data. This can either be done for a specific input image by projecting the activation of parts of the network back to the input pixel space or for all of the training data by optimising an input for maximum activation of parts of the network.

3.5 Visualization of Activation Attribution

Table 3.2: Overview of methods described in depth in the following subsections. Type Paper Method Section Code Inversion Zeiler and Fergus [ZF14] Deconvolutional Network 3.5.1 Simonyan et al. [SVZ13] Gradient-Based methods 3.5.1 Springenberg et al. [SDBR14] Guided Backpropagation 3.5.1 Occlusion Sensitivity Zeiler and Fergus [ZF14] Occlusion Sensitivity Analysis 3.5.2 Selvaraju et al. [SCD+16] Gradient-weighted Class Acti- 3.5.2 vation Mapping Selvaraju et al. [SCD+16] Guided Gradient-weighted 3.5.2 Class Activation Mapping

11 3.5. Visualization of Activation Attribution CHAPTER 3. LITERATURE SURVEY

3.5.1 Code Inversion

Deconvolutional Network

Zeiler and Fergus proposed one of the first approaches for visualizing features at different layers [ZF14]. According to them, the understanding of a CNN “[...] requires interpreting the feature activity in intermediate layers”. They describe a way to project these activities back to the input pixel space. This method thereby shows what input pattern caused a particular activation in the feature maps.

Figure 3.2: Concept of a deconvnet. Top: A deconvnet layer (left) is attached to each CNN layer (right). The deconvnet creates an approximate reconstruction of the activity in the layer beneath. Bottom: Switch variables are used to store the locations of the maxima within each pooling region (colored zones) during pooling operations in the CNN. The black/white bars indicate negative/positive activations in the feature map [ZF14].

This approach uses a Deconvolutional Network (deconvnet) as described by Zeiler et al. to reverse the mapping of pixels to features [ZTF11]. Figure 3.2 (top) illustrates the concept of such deconvnet. In order to examine a CNN, a deconvnet is attached to each layer of that CNN. Each deconvnet layer uses the same components as a regular CNN layer does but in reverse: (1) unpooling, (2) rectification and (3) filtering. Passing the three components in this manner approximately inverts the operation of a single CNN layer. To project a specific activation back, all other activations in the layer are set to zero. Then the layer’s feature maps are passed as input to the attached deconvnet layer. The feature maps run through the mentioned components. In this way, the activity in the next layer beneath, that caused the chosen activation, is reconstructed. This is repeated for all layers of the CNN until input pixel space is reached. Since the max pooling operation is non-invertible, Zeiler and Fergus suggest a set of “switch variables”,

12 CHAPTER 3. LITERATURE SURVEY 3.5. Visualization of Activation Attribution

as shown in Figure 3.2 (bottom). The switch variables store the locations of the maxima in the pooling regions during the pooling operations in the CNN. The unpooling operation uses these switches to place the reconstructions of the layer above into appropriate locations, thereby maintaining the structure of the trigger. As a result, an approximate inverse of the pooling operation is obtained. The unpooled maps are then fed into a rectified linear unit to assure positive features. The filtering component approximately inverts the convolution operation in the CNN: it uses transposed versions of the learned features. The switch variables are specific to a given input image. Thus, processing an activation in this manner reconstructs a small piece of the original image but with structures weighted according to their activation contribution.

The deconvnet method visualizes input stimuli that caused particular feature activations. An advantage of this is the possibility to observe the evolution of features for a certain input during training, if the features for all CNN layers are projected back. For instance, Zeiler and Fergus use the visualizations of the first layers of Krizhevsky et al. architecture [KSH12] to detect problems with this model. The method also “[...] provides a non-parametric view of invariance [on the features]” [ZF14]. One disadvantage of this approach is that it does not visualize the joint activity in a layer, but only a single activation. Therefore it is not apparent what notion of the data a CNN layer might have. The method can not be applied to the fully-connected layers of a CNN.

Gradient-Based

Simonyan et al. introduced two methods which generalize the deconvnet reconstruction procedure [SVZ13]. Both methods use a gradient-based approach. The first method visualizes class models, showing what notion of the data a certain class has learnt. The second method computes a class saliency for a particular image. Such saliency map visualizes which pixels of the image influence the class of interest the most.

Class Model Visualization This method builds on previous work from Erhan et al. who applied a similar approach to deep model architectures as Deep Belief Networks [EBCV09]. Simonyan et al. adapted the concept for the use with CNNs. For a class of interest, the method generates an image which represents the class with regard to the class scoring model. This is done by performing an optimisation on the input image. 2 argmax(Sc(I) − λ||I||2) (3.1)

Formula 3.1 shows the expression to be optimized: Sc(I) is the score of the class c for an image I, calculated by the classification layer of the CNN. λ denotes the regularisation parameter. The goal is to find an L2-regularised image I that yields a high score Sc. To do this, Simonyan et al. use the backpropagation method known from the CNN training procedure. They optimize with respect to the input image while the weights stay fixed to those resulting from the preceding training phase. The optimisation is initialised with the zero image. At the end, the training set mean image is added to the result. According to them, “A locally-optimal [image] I can be found [...]” in this way. Figure 3.3 shows three visualizations obtained from the method.

The class-model-method produces for a given class an image which is representative for that class regarding the class score. The image shows what structures, memorized by the CNN, contributed to the class score during training and therefore, what notion the particular class has gained from the data. One advantage of this is that it is visible what the class looks for. This improves the understanding of the classification. On the other hand, the method is not image-specific.

Image-Specific Class Saliency Visualization The second method calculates an image-specific class saliency map by ranking the pixels of a certain input image I0 based on their importance for the class c. The following Formula 3.2 demonstrates the ranking by a linear score model:

T Sc(I) = wc I + bc (3.2)

13 3.5. Visualization of Activation Attribution CHAPTER 3. LITERATURE SURVEY

Figure 3.3: Class appearance models for three pictures. The corresponding CNN was trained on the ILSVRC-2013 dataset [SVZ13].

Sc(I) represents the score of the class c for an image I. The image is in vectorized form. wc is the weight vector and bc the bias of the model. It is apparent that the magnitude of elements of w determines the importance of the image’s pixels for the class c. Since the class score Sc(I) of a CNN is a non-linear function with respect to the image, a different way has to be found to compute the ranking. Simonyan et al. tackle this issue by approximating Sc(I) with a linear function (Formula 3.3).

T Sc(I) ≈ w I + b (3.3)

They approximate the class score in the neighbourhood of another image I0 by calculating the first-oder Taylor expansion. w is the derivative of the class score Sc(I) with respect to the image I at the point (image) I0:

∂Sc w = (3.4) ∂I I0 The magnitude of the derivative identifies the pixels which “[...] need to be changed the least to affect the class score the most” [SVZ13]. Such pixels represent areas within the image which are especially relevant for the specific class. For instance, this could be pixels corresponding to the object location in the image. m×n With a given m-by-n image I0 and a class c, the saliency map M ∈ R is created in two steps. First, the derivative w is computed with the backpropagation method. The elements of w then need to be rearranged in order to obtain an image representation (the saliency map). The type of the image I determines how the rearrangement is done. For a greyscale image, the number of elements in w equals the number of pixels in I0. Thus, the saliency map can be obtained by a one-to-one mapping with Mij = |wh(i,j)|. h(i,j) is the index of the element in w which corresponds to the image pixel in the i-th row and j-th column. In the case of a multichannel image, w will contain values for all colour channels. Therefore, a single class saliency value for every pixel (i,j) has to be derived. The authors calculate the maximum magnitude of w across all colour channels as Mij = maxc|wh(i,j,c)|. By using the gradients, this method is able to highlight input areas that cause the largest change in the output. As illustrated in Figure 3.4 (bottom), the lighter regions of the pictures outline the main objects within the scene. Hence, the “[...] class saliency maps encode the location of the object [...]” which influenced the class score for a given image the most [SVZ13]. This provides intuition about the network’s final decision to assign a certain image to a particular class. Another advantage of the method is that it only requires a single backpropagation pass, making the calculation of the saliency map very quick. In contrast to the class-model-method, this approach is image-specific. Both the gradient-based techniques can be used for the visualization of any layer and are not restricted to convolutional ones. As stated in the beginning, the two methods generalize on the Deconvolutional Network method. Simonyan et al. are able to proof that both the unpooling and filtering components in a deconvnet are equal to the respective parts of their gradient approach [SVZ13]. However, the rectifi- cation component in the deconvnet reconstruction procedure slightly differs from its relative component in the gradient calculation. The authors conclude that “[...] apart from the RELU layer, computing

14 CHAPTER 3. LITERATURE SURVEY 3.5. Visualization of Activation Attribution

Figure 3.4: Image-specific class saliency maps for the highest-scoring class in ILSVRC-2013 test images [SVZ13]. Top: Two randomly selected test set images. Bottom: The corresponding saliency maps.

the approximate feature map reconstruction Rn using a DeconvNet is equivalent to computing the derivative ∂f/∂Xn using backpropagation [...]”.

Guided Backpropagation

Springenberg et al. presented the Guided Backpropagation method which combines Zeiler’s and Fer- gus’ deconvnet approach with the Gradient-Based concept by Simonyan et al. [SDBR14]. As with the deconvnet, this method aims at projecting a given acivation back into input pixel space for visu- alizing features learned by a CNN. The approach builds upon a special architecture consisting only of convolutional layers.

In their work, Springenberg et al. question the necessity of using the usual components within standard architectures in object recognition tasks. Normally, modern CNNs for object recognition use the same components: “[...] Alternating convolution and max-pooling layers followed by a small number of fully connected layers” [SDBR14]. The authors state that, over the last years, research has extensively focused on two directions to improve these standard architectures: (1) A lot of extensions for these standard pipelines were proposed and (2) experiments with different architectural choices for CNNs in the context of large scale object recognition were carried out. All these different extensions and architectures have their own parameters and training procedures. This motivated the authors to examine which components of CNNs are really necessary to achieve state-of-the-art performance on object recognition datasets. For this purpose, they replace both the pooling layers and the fully-connected layers with standard convolutional ones of small size. This results into an reduced architecture comprising only convolutional layers (“All Convolutional Net”), rectification components (rectified linear units) and a softmax layer. The exact structure of this architecture will not be discussed further.

As described before, the deconvnet reconstruction procedure needs first a forward pass through the network in order to calculate the positions of the maxima during the pooling operations of the training phase. Otherwise, the pooling operation can not be inverted [ZF14]. By using the switches, the reconstructions of the deconvnet are related to a certain input image. Since the All Convolutional Net (All-CNN) by Springenberg et al. does not contain any pooling operations, no switches are needed. This means that reconstructions resulting from applying the deconvnet method to a All-CNN architecture are not related to an input image. As is known, higher layers in a CNN learn more invariant and discriminative features and no single image is maximally activating the units of these layers. This is the reason why Springenberg et al. were not able to obtain sharp visualizations for higher layers with the original deconvnet approach. Thus, a connection to an input image is necessary.

15 3.5. Visualization of Activation Attribution CHAPTER 3. LITERATURE SURVEY

The unpooling part in the Gradient-Based methods by Simonyan et al. equals the unpooling component in the deconvnet approach [SVZ13]. If pooling layers are absent as in the case of the All-CNN, this method will neather lead to sharp visualizations for the same reason. This indicates that another way is needed to compensate for the image-related information that is extracted with the pooling operations. Springenberg et al. solve this problem by combining the two methods with regard to the rectified linear units. Figure 3.5 demonstrates how the different methods work when data flows through such a ReLU. It is apparent that the Gradient-Based methods by Simonyan et al. (second row) differ from the deconvnet approach by Zeiler and Fergus (third row) in the values which are set to zero. The former zeros out those values corresponding to the positions which were set to zero in the preceding forward pass. The latter just zeros out the negative values of the current input.

Figure 3.5: Different concepts of propagating back through a ReLU [SDBR14]. First row: The ReLU replaces all negative values with zero during the forward propagation. The remaining rows depict the backpropagation procedure for the Gradient-Based methods, the deconvnet re- construction and the Guided Backpropagation, respectively.

The fourth row in Figure 3.5 represents the Guided Backpropagation method by Springenberg et al. Rather than masking out only with respect to either the bottom signal (Gradient-Based methods) or the top gradient signal (deconvnet), this method combines both. According to the authors, the bottom signal of the forward pass is able to preserve image-related information and therefore substitutes the switches. Simultaneously, the top gradient signal prevents negative gradients to flow backwards which would decrease the activation of the higher layer unit to be visualized. The Guided Backpropagation method projects a certain activation back to input pixel space. For a given image and a chosen feature activation, this method thereby visualizes the patterns which caused that particular feature activation in a neuron. In contrast to the deconvnet approach, this method is capable of visualizing any layer of a network, since the underlying All-CNN architecture contains only convolutional layers. At the same time, the method is dependent on this very specific architecture. Therefore, the method can not be applied when a particular, given model needs to be examined. The method produces sharp and clear visualizations.

3.5.2 Occlusion Sensitivity

Occlusion Sensitivity Analysis

Together with the Deconvolutional Network (3.5.1) described within the Code Inversion subsection, Zeiler and Fergus introduced another method concerned with the question which parts of an input image contribute the most to the final classification [ZF14].

16 CHAPTER 3. LITERATURE SURVEY 3.5. Visualization of Activation Attribution

The Occlusion Sensitivity Analysis is done by occluding specific portions of a given input image with a grey square (the occluder). The altered image is then fed to the classifier and its resulting classification output is examined in comparison with the output resulting from classifying the original, unoccluded image. If modifying the input image in this way leads to another output, than this means that a significant part of the image is occluded. Otherwise, the occluded part does not seem to have a significant influence on the classification. This can be repeated for all possible locations of the occluder within the image, e.g. the grey square moves by one position and the resulting picture is again fed to the classifier and so on. In this way, a map of correct class probability as a function of the occluder’s position can be created. Such a map makes it possible to localize both the parts of a picture that make the largest contribution to the classification and the parts which are not relevant to it. This method supports in finding out whether the model under examination really localizes the object within the picture or if it just uses broad scene context to classify it. The obvious advantage of this approach is that it does not require modifications to the architecture of a given model. It can be applied immediately after the training phase has finished. On the other hand, this method does not consider the inner parameters of a network of interest since it only demonstrates how changing an input to the network influences its output. Therefore, it might reveal the structures relevant to the network but it does not make a statement about why these structures are important for the network.

Gradient-weighted Class Activation Mapping

The Gradient-weighted Class Activation Mapping (Grad-CAM) method was introduced by Selvaraju et al. [SCD+16]. It is a generalization of the Class Activation Mapping (CAM) approach by Zhou et al. [ZKL+15]. Grad-CAM creates for a given input picture a localization map, highlighting the regions which are relevant to the classification of this picture. Compared to the CAM approach, this method is applicable to a wide range of CNN architectures and is less computationally expensive. Selvaraju et al. state that in the process of designing a model for a specific task, there is typically a trade-off involved between accuracy and interpretability. For instance, they mention expert systems to be highly interpretable but not very precise. On the other hand, deep models in a way sacrifice interpretability for higher performance by incorporating greater abstraction (more layers) and more narrow integration. The CAM approach by Zhou et al. can only be applied to a special class of CNN architectures. It requires that feature maps directly precede softmax layers. Therefore, given models that are not yet based on such an architecture, have to be modified by replacing fully-connected layers with convolutional ones and global average pooling [ZKL+15]. Since such “[...] architectures may reach poor accuracies compared to general networks on some tasks [...]”, the CAM method essentially trades off performance for more interpretability [SCD+16]. Thus, in order to generate a localization map for a state-of-the-art CNN architecture, a different approach is needed. Deeper layers in a CNN learn higher-level visual concepts [BCV12, EBCV09]. Additionally, convolutional feature maps maintain spatial information which is then lost in the subsequent fully-connected layers (if present in a given model). This is why the authors expect the last convolutional layer in a model to represent the best compromise between high-level semantics and spatial information. According to them, units within such a layer look for class-specific information in the given picture. The Grad-CAM method uses the gradients of the last convolutional layer to analyse the importance of each neuron within that layer for the final class decision.

c u×v Given an image and a class c of interest, the computation of the localization map LGrad−CAM ∈ R of width u and height v for that class is done in four steps. First, the given image is propagated forward c through the whole model to compute the gradient ∂y of the score yc for the class c with respect to ∂Ak feature maps Ak of a convolutional layer. Secondly, all gradients are set to zero except for the one representing the class c of interest which is set to one. Then, the gradients are backpropagated and c global-average-pooled in order to receive the “neuron importance weights” αk (Formula 3.5).

1 XX ∂yc αc = (3.5) k Z ∂Ak i j ij

17 3.5. Visualization of Activation Attribution CHAPTER 3. LITERATURE SURVEY

c The neuron weight αk describes the importance of feature map k for the class c. Finally, the backprop- agation signal is combined with the forward convolutional feature maps Ak and then a ReLU operation is performed. Formula 3.6 demonstrates this last step.

c X c k LGrad−CAM = ReLU( αkA ) (3.6) k

This results in a localization map with image regions coloured according to their importance for the classification. Such a heatmap shows where the model has to look in order to assign the given image to the given class c. The resulting heatmap has the same size as the last convolutional layer in the model. The ReLU operation in Formula 3.6 asserts that only features with a positive influence on the class of interest are taken into account. Pixels belonging to such features increase the class score yc, while negative pixels might contribute to other classes in the same input picture. The heatmap can either only contain the different colours or it can be put on top of the original input image to simplify the matching of colours with image regions. Grad-CAM highlights the regions in a certain input image which are significant for a class of interest. In contrast to methods as Guided Backpropagation and Deconvolutional Network, Grad-CAM is class- discriminitive: If a given input picture contains constructs corresponding to several classes, then the resulting heatmap for a chosen class will clearly emphasize the regions influencing that class the most. Ideally, regions influencing other classes at the same time will not be coloured, thereby faciliating a clear distinction between different concepts within the same picture. As already mentioned, this method is image-specific. This means that it can be used to examine why a model chose to classify a particular image in a specific way. A disadvantage of the Grad-CAM approach lies in its lacking ability to visualize fine-grained structures as Code Inversion methods do it. While Grad-CAM can easily localize relevant image regions, it does not explain the reason for assigning an instance to a certain class.

Guided Gradient-weighted Class Activation Mapping

Besides the Grad-CAM method itself, Selvaraju et al. additionally presented the Guided Gradient- weighted Class Activation Mapping (Guided Grad-CAM) method which is an extension of the former approach [SCD+16]. Guided Grad-CAM combines the results obtained from the Grad-CAM approach and the Guided Backpropagation method. In contrast to Grad-CAM, Guided Grad-CAM does not only offer class-discriminative visualizations, but simultaneously reveals the fine-grained details in the image as Code Inversion methods do it. According to Selvaraju et al., methods like Guided Backpropagation (3.5.1) and Deconvolutional Net- work (3.5.1) yield pixel-space gradient visualizations in high resolution and are thereby able to emphasize the fine-grained details in the image. On the other hand, these methods are not class-discriminative, meaning that there is usually no significant difference between the resulting visualizations for two dif- ferent classes within the same picture. Hence, when using these methods for pictures containing more than one class concept, it is often not comprehensible which one of these concepts relates to the class of interest. Grad-CAM is class-discriminative, but in turn can not make visible which structures in the image led to its specific classification. This is why the authors wanted to “[. . . ] combine the best of both worlds [. . . ]” [SCD+16]. A Guided Grad-CAM visualization for a given input image is created by combining the results for the same image from the Grad-CAM approach and the Guided Backpropagation method. The localization c map LGrad−CAM obtained from Grad-CAM first needs to be up-sampled to the input image reso- lution. Selvaraju et al. use bi-linear interpolation for doing this. Then, the two visualizations from Grad-CAM and Guided Backpropagation are fused via point-wise multiplication. This results into a visu- alization showing class-discriminative features as well as high-resolution details known from the Guided Backpropagation method. Figure 3.6 compares the visualizations obtained from the three methods. Guided Grad-CAM outputs a visualization depicting the original image with the regions highlighted that are relevant to the class of interest. Moreover, it illustrates which structures send the network assigning the image to that class. For example, it is clear from Figure 3.6 (d) and (j) which regions in

18 CHAPTER 3. LITERATURE SURVEY 3.6. Visualization of Learned Class-Concepts

Figure 3.6: Comparison of visualizations provided by different methods. (a) and (g) show the original image, containing the two classes cat and dog. (b) to (d) and (h) to (j) depict visualizations provided by different methods for the original image [SCD+16]. the image correspond to the chosen class, namely the region containing the cat for (d) and the region containing the dog for (j). In opposition to the respective Guided Backpropagation visualizations in (b) and (h), Guided Grad-CAM unambiguously stresses only one of the two concepts (cat or dog). At the same time, the method highlights the important patterns for predicting an instance as one of the two classes. For instance, (d) emphasizes the stripes on the cat and (j) the characteristic facial features on the dog, respectively. An obvious drawback of Guided Grad-CAM is that in order make use of it, both the Grad-CAM method and the Guided Backpropagation approach have to be implemented. However, if the implementations are already present anyway, then it is very simple to combine the both to obtain expressive Guided Grad-CAM visualizations. For the combination with Grad-CAM, it is also possible to use the Deconvolutional Network approach instead of Guided Backpropagation. But according to the authors, using Guided Backpropagation leads generally to less noise in the resulting visualizations [SCD+16].

3.6 Visualization of Learned Class-Concepts

The latter of these is especially interesting since it shows us an approximation of the notion that a specific unit of the network represents. On a class level, we can see what, to the network, makes a dog a dog and a cat a cat. While it is easy for humans to visually recognize objects and assign a category to them, synthesizing the essence of a category is very difficult. This is the domain of all visual art. Comparing different styles of painters with the visual representation of different levels of a CNN, there would seem to exist a similar structure.

3.6.1 Optimisation

Activation Maximisation

In 2009, Erhan et al. suggested the Acitivation Maximization method which is one of the first optimisation-related approaches [EBCV09]. It can be applied to any trained Deep Neural Network. For a given unit in a hidden layer, this method searches for an input sample which maximizes the activation of that target unit.

19 3.6. Visualization of Learned Class-Concepts CHAPTER 3. LITERATURE SURVEY

Before the publication of this work, qualitative comparisons of features extracted by first layers of deep architectures were already common in the literature. This was usually done by looking at the filters learned by the first layer of a model, i.e. at their representations in the input space [HOT06, HOWT06]. However, the representations learned beyond the first level were not sufficiently covered yet then. Therefore, the goal of Erhan et al. was to investigate ways of making visible what units of arbitrary layers beyond the first level compute. In order to efficiently calculate such visualizations and to make them applicable to a wide variety of NN-like models, the authors wanted their method to operate in input pixel space of images. The importance of explaining the behavior of units lying in higher layers was confirmed by the experimental results of Erhan et al.: They showed that when evaluating two different models for an image recognition task, considering only the visualizations of the first layers might be deceiving and, consequently, lead to a wrong model choice [EBCV09]. The basic idea of the Activation Maximization method is to search for input patterns which maximize the activation of a chosen unit. According to the authors, this is a reasonable way of improving the understanding of a unit, since “[...] a pattern to which the unit is responding maximally could be a good first-order representation of what a unit is doing” [EBCV09]. The easiest approach to find such patterns would be to identify input samples within the training or test set which cause the highest activation for a specific unit. Unfortunately, the found samples would then need to be combined appropriately if a higher-level unit is examined. Moreover, it could be that only a part of the input sample is contributing to the high activation and determining that part is difficult. Alternatively, instead restricting the search for an input pattern to the training or test set, it can be viewed as an optimization problem, as showed in the following Formula 3.7 [EBCV09].

∗ x = argmax hij(θ,x) (3.7)

θ denotes the parameters (weights and biases) of the NN, x is the input sample. hij(θ,x) stands for the activation of a particular unit i in the layer j as a function of the parameters θ and sample x. The latter has to be modified in some way leading to the highest possible activation. Formula 3.7 represents a non-convex optimization problem which means that finding a local minimum should be feasible. Supposed that θ is fixed, gradient ascent in the input space can be used to search for a local solution. For a given unit, this is done in three steps: (1) x is randomly initialized in some manner. For example, Erhan et al. initialized in their experiment the pixels of x with arbitrary values from a uniform distribution over interval [0;1]. (2) Then, the gradient of the activation hij(θ,x) of the unit i with respect to x is computed. (3) x is moved one step in the direction of the resulting gradient. (2) and (3) are repeated until the activation function does not considerably increase anymore [EBCV09]. This procedure can lead to two outcomes when starting with different randomly sampled initializations: Either the same minimum is found for all the start samples or two or more local minima are found. Nonetheless, in both cases is the respective unit characterized by the minimum or the minima found. When a set of local minima is found, it is possible to average the results, select the one maximizing the activation or simply show all the found minima for explaining that unit. The use of the Activation Maximization method requires the choice of hyperparameters, namely the learning rate and a stopping criterion. The latter could be for example the maximum number of gradient ascent iterations. Erhan et al. further observed in their experiments that the optimal value for a given unit found by carrying out gradient ascent for a given learning rate works for the other units in the same layer too, i.e. is able to maximize all of them. In addition, the authors experimented with different random initializations for a particular unit from the third layer of a Stacked Denoising Auto-Encoder [VLBM08]. Unexpectedly, almost all random initializations converged to mostly the same input pattern which means that the maximum is found consistently. Therefore, Erhan et al. state that the reponse of an unit to input patterns seems to be unimodal [EBCV09]. Activation Maximization tries to find an input sample which maximizes a unit of interest. Thereby, the method visualizes what structures excite that particular unit. Figure 3.7 shows how the higher layers of a Deep Belief Network look for increasingly complex patterns[HOT06]. Such visualizations are helpful to get a sense for the function of distinct units. According to the authors, Activation Maximization “[. . . ] tends to find sharper patterns” than similar approaches as the linear combination of previous layers’ filters by Lee et al. [LEN08, EBCV09]. On the other hand, the method only appears to produce useful

20 CHAPTER 3. LITERATURE SURVEY 3.7. Classification of Methods

Figure 3.7: Activation Maximization method applied to an extended version of the MNIST digit clas- sification dataset [GSL07]. The first, second and third columns show the visualizations obtained from the method for 36 different units of the first, second and third hidden layers of a Deep Belief Network, respectively [HOT06, EBCV09].

results if images with limited dimensions are used. Erhan et al. tested the method on image patches of 20 x 20 pixels and the resulting visualizations were not interpretable anymore. This could be due to the fact that when applying the method to architectures different from CNNs, it might be impossible to still find a simple representation for a given higher-layer unit. The Acitivation Maximization method can be used for any network architecture allowing to compute the gradients [EBCV09].

3.7 Classification of Methods

Table 3.3 shows the collocation of methods into the developed taxonomy. Values marked with a superscript star symbol ? were identified while conducting the experiments described in chapter 4. The most common Subject of Explanation is Concept of Class. This appears to be reasonable since, in the context of classification tasks, the understanding of the notion a class has developed from the data is very helpful. The most Architectural Adaptations are necessary when using the Guided Backpropagation method. As mentioned, this method is dependent on a special architecture comprising only convolutional layers and ReLUs [SDBR14]. On the other hand, it creates sharper visualizations than similar approaches as the Deconvolutional Network. It is apparent from the table that most of the considered methods are rather descriptive than persuasive. However, younger methods tend to create more persuasive visualizations which might be connected to the increasing need of understanding ANNs.

21 3.7. Classification of Methods CHAPTER 3. LITERATURE SURVEY

Table 3.3: Comparison of methods by means of taxonomy features.

Method Resolution of Feature Subject of Explanation Computational Overhead Architectural Adaptation Model Agnosticism Descriptive vs. Persuasive Code Inversion Deconvolutional Network Neuron Concept of Unit O(n ∗ F ) 2 Slight No Descriptive (image-specific) Class Model Visualization 1 Class Classification, O(n) No Yes Descriptive Concept of Class Image-Specific Class Class Classification, O(n) No Yes Descriptive Saliency Visualization 1 Concept of Class (image-specific) Guided Backpropagation Neuron Concept of Unit O(n) Slight No Descriptive (image-specific) Occlusion Sensitivity Occlusion Sensitivity Anal- Class Classification O(n ∗ strides2) No Yes Descriptive ysis Gradient-weighted Class Layer Concept of Class O(n) Slight Yes Persuasive Activation Mapping (image-specific) Guided Gradient-weighted Layer Concept of Class O(n) Slight Yes Persuasive Class Activation Mapping (image-specific) Optimization Activation Maximization Neuron Concept of Unit O(n ∗ steps) No Partly Descriptive

1Gradient-Based methods by Simonyan et al. [SVZ13] 2Where F is the number of feature maps in the target layer.

22 4 Experiments

The goal of this thesis is to examine methods that support developers and researchers in their under- standing of the processes inside the ‘black box’ of Neural Networks, specifically for CNNs used for image classification. For this purpose, a task needs enough complexity to reveal the differences between the available methods, but conversely should be simple enough to not hinder the understanding of those methods. The simplest kind of classification task is binary classification, classifying input elements into one of two groups. Binary classification tasks can be further divided into one vs. one (OvO, training a network to discriminate between two distinct classes) and one vs. all (OvA, training a network to decide if a given input belongs to a single class or not) classification. A Neural Network is a model, a representation of reality that makes certain simplifying assumptions that allow it to produce approximate solutions to a set of problems constrained by those simplifications. This means that a network that was trained to discriminate between 50 different classes of objects in images works on the assumption that there exist only objects that are assignable to one of those 50 classes. Of course it is possible to label one of those classes as ‘unknown’, but that just moves the problem to the selection of data used for training. A Neural Network’s representation of reality can never exceed the boundaries of its training data, as it will always only be able to classify novel data in the context of the data it encountered before (usually in a distinct training phase, but depending on the system, every input that is encountered can influence the network’s parameters and even its architecture [KRFD17]). A potential ‘unknown’ class would need to contain examples with enough variety to capture all possible classes not present in the ‘known’ classes. Otherwise, the network would likely assign novel input that does not belong to a ‘known’ class to one of the ‘known’ classes instead of the ‘unknown’ class unless some other method of detecting this kind of input (e.g., using confidence thresholds) was employed. A OvA binary classifier therefore requires either a carefully crafted set of examples of an ‘unknown’ class or a method of detecting ‘unknown’ input by identifying a threshold on one or several parameters that indicates that it is not part of the ‘known’ class. A OvO binary classifier of course has no more ability to detect such ‘unknown’ input, but only requires two well-defined sets of example data. This led to the choice of a Neural Network trained as an OvO classifier as model, with the purpose of analyzing its behaviour when presented with ‘unknown’ input.

4.1 Methodology

The architecture of the model needs to be complex enough to adequately solve the task of distinguishing the two classes (reaching around 90% accuracy), but should not complicate the understanding and implementation of the methods that are demonstrated (use a minimal amount of layers to achieve the expected accuracy). To maximize the available time for focussing on the core purpose of this thesis, we decided to use transfer learning (using a pretrained model as base and retraining only part of the model for the specific task) which allows for a high ratio of accuracy to size of number of training samples and training time. Pretrained models only work with the architecture they were trained on, so an architecture needs to offer a pretrained model to be considered. Since the model will not be used in any productive capacity, other constraints (e.g. model size, prediction speed) are of secondary concern. As the framework to implement both model and methods, TensorFlow was chosen. Because of its large community and steady development there are many model implementations and pretrained weights available. Additionally, it allows low-level access to all parts of the models which is necessary to

23 4.2. Experimental Testbed CHAPTER 4. EXPERIMENTS

implement most of the methods that are described in the survey part of this thesis. Consequently, implementations of many methods were already available, requiring only some adaptations to be able to be used with a custom model. All these constraints led to the choice of the VGG16 CNN [SZ14] that is provided as part of the Ten- sorFlow framework (in its TF-slim library [GS16]) and its accompanying pretrained model parameters, which were trained on the Imagenet Large Scale Visual Recognition Challenge 2012 dataset (ILSVRC- 2012, [ILS]). This combination achieves a top-1 accuracy (how often the prediction with the highest probability is correct) of 71.5% and a top-5 accuracy (how often any one of the five predictions with highest probabilities is correct) of 89.8% before transfer learning and as such provides a solid baseline for our task. The VGG16 architecture reaches our goal of ∼90% accuracy (on the target task, after transfer learning) but is still comparatively simple with only 16 layers (hence VGG16). It also only uses 3×3 convolution filters and 2×2 pixel windows in the pooling layers, which dramatically reduces the number of parameters in the network [SZ14, p. 3]. The VGG16 architecture that is implemented in TF-slim is a slightly modified version of the VGG configuration D described in [SZ14, p. 3, table 1]. The TF-slim authors have replaced the fully connected layers of the original VGG16 configuration with convolutional layers, which allows to efficiently compute the final scores for variable input formats [KJ]. For brevitys sake, the convolutional layers replacing the fully connected layers will still be denominated as fully connected layers in the rest of the thesis, since they are functionally identical and the classifier part of a CNN is usually built and referred to as ‘fully connected layers’. To train the network, a dataset containing two classes of equal size (i.e. the same number of examples for each class) is required. It needs to be big enough to allow training of a classifier but not too big so that the training with the full dataset does not take too long. A dataset containing 25000 images of cats and dogs from a Kaggle competition [KAG14] was chosen. The images from classes that are unknown to the network will be taken from the ILSVRC-2012 training dataset that the pretrained model was originally trained on. This dataset also contains 1000 well labelled classes that do not overlap, which makes it easy to choose a number of classes for evaluation. The unknown classes were selected to provide a broad distribution over available classes: llama (an animal but somewhat different from the known classes), gray wolf (very close to one of the known classes), groom (closest class to person in ILSVRC-2012), race car (something artificial), broccoli (a plant), valley (a landscape) and website (a comparatively abstract concept). For each of those classes, 10 images are randomly selected from the ILSVRC-2012 training dataset. Additionally, one image with only white pixels and one with only black pixels is added to the test images.

4.2 Experimental Testbed

The training dataset provided by Kaggle for the competition ‘Dogs vs. Cats’ [KAG14] contains 12500 pictures of dogs and 12500 pictures of cats sorted into subfolders according to their class. To allow for faster processing and simpler handling of input data, the Dogs vs. Cats dataset is converted from jpg format to TFRecord format. This combines all images with their label from the dataset into a number of shards which then are converted into a TensorFlow dataset object when performing the training. A seperate TFRecord shardset is created for training and validation data each. The ratio of training to validation data was set as 8 to 2 (20% validation data). Since the goal is not to maximize classification performance, hyperparameters are mostly left at the default values that the respective implementations in TensorFlow provide. The learning rate is initially set to 10−5 and then decayed every two steps during the training process, using an exponential decay with a decay rate of 0.8. An Adaptive Moment Estimation (Adam) optimizer is used for parameter updates. This is the optimizer currently recommended in CS231n [KJ]. To perform the transfer learning, 100 iterations with a training and an evaluation block each are run. Before the first training block, the pretrained parameters from the model provided by TF-slim are loaded

24 CHAPTER 4. EXPERIMENTS 4.3. Implementation of Debugging Methods

Figure 4.1: Accuracy during the training process. Red: Training accuracy; Green: Validation accuracy (y-axis: accuracy; x-axis: number of training steps) from the filesystem and used to initialize the Neural Network. In each training block, 100 training steps are run. Each step loads a batch of 100 images, passes them through the network and then runs backpropagation and parameter updates. After 100 steps, the model is saved to disk and an evaluation block on both training and validation data is run. During this step, summary data is written that can be inspected via TensorBoard [ten]. Using this approach, the model achieved a validation accuracy of 89.7% after 10000 cumulative training steps. The accuracy during the training process is shown in Figure 4.1. Generally it is a good sign if the validation accuracy closely follows the training accuracy, but in this case it is unclear why the validation accuracy has such huge jumps and even starts off with a value of 1.0. Usually this would be a reason to investigate and improve the training process, but since for this task seeing what is going on inside the model (and have the model exhibit some strange behaviour) is more important than getting accurate predictions, this will even be an advantage.

4.3 Implementation of Debugging Methods

For each fundamental type of method described in the survey section, we provide an example imple- mentation. The implementations originate from several different sources and all needed to be adapted to work with our model.

Table 4.1: Implemented methods and references to sources Method Source Occlusion Sensitivity Analysis Picasso [HR17] Saliency Maps

Activation Visualization Deconvolutional Network tf_cnnvis [Bha17] Deepdream Activation Maximization

Guided Backpropagation Grad-CAM-tensorflow [tfg] Grad-CAM

TensorFlow allows to access and modify its computational graph very liberally. This makes it easy to attach or even replace functionality to an existing model. A model can be restored from a checkpoint with little additional knowledge about its structure and then used as a drop-in replacement for other models. The only information that needs to be provided usually is an entry and exit point (the names of the node in the graph that expect the input data and the node that produces the output, usually the class probabilities).

25 4.3. Implementation of Debugging Methods CHAPTER 4. EXPERIMENTS

It is also possible to replace gradients for operations, which for example allows us to redefine the ReLU gradient as required for Guided Backpropagation. In the following paragraphs, the adaptations that had to be made to be able to run the methods available at the sources listed in Table 4.1 are described. The source code for the implementations is available as a digital attachment on the CD included with physical copies of this thesis and online at gitlab.com [KvB18]

Picasso Picasso is a framework for visualizing DNNs, providing a standardized way of wrapping models as well as visualizations and a web application that allows a user to upload pictures, let a model classify them and create visualizations to analyze that process. To add a custom model, an entry and exit point has to be identified (by tensor name) and recorded in a configuration file. Additionally, a small wrapper class that provides functions for preprocessing input images and decoding class labels is required. tf_cnnvis tf_cnnvis is a library that provides an API to create three types of visualizations: Acti- vation Visualization (reconstructing the input image from the activation information in any layer), De- convolution (using deconvolution to reconstruct the input image) and Activation Maximization (based on Deep dream by Google [MTO]). Again, it is necessary to identify an entry and exit point and preprocess the data. These values are then passed to the respective function as parameter values. The library then generates the visualizations and saves them to the filesystem.

Grad-CAM-tensorflow Grad-CAM-tensorflow only contains example implementations for three dif- ferent models and requires to be heavily modified for use with a model loaded from a checkpoint. The provided utility script, which contains functions for generating the visualization after computing the necessary gradients in the main script, is also modified to add visualization of the first convolutional layer. The output image is also enhanced by adding a blended version of the original image and the Grad-CAM heatmap. ReLU gradients in the graph are replaced by GuidedRelu gradients, which return a zero gradient if the gradient is negative and the regular ReLU gradient otherwise. After an input image has passed through the network, the gradient for the target class is set to 1 and all other classes to 0. This signal is then backpropagated to the last convolutional layer (conv5_3) and used to calculate the Grad-CAM visualization. Guided backpropagation is run back to the input layer to reconstruct the input image from the output layer. To create the Guided Grad-CAM visualization, the dot product of Grad-CAM with each channel of the Guided backpropagation is calculated and then converted back to an image.

26 5 Results

The visualizations created during the experiment reveal interesting behaviour in the network. First of all, even though the model achieves a satisfactory accuracy of ∼90% on validation data, analysis of the explanations provided by the visualization methods show that the reason why the model assigns a class to an input image is often not evident to human intuition. For input images that are assignable to one of the classes known to the model, it could be expected that the model would base its decision on features that are at least somewhat similar to those a human would focus on. Instead, experimentation shows that this has to be controlled diligently. Although most example images of cats and dogs were classified correctly, no comprehensible explanation could be produced for about half of them. Figure 5.1 shows the visualizations of four different methods applied to two distinct cat images. Both cat images were classified correctly as ‘Cat’. The first row is a good example for an interpretable explanation: The trained model was able to mainly identify the region containing the cat face, as can be seen in Figure 5.1 (b). On the other hand, the visualizations of the second row for the second cat seem to rather focus on the surroundings of the cat’s face. This is especially clear from Figure 5.1 (c) where the darker areas partly cover the cat’s face and stand for a lower activation. For both rows, the Saliency Map highlights regions that cause the largest change in the output and, reasonably, these regions roughly outline the cat. (e) und (f) show nine different feature activations projected back into input pixel space for the two cats, (e) representing the cat of the first row in the same figure. The reconstructions for the different units are quite similar which suggests that the units of higher layers appear to converge in their notion. Figure 5.2 depicts a similar comparison of visualizations for dog images. The dog in the first row was correctly classified, the dog in the second row not. The first row again shows an example of visualizations which faciliate a reasonable interpretation of the classification: Grad-CAM in Figure 5.2 (b)successfully stresses the parts of the input image which influence the class ‘Dog’ the most but it is not clear from the depiction which details the model really considers as important for the class. Guided Grad-CAM in Figure 5.2 (c) combines pixel-space properties of Code Inversion methods with the Grad-CAM heatmap and thereby outputs a visuazlization which shows this additional detail. For instance, it is apparent from the first row of Figure 5.2 (c) that the outlined ears and gift bend are significant for assigning this image to the class ‘Dog’. When visually comparing the outcomes from the different methods, then the Saliency Map creates the most little value. If the saliency visualizations would not lie besides the results from other methods, it would be difficult to independently recognize the objects within them. Conversely, Saliency Maps seem to confirm what other methods consider as important. For instance, Guided Grad-CAM colours the whiskers of the cat in the second row of Figure 5.1 in red. The corresponding Saliency Map highlights them in white which means that the whiskers are responsible for a significant change in output.

27 CHAPTER 5. RESULTS

(a) Original Image (b) Grad-CAM (c) Guided Grad-Cam (d) Saliency Map

(e) Maxpool Deconvolution (f) Maxpool Deconvolution

Figure 5.1: Two cat images selected from the Kaggle competition dataset ‘Dogs vs. Cats’ [KAG14]. The first row shows an example of an reasonable explanation, the second row an unreasonable one. Column (a) shows the original image, columns (b), (c), (d), (e) and (f) the visualiza- tions for it obtained from the methods Grad-CAM, Guided Grad-CAM, Saliency Map and Maxpool Deconvolution (second max pooling layer). Best viewed in electronic form.

28 CHAPTER 5. RESULTS

(a) Original Image (b) Grad-CAM (c) Guided Grad-Cam (d) Saliency Map

(e) Maxpool Deconvolution (f) Maxpool Deconvolution

Figure 5.2: Two dog images selected from the Kaggle competition dataset ‘Dogs vs. Cats’ [KAG14]. The first row shows an example of an reasonable explanation, the second row an unrea- sonable one. Column (a) shows the original image, columns (b), (c), (d), (e) and (f) the visualizations for it obtained from the methods Grad-CAM, Guided Grad-CAM, Saliency Map and Maxpool Deconvolution (second max pooling layer). Best viewed in electronic form.

29 6 Summary and Future Work

We have given an overview of methods that provide insight into the processes of the purported ‘black box’ of Neural Networks. Then we trained a model on a task that is appropriate to demonstrate the utility of those methods when analyzing a Neural Network. We presented a selection of visualizations created from both classes that the model was trained to recognize and classes it did not know. Currently, the methods that were reviewed in this thesis certainly help to see what is happening inside the ‘black box’ of a Neural Network. Still, to use this insight for debugging and improving the model requires a deep understanding and a wealth of experience in working with Neural Networks. Just as a traditional debugger is a tool of the craft that is, at first, of little help to a novice programmer, these methods require practice to use them to their full potential. Fully integrating these and future methods into a full-fledged Neural Network debugger would likely much enhance their potential utility, since they all provide different perspectives into a model. The combination of these perspectives will likely one day allow experts to understand and control what happens inside a Neural Network as well as software engineers are able to with traditional programming. Additionally deeper understanding will allow researchers to use the data that now goes into creating these visualizations for automated improvement of the training process and Neural Networks in general.

30 Listings

31 Bibliography

[Bar17] Shane Barratt. InterpNET: Neural Introspection for Interpretable Deep Learning. oct 2017. [BCV12] , Aaron Courville, and Pascal Vincent. Representation Learning: A Review and New Perspectives. arXiv:1206.5538, 35(8):1798–1828, 2012. [Bha17] Falak Shah Bhagyesh Vikani. CNN Visualization. \url{https://github.com/InFoCusp/tf_cnnvis/}, 2017. [CMM+11] Dan C. Cireşan, Ueli Meier, Jonathan Masci, Luca M. Gambardella, and Jürgen Schmid- huber. Flexible, high performance convolutional neural networks for image classification. IJCAI International Joint Conference on Artificial Intelligence, pages 1237–1242, 2011. [CMMS12] Dan Cireşan, Ueli Meier, Jonathan Masci, and Jürgen Schmidhuber. Multi-column deep neural network for traffic sign classification. Neural Networks, 32:333–338, 2012. [CMS12] Dan Cireşan, Ueli Meier, and Juergen Schmidhuber. Multi-column Deep Neural Networks for Image Classification. International Conference of Pattern Recognition, (February):3642– 3649, 2012. [Cul17] Eugenio Culurciello. Neural Network Architectures, 2017. [EBCV09] Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher- layer features of a deep network. Bernoulli, (1341):1–13, 2009. [GS09] Alex Graves and Jürgen Schmidhuber. Offline Handwriting Recognition with Multidimen- sional Recurrent Neural Networks. In D Koller, D Schuurmans, Y Bengio, and L Bottou, editors, Advances in Neural Information Processing Systems 21, pages 545–552. Curran Associates, Inc., 2009. [GS16] S. Guadarrama and N. Silberman. TensorFlow-Slim: a lightweight library for defining, training and evaluating complex models in TensorFlow, 2016. [GSL07] G.˜Loosli, S.˜Canu, and L.˜Bottou. Training invariant support vector machines using selective sampling. Large Scale Kernel Machines, pages 301–320, 2007. [Her17] Bernease Herman. The Promise and Peril of Human Evaluation for Model Interpretability. nov 2017. [HOT06] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A Fast Learning Algorithm for Deep Belief Nets. Neural Computation, 18(7):1527–1554, 2006. [HOWT06] Geoffrey Hinton, Simon Osindero, Max Welling, and Yee Whye Teh. Unsupervised discovery of nonlinear structure using contrastive backpropagation. Cognitive Science, 30(4):725– 731, 2006. [HR17] Ryan Henderson and Rasmus Rothe. Picasso: A Modular Framework for Visualizing the Learning Process of Neural Network Image Classifiers. Journal of Open Research Software, 5(1), sep 2017. [ILS] ILSVRC-2012-CLS dataset. [KAG14] Cats vs. Dogs Dataset, 2014. [Kar] Andrej Karpathy. Convolutional Neural Networks (CNNs / ConvNets). [Kar16] Ujjwal Karn. An Intuitive Explanation of Convolutional Neural Networks, 2016.

32 BIBLIOGRAPHY Bibliography

[KGB16] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. (July), 2016. [KJ] Andrej Karpathy and Justin Johnson. CS231n Convolutional Neural Networks for Visual Recognition. [KRFD17] Christoph Käding, Erik Rodner, Alexander Freytag, and Joachim Denzler. Fine-Tuning Deep Neural Networks in Continuous Learning Scenarios. Computer Vision – ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part III, pages 588–605, 2017. [KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks, 2012. [Kub15] M Kubat. An Introduction to Machine Learning. Springer International Publishing, 2015. [KvB18] David Kempf and Lino von Burg. Understanding Deep Neural Networks - Source Code, 2018. [LBBH98] Yann Lecun, Leon Bottou, Y Bengio, and Patrick Haffner. Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 86:2278–2324, 1998. [LEN08] Honglak Lee, Chaitanya Ekanadham, and Andrew Y Ng. Sparse deep belief net model for visual area V2. In J C Platt, D Koller, Y Singer, and S T Roweis, editors, Advances in Neural Information Processing Systems 20, pages 873–880. Curran Associates, Inc., 2008. [MDFF16] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2574–2582, 2016. [MTO] Alexander Mordvintsev, Michael Tyka, and Christopher Olah. . [RN03] S Russell and P Norvig. Artificial Intelligence A Modern Approach, 2nd edn, 2003. [RRND10] S J Russell, S J Russell, P Norvig, and E Davis. Artificial Intelligence: A Modern Approach. Prentice Hall series in artificial intelligence. Prentice Hall, 2010. [SCD+16] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. oct 2016. [Sch15] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, jan 2015. [SDBR14] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striv- ing for Simplicity: The All Convolutional Net. dec 2014. [SVZ13] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep Inside Convolutional Net- works: Visualising Image Classification Models and Saliency Maps. dec 2013. [SZ14] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. pages 1–14, 2014. [SZS+13] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. dec 2013. [ten] Tensorboard Github. [tfg] Grad-CAM-tensorflow. [Tho17] Martin Thoma. Analysis and Optimization of Convolutional Neural Network Architectures. jul 2017. [vGB17] Marcel van Gerven and Sander Bothe. Artificial Neural Networks as Models of Neural Information Processing, 2017.

33 Bibliography BIBLIOGRAPHY

[VLBM08] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. Proceedings of the 25th international conference on Machine learning - ICML ’08, pages 1096–1103, 2008. [YCN+15] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding Neural Networks Through Deep Visualization. jun 2015. [ZBH+16] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Under- standing deep learning requires rethinking generalization. 2016. [Zel97] A Zell. Simulation neuronaler Netze. Oldenbourg, 1997. [ZF14] Matthew D. Zeiler and Rob Fergus. Visualizing and Understanding Convolutional Networks. pages 818–833. Springer, Cham, 2014. [ZKL+15] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning Deep Features for Discriminative Localization. dec 2015. [ZTF11] Matthew D. Zeiler, Graham W. Taylor, and Rob Fergus. Adaptive deconvolutional networks for mid and high level feature learning. Proceedings of the IEEE International Conference on Computer Vision, pages 2018–2025, 2011.

34 List of Figures

2.1 Mathematical model for an artificial neuron by McCulloch and Pitts. The neuron receives a set of inputs and weights which are linearly combined by the input function and then passed on to the activation function, generating the output of the neuron [RRND10]. .5 2.2 Two types of networks. (a) A perceptron network with two inputs and one layer of output neurons. (b) A multilayer network with two inputs, one layer of hidden neurons and one layer of output neurons [RRND10]...... 6 2.3 A simple CNN architecture [Kar16]...... 7 2.4 Visualization of the pooling operation, using the maximum [Kar]...... 8

3.1 Hierarchical taxonomy of methods for debugging DNNs. The bolded debugging types will be covered within this thesis...... 9 3.2 Concept of a deconvnet. Top: A deconvnet layer (left) is attached to each CNN layer (right). The deconvnet creates an approximate reconstruction of the activity in the layer beneath. Bottom: Switch variables are used to store the locations of the maxima within each pooling region (colored zones) during pooling operations in the CNN. The black/white bars indicate negative/positive activations in the feature map [ZF14]. . . . 12 3.3 Class appearance models for three pictures. The corresponding CNN was trained on the ILSVRC-2013 dataset [SVZ13]...... 14 3.4 Image-specific class saliency maps for the highest-scoring class in ILSVRC-2013 test im- ages [SVZ13]. Top: Two randomly selected test set images. Bottom: The corresponding saliency maps...... 15 3.5 Different concepts of propagating back through a ReLU [SDBR14]. First row: The ReLU replaces all negative values with zero during the forward propagation. The re- maining rows depict the backpropagation procedure for the Gradient-Based methods, the deconvnet reconstruction and the Guided Backpropagation, respectively...... 16 3.6 Comparison of visualizations provided by different methods. (a) and (g) show the orig- inal image, containing the two classes cat and dog. (b) to (d) and (h) to (j) depict visualizations provided by different methods for the original image [SCD+16]...... 19 3.7 Activation Maximization method applied to an extended version of the MNIST digit classification dataset [GSL07]. The first, second and third columns show the visual- izations obtained from the method for 36 different units of the first, second and third hidden layers of a Deep Belief Network, respectively [HOT06, EBCV09]...... 21

4.1 Accuracy during the training process. Red: Training accuracy; Green: Validation accu- racy (y-axis: accuracy; x-axis: number of training steps) ...... 25

5.1 Two cat images selected from the Kaggle competition dataset ‘Dogs vs. Cats’ [KAG14]. The first row shows an example of an reasonable explanation, the second row an un- reasonable one. Column (a) shows the original image, columns (b), (c), (d), (e) and (f) the visualizations for it obtained from the methods Grad-CAM, Guided Grad-CAM, Saliency Map and Maxpool Deconvolution (second max pooling layer). Best viewed in electronic form...... 28 5.2 Two dog images selected from the Kaggle competition dataset ‘Dogs vs. Cats’ [KAG14]. The first row shows an example of an reasonable explanation, the second row an un- reasonable one. Column (a) shows the original image, columns (b), (c), (d), (e) and (f) the visualizations for it obtained from the methods Grad-CAM, Guided Grad-CAM, Saliency Map and Maxpool Deconvolution (second max pooling layer). Best viewed in electronic form...... 29

35 List of Tables

3.1 Possible values for the different taxonomy features...... 11 3.2 Overview of methods described in depth in the following subsections...... 11 3.3 Comparison of methods by means of taxonomy features...... 22

4.1 Implemented methods and references to sources ...... 25

36 A Appendix

A.1 Announcement

The next page contains the official annoucement of this thesis.

37 Understanding Deep Neural Networks BA18_stdm_4

BetreuerInnen: Thilo Stadelmann, stdm Olaf Stern, strf Fachgebiete: Datenanalyse (DA) Software (SOW) Studiengang: IT Zuordnung: Institut für angewandte Informationstechnologie (InIT) Gruppengrösse: 2

Kurzbeschreibung: The goal of this thesis is to get introspection into the workings and (finally) decisions of deep (convolutional) neural networks by means of e.g. visualization, saliency maps, most probable input synthesis etc. To this end, it analyses a given learning task on audio or image data (e.g., visual object recognition; speaker clustering) and applies different "debugging" techniques to the learnt models or the training process. This thesis build upon results of previous thesis projects of the supervisors on deep learning, image and audio data analysis and neural network debugging.

Plan

- Research the state of the art in neural network "debugging" - Define the problem and task to work on together with the supervisors - Implement an experimental setup /work bench to train and evaluate neural networks for the chosen task in a structured (probably automatized) manner - Iteratively improve the model, thereby focusing on getting introspection into it by means of "debugging" - Produce an overview of the state of the art in neural network debugging that exceed prior theses at ZHAW (e.g., on another domain) in publishable form - Write a thesis and give a talk about the approach and results of your project

Voraussetzungen: A basic understanding of deep neural networks and their implementation using TensorFlow is assumed. Enjoying to experiment and to program is a must; wanting to tackle a challenging task in a scientific manner with the goal to achieve publishable results is the mindset of a successful candidate.

Die Arbeit ist vereinbart mit: David Kempf (kempfdav) Lino von Burg (vonbulin)

Weiterführende Informationen: https://distill.pub/2017/feature-visualization/

Donnerstag 4. Januar 2018 12:15 APPENDIX A. APPENDIX A.2. Visualizations of ‘Unknown’ Classes

(a) Broccoli Grad- (b) Broccoli Guided (c) Broccoli Grad- (d) Broccoli Guided CAM Grad-CAM CAM Grad-CAM

(f) Groom Guided (h) Groom Guided (e) Groom Grad-CAM Grad-CAM (g) Groom Grad-CAM Grad-CAM

A.2 Visualizations of ‘Unknown’ Classes

39