A Comparative Study of Routing Methods in Capsule Networks

Master of Science Thesis in Electrical Engineering Department of Electrical Engineering, Linköping University, 2019 A Comparative Study of Routing Methods in Capsule Networks Christoffer Malmgren Master of Science Thesis in Electrical Engineering A Comparative Study of Routing Methods in Capsule Networks Christoffer Malmgren LiTH-ISY-EX--19/5188--SE Supervisor: M.Sc Gustav Häger isy, Linköpings universitet Ph.D. Erik Ringaby SICK Linköping Examiner: Prof. Michael Felsberg isy, Linköpings universitet Computer Vision Laboratory Department of Electrical Engineering Linköping University SE-581 83 Linköping, Sweden Copyright © 2019 Christoffer Malmgren Abstract Recently, the deep neural network structure caps-net was proposed by Sabour et al. [11]. Capsule networks are designed to learn relative geometry between the features of a layer and the features of the next layer. The Capsule network’s main building blocks are capsules, which are represented by vectors. The idea is that each capsule will represent a feature as well as traits or subfeatures of that feature. This allows for smart information routing. Capsules traits are used to predict the traits of the capsules in the next layer, and information is sent to to next layer capsules on which the predictions agree. This is called routing by agreement. This thesis investigates theoretical support of new and existing routing algorithms as well as evaluates their performance on the MNIST [16] and CIFAR- 10 [8] datasets. A variation of the dynamic routing algorithm presented in the original paper [11] achieved the highest accuracy and fastest execution time. iii Acknowledgments I want to thank SICK Linköping for the opportunity to make this thesis. Thanks to my examiner Michael Felsberg and supervisors Erik Ringaby and Gustav Häger for their support and helpful reviews. I also want to thank all colleagues at SICK Linköping for showing interest in my thesis and providing insight as well as in- teresting discussions. Finally, I want to thank my family and friends. Linköping, May, 2019 Christoffer Malmgren v Contents Notation ix 1 Introduction 1 1.1 Background . 1 1.2 Problem formulation . 2 1.3 Previous work . 2 1.4 Limitations . 3 2 Theoretical background 5 2.1 Neural networks . 5 2.2 Capsule networks . 7 2.3 Clustering . 9 2.3.1 K-means . 9 2.3.2 Expectation maximization . 10 2.3.3 Kernel density estimation . 11 2.3.4 Variational Bayesian methods . 11 2.4 Routing in current capsule networks . 12 2.4.1 Dynamic routing . 12 2.4.2 Optimized dynamic routing . 14 2.4.3 EM-routing . 15 2.4.4 Fast dynamic routing based on weighted kernel density estimation . 16 3 Model design 19 3.1 Novel routing algorithms . 19 3.1.1 Mean field Gaussian mixture model . 19 3.1.2 Mean field soft k-means . 21 3.1.3 Soft k-means routing . 22 3.2 Practical modifications . 23 3.3 Existing algorithms . 24 4 Experiments 25 4.1 Implementation . 25 vii viii Contents 4.2 Data preprocessing . 29 4.2.1 MNIST . 29 4.2.2 CIFAR-10 . 29 4.3 Results . 31 5 Conclusions 35 5.1 Method analysis . 35 5.2 Routing method comparison . 36 5.3 Contributions . 36 5.4 Future work . 37 A Appendix 41 A.1 Fixed weights Gaussian mixture model expectation maximization 42 A.2 DR optimization . 43 A.3 Mean field Gaussian mixture model . 43 A.4 Mean field soft k-means . 46 A.5 Soft k-means . 47 A.6 Hyper parameters for test accuracy evaluation . 48 A.7 Validation . 50 Bibliography 91 Notation Type setting Notation Meaning x A scalar x A 2D vector X A multidimensional matrix or tensor A set X xi;:::;j Indexing or extraction of a matrix value x(t) The value of x at time or iteration t For all 8 Operators and functions Notation Meaning Σ Sum Π Product R Integral · ; · Scalar product h i · The p-norm. If p is not given then p = 2 is assumed k kp Assignment KL( · ; ·) Kullback–Leibler divergence N(µ, σ) Normal distribution with mean µ and standard derivation σ Γ (α; β) Gamma distribution with shape α and rate β p(x; y) The joint probability of x and y p(x y) The conditional probability of x given y logj The natural logarithm ix x Notation Default variable usage Notation Meaning a Activation of a capsule or node b Bias d Distance function/measurement or dimension index f Function i; j; k; l; n; t Indices m Mean p Probability function r Responsibility, variable describing association u Capsule traits v Predicted capsule traits w Weights x Data points. In the context of routing the same as v α Scale β Rate Small positive number θ Parameters λ Precision scaling in the normal gamma distribution µ Mean π Probability of event or Archimedes’ constant ρ Concentration parameters σ Standard derivation τ Precision constx Constant value w.r.t x Abbreviations Abbreviation Meaning gmm Gaussian mixture model em Expectation maximization or EM-routing. Context de- pendent gmm-em Expectation maximization of a Gaussian mixture model cnn Convolutional neural network caps-net Capsule neural network sgd Stochastic gradient descent frms Fast routing with means shift frem Fast routing with expectation maximization mfgmm Mean field Gaussian mixture model mfskm Mean field soft K-means dr Dynamic routing skm Soft k-means 1 Introduction 1.1 Background The problem considered in this thesis is image classification, a computer vision task in which deep learning has become an increasingly important tool. Recently, two papers [5, 11] were published which are pointing out that convolutional neural networks (cnns), the cutting edge of visual deep learning, have large limitations when it comes to information routing. Routing describes how to redirect the information flow in a network, breaking the standard network structure. Routing in cnns is usually solved by max-pooling, a method that is essentially throwing away information, when only considering the node with the maximum activation. What is more, the position of a feature is encoded by node position until the final layers of a cnn. This means that cnns will have trouble learning relative geometry between objects, which is useful while determining the whole from the parts. [11]. Sabour et.al. are proposing a new model: capsule networks (caps-net). In a capsule network nodes are grouped into capsules. The goal is to have every capsule, rather than node, represent a feature. Consequently, each feature will be represented by a vector rather than a scalar, and will be able to indicate not only the probability of the feature being present, but also the feature pose or traits. Here traits refer to subfeatures or characteristics of the feature encoded, which though motivated by their ability to encode position and orientation of there feature the traits might encode any properties of the feature. This means that on a fairly low layer feature position information may be moved from capsule position to capsule activation. Capsule networks may so learn reference frames of objects and features [5], which is something humans do [6]. Representing capsules with vectors also opens the possibility for smarter routing. Information is sent by the capsules which agree on the traits of the capsule in the next layer, 1 2 1 Introduction which is called routing by agreement [11]. This idea is quite old and can be seen in [19] and [17]. The idea of learning relative geometry of parts is also shared with constella- tion models and deformable parts models [2, 3, 14]. 1.2 Problem formulation To implement routing by agreement one has to define the meaning of agreement. Similar capsule vectors is a good start, but what distance measure should be used? The goal is to have a capsule to spread its activation and trait information to the next layer’s nodes on which its traits prediction coincides with to other capsule’s predictions. This can be seen as a soft clustering problem, where the lower layer capsules are data points, and the higher layer capsules are cluster prototypes. Clustering problems can be solved in multiple ways, and when evaluating a solution one have to consider how well it fits a neural network structure. One have to consider e.g. execution time, how to handle activations, energy propaga- tion and gradient flow. The goal of this thesis is to: – Analyze routing methods with respect to their theoretical motivation as a clustering method as well as how they need to be modified to fit a network structure. – Evaluate routing methods by measuring their accuracy and execution time on common image classification datasets. The analysis will be presented together with the routing methods, that is in chapters 2 and 3 (Theoretical Background and Model Design). The evaluation will be presented in chapters 4 (Experiments). Both will be further discussed in chapter 5 (Conclusions). 1.3 Previous work In their first paper Sabour et. al [11] used a basic routing algorithm, with high similarities to the soft spherical k-means algorithm. It uses the cosine of the angle as distance measure, which later was thought to be sub-optimal, since it makes the clustering insensitive when it comes to distinguishing between an acceptable and an excellent match [5]. Sabour et. al later released a second paper [5], with a routing algorithm based on expectation maximization (em) of a Gaussian mixture model (gmm). This routing algorithm is more complex, and introduces the cluster bias in a less intuitive way. Zhang et al. [21] managed to increase the speed of computation using weighted kernel density estimation. These methods are described more thoroughly in section 2.4.

Load more