Master of Science Thesis in Electrical Engineering Department of Electrical Engineering, Linköping University, 2019

A Comparative Study of Routing Methods in Capsule Networks

Christoffer Malmgren Master of Science Thesis in Electrical Engineering A Comparative Study of Routing Methods in Capsule Networks Christoffer Malmgren LiTH-ISY-EX--19/5188--SE

Supervisor: M.Sc Gustav Häger isy, Linköpings universitet Ph.D. Erik Ringaby SICK Linköping Examiner: Prof. Michael Felsberg isy, Linköpings universitet

Computer Vision Laboratory Department of Electrical Engineering Linköping University SE-581 83 Linköping, Sweden

Copyright © 2019 Christoffer Malmgren Abstract

Recently, the deep neural network structure caps-net was proposed by Sabour et al. [11]. Capsule networks are designed to learn relative geometry between the features of a layer and the features of the next layer. The Capsule network’s main building blocks are capsules, which are represented by vectors. The idea is that each capsule will represent a feature as well as traits or subfeatures of that feature. This allows for smart information routing. Capsules traits are used to predict the traits of the capsules in the next layer, and information is sent to to next layer capsules on which the predictions agree. This is called routing by agreement. This thesis investigates theoretical support of new and existing routing al- gorithms as well as evaluates their performance on the MNIST [16] and CIFAR- 10 [8] datasets. A variation of the dynamic routing algorithm presented in the original paper [11] achieved the highest accuracy and fastest execution time.

iii

Acknowledgments

I want to thank SICK Linköping for the opportunity to make this thesis. Thanks to my examiner Michael Felsberg and supervisors Erik Ringaby and Gustav Häger for their support and helpful reviews. I also want to thank all colleagues at SICK Linköping for showing interest in my thesis and providing insight as well as in- teresting discussions. Finally, I want to thank my family and friends.

Linköping, May, 2019 Christoffer Malmgren

v

Contents

Notation ix

1 Introduction 1 1.1 Background ...... 1 1.2 Problem formulation ...... 2 1.3 Previous work ...... 2 1.4 Limitations ...... 3

2 Theoretical background 5 2.1 Neural networks ...... 5 2.2 Capsule networks ...... 7 2.3 Clustering ...... 9 2.3.1 K-means ...... 9 2.3.2 Expectation maximization ...... 10 2.3.3 Kernel density estimation ...... 11 2.3.4 Variational Bayesian methods ...... 11 2.4 Routing in current capsule networks ...... 12 2.4.1 Dynamic routing ...... 12 2.4.2 Optimized dynamic routing ...... 14 2.4.3 EM-routing ...... 15 2.4.4 Fast dynamic routing based on weighted kernel density es- timation ...... 16

3 Model design 19 3.1 Novel routing algorithms ...... 19 3.1.1 Mean field Gaussian mixture model ...... 19 3.1.2 Mean field soft k-means ...... 21 3.1.3 Soft k-means routing ...... 22 3.2 Practical modifications ...... 23 3.3 Existing algorithms ...... 24

4 Experiments 25 4.1 Implementation ...... 25

vii viii Contents

4.2 Data preprocessing ...... 29 4.2.1 MNIST ...... 29 4.2.2 CIFAR-10 ...... 29 4.3 Results ...... 31

5 Conclusions 35 5.1 Method analysis ...... 35 5.2 Routing method comparison ...... 36 5.3 Contributions ...... 36 5.4 Future work ...... 37

A Appendix 41 A.1 Fixed weights Gaussian mixture model expectation maximization 42 A.2 DR optimization ...... 43 A.3 Mean field Gaussian mixture model ...... 43 A.4 Mean field soft k-means ...... 46 A.5 Soft k-means ...... 47 A.6 Hyper parameters for test accuracy evaluation ...... 48 A.7 Validation ...... 50

Bibliography 91 Notation

Type setting Notation Meaning x A scalar x A 2D vector X A multidimensional matrix or tensor A set X xi,...,j Indexing or extraction of a matrix value x(t) The value of x at time or iteration t For all ∀

Operators and functions Notation Meaning Σ Sum Π Product R Integral · , · Scalar product h i · The p-norm. If p is not given then p = 2 is assumed k kp Assignment KL(← · , ·) Kullback–Leibler divergence N(µ, σ) Normal distribution with mean µ and standard deriva- tion σ Γ (α, β) Gamma distribution with shape α and rate β p(x, y) The joint probability of x and y p(x y) The conditional probability of x given y log| The natural logarithm

ix x Notation

Default variable usage Notation Meaning a Activation of a capsule or node b Bias d Distance function/measurement or dimension index f Function i, j, k, l, n, t Indices m Mean p Probability function r Responsibility, variable describing association u Capsule traits v Predicted capsule traits w Weights x Data points. In the context of routing the same as v α Scale β Rate  Small positive number θ Parameters λ Precision scaling in the normal gamma distribution µ Mean π Probability of event or Archimedes’ constant ρ Concentration parameters σ Standard derivation τ Precision constx Constant value w.r.t x

Abbreviations Abbreviation Meaning gmm Gaussian mixture model em Expectation maximization or EM-routing. Context de- pendent gmm-em Expectation maximization of a Gaussian mixture model cnn Convolutional neural network caps-net Capsule neural network sgd Stochastic gradient descent frms Fast routing with means shift frem Fast routing with expectation maximization mfgmm Mean field Gaussian mixture model mfskm Mean field soft K-means dr Dynamic routing skm Soft k-means 1 Introduction

1.1 Background

The problem considered in this thesis is image classification, a computer vision task in which deep learning has become an increasingly important tool. Recently, two papers [5, 11] were published which are pointing out that convolutional neu- ral networks (cnns), the cutting edge of visual deep learning, have large limita- tions when it comes to information routing. Routing describes how to redirect the information flow in a network, breaking the standard network structure. Routing in cnns is usually solved by max-pooling, a method that is essentially throwing away information, when only considering the node with the maximum activation. What is more, the position of a feature is encoded by node position until the final layers of a cnn. This means that cnns will have trouble learning relative geome- try between objects, which is useful while determining the whole from the parts. [11]. Sabour et.al. are proposing a new model: capsule networks (caps-net). In a capsule network nodes are grouped into capsules. The goal is to have every capsule, rather than node, represent a feature. Consequently, each feature will be represented by a vector rather than a scalar, and will be able to indicate not only the probability of the feature being present, but also the feature pose or traits. Here traits refer to subfeatures or characteristics of the feature encoded, which though motivated by their ability to encode position and orientation of there feature the traits might encode any properties of the feature. This means that on a fairly low layer feature position information may be moved from capsule position to capsule activation. Capsule networks may so learn reference frames of objects and features [5], which is something humans do [6]. Representing capsules with vectors also opens the possibility for smarter routing. Information is sent by the capsules which agree on the traits of the capsule in the next layer,

1 2 1 Introduction which is called routing by agreement [11]. This idea is quite old and can be seen in [19] and [17]. The idea of learning relative geometry of parts is also shared with constella- tion models and deformable parts models [2, 3, 14].

1.2 Problem formulation

To implement routing by agreement one has to define the meaning of agreement. Similar capsule vectors is a good start, but what distance measure should be used? The goal is to have a capsule to spread its activation and trait information to the next layer’s nodes on which its traits prediction coincides with to other capsule’s predictions. This can be seen as a soft clustering problem, where the lower layer capsules are data points, and the higher layer capsules are cluster prototypes. Clustering problems can be solved in multiple ways, and when evaluating a solution one have to consider how well it fits a neural network structure. One have to consider e.g. execution time, how to handle activations, energy propaga- tion and gradient flow. The goal of this thesis is to:

– Analyze routing methods with respect to their theoretical motivation as a clustering method as well as how they need to be modified to fit a network structure.

– Evaluate routing methods by measuring their accuracy and execution time on common image classification datasets.

The analysis will be presented together with the routing methods, that is in chap- ters 2 and 3 (Theoretical Background and Model Design). The evaluation will be presented in chapters 4 (Experiments). Both will be further discussed in chapter 5 (Conclusions).

1.3 Previous work

In their first paper Sabour et. al [11] used a basic routing algorithm, with high similarities to the soft spherical k-means algorithm. It uses the cosine of the angle as distance measure, which later was thought to be sub-optimal, since it makes the clustering insensitive when it comes to distinguishing between an acceptable and an excellent match [5]. Sabour et. al later released a second paper [5], with a routing algorithm based on expectation maximization (em) of a Gaussian mix- ture model (gmm). This routing algorithm is more complex, and introduces the cluster bias in a less intuitive way. Zhang et al. [21] managed to increase the speed of computation using weighted kernel density estimation. These methods are described more thoroughly in section 2.4. 1.4 Limitations 3

1.4 Limitations

Routing algorithms are the main focus of this thesis and the algorithms described in [5, 11] are compared to other possible routing options. Both theoretical aspects as well as performance are evaluated. The datasets used are MNIST [16], and CIFAR-10 [8]. The goal is not an absolute ranking, but to compare theoretical mo- tivations, execution time, as well as accuracy on different data sets individually. Only classification tasks will be considered since they need minimum network alterations. Also, when comparing accuracy one have to consider that the time used for parameter tuning was limited. Some models might be able to achieve higher accuracy, but demand more tuning.

2 Theoretical background

To fully grasp the reasoning of the routing design some prior knowledge is needed. The following chapter provides some relevant theory about neural networks and clustering methods, as well as connects some existing routing methods to that theory. Section 2.1 introduces neural networks and section 2.2 extends the theory to capsule networks. Section 2.3 describes some clustering methods, which will be used as the foundation for the routing methods in section 2.4 and 3.1.

2.1 Neural networks

Deep neural networks solve machine learning problems as e.g. classification, de- tection, and segmentation. A neural network used for at- tempts to predict some output data y from some input data x by using the model y = f (x, θ), where θ are the model parameters. Here f is a large model struc- tured in layers. The basic layer can be described as a linear function followed by a non-linear function. The non-linear function is called activation function and allows the network to model complex relations. The linear functions and layer structure make the model suit computer hardware, so that the network can be evaluated feasibly fast, despite its size. Here follows an example of a layer l with input a and output c:

  X  c = g  w a + b  . (2.1) j  i,j i j  i

Here wi,j , and bj are model parameters, g is the activation function and the in- dexes i, j are used to distinguish between individual nodes in layer l, and layer

5 6 2 Theoretical background

im1 im2 channel

conv conv fc input/output

filters

Figure 2.1: Sketch of 3-layer neural network without max-pooling. The filter and nodes dimensions indicated by coordinate system above. The cubes side lengths represent the number of nodes/filters in the layer. The abbreviations "conv" and "fc" indicates convolutional/fully connected layers

l + 1 respectively. The output cj are called activations of the node j and is later used in a similar way the calculate the activations of l +2. The activation cj should could be thought of as an unnormalized probability of some feature specific to node j. The nodes of a layer are typically arranged in a 3D tensor. If the input is naturally 2D, as e.g. images, the following layers will have two dimensions cor- responding to the image dimensions. I will refer to them as spatial or image dimensions and the remaining as the channel dimensions.

The wi,j and bj are learned by evaluating the model on the test data set (y,¯ x¯) and minimizing the error l(y,¯ f (x,¯ θ), θ) with respect to θ. Here l is some kind of loss function describing the distance between y¯ and f (x,¯ θ) as well as priors on θ. In the large majority of all cases some sgd based optimizer is used. Sometimes one may assume that the input of a layer is shift invariant. This is the case for e.g. image inputs. If this is the case it makes sense to use convo- lutional layers, which shares the weights so that wi,j = wl,k and bi,j = bl,k if ai and cj has the same relationship as al and ck in the spatial dimension. One also assumes wi,j = 0 if ai is far from cj spatially, thus limiting the field of view for cj . A network containing convolutional layers is called a convolutional network. It may contain non-convolutional (fully connected layers) as well, and the last few layer typically will be fully connected. However, the locks the size of the layers in the spatial dimen- sion as can be seen in figures 2.1, meaning that the distance between spatially neighbouring nodes will be one throughout the network. This is undesir- able, since in the upper layers features might be larger spatially, and to increase the field of view of cj there will have to be many non-zero wi,j . This is solved with routing layers. A routing layer redirects the information flow in a network, breaking the standard structure. Max pooling is the most frequently used rout- 2.2 Capsule networks 7

im1 im2 channel

conv m-p conv fc input/output

filters

Figure 2.2: Sketch of 3-layer neural network with max-pooling. The filter and nodes dimensions indicated by coordinate system above. The cubes side lengths represent the number of nodes/filters in the layer. The abbreviations "conv" and "fc" indicates convolutional/fully connected layer and "m-p" max- pooling. Here the max-pooling have a patch size of 2 2 × ing layer. It reduces the spatial dimensions by dividing the layer into patches in the spatial dimension, ignoring all but the node with the maximum activation in the patch, see figure 2.2. This is inefficient since it throws away information. Also, by using convolutions feature position will be encoded by neuron posi- tion. Relative geometry between features will so be learned only in the last few fully connected layers. Here relative geometry refers to the relations of positions and orientations between features.

2.2 Capsule networks

The capsule network builds on the standard convolutional neural network struc- ture, by introducing a new layer type: capsule layers. In a capsule layer each building block is a capsule represented by a vector, rather than a node repre- sented by a scalar. This is done in hope that each capsule will not only represent a feature, but also traits of the feature, such as pose, hue, texture, and deforma- tion [11]. Given capsule activations and traits of layer l, one calculates the activation and traits of a capsule in layer l + 1 as

(uj , aj ) = fi,j (vi,j , ai, ri,j ). (2.2)

Here i and j where is the set of capsule indices in layer l. The ∈ Ll ∈ Ll+1 Ll arguments (uj , aj ) are the traits and activation of capsule j, fi,j is a non-linear ac- tivation function which may have parameters dependent on i and j, ri,j describes how much the traits and activations of capsule i may affect the traits of capsule j, and vi,j = Wi,j (ui) is the prediction of node j’s traits given node i’s traits. Here Wi,j is a linear prediction function. The parameters of Wi,j and fi,j are dependent 8 2 Theoretical background on i, j and are learned. One can make convolutional capsule layers by sharing the weights between capsules which have the same relative position to each other in image space.

The ri,j should be large for (i, j) such that vi,j is similar to other vk,j , k , i. Finding groups of similar vi,j can be seen as a clustering problem, as we do not want a lower level feature to be explained by different higher lever features or P objects. It makes sense to normalize the ri,j so j ri,j = 1, which makes it possible to interpret ri,j as the probability that capsule i is grouped to cluster j, that is: capsule (vi,j , ai) should contribute in the calculation of (uj , aj ). The design of both f and the clustering algorithm is rather open and [5, 11, 21] have some different proposals, which will be discussed further in section 2.4. However we can list some desirable properties which makes sense on an intuitive level.

• We want fi,j to depend mainly on the (vi,j , ai), where ri,j is large.

• We want the implementation of f and the clustering to be feasible in a network structure. This means it has to be cheap to compute and converge somewhat quickly.

• We want the total activation in a layer to have a reasonable total energy. Preferably it should be normalized so that it can be interpreted as probabil- ities.

• We would like f to introduce some type of non-linearity.

• We would like f to not only consider the previous layers activation ai, but also include a bias, as some feature are more important than others for the next level of features.

It is also important to note that the clusters are being seen in different spaces. Data point ui will generate a prediction vi,j which depend on j as well. Also, in the case of convolutional capsule layers different capsules will see different data points, complicating the clustering interpretation further. A simplified example of routing between a 3 capsule and a 2 capsule layer can be seen, in figure 2.3. 2.3 Clustering 9

(ui, ai) (vi,j , ai) (uj , aj )

i = 3 W3,2 clustering W3,1 j = 2 W2,2 i = 2 W2,1 clustering W1,2 j = 1

i = 1 W1,1 = I

Figure 2.3: Routing between a 3 capsule and a 2 capsule layer. Here u is represented by circle position and a is represented by circle size. To make the image more comprehensible W1,1 is put to identity transformation.

Given that the network is used for the classification the last layer will have a capsule per class. The capsules can be feed as inputs to a decoder used to create reconstruction regularizer. The idea is that punishing bad reconstructions will force the capsules in the last layer to learn proper traits, which seems to help with training [11]. This is assuming that there exists a implicit constraint space of the same dimensions as the capsule vector length, which is able to describe the intra-class variation, as discussed in [18].

2.3 Clustering

In the following sections different clustering methods will be described. Obser- vations will be denoted (x1, x2, ..., xn). The weights or activations of (x1, x2, ..., xn) will be denoted (a1, a2, ..., an).

2.3.1 K-means The clusters are denoted by ( , , ..., ) where j, are disjoint set of observa- C1 C2 Ck ∀ Cj tions and k = x : i [1, n] . The goal is to solve ∪j=1Cj { i ∀ ∈ } Xk Xm

arg min L( , µj ) = arg min ai xi µj (2.3) C − 2 ,µ ,µ j=1 x C C i ∈Cj where µj is the prototype of cluster j. The minimization can be performed by ini- tializing cj randomly, then iterative solving a cluster update and a mean update. The cluster update is

j, j = xi : xi µj < xi µk , k , j , (2.4) ∀ C { − 2 − 2 ∀ } 10 2 Theoretical background and the mean update P x aixi i ∈Cj µj = P . (2.5) x ai i ∈Cj The algorithm stops when there is no difference of a cluster update. One can see that the algorithm converges to a local minimum by noting that T is decreasing during both the mean and the cluster update. By making the cluster assignments fuzzy, we get the soft k-means algorithm. P We introduce cluster responsibilities r [0, 1] : k r = 1 which is to model i,j ∈ j=1 i,j p(x ) and try to minimize or maximize some cost or goal function L(r , µ ) i ∈ Cj i,j j w.r.t cluster prototypes µj and ri,j .

2.3.2 Expectation maximization Expectation maximization is an algorithm based on Bayesian inference, with the goal to maximize the likelihood L(θ; X) = p(X θ) of some model parameters θ given observations X, and some unobserved variables| Z. With the use of Gibbs’ inequality one may prove that given a current guess of the parameters θ(t) choos- ing a new guess to be

(t+1) (t) θ = arg max Q(θ, θ ) = arg max EZ X,θ(t) [log (L(θ; X,Z))] θ θ | X (2.6) = arg max p(Z X, θ(t)) log(p(X,Z θ)) | | θ Z will not decrease p(X θ). This means that since p(X θ) 1 updating θ in this fashion will cause p(X| θ) to converge to a local maximum.| ≤ | em can be used as a clustering algorithm by assuming a mixture model ( , , ... ), where for all j cluster D(θ ) is a random variable belonging C1 C2 Ck Cj ∼ j to distribution D with parameters θj . Where Z = (z1, z2...zn) describes which cluster X was generated by, through p(xi zi = j) D(θj ). Another parameter of the model is (π , π , ...π ). Here π describes| the prior∼ probability of cluster . 1 2 k j Cj If one assumes Gaussian distributions (D(θj ) = N(µj , Σj )) one gets expecta- tion maximization of a Gaussian mixture model (gmm-em), which is what many refer to when they talk about em. To practically apply the algorithm we have to make some changes. In the network there will be priors or weights of the observed variables, which could be accounted for in multiple ways. It would make logical sense to put p(xi zi = a | j, ai) N(µj , Σj ) i . However this is not a distribution since it does not integrate ∼ a to one. The authors of [4] normalize by noticing N(µj , Σj ) i N(µj , Σj /ai) and introducing pˆ(x z = j, a ) = N(µ , Σ /a ). ∝ i| i i j j i Solving the em problem with

(t) (t) (t) (t) (t) (t) (t) πj N(xi; µj , Σj /ai) p(zi = j xi, ai, π , µ , Σ ) = r = (2.7) | j j j i,j Pk (t) (t) (t) l=1 πl N(xi; µl , Σl /ai) 2.3 Clustering 11 and k n Y Y  1(z =j) p(x, z a, π, µ, Σ) = π N(x ; µ , Σ /a ) i (2.8) | j i j j i j=1 i=1 results in

Pn (t) (t+1) i=1 ri,j aixi µ = j Pn (t) i=1 ri,j ai    T Pn (t) (t+1) (t+1) r ai xi µ xi µ (t+1) i=1 i,j j j (2.9) Σ = − − j Pn (t) i=1 ri,j ai n (t+1) 1 X (t) π = r j n i,j i=1 as can be seen in Appendix A.1.

2.3.3 Kernel density estimation Kernel density estimation is defined as

n 1 X f (x, h) = k(d(x x )/h) (2.10) nh − i i=1 where (x1, x2...xn) are data points drawn from X, d is a distance measure, k is a kernel function, is a smoothing parameter, and is an estimate of the proba- h f R bility density function of . It is desirable to choose such that (x) x = 1, X k Rd k d k( x) = k(x), and lim x k(x) = 0. −This density estimatek k→±∞ can be used in clustering algorithms as we will see in section 2.4.4.

2.3.4 Variational Bayesian methods Let X be observed variables and Z be unobserved variables. Now we try to find estimate q(Z) of p(Z X). By inspecting | Z Z p(X,Z) p(Z X) log p(X) = q(Z) log dZ q(Z) log | dZ (2.11) q(Z) − q(X) Z Z we note that the KL-distance between q(Z) and p(Z X) is minimized when R | ( ) = ( ) log p(X,Z) is maximized. To make the maximization tractable L q Z q Z q(Z) dZ Qn one can make use of the approximation q(Z) = i qi(Zi), that is that some of the parameters and unobserved variables are independent given X. This is called mean field approximation. We may now maximize w.r.t one qj (Zj ) at a time: 12 2 Theoretical background

Z n  n  Y  X  L(q (Z )) = q (Z ) log p(X,Z) log q (Z ) dZ j j i i  − i i  Z i i   Z Y Z  Z   = qj (Zj )  log p(X,Z)qi(Zi)dZi dZj qj (Zj ) log qj (Zj )dZj + const   − i,j Zj   Y Z    = KL  log p(X,Z)qi(Zi)dZi qj (Zj ) + const −  ||  i,j (2.12) which is maximized when Y Z q∗(Z ) log p(X,Z)q (Z )dZ = E [log p(x, z)]. (2.13) j j ∝ i i i qi (zi ),i,j i,j

This is iterated for different j [1]. By introducing a mixture model with priors this may be used as a clustering algorithm, as can be seen in chapter 3.

2.4 Routing in current capsule networks

This section describes the routing algorithms of some of the currently existing capsule network implementations. Notations and definitions from section 2.2 are used as throughout this section.

2.4.1 Dynamic routing This is the original routing algorithm published in [11]. Here capsule i’s activity ai is modeled by ui’s length and capsule traits are modeled by ui’s direction. Here the predicted traits are vi,j = Wi,j ui, where Wi,j is learned. The routing is described in Algorithm 1.

Algorithm 1 D R 1: procedure ynamic outing(vi,j , t, l) (0) 2: i , j : put b b ∀ ∈ Ll ∈ Ll i,j ← i,j 3: for t iterations do 4: i , j : r softmax (b ) ∀ ∈ Ll ∈ Ll+1 Pi,j ← j i,j 5: j : µ r v ∀ ∈ Ll+1 j ← i i,j i,j 6: j : u squash(µ ) ∀ ∈ Ll+1 j ← j 7: i , j : b b + v , u ∀ ∈ Ll ∈ Ll+1 i,j ← i,j h i,j j i 8: return uj 2.4 Routing in current capsule networks 13

Here ri,j are modeling p(zi = j) which are initially represented by the log priors (0) b , which are learned. However b(0) = 0, i, j has been used as well [11]. Line 4 i,j ∀ in algorithm 1 exp(bi,j ) ri,j = softmax(bi,j ) = (2.14) j Pk l=1 exp(bi,l) P ensures that j ri,j = 1. The expected pose µj of cluster j is calculated as a ∈Ll+1 weighted mean, where the activation is implicit in the length of vi,j . The squash function is 2 µj µj u = squash(µ ) = 2 (2.15) j j 2 1 + µ µj j 2 2 which forces the length of uj to be less than one.

By separating the activation vi,j = ai,j vˆ i,j , where vˆ i,j = 1. We can view the algorithm as an attempt to maximize X X ri,j ai,j (cos(vi,j , µj )) i l j l+1 ∈L ∈L * + (2.16) X X vi,j µj = ri,j ai,j , v i j i,j µj ∈Ll ∈Ll+1 P with respect to ri,j , and the boundary condition j ri,j = 1. Here µj is the ∈Ll+1 cluster prototype. Maximizing w.r.t. µj we find * + * + X vi,j µj X vi,j µj ri,j ai,j , = ri,j ai,j , (2.17) v v i i,j µj i i,j µj ∈Ll ∈Ll

P vi,j P which is minimized by µj = i ri,j ai,j = i ri,j vi,j . Here one could l vi,j l ∈L k k ∈L normalize µ as it does not matter for the distance measure. However, the final µj should be large where ri,j and ai,j is large, so one may refrain any normalization ai,j di,j of µ. Maximazing w.r.t. ri,j gives ri,j = P where di,j = cos(vi,j , µj ). j ai,j di,j However, this is not the method in Algorithm 1. In [13] they show that if one adds a Kullback–Leibler regularizer

(t) (t 1) X (t)  (t) (t 1) α KL(r r − ) = α r log r log r − (2.18) − i,j | i,j − i,j i,j − i,j i,j with α = 1 one gets Algorithm 1, apart from the line 6, in Algorithm 1, which is not explained by this viewpoint [13]. So if line 6 were put outside the for loop so that vj is replaced with sj in line 7 one is maximizing

X X (t) (t 1) L(r , µ ) = r a (cos(v , µ )) KL(r r − ). (2.19) i,j j i,j i,j i,j j − i,j | i,j i j ∈Ll ∈Ll+1 14 2 Theoretical background

The proof can be seen in Appendix A.2. Practically, having line 6 inside the loop will decrease the line length and might hence act as an extra regularizer. The routing method can be seen as a soft k-means algorithm the datapoints vi,j and cluster prototypes µj represented as a points on the surface of a hyper sphere. The distance measure on the sphere is orthogonal projection as described by cos(vi,j , µj ). As the angle get between vi,j and µj gets small the derivative of the distance measure sin(vi,j , µj ) gets close to 0. This makes the algorithm insensitive to good and really− good matches, which was noted by the proposers of the algorithm [5].

2.4.2 Optimized dynamic routing Wang and Liu [13] proposes some tweaks to the dynamic routing algorithm.

• Use an entropy regularizer to force ri,j to be close to uniform instead of the Kullback–Leibler regularizer which forces it to be close to the value in the previous iteration. • Move the activation function outside the for loop. • Put a weight on the prediction, which reflects on the norm of the prediction function. This results in Algorithm 2, which maximizes

L(ri,j , µj ) = X X X X r a (cos(o , µ )) α r log r (2.20) i,j i,j i,j j − i,j i,j i j i j ∈Ll ∈Ll+1 ∈Ll ∈Ll+1

Wi,j ui where oi,j = , and α is the regularizer weight which is a hyperparameter. Wi,j k kF Algorithm 2 O D R 1: procedure ptimized ynamic outing(oi j , t, l, α) | 2: for all capsules i in layer l and j in layer (l + 1) put b 0 i,j ← 3: for t iterations do 4: for all capsule i in layer l: ri,j softmaxj (bi,j ) ← P µj 5: for all capsule j in layer (l + 1) : µ r o , µˆ = j i i,j j i j µ ← | j 2 k k 1 6: for all capsule i in layer l and capsule j in layer (l+1) : bi,j α oj i, µˆj ← h | i 7: return squash(µj )

The point of normalizing with respect to Wi,j is that Wi,j will have large impact on the length of vi,j which here represent the activation of that capsule. The following equation describes what will be known about the normalization

vi,j = Wi,j ui Wi,j ui Wi,j ui . (2.21) 2 2 ≤ 2 k k2 ≤ F k k2 2.4 Routing in current capsule networks 15

v Meaning i,j 2 u . The Frobenius norm is used since it is faster to calculate kW k i 2 i,j F ≤ k k than the L2-norm.k k Note that this algorithm always makes use of uniform priors.

2.4.3 EM-routing Hinton et al. also developed a routing algorithm based on gmm-em, which can be seen in Algorithm 3, [5]. This algorithm is pretty close the to the gmm-em

Algorithm 3 EM R 1: procedure outing(vi,j , t, l, α) 2: for all capsules i and j : r 1/ ∈ Ll ∈ Ll+1 i,j ← |Ll+1| 3: for t iterations do 4: for all capsule j : M-step(a , r , v , j) ∈ Ll+1 i i,j i,j 5: for all capsule i : E-step(µ , σ , a , v , i) ∈ Ll j j j i,j 6: return aj , µj

M- 7: procedure step(a, r, vi,j , j) P h i ri,j ai vi,j 8: h : µh P∈Ll j i ri,j ai ∀ ← ∈Ll P h h 2 i ri,j ai (vi,j µj ) h 2 ∈LPl − 9: h :(σj ) r a ∀ ← i l i,j i ∈L P 10: costh (β + log σ h) r a u j i l i,j i ← P ∈L 11: a logistic(λ(β costh) j ← a − h

12: procedure E-step(µ, σ, a , v , i) j i,j ! (vh µh)2 1 PH i,j − j 13: j l+1 : pj q exp QH h 2 h 2(σ h)2 ∀ ∈ L ← h 2π(σj ) − j aj pj 14: j l+1 : ri,j P k ak pk ∀ ∈ L ← ∈Ll+1 described in Section 2.3.2, but three differences are apparent:

• A diagonal covariance matrix Σ has been assumed.

• As in the dynamic routing case the cluster probability has been altered in- side the loop.

• vi,j is dependent on j as well, which has been discussed previously. A diagonal covariance matrix Σ is a quite rough assumption, but also motivated by the cost of the time consuming matrix inversion. Why line 14 in algorithm P 3 use a instead of π = 1 r a is not motivated in [5]. It might work j j i l i,j i |Ll | ∈L as a regulator but this author has not tried to prove convergence. Here βa, βu are bias terms, σ h is the square root of element h in the covariance matrix, λ is a 16 2 Theoretical background hyperparameter used to scale the preactivation to the active part of the activation function, and logistic is the sigmoid 1 logistic(x) = x . 1 + e−

The learned biases βa, βu are individual per channel, but shared across image space. The trait prediction function used is

Vi,j = Wi,j Ui where Ui is ui reshaped to a matrix and Vi,j is vi,j reshaped as a matrix and Wi,j is a matrix. Note that this demands that trait vector length is a square number. It also lowers the parameters of the prediction function compared to the prediction function in section 2.4.1.

2.4.4 Fast dynamic routing based on weighted kernel density estimation In [21] they use the kernel model 1 X X f (µ, r) = r a k(d(µ v )), (2.22) z i,j i j − i,j l k j i |L | ∈Ll+1 ∈Ll where f is the density estimation which is to be maximized w.r.t. µ and r, zk normalizes the kernel k, and d is a distance measure. This can be solved in multiple ways and [21] proposes two. One is based on mean shift, which iterates between solving for µ with fixed r and updating r with sgd and step length α followed by a normalization, see Algorithm 4. The second

Algorithm 4 FRMS 1: procedure Dynamic routing based on mean shift(v, a, k, d, α, t) 2: for all capsules i lnd j : r 1/ ∈ Ll ∈ Ll+1 i,j ← |Ll+1| 3: for t iterations do ri,j 4: for all capsules i l and j l+1: rˆi,j P l ri,l ∈ L ∈ LP ← ∈Ll+1 rˆ a k (d(µ v ))v i l i,j i 0 j − i,j i,j 5: for all capsule j l+1: µj P∈L i rˆi,j ai k0(d(µj vi,j )) ∈ L ← ∈Ll − 6: for all capsules i and j : r r + αa k(d(µ v )) ∈ Ll ∈ Ll+1 i,j ← i,j i j − i,j 7: return µj introduces cluster priors πj and solves with em, see Algorithm 5. Both versions return only µj and aj has to be calculated afterwards. The paper [21] proposes   X Xh    aj = softmax  rˆi,j ai k( d(vi,d βj,hsj,h) + βj0) . (2.23)  −  i h=0 ∈Ll 2.4 Routing in current capsule networks 17

Algorithm 5 FREM 1: procedure Dynamic routing based on kde and em(v, a, k, d, t) 2: for all capsules i and j : r 1/ ∈ Ll ∈ Ll+1 i,j ← |Ll+1| 3: for t iterations do ri,j 4: for all capsules i l and j l+1: rˆi,j P l ri,l ∈ L ∈ LP ← ∈Ll+1 rˆ a v i l i,j i i 5: for all capsule j l+1: µj P∈L i rˆi,j ai ∈ L ← ∈LPl rˆ i l i,j 6: for all capsule j l+1: πj P P∈L i j rˆi,j ∈ L ← ∈Ll ∈Ll+1 7: for all capsules i and j : r π k(d(µ v )) ∈ Ll ∈ Ll+1 i,j ← j j − i 8: return µj

Here the βj,h and βj,h are learned parameters. One should note that compared to the gmm-em routing this algorithm is quite cheap to execute, especially if one choose k to be a triangle function, and d to the L1-norm.

3 Model design

This chapter describes the evaluated methods. Propositions of new routing mod- els with theoretical motivations can be seen in section 3.1. Modifications to the clustering algorithms necessary because they are used as routing in a neural net- work are described in section 3.2. Existing algorithms which will be tested with some modifications are described in section 3.3. During this chapter if multiple options are described it is implied that all versions are implemented and that which version is used is treated as a hyper parameter.

3.1 Novel routing algorithms

This section contains the theory behind the proposed routing algorithms. Practi- cal modifications of the algorithms can be seen in section 3.2.

3.1.1 Mean field Gaussian mixture model The mean field Gaussian mixture model (mfgmm) is designed mainly with the goal to incorporate the biases in a well motivated fashion. Both the frms, frem, and em-routing introduces biases on the cluster probabilities, however it might make sense to introduce biases on the latent variables instead, as has been done in the dr-algorithm. Imagine a lower capsule representing a triangle which may or may not be the sail of the boat which the next level capsule is representing 1. One may now argue to introduce a bias for the triangle being a sail, rather than a bias for a triangle and a bias for the boat. I propose an algorithm which is based on gmm as in em-routing but instead of maximizing model parameter likelihood with em, it introduces priors on all

1Shamelessly borrowing the example from https://www.youtube.com/watch?v= pPN8d0E3900

19 20 3 Model design parameters and solves it with the Bayesian method Mean field approximation. I borrow the notations from section 2.3. Noticing that p(zi = j πi) = pi,j Cat(π ), i is a categorical distribution, which has the Dirichlet| distribu-∼ i ∀ ∈ Ll tion as its conjugate prior, we put πi,j Dir(ρi), i l. Note that the prior here are data-point individual. ∼ ∀ ∈ L Here ρi will model the routing bias and will be learned. Practically we want to regularize ρi as well to prevent overfitting and make it prefer a uniform prior. bi,j This can be done by changing the representation to ρi,j = e and adding the P term λ b to the loss. reg i,j i,j 2 Since making matrix inversions is too computationally heavy we must still assume diagonal covariance matrices for the Gaussians, which is the same as in- dependent Gaussians. This results in a conjugate prior which is normal-gamma distributed. That is

(0) (0) µ N(m , 1/(τ λ )) j,d ∼ j,d j,d j,d where (3.1) (0) (0) τ Γ (α , β ). j,d ∼ j,d j,d When applying the clustering in the network x will also be dependent on j as in previous methods we will also add the weights ai to each data point. All in all the model becomes: p(x, z, π, µ, τ) = p(x z, µ, τ)p(z π)p(π)p(µ τ)p(τ) | | | where Y Y Y p(x z, µ, τ) = N(x µ , 1/(a τ ))zi,j | i,j,d| j,d i j,d i j d ∈Ll ∈Ll+1 Y Y z p(z π) = π i,j | i,j i j ∈Ll ∈Ll+1 (0) (3.2) Y Γ ( ρ ) Y ρ 1 p(π) = |Ll+1| i π i,j − Γ (ρ ) i,j i i j ∈Ll ∈Ll+1 Y Y (0) (0) p(µ τ) = N(µ m , 1/(τ λ )) | j,d| j,d j,d j,d j d ∈Ll+1 Y Y (0) (0) p(τ) = Γ (τ α , β ). j,d| j,d j,d j d ∈Ll+1 While making the estimate we have to make the approximation q(z, π, µ, τ) = q(z)q(π, µ, τ), where q is the estimated probability density function. (0) (0) Since we do not want to make assumption about the cluster size, αj,d, βj,d, (0) λj,d will be the same for all clusters and put to a constant low-information prior, (0) treated as a hyperparameter. The mean prior mj,d will be set to 0. The calcula- tions can be seen in Appendix A.3. The result is algorithm 6. 3.1 Novel routing algorithms 21

Algorithm 6 Mean field gmm 1: procedure Routing based on Mean fields gmm(x, a) 2: n = 0 3: for t iterations do (n)  a α 1 P (n) (n) i j,d 2 4: i ΩL, j ΩL+1: hi,j = d γ(α ) log(β ) (n) (xi,j,d mj,d) ∀ ∈ ∈ 2 j,d − j,d − β − −  j,d ai λj,d 5: i Ω , j Ω : g = γ(ρ ) ∀ ∈ L ∈ L+1 i,j i,j P 6: i ΩL, j ΩL+1: gi,j = gi,j γ( j Ω ρi,j ) ∀ ∈ ∈ − ∈ L+1 hi,j +gi,j 7: i ΩL, j ΩL+1: pi,j = e ∀ ∈ ∈ pi,j P 8: i ΩL, j ΩL+1: ri,j = p ∀ ∈ ∈ j ΩL+1 i,j (0)∈ 9: i Ω , j Ω : ρ = ρ + r ∀ ∈ L ∈ L+1 i,j i,j i,j (n) P (0) 10: j ΩL+1: λj = i Ω ri,j ai + λj ∀ ∈ P ∈ L (n) i Ω ri,j ai xi,j ∈ L 11: j ΩL+1: mj = (n) ∀ ∈ λj 2 P 2 (n) (n) (n) i Ω ri,j ai xi,j λj mj (0) 12: j Ω : β = ∈ L − + β ∀ ∈ L+1 j 2 j (n) (0) P 13: j Ω : α = α + 1 r L+1 j j 2 i ΩL i,j ∀ ∈ P ∈ 14: j ΩL+1: aj = f ( i Ω ai ri,j ) ∀ ∈ ∈ L 15: return µj , aj

Here f is an activation function. This algorithm is computationally costly, and it seems it would be desirable to find a simpler one. Removing the variance prior we get the algorithm in the following section.

3.1.2 Mean field soft k-means The mean field soft k-means (mfskm) algorithm is built on the following model

p(x, z, π, µ) = p(x z, µ)p(z π)p(π)p(µ) | | where Y Y Y p(x z, µ) = N(x µ , 1/(βa ))zi,j | i,j,d| j,d i i Ω j Ω d ∈ L ∈ L+1 Y Y z p(z π) = π i,j | i,j (3.3) i Ω j Ω ∈ L ∈ L+1 (0) Y Γ ( Ω ) Y ρ 1 p(π) = | L+1|i π i,j − Γ (ρ ) i,j i Ω i j Ω ∈ L ∈ L+1 Y Y (0) (0) p(µ) = N(µ m , 1/(τ )). j,d| j,d j,d j Ω d ∈ L+1 22 3 Model design

The calculations can be seen in Appendix A.4, which all results in the algorithm 7.

Algorithm 7 Mean field soft k-means 1: procedure Routing based on Mean fields soft k-means(x, a) 2: n = 0 3: for t iterations do   βa P (n) (n) 4: i Ω , j Ω : h = i (x m )2 + 1/τ ∀ ∈ L ∈ L+1 i,j − 2 d i,j,d − j,d j,d 5: i Ω , j Ω : g = γ(ρ ) ∀ ∈ L ∈ L+1 i,j i,j P 6: i ΩL, j ΩL+1: gi,j = gi,j γ( j Ω ρi,j ) ∀ ∈ ∈ − ∈ L+1 hi,j +gi,j 7: i ΩL, j ΩL+1: pi,j = e ∀ ∈ ∈ pi,j P 8: i ΩL, j ΩL+1: ri,j = p ∀ ∈ ∈ j ΩL+1 i,j (0)∈ 9: i Ω , j Ω : ρ = ρ + r ∀ ∈ L ∈ L+1 i,j i,j i,j (n) P (0) 10: j ΩL+1: τj = β i Ω ri,j ai + τj ∀ ∈ P ∈ L (n) β i Ω ri,j ai xi,j ∈ L 11: j ΩL+1: mj = (n) ∀ ∈ τj P 12: j ΩL+1: aj = f ( i Ω ai ri,j ) ∀ ∈ ∈ L 13: return µj , aj

Here f is an activation function.

3.1.3 Soft k-means routing

The soft k-means algorithm (skm) is motivated to be as simple as possible. We use prototypes µj with the distance function

2 d(xi,j , µj ) = xi,j µj . (3.4) −

Introducing soft assignments ri,j whose updates will be regularized with the KL- distance to either the prior or the value of the previous iteration or the negative entropy. This results in the cost function X X L(r, µ) = r a d(x µ ) + KL(r , r0 ). (3.5) i,j i,j i,j − j i,j i,j i Ω j Ω ∈ L ∈ L+1 Minimizing the loss with derivation and cancelling results in Algorithm 8, as can be seen in appendix A.5. Here f is an activation function. (0) (0) b Here the priors r(0) are learned through r = e i,j and regularized with i,j (0) (0) (t 1) λreg b , or put to uniform. Here r0 can be either r , r − or constant. i,j 2 i,j i,j i,j (0) (t 1) Which corresponds to bi,j0 being bi,j , bi,j− or 0. 3.2 Practical modifications 23

Algorithm 8 Soft k-means

1: procedure Routing based soft k-means(x, a, b(0)) 2: b = b(0) 3: for t iterations do bi,j 4: i Ω , j Ω : r = e L L+1 i,j P bi,l ∀ ∈ ∈ l Ω e P ∈ L+1 r a x i l i,j i i,j 5: j l+1: µj = P∈L i ri,j ai ∀ ∈ L ∈Ll 2 6: i l, j l+1: di,j = xi,j µj ∀ ∈ L ∈ L − 7: i l, j l+1: bi,j = bi,j0 ai di,j ∀ ∈ L ∈ L P − 8: j ΩL+1: aj = f ( i ai ri,j ) ∀ ∈ ∈Ll 9: return µj , aj

3.2 Practical modifications

The following section contains different modifications of the clustering methods one might want to do since they are to be used as routing methods in a network. Whether or not a modification is applied will be treated as a hyperparameter. 1. Despite that many of the routing algorithms contain data and cluster spe- cific priors bi,j or ρi,j it might be desirable to add cluster scaling and addi- tive biases ba, bu as in [5], see section 2.4.3. 2. In mfgmm and mfskm we might also punish the clusters which get a large standard derivation as in [5], see section 2.4.3. The standard derivation is q q β 1 estimated using σˆ = α and σˆ = τ 3. The mfgmm and mfskm have priors treated as hyper parameters which are scale dependent. To make them work through different layers and the entire learning process it might be necessary to normalize the data. Nor- malization might be done with the matrix Frobenious norm as in [13], see section 2.4.2, or with per dimension normalization by

vi,j v¯ i,j oi,j = − . (3.6) σˆvi,j

Here the mean and standard derivation is estimated in the data dimension i. 4. For skm it is also applicable to regularize by using the activation instead of preactivation in the inner loops as in [11] and [5], as discussed in section 2.4.1 and 2.4.3. 5. The activation function might be a logistic as in [5], but in case no additive or scaling bias is used it might also make sense to use a tangens hyperboli- cus, as zero activation will be unachievable otherwise. 24 3 Model design

6. The algorithm in [11], see section 2.4.1, will not only learn the pose predic- tions, but also learn the importance of activation ai for capsule j since Wi,j implicitly predicts the length of vj . This might be desirable when activation and pose are separate as well. One can replace ai with ai,j = wi,j ai where wi,j is a prediction matrix separate from Wi,j . 7. When implementing the routing methods the iterations are unraveled. This means that practically in the forward pass the activations and poses are propagated through multiple layers with shared weights. This means that in the one have to choose whether to let the gradient prop- agate through the activation/pose only in the last layer (last iteration) or in all layer (all iterations).

3.3 Existing algorithms

The algorithm from [11] and [5], see section 2.4.1 and 2.4.3, will be evalu- ated as well, but the modification in section 3.2 will also be tested. They will now be refered to as dr and em. To see which options applies to which method read chapter 4. The dr will have the same options for log routing priors as skm, see section 3.1.3 which causes the routing algorithm in [13], see section 2.4.2, to become a special case. The frms and frem algorithms in section 2.4.4 will not be evaluated since they have already been compared to the the em algorithm. They were found to gain some speed at the cost of some accuracy. 4 Experiments

The following chapter contains experimental setup as well as results. Section 4.1 contains information about implementation details, the evaluated methods and the hyperparameter search. Section 4.2 contains information about the datasets and prepossessing. Section 4.3 contains the results.

4.1 Implementation

The implementations is based on [7], which is written in Python using the ma- chine learning framework TensorFlow. Specification of the hardware used for inference time measurements can be seen in table 4.1.

Table 4.1: Specifications

GPU NVIDIA GeForce GTX 1080 Ti CPU Intel(R) Xeon(R) CPU E5-2643 v4

Each model will have its accuracy measured on MNIST [16], and CIFAR-10 [8]. The evaluated routing methods tested are:

1. dr as described in section 3.3.

2. em as in section 3.3.

3. mfgmm as described in 3.1.1.

4. mfskm as described in 3.1.2.

5. skm as described in 3.1.3.

25 26 4 Experiments

The test was performed with a capsule network structured similarly to the one in [11], and is as follows:

1. A 256 channel layer with kernel size 9 and stride 1. 2. A 32 channel primary capsule layer with kernel size 9, stride 2, and pose vector length 16. 3. A 10 channel fully connected capsule layer with pose vector length 16.

This is followed by a decoder used for regularizing. The decoder is structured as follows: 1. A 512 channel fully connected layer. 2. A 1024 channel fully connected layer.

3. An output sized fully connected layer. Reconstruction error is per pixel mean square error. Matrix pose estimations as in [5] are used for all routing methods, which is the only difference from the network structure in [11]. Initial tests revealed that the validation accuracy was very sensitive to two hyper parameters: learning rate and specific activations, see table 4.2 and 4.3. Not a single of the tests converged when specific activations was used. Hence, in following tests specific activations was not applied. The learning rate was opti- mized with the rest of the hyperparameters fixed, using grid search. Later, the rest of the hyperparmeters where optimized using random search, with the opti- mal learning rate. The optimal parameters where chosen as the set with the best validation accuracy. Log-scale was used when generating random continuous pa- rameters. The parameters tested for all routing methods can be seen in table 4.2. Hyperparameters used for some methods can be seen in table 4.3. 4.1 Implementation 27

Name Applicability Description Range/fixed Responsibility dr, skm Regularize the responsi- First,. Previous, regularizer bility as described in sec- None First (r0) tion 3.1.3 Activation mfgmm, Use the logistic or tanh ac- Tanh or Logistic . function mfskm, em, tivation function. Item 5 Logistic skm in section 3.2 . Penalize mfgmm, Downscale activations of Bool True standard mfskm, em clusters with high stan- derivation dard derivation as de- scribed in item 2 in sec- tion 3.2 . Inner activa- em, dr, Use the activation instead Bool True tion skm of the preactivation in the inner routing loop as de- scribed in item 4 section 3.2 Specific acti- em, Add a trainable weight – vation mfgmm, that scales the activation mfskm, differently for different skm clusters as described in item 1 section 3.2 Table 4.3: Routing parameters used for some routing methods. Described in 3.2. "Range" is the search range in the random search and "fixed" is the value used when optimizing the learning rate. 28 4 Experiments

Name Description Range/Fixed 5 2 Learning rate The learning rate of the opti- 10− 10− mizer − . Number of itera- The number of iterations used in 2 or 3 3 tions the routing algorithm . Data loss type Spread loss from [5] or Margin 2 values loss from [11] Spread 6 1 . Weight scale The scale of weight regulariza- 10− - 10− tion. Separate for routing biases 4 10− and prediction weights . Reconstruction The scale of regularizing recon- 10 6 - 10 1 5 − − × scale struction error 3 10− 5 0 . Initial standard Initial standard derivation of 10− - 10− derivation pose prediction weights 10 2 − . Normalize predic- Normalize prediction with the 3 values None tions Frobenius norm as in [13] or with per coordinate normaliza- tion or not at all. Item 3 in sec- tion 3.2 . Gradient stop pre- The gradient is only propagated Bool False diction through the pose prediction in the last cluster iteration. Item 7 in section 3.2 . Gradient stop acti- The gradient is only propagated Bool False vation through the activation in the last cluster iteration. Item 7 in sec- tion 3.2 . Routing bias The initial log responsibility is Bool True learned. Else fixed to zero . Scale bias Scale the preactivation with a Bool True trainable bias before the activa- tion function. Item 6 in section 3.2 . Cluster bias Add a trainable bias to the pre- Bool True activation before the activation function. Item 6 in section 3.2 Table 4.2: Hyper parameters. "Range" is the search range in the random search and "fixed" is the value used when optimizing the learning rate. 4.2 Data preprocessing 29

Method/Name Description Range/Fixed mfgmm (0) 5 0 . 2 λ Parameter prior de- 10− - 10 10− scribed in section 3.1.1 (0) 5 0 . 2 α Parameter prior de- 10− - 10 10− scribed in section 3.1.1 (0) 5 0 . 2 β Parameter prior de- 10− - 10 10− scribed in section 3.1.1 mfskm (0) 5 0 . 1 τ Parameter prior de- 10− - 10 5 10− scribed in section 3.1.2 × 5 0 . 1 β Parameter prior de- 10− - 10 5 10− scribed in section 3.1.2 × skm 4 4 . 0 β Parameter prior de- 10− - 10 5 10 scribed in section 3.1.3 × Table 4.4: Routing method specific hyper parameters. "Range" is the search range in the random search and "fixed" is the value used when optimizing the learning rate.

Some routing methods also has specific routing parameters with can be seen in table 4.4. The training stopped early if it did not exceed some dataset depen- dent validation accuracy after 7 epochs or if the loss diverged. The validation tresholds used was 0.8 and 0.2 for MNIST and CIFAR10 respectively. Training is done using an Adam optimizer with TensorFlows default β-parameters.

4.2 Data preprocessing

The following section describes the data and the data prepossessing. Processed images from the training data can be seen in figures 4.1 and 4.2.

4.2.1 MNIST The MNIST images were normalized so the pixel values were in range [0, 1].

4.2.2 CIFAR-10 The CIFAR-10 images were normalized so the pixel values were in range [0, 1]. During training random brightness with max delta 32/255 as well as random contrast with upper bound 1.5 and lower bound 0.5 was added. The amount of training, validation and testing images can be seen in table 4.5. 30 4 Experiments

Training image of a 4 Training image of a 5 Training image of a 7

Figure 4.1: Processed images from the MNIST dataset. The unit of the coor- dinate system is pixels.

Training image of an Training image of a Training image of a automobile frog cat

Figure 4.2: Processed images from the CIFAR-10 dataset. The unit of the coordinate system is pixels.

dataset training validation testing MNIST 55000 5000 10000 CIFAR-10 45000 5000 10000 Table 4.5: Number of training, testing and validation images for different datasets 4.3 Results 31

4.3 Results

The test accuracy, execution time and the number of parameters of the decoder and encoder of the model with the best validation accuracy on MNIST can be seen in the Table 4.6. Network execution time is averaged over 10 runs.

Routing Test Execution Time Parameters Method Accuracy Network(s) Layer(µs) Encoder Decoder dr 0.9956 3.84 0.04 15 658 424 10 822 686 1 411 344 em 0.9927 4.55 ±0.08 21 302 ±361 11 486 260 1 411 344 mfgmm 0.9939 4.96 ±0.09 26 512 ±232 11 486 270 1 411 344 mfskm 0.9935 4.76 ±0.07 23 592 ±494 11 486 270 1 411 344 skm 0.9912 4.16 ±0.06 17 480 ±682 11 486 260 1 411 344 ± ± Table 4.6: Test accuracy on the MNIST dataset, network execution time is measured in seconds over 78 batches of size 128. Layer execution time is measured in microseconds over 1 batch of size 128.

The test accuracy, execution time and the number of parameters of the de- coder and encoder of the model with the best validation accuracy on CIFAR-10 can be seen in the Table 4.7. Execution time is averaged over 10 runs.

Routing Test Execution Time Parameters Method Accuracy Network(s) Layer(µs) Encoder Decoder dr 0.7154 7.29 0.15 27 447 239 11 007 508 3 756 544 em 0.6371 8.80 ±0.15 45 659 ±233 11 671 082 3 756 544 mfgmm 0.6915 8.81 ±0.15 44 000 ±545 11 671 102 3 756 544 mfskm 0.5919 8.36 ±0.14 37 677 ±146 11 671 092 3 756 544 skm 0.6504 8.06 ±0.12 35 775 ±496 11 671 102 3 756 544 ± ± Table 4.7: Test accuracy on the CIFAR10 dataset, network execution time is measured in seconds over 78 batches of size 128. Layer execution time is measured in microseconds over 1 batch of size 128.

No evaluation of the standard derivation of the test accuracy was made due to time limitations, however a gross estimate can be made using the standard derivation of the validation accuracy of dr and skm, as seen in table 5.1. The hyper parameters used for test evaluation can be seen in appendix A.6. The plots in figures 4.3-4.7 show the validation accuracy for different learning rates, and reveal a clear dependency. The validation accuracy of 0 and 0.1 rep- resent a test interrupted because of slow convergence or divergence respectively.− 32 4 Experiments

Figure 4.3: Learning rate grid search for dr

Figure 4.4: Learning rate grid search for em 4.3 Results 33

Figure 4.5: Learning rate grid search for mfgmm

Figure 4.6: Learning rate grid search for mfskm 34 4 Experiments

Figure 4.7: Learning rate grid search for skm

The plots in appendix A.7 contain the results from the random search opti- mization of the remaining hyperparameters. The plots reveal that the validation accuracy can not be predicted by looking the hyperparameters individually, in other word no sole hyperparameter is more important than all of the other pa- rameters. 5 Conclusions

This chapter contains method analysis, routing method comparison and sugges- tions of future works.

5.1 Method analysis

A hyper parameter search will always be biased since someone is deciding which parameters are optimized and the priors of the variables. In this thesis all search intervals are based on what this author feels is reasonable. Normally one would decrease or increase the search intervals as one gets a better feel for which pa- rameters performed better. But since no clear patterns were found it is hard to validate if parameter search were performed in the right intervals. The test here can be compared to the original paper [11], since the network structure really is quite similar. One should note that their test error on MNIST is quite a lot lower, compare 0.24% to 0.44% of the 10000 images. However, to the best knowledge of this author, no one has been able to reproduce the results of the original paper. The results on the CIFAR-10 dataset is not near state of the art. But one should consider that the network is quite shallow, and that the dataset has a lot of vari- ation between examples of the same class. The results in [15] and [9] are similar, and the [10], [5] and [21] achieves higher accuracy using a deeper model, but are still not at the state of the art. Close to state of the art performance is achieved using a ResNets or DenseNets backbones with a form of capsule output layer [20]. However, this model is not a fair comparison. Looking at the plot in appendix A.7 one can note that quite a lot of test stopped early, either because of diverging loss or slow training, which in some part can be due to wide hyperparameter search intervals. However, no clear de- pendence of the hyperparameters can be found, at least not when looking at them

35 36 5 Conclusions individually, with the exception of learning rate. Also, the plots in figures 4.3-4.7 shows that the relations between learning rate and validation accuracy is not very smooth. This is reason to attribute the variation in accuracy to sensitivity to the initialization. To test this, networks with the same hyperparameters as the best dr, and skm networks were re-trained. The results which can be seen in table 5.1. This shows that, no the networks are not hypersensitive to the initialization, at least if good hyperparameters are chosen. Note that all test converged. This means that the hyperparameters are responsible for the divergence, and the performance of the hyperparameter are dependent on one another. It would hence be interesting to optimize the hyperparamterers using another method, which could reveal more about the relationships between good hyperparameters.

Routing method MNIST CIFAR10 dr 0.9962 0.0004 0.7186 0.0045 ± ± skm 0.9946 0.0003 0.6641 0.0097 ± ± Table 5.1: Validation accuracy measured over 50 re-trainings

5.2 Routing method comparison

As one can see in section 4.3 the dr performs better than the rest of the meth- ods. This is interesting, as the only thing differing from from skm is that dr is using the spherical vector representation. The original proposers thought the representation would be disadvantageous, due to its inability to distinguish be- tween an acceptable and excellent match [5]. This could however be a perk. The inability to see the really good matches forces the network to rely on many accept- able matches. The notion that many features are important for generalizations is already seen in other parts of the deep learning field. dr is also faster than the rest of the methods. However, this is attributed to that the node probability and traits are modeled by the same vector, leading to fewer parameters. If this was not the case it would have the same number of operations as skm and probably similar execution time. mfgmm also have high accuracy on both dataset. It is however the slowest method. The rest of the methods have similar accuracy, execution time and number of parameters when comparing over both datasets and significant difference in performance can not be seen.

5.3 Contributions

The contributions of this thesis are:

• Discussions of theoretical support of some existing routing methods, as well as proposing some modifications to these methods. 5.4 Future work 37

• Proposing three new routing methods. • Evaluation of the modified and proposed methods on the MNIST and CI- FAR10 datasets with respect to accuracy and execution time.

5.4 Future work

In future work it would be interesting to • try smarter initialization like Xavier initialization. This might help with the early stopping, and varying results.

• try deeper networks, since it might help with the results on datasets with larger, intra-class variation. • try higher dimensional vectors in the output layer. This might help with the results on datasets with larger, intra-class variation as the implicit con- straint space is probably larger than 16 dimensions.

• try other common deep learning tricks as e.g. batch normalization. • compare the algorithms on the multi-digit MNIST, as it is one of the main motivations for capsule networks relevance [11]. • use other hyperparameter optimization algorithms as e.g. genetic algorithms or Bayesian methods, as it might improve the results of all methods and re- veal dependencies between parameters in a good hyperparamter set. • try capsule networks on other character datasets as e.g. Casia [12], as cap- sule networks perform well on MNIST.

Appendix

A Appendix

Sections A.1, A.2, A.3, A.5 contains derivations which were found important, but to long to be contained in the rapport. Section A.6 contains the hyper parameters used when calculating test accuracy. Section A.7 contains plots of accuracy for different hyperparameter.

41 42 A Appendix

A.1 Fixed weights Gaussian mixture model expectation maximization

Continuing where we left of in Section 2.3.2, we find

X Q(θ, θ(t)) = p(Z X, θ(t)) log(p(X,Z θ)) | | Z X X X = ...

z1= 1,2,...K z2= 1,2,...K zn= 1,2,...K    { } { } { } Y Y 1( = ) Y Y  1(z =j00)  zi j0   i00   c 0  log  πj N(xi ; µj , Σj /ai )   i0,j   00 00 00 00 00  i0=1 j0=1 i00=1 j00=1 X X X = ...

z1= 1,2,...K z2= 1,2,...K zn= 1,2,...K   { } { } { } Y Y 1(z =j ) X X    i0 0    ci ,j  1(zi = j00) log πj N(xi ; µj , Σj /ai )  (A.1)  0  00 00 00 00 00 00  i0=1 j0=1 i00=1 j00=1 X X X X X = 1(z = j ) ... i00 00 i00=1 j00=1 z1= 1,2,...K z2= 1,2,...K zn= 1,2,...K   { } { } { } Y Y     1(zi =j0)   c 0  log πj N(xi ; µj , Σj /ai )   i0,j  00 00 00 00 00  i0=1 j0=1 X X (t)   = ci,j log πj N(xi; µj , Σj /ai) i j X X (t) (t) (t)  (t) (t)  = p(z = j x, a, π , µ , Σ ) log p(x , z = j a, π , µ ) . i | l l l i i | l l i j

Here the extreme decrease in terms comes from the fact that the data are assumed to be independent. Derivation w.r.t Σ and µ and canceling with zero gives

P (t) r a x (t+1) i l i,j i i µ = ∈L j P (t) i ri,j ai ∈Ll    T (A.2) P (t) (t+1) (t+1) r ai xi µ xi µ (t+1) i l i,j j j Σ = ∈L − − . j P (t) i ri,j ai ∈Ll (t) P Maximizing Q(θ, θ ) w.r.t π and the constraint j πk = 1 gives: ∈Ll+1 (t+1) 1 X (t) πj = ri,j . (A.3) l i |L | ∈Ll A.2 DR optimization 43

A.2 DR optimization

Lagrange multiplier solution is   X X  X    (t) (t 1) L(ri,j , λi) = ri,j ai,j (cos(vi,j , µj )) λi  ri,j 1 KL(r r − ) −  −  − i,j | i,j i l j l+1 j l+1 ∈L ∈L ∈L   X X    X  (t 1)   = ri,j ai,j di,j log ri,j + log r − λi  ri,j 1 . − i,j −  −  i j j ∈Ll ∈Ll+1 ∈Ll+1 (A.4)

By derivation w.r.t ri,j and λi and cancelling we get (t 1) (t 1) exp(ai,j di,j 1 + log ri,j− λi) exp(ai,j di,j + log ri,j− ) r = − − = (A.5) i,j P (t 1) P (t 1) exp(a d 1 + log r − λ ) exp(a d + log r − ) l=1 i,l i,l − i,l − i l=1 i,l i,l i,l with bi,j = log ri,j one get step 4 and 7 of Algorithm 1 if one were to use µj instead of vj in step 7.

A.3 Mean field Gaussian mixture model

Continuing where we left of in Section 3.1.1, we find

log q∗(z) = Eπ,µ,τ [log(p(x, z, π, µ, τ)] + constz

= Eµ,τ [log(p(x z, µ, τ)] + Eπ[log(p(z π)] + constz  | |  X X  X  = z E [log(π )] + E [ log(N(x µ , 1/(a τ )))] + const i,j  π i,j µ,τ i,j,d| j,d i j,d  z i l j l+1 d ∈L ∈L X X zi,j (hi,j + gi,j ) + constz i l j l+1 ∈LX∈L X zi,j log pi,j + constz. i j ∈Ll ∈Ll+1 (A.6)

Q Q zi,j Taking the exponent and normalizing we find: q (z) = r , where ∗ i l j l+1 i,j pi,j ∈L ∈L r = P . Now we may note that E [z ] = r . Continuing with the i,j p z q∗(z) i,j i,j j0 l+1 i,j0 | other variables∈L we find

log q∗(π, µ, τ) = Ez[log(p(x, z, π, µ, τ)] + constπ,µ,τ = E [log p(x z, µ, τ) + log p(z π)] + log p(π) + log p(µ τ) + log p(τ) + const z | | | π,µ,τ = log q∗(π) + log q∗(µ, τ) + constπ,µ,τ . (A.7) 44 A Appendix

Here

    Y Y    zi,j  log(q∗(π)) = Ez log  π    i,j  i l j l+1  ∈L ∈L (0) Y Γ ( ) Y ρ 1  l+1 ρi i,j −  + log  |L | π  + constπ  Γ (ρ ) i,j  i i j ∈Ll ∈Ll+1 X X (0) X X (A.8) = (ρ 1) log π + r log π + const i,j − i,j i,j i,j π i j i j ∈Ll ∈Ll+1 ∈Ll ∈Ll+1 = ⇒ (0) Y Y ρ +r 1 i,j i,j − q∗(π) π . ∝ i,j i j ∈Ll ∈Ll+1

(n) (n) (0) This is a Dirichlet distribution q∗(πi) = Dir(ρi ) where ρi,j = ρi,j + ri,j . Standard (n) P (n) moments gives Eπ q (π)[log(πi,j )] = γ(ρi,j ) γ( j ρi,j ). | ∗ − ∈Ll+1 Lastly, we note

log q∗(µ, τ) = Ez[log p(x z, µ, τ)] + log p(µ τ) + log p(τ) + constµ,τ X X| X | = E[z ] log N(x µ , 1/(a τ ))+ i,j i,j,d| j,d i j,d i j d ∈Ll ∈Ll+1 X X (0) (0) (0) (0) log(N(µ m , 1/(τ λ ))Γ (τ α , β )) + const j,d| j,d j,d j,d j,d| j,d j,d µ,τ j d ∈Ll+1 (A.9) X X X 1 ai τj,d  = r log τ (x µ )2 i,j 2 j,d − 2 i,j,d − j,d i j d ∈Ll ∈Ll+1 (0) X X (0) λj,d τj,d (0) (0) + (α 1/2) log τ (µ m )2 β τ + const , j,d − j,d − 2 j,d − j,d − j,d j,d µ,τ j d ∈Ll+1

(n) (n) (n) (n) and by sorting the terms we find q (µ , τ ) N Γ 1(m , λ , α , β ) ∗ j,d j,d ∼ − j,d j,d j,d j,d A.3 Mean field Gaussian mixture model 45 where

(n) X (0) λj = ri,j ai + λj i ∈Ll P (0) (0) r a x + λ m (n) i l i,j i i,j j j m = ∈L j (n) λ j (A.10) P (0) (0)2 (n) (n)2 r a x2 + λ m λ m (n) i l i,j i i,j j j j j (0) β = ∈L − + β j 2 j (n) (0) 1 X α = α + r . j j 2 i,j i ∈Ll

For the calculations of log q∗(z) we need

E [log(N(x µ , 1/(a τ )))] = µ,τ i| j i j 2 X 1 ai τj,d(xi,d µj,d) E [ log(a τ /2π) − ] = µ,τ 2 i j,d − 2 d 2 2 D X 1 ai τj,d x ai τj,d µj,d log(a /2π) + E [ log τ i,d + a τ x µ ] 2 i µd ,τd 2 j,d − 2 i j,d i,d j,d − 2 d   = From standard moments

2 D X γ(αj,d) log(βj,d) ai αj,d x = log(a /2π) + − i,d 2 i 2 − 2β d j,d 2 m αj,d a 1 + a j,d ai xi,d mj,d αj,d i λj,d i βj,d + βj,d − 2

D X γ(αj,d) log(βj,d) ai αj,d a = log(a /2π) + − (x m )2 i . 2 i 2 − 2β i − j,d − 2λ d j,d j,d (A.11)

We will also be interested in Eµ[µ] = m (A.12) α E [τ] = . (A.13) τ β

Now we make a few simplifications. Since ri,j = softmaxj (hi,j + gi,j ) we may D remove any constants not depending on j. We may so remove 2 log(ai/2π) from A.11, (at least if specific activation is not used, see table 4.3). Assuming m(0) = 0 simplifies the calculations of m(n) and β(n). All in all this results in Algorithm 6 which can be seen in Section 3.1.1. 46 A Appendix

A.4 Mean field soft k-means

Continuing where we left of in Section 3.1.2, we begin by estimating the log den- sity function of z and π the as in appendix A.3, by

log q∗(z) = X X X z (E [log(π )] + E [ log(N(x µ , 1/(a β)))] + const i,j π i,j µ i,d| j,d i z i l j l+1 d (A.14) ∈L ∈L X X = zi,j (hi,j + gi,j ). i j ∈Ll ∈Ll+1 and

(0) q∗(π ) Dir(ρ + r ) i ∼ i,j i,j where (A.15)

ri,j = Ez q (z)[zi,j ] | ∗ But we now find

log q∗(µ) = Ez[log p(x z, µ)] + log p(µ) + constµ X X X | = E[z ] log N(x µ , 1/(βa ))+ i,j i,j,d| j,d i i j d ∈Ll ∈Ll+1 X X (0) (0) log(N(µ m , 1/(τ )) + const j,d| j,d j,d µ j d ∈Ll+1 X X X 1 a β  = r log(βa /2π) i (x µ )2 i,j 2 i − 2 i,j,d − j,d i j d ∈Ll ∈Ll+1 X X 1 τj,d (0) + log(τ /2π) (µ m )2 + const . 2 j,d − 2 j,d − j,d µ j d ∈Ll+1 (A.16)

Sorting the terms we find (n) (n) q∗(µ ) N(m , τ ) = j,d ∼ j,d j,d where (n) X (0) τ = β a r + τ j,d i i,j j,d (A.17) i ∈Ll (0) (0) P m τ + β r a x (n) j,d j,d i l i,j i i,j,d m = ∈L . j,d (n) τj,d A.5 Soft k-means 47

Here

(n) X (n) gi,j = Eπ q (π)[log(πi,j )] = γ(ρi,j ) γ( ρi,j ) (A.18) | ∗ − j ∈Ll+1 and

X h = E [ log(N(x µ , 1/βa )))] i,j µ i,j,d| j,d i d 2 2 X 1 βa (xi,j,d 2xi,j,d µj,d µj,d)βai = E [ log i − − ] µ 2 2π − 2 (A.19) d 2 X 1 βa ((xi,j,d mj,d) + 1/τj,d)βai log i − . 2 2π − 2 d

1 βai (0) As in appendix A.3 we may ignore the 2 log 2π term and assuming m = 0, which leads to simplifications. All in all this results in Algorithm 7 which can be seen in Section 3.1.2.

A.5 Soft k-means

Continuing where we left of in Section 3.1.3, we find

   X  X  X       L(r, µ, λi) =  ri,j ai,j di,j + log(ri,j ) log(r0 + λi  ri,j 1 .  − i,j  −  i j j ∈Ll ∈Ll+1 ∈Ll+1 (A.20) Derivation m.a.p ri,j and λi and cancelling gives X X   a d + log(r ) log(r0 ) + 1 + λ = 0 i,j i,j i,j − i,j i i j ∈Ll ∈Ll+1 and (A.21) X ri,j = 1. j ∈Ll+1 Letting ai,j di,j +log(ri,j0 ) 1 λi ri,j = e− − − (A.22)   P ai,j di,j +log(ri,j0 ) solves the first part of equation A.21. Letting λi = log j e− 1 ∈Ll+1 − solves the second part as well. Since then

a d +log(r ) e− i,j i,j i,j0 ri,j = . (A.23) P ai,l di,l +log(ri,l0 ) l e− ∈Ll+1 48 A Appendix

Derivation and cancelling gives the µ update. The resulting algorithm 8 can be seen in section 3.1.3.

A.6 Hyper parameters for test accuracy evaluation

Method/Hyper parameter MNIST CIFAR-10 dr 3 4 Learning rate 1.74 10− 2.09 10− Number of iterations 2 × 2 × Data loss type Margin Margin 5 5 Weight scale routing 3.78 10− 3.15 10− × 5 × 5 Weight scale prediction 1.20 10− 8.59 10− × 4 × 6 Reconstruction scale 5.02 10− 2.20 10− × 3 × 4 Initial standard derivation 1.20 10− 4.17 10− Normalize predictions None× None× Gradient stop prediction True False Inner activations True True Addative bias True True Scale bias True False Routing bias learned True True Regularize log prior Previous Previous em 4 5 Learning rate 1.91 10− 3.98 10− Number of iterations 2 × 3 × Data loss type Margin Margin 4 6 Weight scale routing 4.66 10− 7.27 10− × 5 × 4 Weight scale prediction 3.54 10− 5.72 10− × 5 × 5 Reconstruction scale 2.18 10− 7.11 10− × 5 × 3 Initial standard derivation 9.32 10− 1.14 10− Normalize predictions None× Matrix× Activation function Logistic Tanh Gradient stop prediction False True Gradient stop activation False True Inner activations False True Addative bias False True Scale bias True False Punish standard derivation True False Routing bias learned True False mfskm 3 4 Learning rate 1.20 10− 9.55 10− Number of iterations 3 × 2 × Data loss type Spread Margin Weight scale routing 4.20 10 3 1.50 10 5 × − × − Weight scale prediction 2.57 10 6 7.17 10 4 × − × − A.6 Hyper parameters for test accuracy evaluation 49

5 5 Reconstruction scale 9.04 10− 5.12 10− × 3 × 2 Initial standard derivation 1.86 10− 7.70 10− Normalize predictions None× Std × Activation function Logistic Tanh Specific activation False False Gradient stop prediction False True Gradient stop activation True True Addative bias True False Scale bias True True Punish standard derivation False False Routing bias learned False True 1 3 Prior τ 1.76 10− 7.40 10− × 1 × 2 Prior β 1.49 10− 1.80 10− mfgmm × × 4 4 Learning rate 5.75 10− 1.20 10− Number of iterations 2 × 2 × Data loss type Margin Spread 2 4 Weight scale routing 7.75 10− 1.07 10− × 5 × 6 Weight scale prediction 2.67 10− 4.55 10− × 3 × 5 Reconstruction scale 7.24 10− 2.73 10− × 3 × 3 Initial standard derivation 4.78 10− 1.66 10− Normalize predictions None× Matrix× Activation function Tanh Logistic Gradient stop prediction False True Gradient stop activation False True Addative bias True True Scale bias True True Punish standard derivation True False Routing bias learned True True 2 1 Prior α 9.80 10− 1.61 10− × 2 × 1 Prior λ 6.70 10− 3.37 10− × 2 × 2 Prior β 8.54 10− 2.94 10− skm × × 3 4 Learning rate 1.20 10− 1.58 10− Number of iterations 2 × 3 × Data loss type Margin Spead 5 3 Weight scale routing 3.71 10− 3.92 10− × 5 × 5 Weight scale prediction 4.09 10− 2.83 10− × 4 × 5 Reconstruction scale 8.42 10− 1.31 10− × 3 × 5 Initial standard derivation 4.93 10− 1.62 10− Normalize predictions Std × Std × Activation function Logistic Logistic Gradient stop prediction False False Gradient stop activation True False Inner activations True False 50 A Appendix

Addative bias False True Scale bias True True Routing bias learned True True Regularize log prior Previous Simple β 1.56 10 4 9.88 101 × − × Table A.1: Hyper parameter used to get the results in table 4.6

A.7 Validation

This section contains validation accuracy for different hyper parameter values, which can be seen in figures A.6-A.88. Each graph contains the validation accu- racy for 50 different values of a hyperparameter. Note that other hyper parame- ters were varied as well as random search was used. For Boolean parameters 0 is false and 1 is true. For multiple choice parame- ters the options are assigned a value which is clearly marked in the graph. Con- tinuous parameters are plotted in logarithmic scale. The figure caption contains information about which hyperparameter is plotted. If the training stopped early due to slow increase in validation accuracy the validation accuracy in the figures are fixed to 0. If the training stopped early due to exploding loss the validation accuracy in the figures are fixed to 0.1. The number of tests which succeeded and stopped are marked in the figures.−

Figure A.1: Addative cluster bias, dr A.7 Validation 51

Figure A.2: Beta update type, dr

Figure A.3: Stop gradient of activation in inner loops, dr 52 A Appendix

Figure A.4: Stop gradient of prediction in inner loops, dr

Figure A.5: Activation function is applied in inner loops, dr A.7 Validation 53

Figure A.6: Number of iterations, dr

Figure A.7: Loss type, dr 54 A Appendix

Figure A.8: Normalize predictions, dr

Figure A.9: Scale of prediction weight regularizer, dr A.7 Validation 55

Figure A.10: Scale of reconstruction loss, dr

Figure A.11: Train the routing bias, dr 56 A Appendix

Figure A.12: Scale of the routing weights regularizer, dr

Figure A.13: Multiplicative cluster bias, dr A.7 Validation 57

Figure A.14: Standard derivation of initialized weights , dr 58 A Appendix

Figure A.15: Activation function, em

Figure A.16: Addative cluster bias, em A.7 Validation 59

Figure A.17: Stop gradient of activation in inner loops, em

Figure A.18: Stop gradient of prediction in inner loops, em 60 A Appendix

Figure A.19: Activation function is applied in inner loops, em

Figure A.20: Number of iterations, em A.7 Validation 61

Figure A.21: Loss type, em

Figure A.22: Normalize predictions, em 62 A Appendix

Figure A.23: Scale of prediction weight regularizer, em

Figure A.24: Punish high standard derivation, em A.7 Validation 63

Figure A.25: Scale of reconstruction loss, em

Figure A.26: Train the routing bias, em 64 A Appendix

Figure A.27: Scale of the routing weights regularizer, em

Figure A.28: Multiplicative cluster bias, em A.7 Validation 65

Figure A.29: Standard derivation of initialized weights , em 66 A Appendix

Figure A.30: Activation function, mfgmm

Figure A.31: Addative cluster bias, mfgmm A.7 Validation 67

Figure A.32: Stop gradient of activation in inner loops, mfgmm

Figure A.33: Stop gradient of prediction in inner loops, mfgmm 68 A Appendix

Figure A.34: Number of iterations, mfgmm

Figure A.35: Loss type, mfgmm A.7 Validation 69

Figure A.36: Normalize predictions, mfgmm

Figure A.37: Scale of prediction weight regularizer, mfgmm 70 A Appendix

Figure A.38: Punish high standard derivation, mfgmm

Figure A.39: Scale of reconstruction loss, mfgmm A.7 Validation 71

Figure A.40: Train the routing bias, mfgmm

Figure A.41: Scale of the routing weights regularizer, mfgmm 72 A Appendix

Figure A.42: Multiplicative cluster bias, mfgmm

Figure A.43: Standard derivation of initialized weights, mfgmm A.7 Validation 73

Figure A.44: α(0) in algorithm 6, mfgmm

Figure A.45: β(0) in algorithm 6, mfgmm 74 A Appendix

Figure A.46: λ(0) in algorithm 6 , mfgmm A.7 Validation 75

Figure A.47: Activation function, mfskm

Figure A.48: Addative cluster bias, mfskm 76 A Appendix

Figure A.49: Stop gradient of activation in inner loops, mfskm

Figure A.50: Stop gradient of prediction in inner loops, mfskm A.7 Validation 77

Figure A.51: Number of iterations, mfskm

Figure A.52: Loss type, mfskm 78 A Appendix

Figure A.53: Normalize predictions, mfskm

Figure A.54: Scale of prediction weight regularizer, mfskm A.7 Validation 79

Figure A.55: Punish high standard derivation, mfskm

Figure A.56: Scale of reconstruction loss, mfskm 80 A Appendix

Figure A.57: Train the routing bias, mfskm

Figure A.58: Scale of the routing weights regularizer, mfskm A.7 Validation 81

Figure A.59: Multiplicative cluster bias, mfskm

Figure A.60: Standard derivation of initialized weights, mfskm 82 A Appendix

Figure A.61: β(0) in algorithm 7, mfskm

Figure A.62: τ(0) in algorithm 7 , mfskm A.7 Validation 83

Figure A.63: Activation function, skm

Figure A.64: Addative cluster bias, skm 84 A Appendix

Figure A.65: Beta update type, skm

Figure A.66: Stop gradient of activation in inner loops, skm A.7 Validation 85

Figure A.67: Stop gradient of prediction in inner loops, skm

Figure A.68: Activation function is applied in inner loops, skm 86 A Appendix

Figure A.69: Number of iterations, skm

Figure A.70: Loss type, skm A.7 Validation 87

Figure A.71: Normalize predictions, skm

Figure A.72: Scale of prediction weight regularizer, skm 88 A Appendix

Figure A.73: Scale of reconstruction loss, skm

Figure A.74: Train the routing bias, skm A.7 Validation 89

Figure A.75: Scale of the routing weights regularizer, skm

Figure A.76: Multiplicative cluster bias, skm 90 A Appendix

Figure A.77: Standard derivation of initialized weights, skm

Figure A.78: β in algorithm 8 , skm Bibliography

[1] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. Cited on page 12.

[2] Pedro F. Felzenszwalb, Ross B. Girshick, David A. McAllester, and Deva Ra- manan. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32:1627– 1645, 2009. Cited on page 2.

[3] M. A. Fischler and R. A. Elschlager. The representation and matching of pictorial structures. IEEE Trans. Comput., 22(1):67–92, January 1973. ISSN 0018-9340. doi: 10.1109/T-C.1973.223602. URL https://doi.org/ 10.1109/T-C.1973.223602. Cited on page 2. [4] I. D. Gebru, X. Alameda-Pineda, F. Forbes, and R. Horaud. Em algorithms for weighted-data clustering with application to audio-visual scene analy- sis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(12): 2402–2415, Dec 2016. ISSN 0162-8828. doi: 10.1109/TPAMI.2016.2522425. Cited on page 10.

[5] Geoffrey E Hinton, Sara Sabour, and Nicholas Frosst. Matrix capsules with EM routing. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HJWLfGWRb. Cited on pages 1, 2, 3, 8, 14, 15, 23, 24, 26, 28, 35, and 36.

[6] Geoffrey F. Hinton. A parallel computation that assigns canonical object- based frames of reference. In Proceedings of the 7th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’81, pages 683–685, San Francisco, CA, USA, 1981. Morgan Kaufmann Publishers Inc. URL http://dl.acm.org/citation.cfm?id=1623264.1623282. Cited on page 1.

[7] Jiawei He Huadong Liao. Capslayer: An advanced library for capsule theory. http://naturomics.com/CapsLayer, 2017. Cited on page 25. [8] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009. Cited on pages iii, 3, and 25.

91 92 Bibliography

[9] Prem Qu Nair, Rohan Doshi, and Stefan Keselj. Pushing the limits of capsule networks. 2018. Cited on page 35. [10] Sai Samarth R. Phaye, Apoorva Sikka, Abhinav Dhall, and Deepti R. Bathula. Dense and diverse capsule networks: Making the capsules learn better. CoRR, abs/1805.04001, 2018. URL http://arxiv.org/abs/ 1805.04001. Cited on page 35.

[11] Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton. Dynamic routing be- tween capsules. CoRR, abs/1710.09829, 2017. URL http://arxiv.org/ abs/1710.09829. Cited on pages iii, 1, 2, 3, 7, 8, 9, 12, 13, 23, 24, 26, 28, 35, and 37. [12] Da-Han Wang, Cheng-Lin Liu, Jin-Lun Yu, and Xiang-Dong Zhou. Casia- olhwdb1: A database of online handwritten chinese characters. 2009 10th International Conference on Document Analysis and Recognition, pages 1206–1210, 2009. Cited on page 37. [13] Dilin Wang and Qiang Liu. An optimization view on dynamic routing between capsules, 2018. URL https://openreview.net/forum?id= HJjtFYJDf. Cited on pages 13, 14, 23, 24, and 28. [14] Markus Weber. of Models for Object Recognition. PhD thesis, Pasadena, CA, USA, 2000. AAI9972646. Cited on page 2. [15] Edgar Xi, Selina Bing, and Yang Jin. Capsule network performance on com- plex data. 2017. URL https://arxiv.org/pdf/1712.03480.pdf. Cited on page 35. [16] Christopher J.C. Burges Yann LeCun, Corinna Cortes. The of handwritten digits. URL http://yann.lecun.com/exdb/mnist/. Cited on pages iii, 3, and 25.

[17] Richard S. Zemel and Geoffrey E Hinton. Discovering viewpoint-invariant relationships that characterize objects. In R. P. Lippmann, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Pro- cessing Systems 3, pages 299–305. Morgan-Kaufmann, 1991. URL http://papers.nips.cc/paper/329-discovering-viewpoint- invariant-relationships-that-characterize-objects.pdf. Cited on page 2.

[18] Richard S. Zemel and Geoffrey E. Hinton. Learning population codes by minimizing description length. 7(3):549–564, 1995. doi: 10.1162/ neco.1995.7.3.549. URL https://app.dimensions.ai/details/ publication/pub.1016522705andhttp://www.cs.utoronto.ca/ ~hinton/absps/mdlpop.pdf. Exported from https://app.dimensions.ai on 2018/12/07. Cited on page 9.

[19] Richard S. Zemel, Michael C Mozer, and Geoffrey E Hinton. Traffic: Recognizing objects using hierarchical reference frame transformations. Bibliography 93

In D. S. Touretzky, editor, Advances in Neural Information Process- ing Systems 2, pages 266–273. Morgan-Kaufmann, 1990. URL http: //papers.nips.cc/paper/241-traffic-recognizing-objects- using-hierarchical-reference-frame-transformations.pdf. Cited on page 2. [20] Liheng Zhang, Marzieh Edraki, and Guo-Jun Qi. Cappronet: Deep fea- ture learning via orthogonal projections onto capsule subspaces. CoRR, abs/1805.07621, 2018. URL http://arxiv.org/abs/1805.07621. Cited on page 35. [21] Suofei Zhang, Wei Zhao, Xiaofu Wu, and Quan Zhou. Fast dynamic rout- ing based on weighted kernel density estimation. CoRR, abs/1805.10807, 2018. URL http://arxiv.org/abs/1805.10807. Cited on pages 2, 8, 16, and 35.