<<

Everything you wanted to know about for Vision but were afraid to ask

Moacir A. Ponti, Leonardo S. F. Ribeiro, Tiago S. Nazare Tu Bui, John Collomosse ICMC – University of Sao˜ Paulo CVSSP – University of Surrey Sao˜ Carlos/SP, 13566-590, Brazil Guildford, GU2 7XH, UK Email: [ponti, leonardo.sampaio.ribeiro, tiagosn]@usp.br Email: [t.bui, j.collomosse]@surrey.ac.uk

Abstract—Deep Learning methods are currently the state- (AEs), started appearing as a for state-of-the-art meth- of-the-art in many and Image Processing ods in several computer vision applications (e.g. remote problems, in particular image classification. After years of sensing [7], [8], [9] and re-identification [10]). intensive investigation, a few models matured and became important tools, including Convolutional Neural Networks The ImageNet challenge [1] played a major role in this (CNNs), Siamese and Triplet Networks, Auto-Encoders (AEs) process, starting a race for the model that could beat the and Generative Adversarial Networks (GANs). The field is fast- current champion in the image classification challenge, but paced and there is a lot of terminologies to catch up for those also , object recognition and other tasks. who want to adventure in Deep Learning waters. This paper In order to accomplish that, different architectures and has the objective to introduce the most fundamental concepts of Deep Learning for Computer Vision in particular CNNs, combinations of DBM were employed. Also, novel types of AEs and GANs, including architectures, inner workings and networks – such as Siamese Networks, Triplet Networks and optimization. We offer an updated description of the theoretical Generative Adversarial Networks (GANs) – were designed. and practical knowledge of working with those models. After Deep Learning techniques offer an important set of meth- that, we describe Siamese and Triplet Networks, not often covered in tutorial papers, as well as review the literature ods suited for tasks within the domains of digital visual on recent and exciting topics such as visual stylization, pixel- content. It is noticeable that those methods comprise a wise prediction and video processing. Finally, we discuss the diverse variety of models, components and that limitations of Deep Learning for Computer Vision. may be dissimilar to what one may be used to in Computer Keywords-Computer Vision; Deep Learning; Image Process- Vision. A myriad of keywords within this context makes the ing; Video Processing literature in this field almost a different language: Feature maps, Activation, Receptive Fields, Dropout, ReLu, Max- I.INTRODUCTION Pool, Softmax, SGD, Adam, FC, Generator, Discriminator, The field of Computer Vision was revolutionized in the Shared Weights, etc. This can make it hard for a beginner past years (in particular after 2012) due to Deep Learning to understand and catch up with the recent studies. techniques. This is mainly due to two reasons: the availabil- There are DL methods we do not cover in this paper, ity of labelled image datasets with millions of images [1], including Deep Belief Networks (DBN), Deep Boltzmann [2], and that allowed to speed-up compu- Machines (DBM) and also those using Recurrent Neural tations. Before that, there were studies exploring hierarchical Networks (RNN), and Long short- representations with neural networks such as Fukushima’s term memory (LSTMs). We refer to [11]–[14] for DBN and Neocognitron [3] and LeCun’s neural networks for digit DBM-related studies, and [15]–[17] for RNN-related studies. recognition [4]. Although those techniques were known to The paper is organized as follows: and Artificial Intelligence communities, the efforts of Computer Vision researchers during the 2000’s • Section II provides definitions and prerequisites. was in a different direction, in particular using approaches • Section III aims to present a detailed and updated based on Scale-Invariant features, Bag-of-Features, Spatial description of the DL’s main terminology, building Pyramids and related methods [5]. blocks and algorithms of the Convolutional Neural After the publication of the AlexNet Convolutional Neural Network (CNN) since it is widely used in Model [6], many research communities realized the Vision, including: power of such methods and, from that point on, Deep Learn- 1) Components: Convolutional Layer, Activation ing (DL) invaded the Visual fields: Computer Function, Feature/Activation Map, Pooling, Nor- Vision, Image Processing, . Convolu- malization, Fully Connected Layers, Parameters, tional Neural Networks (CNNs), Deep Belief Nets (DBNs), Loss Function; Restricted Boltzmann Machines (RBMs) and Autoencoders 2) Algorithms: Optimization (SGD, Momentum, Adam, etc.) and Training; Deep Learning often involves learning hierarchical repre- 3) Architectures: AlexNet, VGGNet, ResNet, Incep- sentations using a series of layers that operate by processing tion, DenseNet; an input generating a series of representations that are 4) Beyond Classification: fine-tuning, feature extrac- then given as input to the next layer. By having spaces of tion and . sufficiently high dimensionality, such methods are able to • Section IV describes Autoencoders (AEs): capture the scope of the relationships found in the original 1) Undercomplete AEs; data so that it finds the “right” representation for the task 2) Overcomplete AEs: Denoising and Contractive; at hand [21]. This can be seen as separating the multiple 3) Generative AE: Variational AEs that represent data via a series of transformations.

• Section V introduces Generative Adversarial Net- III.CONVOLUTIONAL NEURAL NETWORKS works (GANs): Convolutional Neural Networks (CNNs or ConvNets) 1) Generative Models: Generator/Discriminator; are probably the most well known Deep Learning model 2) Training: and Loss Lunctions. used to solve Computer Vision tasks, in particular im- • Section VI is devoted to Siamese and Triplet Net- age classification. The basic building blocks of CNNs are works:. , pooling (downsampling) operators, activation 1) SiameseNet with contrastive Loss; functions, and fully-connected layers, which are essentially 2) TripletNet: with ; similar to hidden layers of a Multilayer (MLP). • Section VII reviews Applications of Deep Networks in Architectures such as AlexNet [6], VGG [22], ResNet [23] Image Processing and Computer Vision, including: and GoogLeNet [24] became very popular, used as subrou- 1) Visual Stilization; tines to obtain representations that are then offered as input 2) Image Processing and Pixel-Wise Prediction; to other algorithms to solve different tasks. 3) Networks for Video Data: Multi-stream and C3D. We can write this network as a composition of a sequence of functions fl(.) (related to some layer l) that takes as input • Section VIII concludes the paper by discussing the a vector x and a set of parameters W , and outputs a vector Limitations of Deep Learning methods for Computer l l x : Vision. l+1

f(x) = fL (··· f2(f1(x1,W1); W2) ··· ),WL) II.DEEP LEARNING: PREREQUISITES AND DEFINITIONS The prerequisites needed to understand Deep Learning x1 is the input image, and in a CNN the most charac- for Computer Vision includes basics of Machine Learning teristic functions fl(.) are convolutions. Convolutions and (ML) and Image Processing (IP). Since those are out of the other operators works as building blocks or layers of a scope of this paper we refer to [18], [19] for an introduction CNN: , pooling, normalization and linear in such fields. We also assume the reader is familiar with combination (produced by the fully connected layer). In the and Calculus, as well as Probability and next sections we will describe each one of those building Optimization — a review of those topics can be found in blocks. Part I of Goodfellow et al. textbook on Deep Learning [20]. A. Convolutional Layer Machine learning methods basically try to discover a model (e.g. rules, parameters) by using a set of input data A layer is composed of a set of filters, each to be points and some way to guide the algorithm in order to learn applied to the entire input vector. Each filter is nothing but from this input. In , we have examples a k × k of weights (or values) wi. Each weight is a of expected output, whereas in unsupervised learning some parameter of the model to be learned. We refer to [18] for assumption is made in order to build the model. However, an introduction about and image filters. in order to achieve success, it is paramount to have a Each filter will produce what can be seen as an affine good representation of the input data, i.e. a good set of transformation of the input. Another view is that each filter features that will produce a feature space in which an produces a linear combination of all pixel values in a algorithm can use its bias in order to learn. One of the neighbourhood defined by the size of the filter. Each region main ideas of Deep learning is to solve the problem of that the filter processes is called local receptive field: an finding this representation by learning it from the data: it output value (pixel) is a combination of the input pixels defines representations that are expressed in terms of other, in this local receptive field (see Figure 1). That makes the simpler ones [20]. Another is to assume that depth (in terms convolutional layer different from layers of an MLP for of successive representations) allows learning a sequence of example; in a MLP each neuron will produce a single output parallel instructions that transforms an initial input vector, based on all values from the previous layer, whereas in a mapping one space to another. convolutional layer, an output value f(i, x, y) is based on tanh(x) logistic(x) 1 1

x

-1 x (a) hyperbolic tangent (b) logistic max[0, x] max[ax, x], a = 0.1 1 1

Figure 1. A convolution processes local information centred in each position (x, y): this region is called local receptive field, whose values x are used as input by some filter i with weights wi in order to produce a x single point (pixel) in the output feature map f(i, x, y).

-1 -1 a filter i and local data coming from the previous layer (c) ReLU (d) PReLU centered at a position (x, y). For example, if we have an RGB image as an input, Figure 2. Illustration of activation functions, (a) and (b) are often used and we want the first convolutional layer to have 4 filters in Networks, while ReLUs (c) and (d) are more a = 0.01 of size 5 × 5 to operate over this image, that means we common in CNNs. Note (d) with is equivalent to Leaky ReLU. actually need a 5 × 5 × 3 filter (the last 3 is for each colour channel). Let the input image to have size 64×64×3, and a values. This is somewhat related to the non-negativity con- convolutional layer with 4 filters, then the output is going to straint often used to regularize image processing methods have 4 matrices; stacked, we have a of size 64×64×4. based on subspace projections [28]. The Parametric ReLU Here, we assume that we used a zero-padding approach in (PReLU) allows small negative features, parametrized by order to perform the convolution keeping the of 0 ≤ a ≤ 1 [29]. It is possible to design the layers so that a the image. is learnable during the training stage. When we have a fixed The most commonly used filter sizes are 5×5×d, 3×3×d a = 0.01 we have the Leaky ReLU. and 1 × 1 × d, where d is the depth of the tensor. Note that in the case of 1 × 1 the output is a linear combination of all C. Feature or activation map feature maps for a single pixel, not considering the spatial Each convolutional neuron produces a new vector that neighbours but only the depth neighbours. passes through the activation function and it is then called a It is important to mention that the convolution operator feature map. Those maps are stacked, forming a tensor that can have different strides, which defines the step taken will be offered as input to the next layer. between each local filtering. The default is 1, in this case all Note that, because our first convolutional layer outputs a pixels are considered in the convolution. For example with 64×64×4 tensor, then if we set a second convolutional layer stride 2, every odd pixel is processed, skipping the others. It with filters of size 3 × 3, those will actually be 3 × 3 × 4 is common to use an arbitrary value of stride s, e.g. s = 4 in filters. Each one independently processes the entire tensor AlexNet [6] and s = 2 in DenseNet [25], in order to reduce and outputs a single feature map. In Figure 3 we show an the running time. illustration of two convolutional layers producing a feature map. B. Activation Function In contrast to the use of a such as D. Pooling the logistic or hyperbolic tangent in MLPs, the rectified Often applied after a few convolutional layers, it down- linear function (ReLU) is often used in CNNs after con- samples the image in order to reduce the spatial dimen- volutional layers or fully connected layers [26], but can also sionality of the vector. The maxpooling operator is the be employed before layers in a pre-activation setting [27]. most frequently used. This type of operation has two main Activation Functions are not useful after Pool layers because purposes: first, it reduces the size of the data: as we will such layer only downsamples the input data. show in the specific architectures, the depth (3rd ) Figure 2 shows plots of those such functions: ReLU of the tensor often increases and therefore it is convenient cancels out all negative values, and it is linear for all positive to reduce the first two dimensions. Second, by reducing Figure 3. Illustration of two convolutional layers, the first with 4 filters 5 × 5 × 3 that gets as input an RGB image of size 64 × 64 × 3, and produces a tensor of feature maps. A second convolutional layer with 5 filters 3 × 3 × 4 gets as input the tensor from the previous layer of size 64 × 64 × 4 and produces a new 64 × 64 × 5 tensor of feature maps. The circle after each filter denotes an activation function, e.g. ReLU. the image size it is possible to obtain a kind of multi- F. Fully Connected Layer resolution filter bank, by processing the input in different After many convolutional layers, it is common to include scale spaces. However, there are studies in favour to discard fully connected (FC) layers that work in a way similar to a the pooling layers, reducing the size of the representations hidden layer of an MLP in order to learn weights to classify via a larger stride in the convolutional layers [30]. Also, the representation. In contrast to a convolutional layer, for because generative models (e.g. variational AEs, GANs, see which each filter produces a matrix of local activation values, Sections IV and V) shown to be harder to train with pooling the fully connected layer looks at the full input vector, layers, there is probably a tendency for future architectures producing a single scalar value. In order to do that, it takes as to avoid pooling layers. input the reshaped version of the data coming from the last layer. For example, if the last layer outputs a 4 × 4 × 40 tensor, we reshape it so that it will be a vector of size E. Normalization 1 × (4 × 4 × 40) = 1 × 640. Therefore, each neuron in It is common also to apply normalization to both the the FC layer will be associated with 640 weights, producing input data and after convolutional layers. In input data a linear combination of the vector. In Figure 4 we show an preprocessing it is common to apply a z-score normalization illustration of the transition between a convolutional layer (centring by subtracting the mean and normalizing the with 5 feature maps with size 2×2 and an FC layer with m T standard deviation to unity) as described by [31], which neurons, each producing an output based on f(x w + b), can be seen as a whitening process. For layer normalization in which x is the feature map vector, w are the weights there are different approaches such as the channel-wise layer associated with each neuron of the fully connected layer, normalization, that normalizes the vector at each spatial and b are the bias terms. location in the input map, either within the same feature The last layer of a CNN is often the one that outputs the map or across consecutive channels/maps, using L1-norm, class membership probabilities for each class c using logistic L2-norm or variations. regression: T In AlexNet architecture (2012) [6] the Local Response x wc+bc T e Normalization (LRN) is used: for every particular input pixel P (y = c|x; w; b) = softmaxc(x w + b) = T , P ex wj +bj (x, y) for a given filter i, we have a single output pixel j fi(x, y), which is normalized using values from adjacent where y is the predicted class, x is the vector coming from feature maps j, i.e. fj(x, y). This procedure incorporates the previous layer, w and b are respectively the weights and information about outputs of other kernels applied to the the bias associated with each neuron of the output layer. same position (x, y). However, more recent methods such as GoogLeNet [24] G. CNN architecture and its parameters and ResNet [23] do not mention the use of LRN. Instead, Typical CNNs are organized using blocks of convolutional they use (BN) [32]. We describe BN in layers (Conv) followed by an activation function (AF), even- Section III-J. tually pooling (Pool) and then a series of fully connected Figure 4. Illustration of a transition between a convolutional and a fully connected layer: the values of all 5 activation/feature maps of size 2 × 2 are concatenated in a vector x and neurons in the fully connected layer will have full connections to all values in this previous layer producing a vector multiplication followed by a bias shift and an activation function in the form f(xT w + b).

layers (FC) which are also followed by activation functions. • FC 1 (FC → AF): 32 neurons. Normalization of data before each layer can also be applied – outputs 32 values as we describe later in Section III-J. • FC 2 (output) (FC → AF): 5 neurons (one for each In order to build a CNN we have to define the architecture, class). which is given first by the number of convolutional layers – outputs 5 values (for each layer also the number of filters, the size of Considering that each of the three convolutional layer’s the filters and the stride of the convolution). Typically, a filters has p × q × d parameters, plus a bias term, and that sequence of Conv + AF layers is followed by Pooling (for the FC layer has weights and bias term associated with each each Pool layer define the window size and stride that will value of the vector received from the previous layer than we define the downsampling factor). After that, it is common to have the following number of parameters: have a number of fully connected layers (for each FC layer define the number of neurons). Note that pre-activation is (10 × [5 × 5 × 3 + 1] = 760) [Conv1] also possible as in He et al. [27], in which first the data + (20 × [3 × 3 × 10 + 1] = 1820) [Conv2] goes through AF and then to a Conv.Layer. + (40 × [1 × 1 × 20 + 1] = 840) [Conv3] The number of parameters in a CNN is related to the number of weights we have to learn — those are basically + (32 × [640 + 1] = 20512) [FC1] the values of all filters in the convolutional layers, all weights + (5 × [32 + 1] = 165) [FC2] of fully connected layers, as well as bias terms. = 24097 —Example: let us build a CNN architecture to work with RGB input images with dimensions 64 × 64 × 3 in order As the reader can notice, even a relatively small architec- to classify images into 5 classes. Our CNN will have three ture can easily have a lot of parameters to be tuned. In the convolutional layers, two max pooling layers, and one fully case of a classification problem we presented, we are going connected layer as follows: to use the labelled instances in order to learn the parameters. But to guide this process we need a way to measure how the • Conv 1 (Conv → AF): 10 neurons 5 × 5 × 3 current model is performing and then a way to change the – outputs 64 × 64 × 10 tensor parameters so that it performs better than the current one. • Max pooling 1: downsampling factor 4 – outputs 16 × 16 × 10 tensor H. Loss Function • Conv 2 (Conv → AF): 20 neurons 3 × 3 × 10 A loss or cost function is a way to measure how bad – outputs 16 × 16 × 20 tensor the performance of the current model is given the current input and expected output; because it is based on a training • Conv 3 (Conv → AF): 40 neurons 1 × 1 × 20 set it is, in fact, an empirical loss function. Assuming we – outputs 16 × 16 × 40 tensor want to use the CNN as a regressor in order to discriminate • Max pooling 2: downsampling factor 4 between classes, then a loss `(y, yˆ) will express the penalty – outputs 4 × 4 × 40 tensor for predicting some yˆ, while the true output value should be y. The hinge loss or max-margin loss is an example We then expand the loss by adding this term, weighted by of this type of function that is used in the SVM classifier a hyperparameter λ to control the regularization influence: optimization in its primal form. Let fi ≡ fi(xj,W ) be a N 1 X score function for some class i given an instance xj and a L(W ) = ` (y , f(x ; W )) + λR(W ). N j j set o parameters W , then the hinge loss is: j=1 (h) X  The parameter λ controls how large the parameters in W `j = max 0, fc − fyj + ∆ , are allowed to grow, and it can be found by cross-validation c6=yj using the training set. where class yj is the correct class for the instance j. The hyperparameter ∆ can be used so that when minimizing this I. Optimization Algorithms loss, the score of the correct class needs to be larger than the After defining the loss function, we want to adjust the incorrect class scores by ∆ at least. There is also a squared parameters so that the loss is minimized. The Gradient version of this loss, called squared hinge loss. Descent is the standard algorithm for this task, and the For the softmax classifier, often used in neural networks, method is used to obtain the gradient for the cross-entropy loss is employed. Minimizing the cross- the sequence of weights using the chain rule. We assume the entropy between the estimated class probabilities: reader is familiar with the fundamentals of both and backpropagation, and focus on the particular fy ! (ce) e j case of CNNs. `j = − log , (1) P fk Note that L(W ) is based on a finite dataset and because of k e that we are computing Montecarlo estimates (via randomly in which k = 1 ··· C is the index of each neuron for the selected examples) of the real distribution that generates output layer with C neurons, one per class. the parameters. Also, recall that CNNs can have a lot of This function takes a vector of real-valued scores to a parameters to be optimized, therefore needing to be trained vector of values between zero and one with unitary sum. using thousands or millions of images (many current datasets There is an interesting probabilistic interpretation of mini- have more than 1TB of data). But if we have millions of mizing Equation 1 in which it can be seen as minimizing the examples to be used in the optimization, then the Gradient Kullback-Leibler divergence between two class distributions Descent is not viable, since this algorithm has to compute in which the true distribution has zero entropy (since it has the gradient for all examples individually. The difficulty here a single probability 1) [33]. Also, we can interpret it as is easy to see because if we try to run an epoch (i.e. a pass minimizing the negative log likelihood of the correct class, through all the data) we would have to load all the examples which is related to the Maximum Likelihood Estimation. into a limited memory, which is not possible. Alternatives The full loss of some training set (a finite batch of data) to overcome this problem are described below, including the is the average of the instances’ xj outputs, f(xj; W ), given SGD, Momentum, AdaGrad, RMSProp and Adam. the current set of all parameters W : – Stochastic Gradient Descent (SGD): one possible solution to accelerate the process is to use approximate N 1 X methods that goes through the data in samples composed L(W ) = ` (y , f(x ; W )) . N j j of random examples drawn from the original dataset. This j=1 is why the method is called Stochastic Gradient Descent: We now need to minimize L(W ) using some optimization now we are not inspecting all available data at a time, but a method. sample, which adds uncertainty in the process. We can even – Regularization: there is a possible problem with compute the Gradient Descent using a single example at a using only the loss function as presented. This is because time ( a method often used in streams or online learning). there might be many similar W for which the model is able However, in practice, it is common to use mini-batches to correctly classify the training set, and this can hamper with size B. By performing enough iterations (each iteration the process of finding good parameters via minimization of will compute the new parameters using the examples in the the loss, i.e. make it difficult to converge. In order to avoid current mini-batch), we assume it is possible to approximate ambiguity of solution, it is possible to add a new term that the Gradient Descent method. penalizes undesired situations, which is called regularization. B X B The most common regularization is the L2-norm: by adding Wt+1 = Wt − η ∇L(W ; xj ), a sum of squares, we want to discourage individually large j=1 weights: in which η is the learning rate parameter: a large η will X X 2 produce larger steps in the direction of the gradient, while R(W ) = Wk,l k l a small value produces a smaller step in the direction of the gradient. It is common to set a larger initial value for η, and m is called the first order moment of ∇L (don’t confuse it then exponentially decrease it as a function of the iterations. with momentum) and mˆ is m after applying the decaying In fact, SGD is a rough approximation, producing a non- factor. Then we need to compute the gradients g to use in smooth convergence. Because of that, variants were pro- the normalization: posed to compensate for that, such as the Adaptive Gradient g = γ g + (1 − γ )∇L(W )2 (AdaGrad) [34], Adaptive learning rate (AdaDelta) [35] and t+1 t+1 t t+1 t g Adaptive moment estimation (Adam) [36]. Those variants gˆ = t+1 t+1 1 − γ basically use the ideas of momentum and normalization, as t+1 we describe below. g is called the second order moment of ∇L (again, don’t – Momentum: adds a new variable α to control the confuse it with momentum). The final parameter update is change in the parameters W . It creates a momentum that given by: prevents the new parameters Wt+1 from deviating too much ηmˆ t+1 from the previous direction: Wt+1 = Wt − p gˆt+1 +  Wt+1 = Wt + α(Wt − Wt−1) + (1 − α)[−η∇L(Wt)] , J. Tricks for Training CNNs where L(Wt) is the loss computed using some examples – Initialization: random initialization of weights is using the current parameters Wt (often a mini-batch). Note important to the convergence of the network. The Gaussian that the magnitude of the step for the iteration t + 1 now is distribution N (µ, σ) is often used to produce the random also constrained by the step taken in the iteration t. numbers. However, for models with more than 8 convo- – AdaGrad: works by putting more weight on rare or lutional layers, the use of a fixed standard deviation (e.g. infrequent parameters. It creates a history of how much a σ = 0.01) as in [6] was shown to hamper convergence. given parameter already changed the loss, accumulating the Therefore, when using rectifiers as activation functions it is 2 p individual gradients gt+1 = gt + ∇L(Wt) . Then, the next recommended to use µ = 0, σ = 2/nl, where nl is the step is now scaled/normalized for each parameter: number of connections of a response of a given layer l; as 2 well as initializing all bias parameters to zero [29],. η∇L(Wt) Wt+1 = Wt − √ , – Minibatch size: due to the need of using SGD gt+1 +  optimization and variants, one must define the size of the since this historical gradient is computed feature-wise, the minibatch of images that is going to be used to train the infrequent features will have more influence in the next model taking into account memory constraints but also gradient descent step. the behaviour of the optimization algorithm. For example, – RMSProp: computes running averages of recent while a small batch size can make the loss minimization gradient magnitudes and normalizes using these average so more difficult, the convergence speed of SGD degrades that loosely gradient values are normalized. It is similar when increasing the minibatch size for convex objective to AdaGrad, but here gt is computed by an exponentially functions [37]. In particular, assuming SGD converges in decaying average and not the simple sum of gradients: T iterations,√ then minibatch SGD with batch size B runs in 2 O(1/ BT + 1/T ). gt+1 = γgt + (1 − γ)∇L(Wt) However, increasing B is important to reduce the variance g is called the second order moment of ∇L (don’t confuse of the SGD updates (by using the average of the loss), and it with momentum). The final parameter update is given by this, in turn, allows you to take bigger step-sizes [38]. Also, adding the momentum: larger minibatches are interesting when using GPUs, since it   gives better throughput by performing backpropagation with η∇L(Wt) Wt+1 = Wt + α(Wt − Wt−1) + (1 − α) −√ , data reuse using matrix multiplication (instead of several gt+1 +  vector-matrix multiplications), and needing fewer transfers – Adam: uses an idea that is similar to AdaGrad and to the GPU. Therefore, it can be an advantage to choose RMSProp, but the momentum is used for the first and second the batch size so that it fully occupies the GPU memory order moment so now we have α and γ to control the and choose the largest experimentally found step size. While momentum of respectively W and g. The influence of both popular architectures (as we will discuss in Section III-K) decays over time so that the step size decreases when it use from 32 to 256 examples in the batch size, a recent approaches minimum. We use an auxiliary variable m for paper used a linear scaling rule for adjusting learning rates clarity: as a function of minibatch size, also adding a warmup scheme with large step-sizes in the first few epochs to avoid mt+1 = αt+1gt + (1 − αt+1)∇L(Wt) optimization problems. By using 256 GPUs and a batch size mt+1 of 8192, Goyal et al. [39] were able to train a Residual mˆ t+1 = 1 − αt+1 Network with 50 layers with the ImageNet in 1 hour. – Dropout: a technique proposed by [40] that, during – Fine-tuning: when you have a small dataset it can the forward pass of the network training stage, randomly be a challenge to train a CNN. Even deactivate neurons of a layer with some probability p (in can be insufficient since augmentation will create perturbed particular from FC layers). It has relationships with the Bag- versions of the same images. In this case, it is very useful ging ensemble method [41] because, at each iteration of the to use a model already trained in a large dataset (for mini-batch SGD, the dropout procedure creates a randomly example the ImageNet dataset [1]), with initial weights that different network by subsampling the activations, which is were already learned. To obtain a trained model, adapt its trained using backpropagation. This method became known architecture to your dataset and re-enter training phase using as a form of regularization that prevents the network to such dataset is a process called fine-tuning. Because it often overfit. In the test stage, the dropout is turned off, and the involves the use of a different dataset we discuss this in more activations are re-scaled by p to compensate those activations detail in Section III-L about Transfer Learning and Feature that were dropped during the training stage. Extraction. – Batch normalization (BN): also used as a regularizer, it normalizes the a layer activations at each batch of input K. CNN Architectures for Image Classification data by maintaining the mean activation close to 0 (cen- tering) and the activation standard deviation close to 1, and There are many proposed CNN architectures for im- using parameters γ and β to produce a linear transformation age classification. We chose to cover those that contain of the normalized vector, i.e.: significant differences starting with AlexNet, all designed ! for image classification (1000 classes) of ImageNet Chal- x − µ BN (x ) = γ i B + β, (2) lenge [1]. We refer also to Fukushima’s Neocognitron [3] γ,β i p 2 σB +  and LeNet [4], both important studies for the history of in which γ and β are parameters that can be learned during Deep Learning in Computer Vision. We first describe each the backpropagation stage [32]. This allows for example to architecture and later we show an overall comparison in adjust the normalization, and even restore the data back to Table I. p 2 – AlexNet [6]: was the champion model in ImageNet its un-normalized form, i.e. when γ = σB and β = µB. BN became a standard in the recent years, often replacing Challenge 2012. With ∼ 60 million parameters and 650000 the use of both regularization and dropout (e.g. ResNet [23] neurons, AlexNet was originally designed in two branches and Inception V3 [42] and V4). allowing parallel processing. It uses Local Response nor- – Data-augmentation: as mentioned previously, CNNs malization, maxpooling with overlapping (window size 3, often have a large set of parameters to be optimized, requir- stride 2), a batch size of 128 examples, momentum of 0.9 ing a huge number of training examples. Because it is very weight decay of 0.0005. They initialized the weights using hard to have a dataset with sufficient examples, it is common Gaussian-distributed random values with fixed σ = 0.01, to employ some type of data augmentation. Also, usually and bias to 1 for the 2nd, 4th and 5th convolutional layers, images in the same dataset often have similar illumination and bias to 0 for the remaining layers. The learning rate conditions, a low variance of rotation, pose, etc. Therefore, was initialized to 0.01 and arbitrarily dividing this learning one can augment the training dataset by many operations in rate by 10 three times during the training stage. In Figure 5 order to produce 5 to 10 times more examples [43] such as: we show a diagram of the layers and compare it with the (i) cropping images in different positions — note that CNNs VGG-16, the latter is described next. often have a low-resolution image input (e.g. 224 × 224) – VGG-Net [22]: this architecture was developed to so we can find several cropped versions of an image with increase the depth while making all filters with at most higher resolution; (ii) flipping images horizontally — and 3 × 3 size. The rationale behind this is that 2 consecutive also vertically if it makes , e.g. in case of remote 3 × 3 conv. layers will have an effective receptive field of sensing and astronomical images; (iii) adding noise [44]; 5 × 5, and 3 of such layers an effective receptive field of (iv) creating new images by using PCA as in the Fancy PCA 7 × 7 when combining the feature maps. The authors claim proposed by [6]. Note that augmentation must be performed that stacking 3 conv. layers with 3 × 3 filters instead of preserving the labels. using just one with filter size 7 × 7 has the advantage of – Pre-processing: the input data can be pre-processed incorporating more rectification layers, making the decision in several ways: (i) compute the average image for the whole function more discriminative. Also, it incorporates 1 × 1 training data and subtracting it from each image; (ii) z- conv. layers that perform a linear projection of a position score normalization (as mentioned in Normalization), (iii) (x, y) across all feature maps in a given layer. There are PCA whitening that first tries to decorrelate the data by two most commonly used versions of this CNN: VGG-16 projecting zero-centered original data into the eigenbasis, and VGG-19, respectively with 16 weight layers and 19 and then takes the data in the eigenbasis and divides every weight layers. In Figure 5 we show a diagram of the layers dimension by the eigenvalue to normalize the scale. of VGG-16 and compare it with the AlexNet. AlexNet architecture x x x with d = 256 5 3 3 3 11

× × × × 64 1 × 1 × , 3 5 3 3 weights weights + pooling neurons 11 ReLU image neurons neurons

filters filters filters filters ReLU ReLU 64, 3 × 3 1000 filters 224 identity pooling ReLU 4096 4096 256 384 385 256 weights weights × 96 256, 1 × 1 Max pooling 1 Max pooling 2 Max pooling 3 224 FC1 FC2 f(x) f(x) Conv2: Conv3: Conv4: Conv5: Conv1: FC3 output f(x)

+ +

1 3 3 1 + 3 3 3 3 3 1 3 3 3 × × × × × × × × × × × × ×

1 3 3 1 ReLU ReLU 1 3 3 3 3 3 3 ReLU 3 3 neurons f(x) + x f(x) + p(x) f(x) + x image neurons neurons filters filters filters filters filters filters filters filters filters filters filters filters filters 1000 224 512 512 512 512 64 64 (a) residual block (b) with pooling (c) bottleneck 4096 4096 128 128 256 256 256 512 512 × Max pooling 1 Max pooling 2 Max pooling 3 Max pooling 4 Max pooling 5 224 FC1 FC2

Conv1: Conv2: Figure 6. Modules to preserve the characteristics the original vector Conv3: Conv4: Conv5: Conv6: Conv7: Conv8: Conv9: Conv10: Conv11: Conv12: Conv13: FC3 output (identity) allows the vector to skip weight layers, typically skipping 2 layers: VGG-16 architecture (a) the original vector x before its modification by weights is summed with its transformed version f(x); (b) when some layer also include pooling operation, the dashed line indicates the original vector needed pooling to Figure 5. Outline of CNN architectures: AlexNet with variable filter sizes be summed; (c) in the bottleneck module illustrated, the depth (256) of the (top), and VGG-16 with fixed 3 × 3 filter sizes (bottom). input is reduced by the first 1 × 1 layer of 64 filters and then restored back to 256 by the third layer.

During training, the batch size used is 256. LRN is not used since it shows no classification improvement but convolutions with a reduced number of filters, and then increased running time. Maxpooling uses window size 2 and restore its depth by adding another layer of containing a stride 2. Training was regularized by weight decay using L2 number of filters 1×1 equal to the depth of the input feature regularization λ = 0.0005, and dropout with 0.5 ratio for map. The bottleneck blocks are used in ResNet with 50, 101 the first two FC layers. Learning rate and initialization were and 152 layers. For example, the ResNet-50 is basically the similar to those used in AlexNet, but all bias parameters ResNet-34 replacing all residual blocks (each containing 2 were set to 0 and an initialization followed also a pretraining weight layers) with bottleneck blocks (each has 3 weight approach. layers). – Residual Networks (ResNet) [23]: the study describ- For the ImageNet training the authors adopt only Batch ing ResNets raises the question of whether we really get Normalization (BN) without regularization, dropout or nor- better networks by simply stacking more layers. At the time malization; data augmentation was performed using crops, of the publication VGG-19 was considered “very deep”, and horizontal flip and Fancy PCA; they also preprocessed the the authors show that, although depth seems to be correlated images by average subtraction. The batch size used was 256 with better results, in practice, when increasing the number and the remaining parameters are similar to the previously of layers, the accuracy first saturates and then starts to described methods as shown in Table I. degrade rapidly and fail to even work in the training set, – GoogLeNet [24] and Inception [42]: the GoogLeNet therefore underfitting. as proposed by [24] and the VGGNet [22] achieved similar He et al. [23] claim that this could, in fact, be an performances in the 2014 ImageNet challenge [2]. However, optimization problem, and propose the use of residual the GoogLeNet received due to its architecture blocks for networks with 34 to 152 layers. Those blocks based on modules called Inception (see Figure 8). Later are designed to preserve the characteristics of the original improvements in this model are called Inception archi- vector x before its transformation by some layer fl(x) by tectures such as the Inception V3 presented by [42], in skipping weight layers and performing the sum fl(x) + x. which the authors also discuss design principles of CNN In Figure 6 we show a diagram of three different versions architectures including: (i) gentle decrease of representation of such blocks, and in Figure 7 the ResNet architecture size from input to output; (ii) use of higher dimensional with 34 layers that uses the residual block and residual representations per layer (and consequently more activation block with pooling. The authors claim that, because the maps); (iii) use of lower dimensional embeddings using gradient is an additive term, it is unlikely to vanish even with 1 × 1 convolutions before spatial convolutions; (iv) balance many layers. Interestingly, this architecture does not contain of width (number of filters per layer) and depth (number of any FC hidden layers. After the last convolution layer, an layers). Average pooling is computed followed by the output layer. Recently, the same authors incorporated ideas from The bottleneck blocks (Figure 6-(c)) are designed to ResNets, producing many variants including Inception V4 compress the depth of the input feature map via 1 × 1 and Inception-ResNet [45]. Here, for didactic purposes, we nIcpinV tflostedsg rnil ii and (iii) principle design the follows while 8-(a), it Figure V3 in shown Inception is in sequences. [24] parallel Inception different VG- and original the convolutions The in of small stacking outputs explored of idea the parallel) concatenates already this (in explores was sequences Inception receptivedifferent idea the effective this However, same GNet. that the Note have field. that filters consecutive 7 ou nIcpinV ic h 4i aial ain of variant a basically is V4 one. the previous since the V3 Inception on focus the outputs. replacing bank filter (b) expanded traditional; with (a) (d) modules Inception two 8. Figure with mappings identity indicate to. lines skips dashed it while layer layers, the skipping in mappings representation identity the indicate of lines size Solid the architecture. match ResNet-34 to of order Outline in pooling 7. Figure 5 1 × k 1 k 1 1 The × × × × × × × 3 a rgnlIcpin()Icpinv3-1 Inception (b) Inception Original (a) 5 1 k k 1 1 1 × 7 c neto 32()Icpinv3-3 Inception (d) v3-2 Inception (c) htaecmuainlyepnie nosmaller into expensive, computationally are that ) 3 neto module Inception etr asconcatenation maps Feature etr asconcatenation maps Feature 3 1 ovltos c ihfcoiainof factorization with (c) convolutions; k 1 1 × × x × × × x 224 × 224 image 3 1 k 1 1

1 Conv1: 64, 7 × 7, pl /2 1 Pool Pool × × 1 1 Max pooling /2 1

1 Conv2: 64, 3 × 3 × ×

1 Conv3: 64, 3 × 3 1

raslre les(e.g. filters larger breaks Conv4: 64, 3 × 3 1 3 1 Conv5: 64, 3 × 3 × × × 3 3 1 3 3 1 Conv6: 64, 3 × 3 × × × 3 3 3 1 Conv7: 64, 3 × 3 × 1

etr asconcatenation maps Feature Conv8: 128, 3 × 3, pl /2 1 1 x 3 1 k × × × × x ×

etr asconcatenation maps Feature Conv9: 128, 3 × 3 1 3 3 1 k

1 Conv10: 128, 3 × 3 ovlto filters, convolution × 1 Pool 3

× Conv11: 128, 3 × 3 1

1 Conv12: 128, 3 × 3 Pool ×

5 128 3 × 3 1 Conv13: , 1 × 5 ×

1 Conv14: 128, 3 × 3 1 × 5 × Conv15: 128, 3 × 3 by 1 5 , Conv16: 256, 3 × 3, pl /2 Conv17: 256, 3 × 3 rpsst atrz a factorize to proposes euneo aesweeec layer each where layers of sequence a the Inception, and other clipping gradient All used RMSProp. 2. also threshold They the with standard. is are algorithm parameters classes: optimization all The for same the i.e. 1 prior, uniform a class. and given LSR a the of of that i.e. influence distribution, second the label controls a prior losses, and the cross-entropy from label, of (1 deviate incorrect pair to it the a penalizes using penalizes as that seen one be can and LSR output predicted input the an Given solution. problem, non-sparse the a let constrain output example, to to tries layer that last (LSR) parameters. of regularization terms in size small of a because has is modules), factorization Inception the size inside (spatial internal layers scale (considering convolutions layers coarser depth 42 a has the architecture in increasing this Although maps of feature intention feature the the 6 of has of module concatenation per a the output maps that that V3-3, Note type size size modules spatial with (iv). of used representation and are input increases that an V3-2 (ii) third show type (i), the modules to Inception shrink, principles because dimensions the important two following is first transitions the This some in while network. representation the the of of size We the pooling, V3-1,2,3). (type highlight and modules layers Inception different convolution by regular followed include layers first 8-(d). the Figure increase are in to shown bank modules as filter representations other expanded of an two dimensionality and 8-(c), addition, Figure for In in module factorization 8-(b). a Figure proposed: in as 0 / . 1

otantentokfrIaee,te sdLRwith LSR used 1000 they ImageNet, for network the train To label- a employ also authors the V3 Inception For the 9: Figure in architecture V3 Inception the show We Conv18: 256, 3 × 3 − Ni ple nbt ovltoa n Clayers. FC and convolutional both in applied is BN ,

esNt[25]: DenseNet – Conv19: 256, 3 × 3 γ ) ∀ ` Conv20: 256 3 × 3 k ( Y Y Conv21: 256, 3 × 3 o h mgNt10-ls rbe and problem 1000-class ImageNet the for ( ( k k Conv22: 256, 3 × 3 ) ) k , etegon rt o ahclass each for truth ground the be Y Conv23: 256, 3 × 3 ˆ eteidxfralcasso given a of classes all for index the be (

k Conv24: 256, 3 × 3 )+ )) DenseNet Conv25: 256, 3 × 3 nprdb da rmbt ResNets both from ideas by inspired

5 Conv26: 256, 3 × 3 γ` P × Conv27: 256, 3 × 3 ( ( P k 5

) Conv28: 512, 3 × 3, pl /2 ( ovlto notwo into convolution k nrdcsthe introduces ro o ahcas The class. each for prior a

) Conv29: 512, 3 × 3 , Y ˆ Conv30: 512, 3 × 3 ( 17 k k

P Conv31: 512, 3 × 3 )) × × h parameter The . ersn h prior the represent Conv32: 512 3 × 3 k 17 l Conv33: 512, 3 × 3 ovltosas convolutions h Inception The . ae sinput as takes Average pooling DenseBlock 1000 k FC output neurons P k 7 = 8 ( , k 3 Y × ˆ = ) γ × ( for 8 k = ). γ 3 ) , neto 3adDnee nTbeI. Table in DenseNet and ResNet, VGGNet, top1 V3 AlexNet, and Inception for model dataset the ImageNet the described, of in size architectures error parameters, all training the of listed we comparison overall an show in ImageNet for used architecture 11. Figure DenseNet the of dataset outline filters ImageNet the number as the DenseNet DenseNet The For in [25]. that defined hyperparameter layer claim a per is authors filters fewer the requires However, image. input where maps, input produces refer DenseNet-BC. they as factor combined a model are with the compression layers and transition bottleneck at When maps the reduce feature to of proposed number also is method compression a Finally, ReLU, ReLU, sequence BN, a is operations: DenseBlock of each case this in layers, bottleneck 2. stride with pooling a average BN, of posed turn in would which process. out, convergence canceled each the be for to improves derivative unlikely simplified is a that yielding unit He of property by called the introduced has is was operations a unit, of then pre-activation sequence and unusual ReLU This by convolution. followed (BN), normalization of H inputs multiple of nation by indicated is stride layer. The pooling gray). max (in last maps Conv3 the feature for after also except preceding and zero-padding all types, use modules not Inception the do after Convolutions and architecture. before V3 representations Inception the of of size Outline the highlight 9. Figure hs ahlyrpoue notu hc safnto of function a i.e.: is maps, which feature output previous an all produces layer each Thus, etos ahDnelc ilsrtdi iue1)hsa progression, has arithmetic an 10) i.e. following Figure connections in of (illustrated number DenseBlock with each networks nections, regular while Hence, l ahlyrhsmn nu etr as feach if maps; feature input many has layer Each DenseBlocks of Variations layers Transition scmoe ftreoeain,i eune batch sequence: in operations, three of composed is L oprsnadohrarchictectures: other and Comparison – ( L

2 299 × 299 × 3 image +1) 3 k × Conv1: 32, 3 × 3, s = 2 etr as hnthe then maps, feature ietcnetos h esNti concate- a is DenseNet The connections. direct

3 Conv2: 32, 3 × 3, s = 1 x ov hsvraini aldDenseNet-B. called is variation This Conv. Conv3: 64, 3 × 3, s = 1 l 1 = Max pooling 3 × 3,s = 2 × H r aesbtenDnelcscom- DenseBlocks between layers are k Conv4: 128 filters 3 × 3 0 1 l 256 3 × 3

([ Conv5: filters stenme fcanl nthe in channels of number the is ovLyr olwdb a by followed Conv.Layer, x Conv6: 256 filters 3 × 3 1 k , 1 x H hyas xeietdusing experimented also they :

x Conv7: 256 filters 1 × 1 1 × 32 = 35 2 l , () , x × · · · 1 l 2 35 noasnl esr Each tensor. single a into th , ov,floe yBN, by followed Conv., × · · · , ae has layer sue.W hwthe show We used. is 288 x L l , − aeshave layers x 1 l ]) Inception V3-1 − . 1 tal et Inception V3-1 k concatenated. 0

rwhrate growth Inception V3-1 nodrto order in + 17 2]and [27] . k × × 17 L 2 ( 3 × l con- − × × 768 H 1) θ k 2 3 l . . Inception V3-2 Inception V3-2 oe ocet e lsie ihadfeetstof set different a with classifier new a create perform to to as model possible networks also the is use It can we dataset, Extractors large a with [47]. domain different medical a from as be even can such weights datasets, learned other the for because meaningful images is the This ImageNet, classifying dataset. than the that other as from tasks for such even datasets Inception), useful be large ResNet, can with VGGNet, trained images, (e.g. ones were the paper of that as this millions such in models requiring described However, augmentation. layers, with data number scratch, and plus high from the parameters Network to due Deep of is a This train weights. initialized to randomly viable not is it and extraction learning feature transfer fine-tuning, Classification: Beyond L. systems. for embedded could, in it implemented that be so example, compression additional the and representations, parameters the compressing small, SqueezeNet be to designed functions 5 with DenseBlock a of Illustration Layer. Transition 10. Figure Inception V3-2 nti otx,b trigwt oemdlpre-trained model some with starting by context, this In images, of dataset small a with work to needs one When was that model another to refer to want we Finally, Input Inception V3-2 Inception V3-2 8 × 4]o vnt achieve to even or [48] 8 × 4] ae nAeNtbtwt 50 with but AlexNet on based [46], BN + ReLU + Conv: H1 1280

Inception V3-3 BN + ReLU + Conv: H2 Inception V3-3 8 × 8 ×

Fine-tuning BN + ReLU + Conv: H

3 2048 rnfrLearning Transfer Max pooling 8 × 8

BN + ReLU + Conv: H4 1 × 1 × fapre-trained a of 2048

BN + ReLU + Conv: H5 FC1 2048 neurons × H Feature

l FC2 output 1000 neurons fewer s n a and [49]. Transition (BN+Conv+Pool) We . o xml,uiga neto 3ntok n ol get could one network, V3 Inception the [48]. an features using as fea- example, layer the For arbitrary use some then and of images maps the ture/activation with pass forward a [50]. perform weights the it fine-tuning information to label anomaly possible in without not (e.g. which is examples in prob- labelled clustering problems), the partially in detection when only (e.g. even unlabelled or CNNs of lems) pre-trained composed of is use dataset make to possible adapt. to layers, layers first deeper the of freezing weights common, the allowing most Usually, but the scenario. is the on (ii) depend approach will (iii) strategy original change, the The to replacing ones. layers initialized, of randomly subset layers a new some just create freeze allow (ii) and layer), fine-tuned layers, last be the the to of only layers (not dataset all new from the weights with the layer. allow output (i) new ample: our use and of dataset weights design new the then our adjust We from to classes. examples them of using set number number the training the pre-trained with a to layer a equal output on new neurons a based of build a dataset to have create new we the to our model, (e.g want for model we classifier the if CNN train Thus, dataset). to ImageNet used 1000-class originally compared dataset when classes the different with has dataset new a that likely edsrb etfrec case. As each etc. for layers, next new describe including we layers, them discarding allowing change), (not layers to freezing involve that options many classes. iue1.Otieo esNtue o mgNt h rwhrt is rate growth The ImageNet. for used DenseNet layers. of Outline Conv( 11. Figure nodrt xrc etrsacmo prahi to is approach common a features extract to order In ex- for fine-tuning, involving options different are There nodrt efr h frmnindtss hr are there tasks, aforementioned the perform to order In neto V3 Inception 1 DenseNet VGGNet AlexNet 3 N-ae etr Extractor: Feature CNN-based – uliganwCNcasfirvaFine-Tuning: via classifier CNN new a Building – ResNet C × CNN × MAIO OF OMPARISON 1 3 × .ADnelc (DB) DenseBlock A ). 2048

Input Image ac ieLanrt pmag pmprm pcsRegularization Epochs Optm.param. Optm.alg. Learn.rate size Batch uptfo h atMxPoiglyr(see layer Pooling Max last the from output Conv1: 16, 7 × 7, s = 2 2 . SGD SGD SGD 0.1 SGD 0.1 0.01 128 0.01 256 256 128 2005RMSP 0.045 32 112 CNN × 112 OESI EM FTEPRMTR SDT RI IHTHE WITH TRAIN TO USED PARAMETERS THE OF TERMS IN MODELS 4

× Max pool 3 × 3,s = 2 6 56 en hr r nenllyr ih6fitr ah h rosidct h ieo h ersnainbetween representation the of size the indicate arrows The each. filters 6 with layers internal 4 are there means × 56 nti cnro tis it scenario, this in

riigParameters Training DB-B 1: 4 × 6 Transition Layer 1 28 × ,γ α, α α α α 28 0 = 0 = 0 = 0 = 0 = ARCHITECTURES tis it . . . . 9 9 9 9

. DB-B 2: 4 × 12 9

al I Table Transition Layer 2 k 32 = 14 0 BN LSR BN + BN 100 100 120 × Ecno er o opretycp t nu.Instead, input. its copy perfectly to how learn cannot 12). AE Figure (See it from representation input this original call the we – it (more code restricted of a creates representation and compact) input original encoder the takes an former parts: two into divided intrinsic dumb useful a discover learn often properties. to are able they data AEs not Instead, the are function. because they is that copy way This such we network. learned in the data representations designed of same the interested layers on the exactly the other rather not learn but in be are itself, to we would output tries fact, the what just In in input? trivial: it as if is have AE already task the such of like utility seem may feed it we example when training words, a output other with In AE function. an identity an imate discuss sketch VI. We Section and network. in triplet photo architectures example a of both For compose type architecture. to those for larger used networks are a images pre-trained of part [54] a in as work also [53]. use [52], Quantization can Product on one or based needed, [51] quantization PCA is example. or methods for representation an reduction SVM compact be dimensionality the to more as dimensions a such 2048 can If classifier, with values another vector of for feature set input a This as layer. seen output be the ignoring 9), Figure 4L Dropout + Dropout L2 74 90 14 ahCn.ae orsod oBN-ReLU-Conv( to corresponds Conv.layer each , eas h oei iie aarpeetto,an representation, data limited a is code the Because be can AE an view, of point architectural an From approx- to learn to aims that network neural a is AE An hn h atrtksti oeadtist reconstruct to tries and code this takes latter the Then, . N sabidn block: building a as CNN – x ˆ DB-B 3: 24, 32, 48, 64

hti ssmlra osbeto possible as similar as is that Transition Layer 3 7 × V A IV. 7 I MAGE aesSize Layers # UTOENCODERS DB-B 4: 16, 32, 32, 48 0203M 11lay.) (161 30MB lay.) 40-250 (50 102MB 34-152 N 61 7M 1 lay.) (19 574MB 16-19 296MB 42 TDATASET ET 240MB 8 7 × 7 Model

x Global Avg pool 7 × 7 ( r-rie ewrscan networks pre-trained NLDN N THEIR AND INCLUDING f ttist eeaeand generate to tries it (AE n decoder a and 1 × 1 x S 1 tfis glance, first At . ) ×

1 FC output 1000 neurons BN-ReLU- + ) o- error top-1 ImageNet ∼ 18 19 24 37 18 . . . . 7% 3% 4% 5% . g 7% The . layers we are increasing the capacity of the AE. In those input x Encoder Code Decoder output xˆ scenarios, despite the fact that their code is smaller than the input, undercomplete AE still can learn how to copy the input, because they are given too much capacity. As Figure 12. General structure of AEs. stated in [20], if the encoder and the decoder have enough capacity we could have a learn a one-dimensional code i it tries to learn a code that grasps valuable knowledge such that every training sample x is mapped to code i regarding the structure of the data. In summary, we say using the encoder. Next, the decoder maps this code back that an AE generalizes well when it understands the data- to the original input. This particular example – despite not generating distribution – i.e. it has a low reconstruction error happening in practice – illustrates the problems we may end- for data generated by such mechanism, while having a high up having if an undercomplete AE has too much capacity. reconstruction error for samples that were not produced by it [55]. Convolutional AEs: one can build Convolutional AEs 2 d 1 2 1 g d g f by replacing the fully connected layers of a traditional AE f with convolutional layers. Those models are useful because , size ˆ h x , size g (g (h)) they do not need labelled examples and can be designed to f2(f1(x))) 2 1 x neurons, neurons, neurons, obtain hierarchical feature extraction by making use of the neurons, 2 4 code 4 2

autoencoder architecture. Masci et al. [56] described Convo- d/ d/ d/ d/ lutional Autoencoders for both unsupervised representation L6: output L5: L1: input L4: L3: learning (e.g. to extract features that can be used in shallow L2: or traditional classifiers) and also to initialize weights of CNNs in a fast and unsupervised way. Now that we presented the basic structure of an AE and saw the importance of having a limited representation of Figure 13. An illustration of a undercomplete AE architecture with: the data, we give a more in depth explanation concerning input x, two layers for the encoder which is a sequence of two functions f1(f2(.)) producing the code h, followed by two layers for the decoder how to restrict the generated code. Basically, there are which is a sequence of two functions g1(g2(.)) producing the output xˆ. two main ways of achieving this goal: by constructing an undercomplete AE or an overcomplete regularized AE. B. Overcomplete regularized AEs A. Undercomplete AEs Differently from undercomplete AEs, in overcomplete In this case the AE has a code that is smaller than its AEs the code dimensionality is allowed to be greater than input. Therefore, this code can not hold a complete copy of the input size, which means that in principle the networks the input data. To train this model we minimize the following are able to copy the input without learning useful properties loss function: from the training data. To limit their capability of simply copying the input we can either use a different loss function L(x, g(f(x))), (3) via regularization (sparse autoencoder) or add noise to the input data (denoising autoencoder). L x where is a loss function (e.g. mean squared error), is – Sparse autoencoder: if we decide to use a different g f an input sample, represents the decoder, represents the loss function we can regularize an AE by adding a sparsity h = f(x) decoder, is the code generated by the encoder constraint to the code, Ω(h) to Equation 3 as follows: and xˆ = g(f(x)) is the reconstructed input data. Let the decoder f() be linear, and L() be the mean squared L(x, g(f(x))) + Ω(f(x)). error, then the undercomplete AE is able to learn the same subspace as the PCA (Principal Component Analysis), i.e Thereby, because Ω favours sparse codes, our AE is pe- the principal component subspace of the training data [20]. nalized if it uses all the available neurons to form the Because of this type of behaviour AEs were often employed code h. Consequently, it would be prevented from just for dimensionality reduction. coping the input. This is why those models are known as To illustrate a possible architecture, we show an example sparse autoencoders. A possible regularization would be of a undercomplete AE in Figure 13, in which we now have an absolute value sparcity penalty such as: the encoder and the decoder composed of two layers each X L(x, g(f(x))) + Ω(h) = L(x, g(f(x))) + λ |h |, so that the code is computed by h = f(x) = f2(f1(x))), i i and the output by xˆ = g(x) = g2(g1(h))). Note that, by allowing the functions to be nonlinear, and adding several for all i values of the code. – Denoising autoencoder: we can regularize an AE by ideally correspond to real variations of the data. For instance, disturbing its input with some kind of noise. This means if we use images as input, then a CAE should learn tangent that our AE has to reconstruct an input sample x given a vectors corresponding to moving or changing parts of objects corrupted version of it x˜. Then each example is a pair (x, x˜) (e.g. head and legs) [59]. and now the loss is given by the following equation: C. Generative Autoencoders L(x, g(f(˜x))). As mentioned previously, the autoencoders can be used By using this approach we restrain our model from copy- to learn manifolds in a nonparametric fashion. It is possible ing the input and force it to learn to remove noise and, as a to associate each of the neurons/nodes of a representation result, to gain valuable insights on the data distribution. AEs with a tangent plane that spans the directions of variations that use such strategy are called denoising autoencoders associated with the difference vectors between the example (DAEs). and its neighbours [60]. An embedding associates each training example with a real-valued vector position [20]. If In principle, the DAEs can be seen as MLP networks that we have enough examples to cover the curvature and warps are trained to remove noise from input data. However, as of the target , we could, for example, generate new in all AEs, learning an output is not the main objective. examples via some interpolation of such positions. In this Therefore, DAEs aim to learn a good internal representation section we focus on generative AEs, but we also describe as a side effect of learning a denoised version of the Generative Adversarial Networks in this paper at Section V. input [57]. The idea of using an AE as a generative model is to – Contractive autoencoder: by regularizing the AE estimate the density P , or P (x), that generates the data. using the gradient of the input x, we can learn functions data For example, in [61] the authors show that when training so that the code rate of change follows the rate of change DAEs including a procedure to corrupt the data x with noise in the input x. This requires a different form for Ω that now using a conditional distribution C(ˆx|x), the DAE is trained depends also on x such as: to estimate the reverse conditional P (x|xˆ) as a side-effect. 2 X ∂h Combining this estimator with the known corruption process, Ω(x, h) = λ , they show that it is possible to obtain an estimator of P (x) ∂x F i through a Markov chain that alternates between sampling i.e. the squared Frobenius norm of the Jacobian matrix of from the reconstruction distribution model P (x|xˆ) (decode), partial derivatives associated with the encoder function. The apply the stochastic corruption procedure C(ˆx|x) (decode), contractive autoencoder, or CAE, has theoretical connec- and iterate. tions with denoising AEs and manifold learning. In [58] the – Variational Autoencoders: in addition to the use of authors show that denoising AEs make the reconstruction CAEs and DAEs for generating examples, the Variational function to resist small and finite perturbations of x, while Autoencoders (VAE) emerged as an important method of contractive autoencoders make the function resist infinitesi- unsupervised learning of complex distributions and used as mal perturbations of x. generative models [62]. The name contractive comes from the fact that the CAE VAEs were introduced as a constrained version of tradi- favours the mapping of a neighbourhood of input points (x tional autoencoders that learn an approximation of the true and its perturbed versions) into a smaller neighbourhood of distribution and are currently one of the most used generative output points (thus locally contracting), warping the space. models available [63], [64]. Although VAEs share little We can think of the Jacobian matrix J at a point x as a mathematical basis with CAEs and DAEs, its structure also linear approximation of a nonlinear encoder f(x). A linear contains encoder and a decoder modules, thus resembling a operator is said to be contractive if the norm of Jx is kept traditional autoencoder. In summary, it uses latent variables less than or equal to 1 for all unit-norm of x, i.e. if it z that can be seen as a space of rules that enforces a valid shrinks the unit sphere around each point. Therefore the example sampled from P (x). In practice, it tries to find a CAE encourages each of the local linear operators to become function Q(z|x) (encoder) which can take a value x and a contraction. give as output a distribution over z values that are likely to In addition to the contractive term, we still need to produce x. minimize the reconstruction error, which alone would lead To illustrate a training-time VAE implemented as a feed- to learning the function f() as the identity map. But the forward network we shown a simple example in Figure 14. contractive penalty will guide the learning process so that we In our example P (x|z) is given by a Gaussian distribution, ∂f(x) have most derivatives ∂x approaching zero, while only a and therefore we have a set of two parameters θ = {µ, Σ}. few directions with a large gradient for x that rapidly change Also, two loss functions are employed: (1) a Kullback- the code h. Those are likely the directions approximating Leibler (KL) divergence LKL between the distributions the tangent planes of the manifold. These tangent directions formed by the estimated parameters N (µ(x), Σ(x)) and N (0,I) in which I is the identity matrix – note that it assumed that the estimated probability distribution Pmodel is possible to obtain maximum likelihood estimation from is parameterized by a set of parameters θ and that, for minimizing the KL divergence; (2) a squared error loss any given θ, the likelihood L is the joint probability of all between the generated example f(x) and the input example samples from the training data x “happening” on the Pmodel x. distribution: Note that this VAE is basically trying to learn the pa- m Y rameters θ = {µ, Σ} of a Gaussian distribution and at the L(x, θ) = pmodel(xi; θ) same time a function f(), so that, when we sample from the i=1 distribution it is possible to use the function f() to generate Training a model to achieve maximum likelihood is then a new sample. At test time, we just sample z, input it to the one way of estimating the distribution P that generated decoder and obtain a generated example f(z) (see dashed data the training samples x. This has similarities with training red line in Figure 14). a VAE via minimization of a Kullback-Leibler divergence For the reader interested in this topic, we refer to [62] for (which is a way to achieve ML), as shown in Section IV-C. a detailed description of VAEs. GANs by default are not based on the ML principle but can, for the sake of comparison, be altered to do so. x Generative models based on this principle can be divided into two main groups: (a) models that learn an explicit Encoder Q density function and (b) models that can draw samples from an implicit but not directly specified density function. GANs µ(x) Σ(x) fall into the second group. There is, however, a plethora of methods from the first group that are currently used in the literature and, in many sample z ← N (µ(x), Σ(x)) cases, have incorporated deep learning into their respective frameworks. Fully Visible Belief Networks were introduced L (N (µ(x), Σ(x)), N (0,I)) KL by Frey et al. [66], [67] and have recently featured as the Decoder P basis of a WaveNet [68], a state of the art generative model for raw audio. Boltzmann Machines [69] and their counter- f(z) parts Restricted Boltzmann Machines (RBM) are a family of generative models that rely on Markov Chains for P 2 data ||x − f(z)|| approximation and were a key component at demonstrating the potential of deep learning for modern applications [14], Figure 14. An example of training VAE as a feed-forward network in [70]. Finally, autoencoders can also be used for generative which we have a Gaussian distributed P (x|z) as in [62]. Blue boxes purposes, in particular denoise, contractive and variational indicate loss functions. The dashed line highlights the test-time module in which we sample z and generate f(z) autoencoders as we described previously in Section IV-C. GANs are currently applied not only for generation of images but also in other contexts, for example image-to- V. GENERATIVE ADVERSARIAL NETWORKS (GAN) image translation and visual style transfer [71] [72]. Deep Adversarial Networks have been recently introduced B. Generator/Discriminator by Goodfellow et al. [65] as a framework for building Generative Models that are capable of learning, through In the GAN framework, two models are trained simul- back-propagation, an implicit probability density function taneously; one is a generator model that given an input z from a set of data samples. One highlight of GANs is produces a sample x from an implicit probability distribution that they perform learning and sampling without relying on Pg; the other is the discriminator model, a classifier that, Markov chains and with scalable performance. given an input x, produces single scalar (a label) that determines whether x was produced from Pdata or from Pg. A. Generative Models It is possible to think of the generator as a money counter- Given a training set made of samples drawn from a feiter, trained to fool the discriminator which can be thought distribution Pdata, a generative model learns to produce a as the police force. The police are then always improving probability distribution Pmodel or sample from one such its counterfeiting detection techniques at the same time as distribution that represents an estimation of its Pdata coun- counterfeiters are improving their ability of producing better terpart. counterfeits. One way a generative model can work is by using the Formally, GANs are a structured probabilistic model with principle of maximum likelihood (ML). In this context it is latent variables z and observed variables x. The two models are represented generally by functions with the only require- and ment being that both must be differentiable. The generator is a function G that takes z as an input and uses a list of L(G)(θ(D), θ(G)) = −J (D) parameters θ(G) to output so called observed variables x. The discriminator is, similarly, a function D that takes x This function and derived loss functions are interesting as an input and uses a list of parameters θ(D) to output a for their theoretical background; Goodfellow et al. showed in single scalar value, the label. The function D is optimized [65] that learning in this context is reducible to minimizing a (trained) to assign the correct labels to both training data and Jensen-Shannon divergence between the data and the model data produced by G while G itself is trained to minimize distribution and that it converges if both G and D are convex correct assignments of D when regarding data produced by functions updated in function space. Unfortunately, since in G. There have been many developments in the literature practice G and D are represented by Deep Neural Networks regarding the choice of cost functions for training both G (non-convex functions) and updates are made in parameter and D. The general framework can be seen as a diagram at space, the theoretical guarantee does not apply. Figure 15. There is then at least one case where the minimax based cost does not perform well and does not converge to equilibrium. On a minimax game, if the discriminator ever training x set achieves a point where it is performing its task successfully, the gradient of L(D) approaches zero, but, alarmingly, so Real or Generated? does the gradient of L(G), leading to a situation where the Discriminator z generator stops updating. As suggested by Goodfellow et al. [73], one approach to solving this problem is to use a variation of the minimax game, a heuristic non-saturating Generator xˆ Cost function game. Instead of defining L(G) simply as the opposite of L(D), we can change it to represent the discriminator’s correct hits regarding samples generated by G: Backpropagation

1 Figure 15. Generative Adversarial Networks Framework. L(G)(θ(D), θ(G)) = − log D(G(x)) 2Ez Given that the only requirement for functions G and D is In this game, rather than minimizing log-probability of the that both must be differentiable, it is easy to see how MLPs discriminator being correct like on the minimax game, the and CNNs fit the place of both. generator maximises the log-probability of the discriminator being mistaken, which is intuitively the same idea but C. GAN Training without the vanishing-gradient. Training GANs is a relatively straightforward process. At each minibatch, values are sampled from both the training VI.SIAMESEAND TRIPLET NETWORKS set and the set of random latent variables z and after one step through the networks, one of the many gradient-based Whilst Deep Learning using CNN has achieved huge optimization methods discussed in III-I is applied to update successes in classification tasks, end-to-end regression with G and D based on loss functions L(G) and L(D). DL is also an active field. Notable work in this area are studies of Siamese and Triplet CNNs. The ultimate goal D. Loss Functions of such networks is to project raw signals into a lower The original proposal [65] is that D and G play a two- dimensional space (aka. embedding) where their compact player minimax game with the value function V (G, D): representations are regressed so that the similar signals are brought together and the dissimilar ones are pushed away. These networks often consist of multiple CNN branches min max V (G, D) = Ex∼pdata(x)[log D(x)] G D depending on the number of training samples required in the

+ Ez∼pg (z)[log(1 − D(G(z)))] cost function. The CNN branches can have their parameters either shared or not depending on whether the task is intra- Which in turn leads to the following loss functions: or cross-domain learning.

1 L(D)(θ(D), θ(G)) = − [log D(x)] A. Siamese Networks with Contrastive Loss 2Ex∼pdata(x) 1 A typical Siamese network has two identical CNN − [log(1 − D(G(x)))] 2Ez branches with associated embedding function f(.). The (a) (b)

Figure 16. (a) Siamese and (b) triplet networks. (a) space before training standard cost function for a training exemplar (x1, x2) is proposed by Hadsell et al. [74]:

1 2 L(W, (Y, x1, x2)) = (1 − Y )(D(W, x1, x2)) + 2 (4) 1 Y max{m − D(W, x , x )}2 2 1 2 where D(W, x1, x2) = ||f(W, x1) − f(W, x2)||2. Y = 1 if (x1, x2) is a similar pair, and Y = 1 otherwise. m is a margin defining a desirable threshold for between x1 and x2 if they are dissimilar. Variants of Equation 4 are studied in [75], [76]. Intra- domain regression with shared CNN branches are employed for embedding digits [74], face verification [76], photo retrieval [77] and sketch2edgemap matching [75]. Cross- domain learning between sketch and 3D model was studied in [78] where each CNN branch has its own weights and (b) desired space after training being updated independently during training. B. Triplet Networks with Triplet Loss Figure 17. Example of applying a triplet network to a multimodal retrieval problem, the goal in this example is to approximate sketches and natural A triplet network has three branches that accept a similar images of the same category while keeping different categories apart; (a) represents the space as it could be before training and (b) represents pair (anchor, positive) and additionally a dissimilar sample the space as it should be after successful training. Triplet loss based serving as the negative example. Unlike the contrastive learning minimizes the distance between the anchor and positive samples loss function that enforces a hard regularisation on the and maximises the distance between the anchor and negative samples. similar pair (minimise distance toward zero), the triplet loss regulates a more flexible approach that only consider relative difference between the of similar and dissimilar such as face recognition [79] and tracking [80]. For learning pairs: between completely different domains such as sketch vs. photo in sketch-based image retrieval, the anchor branch (fetching sketch) is kept unshared with the image branches L(W, (xa, xp, xn)) = (5) 1 [81]. For mapping among similar domains like sketch and max{0, m + ||f(W , x ) − g(W , x )||2 (6) image’s edgemap, the work in [54] shows that a hybrid half- 2 a a p p 2 shared model performs the best. An intuitive of − ||f(W , x ) − g(W , x )||2} (7) a a p n 2 triplet network learning for mapping between sketches and image’s edge maps can be seen on Figure 17. where (xa, xp, xn) is an input triplet, m is the margin and Wa and Wp are the parameters of the anchor f(.) and Due to loose regularisation, it is often harder to train a positive/negative g(.) branches respectively. The positive and triplet network than a Siamese one. Several variants of Eq. 7 negative branches are always identical. The anchor branch are proposed in [54], [80] to help training triplet networks is typically shared for learning embedding within a domain converged. Furthermore, it is still inconclusive whether or not triplet CNN is better than Siamese. Contrastive loss is magnitude. The content loss compares features extracted seen to outperform triplet loss in [77], but marginally under- from a late (semantic) layer of the VGG19 network — performs its counterpart in [54], [82]. Recently, there exist ideally these should match between p and x. The original studies of another loss function that aims to overcome the formulation uses the conv4 2 layer of VGG19 (assume F (.) drawbacks of both contrastive and triplet losses, as claimed extracts this image feature): in the work of [83]. 1 X 2 L = |F (p) − F (x)| content 2 VII.APPLICATIONS i,j Here we describe Computer Vision and Image Process- The style loss is computed using several Gram (inner- ing applications that use the methods detailed in previous product) matrices, each computed from the feature response sections, by reviewing some of the recent literature on each from a layer in VGG19. For a given layer l the gram matrix topic. Gl can be computed as an inner product of the vectorised l A. Visual Stylization feature map i and j within that layer (Fi,j): X Deep CNNs have also benefited visual content styliza- Gl (x) = F l (x)F l (x). tion [84]. Echoing the transformation of visual recognition i,j i,k j,k k pipelines from hand-engineered to end-to-end trained so- 2 lutions, deep learning has delivered a step change in our The style loss simply sums Gl(x) − Gl(a) for several capability to learn artistic styles by example, and to transfer layers l. In some implementations the sum is weighted to that style to novel imagery - solving a major sub-problem afford prominence to certain layers, but this has little prac- within the Non-Photorealistic Rendering (NPR) field of tical outcome. The style loss is commonly aggregated over computer graphics [85]. a set of VGG19 layers {conv1 1, conv2 1,..., conv5 1}. Stylized output had already been explored in projects The use of the Gram matrix to encode style similarity was such as Google’s DeepDream [86], where backpropagation the core insight in this work, and fundamentally underpins was applied to optimize an input image toward an image all subsequent work on deep learning for stylization. For maximising a one-hot output on a discriminative network example, recent work seeking to stylize video notes that (such as GoogLeNet). Initialising such a network with white instability in the Gram matrix over time is directly correlated noise converges the input toward an optimal trigger image with temporal incoherence (flicker) and seeks to minimize for the desired object category. More interestingly, initial- that explicitly in the optimization [88]. ising the optimization from a photograph (plus Gaussian Whilst optimization of the input image in this manner additive i.e. white noise) will hallucinate a locally optimal yields impressive aesthetics, the technique is slow. More image in which structures resembling the one-hot object are recently, feed-forward networks have been used to stylize transformed to more closely represent that object. images in the order of seconds rather than minutes [89]. Gatys et al. [87] went further, observing that such noisy Although current feed-forward designs somewhat degrade images could optimized to meet a dual objective – seeking style fidelity, their interactive speed has enabled video styl- a particular content (scene structure and composition) with ization and a slew of commercial (mobile app) software (e.g. a particular style (the appearance properties). Specifically, Prisma, deepart.io) in the social media space. A discussion their ‘Neural Styles’ technique [87] accepted a pair of of feed-forward architectures for visual stylization is beyond images (a photograph and an artwork) as input, and sought the scope of this tutorial, but we note that contemporary to hallucinate an input image that matched the content of architectures for video stylization integrate two stream net- the input photograph, but matched the style of the input works (again, pre-trained) – one branch dealing with image, artwork. A simplified structure of the techinque is presented the other optical flow, with the latter integrated into the loss on Figure 18. The input image was again initialised from the function to minimise flicker [88], [90]. photograph plus white noise, and in their work a pre-trained discriminative network (VGG19, trained over ImageNet) B. Image processing (pixels-to-pixels predictions) was used for the optimization. During optimization network Deep Networks for computer vision as seen so far are usu- weights are fixed, but the image is converged to minimise ally designed for image classification and feature learning. loss LNS: In order to adapt such models to output images (instead of L (x; p, a) = αL (p, x) + βL (a, x) class labels or feature vectors), one must train the networks NS content style pixels-to-pixels. This kind of architecture is often called where x refers the image being optimized, p refers to Fully convolutional or Image Processing Deep Networks, the photograph and a refers to the example style image. from which is possible to obtain for example semantic Weights α = 1, β = 100 are typically chosen to balance segmentation [91]–[93], depth estimation [94], [95] and the two losses, which otherwise exhibit differing orders of optical flow [96]. Backpropagation (FlowNet) [96] that works in a similar way. However, because optical flow needs to be computed over a pair of im- ages (e.g. consecutive frames of a video), two architectures Desired Image VGG19 Net Neural Styles Loss are proposed: the first stacks the pair of images, building an initial input with depth 6 (2× RGB images) and then using an FCN; the second starts with two parallel streams Style Image VGG19 Net Style Representation of convolutions (3 conv.layers), one for each image, and has a fusion layer after the 3rd conv.layer based on a correlation operator. Content Image VGG19 Net Content Representation Depth estimation from stereo image pairs without super- vision is proposed by Garg et al. [94] using a Convolutional AE. To reconstruct the source image, an inverse warp of Figure 18. Simplified structure of Gatys et al. ‘Neural Styles’ technique the target image is generated using the predicted depth and known inter-view displacement. On the other hand, the problem of estimating depth using a single image still needs Fully Convolutional Netwoks (FCNs) are defined as supervised learning to achieve good results. In [95], the networks that operate on some input of any size and produce authors first apply a superpixel algorithm, extracting patches an output of corresponding (and possibly resampled) spatial of size 224 × 224 from the image using the superpixels as dimensions. FCNs were proposed by [91] for semantic centroids. They input individual patches to a CNN com- segmentation. They show that classification networks can posed of 5 conv.layers and 4 FC layers (which they call be converted into FCNs that output coarse maps, but in unary part) and also compute similarities between superpixel order to provide pixel-wise prediction, those coarse maps neighbours to be feed to one parallel FC layer (they call must then be connected back to each input pixel. This is it pairwise part). Finally, they use a Conditional Random performed in [91] via a bilinear interpolation that computes Fields (CRF) loss layer that allows to jointly learn unary each output from the nearest four inputs by a and pairwise potentials of a continuous CRF. The pairwise that depends only on the relative positions of the input and is responsible for enforcing smoothness by minimizing the output cells: distance between adjacent superpixels. Because each input 1 X of the unary part is a superpixel, it is possible to predict yi,j = |1 − α − {i/f}||1 − β − {i/j}|xbi/fc+α,bj/fc+β, depth for each superpixel patch. They also explored FCNs α,β=0 pretrained on ImageNet with posterior fine-tuning in order in which f is an upscaling factor, {.} returns the fractional to speed-up training time. part. Note that upsampling with a factor f can be seen as C. Video processing and analysis a convolution with fractional input stride of 1/f. This operation can be used as a upsampling layer (also known Video data includes both a spatial domain and a time as deconvolution layer), and allow end-to-end learning by domain. CNNs architectures as presented in the previous backpropagation of a pixelwise loss. A stack of deconvo- sections are not able to model motion (the temporal di- lution layers with activation functions can even learn some mension of a video). This is because they usually are form of nonlinear upsampling. designed to receive as input an image taken at a given An example of FCN is shown in Figure 19, in which the time. In principle it is possible to process a video frame- VGGNet-16 is adapted (see Figure 5 for the original model), by-frame, i.e. performing spatial processing only, such as by replacing FC layers with conv layers and adding decon- in YOLO method, which achieved static volutional (upsampling) layers. To perform upsampling the in 155fps [97]. However, the temporal aspect of a video authors investigated three options: the first, FCN-32s, is a carries crucial information for applications such as activity single-stream network that uses just the output of the final recognition [98], classification [99], anomaly detection [8], layer; the two others explore combinations of outputs from [100], among others. In order to use Deep Learning, we different layers: the FCN-16s combines 2× the output of need to input information related to more than one frame Conv.15 and 1× the output of MaxPool4, and the FCN-8s so that the spatial-temporal features are learned. Here we combines 4× the output of Conv.15 and 2× the output from describe two possible approaches to address this problem: MaxPool4 and 1× the output from MaxPool3. The FCN-8s first with the use of more than one network stream. Note that networks obtained the best results for the PASCAL VOC this is similar to the ideas of Siamese and Triplet networks 2011 segmentation dataset. (see Section VI) and in pixels-to-pixels predictions (Sec- Another example of image processing network is a tion VII-B). A second approach explores 3D convolutions, CNN-based method that aims to learn Optical Flow as we describe next. htcnjitymdltesailadtmoa dimensions temporal use and to spatial networks the create propose model to authors jointly layers the can pooling [101] used that 3D be in and to so, convolutions order do 3D in To input videos. as for frames of sequences considers pairs from (spatial computed Flow CNN Optical of the first input as the takes stream) framework, frames employed; input this as In are takes [99]. stream) streams in used CNN also optical- two and and [98] The images by data. proposed video using process often to layers flow of streams several MaxPool3 additional an 8. of with upsample stride de- layers (FCN-8s) at MaxPool4 16; of output and stride Conv15 versions at the layer and 3 combining MaxPool4 14 and of are layer layers upsample layers There Conv15 (FCN-16s) FC (the Conv15; of layers. The architecture of upsample convolutional output omitted). VGGNet-16 (FCN-32s) are by the layers: 3 on replaced convolutional/upsampling Pooling based are Max FCN 15 and A input between 19. Figure V lsie hti sdt eetaoaisi video in anomalies Class detect One to a used train is data. to that used classifier then [8]. SVM representations are joint representations and au- Those motion three-stream maps appearance, optical-flow a extract and to class, frames using normal proposed was the toencoder from time examples over have information store an to with dynamics. order long-range network in explore Two-Stream top and the the use on to backpropagation. authors additional LSTM the sum) the compute an to [99], the which used In (using from is layer performed loss, function fusion training is loss overall scores A the class function. obtain two loss the respective of and classifier softmax L stream of temporal composed the is of input CNN the component, horizontal a and 10 =

ntecneto nml eeto,i hc eonly we which in detection, anomaly of context the In Input image L ut-temnetworks: Multi-stream – ovltoa 3D: Convolutional – ... rms ic ahOtclFo eut navertical a in results Flow Optical each Since frames. Max pooling 3 sgvn h etrsls ahsra a t own its has stream Each results. best the giving as

Conv8 Conv9 Conv10

Max pooling 4 2 L

nu hnes n[8 hyreport they [98] In channels. input Conv11

sacnouinlntokthat network convolutional a is Conv12 f

t Conv13 epntok o ie use video for networks deep h eodCN(temporal CNN second the ; Two-stream Max pooling 5 16 × Conv14 obnn h upt of outputs the combining Conv15 ewr was Network 32 FCN-32s: 32× upsampling ×

sn the using FCN-16s: 16× upsampling 8

× FCN-8s: 8× upsampling D sal otk siptol rget fvideos [103]) datasets of UCF101 challenging and fragments some [102] in Sports-1M only recognition (e.g. input stated-of-the-art action obtained for as method results take the shows frames, to 16 thereafter containing able the and Although is motion. frames, salient CD3 first to capture referring the patterns that activation of representations appearance learns CD3 the the show, layers. a authors softmax applies one and it pooling Lastly, max 5 FC, 2 convolutional, layers. to further information by temporal processed the volume be preserves another which in map, results feature volume as video a in convolution 3D a with one sizes with input as layers pooling and 3D 1), (Pooling and learned, be to esnol.W loesl dp hs ocpsoe time. over concepts one those by adapt drawn easily examples also few We a only. to with person able concepts are same usually classify the humans learn while to digits people, abstraction such network different of by a by thousands or drawn train concept hundreds To need a we examples. digits, learn few numeric a to just able seeing are beings human a presents which smooth, [21]. ideally constraint and significant output transformation continuous and is input the be are between function map training, must that the loss allow i.e. (parameters) differentiable, the to be weights must that order of so In stage set transformation minimized. training large This during time. a the updated of a by regions at given different example is to one mapped space, are output they different from that net- examples the so input example, classes transform loss for to the how classification, minimizes learns of transforma- it work case of that the so series In vector a input function. learn an to to applied net- way tions a deep basically (including are networks works) neural that Recall approach. a solve to applications. strategies implement of to number into possible vast combination is Multi- it Triplet, their etc., Siamese, and stream, as GANs, such architectures or architectures complex AEs simple more different CNNs, to using as trained By such applica- are contexts. many that to and useful operators tions is of representations layers hierarchical stacking learn paper, of this idea throughout show the we As results. impressive tically ( kernels lutional they approach network. such named of architecture, effectiveness an the developed prove To video. a of rma rhtcua on fve,CDhs83D- 8 has C3D view, of point architectural an From hltde ewrsne eylreanttddatasets, annotated large very need networks deep Whilst of kind this with limitations clear however, are, There statis- achieved vision computer for models learning Deep convo- small uses C3D networks, VGG the to Similarly 224 II L VIII. 224 × × 224 2 3 MTTOSOF IMITATIONS 224 × × × 0 2 3 eouin.Nt htapyn small a applying that Note resolution). . 16 × × 5 2 3 rpu nF6adF7 sthe As FC7. and FC6 on dropout ies(6sqeta rms each frames, sequential (16 pixels ordc h ubro weights of number the reduce to ) Poig25.Tentoktakes network The 2–5). (Pooling D ovltoa D(C3D) 3D Convolutional EEP L EARNING 1 × 2 × 2 because they are trained with up to millions of examples, thus unseen data that is similar to training data will even- tually fall inside the memorized regions. Although recent work including tensor analysis of representations [110] and uncertainty [111] tried to shed on learning behaviour of deep networks, more studies on theoretical aspects related to Deep Learning are still needed. Figure 20. Adversarial examples visually unrecognizable, but classified Nevertheless, the positive impact of Deep Learning in with more than 99% confidence by Deep Learning methods, as in [104]. Computer Vision and Image Processing is undeniable. By understanding its limitations and inner workings, researchers can explore those methods for many years to come and continue to develop exciting results for new applications.

ACKNOWLEDGMENT The authors would like to thank FAPESP (grants #2016- 16411-4, #2017/10068-2, #2015/04883-0, #2013/07375-0) and the Santander Mobility Award. Figure 21. Adversarial image generated by adding a cheetah gradient feature to a picture of an owl, that is then classified with high confidence REFERENCES as a cheetah by a state-of-the-art Deep Network. [1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image ,” in Although generative models are a step towards this direction, CVPR09, 2009. since they represent an attempt to model the distribution that [2] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, generates the data categories, those are still a long way from S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, the human learning capacity. A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Even considering tasks that are in principle possible to Recognition Challenge,” International Journal of Computer be addressed via Deep Learning, there are many studies Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015. showing that perturbations in the input data, such as noise, [3] K. Fukushima, “Neocognitron: A hierarchical neural net- can significantly impact the results [44]. An interesting paper work capable of visual ,” Neural net- showed that deep networks can even fail to classify inverted works, vol. 1, no. 2, pp. 119–130, 1988. images [105]. It is possible to create images that appear to be just random noise and are unrecognizable to humans, [4] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient- but that CNNs believe to belong to some known class with based learning applied to document recognition,” Proceed- ings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. more than 99% of certainty [104] as examples shown in Figure 20. Based on those unrecognizable patterns, it is [5] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of possible to design small perturbations (around 4%) added features: Spatial pyramid matching for recognizing natural to images so that the output of such methods is completely scene categories,” in Computer Vision and Pattern Recogni- wrong [106]. In Figure 21 we adapt a famous example by tion, 2006 IEEE Conf. on, vol. 2. IEEE, 2006, pp. 2169– adding a cheetah gradient feature to an owl image so that 2178. it is imperceptible to humans, but that is then classified [6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet with high confidence as cheetah by a Deep Network. In classification with deep convolutional neural networks,” in order to alleviate such fragility to attacks, recent studies Advances in Neural Information Processing Systems 25: recommend to add adversarial examples to the training 26th Annual Conference on Neural Information Processing set or to increase the capacity (in terms of number of Systems., 2012, pp. 1106–1114. filters/nodes per layer), increasing the number of parameters [7] L. Zhang, L. Zhang, and B. Du, “Deep learning for remote and therefore the complexity of space of functions allowed in sensing data: A technical tutorial on the state of the art,” the model [107], [108]. Therefore, although the performance IEEE Geoscience and Magazine, vol. 4, of such methods can be impressive when considering overall no. 2, pp. 22–40, 2016. results, the outputs can be individually unreliable. If we analyze those already mentioned scenarios in the [8] D. Xu, E. Ricci, Y. Yan, J. Song, and N. Sebe, “Learning deep representations of appearance and motion for anoma- point of view of Statistical Learning Theory [109], we lous event detection.” in British Matchine Vision Conference could hypothesize that those methods are in fact learning (BMVC), X. Xie, M. W. Jones, and G. K. L. Tam, Eds. the memory-based model. They work well in several cases BMVA Press, 2015. [9] M. Ravanbakhsh, M. Nabi, H. Mousavi, E. Sangineto, [25] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, and N. Sebe, “Plug-and-play CNN for crowd motion “Densely connected convolutional networks,” in Proceedings analysis: An application in abnormal event detection,” of the IEEE Conference on Computer Vision and Pattern CoRR, vol. abs/1610.00307, 2016. [Online]. Available: Recognition, 2017. http://arxiv.org/abs/1610.00307 [26] V. Nair and G. E. Hinton, “Rectified linear units improve [10] I. B. Barbosa, M. Cristani, B. Caputo, A. Rognhaugen, and restricted boltzmann machines,” in Proceedings of the 27th T. Theoharis, “Looking beyond appearances: Synthetic train- International Conference on Machine Learning (ICML-10), ing data for deep cnns in re-identification,” arXiv preprint J. Furnkranz¨ and T. Joachims, Eds. Omnipress, 2010, pp. arXiv:1701.03153, 2017. 807–814.

[11] A. Fischer and C. Igel, “Training restricted boltzmann ma- [27] K. He, X. Zhang, S. Ren, and J. Sun, Identity Mappings chines: An introduction,” Pattern Recognition, vol. 47, no. 1, in Deep Residual Networks. Cham: Springer International pp. 25–39, 2014. Publishing, 2016, pp. 630–645. [Online]. Available: https: //doi.org/10.1007/978-3-319-46493-0 38 [12] H. Larochelle, M. Mandel, R. Pascanu, and Y. Bengio, “Learning algorithms for the classification restricted boltz- [28] M. Ponti, E. S. Helou, P. J. S. Ferreira, and N. D. Mas- mann machine,” Journal of Machine Learning Research, carenhas, “Image restoration using gradient iteration and IEEE Journal of Selected vol. 13, no. Mar, pp. 643–669, 2012. constraints for band extrapolation,” Topics in , vol. 10, no. 1, pp. 71–80, 2016.

[13] R. Salakhutdinov and G. Hinton, “Deep boltzmann ma- [29] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into chines,” in Artificial Intelligence and , 2009, pp. rectifiers: Surpassing human-level performance on 448–455. classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034. [14] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, [30] J. T. Springenberg, A. Dosovitskiy, T. Brox, and no. 7, pp. 1527–1554, 2006. M. Riedmiller, “Striving for simplicity: The all convolutional net,” in ICLR (workshop track), 2015. [15] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, [Online]. Available: http://lmb.informatik.uni-freiburg.de/ Z. Su, D. Du, C. Huang, and P. Torr, “Conditional random Publications/2015/DB15a fields as recurrent neural networks,” in International Con- ference on Computer Vision (ICCV), 2015. [31] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Muller,¨ “Efficient backprop,” in Neural networks: Tricks of the trade. [16] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recog- Springer, 2012, pp. 9–48. nition with deep recurrent neural networks,” in IEEE In- ternational Conference on Acoustics, Speech and Signal [32] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating Processing (ICASSP), 2013, pp. 6645–6649. deep network training by reducing internal covariate shift,” in International Conference on Machine Learning, 2015, pp. [17] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty 448–456. of training recurrent neural networks,” in International Con- ference on Machine Learning, 2013, pp. 1310–1318. [33] M. Ponti, J. Kittler, M. Riva, T. de Campos, and C. Zor, “A decision cognizant Kullback–Leibler divergence,” Pattern [18] R. Gonzalez and R. Woods, Processing, Recognition, vol. 61, pp. 470–478, 2017. 3rd ed. Pearson, 2007. [34] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient [19] S. Ben-David and S. Shalev-Shwartz, Understanding Ma- methods for online learning and stochastic optimization,” chine Learning: from theory to algorithms. Cambridge, Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2014. 2121–2159, 2011. [35] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” [20] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. arXiv preprint arXiv:1212.5701, 2012. MIT Press, 2016, http://www.deeplearningbook.org. [36] D. Kingma and J. Ba, “Adam: A method for stochastic [21] F. Chollet, Deep Learning with Python. Manning, 2017. optimization,” in 3rd International Conference on Learning Representations (ICLR), 2015. [22] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. [37] M. Li, T. Zhang, Y. Chen, and A. J. Smola, “Efficient mini- abs/1409.1556, 2014. batch training for stochastic optimization,” in Proceedings of the 20th ACM SIGKDD international conference on [23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning Knowledge discovery and . ACM, 2014, pp. for image recognition,” CoRR, vol. abs/1512.03385, 2015. 661–670.

[24] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, [38] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization “Rethinking the inception architecture for computer vision,” methods for large-scale machine learning,” arXiv preprint CoRR, vol. abs/1512.00567, 2015. arXiv:1606.04838, 2016. [39] P. Goyal, P. Dollar, R. Girshick, P. Noordhuis, [51] M. Ponti, T. S. Nazare,´ and G. S. Thume,´ “Image quanti- L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, zation as a dimensionality reduction procedure in color and and K. He, “Accurate, large minibatch SGD: texture feature extraction,” Neurocomputing, vol. 173, pp. Training imagenet in 1 hour.” [Online]. Available: 385–396, 2016. https://research.fb.com/publications/imagenet1kin1h/ [52] Y. Kalantidis and Y. Avrithis, “Locally optimized product [40] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, quantization for approximate ,” in and R. R. Salakhutdinov, “Improving neural networks by Proceedings of the IEEE Conference on Computer Vision preventing co-adaptation of feature detectors,” arXiv preprint and Pattern Recognition, 2014, pp. 2321–2328. arXiv:1207.0580, 2012. [53] T. Bui, L. Ribeiro, M. Ponti, and J. Collomosse, “Compact [41] D. Warde-Farley, I. J. Goodfellow, A. Courville, and descriptors for sketch-based image retrieval using a triplet Y. Bengio, “An empirical analysis of dropout in piecewise loss convolutional neural network,” Computer Vision and linear networks,” in ICLR 2014, 2014. [Online]. Available: Image Understanding, 2017. arXivpreprintarXiv:1312.6197 [54] ——, “Generalisation and sharing in triplet convnets [42] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, for sketch based visual search,” arXiv preprint “Rethinking the inception architecture for computer vision,” arXiv:1611.05301, 2016. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826. [55] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Trans. [43] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798–1828, “Return of the devil in the details: Delving deep into Aug. 2013. convolutional nets,” in British Conference (BMVC), 2014, p. 1405.3531. [56] J. Masci, U. Meier, D. Cires¸an, and J. Schmidhuber, “Stacked convolutional auto-encoders for hierarchical fea- Artificial Neural Networks and Machine [44] T. Nazare, G. Paranhos da Costa, W. Contato, and M. Ponti, ture extraction,” Learning–ICANN 2011 “Deep convolutional neural networks and noisy images,” in , pp. 52–59, 2011. Iberoamerican Conference on Pattern Recognition (CIARP), [57] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. 2017. Manzagol, “Stacked denoising autoencoders: Learning use- ful representations in a deep network with a local denoising [45] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, criterion,” Journal of Machine Learning Research, vol. 11, “Inception-v4, inception-resnet and the impact of residual no. Dec, pp. 3371–3408, 2010. connections on learning.” in AAAI, 2017, pp. 4278–4284. [58] G. Alain and Y. Bengio, “What regularized auto-encoders [46] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. learn from the data-generating distribution,” The Journal of Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy Machine Learning Research, vol. 15, no. 1, pp. 3563–3593, with 50x fewer parameters and¡ 0.5 mb model size,” in 2014. ICLR, 2016. [59] S. Rifai, G. Mesnil, P. Vincent, X. Muller, Y. Bengio, [47] H.-C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, Y. Dauphin, and X. Glorot, “Higher order contractive auto- J. Yao, D. Mollura, and R. M. Summers, “Deep convolu- encoder,” Machine Learning and Knowledge Discovery in tional neural networks for computer-aided detection: Cnn , pp. 645–660, 2011. architectures, dataset characteristics and transfer learning,” IEEE transactions on , vol. 35, no. 5, pp. [60] B. Scholkopf,¨ A. Smola, and K.-R. Muller,¨ “Nonlinear 1285–1298, 2016. component analysis as a kernel eigenvalue problem,” Neural computation, vol. 10, no. 5, pp. 1299–1319, 1998. [48] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features off-the-shelf: An astounding baseline for [61] Y. Bengio, L. Yao, G. Alain, and P. Vincent, “Generalized recognition,” in Proceedings of the 2014 IEEE Conference denoising auto-encoders as generative models,” in Advances on Computer Vision and Pattern Recognition Workshops, in Neural Information Processing Systems, 2013, pp. 899– ser. CVPRW ’14. Washington, DC, USA: IEEE Computer 907. Society, 2014, pp. 512–519. [62] C. Doersch, “Tutorial on variational autoencoders,” arXiv [49] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning preprint arXiv:1606.05908, 2016. and transferring mid-level image representations using con- volutional neural networks,” in Proceedings of the IEEE [63] D. J. Rezende, S. Mohamed, and D. Wierstra, conference on computer vision and pattern recognition, “Stochastic backpropagation and approximate inference 2014, pp. 1717–1724. in deep generative models,” in Proceedings of the 31st International Conference on Machine Learning, ser. [50] S. M. Erfani, S. Rajasegarar, S. Karunasekera, and C. Leckie, Proceedings of Machine Learning Research, E. P. Xing and “High-dimensional and large-scale anomaly detection using T. Jebara, Eds., vol. 32, no. 2. Bejing, China: PMLR, a linear one-class svm with deep learning,” Pattern Recog- 22–24 Jun 2014, pp. 1278–1286. [Online]. Available: nition, vol. 58, pp. 121–134, 2016. http://proceedings.mlr.press/v32/rezende14.html [64] D. P. Kingma, “Fast Gradient-Based Inference with [77] F. Radenovic,´ G. Tolias, and O. Chum, “Cnn image re- Continuous Latent Variable Models in Auxiliary Form,” jun trieval learns from bow: Unsupervised fine-tuning with hard 2013. [Online]. Available: http://arxiv.org/abs/1306.0733 examples,” in European Conference on Computer Vision. Springer, 2016, pp. 3–20. [65] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, [78] F. Wang, L. Kang, and Y. Li, “Sketch-based 3d shape re- “Generative adversarial nets,” in Advances in Neural Infor- trieval using convolutional neural networks,” in Proceedings mation Processing Systems 27, 2014. of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1875–1883. [66] B. J. Frey, G. E. Hinton, and P. Dayan, “Does the wake-sleep algorithm produce good density estimators?” in Advances in [79] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A Neural Information Processing Systems 8, D. S. Touretzky, unified embedding for face recognition and clustering,” in M. C. Mozer, and M. E. Hasselmo, Eds. MIT Press, 1996, Proceedings of the IEEE Conference on Computer Vision pp. 661–667. and Pattern Recognition, 2015, pp. 815–823.

[67] B. J. Frey, Graphical models for machine learning [80] X. Wang and A. Gupta, “Unsupervised learning of visual and digital communication. MIT Press, 1998. representations using videos,” in Proceedings of the IEEE [Online]. Available: https://mitpress.mit.edu/books/ International Conference on Computer Vision, 2015, pp. graphical-models-machine-learning-and-digital-communication 2794–2802.

[68] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, [81] P. Sangkloy, N. Burnell, C. Ham, and J. Hays, “The sketchy O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, database: learning to retrieve badly drawn bunnies,” ACM and K. Kavukcuoglu, “WaveNet: A Generative Model Transactions on Graphics (TOG), vol. 35, no. 4, p. 119, for Raw Audio,” sep 2016. [Online]. Available: http: 2016. //arxiv.org/abs/1609.03499 [82] E. Hoffer and N. Ailon, “Deep metric learning using triplet [69] S. E. Fahlman, G. E. Hinton, and T. J. Sejnowski, network,” in International Workshop on Similarity-Based “Massively parallel architectures for ai: Netl, thistle, Pattern Recognition. Springer, 2015, pp. 84–92. and boltzmann machines,” in Proceedings of the Third AAAI Conference on Artificial Intelligence, ser. AAAI’83. [83] O. Rippel, M. Paluri, P. Dollar,´ and L. D. Bourdev, “Metric AAAI Press, 1983, pp. 109–113. [Online]. Available: learning with adaptive density discrimination,” CoRR, vol. http://dl.acm.org/citation.cfm?id=2886844.2886868 abs/1511.05939, 2015. [70] G. E. Hinton, “Learning multiple layers of representation,” [84] P. Rosin and J. C. (Eds.), Image and Video based Artistic pp. 428–434, 2007. [Online]. Available: http://www.cs. Stylization. Springer, 2013. toronto.edu/{∼}fritz/absps/tics.pdf

[71] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired [85] J.-E. Kyprianidis, J. Collomosse, T. Wang, and T. Isenberg, image-to-image translation using cycle-consistent adver- “State of the ’art’: A taxonomy of artistic stylization tech- sarial networks,” in ICLR 2017, 2017, p. arXiv preprint niques for images and video,” IEEE Trans. Visualization and arXiv:1703.10593. Comp. Graphics (TVCG), 2012.

[72] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to- [86] A. Mordvintsev, M. Tyka, and C. Olah, “Deepdream,” https: image translation with conditional adversarial networks,” in //github.com/google/deepdream, 2015. CVPR 2017, 2017, p. arXiv preprint arXiv:1611.07004. [87] L. Gatys, A. Ecker, and M. Bethge, “A neural algorithm of [73] I. Goodfellow, “NIPS 2016 Tutorial: Generative Adversarial artistic style,” arXiv preprint arXiv:1508.06576, 2015. Networks,” dec 2016. [Online]. Available: http://arxiv.org/ abs/1701.00160 [88] D. Chen, J. Liao, L. Yuan, N. Yu, and G. Hua, “Coherent online video style transfer,” in Proc. Intl. Conf. Computer [74] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality Vision (ICCV), 2017. reduction by learning an invariant mapping,” in Computer vision and pattern recognition, 2006 IEEE computer society [89] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempitsky, conference on, vol. 2. IEEE, 2006, pp. 1735–1742. “Texture networks: Feed-forward synthesis of textures and stylized images,” in Proc. Intl. Conf. Machine Learning [75] Y. Qi, Y.-Z. Song, H. Zhang, and J. Liu, “Sketch-based (ICML), 2016. image retrieval via siamese convolutional neural network,” in Image Processing (ICIP), 2016 IEEE International Con- [90] A. Gupta, J. Johnson, A. Alahi, and L. Fei-Fei, “Stability ference on. IEEE, 2016, pp. 2460–2464. in neural style transfer: Characterization and application,” in Proc. Intl. Conf. Computer Vision (ICCV), 2017. [76] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verifica- [91] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional tion,” in Computer Vision and Pattern Recognition, 2005. networks for semantic segmentation,” IEEE Transactions on CVPR 2005. IEEE Computer Society Conference on, vol. 1. Pattern Analysis and Machine Intelligence, vol. 39, no. 4, IEEE, 2005, pp. 539–546. pp. 640–651, 2017. [92] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A [105] H. Hosseini and R. Poovendran, “Deep neural net- deep convolutional encoder-decoder architecture for scene works do not recognize negative images,” arXiv preprint segmentation,” IEEE transactions on pattern analysis and arXiv:1703.06857, 2017. machine intelligence, 2017. [106] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. [93] L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and Celik, and A. Swami, “The limitations of deep learning A. L. Yuille, “Semantic image segmentation with task- in adversarial settings,” in IEEE European Symposium on specific using cnns and a discriminatively Security and Privacy (EuroS&P). IEEE, 2016, pp. 372– trained domain transform,” in Proceedings of the IEEE 387. Conference on Computer Vision and Pattern Recognition, 2016, pp. 4545–4554. [107] A. Rozsa, M. Gunther,¨ and T. E. Boult, “Are accuracy and robustness correlated,” in Machine Learning and Applica- [94] R. Garg, G. Carneiro, and I. Reid, “Unsupervised cnn for tions (ICMLA), 2016 15th IEEE International Conference single view depth estimation: to the rescue,” in on. IEEE, 2016, pp. 227–232. European Conference on Computer Vision. Springer, 2016, pp. 740–756. [108] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to ad- [95] F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from versarial attacks,” arXiv preprint arXiv:1706.06083, 2017. single monocular images using deep convolutional neural fields,” IEEE Transactions on Pattern Analysis and Machine [109] V. Vapnik, The nature of statistical learning theory. Intelligence, vol. 38, no. 10, pp. 2024–2039, 2016. Springer science & business media, 2013. [96] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, [110] N. Cohen, O. Sharir, and A. Shashua, “On the expressive V. Golkov, P. van der Smagt, D. Cremers, and T. Brox, power of deep learning: A tensor analysis,” in Conference “Flownet: Learning optical flow with convolutional net- on Learning Theory, 2016, pp. 698–728. works,” in Proceedings of the IEEE International Confer- ence on Computer Vision, 2015, pp. 2758–2766. [111] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approx- [97] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You imation: Representing model uncertainty in deep learning,” only look once: Unified, real-time object detection,” in in International Conference on Machine Learning, 2016, pp. Proceedings of the IEEE Conference on Computer Vision 1050–1059. and Pattern Recognition, 2016, pp. 779–788.

[98] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, 2014, pp. 568–576.

[99] Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue, “Modeling spatial-temporal clues in a hybrid deep learning framework for video classification,” in Proceedings of the 23rd ACM international conference on Multimedia. ACM, 2015, pp. 461–470.

[100] M. Ponti, T. S. Nazare, and J. Kittler, “Optical-flow fea- tures empirical mode decomposition for motion anomaly detection,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 1403–1407.

[101] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE International Con- ference on Computer Vision, 2015, pp. 4489–4497.

[102] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with con- volutional neural networks,” in CVPR, 2014.

[103] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” CoRR, vol. abs/1212.0402, 2012.

[104] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easily fooled: High confidence predictions for unrecog- nizable images,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 427–436.