Everything You Wanted to Know About Deep Learning for Computer Vision but Were Afraid to Ask
Total Page:16
File Type:pdf, Size:1020Kb
Everything you wanted to know about Deep Learning for Computer Vision but were afraid to ask Moacir A. Ponti, Leonardo S. F. Ribeiro, Tiago S. Nazare Tu Bui, John Collomosse ICMC – University of Sao˜ Paulo CVSSP – University of Surrey Sao˜ Carlos/SP, 13566-590, Brazil Guildford, GU2 7XH, UK Email: [ponti, leonardo.sampaio.ribeiro, tiagosn]@usp.br Email: [t.bui, j.collomosse]@surrey.ac.uk Abstract—Deep Learning methods are currently the state- (AEs), started appearing as a basis for state-of-the-art meth- of-the-art in many Computer Vision and Image Processing ods in several computer vision applications (e.g. remote problems, in particular image classification. After years of sensing [7], surveillance [8], [9] and re-identification [10]). intensive investigation, a few models matured and became important tools, including Convolutional Neural Networks The ImageNet challenge [1] played a major role in this (CNNs), Siamese and Triplet Networks, Auto-Encoders (AEs) process, starting a race for the model that could beat the and Generative Adversarial Networks (GANs). The field is fast- current champion in the image classification challenge, but paced and there is a lot of terminologies to catch up for those also image segmentation, object recognition and other tasks. who want to adventure in Deep Learning waters. This paper In order to accomplish that, different architectures and has the objective to introduce the most fundamental concepts of Deep Learning for Computer Vision in particular CNNs, combinations of DBM were employed. Also, novel types of AEs and GANs, including architectures, inner workings and networks – such as Siamese Networks, Triplet Networks and optimization. We offer an updated description of the theoretical Generative Adversarial Networks (GANs) – were designed. and practical knowledge of working with those models. After Deep Learning techniques offer an important set of meth- that, we describe Siamese and Triplet Networks, not often covered in tutorial papers, as well as review the literature ods suited for tasks within the domains of digital visual on recent and exciting topics such as visual stylization, pixel- content. It is noticeable that those methods comprise a wise prediction and video processing. Finally, we discuss the diverse variety of models, components and algorithms that limitations of Deep Learning for Computer Vision. may be dissimilar to what one may be used to in Computer Keywords-Computer Vision; Deep Learning; Image Process- Vision. A myriad of keywords within this context makes the ing; Video Processing literature in this field almost a different language: Feature maps, Activation, Receptive Fields, Dropout, ReLu, Max- I. INTRODUCTION Pool, Softmax, SGD, Adam, FC, Generator, Discriminator, The field of Computer Vision was revolutionized in the Shared Weights, etc. This can make it hard for a beginner past years (in particular after 2012) due to Deep Learning to understand and catch up with the recent studies. techniques. This is mainly due to two reasons: the availabil- There are DL methods we do not cover in this paper, ity of labelled image datasets with millions of images [1], including Deep Belief Networks (DBN), Deep Boltzmann [2], and computer hardware that allowed to speed-up compu- Machines (DBM) and also those using Recurrent Neural tations. Before that, there were studies exploring hierarchical Networks (RNN), Reinforcement Learning and Long short- representations with neural networks such as Fukushima’s term memory (LSTMs). We refer to [11]–[14] for DBN and Neocognitron [3] and LeCun’s neural networks for digit DBM-related studies, and [15]–[17] for RNN-related studies. recognition [4]. Although those techniques were known to The paper is organized as follows: Machine Learning and Artificial Intelligence communities, the efforts of Computer Vision researchers during the 2000’s • Section II provides definitions and prerequisites. was in a different direction, in particular using approaches • Section III aims to present a detailed and updated based on Scale-Invariant features, Bag-of-Features, Spatial description of the DL’s main terminology, building Pyramids and related methods [5]. blocks and algorithms of the Convolutional Neural After the publication of the AlexNet Convolutional Neural Network (CNN) since it is widely used in Computer Network Model [6], many research communities realized the Vision, including: power of such methods and, from that point on, Deep Learn- 1) Components: Convolutional Layer, Activation ing (DL) invaded the Visual Computing fields: Computer Function, Feature/Activation Map, Pooling, Nor- Vision, Image Processing, Computer Graphics. Convolu- malization, Fully Connected Layers, Parameters, tional Neural Networks (CNNs), Deep Belief Nets (DBNs), Loss Function; Restricted Boltzmann Machines (RBMs) and Autoencoders 2) Algorithms: Optimization (SGD, Momentum, Adam, etc.) and Training; Deep Learning often involves learning hierarchical repre- 3) Architectures: AlexNet, VGGNet, ResNet, Incep- sentations using a series of layers that operate by processing tion, DenseNet; an input generating a series of representations that are 4) Beyond Classification: fine-tuning, feature extrac- then given as input to the next layer. By having spaces of tion and transfer learning. sufficiently high dimensionality, such methods are able to • Section IV describes Autoencoders (AEs): capture the scope of the relationships found in the original 1) Undercomplete AEs; data so that it finds the “right” representation for the task 2) Overcomplete AEs: Denoising and Contractive; at hand [21]. This can be seen as separating the multiple 3) Generative AE: Variational AEs manifolds that represent data via a series of transformations. • Section V introduces Generative Adversarial Net- III. CONVOLUTIONAL NEURAL NETWORKS works (GANs): Convolutional Neural Networks (CNNs or ConvNets) 1) Generative Models: Generator/Discriminator; are probably the most well known Deep Learning model 2) Training: Algorithm and Loss Lunctions. used to solve Computer Vision tasks, in particular im- • Section VI is devoted to Siamese and Triplet Net- age classification. The basic building blocks of CNNs are works:. convolutions, pooling (downsampling) operators, activation 1) SiameseNet with contrastive Loss; functions, and fully-connected layers, which are essentially 2) TripletNet: with triplet Loss; similar to hidden layers of a Multilayer Perceptron (MLP). • Section VII reviews Applications of Deep Networks in Architectures such as AlexNet [6], VGG [22], ResNet [23] Image Processing and Computer Vision, including: and GoogLeNet [24] became very popular, used as subrou- 1) Visual Stilization; tines to obtain representations that are then offered as input 2) Image Processing and Pixel-Wise Prediction; to other algorithms to solve different tasks. 3) Networks for Video Data: Multi-stream and C3D. We can write this network as a composition of a sequence of functions fl(:) (related to some layer l) that takes as input • Section VIII concludes the paper by discussing the a vector x and a set of parameters W , and outputs a vector Limitations of Deep Learning methods for Computer l l x : Vision. l+1 f(x) = fL (··· f2(f1(x1;W1); W2) ··· );WL) II. DEEP LEARNING: PREREQUISITES AND DEFINITIONS The prerequisites needed to understand Deep Learning x1 is the input image, and in a CNN the most charac- for Computer Vision includes basics of Machine Learning teristic functions fl(:) are convolutions. Convolutions and (ML) and Image Processing (IP). Since those are out of the other operators works as building blocks or layers of a scope of this paper we refer to [18], [19] for an introduction CNN: activation function, pooling, normalization and linear in such fields. We also assume the reader is familiar with combination (produced by the fully connected layer). In the Linear Algebra and Calculus, as well as Probability and next sections we will describe each one of those building Optimization — a review of those topics can be found in blocks. Part I of Goodfellow et al. textbook on Deep Learning [20]. A. Convolutional Layer Machine learning methods basically try to discover a model (e.g. rules, parameters) by using a set of input data A layer is composed of a set of filters, each to be points and some way to guide the algorithm in order to learn applied to the entire input vector. Each filter is nothing but from this input. In supervised learning, we have examples a matrix k × k of weights (or values) wi. Each weight is a of expected output, whereas in unsupervised learning some parameter of the model to be learned. We refer to [18] for assumption is made in order to build the model. However, an introduction about convolution and image filters. in order to achieve success, it is paramount to have a Each filter will produce what can be seen as an affine good representation of the input data, i.e. a good set of transformation of the input. Another view is that each filter features that will produce a feature space in which an produces a linear combination of all pixel values in a algorithm can use its bias in order to learn. One of the neighbourhood defined by the size of the filter. Each region main ideas of Deep learning is to solve the problem of that the filter processes is called local receptive field: an finding this representation by learning it from the data: it output value (pixel) is a combination of the input pixels defines representations that are expressed in terms of other, in this local receptive field (see Figure 1). That makes the simpler ones [20]. Another is to assume that depth (in terms convolutional layer different from layers of an MLP for of successive representations) allows learning a sequence of example; in a MLP each neuron will produce a single output parallel instructions that transforms an initial input vector, based on all values from the previous layer, whereas in a mapping one space to another. convolutional layer, an output value f(i; x; y) is based on tanh(x) logistic(x) 1 1 x -1 x (a) hyperbolic tangent (b) logistic max[0; x] max[ax; x]; a = 0:1 1 1 Figure 1.