1.1. From neural networks to deep learning

Fran¸coisFleuret https://fleuret.org/dlc/ Many applications require the automatic extraction of “refined” information from raw signal (e.g. image recognition, automatic speech processing, natural language processing, robotic control, geometry reconstruction).

(ImageNet)

Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 1 / 18 Our brain is so good at interpreting visual information that the “semantic gap” is hard to assess intuitively.

Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 2 / 18 Our brain is so good at interpreting visual information that the “semantic gap” is hard to assess intuitively.

This: is a horse

Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 2 / 18 Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 2 / 18 Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 2 / 18 >>> from torchvision.datasets import CIFAR10 >>> cifar = CIFAR10('./data/cifar10/', train=True, download=True) Files already downloaded and verified >>> x = torch.from_numpy(cifar.data)[43].permute(2, 0, 1) >>> x[:, :4, :8] tensor([[[ 99, 98, 100, 103, 105, 107, 108, 110], [100, 100, 102, 105, 107, 109, 110, 112], [104, 104, 106, 109, 111, 112, 114, 116], [109, 109, 111, 113, 116, 117, 118, 120]],

[[166, 165, 167, 169, 171, 172, 173, 175], [166, 164, 167, 169, 169, 171, 172, 174], [169, 167, 170, 171, 171, 173, 174, 176], [170, 169, 172, 173, 175, 176, 177, 178]],

[[198, 196, 199, 200, 200, 202, 203, 204], [195, 194, 197, 197, 197, 199, 200, 201], [197, 195, 198, 198, 198, 199, 201, 202], [197, 196, 199, 198, 198, 199, 200, 201]]], dtype=torch.uint8)

Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 3 / 18 This is similar to biological systems for which the model (e.g. brain structure) is DNA-encoded, and parameters (e.g. synaptic weights) are tuned through experiences.

Deep learning encompasses software technologies to scale-up to billions of model parameters and as many training examples.

Extracting semantic automatically requires models of extreme complexity, which cannot be designed by hand.

Techniques used in practice consist of

1. defining a parametric model, and 2. optimizing its parameters by “making it work” on training data.

Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 4 / 18 Deep learning encompasses software technologies to scale-up to billions of model parameters and as many training examples.

Extracting semantic automatically requires models of extreme complexity, which cannot be designed by hand.

Techniques used in practice consist of

1. defining a parametric model, and 2. optimizing its parameters by “making it work” on training data.

This is similar to biological systems for which the model (e.g. brain structure) is DNA-encoded, and parameters (e.g. synaptic weights) are tuned through experiences.

Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 4 / 18 Extracting semantic automatically requires models of extreme complexity, which cannot be designed by hand.

Techniques used in practice consist of

1. defining a parametric model, and 2. optimizing its parameters by “making it work” on training data.

This is similar to biological systems for which the model (e.g. brain structure) is DNA-encoded, and parameters (e.g. synaptic weights) are tuned through experiences.

Deep learning encompasses software technologies to scale-up to billions of model parameters and as many training examples.

Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 4 / 18 Classical ML methods combine a “learnable” model from statistics (e.g. “linear regression”) with prior knowledge in pre-processing.

“Artificial neural networks” pre-dated these approaches, and do not follow this dichotomy. They consist of “deep” stacks of parametrized processing.

There are strong connections between standard statistical modeling and machine learning.

Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 5 / 18 “Artificial neural networks” pre-dated these approaches, and do not follow this dichotomy. They consist of “deep” stacks of parametrized processing.

There are strong connections between standard statistical modeling and machine learning.

Classical ML methods combine a “learnable” model from statistics (e.g. “linear regression”) with prior knowledge in pre-processing.

Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 5 / 18 There are strong connections between standard statistical modeling and machine learning.

Classical ML methods combine a “learnable” model from statistics (e.g. “linear regression”) with prior knowledge in pre-processing.

“Artificial neural networks” pre-dated these approaches, and do not follow this dichotomy. They consist of “deep” stacks of parametrized processing.

Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 5 / 18 From artificial neural networks to “Deep Learning”

Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 6 / 18 130 LOGICAL CALCULUS FOR NERVOUS ACTIVITY

b

d

e ~ ~

f

9

h

FIG~E 1

Networks of “Threshold Logic Unit”

(McCulloch and Pitts, 1943)

Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 7 / 18 1958 – Frank Rosenblatt creates a perceptron to classify 20 × 20 images. 1959 – David H. Hubel and demonstrate orientation selectivity and columnar organization in the cat’s visual cortex (Hubel and Wiesel, 1962). 1982 – Paul Werbos proposes back-propagation for ANNs (Werbos, 1981).

Frank Rosenblatt working on the Mark I perceptron (1956)

1949 – Donald Hebb proposes the Hebbian Learning principle (Hebb, 1949). 1951 – Marvin Minsky creates the first ANN (Hebbian learning, 40 neurons).

Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 8 / 18 1959 – David H. Hubel and Torsten Wiesel demonstrate orientation selectivity and columnar organization in the cat’s visual cortex (Hubel and Wiesel, 1962). 1982 – Paul Werbos proposes back-propagation for ANNs (Werbos, 1981).

Frank Rosenblatt working on the Mark I perceptron (1956)

1949 – Donald Hebb proposes the Hebbian Learning principle (Hebb, 1949). 1951 – Marvin Minsky creates the first ANN (Hebbian learning, 40 neurons). 1958 – Frank Rosenblatt creates a perceptron to classify 20 × 20 images.

Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 8 / 18 1982 – Paul Werbos proposes back-propagation for ANNs (Werbos, 1981).

Frank Rosenblatt working on the Mark I perceptron (1956)

1949 – Donald Hebb proposes the Hebbian Learning principle (Hebb, 1949). 1951 – Marvin Minsky creates the first ANN (Hebbian learning, 40 neurons). 1958 – Frank Rosenblatt creates a perceptron to classify 20 × 20 images. 1959 – David H. Hubel and Torsten Wiesel demonstrate orientation selectivity and columnar organization in the cat’s visual cortex (Hubel and Wiesel, 1962).

Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 8 / 18 Frank Rosenblatt working on the Mark I perceptron (1956)

1949 – Donald Hebb proposes the Hebbian Learning principle (Hebb, 1949). 1951 – Marvin Minsky creates the first ANN (Hebbian learning, 40 neurons). 1958 – Frank Rosenblatt creates a perceptron to classify 20 × 20 images. 1959 – David H. Hubel and Torsten Wiesel demonstrate orientation selectivity and columnar organization in the cat’s visual cortex (Hubel and Wiesel, 1962). 1982 – Paul Werbos proposes back-propagation for ANNs (Werbos, 1981).

Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 8 / 18 195

visuo[ oreo 9l< QSsOCiQtion oreo--

lower-order --,. higher-order .-,. ~ .grandmother retino --,- LGB --,. simple ~ complex --,. hypercomplex hypercomplex " -- cell '~

F- 3 I-- .... l r modifioble synapses I I I I 11 Uo ',~' Usl ----->Ucl t~-~i Us2~ Uc2 ~ Us3----* Uc3 T ) unmodifiable synopses [ NeocognitronI L ~ L J Fig. 1. Correspondencebetween the hierarchy model by Hubel and Wiesel, and the neural network of the neocognitron

R3~I

Fig. 2. Schematic diagram illustrating the interconnections between layers in the neocognitron

shifted in parallel from cell to cell. Hence, all the cells in Since the cells in the network are interconnected in a single cell-plane have receptive fields of the same a cascade(Fukushima, as shown 1980) in Fig. 2, the deeper the layer is, the function, but at different positions. larger becomes the of each cell of that We willThis use model notations follows Us~(k~,n Hubel) to andrepresent Wiesel’s the results.layer. The density of the cells in each cell-plane is so output of an S-cell in the krth S-plane in the l-th determined as to decrease in accordance with the module, and Ucl(k~, n) to represent the output of a C-cell increase of the size of the receptive fields. Hence, the in the krth C-plane in that module, where n is the two- total number of the cells in each cell-plane decreases Fran¸coisFleuretdimensional co-ordinates Deep learning representing / 1.1. From neural the networks position to deep learningof with the depth of the cell-plane in 9the / 18 network. In the these cell's receptive fields in the input layer. last module, the receptive field of each C-cell becomes Figure 2 is a schematic diagram illustrating the so large as to cover the whole area of input layer U0, interconnections between layers. Each tetragon drawn and each C-plane is so determined as to have only one with heavy lines represents an S-plane or a C-plane, C-cell. and each vertical tetragon drawn with thin lines, in The S-cells and C-cells are excitatory cells. That is, which S-planes or C-planes are enclosed, represents an all the efferent synapses from these cells are excitatory. S-layer or a C-layer. Although it is not shown in Fig. 2, we also have In Fig. 2, a cell of each layer receives afferent connections from the cells within the area enclosed by the elipse in its preceding layer. To be exact, as for the S-cells, the elipses in Fig. 2 does not show the connect- ing area but the connectable area to the S-cells. That is, all the interconnections coming from the elipses are not always formed, because the synaptic connections incoming to the S-cells have plasticity. In Fig. 2, for the sake of simplicity of the figure, only one cell is shown in each cell-plane. In fact, all the cells in a cell-plane have input synapses of the same spatial distribution as shown in Fig. 3, and only the positions of the presynaptic cells are shifted in parallel Fig. 3. Illustration showing the input interconnections to the cells from cell to cell. within a single cell-plane Network for the T-C problem

Trained with back-prop.

(Rumelhart et al., 1988)

Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 10 / 18 LeNet family

(LeCun et al., 1989)

Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 11 / 18 AlexNet

Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer-parts at the top of the figure(Krizhevsky while the other runset al., the layer-parts 2012) at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264– 4096–4096–1000.

neurons in a kernel map). The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 5 48. × × Fran¸coisFleuretThe third, fourth, and fifth convolutional Deep learning layers/ 1.1. From are neural connected networks to deep to one learning another without any intervening 12 / 18 pooling or normalization layers. The third convolutional layer has 384 kernels of size 3 3 256 connected to the (normalized, pooled) outputs of the second convolutional layer. The× fourth× convolutional layer has 384 kernels of size 3 3 192 , and the fifth convolutional layer has 256 kernels of size 3 3 192. The fully-connected× layers× have 4096 neurons each. × ×

4 Reducing Overfitting

Our neural network architecture has 60 million parameters. Although the 1000 classes of ILSVRC make each training example impose 10 bits of constraint on the mapping from image to label, this turns out to be insufficient to learn so many parameters without considerable overfitting. Below, we describe the two primary ways in which we combat overfitting.

4.1 Data Augmentation

The easiest and most common method to reduce overfitting on image data is to artificially enlarge the dataset using label-preserving transformations (e.g., [25, 4, 5]). We employ two distinct forms of data augmentation, both of which allow transformed images to be produced from the original images with very little computation, so the transformed images do not need to be stored on disk. In our implementation, the transformed images are generated in Python code on the CPU while the GPU is training on the previous batch of images. So these data augmentation schemes are, in effect, computationally free. The first form of data augmentation consists of generating image translations and horizontal reflec- tions. We do this by extracting random 224 224 patches (and their horizontal reflections) from the 256 256 images and training our network on× these extracted patches4. This increases the size of our training× set by a factor of 2048, though the resulting training examples are, of course, highly inter- dependent. Without this scheme, our network suffers from substantial overfitting, which would have forced us to use much smaller networks. At test time, the network makes a prediction by extracting five 224 224 patches (the four corner patches and the center patch) as well as their horizontal reflections× (hence ten patches in all), and averaging the predictions made by the network’s softmax layer on the ten patches. The second form of data augmentation consists of altering the intensities of the RGB channels in training images. Specifically, we perform PCA on the set of RGB pixel values throughout the ImageNet training set. To each training image, we add multiples of the found principal components,

4This is the reason why the input images in Figure 2 are 224 224 3-dimensional. × ×

5 softmax2 VGG-19 34-layer plain 34-layer residual Residual Network. Based on the above plain network, we

SoftmaxActivation image image image insert shortcut connections (Fig.3, right) which turn the output FC 3x3 conv, 64 size: 224 network into its counterpart residual version. The identity

AveragePool 3x3 conv, 64 shortcuts (Eqn.(1)) can be directly used when the input and 7x7+1(V) output are of the same dimensions (solid line shortcuts in DepthConcat pool, /2 output size: 112 Conv Conv Conv Conv 3x3 conv, 128 Fig.3). When the dimensions increase (dotted line shortcuts 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) in Fig.3), we consider two options: (A) The shortcut still Conv Conv MaxPool 3x3 conv, 128 7x7 conv, 64, /2 7x7 conv, 64, /2 1x1+1(S) 1x1+1(S) 3x3+1(S) performs identity mapping, with extra zero entries padded pool, /2 pool, /2 pool, /2 DepthConcat output size: 56 for increasing dimensions. This option introduces no extra Conv Conv Conv Conv softmax1 3x3 conv, 256 3x3 conv, 64 3x3 conv, 64 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) parameter; (B) The projection shortcut in Eqn.(2) is used to Conv Conv MaxPool 3x3 conv, 256 3x3 conv, 64 3x3 conv, 64 SoftmaxActivation 1x1+1(S) 1x1+1(S) 3x3+1(S) match dimensions (done by 1 1 convolutions). For both MaxPool 3x3 conv, 256 3x3 conv, 64 3x3 conv, 64 × FC 3x3+2(S) options, when the shortcuts go across feature maps of two 3x3 conv, 256 3x3 conv, 64 3x3 conv, 64 DepthConcat FC sizes, they are performed with a stride of 2.

Conv Conv Conv Conv Conv 3x3 conv, 64 3x3 conv, 64 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S) 3x3 conv, 64 3x3 conv, 64 3.4. Implementation Conv Conv MaxPool AveragePool 1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V) pool, /2 3x3 conv, 128, /2 3x3 conv, 128, /2 DepthConcat output Our implementation for ImageNet follows the practice size: 28 3x3 conv, 512 3x3 conv, 128 3x3 conv, 128 Conv Conv Conv Conv 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) in [21, 41]. The image is resized with its shorter side ran- 3x3 conv, 512 3x3 conv, 128 3x3 conv, 128 Conv Conv MaxPool domly sampled in [256, 480] for scale augmentation [41]. 1x1+1(S) 1x1+1(S) 3x3+1(S) 3x3 conv, 512 3x3 conv, 128 3x3 conv, 128 A 224 224 crop is randomly sampled from an image or its DepthConcat softmax0 × 3x3 conv, 512 3x3 conv, 128 3x3 conv, 128 horizontal flip, with the per-pixel mean subtracted [21]. The Conv Conv Conv Conv SoftmaxActivation 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 3x3 conv, 128 3x3 conv, 128 standard color augmentation in [21] is used. We adopt batch Conv Conv MaxPool FC 1x1+1(S) 1x1+1(S) 3x3+1(S) 3x3 conv, 128 3x3 conv, 128 normalization (BN) [16] right after each convolution and DepthConcat FC 3x3 conv, 128 3x3 conv, 128 before activation, following [16]. We initialize the weights Conv Conv Conv Conv Conv 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S) output pool, /2 3x3 conv, 256, /2 3x3 conv, 256, /2 as in [13] and train all plain/residual nets from scratch. We size: 14 Conv Conv MaxPool AveragePool 1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V) 3x3 conv, 512 3x3 conv, 256 3x3 conv, 256 use SGD with a mini-batch size of 256. The learning rate

DepthConcat 3x3 conv, 512 3x3 conv, 256 3x3 conv, 256 starts from 0.1 and is divided by 10 when the error plateaus, Conv Conv Conv Conv 4 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) and the models are trained for up to 60 10 iterations. We 3x3 conv, 512 3x3 conv, 256 3x3 conv, 256 × Conv Conv MaxPool 1x1+1(S) 1x1+1(S) 3x3+1(S) use a weight decay of 0.0001 and a momentum of 0.9. We 3x3 conv, 512 3x3 conv, 256 3x3 conv, 256 MaxPool do not use dropout [14], following the practice in [16]. 3x3+2(S) 3x3 conv, 256 3x3 conv, 256

DepthConcat In testing, for comparison studies we adopt the standard 3x3 conv, 256 3x3 conv, 256 Conv Conv Conv Conv 10-crop testing [21]. For best results, we adopt the fully- 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 3x3 conv, 256 3x3 conv, 256 convolutional form as in [41, 13], and average the scores Conv Conv MaxPool 1x1+1(S) 1x1+1(S) 3x3+1(S) 3x3 conv, 256 3x3 conv, 256 at multiple scales (images are resized such that the shorter DepthConcat 3x3 conv, 256 3x3 conv, 256 side is in 224, 256, 384, 480, 640 ). Conv Conv Conv Conv 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 3x3 conv, 256 3x3 conv, 256 { }

Conv Conv MaxPool 1x1+1(S) 1x1+1(S) 3x3+1(S) 3x3 conv, 256 3x3 conv, 256 4. Experiments MaxPool output 3x3+2(S) pool, /2 3x3 conv, 512, /2 3x3 conv, 512, /2 size: 7

LocalRespNorm 4.1. ImageNet Classification 3x3 conv, 512 3x3 conv, 512

Conv 3x3+1(S) 3x3 conv, 512 3x3 conv, 512 We evaluate our method on the ImageNet 2012 classifi-

Conv 1x1+1(V) 3x3 conv, 512 3x3 conv, 512 cation dataset [36] that consists of 1000 classes. The models

LocalRespNorm 3x3 conv, 512 3x3 conv, 512 are trained on the 1.28 million training images, and evalu-

MaxPool 3x3+2(S) 3x3 conv, 512 3x3 conv, 512 ated on the 50k validation images. We also obtain a final

Conv output result on the 100k test images, reported by the test server. 7x7+2(S) fc 4096 avg pool avg pool size: 1 We evaluate both top-1 and top-5 error rates. input fc 4096 fc 1000 fc 1000 fc 1000 Plain Networks. We first evaluate 18-layer and 34-layer Figure 3: GoogLeNet network with all the bells and whistles plain nets. The 34-layer plain net is in Fig.3 (middle). The GoogleNet (Szegedy et al., 2015)Figure 3. Example network ResNet architectures for (He ImageNet. etLeft: the al.,18-layer 2015) plain net is of a similar form. See Table1 for de- 7 VGG-19 model [41] (19.6 billion FLOPs) as a reference. Mid- tailed architectures. dle: a plain network with 34 parameter layers (3.6 billion FLOPs). The results in Table2 show that the deeper 34-layer plain Right: a residual network with 34 parameter layers (3.6 billion net has higher validation error than the shallower 18-layer FLOPs). The dotted shortcuts increase dimensions. Table1 shows plain net. To reveal the reasons, in Fig.4 (left) we com- more details and other variants. pare their training/validation errors during the training pro- cedure. We have observed the degradation problem - the Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 13 / 18 4 Figure 1: The Transformer - model architecture.(Vaswani et al., 2017)

Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 14 / 18 3.1 Encoder and Decoder Stacks

Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512.

Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

3 Deep learning is built on a natural generalization of a neural network: a graph of tensor operators, taking advantage of

• the chain rule (aka “back-propagation”),

• stochastic gradient decent,

• convolutions,

• parallel operations on GPUs.

This does not differ much from networks from the 90s.

Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 15 / 18 This generalization allows to design complex networks of operators dealing with images, sound, text, sequences, etc. and to train them end-to-end.

(Tran et al., 2020)

Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 16 / 18 ImageNet Large Scale Visual Recognition Challenge.

1000 categories, > 1M images

(http://image-net.org/challenges/LSVRC/2014/browse-synsets)

Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 17 / 18 60

50

40

Error 30

20

10 Human performance

0 2010 2011 2012 2013 2014 2015 2016 2017

(Gershgorn, 2017)

Fran¸coisFleuret Deep learning / 1.1. From neural networks to deep learning 18 / 18 The end References

K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of unaffected by shift in position. Biological Cybernetics, 36(4): 193–202, April 1980. D. Gershgorn. The data that transformed AI research—and possibly the world, July 2017. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015. D. O. Hebb. The organization of behavior: A neuropsychological theory. Wiley, 1949. ISBN 0-8058-4300-0. D. Hubel and T. Wiesel. Receptive fields, binocular interaction, and functional architecture in the cat’s visual cortex. Journal of Physiology, 160:106–154, 1962. A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems (NIPS), 2012. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989. W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943. D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Neurocomputing: Foundations of Research, chapter Learning Representations by Back-propagating Errors, pages 696–699. MIT Press, 1988. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015. A. Tran, A. Mathews, and L. Xie. Transform and tell: Entity-aware news image captioning. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 13035–13045, 2020. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017. P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In Proceedings of the 10th IFIP Conference, pages 762–770, 1981.