<<

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Opening the Black Box (Part 2) Overview

Artificial neural networks Deep learning Opening the black box (part 2)

2 Artificial Neural Networks

3 The basic idea

Very loosely based on how the human brain works A collection of mathematical “neurons” are created and connected together, allowing them to send signals to each other Next, the network is asked to solve a problem, which it attempts to do over and over, each time strengthening the connections that lead to success and diminishing those that lead to failure

4 A brief history

https://beamandrew.github.io/deeplearning/2017/02/23/deep_learning_101_part1.html

5 A brief history

An electronic brain (1940s): since the dawn of computing, researchers have been thinking about the idea of an “intelligent”, perhaps even “conscious” machine Alan Turing laid out several criteria to assess whether a machine could be said be intelligent in Computing Machinery and Intelligence (Turing test) The (1950s): early work in machine learning was inspired by the (then) working theories on the human brain Frank Rosenblatt kickstarts the field by introducing the “perceptron”: simplified mathematical representation of a neuron Convinced that this would quickly lead to true AI

6 A brief history

“ The embryo of an electronic computer that the Navy expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence. – The New York Times, based on Rosenblatt’s statements

on the Perceptron “

Into the AI winter (1960s-80s): , considered as one of the fathers of AI, is less convinced Along with Seymor Papert, Minksy wrote a book entitled that in fact ended the optimism around the perceptron They showed that the perceptron was incapable of learning the simple exclusive-or (XOR) function

7 A brief history

Backpropagation to the rescue (1980s-90s): interest returns to neural networks Geoff Hinton shows that neural networks with many hidden layers (i.e. consisting of more than one perceptron) could be effectively trained by a relatively simple procedure, called “” Such networks have the ability to learn any function, a result known as the “universal approximation theorem”, and with that, neural networks were hot again The idea of using multiple perceptrons in a layered fashion was not new, though it was unclear how exactly such networks could be “trained” The backpropagation algorithm works by taking starting from a network’s “error” and “back-propagating” these throughout all its layers to adjust their parameters Leads to some early successes: Multi-layer perceptrons (MLP), the first convolutional neural networks (CNNs) to recognize handwritten digits by Yann Lecun at AT&T (“LeNet”) A second AI winter (1990s-early 2000s): the MLP approach didn’t scale well to larger problems Computing power lacking for larger networks Meanwhile, by the 90s, the support vector machine (SVM) was rapidly taking the center-stage as the method of choice Neural networks were left behind once again The fields shifts its angle to be a lot more theoretical in nature

8 A brief history

Deep learning (early 2000s): around 2006, Hinton introduces “unsupervised pretraining” and “deep belief nets” Train a simple 2-layer unsupervised model, freeze all its parameters, add a new layer on top and just train the parameters for the new layer Keep adding and training layers until you have a “deep network” Deep learning unleashed (2010s): based on Hinton’s work, more and more research papers began to take form In 2010, a large database known as “Imagenet” containing millions of labeled images was created and published by a research group at Stanford. This database was coupled with the annual Large Scale Visual Recognition Challenge (LSVRC), where contestants would build computer vision models In the first two years of the contest, the top models had error rates of about 25% In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton entered a submission that would halve the error rate! Combined several critical components that would go one to become mainstays in deep learning models: the use of graphics processing units (GPUs) to train the model, a method to reduce overfitting known as dropout and the rectified linear activation unit (ReLU) The network went on to become known as “Alexnet” and the paper describing it has been cited nearly 60000 times since it was published Deep learning all around (2020s): many innovations would follow after this result Appearance of large, high-quality labeled datasets Massively parallel computing with GPUs, TPUs New activation functions, improved architectures Software support New regularization techniques: dropout, batch normalization, data-augmentation to protect against overfitting New optimizers: from stochastic gradient descent (SGD) to RMSprop, ADAM and others Focus returns on practice, experiments, empirical approaches 9 Foundations: the perceptron

A perceptron works over a number of numerical inputs: the input neurons or input units Each neuron has an output: its “activation” For the input neurons, the activation is simply given by the input data instance

Every input unit’s output is connected to another neuron (the perceptron unit): inputs are multiplied with a weight and summed A bias can also be added with an input unit that always outputs 1 The final output of the perceptron unit is equal to the result of an “” over the weighted sum of the inputs

(This can also be expressed as two operations: combination/transfer function and activation function) 10 Foundations: the perceptron

11 So, how to train it?

Feedforward is easy: just plug in the inputs and feed them through the network, obtaining an output For simple statistical models (for example, linear regression), closed-form analytical formulas to determine the optimum parameter estimates exist For nonlinear models, such as neural networks, the parameter estimates can be determined numerically using an iterative algorithm

12 So, how to train it?

Say we use L = (y − y^)2 to see how good our prediction is for an instance

This is our error or loss function

Weights for this simple perceptron can be updated using: wi,t+1 = wi,t + η(y − y^)xi

∂L ∂(y − y^)2 ∂(y − y^)2 ∂(y − y^) ∂(y − y^) ∂y^ = = = 2(y − y^) = 2(y − y^)(−1) ∂wi ∂wi ∂(y − y^) ∂wi ∂wi ∂wi

Let o = w1x1 + w2x2 + b , then: ∂σ(o) ∂o = 2(y − y^)(−1) = 2(y − y^)(−1)σ(o)(1 − σ(o))xi ∂o ∂wi

13 So, how to train it?

Gradient descent: ∂L wi,t+1 = wi,t − η ∂wi

η is the learning rate ∂L = 2(y − y^)(−1)σ(o)(1 − σ(o))xi = −2(y − y^)σ(o)(1 − σ(o))xi ∂wi

Weights for this simple perceptron can be updated using: wi,t+1 = wi,t + η(y − y^)xi

14 So, how to train it?

We’re nudging the weights in order to minimize the error or “loss”

The learning rate η determines the “speed” of the convergence Higher: quicker towards minimum, but risk of overshooting Lower: slower towards minimum, risk of getting trapped in local minima

15 So, how to train it?

The loss is a function of the weights given a piece of training data

Minimize error using the gradient of the loss function Gradient descent is the process of minimizing a function by following the gradients of the cost function This involves knowing the form of the cost as well as the derivative so that from a given point you know the gradient and can move downhill towards the minimum value

(Stochastic) gradient descent

One iteration: one instance fed-forward, weights are updated One epoch: one full pass over all instances in training set To properly train our perceptron, need to perform multiple passes over training set: multiple epochs Typically, the training set is reshuffled after every epoch

16 So, how to train it?

import math

X = [[0, 1], [1, 0], [2, 2], [3, 4], [4, 2], [5, 2], [4, 1], [5, 0]] y = [0, 0, 0, 0, 1, 1, 1, 1]

weights = [0, 0, 0] # Initialize the weights (first weight is the bias)

def sigmoid(x): return 1 / (1 + math.exp(-x))

def predict(instance, weights): output = weights[0] for i in range(len(weights)-1): output += weights[i+1] * instance[i] return sigmoid(output)

def train(instance, weights, y_true, l_rate=0.01): prediction = predict(instance, weights) error = y_true - prediction weights[0] = weights[0] + l_rate * error for i in range(len(weights)-1): weights[i+1] = weights[i+1] + l_rate * error * instance[i] return weights

17 So how to train it?

Without training: After 20 epochs: After 2000 epochs:

Instance y y^ Instance y y^ Instance y y^ [0, 1] 0 0.5 [0, 1] 0 0.36 [0, 1] 0 0.00 [1, 0] 0 0.5 [1, 0] 0 0.57 [1, 0] 0 0.10 [2, 2] 0 0.5 [2, 2] 0 0.49 [2, 2] 0 0.04 [3, 4] 0 0.5 [3, 4] 0 0.42 [3, 4] 0 0.01 [4, 2] 1 0.5 [4, 2] 1 0.71 [4, 2] 1 0.94 [5, 2] 1 0.5 [5, 2] 1 0.79 [5, 2] 1 0.99 [4, 1] 1 0.5 [4, 1] 1 0.78 [4, 1] 1 0.99 [5, 0] 1 0.5 [5, 0] 1 0.89 [5, 0] 1 0.99

18 However…

x1 x2 y 0 0 0 0 1 1 1 0 1 1 1 0

# Weights: [0.004697241052453581, -0.009743527387551375, -0.00476408160440969]

[0, 0] 0 -> 0.5011743081039458 [0, 1] 1 -> 0.4999832898620173 [1, 0] 1 -> 0.49873843109337934 [1, 1] 0 -> 0.4975474276853999

This is the XOR problem 19 So far, not very spectacular…

One neuron on its own is hardly a brain

Multilayer Perceptron (MLP): stack different neurons in layers

Input layer Hidden layer(s) Output layer Connect all outputs with all inputs of next layer (“fully connected” or “dense” architecture)

The question is now: how to train?

20 Backpropagation

We can’t use the same approach as we did for a single perceptron as we don’t know what the “true outcome” should be for the lower layers

This issue took quite some time to solve Eventually, a method called “backpropagation” was devised to overcome this Some controversy regarding who should be credited with the discovery of backprogation (e.g. see Griewank, A. (2012). Who invented the reverse mode of differentiation? https://www.math.uni-bielefeld.de/documenta/vol-ismp/52_griewank- andreas-b.pdf) Basically the chain rule applied on first partial derivatives applied in a step-by-step manner

21 Backpropagation

Note that feedforward is still easy We can also still compare the predicted output with the expected one, from which we can derive a loss value

http://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html

http://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html

22 Backpropagation

The idea of backpropagation is to “back Using this, we know how to shift the weights propagate” the error through the network

Using the chain rule of partial derivatives

http://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html

Further information:

https://victorzhou.com/blog/intro-to-neural-networks/ http://www.emergentmind.com/neural-network https://www.youtube.com/watch? http://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi (recommended!)

23 Automatic differentiation

The discussion regarding backpropagation reveals that the error function and activation functions should be differentiable

But some are not fully: e.g. loss functions involving absolute errors, or the ReLU activation function See e.g.: A Review of Automatic Differentiation and its Efficient Implementation Charles C. Margossian or Automatic Differentiation in Machine Learning: a Survey

TensorFlow and many other deep learning frameworks utilize automatic differentiation

Not the same as symbolic or numeric (algebraic) differentiation Inspection of computational graph which links operations over higher rank matrices together Many other implementations for automatic differentiation exist JAX: Autograd and XLA https://github.com/HIPS/autograd https://github.com/google/jax https://www.juliadiff.org/ https://github.com/denizyuret/AutoGrad.jl

24 Activation functions

Logistic (sigmoid): f(x) = 1/(1 + e−x)

Output between 0 and 1

f(x) = ex −e−x Hyperbolic tangent (tanh): ex +e−x

Output between -1 and 1

Rectified Linear Unit (ReLU): f(x) = max(0, x)

Output between 0 and +∞ Many modifications exist, very common method

Others:

Linear: f(x) = x Exponential: f(x) = ex Radial Basis Function (RBF) Activation functions

In older approaches, commonly used activation functions were symmetric and had asymptotic ranges

E.g. hyperbolic tangent and sigmoid functions

Why has ReLU become so popular? It looks like a very simple activation function…

The key is that it avoids the vanishing gradient problem in deeper networks Other activation functions lead to vanishing or – the opposite – exploding gradients It was found that such traditional activations functions limit training for deeper neural networks with more layers ReLU and friends avoid this problem (constant gradient)

26 The importance of initialization

Another way to solve this: better initialization

Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.

Good starting values for the weights are essential for good training

Prevent layer activation outputs from exploding or vanishing If either occurs, loss gradients will either be too large or too small to flow backwards beneficially and the network will take longer to converge, if it is even able to do so at all

An older approach is “preliminary training”

Use a random starting weights (uniformly or gaussian) and train a few epochs Use the best of the final values as the new starting value and continue with those 27 Loss functions

Note: we can easily use more than one output neuron is possible as well (e.g. for a multi-class problem)

Note that the neurons in the hidden layers commonly use different activation functions than the output neurons E.g. ReLU is common for the hidden layers. The activation function of the output layer depends on the task (regression, binary classification, multiclass) For multiclass, a “softmax” layer is added on top of the output neurons to summarize their outputs to 1

For the perceptron model, a naïve error function was used

Many other error (or “loss”) functions exist as well…

For regression: For classification: 1 N 2 N C Mean squared error (MSE): E = ∑i=1(y^i − yi) Cross entropy: E = − ∑i=1 ∑c=1(yic × ln(y^ic)) 2 Cross entropy for binary classification: N ^ ^ E = − ∑i=1(yi × ln(y i) + (1 − yi) × ln(1 − y i))

Note that a great deal in research regarding finding new architectures consists of finding appropriate loss functions

28 (Stochastic) gradient descent

Normal gradient descent (“batch” gradient descent) presents all training instances to the network One update of the weights follows based on averaged gradients over the whole trainng set Precise, but very time-consuming Stochastic gradient descent updates weights after every instance Quicker, but more sensitive to particular examples (looks like a “drunk walk” towards the minimum) Definitely need to shuffle instances every epoch Most implementations hence use a “mini-batch” approach Shuffle the training set, present in small batches Update weights after each mini-batch

29 Backpropagation alternatives

Most implementations will use backpropagation, though other approaches to train an artificial neural network exist as well

Advanced nonlinear optimization algorithms Hessian based Newton based methods Conjugate gradient Levenberg-Marquardt Genetic algorithm based No training at all, even, e.g. https://weightagnostic.github.io/ (well, not training the weights…)

“ The invention of backpropagation immediately elicited an outcry from some neuroscientists, who said it could never work in real brains. – https://www.quantamagazine.org/artificial-

neural-nets-finally-yield-clues-to-how-brains-learn-20210218/ “

30 Optimizers

Even when using backpropagation, different optimization strategies exist other then stochastic gradient descent

Momentum Rmsprop Adagrad Adadelta Eve Adabound http://cs231n.github.io/neural-networks-3/ …

A lot of research is being put in this field

See https://vis.ensmallen.org/ for examples

31 Learning rate tuning

Recall: the learning rate determines the “speed” of the convergence

Higher: quicker towards minimum, but risk of overshooting Lower: slower towards minimum, risk of getting trapped in local minima Adaptive learning rate: start high and decrease over time Momentum based: also prevents overshooting

Finding a good initial learning rate and tuning is during training is critical!

https://github.com/psklight/keras_one_cycle_clr Plot loss function over small batch for different learning rates, use best one to continue training

32 Learning rate tuning

Smith, L. N. (2017). Cyclical learning rates for training neural networks. https://arxiv.org/abs/1506.01186 (cyclical learning rate)

Cyclical training rate Bounce the learning rate back and forth during training An intuitive understanding of why CLR works comes from considering the loss function topology The difficulty in minimizing the loss arises from saddle points rather than poor local minima (!) Saddle points have small gradients that slow the learning process. However, increasing the learning rate allows more rapid traversal of saddle point plateaus

33 Learning rate tuning

Smith, L. N. (2018). A disciplined approach to neural network hyper-parameters: learning rate, batch size, momentum, and weight decay. https://arxiv.org/abs/1803.09820 (one cycle policy)

34 Preventing overfitting

Continuous training will continue to lower the error on the training set, but will eventually lead to overfitting (memorizing the training data)

As such, validation is crucial (commonly with a validation split)

Early stopping: stop training when validation error has reached its minimum level

Regularization (penalizing large weights) is another approach, as larger weights generally are a sign that overfitting is occurring

35 Preventing overfitting

Dropout is another method: at each training stage, individual nodes are either “dropped out” of the net with a given probability, so that a reduced network is left: incoming and outgoing edges to a dropped-out node are also removed

Improves training and reduces node interactions Forces the network to learn alternative pathways – e.g. enforces redudancy Leading to better generalization

Batch normalization has also become popular: one often normalizes the input layer by adjusting and scaling the activations; if the input layer is benefiting from it, why not do the same thing also for the values in the hidden layers, that are changing all the time?

Batch normalization reduces the amount by what the hidden unit values shift around (covariance shift) Allows to use higher learning rates because batch normalization makes sure that there’s no activation that’s gone really extreme Reduces overfitting because it has a slight regularization effects. Similar to dropout, it adds some noise to each hidden layer’s activations

36 Example

Let’s summarize a bit…

Using https://playground.tensorflow.org/

37 MLPs are already powerful

import tensorflow.keras as keras from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Flatten from tensorflow.keras.datasets import mnist from matplotlib import pyplot as plt import numpy as np from PIL import Image

(X_train, y_train), (X_test, y_test) = mnist.load_data() print(X_train.shape) # (60000, 28, 28)

plt.imshow(X_train[0], cmap='gray'); plt.show() print(y_train[0]) # 5

X_train, X_test = X_train.astype('float32') /= 255, # 60000 train samples X_test.astype('float32') /= 255 # 10000 test samples

num_classes = 10

y_train = keras.utils.to_categorical(y_train, num_classes) y_test = keras.utils.to_categorical(y_test, num_classes)

What’s the /= 255 for?

38 MLPs are already powerful

model = Sequential() model.add(Flatten(input_shape=(28, 28))) model.add(Dense(8, activation='relu')) model.add(Dense(8, activation='relu')) model.add(Dense(num_classes, activation='softmax'))

model.summary()

model.compile( loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'] )

39 MLPs are already powerful

batch_size = 128 epochs = 20

model.fit( X_train, y_train, batch_size=batch_size, epochs=epochs, verbose=2, validation_data=(X_test, y_test) )

score = model.evaluate(X_test, y_test, verbose=0) print('Test loss:', score[0]) print('Test accuracy:', score[1])

# Epoch 1/20 # - 3s - loss: 0.9510 - acc: 0.7003 - val_loss: 0.4914 - val_acc: 0.8681 # Epoch 2/20 # - 2s - loss: 0.4345 - acc: 0.8772 - val_loss: 0.3707 - val_acc: 0.8979 # Epoch 20/20 # - 2s - loss: 0.2511 - acc: 0.9295 - val_loss: 0.2701 - val_acc: 0.9262 # Test loss: 0.27007879534959794 # Test accuracy: 0.9262 40 MLPs are already powerful, but how do they learn?

Layer (type) Output Shape Param # ======flatten_1 (Flatten) (None, 784) 0 dense_1 (Dense) (None, 8) 6280 = 784 * 8 + 8 dense_2 (Dense) (None, 8) 72 = 8 * 8 + 8 dense_3 (Dense) (None, 10) 90 = 8 * 10 + 10

Total params: 6 442

41 MLPs are already powerful, but how do they learn?

42 MLPs are already powerful, but how do they learn?

[[0 0 0.016035 *0.983951* 0 0.000015 0 0 0 0]]

# This is a three?

43 Deep Learning

44 Deep what?

The deep in deep learning isn’t a reference to any kind of deeper understanding achieved by the approach, but stands for the idea of successive layers of representations

Other appropriate names for the field could have been: Layered representations learning Hierarchical representations learning Differential function learning

Modern deep learning often involves tens or even hundreds of successive layers of representations

Enabled by computational power rise

Main contributions follow from architecture and loss functions “ The goal is to create algorithms that can take in very unstructured data, like images, audio waves or text blocks (things “ traditionally very hard for computers to process) and predict the properties of those inputs – Andrew Ng

45 We’ll take a look at the following types

Convolutional neural networks: now Generative adversarial networks: now (briefly) Reinforcement learning: now (briefly)

Embeddings and representational learning: when we discuss text mining Recurrent neural networks: when we discuss text mining (briefly) Graph convolutional networks and graph neural networks: when we discuss graph mining (briefly)

46 Convolutional neural networks (CNNs)

Our “deep” MLP does pretty well on a simple data set

Black-white image Small Only 10 classes

How about a data set with pictures of 1000 classes? (Cats, dogs, cars, boats, …)

Increase number of layers? Hidden units? Lots of weights to train!

47 Convolutional neural networks (CNNs)

In 2010, a large database known as “Imagenet” containing millions of labeled images was created and published by a research group at Stanford In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton entered a submission that would halve the error rate This model combined several critical components Probably the most important piece was the use of graphics processing units (GPUs) to train the model They also introduced a method to reduce overfitting known as dropout and used the rectified linear activation unit (ReLU) The network went on to become known as “Alexnet”

And even before this:

First convolutional neural networks (CNNs) to recognize handwritten digits by Yann Lecun at AT&T Bell Labs (“LeNet”)

48 Convolutional neural networks (CNNs)

Series of convolutional, pooling layers, followed by fully connected layers, and a softmax output layer

width × height × 3 input layer for colored images Convolutional layer does most of the heavy lifting: learns a number of “filters” (“kernels”) by retaining spacial topology I.e. don’t fully connect everything Pooling layer applies simple downsampling (i.e. downsizing the image)

49 Convolutional neural networks (CNNs)

Convolutional layer learns filters

Look at every wf × hf × d window and apply the filter: spatially local weighted sum Do this for every such window by moving it n pixels (“stride”) Paddings can be defined for the edges of an image (“padding”) Every window leads to one (2d) output The same kernel is used for all positions in the image (“kernel”): parameter sharing! A convolutional layer learns multiple filters (the “depth”) 50 Convolutional neural networks (CNNs)

In case more than one input: kernel has depth n E.g. in case of color image or for filters in later layers Output is still 2 dimensional (1d CNN units exist as well)

51 Convolutional neural networks (CNNs)

Pooling layer applies downsampling (a resize)

Also works with a size and stride Same depth as the layer before “Max pooling” vs. “average pooling” Global pooling: downsize the feature map to a single value: - aggressively summarizes the presence of a feature of an image Oftentimes used as an alternative to using fully connected layer to transform from feature maps to an output prediction

52 Convolutional neural networks (CNNs)

http://scs.ryerson.ca/~aharley/vis/

model = Sequential() model.add(Conv2D(16, (3, 3), padding='same', input_shape=(28, 28, 1), activation='relu')) model.add(Conv2D(16, (3, 3), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Flatten()) model.add(Dense(128, activation='relu')) model.add(Dense(num_classes, activation='softmax')) model.summary()

53 Convolutional neural networks (CNNs)

Many variations have been developed

LeNet: the first successful applications of CNNs, developed by Yann LeCun in 1990’s AlexNet: the first work that popularized CNNs in Computer Vision ZF Net: a Convolutional Network from Matthew Zeiler and Rob Fergus; an improvement on AlexNet by tweaking the architecture hyperparameters GoogLeNet: main contribution was the development of an “Inception Module” that dramatically reduced the number of parameters in the network VGGNet: showed that the depth of the network is a critical component for good performance (140M parameters) – still used quite often for transfer learning ResNet and ResNeXt: features special skip connections and a heavy use of batch normalization. The architecture is also missing fully connected layers at the end of the network – very popular, still SqueezeNet: achieves AlexNet performance levels with 50x fewer parameters, leading to a very small model that is easy to deploy on e.g. smart devices

54 Convolutional neural networks (CNNs)

Best practices

The input layer (the image) should be divisible by 2 many times Common numbers include 32 (e.g. CIFAR-10 data set), 64, 96, 224, 384, and 512 Convolutional layers should use small filters (3x3 or at most 5x5), a stride of 1 and padding A stride of 1 allows to leave all spatial down-sampling to pooling layers, with convolutional layers only transforming the input volume depth- wise If convolutional layers were not padded, then size of the volumes would reduce by a small amount after each , and information at the borders would be “washed away” too quickly If we would use strides greater than 1 or not zero-pad the input, we would also have to carefully keep track of the input volumes throughout the architecture Some do argue for a larger convolutional layer at the beginning of the architecture (5x5 kernel size) with sufficient depth Prefer stack of small filter over one layer with a large receptive field More small filters is better than one large one The effective receptive field over the image will be similar Luo, W., Li, Y., Urtasun, R., & Zemel, R. (2016). Understanding the effective receptive field in deep convolutional neural networks. In Advances in neural information processing systems (pp. 4898-4906). https://distill.pub/2019/computing-receptive-fields/ The conventional paradigm of a linear list of layers has been challenged E.g. ResNet includes skip connections for the residuals, comparable to boosting 55 Convolutional neural networks (CNNs)

Best practices

The pooling layers oversee downsampling the spatial dimensions of the input A common setting is to use max pooling with 2x2 receptive fields, and with a stride of 2 Many dislike the pooling operation and think that we can get away without it: if necessary, reduce size of the representation by occasionally using a larger convolutional stride This also allows to construct FCN’s (fully convolutional networks), which are more flexible as they do not impose the limitation of having to use a fixed input image size The amount of memory can grow quickly with a CNN Since GPUs are often bottlenecked by memory, it may be necessary to compromise Using a black and white picture instead of colored can help, or downsizing the input layer, or using smaller minibatches Many CNN architectures perform convolutions in the lower layers and add a number of fully connected (dense) layer at the top, followed by a softmax layer Bridges the convolutional structure with traditional neural network classifiers and treats the convolutional layers as feature extractors, and the resulting features are classified in a traditional way However, the fully connected layers are prone to overfitting Typically, dropout layers are added to resolve this issue Another strategy is “global average pooling” Replaces the fully connected layers and generates one feature map for each class of the classification task in the last layer 56 Data augmentation

Increase size of training set through transformations

Some popular augmentations people use are grayscales, horizontal flips, vertical flips, random crops, color jitters, translations, rotations, cutouts, and much more. By applying just a couple of these transformations to your training data, you can easily double or triple the number of training examples https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDat aGenerator https://github.com/aleju/imgaug Force the network to focus on important aspects Prevent overfitting Sometimes also applied at prediction-time to stabilize predictions (test- time augmentation) Tends to work very well Open question: text, audio, …?

57 58 Transfer learning

A common misconception is that without huge amounts of data, you can’t create effective deep learning models

While data is a critical part, the idea of transfer learning has helped to lessen the data demands

Transfer learning is the process of taking pre-trained model and “fine-tuning” it on your dataset

Idea is that pre-trained model will act as a feature extractor: you remove the last layer(s) of the network and replace it with your own classifier, and only retrain those weights while keeping the rest “frozen” Or simply keep as is but only retrain last layers Lower layers of the network will detect features like edges and curves. Rather than training the whole network through a random initialization of weights, we can use the weights of the pre-trained model and focus on the layers that are higher up for training

Also heavily applied in the textual domain!

59 Transfer learning https://teachablemachine.withgoogle.com/v1/

Train an image detector in your browser How can this train so fast?

Uses a pretrained SqueezeNet on ImageNet (1000 classes)

Hence, we get an output vector of size 1000 for any image When the user “trains” this system, simply store output vectors for each image together with given class label

For a new image (to “predict”), look at k nearest neighbors based on distance to stored output vectors

We then see for these neighbors to which class they belonged to (A, B, C), and simply derive the probabilities based on this set

Works surprisingly well, even for images that are not in ImageNet

The network might think of a “lemon” for pictures of a banana When we give another banana picture to predict, the network will think “lemon” again We simply translate this to the user as: “it’s a banana”

60 Transfer learning

61 Other image tasks

Basic CNNs are easy to set up for image classification

Taking an input image and outputting a class number out of a set of categories

For object localization, the goal is not only to produce a class label but also a bounding box that describes where the object is in the picture

RCNN, Fast RCNN, Faster RCNN, MultiBox, Bayesian Optimization, Multi-region, RCNN Minus R, Image Windows

For object segmentation, the task is to output a class label as well as an outline of every object in the input image

Semantic Seg, Unconstrained Video, Shape Guided, Object Regions, Shape Sharing

62 Other image tasks

(BodyPix)

63 Other image tasks

A basic CNN setup can also be used to localize objects of interest by preprocessing the data appropriately At prediction, the model is queried for each slice over the image

64 Other tasks

One dimensional CNNs have been used for text and time series analysis as well Capsule networks () try to remove standing issues of the traditional CNN architecture Standard CNNs focus heavily on small texture and edge based filters but have difficulty with pose and overall composition Smart idea, but didn’t take off Maybe there are smarter alternatives: Making Convolutional Networks Shift-Invariant Again: https://arxiv.org/pdf/1904.11486v2.pdf

65 Other tasks

Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., & Brendel, W. (2018). ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. “Gabor filters” which focus on texture analysis

66 Other tasks

Image classification, segmentation, detection Face recognition and classification (from proper to “bad” science) Pose and gait detection Business applications, e.g. in insurance (take a picture of your car to file a damage claim), fraud detection (forged signatures), etc. Stylistic and artistic use cases, e.g. photo editing and processing

67 Style transfer

https://mspoweruser.com/popular-ios-app-prisma-coming-windows-10-month/

68 Deep dreaming

https://ai.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html

“ One way to visualize what goes on is to turn the network upside down and ask it to enhance an input image in such a way as to elicit a particular interpretation. Say you want to know what sort of image would result in “Banana. Start

with an image full of random noise, then gradually tweak the image towards what the neural net considers a banana. By itself, that doesn’t work very well, but it does if we impose a prior constraint that the image should have similar “ statistics to natural images, such as neighboring pixels needing to be correlated.

69 Deep dreaming

70 One-shot learning

Deep neural networks are really good at learning from high dimensional data like images or spoken language, but only when they have huge amounts of labelled examples to train on

Humans on the other hand, are capable of one-shot learning: take a human who’s never seen a tomato before, and show them a single picture of a tomato, they will probably be able to distinguish tomatoes from other fruits with astoundingly high precision

Trivial to us, but not so much for a computer

1 nearest neighbor (take the nearest known sample based on Euclidean distance) Very low accuracy, but still better than random Hierarchical Bayesian Learning (Lake et al.) Better results, but inputs modified or annotated Naïve deep neural network approach Would horribly overfit Transfer learning Works better, makes more sense Siamese networks (Koch et al.) Provide two images and train the network to predict whether they have the same category During prediction-time, the network can be used to compare a new image to each in the support set and pick the best matching category based on this We want an architecture that takes two inputs and outputs the probability of sharing the same class Symmetry: p(x1, x2) = p(x2, x1) – which means we cannot just “stitch” both images together to one large image Siamese network: shared parameters for identical convnets, then joined by a distance function

Also possible: zero-shot learning (no examples for some classes), student-teacher networks (alternative transfer learning approach)

71 Generative adversarial networks (GANs)

Generative adversarial networks (GANs) are deep neural net architectures comprised of two nets, pitting one against the other

The adversarial network tries to fool the discriminator The discriminator tries to spot fooling attempts

GANs were introduced in a paper by Ian Goodfellow et al. https://arxiv.org/abs/1406.2661

Yann LeCun called adversarial training “the most interesting idea in the last 10 years in ML” (by now he has changed his mind and self-supervised learning is the coolest thing):

If intelligence is a cake, the bulk of the cake is self-supervised learning, the icing on the cake is “ “ supervised learning, and the cherry on the cake is reinforcement learning (RL)

72 Generative adversarial networks (GANs)

73 Generative adversarial networks (GANs)

Traditional discriminative algorithms try to classify input data; that is, given the features of an instance, they predict the class to which that instance belongs, or P(Y = 1|X) Discriminative algorithms map features to labels. They are concerned solely with that correlation One way to think about generative algorithms is that they do the opposite. Instead of predicting a label given certain features, they attempt to predict features given a certain label, or P(X|Y ) The question a generative algorithm tries to answer is: assuming this label, how likely are these features? While discriminative models care about the relation between y and x, generative models care about “how you get x” In other words: discriminative models learn the boundary between classes, generative models model the distribution of individual classes

74 Generative adversarial networks (GANs)

See example: https://poloclub.github.io/ganlab/

https://thisxdoesnotexist.com/ https://thispersondoesnotexist.com/ https://thisrentaldoesnotexist.com/ https://www.thiswaifudoesnotexist.net/ https://thisartworkdoesnotexist.com/ https://thiscatdoesnotexist.com/ (okay, maybe this one is less good)

We’re getting better at this: DCGAN, StyleGAN, …

“Deep fakes”:

https://www.theguardian.com/technology/2018/nov/12/deep-fakes-fake-news-truth https://www.forbes.com/sites/johnbbrandon/2019/10/08/there-are-now-15000-deepfake-videos-on-social-media-yes-you- should-worry/ 75 Reinforcement learning

Reinforcement learning allows to create AI agents that learn from the environment by interacting with it

Learns by trial and error The environment exposes a state to the agent, with a number of possible actions the agent can perform After each action, the agent receives the feedback The feedback consists of the reward and next state of the environment

See example: http://projects.rajivshah.com/rldemo/

76 Q-learning

Given one run of the agent through an environment (one episode), we can easily calculate the total reward for that episode: R = r1 + r2+. . . +rn

The total future reward from time point t onward can be expressed as: Rt = rt + rt+1 + rt+2+. . . +rn

Because the environment is stochastic, it is common to use discounted future reward instead:

2 n−t Rt = rt + γrt+1 + γ rt+2+. . . +γ rn

Rt = rt + γ(rt+1 + γ(rt+2+. . . )) = rt + γRt+1

If we set the discount factor γ=0, our strategy will be short-sighted and we rely only on the immediate rewards Balance between immediate and future rewards with e.g. γ=0.9 If our environment is fully deterministic and the same actions always result in same rewards: γ=1

A good strategy for an agent would be to always choose an action that maximizes the (discounted) future reward 77 Q-learning

Define a quality function Q(s, a) representing the maximum discounted future reward when performing action a in state s and continuing optimally from the point onwards:

But: how can we estimate this future reward?

We know just the current state and action, and not the actions and rewards coming after that We can’t, Q is just a theoretical construct

If we could find an estimate for Q, we could determine a policy as follows: just pick the action with the highest Q value in a certain state:

π(s) = argmaxa Q(s, a)

Here π represents the policy: the rule how we choose an action in each state

78 Q-learning

Say we have one action: (s, a, r, s′)

Just like with discounted future rewards, we can express the Q-value of state s and action a in terms of the Q-value of the next state s’: Q(s, a) = r + γ × Q(s′, π(s′))

This is called the Bellman equation

The main idea in Q-learning is that we can iteratively approximate the Q-function using the Bellman equation. In the simplest case the Q-function is implemented as a table, with states as rows and actions as columns

79 Q-learning

initialize Q[num_states,num_actions] arbitrarily observe initial state s repeat select action a according to policy execute action a and obtain new state s' select action a' in s' accordingly to policy (which might include an epsilon for "exploration") Q[s,a] = Q[s,a] + α(r + γ * Q[s',a'] - Q[s,a]) move to new state s' until termination

α is a learning rate that controls how much of the difference between previous Q-value and newly proposed Q-value is taken into account When α=1, then the update is exactly the same as the Bellman equation The updating of the Q matrix is only an approximation and in early stages of learning it may be completely wrong However the approximation gets more and more accurate with every iteration and it has been shown, that if we perform this update enough times, then the Q-function will converge and represent the true Q-value

80 Q-learning

What we need:

A transition matrix T, describing which state transitions are possible Can be stochastic (harder) Might not be fully initalized but determined whilst exploring A reward matrix R, describing the rewards we get from state transitions Might not be fully initalized but determined whilst exploring A Q matrix, the “brain” of our agent A policy Alpha (learning rate), gamma (discount factor) and epsilon (exploration versus exploitation)

An example will be posted using Python in the background information

81 Q-learning: example

http://mnemstudio.org/path-finding-q-learning-tutorial.htm

82 Q-learning: example

Reward matrix R

Indicates possible actions from a certain state And the reward per action We could split this up into a transition and reward matrix respectively

Q matrix

The “brain” of our agent Initially all values are 0

83 Q-learning

“ If we apply the same preprocessing to game screens as in the DeepMind paper – take the four

last screen images, resize them to 84×84 and convert to grayscale with 256 gray levels – we 84×84×4 67970 “ would have 256 ≈ 10 possible game states

67970 This means 10 rows in our Q-table, more than the number of atoms in the known universe One could argue that states never occur, we could possibly represent it as a sparse table containing only visited states Even so, most of the states are very rarely visited and it would take a lifetime of the universe for the Q-table to converge Ideally, we would also like to have a good guess for Q-values for states we have never seen before

84 Deep Q-learning

This is the point where deep learning steps in

We could represent our Q-function with a neural network, that takes the state and action as input and outputs the corresponding Q-value “According to the network, which action leads to the highest payoff in a given state?”

^ Note that Q in the “Target” is a prediction by the network trained so-far

85 Deep Q-learning

Estimate the future reward in each state using Q-learning and approximate the Q-function using a convolutional neural network

It turns out that approximation of Q-values using non-linear functions is not very stable

Not easy to converge and takes a long time, almost a week on a single GPU

Hence, experience replay is applied. During gameplay all the experiences (s, a, r, s′) are stored in a replay memory

When training the network, random minibatches from the replay memory are used instead of the most recent transition This breaks the similarity of subsequent training samples, which otherwise might drive the network into a local minimum Help avoid the neural network to overly adjust its weights for the most recent state which may affect the action output of other states

86 Deep Q-learning

Q-learning attempts to solve the reward assignment problem – it propagates rewards back in time, until it reaches the crucial decision point which was the actual cause for the obtained reward

When a Q-table or Q-network is initialized randomly, then its predictions are initially random as well. If we pick an action with the highest Q-value, the action will be random and the agent performs crude “exploration” As a Q-function converges, it returns more consistent Q-values and the amount of exploration decreases

But this exploration is “greedy”, it settles with the first effective strategy it finds. We need a tradeoff between exploration and exploitation

A simple and effective fix for the above problem is ε-greedy exploration: with probability ε choose a random action, otherwise go with the “greedy” action with the highest Q-value In their system DeepMind actually decreases ε over time from 1 to 0.1 – in the beginning the system makes completely random moves to explore the state space maximally, and then it settles down to a fixed exploration rate

87 Reinforcement learning

Deep Q Learning (DQN): https://arxiv.org/abs/1312.5602 Used to play simple Atari games together with a CNN Double DQN: https://arxiv.org/abs/1509.06461 The Q-learning algorithm is known to overestimate action values under certain conditions Deep Deterministic Policy Gradient (DDPG): https://arxiv.org/abs/1509.02971 Asynchronous Advantage Actor-Critic (A3C): https://arxiv.org/abs/1602.01783 Continuous DQN (CDQN or NAF): https://arxiv.org/abs/1603.00748 Cross-Entropy Method (CEM): http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.6579&rep=rep1&type=pdf Dueling network DQN (Dueling DQN) https://arxiv.org/abs/1511.06581 Deep SARSA: http://ieeexplore.ieee.org/document/7849837/ Meta-learning with reinforcement learning: https://pdfs.semanticscholar.org/0b46/4e82c52c6e0e63a05d1ba425a82e0ce9d66f.pdf Using SGD: http://taichi.graphics/ Evolving agents with a genetic algorithm

88 Reinforcement learning

https://www.alexirpan.com/2018/02/14/rl-hard.html

89 Reinforcement learning

https://openai.com/blog/faulty-reward-functions/

90 Conclusions

Artificial neural networks are back

Powerful But: require a lot of tuning, configuration, risk of overfitting, and require huge amounts of samples for non-explored problems (although transfer learning helps) So many architectures What are best practices? Black box! The last 10%: deep learning gets you quickly to 90% of good results, but the last 10% is still very hard to reach

91 Software

92 Conclusions

“ Don’t believe the short-term hype, but do believe in the long-term vision. It may take a while for AI to be deployed to its true potential—a potential the full extent of which no one has yet dared to dream—but AI is coming, and it will

transform our world in a fantastic way. – Francois Chollet Difference between machine learning and AI: if it is written in Python, it’s probably machine learning. If it is written “ in PowerPoint, it’s probably AI. – Mat Velloso

93 Conclusions

Traditional algorithms Deep learning Accuracy Fair to good (on structured data) Good to excellent Training time Short (seconds) to medium (hours) Medium to (very) long (weeks) High (many thousands of e.g. images, though “transfer learning” Data requirements Limited (a couple of hundred rows of “small” data) possible in some cases) Manual trend features, windowing, aggregations, domain- Feature engineering Automatic, done “by the model” specific approaches Many (architecture, number of hidden layers, activation functions, Hyperparameters Few to some (depending on the algorithm) optimizer, …) Interpretability High (white-box models) to reasonable Low (black-box model, though some explanations can be extracted) Cost and operational Reasonable to high (GPU, cloud, parallel computational Low to reasonable efficiency requirements)

94 Opening The Black Box (Part 2)

95 Adversarial attacks: a motivating example

NIPS 2017

96 Adversarial learning: a motivating example

Su et al, 2017

97 Adversarial learning: a motivating example

Eykholt et al., 2018

98 Adversarial learning: a motivating example

Brown et al., 2018

99 Adversarial learning: a motivating example

https://arxiv.org/pdf/1911.07658.pdf https://github.com/Kayzaks/HackingNeuralNetworks

100 Opening the black box

How do you inspect something which has not thousands but easily millions of parameters?

101 Hinton diagrams

The size of the square indicates the size of the weight The color of the square indicates the sign of the weight

Suitable for simple MLP models 102 Rule extraction

Decompositional rule extraction algorithms

Are closely intertwined with the internal workings of the neural network Analyze weights, biases, and activation values

Pedagogical rule extraction algorithms

Consider the neural network as a black box Use the neural network as an oracle to label and generate additional training observations

103 Decompositional rule extraction

Extract rules that describe the network outputs in terms of the discretized hidden unit activation values Generate rules that describe the discretized hidden unit activation values in terms of the network inputs Merge the two sets of rules to obtain a set of rules that relate the inputs and outputs of the network

104 Pedagogical rule extraction

Use the neural network to relabel the training data Build a simple model (e.g. decision tree) on the relabeled data Use the neural network as an oracle to generate additional training data when the data becomes too partitioned (for example, less than S observations for deciding upon splits, with S as a user-defined parameter)

105 Two-stage models

106 Layer activations

A straight-forward visualization technique is to show the activations of the network during the forward pass For ReLU CNN networks, the activations usually start out looking relatively blobby and dense, but as the training progresses the activations usually become sparser and more localized

107 Convolutional filters

Another strategy to visualize the weights rather than the activations These are usually most interpretable on the first CONV layer which is looking directly at the raw pixel data, but it is possible to also show the filter weights deeper in the network

See: http://cs231n.github.io/understanding-cnn/

108 Feature visualization

However, since the weight matrices are generally small and hard to understand, a common strategy is to find an input through optimization (e.g. using SGD) which maximally “excites” a particular filter: “feature visualization” Mahendran, A., & Vedaldi, A. (2016). Visualizing deep convolutional neural networks using natural pre-images. International Journal of Computer Vision, 120(3), 233-255. Olah, C., Mordvintsev, A., & Schubert, L. (2017). Feature visualization. Distill, 2(11), e7.

See: https://openai.com/blog/multimodal-neurons/

109 Maximally activating inputs

Similarly, we can take our given set of images, feed them through the network and keep track of which images maximally activates (or “excites”) a neuron or set of neurons We can then visualize those top-n images to get an understanding of what the neuron is looking for in its receptive field

See: https://distill.pub/2019/activation-atlas/

110 Input occlusion

Suppose that a CNN classifies an image as a dog. How can we be certain that it’s actually picking up on the dog in the image as opposed to some contextual cues from the background or some other miscellaneous object? One way of investigating which part of the image some classification prediction is coming from is by plotting the probability of the class of interest (e.g. “dog” class) as a function of the position of an occluder object We iterate over regions of the image, set a patch of the image to be all zero, and look at the probability of the class. We can visualize the probability as a 2- dimensional heat map

Can also be used for simple MLP models:

Train the neural network Prune the input where the input-to-hidden layer weights are closest to zero and retrain the network If the predictive power increases (or stays the same), then repeat the process If not, reconnect the input and stop

111 Saliency maps

An approach with a similar goal is saliency maps Again, based on determining changes to a given input image which have the largest effect on the output, which can also be optimized through SGD See https://github.com/raghakot/keras-vis/ for Keras implementations

112 Dimensionality reduction

CNNs can be interpreted as gradually transforming the images into a representation in which the classes are separable by a linear classifier We can get a rough idea about the topology of this space by embedding images into two dimensions so that their low- dimensional representation has approximately equal distances than their high-dimensional representation. To produce an embedding, we can take a set of images and use the CNN to extract the vector of outputs right before the final softmax (classifier) layer Or, alternatively, a vector constructed using the mean activations of each filter in a convolutional layer We can then plug these into t-SNE or UMAP and get 2-dimensional vector for each image

113 Others

Recall LIME, Shapley Values, feature importance plots, partial dependence and individual conditional expection plots

Model agnostic techniques which can be used here as well But: harder on text, imagery…

Data scientists are automating themselves… … though interpretability will stay crucial!

https://cloud.google.com/explainable-ai https://docs.microsoft.com/en-us/azure/machine-learning/how-to-machine-learning-interpretability https://www.ibm.com/watson/explainable-ai https://aws.amazon.com/sagemaker/

“ Explainable AI is a set of tools and frameworks to help you develop interpretable and inclusive machine learning models

and deploy them with confidence. With it, you can understand feature attributions in AutoML Tables and AI Platform and visually investigate model behavior using the What-If Tool. It also further simplifies model governance through continuous “ evaluation of models managed using AI Platform. 114