International Master in Computer Vision

Fundamentals of for computer vision

Eva Cernadas Contents

Machine learning theory (Dr. Jaime Cardoso)

Linear regression and optimization (Dr. Jaime Cardoso)

Clustering

Model selection and evaluation

Classical classification models

Artificial neural networks

Support vector machines (SVM)

Ensembles: bagging, boosting and

Artificial Neural Networks Eva Cernadas 2 Artificial neural networks

● ANN: Artificial Neural Networks

● Neural network: combination of local processing units (neurons)

● Neuron: simple processing of various inputs and one output.

● Input connections with weights for each input.

● The output is detemined by activation functions from the inputs and weights.

● The weights are persistent: they are the memory of the neural net.

● The neurons are grouped in layers: multi-layers networks: input, hidden (one or various) and output layers.

● The weights are calculated to approximate n-dimensional functions.

Artificial Neural Networks Eva Cernadas 3 Neural network

● The term ANN are normally used to multi-layer (MLP).

● Supervised training: calculate the weights using input patterns and desired outputs.

● In the test: the pattern propagate through the net to calculate the output.

● There are supervised and unsupervised neural networks.

Artificial Neural Networks Eva Cernadas 4 Perceptron (I)

● One neuron with n inputs and one output, binary activation function (0,1) (for example, 2-class classification problems):

x 1 Step activation function

. w

. 1

. n . w i z 1 t≥Γ xi Σ . z(x)=f wi xi f (t)=

. ∑ .

. ( i=1 ) {0 t<Γ } wn xn

● Need a threshold Γ (for example Γ=0.5) to calculate the output (0 or 1). Varying the Γ value generate a ROC curve.

● Connection weights w1...wn : iteratively calculated

Artificial Neural Networks Eva Cernadas 5 Perceptron (II)

● Gradient descend: minimize the difference between the desired output (y) and predicted (z), : learning speed

w (t +1)=w(t)+μ( yi−zi)xi ,i=1,…, N

Note that: yi-zi∈{0,1}

N

● Repeat until ∑ y i z i = 0 . Converge only when the data i=1 are linearly separated.

Artificial Neural Networks Eva Cernadas 6 Perceptron (III) ● Activation f derivable: sigmoid or hyperbolic tangent:

Logistic sigmoid function: 0,1

1 f (t)=σ(t)= 1+e−at

Hyperbolic tangent function: 1

e−at−eat f (t)=tanh at= eat +e−at

Artificial Neural Networks Eva Cernadas 7 Perceptron (V)

T ● Kernel perceptron: linear classification z = f ( w ϕ ( x ) + b ) in the multi-dimensional projected hidden space (f : step function):

N N

w=∑ ai yi ϕ(xi) b=∑ ai yi i=1 i=1

● Inicially, ai=0, i=1..N; ai=ai+1 if yizi>0 until that N

∑ yi zi=0 i=1

Artificial Neural Networks Eva Cernadas 8 Multi-layer neural network (MLP) ● Each neuron is a of the input space (hyperplane).

● Divide the space in two subspaces with outputs 0,1 (sigmoid function) or 1 (hiperbolic function).

● Joining various neurons into one layer, divide the space into regions with piecewise linear borders (surfaces).

Artificial Neural Networks Eva Cernadas 9 Multi-layer Layer Perceptron (MLP) 1ª hidden (H-1)ª hidden ● Hidden layer: activation Input .... layer layer pattern Output sigmoidal/tanh layer (Hª) x1 1 1 . .

● . 1 z Output layer: step (clasificación) x . 1 1 ...... ● . .

We need: training set, cost . .

x . I z

n-1 . H M .

function, and derivable activation .

xn I1 IH-1

● k wj : neuron weight j=1..Ik in layer k=1..H

● k k T k-1 k aij =(wj ) hi +bj , i=1..N (pattern), k=1..H (layer), j=1..Ik (neuron)

● k-1 hi : output of layer k-1 (Ik-1 values) for pattern xi

● k k k hi : output of layer k for pattern xi: hij =f(aij ) with j=1..Ik

● Yij =desired output to output neuron j and pattern xi

Artificial Neural Networks Eva Cernadas 10 Backpropagation (I)

● Gradient descent:

k ∂ J k ∂ J Δ w j=−μ k , Δ b j =−μ k , k=1,…, H ∂ w j ∂ b j

N

● Mean squared error Ji : evaluate on output layer J=∑ J i for i pattern: i=1

I H H 2 1 H 2 |yi−hi | J i= ∑ ( yij−hij ) = 2 j=1 2

● The derivative is: k k ∂ J i ∂ J i ∂ aij ∂ Ji ∂ J i ∂ aij k = k k , k = k k ∂ w j ∂ aij ∂ w j ∂ b j ∂ aij ∂b j Artificial Neural Networks Eva Cernadas 11 Backpropagation (II)

k ∂ J i ● Define: δ ij ≡ k ∂ aij

∂ ak ∂ ak ● Thus: ij k−1 ij ak =(wk )Th k-1+bk k =hi , k =1 ij j i j ∂ w j ∂ b j

● So: N N k k k−1 k k Δ w j=−μ∑ δij hi ; Δ b j =−μ∑ δij i=1 i=1

Artificial Neural Networks Eva Cernadas 12 Backpropagation (III)

● For the output layer (k=H):

H H H H H h k=f(a k) δij =( yij−hij )f '(aij )=εij f '(aij ) ij ij

N k H H k−1 Δ w j=−μ∑ εij f '(aij )hij i=1 N k H H Δ b j=−μ∑ εij f '(aij ) i=1 ● For the layer k

I k+1 k+1 I k+1 k +1 k ∂ J i ∂ J i ∂ ail k +1 ∂ ail δij= k =∑ k+1 k =∑ δil k ∂ aij l=1 ∂ ail ∂ aij l=1 ∂ aij

Artificial Neural Networks Eva Cernadas 13 Backpropagation (IV)

● k+1 k+1 T k k+1 k k Since ail =(wl ) hi +bl and hij =f(aij ), then:

I k k+1 k+1 k+1 k k +1 ∂ ail k+1 k ail =∑ wlm f (aim)+bl k =wlj f '(aij) m=1 ∂ aij

● k k+1 So δij is a recursive function depending on δij and k+1 wl :

I k+1 I k+1 k k +1 k+1 k k k+1 k +1 δij=∑ δil wlj f ' (aij)=f ' (aij)∑ δil wlj ,k=K−1,…,1 l=1 l=1

I k+1 k k+1 k +1 k k k ● δ =ε f ' (a ) Denoting ε ij = ∑ δ il w lj , obtain: ij ij ij l=1 ● The derivative is, respectively, f ’(t)=af(t)[1-f(t)] and f ’(t)=a[1-f 2(t)] for sigmoid and tanh activation functions.

Artificial Neural Networks Eva Cernadas 14 Backpropagation (V) repeat # epoch loop for i=1:N ● k k for k=1:H # direct propagation Initial weights (wj , bj )

for j=1:Ik randomly low k k T k-1 k k k aij =(wj ) hi +bj ; hij =f(aij ) ● endfor Stop criterion: 1) J or gradient endfor of J lower than a threshold; 2) maximum of epochs for j=1:IH ε H=y -h H; δ H=ε H f ’(a H) ij ij ij ij ij ij ● Speed μ intermediate: avoid endfor slowness and oscilations for k=H-1:-1:1 # backpropagation I for j=1:I k +1 k k k +1 k+1 ● Fall into local minimum which εij= δil wlj ; δ k=ε k f ’(a k+1) ∑ ij ij ij has high J: select the best l=1 endfor among different initializations endfor endfor ● Update the weights pattern by for k=1:H # weights update pattern (online): it can avoid for j=1:Ik N N k k k−1 k k local minimums and converges Δ w j =−μ∑ δij hi ; Δb j=−μ∑ δij faster i=1 i=1 k k k k k k wj =wj +Δwj ; bj =bj +Δbj endfor endfor Bach processing until stop criterion Artificial Neural Networks Eva Cernadas 15 Enhacement over backpropagation (I)

● Preprocessing: inputs using 0 mean and the same variance as the activation range f(t).

● Symmetrical activation (tanh) instead of the sigmoid function.

● Moment (α ): inertia in the learning:

N k k k k−1 Δ w j (t+1)=α Δ w j (t)−μ∑ δij hi i=1

k t=iteraction; α∈[0.7-0.95]; Δwj reduce aprox. in 1-α

Artificial Neural Networks Eva Cernadas 16 Enhacement over backpropagation (II)

● RMSProp (root mean square propagation): use a second order moment instead of first order moment; η,β: hyper-

parameters epoch=presentation of training set S =0;S =0 w b Use the squared for epoch=1:nepoch gradient to scale the learning speed. calculate Δw and Δb

2 2 Sw=βSw+(1-β)|Δw| ;Sb=βSb+(1-β)Δb ηΔ w ηΔ b w (t+1)=w(t)− ;b(t +1)=b(t)− ε+√Sw ε+√Sb endfor

Artificial Neural Networks Eva Cernadas 17 Enhacement over backpropagation (III)

● Adaptive learning speed: μ(t=0)∈[0.03-0.1]

– μ(t+1)=(1+εi)μ(t) if J(t+1)

– μ(t+1)=(1-εd)μ(t) and a=0 if J(t+1)>(1+εc)J(t), with

εi=0.05, εd=0.3 e εc=0.04

● Speed reduction: t=epoch; μ0,β,k: hyper-parameters μ k μ μ(t)= 0 μ(t)=0.95t μ μ(t)= 0 1+t β 0 √t

Artificial Neural Networks Eva Cernadas 18 Enhacement over backpropagation (IV)

● Different speed for each weight: increase μ when the gradient of that weight has the same sign in two iteractions.

● Desired outputs yij corresponding with activation function (sigmoid or hyperbolic).

● If yij∈[0,1], they can be seen as probabilities and cross entropy can be use as cost function:

N I H H H J=−∑∑ [ yij ln hij +(1− yij)ln(1−hij )] i=1 j=1

Artificial Neural Networks Eva Cernadas 19 Enhacement over backpropagation (V)

● Karhunen-Loewe divergence or relative entropy, using softmax activations for the output, can also be considered as

cost function: H aij N I H H H e hij hij = I H J=−∑∑ yij ln H ail i=1 j=1 yij ∑ e l=1

● The number H of layers and neurons Ik in each layer k=1..H must be decided: tunned hyper-parameters.

● k If there is many neurons (and weights wj , that are free parameters), produce overlearning. Artificial Neural Networks Eva Cernadas 20 Enhacement over backpropagation (VI)

● In practice, it was demostrated that backpropagation do not provide good solutions using various hidden layers.

● Mathematically, it was demostrated that a ANN with only a hidden layer is the universal approacher of any functions.

● Training error, but not test error: overlearning

● With relation to the number of neurons by layer: normally we start with many neurons and use regularization to reduce the less informative weights (prunning).

Artificial Neural Networks Eva Cernadas 21 Enhacement over backpropagation (VII)

● Weight reduction: use J’(W)=J(W)+λ|W|2 regularizated by

k the norm of the weights W={wj }, k=1..H,j=1..Ik:

N 2 2 w ● Alternative to |W| : θl ∑ 2 2 l=1 θh+θl

being θl the l-esimo weight (l=1..Nw) and θh the threshold:

tends to supress the weights θl<θh

Artificial Neural Networks Eva Cernadas 22 Enhacement over backpropagation (VIII)

● 2 Sensitivity analysis: the weights θl with saliencia sl=hllθl /2 low

are periodically suppress (measure the effect of suppression of θl over J): ∂2 J hl l= 2 ∂ θl

● Early stop: for each epoch, test the network over a validation set (different of the training set).

● Stop the training when the error over that set start to increase (sympton of overlearning).

Artificial Neural Networks Eva Cernadas 23 Enhacement over backpropagation (IX)

● Share weights: it is obligate that some conections have the same weights to guaranty certain in-variances(ex: translation, rotation and scale in images).

● Alternative: use input features which are invariants to these transformations.

Artificial Neural Networks Eva Cernadas 24 Limitations of MLP (I)

● Slow training, finishing in local minimums which are not optimal, almost various hidden layers are used.

● Many tuned parameters: number of hidden layers (H),

number of neurons in each layer (I1..IH), learning speed (μ), moment (α), etc.

● In some time, the multilayer networks were abandoned due to ANN with one hidden layer is the universal approximation.

● The number of neurons to learn a problem with only one layer is higher than with various layers.

Artificial Neural Networks Eva Cernadas 25 Limitations of MLP(II)

● k+1 The backpropagation is based on f ’(aij ): a is the sigmoid or hyperbolic (tanh), which have null derivative in almost all its domain.

● This provides null gradients and stop the training, generating many problems.

Artificial Neural Networks Eva Cernadas 26 MLP in Python

Artificial Neural Networks Eva Cernadas 27 MLP in octave

● Package nnet in octave: https://octave.sourceforge.io/nnet/

– prestd(): preprocesses the data so that the mean is 0 and the standard deviation is 1.

– trastd(): preprocess additional data for neural network simulation (for example the test set).

– newff(): create a feed-forward backpropagation network.

– train(): a neural feed-forward network will be trained.

– sim(): is usuable to simulate a before defined and trained neural network.

Artificial Neural Networks Eva Cernadas 28 MLP in Matlab

● Neural Network Toolbox in Matlab. It has similar functions than octave.

Artificial Neural Networks Eva Cernadas 29 MLP in R ● Package nnet in R: https://cran.r-project.org/web/packages/nnet/index.html

Artificial Neural Networks Eva Cernadas 30 Extreme Learning Machine (ELM)

● Network with H=2 backpropagation layers: only one hidden layer, direct propagation

● 1 1 Input weights W ={wjl }, and {bj} with j=1..I1 and l=1..n, are initialized with random values.

● 2 2 Output weights W ={wjl } and {bj} with j=1..I2 and l=1..I1 are calculated using the pseudo inverse of activities matrix H and the desired outputs Y as W2=H+Y: it is a direct method and then it is efficient.

Artificial Neural Networks Eva Cernadas 31 Extreme Learning Machine (ELM)

● It it theoretically demostrated that the training error is

reduced when I1→N

● The activation fuction for the output is linear or sigmoid

n 2 2 1 1 i=1,…N ; j=1…I hij=w jl f ∑ wlm xi m+b j 2 (m=1 )

● Authors code: https://personal.ntu.edu.sg/egbhuang/elm_random_hidden_nodes.html

Artificial Neural Networks Eva Cernadas 32 Other neural networks

● Networks of Radial Basis Function (RBF)

● Recursive networks: Redes con realimentación das entradas ás saídas: p.ex. redes recursivas

● Self-Organized Maps (SOM): non-

● Learning vector quantization (LVQ): SOM with supervised learning

● Boltzmann machines

Artificial Neural Networks Eva Cernadas 33 Other neural networks: (DL)

● Networks with many hidden layers. Types:

– Deep neural network (DNN)

– Deep belief network (DBN)

– Convolutional neural network (CNN)

– Deep network (DAN)

● Non-supervised pre-training for each layer separately: restricted Boltzman machine (RBM). Do not use the desired outputs.

● Supervised training of the output weights

● Precise tuning of the intermediate and output weights using back- propagation

Artificial Neural Networks Eva Cernadas 34