Statistical Machine Learning: Introduction

Dino Sejdinovic Department of University of Oxford

22-24 June 2015, Novi Sad slides available at: http://www.stats.ox.ac.uk/~sejdinov/talks.html Tom Mitchell, 1997 Any computer program that improves its performance at some task through experience.

Kevin Murphy, 2012 To develop methods that can automatically detect patterns in data, and then to use the uncovered patterns to predict future data or other outcomes of interest.

Introduction Introduction What is Machine Learning?

Arthur Samuel, 1959 Field of study that gives computers the ability to learn without being explicitly programmed. Kevin Murphy, 2012 To develop methods that can automatically detect patterns in data, and then to use the uncovered patterns to predict future data or other outcomes of interest.

Introduction Introduction What is Machine Learning?

Arthur Samuel, 1959 Field of study that gives computers the ability to learn without being explicitly programmed.

Tom Mitchell, 1997 Any computer program that improves its performance at some task through experience. Introduction Introduction What is Machine Learning?

Arthur Samuel, 1959 Field of study that gives computers the ability to learn without being explicitly programmed.

Tom Mitchell, 1997 Any computer program that improves its performance at some task through experience.

Kevin Murphy, 2012 To develop methods that can automatically detect patterns in data, and then to use the uncovered patterns to predict future data or other outcomes of interest. Introduction Introduction What is Machine Learning?

Information Structure Prediction Decisions Actions

data

http://gureckislab.org Larry Page about DeepMind’s ML systems that can learn to play video games like humans Introduction Introduction What is Machine Learning?

statistics

business computer finance science

cognitive biology Machine science genetics Learning psychology

physics mathematics engineering operations research Introduction Introduction What is Data Science?

’Data Scientists’ Meld Statistics and Software WSJ article Introduction Introduction Information Revolution

Traditional Statistical Inference Problems Well formulated question that we would like to answer. Expensive data gathering and/or expensive computation. Create specially designed experiments to collect high quality data.

Information Revolution Improvements in data processing and data storage. Powerful, cheap, easy data capturing. Lots of (low quality) data with potentially valuable information inside. Introduction Introduction Statistics and Machine Learning in the age of Big Data

ML becoming a thorough blending of computer science, engineering and statistics unified framework of data, inferences, procedures, algorithms statistics taking computation seriously computing taking statistical seriously scale and granularity of data personalization, societal and business impact multidisciplinarity - and you are the interdisciplinary glue it’s just getting started Michael Jordan: On the Computational and Statistical Interface and "Big Data" Introduction Applications of Machine Learning Applications of Machine Learning

spam filtering recommendation fraud detection systems

self-driving cars stock market analysis image recognition

ImageNet: Krizhevsky et al, 2012; Machine Learning is Eating the World: Forbes article Introduction Types of Machine Learning Types of Machine Learning

Supervised learning Data contains “labels”: every example is an input-output pair classification, regression Goal: prediction on new examples

Unsupervised learning Extract key features of the “unlabelled” data clustering, signal separation, density estimation Goal: representation, hypothesis generation, visualization Introduction Types of Machine Learning Types of Machine Learning

Semi-supervised Learning A database of examples, only a small subset of which are labelled.

Multi-task Learning A database of examples, each of which has multiple labels corresponding to different prediction tasks.

Reinforcement Learning An agent acting in an environment, given rewards for performing appropriate actions, learns to maximize their reward. Introduction Types of Machine Learning Literature Supervised Learning Supervised Learning Supervised Learning

n We observe the input-output pairs {(xi, yi)}i=1, xi ∈ X , yi ∈ Y Types of supervised learning: Classification: discrete responses, e.g. Y = {+1, −1} or {1,..., K}. Regression: a numerical value is observed and Y = R. The goal is to accurately predict the response Y on new observations of X, i.e., to learn a function f : Rp → Y, such that f (X) will be close to the true response Y. Supervised Learning Decision Theory

Suppose we made a prediction Yˆ = f (X) ∈ Y based on observation of X. How good is the prediction? We can use a loss function L : Y × Y 7→ R+ to formalize the quality of the prediction. Typical loss functions: Misclassification loss (or 0-1 loss) for classification

 0 f (X) = Y L(Y, f (X)) = . 1 f (X) 6= Y

Squared loss for regression

L(Y, f (X)) = (f (X) − Y)2 .

Many other choices are possible, e.g., weighted misclassification loss.

In classification, the vector of estimated (ˆp(k))k∈Y can be returned, and in this case log-likelihood loss (or log loss) L(Y, ˆp) = − log ˆp(Y) is often used. Supervised Learning Decision Theory Risk

n Paired observations {(xi, yi)}i=1 viewed as i.i.d. realizations of a random variable (X, Y) on X × Y with joint distribution PXY

Risk For a given loss function L, the risk R of a learned function f is given by the expected loss

R(f ) = EPXY [L(Y, f (X))] , where the expectation is with respect to the true (unknown) joint distribution of (X, Y).

The risk is unknown, but we can compute the empirical risk:

n 1 X R (f ) = L(y , f (x )). n n i i i=1 Supervised Learning Decision Theory The Bayes Classifier

What is the optimal classifier if the joint distribution (X, Y) were known? The density g of X can be written as a mixture of K components (corresponding to each of the classes):

K X g(x) = πkgk(x), k=1 where, for k = 1,..., K, P(Y = k) = πk are the class probabilities, gk(x) is the conditional density of X, given Y = k.

The Bayes classifier fBayes : x 7→ {1,..., K} is the one with minimum risk:

  R(f ) =E [L(Y, f (X))] = EX EY|X[L(Y, f (X))|X] Z = E [L(Y, f (X))|X = x] g(x)dx X The minimum risk attained by the Bayes classifier is called Bayes risk. Minimizing E[L(Y, f (X))|X = x] separately for each x suffices. Supervised Learning Decision Theory The Bayes Classifier

Consider the 0-1 loss. The risk simplifies to:

K h i X E L(Y, f (X)) X = x = L(k, f (x))P(Y = k|X = x) k=1 =1 − P(Y = f (x)|X = x)

The risk is minimized by choosing the class with the greatest posterior :

fBayes(x) = arg max P(Y = k|X = x) k=1,...,K π g (x) arg max k k arg max g x = PK = πk k( ). k=1,...,K j=1 πjgj(x) k=1,...,K

The functions x 7→ πkgk(x) are called discriminant functions. The discriminant function with maximum value determines the predicted class of x. Supervised Learning Decision Theory The Bayes Classifier: Example

A simple two Gaussians example: Suppose X ∼ N (µY , 1), where µ1 = −1 and µ2 = 1 and assume equal priors π1 = π2 = 1/2. 1  (x + 1)2  1  (x − 1)2  g1(x) = √ exp − and g2(x) = √ exp − . 2π 2 2π 2 0.25 0.4 0.20 0.3 0.15 0.2 marginal density conditional densities 0.10 0.1 0.05 0.0

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

x x ( 1 if x < 0, Optimal classification is fBayes(x) = arg max πkgk(x) = k=1,...,K 2 if x ≥ 0. Supervised Learning Decision Theory The Bayes Classifier: Example

How do you classify a new observation x if now the is still 1 for class 1 but 1/3 for class 2? 1.2 1e−04 1.0 0.8 1e−11 0.6 1e−18 conditional densities conditional densities 0.4 1e−25 0.2 0.0 1e−32 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

x x Looking at density in a log-scale, optimal classification is to select class 2 if and only if x ∈ [0.34, 2.16]. Supervised Learning Decision Theory Plug-in Classification

The Bayes Classifier chooses the class with the greatest posterior probability

fBayes(x) = arg max πkgk(x). k=1,...,K

We know neither the conditional densities gk nor the class probabilities πk! The plug-in classifier chooses the class

f (x) = arg max πˆkˆgk(x), k=1,...,K

where we plugged in

estimates πˆk of πk and k = 1,..., K and estimates ˆgk(x) of conditional densities, Supervised Learning Linear Discriminant Analysis Linear Discriminant Analysis

LDA is the simplest example of plug-in classification. Assume multivariate normal conditional density gk(x) for each class k:

X|Y = k ∼N (µk, Σ),  1  g (x) =(2π)−p/2|Σ|−1/2 exp − (x − µ )>Σ−1(x − µ ) , k 2 k k

each class can have a different mean µk, all classes share the same covariance Σ. For an observation x, the k-th log-discriminant function is 1 log π g (x) = const + log π − (x − µ )>Σ−1(x − µ ) k k k 2 k k > = const + ak + bk x linear log-discriminant functions ↔ linear decision boundaries. > −1 The quantity (x − µk) Σ (x − µk) is the squared Mahalanobis distance between x and µk. 1 If Σ = Ip and πk = K , LDA simply chooses the class k with the nearest (in the Euclidean sense) class mean. Supervised Learning Linear Discriminant Analysis Iris Dataset

● ●● 2.5 ● ● ●●●● ● ● ● ● ● ● ● ●●●● ● ● ●●●● ● ● 2.0 ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ●●● 1.5 ● ● ●●● ● ● ●●●●●●● ●● ● ● ● ● ●●

Petal.Width ● ● ● ●● 1.0

● ● 0.5 ● ●●● ● ●●● ● ● ●●●●●● ● ● ●●

1 2 3 4 5 6 7 Petal.Length Supervised Learning Statistical Learning Theory Generative vs Discriminative Learning

Generative learning: find parameters which explain all the data available. n X θˆ = argmax log P(xi, yi|θ) θ i=1 Examples: LDA, naïve Bayes. Makes use of all the data available. Flexible framework, can incorporate other tasks, incomplete data. Stronger modelling assumptions. Discriminative learning: find parameters that aid in prediction.

n n 1 X X θˆ = argmin L(y , f (x )) or θˆ = argmax log P(y |x , θ) n i θ i i i θ i=1 θ i=1 Examples: logistic regression, support vector machines. Typically performs better on a given task. Weaker modelling assumptions. Can overfit more easily. Supervised Learning Statistical Learning Theory Generative Learning

We work with a joint distribution PX,Y (x, y) over data vectors and labels. A learning algorithm: construct f : X → Y which predicts the label of X. Given a loss function L, the risk R of f (X) is

R(f ) = EX,Y [L(Y, f (X))]

For 0/1 loss in classification, Bayes classifier

fBayes(x) = argmax P(Y = k|x) = argmax PX,Y (x, k) k=1,...,K k=1,...,K

has the minimum risk (Bayes risk), but is unknown since PX,Y is unknown.

Assume a parameteric model for the joint: PX,Y (x, y) = PX,Y (x, y|θ) ˆ Pn Fit θ = argmaxθ i=1 log P(xi, yi|θ) and plug in back to Bayes classifier:

ˆf (x) = argmax PX,Y (x, k|θˆ). k=1,...,K Supervised Learning Statistical Learning Theory Hypothesis space and Empirical Risk Minimization

Find best function in H minimizing the risk:

f? = argmin EX,Y [L(Y, f (X))] f ∈H

Empirical Risk Minimization (ERM): minimize the empirical risk instead, since we typically do not know PX,Y . n 1 X ˆf = argmin L(yi, f (xi)) f ∈H n i=1

Hypothesis space H is the space of functions f under consideration. How complex should we allow functions f to be? If hypothesis space H is “too large”, ERM will overfit. Function ( y if x = x , ˆf (x) = i i 0 otherwise will have zero empirical risk, but is useless for generalization, since it has simply “memorized” the dataset. Supervised Learning Statistical Learning Theory Training and Test Performance

Training error is the empirical risk

n 1 X L(y , f (x )) n i i i=1 For 0-1 loss in classification, this is the misclassification error on the n training data {xi, yi}i=1, which were used in learning f . Test error is the empirical risk on new, previously unseen observations m {˜xi, ˜yi}i=1 m 1 X L(˜y , f (˜x )) m i i i=1 which were NOT used in learning f . Test error is a much better gauge of how well learned function generalizes to new data. The test error is in general larger than the training error. Supervised Learning Statistical Learning Theory Hypothesis space for two-class LDA

Assume we have two classes {+1, −1}. Recall that the discriminant functions in LDA are linear. Assuming that data vectors in class k is modelled as N (µk, Σ), choosing class +1 over −1 involves:

> > > a+1 + b+1x > a−1 + b−1x ⇔ a? + b? x > 0,

where a? = a+1 − a−1, b? = b+1 − b−1. Thus, hypothesis space of two-class LDA consists of functions f (x) = sign(a + b>x). We obtain coefficients ˆa and bˆ, and thus the function ˆf through fitting the parameters of the generative model. Discriminative learning: restrict H to a class of functions f (x) = sign(a + b>x) and select ˆa and bˆ which minimize empirical risk. Supervised Learning Statistical Learning Theory Space of linear decision functions

Hypothesis space H = {f : f (x) = sign(a + b>x), a ∈ R, b ∈ Rp} Find a, b that minimize the empirical risk under 0-1 loss:

n 1 X argmin L(yi, fa,b(xi)) a b n , i=1 n ( > 1 X 0 if yi = sign(a + b xi) = argmin a b n 1 otherwise , i=1 n 1 X  >  = argmin 1 − sign(yi(a + b xi)) . a b 2n , i=1

Combinatorial problem - not typically possible to solve... Maybe easier with a different loss function? (Logistic regression) Supervised Learning Logistic Regression Linearity of log-posterior odds

Another way to express linear decision boundary of LDA: p(Y = +1|X = x) log = a + b>x. p(Y = −1|X = x)

Solve explicitly for conditional class probabilities: 1 p(Y = +1|X = x) = =: s(a + b>x) 1 + exp(−(a + b>x)) 1 p(Y = −1|X = x) = = s(−a − b>x) 1 + exp(+(a + b>x)) where s(·) is the logistic function 1

0.5

0 −8 −6 −4 −2 0 2 4 6 8 Supervised Learning Logistic Regression Logistic Regression

Consider maximizing the conditional log likelihood: n n X X > `(a, b) = log p(Y = yi|X = xi) = − log(1 + exp(−yi(a + b xi))) i=1 i=1 Equivalent to minimizing the empirical risk associated with the log loss:   n n 1 X 1 X Remp = log(1 + exp(−y (a + b>x ))) = − log(s(y (a + b>x ))) log n i i n  i i  i=1 i=1 | {z } ˆpa,b(yi|xi) 4 Log Loss (log(1+exp(−y(a+b’x)))) 0−1 Loss (1/2+sign(−y(a+b’x))/2) 3

2

1

0 −4 −3 −2 −1 0 1 2 3 4 Supervised Learning Logistic Regression Logistic Regression

Not possible to find optimal a, b analytically. Logistic Function For simplicitiy, absorb a as an entry in b by appending ’1’ into x vector. s(−z) = 1 − s(z) Objective function: ∇zs(z) = s(z)s(−z) ∇z log s(z) = s(−z) n emp 1 X > 2 R = − log s(y x b) ∇z log s(z) = −s(z)s(−z) log n i i i=1

Differentiate wrt b:

n 1 X ∇ Remp = −s(−y x>b)y x b log n i i i i i=1 n 1 X ∇2Remp = s(y x>b)s(−y x>b)x x> b log n i i i i i i i=1 Supervised Learning Logistic Regression Logistic Regression

Second derivative is positive-definite: objective function is convex and there is a single unique global minimum. Many different algorithms can find optimal b, e.g.: Gradient descent: n 1 X > bnew = b +  s(−y x b)y x n i i i i i=1 Stochastic gradient descent:

1 X > bnew = b +  s(−y x b)y x t |I(t)| i i i i i∈I(t)

where I(t) is a subset of the data at iteration t, and t → 0 slowly P P 2 ( t t = ∞, t t < ∞). Newton-Raphson: new 2 emp −1 emp b = b − (∇bRlog ) ∇bRlog This is also called iterative reweighted least squares. Conjugate gradient, LBFGS and other methods from numerical analysis. Supervised Learning Logistic Regression Logistic Regression vs. LDA

Both have linear decision boundaries and model log-posterior odds as

p(Y = +1|X = x) log = a + b>x p(Y = −1|X = x)

LDA models the marginal density of x as a Gaussian mixture with shared covariance g(x) = π−1N (x; µ−1, Σ) + π+1N (x; µ+1, Σ)

and fits the parameters θ = (µ−1, µ+1, π−1, π+1, Σ) by maximizing joint Pn likelihood i=1 p(xi, yi|θ). a and b are then determined from θ. Logistic regression leaves the marginal density g(x) as an arbitrary density function, and fits the parameters a,b by maximizing the Pn conditional likelihood i=1 p(yi|xi; a, b). Supervised Learning Logistic Regression Logistic Regression

Properties of logistic regression: Makes less modelling assumptions than generative classifiers. A simple example of a generalised linear model (GLM). Much statistical theory: assessment of fit via deviance and plots, interpretation of entries of b as odds-ratios, fitting categorical data (sometimes called multinomial logistic regression), well founded approaches to removing insignificant features (drop-in deviance test, Wald test). Supervised Learning Model Complexity and Generalization

Model Complexity and Generalization Supervised Learning Model Complexity and Generalization Generalization

Generalization ability: what is the out-of-sample error of learner f ? training error =6 testing error. We learn f by minimizing some variant of empirical risk Remp(f )- what can we say about the true risk R(f )? Two important factors determining generalization ability: Model complexity Training data size Supervised Learning Model Complexity and Generalization Learning Curves

Underfit

Overfit

Test error Just right Prediction error

Training error

Model complexity/flexibility

Fixed dataset size, varying model complexity. Supervised Learning Model Complexity and Generalization Learning Curves

overfit overfit

testing error testing error

training error prediction error prediction error

training error

training dataset size training dataset size

Fixed model complexity, varying dataset size. Two models: one simple, one complex. Which is which? Supervised Learning Model Complexity and Generalization Bias- Tradeoff

Where does the prediction error come from? Example: Squared loss in regression: X = Rp, Y = R, L(Y, f (X)) = (Y − f (X))2

Optimal f is the conditional mean:

f∗(x) = E [Y|X = x]

h i 2 Follows from R(f ) = EXE (Y − f (X)) X and h i 2 E (Y − f (X)) X = x  2  2 = E Y X = x − 2f (x) E [Y| X = x] + f (x) 2 = Var [Y|X = x] + (E [Y| X = x] − f (x)) . Supervised Learning Model Complexity and Generalization Bias-Variance Tradeoff

Optimal risk is the intrinsic conditional variability of Y (noise):

R(f∗) = EX [Var [Y|X]]

Excess risk: h 2i R(f ) − R(f∗) = EX (f (X) − f∗(X))

How does the excess risk behave on average? n (D) Consider a randomly selected dataset D = {(Xi, Yi)}i=1 and f trained on D using a model (hypothesis class) H.

 2 h (D) i  (D)  ED R(f ) − R(f∗) = EDEX f (X) − f∗(X)

 2  (D)  = EXED f (X) − f∗(X) . Supervised Learning Model Complexity and Generalization Bias-Variance Tradeoff

¯ (D) Denote f (x) = EDf (x) (average decision function over all possible datasets)

 2  2 (D) (D) ¯ ¯ 2 ED f (X) − f∗(X) = ED f (X) − f (X) + f (X) − f∗(X) | {z } 2 | {z } BiasX (H,n) VarX (H,n)

Now, (D) 2 EDR(f ) = R(f∗) + EXVarX(H, n) + EXBiasX(H, n)

Where does the prediction error come from? Noise: Intrinsic difficulty of regression problem. Bias: How far away is the best learner in the model (average learner over all possible datasets) from the optimal one? Variance: How variable is our learning method if given different datasets? Supervised Learning Model Complexity and Generalization Learning Curves

overfit overfit

variance

testing error variance testing error training error prediction error prediction error

training error bias bias

training dataset size training dataset size Supervised Learning Model Complexity and Generalization Learning Curve

Underfit: high bias Overfit: low variance low bias high variance

Test error Just right Prediction error

Training error

Model complexity/flexibility Supervised Learning Artificial Neural Networks

Artificial Neural Networks Supervised Learning Artificial Neural Networks Biological inspiration

Basic computational elements: neurons. Receives signals from other neurons via dendrites. Sends processed signals via axons. Axon-dendrite interactions at synapses. 1010 − 1011 neurons. 1014 − 1015 synapses. Supervised Learning Artificial Neural Networks Single Neuron Classifier

1 x(1) w 1 b x(2) w2 s . w ( ) x(3) 3 P P(Y =1|X = x)

b ⊤ b w x + b b wp

x(p)

activation w>x + b (linear in inputs x) activation/transfer function s gives the output/activity (potentially nonlinear in x) 1 common nonlinear activation function s(a) = 1+e−a : logistic regression learn w and b via gradient descent Supervised Learning Artificial Neural Networks Single Neuron Classifier

xi1

xi2 prevent overfitting by: early stopping: just halt the gradient descent

regularization: L2-regularization called weight decay in neural networks literature.

Supervised Learning Artificial Neural Networks Overfitting

iterations 30,80

Figures from D. MacKay, Information Theory, Inference and Learning Algorithms prevent overfitting by: early stopping: just halt the gradient descent

regularization: L2-regularization called weight decay in neural networks literature.

Supervised Learning Artificial Neural Networks Overfitting

iterations 500,3000

Figures from D. MacKay, Information Theory, Inference and Learning Algorithms Supervised Learning Artificial Neural Networks Overfitting

iterations 10000,40000 prevent overfitting by: early stopping: just halt the gradient descent

regularization: L2-regularization called weight decay in neural networks literature.

Figures from D. MacKay, Information Theory, Inference and Learning Algorithms Supervised Learning Artificial Neural Networks Multilayer Networks

p yˆi Data vectors xi ∈ R , binary labels yi ∈ {0, 1}. inputs xi1,..., xip output ˆyi = P(Y = 1|X = xi) hidden unit activities hi1,..., him Compute hidden unit activities:

hi1 hi2 hi3 hi4 hi5 hi6 hi7  p  h X h hil = s bl + wjlxij j=1 Compute output probability:

m ! o X o ˆyi = s b + wk hil x x x x l=1 i1 i2 i3 i4 Supervised Learning Artificial Neural Networks Multilayer Networks

xi1

xi2 Supervised Learning Artificial Neural Networks Training a Neural Network

Objective function: L2-regularized log-loss

n   X λ X h 2 X o 2 J = − yi log ˆyi + (1 − yi) log(1 − ˆyi) +  (wjl) + (wl )  2 i=1 jl l where m !  p  o X o h X h ˆyi = s b + wl hil hil = s bl + wjlxij l=1 j=1  h h o o h m h p×m Optimize parameters θ = b , w , b , w , where b ∈ R , w ∈ R , bo ∈ R, wo ∈ Rm with gradient descent. n n ∂J o X ∂J ∂ˆyi o X o = λwl + o = λwl + (ˆyi − yi)hil, ∂w ∂ˆyi ∂w l i=1 l i=1 n n ∂J h X ∂J ∂ˆyi ∂hil h X o h = λwjl + h = λwjl + (ˆyi − yi)wl hil(1 − hil)xij. ∂w ∂ˆyi ∂hil ∂w jl i=1 jl i=1

L2-regularization often called weight decay. Multiple hidden layers: Backpropagation algorithm Supervised Learning Artificial Neural Networks Multiple hidden layers

ˆ = L+1 yi hi `+1 `+1 ` hi = s W hi

`+1 `  b b b b b b W = wjk : weight matrix at hL hL jk i1 im ` the (` + 1)-th layer, weight wjk on `−1 ` the edge between hik and hij s: entrywise (logistic) transfer function 1 b b b b b b 1 hi1 him

L+1 L 1  ˆyi = s W s W ··· s W xi

b b b = 0 x = h0 xi1 hi1 ip ip Supervised Learning Artificial Neural Networks Backpropagation

ˆ = L+1 yi hi n X L+1 L+1 J = − yi log hi +(1−yi) log(1−hi ) i=1

` ℓ+1 ℓ+1 Gradients wrt hij computed by hi1 him b b b b b b recursive applications of chain rule, and propagated through the network backwards.

∂J yi 1 − yi L+1 = − L+1 + L+1 ∂hi hi 1 − hi m b b b ℓ ∂J X ∂J ∂h`+1 hij ir ` = `+1 ` ∂hij ∂hir ∂hij ℓ r=1 wjk n ` ∂J X ∂J ∂hij = ∂w` ∂h` ∂w` ℓ−1 jk i=1 ij jk b b b hik Supervised Learning Artificial Neural Networks Neural Networks

Global solution and local minima Neural network fit with a weight decay of 0.01 1.0 1.0 0.8 0.8 0.6 0.6 x2 x2 0.4 0.4 0.2 0.2

Solution (global minimum) Local minimum 1 Local minimum 2 Local minimum 3 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x1 x1

R package implementing neural networks with a single hidden layer: nnet. Supervised Learning Artificial Neural Networks Neural Networks – Discussion

Nonlinear hidden units introduce modelling flexibility. In contrast to user-introduced nonlinearities, features are global, and can be learned to maximize predictive performance. Neural networks with a single hidden layer and sufficiently many hidden units can model arbitrarily complex functions. Optimization problem is not convex, and objective function can have many local optima, plateaus and ridges. On large scale problems, often use stochastic gradient descent, along with a whole host of techniques for optimization, regularization, and initialization. Recent developments, especially by Geoffrey Hinton, Yann LeCun, Yoshua Bengio, Andrew Ng and others. See also http://deeplearning.net/. Supervised Learning Artificial Neural Networks Dropout Training of Neural Networks

Neural network with single layer of hidden units:

Hidden unit activations: yˆi

 p  h X h hik = s bk + Wjkxij j=1 Output probability:

m ! o X o hi1 hi2 hi3 hi4 hi5 hi6 hi7 ˆyi = s b + Wk hik k=1 Large, overfitted networks often have co-adapted hidden units. What each hidden unit learns may in fact be useless, e.g. predicting the negation of x x x x predictions from other units. i1 i2 i3 i4 Can prevent co-adaptation by randomly dropping out units from network. Hinton et al (2012). Supervised Learning Artificial Neural Networks Dropout Training of Neural Networks

Model as an ensemble of networks:

X |b| m−|b| p(yi = 1|xi, θ) = q (1 − q) p(yi = 1|xi, θ, drop out units b) b⊂{1,...,m} yˆi yˆi yˆi yˆi

hi1 hi2 hi3 hi4 hi1 hi3 hi5 hi7 hi1 hi4 hi6 hi3 hi4 hi5

xi1 xi2 xi3 xi4 xi1 xi2 xi3 xi4 xi1 xi2 xi3 xi4 xi1 xi2 xi3 xi4

Weight-sharing among all networks: each network uses a subset of the parameters of the full network (corresponding to the retained units). Training by stochastic gradient descent: at each iteration a network is sampled from ensemble, and its subset of parameters are updated. Biological inspiration: 1014 weights to be fitted in a lifetime of 109 seconds Poisson spikes as a regularization mechanism which prevents co-adaptation: Geoff Hinton on Brains, Sex and Machine Learning Supervised Learning Artificial Neural Networks Dropout Training of Neural Networks Classification of phonemes in speech.

Figure from Hinton et al.