Statistical Machine Learning: Introduction
Dino Sejdinovic Department of Statistics University of Oxford
22-24 June 2015, Novi Sad slides available at: http://www.stats.ox.ac.uk/~sejdinov/talks.html Tom Mitchell, 1997 Any computer program that improves its performance at some task through experience.
Kevin Murphy, 2012 To develop methods that can automatically detect patterns in data, and then to use the uncovered patterns to predict future data or other outcomes of interest.
Introduction Introduction What is Machine Learning?
Arthur Samuel, 1959 Field of study that gives computers the ability to learn without being explicitly programmed. Kevin Murphy, 2012 To develop methods that can automatically detect patterns in data, and then to use the uncovered patterns to predict future data or other outcomes of interest.
Introduction Introduction What is Machine Learning?
Arthur Samuel, 1959 Field of study that gives computers the ability to learn without being explicitly programmed.
Tom Mitchell, 1997 Any computer program that improves its performance at some task through experience. Introduction Introduction What is Machine Learning?
Arthur Samuel, 1959 Field of study that gives computers the ability to learn without being explicitly programmed.
Tom Mitchell, 1997 Any computer program that improves its performance at some task through experience.
Kevin Murphy, 2012 To develop methods that can automatically detect patterns in data, and then to use the uncovered patterns to predict future data or other outcomes of interest. Introduction Introduction What is Machine Learning?
Information Structure Prediction Decisions Actions
data
http://gureckislab.org Larry Page about DeepMind’s ML systems that can learn to play video games like humans Introduction Introduction What is Machine Learning?
statistics
business computer finance science
cognitive biology Machine science genetics Learning psychology
physics mathematics engineering operations research Introduction Introduction What is Data Science?
’Data Scientists’ Meld Statistics and Software WSJ article Introduction Introduction Information Revolution
Traditional Statistical Inference Problems Well formulated question that we would like to answer. Expensive data gathering and/or expensive computation. Create specially designed experiments to collect high quality data.
Information Revolution Improvements in data processing and data storage. Powerful, cheap, easy data capturing. Lots of (low quality) data with potentially valuable information inside. Introduction Introduction Statistics and Machine Learning in the age of Big Data
ML becoming a thorough blending of computer science, engineering and statistics unified framework of data, inferences, procedures, algorithms statistics taking computation seriously computing taking statistical risk seriously scale and granularity of data personalization, societal and business impact multidisciplinarity - and you are the interdisciplinary glue it’s just getting started Michael Jordan: On the Computational and Statistical Interface and "Big Data" Introduction Applications of Machine Learning Applications of Machine Learning
spam filtering recommendation fraud detection systems
self-driving cars stock market analysis image recognition
ImageNet: Krizhevsky et al, 2012; Machine Learning is Eating the World: Forbes article Introduction Types of Machine Learning Types of Machine Learning
Supervised learning Data contains “labels”: every example is an input-output pair classification, regression Goal: prediction on new examples
Unsupervised learning Extract key features of the “unlabelled” data clustering, signal separation, density estimation Goal: representation, hypothesis generation, visualization Introduction Types of Machine Learning Types of Machine Learning
Semi-supervised Learning A database of examples, only a small subset of which are labelled.
Multi-task Learning A database of examples, each of which has multiple labels corresponding to different prediction tasks.
Reinforcement Learning An agent acting in an environment, given rewards for performing appropriate actions, learns to maximize their reward. Introduction Types of Machine Learning Literature Supervised Learning Supervised Learning Supervised Learning
n We observe the input-output pairs {(xi, yi)}i=1, xi ∈ X , yi ∈ Y Types of supervised learning: Classification: discrete responses, e.g. Y = {+1, −1} or {1,..., K}. Regression: a numerical value is observed and Y = R. The goal is to accurately predict the response Y on new observations of X, i.e., to learn a function f : Rp → Y, such that f (X) will be close to the true response Y. Supervised Learning Decision Theory Loss function
Suppose we made a prediction Yˆ = f (X) ∈ Y based on observation of X. How good is the prediction? We can use a loss function L : Y × Y 7→ R+ to formalize the quality of the prediction. Typical loss functions: Misclassification loss (or 0-1 loss) for classification
0 f (X) = Y L(Y, f (X)) = . 1 f (X) 6= Y
Squared loss for regression
L(Y, f (X)) = (f (X) − Y)2 .
Many other choices are possible, e.g., weighted misclassification loss.
In classification, the vector of estimated probabilities (ˆp(k))k∈Y can be returned, and in this case log-likelihood loss (or log loss) L(Y, ˆp) = − log ˆp(Y) is often used. Supervised Learning Decision Theory Risk
n Paired observations {(xi, yi)}i=1 viewed as i.i.d. realizations of a random variable (X, Y) on X × Y with joint distribution PXY
Risk For a given loss function L, the risk R of a learned function f is given by the expected loss
R(f ) = EPXY [L(Y, f (X))] , where the expectation is with respect to the true (unknown) joint distribution of (X, Y).
The risk is unknown, but we can compute the empirical risk:
n 1 X R (f ) = L(y , f (x )). n n i i i=1 Supervised Learning Decision Theory The Bayes Classifier
What is the optimal classifier if the joint distribution (X, Y) were known? The density g of X can be written as a mixture of K components (corresponding to each of the classes):
K X g(x) = πkgk(x), k=1 where, for k = 1,..., K, P(Y = k) = πk are the class probabilities, gk(x) is the conditional density of X, given Y = k.
The Bayes classifier fBayes : x 7→ {1,..., K} is the one with minimum risk:
R(f ) =E [L(Y, f (X))] = EX EY|X[L(Y, f (X))|X] Z = E [L(Y, f (X))|X = x] g(x)dx X The minimum risk attained by the Bayes classifier is called Bayes risk. Minimizing E[L(Y, f (X))|X = x] separately for each x suffices. Supervised Learning Decision Theory The Bayes Classifier
Consider the 0-1 loss. The risk simplifies to:
K h i X E L(Y, f (X)) X = x = L(k, f (x))P(Y = k|X = x) k=1 =1 − P(Y = f (x)|X = x)
The risk is minimized by choosing the class with the greatest posterior probability:
fBayes(x) = arg max P(Y = k|X = x) k=1,...,K π g (x) arg max k k arg max g x = PK = πk k( ). k=1,...,K j=1 πjgj(x) k=1,...,K
The functions x 7→ πkgk(x) are called discriminant functions. The discriminant function with maximum value determines the predicted class of x. Supervised Learning Decision Theory The Bayes Classifier: Example
A simple two Gaussians example: Suppose X ∼ N (µY , 1), where µ1 = −1 and µ2 = 1 and assume equal priors π1 = π2 = 1/2. 1 (x + 1)2 1 (x − 1)2 g1(x) = √ exp − and g2(x) = √ exp − . 2π 2 2π 2 0.25 0.4 0.20 0.3 0.15 0.2 marginal density conditional densities 0.10 0.1 0.05 0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x x ( 1 if x < 0, Optimal classification is fBayes(x) = arg max πkgk(x) = k=1,...,K 2 if x ≥ 0. Supervised Learning Decision Theory The Bayes Classifier: Example
How do you classify a new observation x if now the standard deviation is still 1 for class 1 but 1/3 for class 2? 1.2 1e−04 1.0 0.8 1e−11 0.6 1e−18 conditional densities conditional densities 0.4 1e−25 0.2 0.0 1e−32 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x x Looking at density in a log-scale, optimal classification is to select class 2 if and only if x ∈ [0.34, 2.16]. Supervised Learning Decision Theory Plug-in Classification
The Bayes Classifier chooses the class with the greatest posterior probability
fBayes(x) = arg max πkgk(x). k=1,...,K
We know neither the conditional densities gk nor the class probabilities πk! The plug-in classifier chooses the class
f (x) = arg max πˆkˆgk(x), k=1,...,K
where we plugged in
estimates πˆk of πk and k = 1,..., K and estimates ˆgk(x) of conditional densities, Supervised Learning Linear Discriminant Analysis Linear Discriminant Analysis
LDA is the simplest example of plug-in classification. Assume multivariate normal conditional density gk(x) for each class k:
X|Y = k ∼N (µk, Σ), 1 g (x) =(2π)−p/2|Σ|−1/2 exp − (x − µ )>Σ−1(x − µ ) , k 2 k k
each class can have a different mean µk, all classes share the same covariance Σ. For an observation x, the k-th log-discriminant function is 1 log π g (x) = const + log π − (x − µ )>Σ−1(x − µ ) k k k 2 k k > = const + ak + bk x linear log-discriminant functions ↔ linear decision boundaries. > −1 The quantity (x − µk) Σ (x − µk) is the squared Mahalanobis distance between x and µk. 1 If Σ = Ip and πk = K , LDA simply chooses the class k with the nearest (in the Euclidean sense) class mean. Supervised Learning Linear Discriminant Analysis Iris Dataset
● ●● 2.5 ● ● ●●●● ● ● ● ● ● ● ● ●●●● ● ● ●●●● ● ● 2.0 ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ●●● 1.5 ● ● ●●● ● ● ●●●●●●● ●● ● ● ● ● ●●
Petal.Width ● ● ● ●● 1.0
● ● 0.5 ● ●●● ● ●●● ● ● ●●●●●● ● ● ●●
1 2 3 4 5 6 7 Petal.Length Supervised Learning Statistical Learning Theory Generative vs Discriminative Learning
Generative learning: find parameters which explain all the data available. n X θˆ = argmax log P(xi, yi|θ) θ i=1 Examples: LDA, naïve Bayes. Makes use of all the data available. Flexible framework, can incorporate other tasks, incomplete data. Stronger modelling assumptions. Discriminative learning: find parameters that aid in prediction.
n n 1 X X θˆ = argmin L(y , f (x )) or θˆ = argmax log P(y |x , θ) n i θ i i i θ i=1 θ i=1 Examples: logistic regression, support vector machines. Typically performs better on a given task. Weaker modelling assumptions. Can overfit more easily. Supervised Learning Statistical Learning Theory Generative Learning
We work with a joint distribution PX,Y (x, y) over data vectors and labels. A learning algorithm: construct f : X → Y which predicts the label of X. Given a loss function L, the risk R of f (X) is
R(f ) = EX,Y [L(Y, f (X))]
For 0/1 loss in classification, Bayes classifier
fBayes(x) = argmax P(Y = k|x) = argmax PX,Y (x, k) k=1,...,K k=1,...,K
has the minimum risk (Bayes risk), but is unknown since PX,Y is unknown.
Assume a parameteric model for the joint: PX,Y (x, y) = PX,Y (x, y|θ) ˆ Pn Fit θ = argmaxθ i=1 log P(xi, yi|θ) and plug in back to Bayes classifier:
ˆf (x) = argmax PX,Y (x, k|θˆ). k=1,...,K Supervised Learning Statistical Learning Theory Hypothesis space and Empirical Risk Minimization
Find best function in H minimizing the risk:
f? = argmin EX,Y [L(Y, f (X))] f ∈H
Empirical Risk Minimization (ERM): minimize the empirical risk instead, since we typically do not know PX,Y . n 1 X ˆf = argmin L(yi, f (xi)) f ∈H n i=1
Hypothesis space H is the space of functions f under consideration. How complex should we allow functions f to be? If hypothesis space H is “too large”, ERM will overfit. Function ( y if x = x , ˆf (x) = i i 0 otherwise will have zero empirical risk, but is useless for generalization, since it has simply “memorized” the dataset. Supervised Learning Statistical Learning Theory Training and Test Performance
Training error is the empirical risk
n 1 X L(y , f (x )) n i i i=1 For 0-1 loss in classification, this is the misclassification error on the n training data {xi, yi}i=1, which were used in learning f . Test error is the empirical risk on new, previously unseen observations m {˜xi, ˜yi}i=1 m 1 X L(˜y , f (˜x )) m i i i=1 which were NOT used in learning f . Test error is a much better gauge of how well learned function generalizes to new data. The test error is in general larger than the training error. Supervised Learning Statistical Learning Theory Hypothesis space for two-class LDA
Assume we have two classes {+1, −1}. Recall that the discriminant functions in LDA are linear. Assuming that data vectors in class k is modelled as N (µk, Σ), choosing class +1 over −1 involves:
> > > a+1 + b+1x > a−1 + b−1x ⇔ a? + b? x > 0,
where a? = a+1 − a−1, b? = b+1 − b−1. Thus, hypothesis space of two-class LDA consists of functions f (x) = sign(a + b>x). We obtain coefficients ˆa and bˆ, and thus the function ˆf through fitting the parameters of the generative model. Discriminative learning: restrict H to a class of functions f (x) = sign(a + b>x) and select ˆa and bˆ which minimize empirical risk. Supervised Learning Statistical Learning Theory Space of linear decision functions
Hypothesis space H = {f : f (x) = sign(a + b>x), a ∈ R, b ∈ Rp} Find a, b that minimize the empirical risk under 0-1 loss:
n 1 X argmin L(yi, fa,b(xi)) a b n , i=1 n ( > 1 X 0 if yi = sign(a + b xi) = argmin a b n 1 otherwise , i=1 n 1 X > = argmin 1 − sign(yi(a + b xi)) . a b 2n , i=1
Combinatorial problem - not typically possible to solve... Maybe easier with a different loss function? (Logistic regression) Supervised Learning Logistic Regression Linearity of log-posterior odds
Another way to express linear decision boundary of LDA: p(Y = +1|X = x) log = a + b>x. p(Y = −1|X = x)
Solve explicitly for conditional class probabilities: 1 p(Y = +1|X = x) = =: s(a + b>x) 1 + exp(−(a + b>x)) 1 p(Y = −1|X = x) = = s(−a − b>x) 1 + exp(+(a + b>x)) where s(·) is the logistic function 1
0.5
0 −8 −6 −4 −2 0 2 4 6 8 Supervised Learning Logistic Regression Logistic Regression
Consider maximizing the conditional log likelihood: n n X X > `(a, b) = log p(Y = yi|X = xi) = − log(1 + exp(−yi(a + b xi))) i=1 i=1 Equivalent to minimizing the empirical risk associated with the log loss: n n 1 X 1 X Remp = log(1 + exp(−y (a + b>x ))) = − log(s(y (a + b>x ))) log n i i n i i i=1 i=1 | {z } ˆpa,b(yi|xi) 4 Log Loss (log(1+exp(−y(a+b’x)))) 0−1 Loss (1/2+sign(−y(a+b’x))/2) 3
2
1
0 −4 −3 −2 −1 0 1 2 3 4 Supervised Learning Logistic Regression Logistic Regression
Not possible to find optimal a, b analytically. Logistic Function For simplicitiy, absorb a as an entry in b by appending ’1’ into x vector. s(−z) = 1 − s(z) Objective function: ∇zs(z) = s(z)s(−z) ∇z log s(z) = s(−z) n emp 1 X > 2 R = − log s(y x b) ∇z log s(z) = −s(z)s(−z) log n i i i=1
Differentiate wrt b:
n 1 X ∇ Remp = −s(−y x>b)y x b log n i i i i i=1 n 1 X ∇2Remp = s(y x>b)s(−y x>b)x x> b log n i i i i i i i=1 Supervised Learning Logistic Regression Logistic Regression
Second derivative is positive-definite: objective function is convex and there is a single unique global minimum. Many different algorithms can find optimal b, e.g.: Gradient descent: n 1 X > bnew = b + s(−y x b)y x n i i i i i=1 Stochastic gradient descent:
1 X > bnew = b + s(−y x b)y x t |I(t)| i i i i i∈I(t)
where I(t) is a subset of the data at iteration t, and t → 0 slowly P P 2 ( t t = ∞, t t < ∞). Newton-Raphson: new 2 emp −1 emp b = b − (∇bRlog ) ∇bRlog This is also called iterative reweighted least squares. Conjugate gradient, LBFGS and other methods from numerical analysis. Supervised Learning Logistic Regression Logistic Regression vs. LDA
Both have linear decision boundaries and model log-posterior odds as
p(Y = +1|X = x) log = a + b>x p(Y = −1|X = x)
LDA models the marginal density of x as a Gaussian mixture with shared covariance g(x) = π−1N (x; µ−1, Σ) + π+1N (x; µ+1, Σ)
and fits the parameters θ = (µ−1, µ+1, π−1, π+1, Σ) by maximizing joint Pn likelihood i=1 p(xi, yi|θ). a and b are then determined from θ. Logistic regression leaves the marginal density g(x) as an arbitrary density function, and fits the parameters a,b by maximizing the Pn conditional likelihood i=1 p(yi|xi; a, b). Supervised Learning Logistic Regression Logistic Regression
Properties of logistic regression: Makes less modelling assumptions than generative classifiers. A simple example of a generalised linear model (GLM). Much statistical theory: assessment of fit via deviance and plots, interpretation of entries of b as odds-ratios, fitting categorical data (sometimes called multinomial logistic regression), well founded approaches to removing insignificant features (drop-in deviance test, Wald test). Supervised Learning Model Complexity and Generalization
Model Complexity and Generalization Supervised Learning Model Complexity and Generalization Generalization
Generalization ability: what is the out-of-sample error of learner f ? training error =6 testing error. We learn f by minimizing some variant of empirical risk Remp(f )- what can we say about the true risk R(f )? Two important factors determining generalization ability: Model complexity Training data size Supervised Learning Model Complexity and Generalization Learning Curves
Underfit
Overfit
Test error Just right Prediction error
Training error
Model complexity/flexibility
Fixed dataset size, varying model complexity. Supervised Learning Model Complexity and Generalization Learning Curves
overfit overfit
testing error testing error
training error prediction error prediction error
training error
training dataset size training dataset size
Fixed model complexity, varying dataset size. Two models: one simple, one complex. Which is which? Supervised Learning Model Complexity and Generalization Bias-Variance Tradeoff
Where does the prediction error come from? Example: Squared loss in regression: X = Rp, Y = R, L(Y, f (X)) = (Y − f (X))2
Optimal f is the conditional mean:
f∗(x) = E [Y|X = x]
h i 2 Follows from R(f ) = EXE (Y − f (X)) X and h i 2 E (Y − f (X)) X = x 2 2 = E Y X = x − 2f (x) E [Y| X = x] + f (x) 2 = Var [Y|X = x] + (E [Y| X = x] − f (x)) . Supervised Learning Model Complexity and Generalization Bias-Variance Tradeoff
Optimal risk is the intrinsic conditional variability of Y (noise):
R(f∗) = EX [Var [Y|X]]
Excess risk: h 2i R(f ) − R(f∗) = EX (f (X) − f∗(X))
How does the excess risk behave on average? n (D) Consider a randomly selected dataset D = {(Xi, Yi)}i=1 and f trained on D using a model (hypothesis class) H.
2 h (D) i (D) ED R(f ) − R(f∗) = EDEX f (X) − f∗(X)
2 (D) = EXED f (X) − f∗(X) . Supervised Learning Model Complexity and Generalization Bias-Variance Tradeoff
¯ (D) Denote f (x) = EDf (x) (average decision function over all possible datasets)
2 2 (D) (D) ¯ ¯ 2 ED f (X) − f∗(X) = ED f (X) − f (X) + f (X) − f∗(X) | {z } 2 | {z } BiasX (H,n) VarX (H,n)
Now, (D) 2 EDR(f ) = R(f∗) + EXVarX(H, n) + EXBiasX(H, n)
Where does the prediction error come from? Noise: Intrinsic difficulty of regression problem. Bias: How far away is the best learner in the model (average learner over all possible datasets) from the optimal one? Variance: How variable is our learning method if given different datasets? Supervised Learning Model Complexity and Generalization Learning Curves
overfit overfit
variance
testing error variance testing error training error prediction error prediction error
training error bias bias
training dataset size training dataset size Supervised Learning Model Complexity and Generalization Learning Curve
Underfit: high bias Overfit: low variance low bias high variance
Test error Just right Prediction error
Training error
Model complexity/flexibility Supervised Learning Artificial Neural Networks
Artificial Neural Networks Supervised Learning Artificial Neural Networks Biological inspiration
Basic computational elements: neurons. Receives signals from other neurons via dendrites. Sends processed signals via axons. Axon-dendrite interactions at synapses. 1010 − 1011 neurons. 1014 − 1015 synapses. Supervised Learning Artificial Neural Networks Single Neuron Classifier
1 x(1) w 1 b x(2) w2 s . w ( ) x(3) 3 P P(Y =1|X = x)
b ⊤ b w x + b b wp
x(p)
activation w>x + b (linear in inputs x) activation/transfer function s gives the output/activity (potentially nonlinear in x) 1 common nonlinear activation function s(a) = 1+e−a : logistic regression learn w and b via gradient descent Supervised Learning Artificial Neural Networks Single Neuron Classifier
xi1
xi2 prevent overfitting by: early stopping: just halt the gradient descent
regularization: L2-regularization called weight decay in neural networks literature.
Supervised Learning Artificial Neural Networks Overfitting
iterations 30,80
Figures from D. MacKay, Information Theory, Inference and Learning Algorithms prevent overfitting by: early stopping: just halt the gradient descent
regularization: L2-regularization called weight decay in neural networks literature.
Supervised Learning Artificial Neural Networks Overfitting
iterations 500,3000
Figures from D. MacKay, Information Theory, Inference and Learning Algorithms Supervised Learning Artificial Neural Networks Overfitting
iterations 10000,40000 prevent overfitting by: early stopping: just halt the gradient descent
regularization: L2-regularization called weight decay in neural networks literature.
Figures from D. MacKay, Information Theory, Inference and Learning Algorithms Supervised Learning Artificial Neural Networks Multilayer Networks
p yˆi Data vectors xi ∈ R , binary labels yi ∈ {0, 1}. inputs xi1,..., xip output ˆyi = P(Y = 1|X = xi) hidden unit activities hi1,..., him Compute hidden unit activities:
hi1 hi2 hi3 hi4 hi5 hi6 hi7 p h X h hil = s bl + wjlxij j=1 Compute output probability:
m ! o X o ˆyi = s b + wk hil x x x x l=1 i1 i2 i3 i4 Supervised Learning Artificial Neural Networks Multilayer Networks
xi1
xi2 Supervised Learning Artificial Neural Networks Training a Neural Network
Objective function: L2-regularized log-loss
n X λ X h 2 X o 2 J = − yi log ˆyi + (1 − yi) log(1 − ˆyi) + (wjl) + (wl ) 2 i=1 jl l where m ! p o X o h X h ˆyi = s b + wl hil hil = s bl + wjlxij l=1 j=1 h h o o h m h p×m Optimize parameters θ = b , w , b , w , where b ∈ R , w ∈ R , bo ∈ R, wo ∈ Rm with gradient descent. n n ∂J o X ∂J ∂ˆyi o X o = λwl + o = λwl + (ˆyi − yi)hil, ∂w ∂ˆyi ∂w l i=1 l i=1 n n ∂J h X ∂J ∂ˆyi ∂hil h X o h = λwjl + h = λwjl + (ˆyi − yi)wl hil(1 − hil)xij. ∂w ∂ˆyi ∂hil ∂w jl i=1 jl i=1
L2-regularization often called weight decay. Multiple hidden layers: Backpropagation algorithm Supervised Learning Artificial Neural Networks Multiple hidden layers
ˆ = L+1 yi hi `+1 `+1 ` hi = s W hi
`+1 ` b b b b b b W = wjk : weight matrix at hL hL jk i1 im ` the (` + 1)-th layer, weight wjk on `−1 ` the edge between hik and hij s: entrywise (logistic) transfer function 1 b b b b b b 1 hi1 him