Statistical Machine Learning: Introduction
Total Page:16
File Type:pdf, Size:1020Kb
Statistical Machine Learning: Introduction Dino Sejdinovic Department of Statistics University of Oxford 22-24 June 2015, Novi Sad slides available at: http://www.stats.ox.ac.uk/~sejdinov/talks.html Tom Mitchell, 1997 Any computer program that improves its performance at some task through experience. Kevin Murphy, 2012 To develop methods that can automatically detect patterns in data, and then to use the uncovered patterns to predict future data or other outcomes of interest. Introduction Introduction What is Machine Learning? Arthur Samuel, 1959 Field of study that gives computers the ability to learn without being explicitly programmed. Kevin Murphy, 2012 To develop methods that can automatically detect patterns in data, and then to use the uncovered patterns to predict future data or other outcomes of interest. Introduction Introduction What is Machine Learning? Arthur Samuel, 1959 Field of study that gives computers the ability to learn without being explicitly programmed. Tom Mitchell, 1997 Any computer program that improves its performance at some task through experience. Introduction Introduction What is Machine Learning? Arthur Samuel, 1959 Field of study that gives computers the ability to learn without being explicitly programmed. Tom Mitchell, 1997 Any computer program that improves its performance at some task through experience. Kevin Murphy, 2012 To develop methods that can automatically detect patterns in data, and then to use the uncovered patterns to predict future data or other outcomes of interest. Introduction Introduction What is Machine Learning? Information Structure Prediction Decisions Actions data http://gureckislab.org Larry Page about DeepMind’s ML systems that can learn to play video games like humans Introduction Introduction What is Machine Learning? statistics business computer finance science cognitive biology Machine science genetics Learning psychology physics mathematics engineering operations research Introduction Introduction What is Data Science? ’Data Scientists’ Meld Statistics and Software WSJ article Introduction Introduction Information Revolution Traditional Statistical Inference Problems Well formulated question that we would like to answer. Expensive data gathering and/or expensive computation. Create specially designed experiments to collect high quality data. Information Revolution Improvements in data processing and data storage. Powerful, cheap, easy data capturing. Lots of (low quality) data with potentially valuable information inside. Introduction Introduction Statistics and Machine Learning in the age of Big Data ML becoming a thorough blending of computer science, engineering and statistics unified framework of data, inferences, procedures, algorithms statistics taking computation seriously computing taking statistical risk seriously scale and granularity of data personalization, societal and business impact multidisciplinarity - and you are the interdisciplinary glue it’s just getting started Michael Jordan: On the Computational and Statistical Interface and "Big Data" Introduction Applications of Machine Learning Applications of Machine Learning spam filtering recommendation fraud detection systems self-driving cars stock market analysis image recognition ImageNet: Krizhevsky et al, 2012; Machine Learning is Eating the World: Forbes article Introduction Types of Machine Learning Types of Machine Learning Supervised learning Data contains “labels”: every example is an input-output pair classification, regression Goal: prediction on new examples Unsupervised learning Extract key features of the “unlabelled” data clustering, signal separation, density estimation Goal: representation, hypothesis generation, visualization Introduction Types of Machine Learning Types of Machine Learning Semi-supervised Learning A database of examples, only a small subset of which are labelled. Multi-task Learning A database of examples, each of which has multiple labels corresponding to different prediction tasks. Reinforcement Learning An agent acting in an environment, given rewards for performing appropriate actions, learns to maximize their reward. Introduction Types of Machine Learning Literature Supervised Learning Supervised Learning Supervised Learning n We observe the input-output pairs f(xi; yi)gi=1, xi 2 X , yi 2 Y Types of supervised learning: Classification: discrete responses, e.g. Y = f+1; −1g or f1;:::; Kg. Regression: a numerical value is observed and Y = R. The goal is to accurately predict the response Y on new observations of X, i.e., to learn a function f : Rp !Y, such that f (X) will be close to the true response Y. Supervised Learning Decision Theory Loss function Suppose we made a prediction Y^ = f (X) 2 Y based on observation of X. How good is the prediction? We can use a loss function L : Y × Y 7! R+ to formalize the quality of the prediction. Typical loss functions: Misclassification loss (or 0-1 loss) for classification 0 f (X) = Y L(Y; f (X)) = : 1 f (X) 6= Y Squared loss for regression L(Y; f (X)) = (f (X) − Y)2 : Many other choices are possible, e.g., weighted misclassification loss. In classification, the vector of estimated probabilities (^p(k))k2Y can be returned, and in this case log-likelihood loss (or log loss) L(Y; ^p) = − log ^p(Y) is often used. Supervised Learning Decision Theory Risk n Paired observations f(xi; yi)gi=1 viewed as i.i.d. realizations of a random variable (X; Y) on X × Y with joint distribution PXY Risk For a given loss function L, the risk R of a learned function f is given by the expected loss R(f ) = EPXY [L(Y; f (X))] ; where the expectation is with respect to the true (unknown) joint distribution of (X; Y). The risk is unknown, but we can compute the empirical risk: n 1 X R (f ) = L(y ; f (x )): n n i i i=1 Supervised Learning Decision Theory The Bayes Classifier What is the optimal classifier if the joint distribution (X; Y) were known? The density g of X can be written as a mixture of K components (corresponding to each of the classes): K X g(x) = πkgk(x); k=1 where, for k = 1;:::; K, P(Y = k) = πk are the class probabilities, gk(x) is the conditional density of X, given Y = k. The Bayes classifier fBayes : x 7! f1;:::; Kg is the one with minimum risk: R(f ) =E [L(Y; f (X))] = EX EYjX[L(Y; f (X))jX] Z = E [L(Y; f (X))jX = x] g(x)dx X The minimum risk attained by the Bayes classifier is called Bayes risk. Minimizing E[L(Y; f (X))jX = x] separately for each x suffices. Supervised Learning Decision Theory The Bayes Classifier Consider the 0-1 loss. The risk simplifies to: K h i X E L(Y; f (X)) X = x = L(k; f (x))P(Y = kjX = x) k=1 =1 − P(Y = f (x)jX = x) The risk is minimized by choosing the class with the greatest posterior probability: fBayes(x) = arg max P(Y = kjX = x) k=1;:::;K π g (x) arg max k k arg max g x = PK = πk k( ): k=1;:::;K j=1 πjgj(x) k=1;:::;K The functions x 7! πkgk(x) are called discriminant functions. The discriminant function with maximum value determines the predicted class of x. Supervised Learning Decision Theory The Bayes Classifier: Example A simple two Gaussians example: Suppose X ∼ N (µY ; 1), where µ1 = −1 and µ2 = 1 and assume equal priors π1 = π2 = 1=2. 1 (x + 1)2 1 (x − 1)2 g1(x) = p exp − and g2(x) = p exp − : 2π 2 2π 2 0.25 0.4 0.20 0.3 0.15 0.2 marginal density conditional densities 0.10 0.1 0.05 0.0 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 x x ( 1 if x < 0, Optimal classification is fBayes(x) = arg max πkgk(x) = k=1;:::;K 2 if x ≥ 0. Supervised Learning Decision Theory The Bayes Classifier: Example How do you classify a new observation x if now the standard deviation is still 1 for class 1 but 1=3 for class 2? 1.2 1e−04 1.0 0.8 1e−11 0.6 1e−18 conditional densities conditional densities 0.4 1e−25 0.2 0.0 1e−32 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 x x Looking at density in a log-scale, optimal classification is to select class 2 if and only if x 2 [0:34; 2:16]. Supervised Learning Decision Theory Plug-in Classification The Bayes Classifier chooses the class with the greatest posterior probability fBayes(x) = arg max πkgk(x): k=1;:::;K We know neither the conditional densities gk nor the class probabilities πk! The plug-in classifier chooses the class f (x) = arg max π^k^gk(x); k=1;:::;K where we plugged in estimates π^k of πk and k = 1;:::; K and estimates ^gk(x) of conditional densities, Supervised Learning Linear Discriminant Analysis Linear Discriminant Analysis LDA is the simplest example of plug-in classification. Assume multivariate normal conditional density gk(x) for each class k: XjY = k ∼N (µk; Σ); 1 g (x) =(2π)−p=2jΣj−1=2 exp − (x − µ )>Σ−1(x − µ ) ; k 2 k k each class can have a different mean µk, all classes share the same covariance Σ. For an observation x, the k-th log-discriminant function is 1 log π g (x) = const + log π − (x − µ )>Σ−1(x − µ ) k k k 2 k k > = const + ak + bk x linear log-discriminant functions $ linear decision boundaries. > −1 The quantity (x − µk) Σ (x − µk) is the squared Mahalanobis distance between x and µk. 1 If Σ = Ip and πk = K , LDA simply chooses the class k with the nearest (in the Euclidean sense) class mean. Supervised Learning Linear Discriminant Analysis Iris Dataset ● ●● 2.5 ● ● ●●●● ● ● ● ● ● ● ● ●●●● ● ● ●●●● ● ● 2.0 ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ●●● 1.5 ● ● ●●● ● ● ●●●●●●● ●● ● ● ● ● ●● Petal.Width ● ● ● ●● 1.0 ● ● 0.5 ● ●●● ● ●●● ● ● ●●●●●● ● ● ●● 1 2 3 4 5 6 7 Petal.Length Supervised Learning Statistical Learning Theory Generative vs Discriminative Learning Generative learning: find parameters which explain all the data available.