Modern Methods of Statistical Learning sf2935 Lecture 1: Introduction to Learning Theory Timo Koski

TK

2018-08-28

TK Modern Methods of Statistical Learning sf2935 Overview of the Lecture

The methods of statistical learning provide a heterogeneous collection of points of view, tasks, algorithms, models and probability distributions. This lecture tries to give a few comprehensive concepts and presents the historically first instance of learning theory, and perceptron algorithm.

TK Modern Methods of Statistical Learning sf2935 Your Learning Outcomes

Supervised learning Bias-Variance Trade-off

TK Modern Methods of Statistical Learning sf2935 E-learning

TK Modern Methods of Statistical Learning sf2935 Statistical Learning = + Statistics

Machine Learning means development of algorithms and techniques that allow computers to learn to 1 record observations about a phenomenon 2 build a model of this phenomenon 3 predict the future of this phenomenon

Statistics gives a formal definition of machine learning some guarantees of expected results suggestions for new or improved modelling tools

TK Modern Methods of Statistical Learning sf2935 Statistical Learning = Machine Learning + Statistics

Machine Learning: a phenomenon is recorded via observations n {zi }i=1 with zi ∈ Z. There are two generic situations 1 Unsupervised learning no predefined structure in Z. the goal is to find some structures: clusters, association rules, estimating probability distributions/densities 2 Supervised learning Z = X × Y, the components are not exchangeable z = x × y ∈ X × Y. modelling: finding how x and y are related. the goal is to make prediction: given x, find a reasonable value for y such that z = x × y is compatible with the phenomenon.

TK Modern Methods of Statistical Learning sf2935 Unsupervised learning

Learning what normally happens (no teacher) No output Clustering: Grouping similar instances

TK Modern Methods of Statistical Learning sf2935 Other types

SemiSupervised Learning, a class of supervised learning techniques that also make use of large amount of n1 non-structured data {zi }i=1 for training and typically a small n2 amount of data {xi × yi }i=1 . Semi-supervised learning falls between unsupervised learning and supervised learning.

Active learning is said to be a special case of semi-supervised machine learning in which a learning algorithm is able to interactively query some information source to obtain the desired outputs at new data points. In statistics it is sometimes also called optimal experimental design.

TK Modern Methods of Statistical Learning sf2935 Other types

Reinforcement Learning is (Wikipedia) an area of machine learning inspired by behaviorist , concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.

Learning by Doing

TK Modern Methods of Statistical Learning sf2935 Supervised learning

Several types of Y can be considered: 1 Y = {1, 2, ... , q}, classification i q classes, e.g., to tell whether a patient has a certain kind of disease. 2 Y = Rq, regression, e.g., to calculate the price y of a house based on some characteristics x (like the neighbourhood, year of construction, architectural style). 3 Y is something complex but structured.

The modelling difficulty increases with the complexity of Y.

TK Modern Methods of Statistical Learning sf2935 Modeling: f from X to Y

n Given a data set or training set {xi × yi }i=1 a machine learning method builds up a mmachine h from X to Y. Example Conditional expectation or regression function:

h (x) = E (Y | X = x)

where Y and X are random variables with values in Y and X , respectively.

A Good Model should be such that for all i = 1, ... , n

h (xi ) ≈ yi In statistics, one chooses often to measure the quality of modelling by Mean Square Error (assuming Y ⊆ R)) n 1 2 MSE = ∑ (yi − h (xi )) . n i=1 TK Modern Methods of Statistical Learning sf2935 Modelling: f from X to Y

The MSE in n 1 2 MSE = ∑ (yi − h (xi )) . n i=1 is computed using the training data that was used to fit the model, and so should more accurately be referred to as the training MSE. In general, we do not really care how well the method works on the training data. Rather, we are interested in the accuracy of the pre- dictions that we obtain when we apply our method to previously unseen test data. This is called or test error.

TK Modern Methods of Statistical Learning sf2935 Bias -Variance Trade-Off

Suppose that we have a training set consisting of a set of points x1, ... , xn and real values yi associated with each point xi . We assume that there is a functional, but noisy relation

yi = f (xi ) + e,

where the noise, e, has zero mean and variance σ2 . We want to find a function fˆ(x), that approximates the true function y = f (x) as well as possible, by means of some learning algorithm. We make ”as well as possible” precise by measuring the MSE between y and h = fˆ(x) which we want to be minimal both for x1, ... , xn AND for points outside of our sample. Of course, we cannot hope to do so perfectly, since the yi contain noise e. This means we must be prepared to accept an irreducible error in any function we come up with.

TK Modern Methods of Statistical Learning sf2935 Bias -Variance Decomposition

Finding an fˆ that generalizes to points outside of the training set can be done with any of the countless algorithms used for supervised learning. It turns out that whichever function fˆ we select, we can decompose its expected error on an unseen sample x as follows:

h 2i 2 E y − fˆ(x) = Biasfˆ(x) + Varfˆ(x) + σ2 (1) (2) Where:

Biasfˆ(x) = Efˆ(x) − f (x) (3) and

h 2i Varfˆ(x) = E fˆ(x) − E[fˆ(x)] (4)

TK Modern Methods of Statistical Learning sf2935 Bias -Variance Trade-Off

The expectation ranges over different choices of the training set x1, ... , xn, y1, ... , yn all sampled from the same distribution. The three terms represent: the square of the bias of the learning method, which can be thought of the error caused by the simplifying assumptions built into the method. E.g., when approximating a non-linear function f (x) using a learning method for linear models, there will be error in the estimates fˆ(x) due to this assumption; the variance of the learning method, or, intuitively, how much the learning method fˆ(x) will move around its mean; the irreducible error σ2. Since all three terms are non-negative, this forms a lower bound on the expected error on unseen samples.

TK Modern Methods of Statistical Learning sf2935 Bias -Variance Trade-Off

TK Modern Methods of Statistical Learning sf2935 Bias -Variance Trade-Off

TK Modern Methods of Statistical Learning sf2935 Bias -Variance Trade-Off

If the model is too simple, the solution it yields is biased and does not fit the data. If the model is too complex, it is too sensitive to small variations in data

TK Modern Methods of Statistical Learning sf2935 Classification

In classification, one measures the quality of a model by training error rate 1 n ∑ I (yi 6= h (xi )) n i=1

where I (yi 6= f (xi )) = 1 if yi 6= h (xi ) and 0 otherwise. Bias-Variance trade-off can be established here, too.

TK Modern Methods of Statistical Learning sf2935 Inductive Learning

Inductive Learning is to learn a general rule from a finite number of cases.

TK Modern Methods of Statistical Learning sf2935 Google DeepMinds AlphaGo algorithm masters ancient game of Go. Deep-learning software defeats human professional for first time. Nature 27 January 2016

the AlphaGo program applied in neural networks - brain-inspired programs in which connections between layers of simulated neurons are strengthened through examples and experience. It first studied 30 million positions from expert games, gleaning abstract information on the state of play from board data, much as other programmes categorize images from pixels. Then it played against itself across 50 computers, improving with each iteration, a technique known as .

TK Modern Methods of Statistical Learning sf2935 Visions

”the embryo of an electronic computer that will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.”

Hannes Alfv´en: ” (Manniskans)¨ verkliga storhet ligger i att hon ar¨ den levande varelse som var intelligent nog for¨ att inse att (evolutionens) mal˚ ar¨ datorn. ” p. 17 in H. Alfven: Sagan om den stora datamaskinen.

concern for a Frankenstein-like world, where machines are programmed so perfectly that one day, they surpass human intellect.

TK Modern Methods of Statistical Learning sf2935 TK Modern Methods of Statistical Learning sf2935