COMPUTATIONAL

Luca Bortolussi

Department of Mathematics and Geosciences University of Trieste

Office 238, third floor, H2bis [email protected]

Trieste, Winter Semester 2015/2016 LINEAR REGRESSION MODELS BAYESIAN LINEAR REGRESSION DUAL REPRESENTATION AND KERNELS 33 / 41

MARGINAL LIKELIHOOD

The marginal likelihood p(t ↵, ), appearing at the | denominator in Bayes theorem, can be used to identify good ↵ and , known as hyperparameters. Intuitively, we can place a prior distribution over ↵ and , compute their posterior, and use this in a fully Bayesian treatment of the regression:

p(↵, t) p(t ↵, )p(↵, ) | / | If we assume the posterior is peaked around the mode, then we can take the MAP as an approximation of the full posterior for ↵ and . If the flat is prior, this will boil down to the ML solution. LINEAR REGRESSION MODELS BAYESIAN LINEAR REGRESSION DUAL REPRESENTATION AND KERNELS 34 / 41

MARGINAL LIKELIHOOD

Hence we need to optimise the marginal likelihood, which can be computed as:

M N 1 1 N log p(t ↵, )= log ↵+ log E(m ) log S log 2⇡ | 2 2 N 2 | N | 2 with ↵ E(m )= t m 2 + m T m N 2|| N|| 2 N N This optimisation problem can be solved with any optimisation routine, or with specialised methods, see Bishop. LINEAR REGRESSION MODELS BAYESIAN LINEAR REGRESSION DUAL REPRESENTATION AND KERNELS 35 / 41

BAYESIAN MODEL COMPARISON

Consider and two different models, which one is M1 M2 the best to explain the data ? D In a Bayesian setting, we may place a prior p( ) on the Mj models, and compute the posterior p( )p( ) D|Mj Mj p( j )= p( )p( ) . M |D j D|Mj Mj As we typicallyP have additional parameters w, the term p( ) is the model evidence/ marginal likelihood. D|Mj The ratio p( )/p( ) is known as Bayes Factor. D|M1 D|M2 LINEAR REGRESSION MODELS BAYESIAN LINEAR REGRESSION DUAL REPRESENTATION AND KERNELS 36 / 41

BAYESIAN MODEL COMPARISON

In Bayesian model comparison, we can take two approaches. We can compute the predictive distribution for each model and average it by the posterior model probability

p(t )= p(t , )p( ) |D |Mj D Mj |D Xj Alternatively, we can choose the model with larger Bayes Factor. This will pick the correct model on average. In fact, the average log Bayes factor (assuming is the true M1 model) is p( ) p( ) log D|M1 > 0 D|M1 p( ) Z D|M2 OUTLINE

1 LINEAR REGRESSION MODELS

2 BAYESIAN LINEAR REGRESSION

3 DUAL REPRESENTATION AND KERNELS LINEAR REGRESSION MODELS BAYESIAN LINEAR REGRESSION DUAL REPRESENTATION AND KERNELS 38 / 41

DUAL REPRESENTATION

Consider a regression problem with data (xi, yi ), and a linear model wT (x). We can restrict the choice of w to the linear subspace spanned by (x1),...,(xN), as any w othogonal to this T ? subspace will give a contribution w (xi)=0 on input ? points: M 1 w = aj (xj) Xj=0 a are known as the dual variables T By defining the kernel k(xi, xj):=(xi) (xj), we can write

T T i w (xi)=a K

Where Ki is the ith column of the Gram matrix K, Kij = k(xi, xj). LINEAR REGRESSION MODELS BAYESIAN LINEAR REGRESSION DUAL REPRESENTATION AND KERNELS 39 / 41

DUAL REGRESSION PROBLEM

In the dual variables, we have to optimise the following regression equation

N E (a)+E (a)= (t aT Ki)2 + aT Ka d W i Xi=1 By deriving w..t a and setting the gradient to zero, we obtain the solution

1 aˆ =(K + I) t

At a new input x⇤, the prediction will then be

T 1 y(x⇤)=k (K + I) t ⇤ T with k =(k(x⇤, x1),...,k(x⇤, xN)) ⇤ The computational cost to solve the primal problem is O(M3), while the dual costs O(N3). They can be both prohibitive is N and M are large. In this case, one can optimise the log likelihood directly, using gradient based methods.

LINEAR REGRESSION MODELS BAYESIAN LINEAR REGRESSION DUAL REPRESENTATION AND KERNELS 40 / 41

THE KERNEL TRICK

The dual objective function depends only on the scalar product of input vectors We can replace the Euclidean scalar product with any (non-linear) scalar product This is usually obtained by giving directly a non-linear kernel function k(xi , xj ) (kernel trick) This enables us to work with more general set of basis functions, even countable. See Gaussian processes. The same dual procedure applies to other algorithms, notably linear classification and SVMs LINEAR REGRESSION MODELS BAYESIAN LINEAR REGRESSION DUAL REPRESENTATION AND KERNELS 40 / 41

THE KERNEL TRICK

The dual objective function depends only on the scalar product of input vectors We can replace the Euclidean scalar product with any (non-linear) scalar product This is usually obtained by giving directly a non-linear kernel function k(xi , xj ) (kernel trick) This enables us to work with more general set of basis functions, even countable. See Gaussian processes. The same dual procedure applies to other algorithms, notably linear classification and SVMs The computational cost to solve the primal problem is O(M3), while the dual costs O(N3). They can be both prohibitive is N and M are large. In this case, one can optimise the log likelihood directly, using gradient based methods.

COMPUTATIONAL STATISTICS LINEAR CLASSIFICATION

Luca Bortolussi

Department of Mathematics and Geosciences University of Trieste

Office 238, third floor, H2bis [email protected]

Trieste, Winter Semester 2015/2016 OUTLINE

1 LINEAR CLASSIFIERS

2 LOGISTIC REGRESSION

3 LAPLACE APPROXIMATION

4 BAYESIAN LOGISTIC REGRESSION

5 CONSTRAINED OPTIMISATION

6 SUPPORT VECTOR MACHINES LINEAR CLASSIFIERS LOGISTIC REGRESSION LAPLACE BAYESIAN LOGISTIC REGRESSION OPTIMISATION SVM 3/52

INTRODUCTION

Data: xi, ti . Output are discrete, either binary or multiclass (K classes), and are also denoted by yi . Classes are denoted by ,..., . C1 CK Discriminant function: we construct a function f (x) 1,...,K associating with each input a class. 2 { } Generative approach: We consider a prior over classes, p( ), and the class-conditional densities p(x ), from a Ck |Ck parametric family. We learn class-conditional densities from data, and then compute the class posterior.

p(x k )p( k ) p( x)= |C C Ck | p(x)

Discriminative approach: we learn directly a model for the class posteriori p( k x), typically as p( k x)=f (w(x)). C | C1 | f is called an activation function (and f a link function). LINEAR CLASSIFIERS LOGISTIC REGRESSION LAPLACE BAYESIAN LOGISTIC REGRESSION OPTIMISATION SVM 4/52

ENCODING OF THE OUTPUT

For a binary classification problem, usually we choose t 0, 1 . The interpretation is that of a “probability” to n 2 { } belong to class . C1 In some circumstances (perceptron, SVM), we will prefer the encoding t 1, 1 . n 2 { } For a multiclass problem, we usually stick to a boolean encoding: t =(t ,...,t ), with t 0, 1 , and t is in n n,1 n,K n,j 2 { } n class k if and only if tn,k = 1 and tn,j = 0, for j , k.