Machine Learning Srihari

Linear Models for Regression

Sargur Srihari [email protected]

1 Machine Learning Srihari Topics in Linear Regression • What is regression? – Curve Fitting with Scalar input – Linear Models • Maximum Likelihood and Least Squares • Stochastic Gradient Descent • Regularized Least Squares

2 Machine Learning Srihari The regression task • It is a supervised learning task • Goal of regression: – predict value of one or more target variables t – given d-dimensional vector x of input variables – With dataset of known inputs and outputs

• (x1,t1), ..(xN,tN)

• Where xi is an input (possibly a vector) known as the predictor

• ti is the target output (or response) for case i which is real-valued – Goal is to predict t from x for some future test case • We are not trying to model the distribution of x – We dont expect predictor to be a linear function of x • So ordinary linear regression of inputs will not work • We need to allow for a nonlinear function of x

• We don’t have a theory of what form this function to take 3 Machine Learning Srihari An example problem • Fifty points generated (one-dimensional problem) – With x uniform from (0,1) – y generated from formula y=sin(1+x2)+noise • Where noise has N(0,0.032) distribution • Noise-free true function and data points are as shown

4 Machine Learning Srihari Applications of Regression

1. Expected claim amount an insured person will make (used to set insurance premiums) or prediction of future prices of securities 2. Also used for algorithmic trading

5 Machine Learning Srihari ML Terminology • Regression – Predict a numerical value t given some input • Learning algorithm has to output function f : RnàR – where n =no of input variables • Classification – If t value is a label (categories): f : Rnà{1,..,k} • Ordinal Regression – Discrete values, ordered categories

6 Machine Learning Srihari Polynomial Curve Fitting with a Scalar

– With a single input variable x M – y(x, ) = w +w x+w x2+…+w xM w x j w 0 1 2 M = ∑ j j =0 M is the order of the polynomial, Training data set x j denotes x raised to the power j, N=10, Input x, target t

Coefficients w0,…,wM are collectively denoted by vector w

– Task: Learn w from training data D ={(xi,ti)}, i =1,..,N •Can be done by minimizing an error function that minimizes the misfit between y(x,w) for any given w and training data •One simple choice of error function is sum of squares of error between predictions y(xn,w) for each data point xn and corresponding target 2 values tn so that we minimize 1 N E(w) = y(x ,w) − t ∑{ n n } 2 n=1 •It is zero when function y x passes exactly through each training ( ,w) 7 data point Machine Learning Srihari Results with polynomial basis

8 Machine Learning Srihari Regression with multiple inputs

• Generalization – Predict value of continuous target variable t given value of D input variables x=[x1,..xD] – t can also be a set of variables (multiple regression) – Linear functions of adjustable parameters • Specifically linear combinations of nonlinear functions of input variable • Polynomial curve fitting is good only for: – Single input variable scalar x – It cannot be easily generalized to several variables, as we will see

9 Machine Learning Srihari Simplest Linear Model with D inputs

• Regression with D input variables This differs from T Linear Regression with one variable y(x,w) = w0+w1x1+..+wd xD = w x and Polynomial Reg with one variable

T where x=(x1,..,xD) are the input variables

• Called Linear Regression since it is a linear function of

– parameters w0,..,wD

– input variables x1,..,xD • Significant limitation since it is a linear function of input variables – In the one-dimensional case this amounts a straight-line fit (degree-one polynomial)

– y(x,w) = w0 + w1x 10 Machine Learning Srihari Fitting a Regression Plane

• Assume t is a function of inputs x1, x2,...xD Goal: find best linear regressor of t on all inputs – Fitting a hyperplane through N input samples – For D =2: x1 x2 t

• Being a linear function of input variables imposes limitations on the model – Can extend class of models by considering fixed nonlinear functions of input variables Machine Learning Srihari Basis Functions • In many applications, we apply some form of fixed-preprocessing, or feature extraction, to the original data variables • If the original variables comprise the vector x, then the features can be expressed in terms of basis functions {φj (x)} – By using nonlinear basis functions we allow the function y(x,w) to be a nonlinear function of the input vector x • They are linear functions of parameters (gives them simple analytical properties), yet are nonlinear wrt input variables 12 Machine Learning Srihari

Linear Regression with M Basis Functions

• Extended by considering nonlinear functions of input variables

M −1 y(x,w) w w (x) = 0 + ∑ jφj j =1

– where φj(x) are called Basis functions – We now need M weights for basis functions instead of D weights for features

– With a dummy basis function φ0(x)=1 corresponding to the bias parameter w0, we can write

M −1 y(x,w) w (x) wT (x) = ∑ jφj = φ j =0

T – where w=(w0,w1,..,wM-1) and Φ=(φ0,φ1,..,φM-1)

• Basis functions allow non-linearity with D input variables 13 Machine Learning Srihari Choice of Basis Functions

• Many possible choices for basis function: 1. • Good only if there is only one input variable 2. Gaussian basis functions 3. Sigmoidal basis functions 4. Fourier basis functions 5.

14 Machine Learning Srihari 1. Polynomial Basis for one variable

• Linear Basis Function Model M−1 x j y(x,w) = w (x) = wT (x) ∑ jφj φ j=0 • Polynomial Basis (for single variable x) φ j with degree polynomial j(x)=x M-1 x • Disadvantage – Global: • changes in one region of input space affects others – Difficult to formulate • Number of increases exponentially with M – Can divide input space into regions • use different polynomials in each region: • equivalent to spline functions 15 Machine Learning Srihari Can we use Polynomial with D variables? (Not practical!) j j ⎡ 2 2 2 ⎤ • Consider (for a vector x) the basis:φ (x) =|| x || = ⎢ x + x + ..+ x ⎥ j ⎣ 1 2 d ⎦ – x=(2,1) and x= (1,2) have the same squared sum, so it is unsatisfactory – Vector is being converted into a scalar value thereby losing information • Better polynomial approach: – Polynomial of degree M-1 has terms with variables taken none, one, two… M-1 at a time.

– Use multi-index j=(j1,j2,..jD) such that j1+j2+..jD < M-1 – For a quadratic (M=3) with three variables (D=3)

y(x,w) w (x) w w x w x w x w x x w x x = ∑ jφj = 0 + 1,0,0 1 + 0,1,0 2 + 0,0,1 3 + 1,1,0 1 2 + 1,0,1 1 3 + (j1,j2,j3 ) w x x + w x 2 + w x 2 + w x 2 0,1,1 2 3 2,0,0 1 0,2,0 2 0,0,2 3 – Number of quadratic terms is 1+D+D(D-1)/2+D – For D=46, it is 1128 • Better to use Gaussian kernel, discussed next Machine Learning Srihari Disadvantage of Polynomial

• Polynomials are global basis functions – Each affecting the prediction over the whole input space • Often local basis functions are more appropriate

17 Machine Learning Srihari 2. Gaussian Radial Basis Functions ⎛ 2 ⎞ ⎜(x − µ ) ⎟ φ (x) = exp⎜ j ⎟ • Gaussian j ⎜ 2 ⎟ ⎜ 2σ ⎟ ⎝ ⎠ – Does not necessarily have a probabilistic interpretation – Usual normalization term is unimportant • since basis function is multiplied by weight wj • Choice of parameters

– μj govern the locations of the basis functions • Can be an arbitrary set of points within the range of the data – Can choose some representative data points – σ governs the spatial scale • Could be chosen from the data set e.g., average variance • Several variables – A Gaussian kernel would be chosen for each dimension – For each j a different set of means would be needed– perhaps chosen from the data ⎛ 1 ⎞ ⎜ t −1 ⎟ φj (x) = exp⎜− (x − µj ) Σ (x − µj )⎟ ⎜ 2 ⎟ ⎝ ⎠ Machine Learning Srihari Result with Gaussian Basis Functions

Basis functions for s=0.1, with the µj on a grid with spacing s

wj s for middle model:

6856.5 -3544.1 -2473.7 -2859.8 -2637.7 -2861.5 -2468.0 -3558.4 Machine Learning Srihari Biological Inspiration for RBFs • Nervous system • RBF Network contains many examples – Local receptive fields in visual cortex • Receptive fields overlap • So there is usually more than one unit active • But for given input, total

no. of active units is small20 Machine Learning Tiling the input space Srihari • Determining centers – k-means clustering • Choose k cluster centers • Mark each training point as captured by cluster to which it is closest • Move each cluster center to mean of points it captured • Determining variance σ2 Global: σ =mean distance between each unit j and its closest neighbor

P-nearest neighbor: set each σj so that there is certain overlap with P closest neighbors of unit j Machine Learning Srihari 3. Sigmoidal Basis Function

⎛ x ⎞ • Sigmoid − µj 1 φ (x) = σ ⎜ ⎟ where σ(a) = j s 1 + exp(−a) ⎝ ⎠ • Equivalently, tanh because it is related to logistic sigmoid by

tanh(a) = 2σ(a) − 1

Logistic Sigmoid For different µj Machine Learning Srihari 4. Other Basis Functions

• Fourier – Expansion in sinusoidal functions – Infinite spatial extent • Signal Processing – Functions localized in time and frequency – Called wavelets • Useful for lattices such as images and time series • Further discussion independent of choice of basis including φ(x) = x

23 Machine Learning Srihari Relationship between Maximum Likelihood and Least Squares • Will show that Minimizing sum-of-squared errors is the same as maximum likelihood solution under a Gaussian noise model • Target variable is a scalar t given by deterministic function y (x,w) with additive Gaussian noise ε t = y(x,w)+ ε – which is a zero-mean Gaussian with precision β • Thus distribution of t is univariate normal: p(t|x,w,β)=N (t | y(x,w),β -1 ) mean variance Machine Learning Srihari Likelihood Function • Data set:

– Input X={x1,..xN} with target t = {t1,..tN}

– Target variables tn are scalars forming a vector of size N

• Likelihood of the target data – It is the probability of observing the data assuming they are independent – since p(t|x,w,β)=N (t | y(x,w),β -1 ) M −1 – and y(x,w) w (x) wT (x) = ∑ jφj = φ j =0 N p(t | X,w, ) = N t | wT (x ), −1 β ∏ ( n φ n β ) n=1 Machine Learning Srihari Log-Likelihood Function N • Likelihood p(t | X,w, ) = N t | wT (x ), −1 β ∏ ( n φ n β ) n=1

N ln p(t | w,β) = lnN t | wTφ(x ),β−1 • Log-likelihood ∑ ( n n ) n=1

1 ⎧ 1 ⎫ – Using standard univariate Gaussian N(x | µ,σ 2 ) = exp⎨− (x − µ)2⎬ (2πσ 2 )1/ 2 ⎩ 2σ 2 ⎭ N N ln p(t | w,β) = ln β − ln 2π − βED(w) 2 2 • Where N 2 1 T ED (w) = å{ t n-w f(xn ) } Sum-of-squares Error Function 2 n=1 • With Gaussian basis functions

⎛ t ⎞ ⎜(x − µ ) (x − µ )⎟ φ (x) = exp⎜ j j ⎟ j ⎜ 2 ⎟ ⎜ 2s ⎟ ⎝ ⎠ Machine Learning Srihari Maximizing Log-Likelihood Function • Log-likelihood N ln p(t | w,β) = lnN t | wTφ(x ),β −1 ∑ ( n n ) n=1 N N = ln β − ln 2π − βED(w) 2 2 – where 2 1 N E (w) = t −wTϕ(x ) D ∑{ n n } 2 n=1 • Therefore, maximizing likelihood is equivalent to

minimizing ED(w) Machine Learning Srihari Determining max likelihood solution

• The likelihood function has the form N N ln p(t | w,β) = ln β − ln 2π − βE (w) 2 2 D 2 1 N • where E (w) = t −wTφ(x ) D ∑{ n n } 2 n=1 • We will show that the maximum likelihood solution has a closed form – Take derivative of l n p ( t | w , β ) wrt w and set equal to zero and solve for w

– or equivalently just the derivative of ED(w)

28 Machine Learning Srihari Gradient of Log-likelihood wrt w

N ∇ ln p(t | w, ) = t − wT (x ) (x )T β β∑{ n φ n } φ n n=1 -which is obtained from log-likelihood expression and by using calculus result

1 2 [ a wb ] (a wb)b ∇w − ( − ) = − 2 • Gradient is set to zero and we solve for w

N N T # T & 0 = ∑tnφ(xn )− w %∑φ(xn )φ(xn ) ( as shown in next slide n=1 $ n=1 '

• Second derivative will be negative making this a maximum Machine Learning Srihari Max Likelihood Solution for w

X={x1,..xN} are samples • Solving for w we obtain: (vectors of d variables)

+ t = {t1,..tN} are targets (scalars) w ML = F t

where F + = ( F T F ) - 1 F T is the Moore-Penrose pseudo inverse of the N

M Design Matrix Φ whose elements are given by Φnj=ϕj(xn) • Known as the normal equations for the least squares problem

æ f 0(x1) f1(x1) ... f M -1(x1) ö Pseudo inverse: Design Matrix: ç ÷ generalization of notion of matrix inverse Rows correspond to ç f 0(x 2 ) ÷ F = to non-square matrices N samples, ç ÷ If design matrix is square and invertible. Columns to ç ÷ M basis functions ç ÷ then pseudo-inverse is same as inverse èf 0(x N ) f M -1(x N )ø

ϕi(xn) are M basis functions, e.g., Gaussians centered on M data points ⎛ 1 ⎞ ⎜ t −1 ⎟ φj (x) = exp⎜− (x − µj ) Σ (x − µj )⎟ ⎜ 2 ⎟ ⎝ ⎠ Machine Learning Srihari Design Matrix Φ

M Basis functions

æ f 0(x1) f1(x1) ... f M -1(x1) ö ç ÷ N f (x ) ç 0 2 ÷ Data Note that F = ç ÷ ç ÷ ϕ0 corresponds ç ÷ èf 0(x N ) f M -1(x N )ø to bias, which is set Represented as N-dim vectors to 1 ϕ 0 ϕ1 ϕM-1

Φ is an N M matrix Thus ΦT is an M N matrix Thus, ΦT Φ is M M, and so is [ΦT Φ]-1 So we have [ΦT Φ]-1 ΦT is M N T -1 T Since t is N x 1, we have that wML = [Φ Φ] Φ t is M 1. which consists of the M weights (including bias). Machine Learning Srihari What is the role of Bias parameter w0?

N 2 1 T • Sum-of squares error function is: ED (w) = ∑{ t n −w φ(xn ) } 2 n=1 M −1 y(x,w) = w + w φ (x) – Substituting: 0 ∑ j j we get: j =1 2 1 N ⎧ M −1 ⎫ E (w) = ⎨ t −w − w φ (x ) ⎬ D 2 ∑ n 0 ∑ j j n n=1 ⎩⎪ j =1 ⎭⎪ – Setting derivatives wrt w0 equal to zero and solving for w0

M −1 1 N 1 N w = t − w φ φ = φ (x ) 0 ∑ j j where t = ∑tn and j ∑ j n N N n=1 j =1 n=1 • First term is average of the N values of t • Second term is weighted sum of the average basis function values over N samples

• Thus bias w0 compensates for difference between average target values and weighted sum of averages of basis function values Machine Learning Srihari Maximum Likelihood for precision β • We have determined m.l.e. solution for w using a probabilistic formulation -1 • p(t|x,w,β)=N(t|y(x,w),β ) + wML = Φ t • With log-likelihood

N N N T T ∇ ln p(t | w,β) = β t − w φ(x ) φ(x ) ln p(t | w,β) = ln β − ln 2π − βED(w) ∑{ n n } n 2 2 n=1 • Taking gradient wrt β gives

1 1 N 2 = t − wT φ(x ) β N ∑{ n ML n } ML n=1 – Thus Inverse of the noise precision gives Residual variance of the target values around

the regression function 33 Machine Learning Srihari Geometry of Least Squares • Geometrical Interpretation of Least Squares Solution instructive • Consider N-dim space with axes tn T so that t =(t1,…,tN) is a vector in this space subspace

• Each basis ϕj(xn) evaluated at N points can also be represented as a vector in the same space th • φj corresponds to j column of Φ, whereas ϕ(xn) corresponds to the the nth row of Φ • If the no of basis functions is smaller than the no of data points

–i.e., M

• Direct solution of normal equations

+ F+ = (FT F)-1FT w ML = F t • This direct solution can lead to numerical difficulties – When ΦTΦ is close to singular (determinant=0) – When two basis functions are collinear parameters can have large magnitudes • Not uncommon with real data sets • Can be addressed using – Singular Value Decomposition – Addition of regularization term ensures matrix is non-singular

35 Machine Learning Srihari Method of Gradient Descent • Criterion f(x) minimized by moving from current solution in direction of negative of gradient f’(x) • Steepest descent proposes a new point x ' = x − ηf '(x)

– where η is the learning – rate, a positive scalar. – Set to a small constant.

36 Machine Learning Srihari Gradient with multiple inputs • For multiple inputs we need partial derivatives: ∂ f(x) ∂x is how f changes as only x increases i i –Gradient of f is a vector of partial derivatives ∇ f (x) x • Gradient descent proposes a new point x' = x - η∇ f (x) x –where η is the learning rate, a positive scalar • Set to a small constant

Direction in w0-w1 plane producing steepest descent Machine Learning Srihari Stochastic Gradient Descent

2 1 N E (w) = t −wTφ(x ) • Error function D ∑ { n n } sums over data 2 n=1

– Denoting ED(w)=ΣnEn , SGD updates w using (t +1) (t ) w = w -hÑEn • where τ is the iteration no., η is a learning rate parameter and we are updating after presenting pattern n

N ∇E = − t − wT (x ) (x )T n ∑{ n φ n } φ n • Substituting for the derivative n=1 (τ+1) (τ) (τ)T w = w + η(t − w φ )φ where ϕn=ϕ(xn) n n n • w is initialized to some starting vector w(0) • η chosen with care to ensure convergence • Known as Least Mean Squares Algorithm

38 Machine Learning Srihari Choosing the Learning rate • Useful to reduce η as training progresses • Constant learning rate is default in Keras – Momentum and decay are set to 0 by default • keras.optimizers.SGD(lr=0.1, momentum=0.0, decay=0.0, nesterov=False)

Constant learning rate

Time-based decay: decay_rate=learning_rate/epochs) SGD(lr=0.1, momentum=0.8, decay=decay_rate, Nesterov=False) 39 Machine Learning Srihari Sequential (On-line) Learning • Disadvantage of ML solution, which is

T −1 T wML = (Φ Φ) Φ t – It is a batch technique • Processing entire training set in one go • It is computationally expensive for large data sets – Due to huge N M Design matrix Φ • Solution is to use a sequential algorithm where samples are presented one at a time (or a minibatch at a time) – Called stochastic gradient descent

40 Machine Learning Srihari Computational bottleneck

• A recurring problem in machine learning: – large training sets are necessary for good generalization – but large training sets are also computationally expensive • SGD is an extension of gradient descent that offers a solution – Moreover it is a method of generalization beyond the training set

41 Machine Learning Srihari Insight of SGD

m • Gradient is an expectation ∇ ln p(y | X,θ,β) = β∑{ y(i) − θTx (i)} x (i)T i=1 – Expectation may be approximated using small set of samples • In each step of SGD we can sample a minibatch of examples B ={x(1),..,x(m’)} – drawn uniformly from the training set – Minibatch size m’ is typically chosen to be small: 1 to a hundred • Crucially m’ is held fixed even if sample set is in billions • We may fit a training set with billions of examples using updates computed on only a hundred examples

42 Machine Learning Srihari Regularized Least Squares • As model complexity increases, e.g., degree of polynomial or no.of basis functions, then it is likely that we overfit • One way to control overfitting is not to limit complexity but to add a regularization term to the error function • Error function to minimize takes the form

E(w) = E (w) + λE (w) D W • where λ is the regularization coefficient that controls relative importance of data-dependent error ED(w) and regularization term EW(w) 43 Machine Learning Srihari Simplest Regularizer is weight decay • Regularized least squares is

E(w) = ED (w) + λEW (w) • Simple form of regularization term is 1 E (w) = wT w W 2 – Thus total error function becomes 1 N 2 λ E(w) = t − wT φ(x ) + wT w ∑{ n n } 2 n=1 2 • This regularizer is called weight decay – because in sequential learning weight values decay towards zero unless supported by data • Also, the error function remains a quadratic function of w, so exact minimizer found in

closed form 44 Machine Learning Srihari Closed-form Solution with Regularizer

• Error function with quadratic regularizer is,

1 N 2 λ E(w) = t − wT φ(x ) + wT w ∑{ n n } 2 n=1 2 • Its exact minimizer can be found in closed form – By setting gradient wrt w to zero and solving for w T −1 T w = (λI + Φ Φ) Φ t • This is a simple extension of the least squared solution

T −1 T w ML = (Φ Φ) Φ t

Machine Learning Srihari Geometric Interpretation of Regularizer

w2 Contours of In unregularized case: Unregularized we are trying to find w that minimizes Error function 2 1 N E (w) = t −wT (x ) D ∑{ n φ n } 2 n=1

Any value of w on contour gives same error In regularized case: choose that value of w subject to the w1 constraint

M | w |2 ∑ j ≤ η j =1 We dont want the weights to become too large The two approaches related by Lagrange multipliers

N 1 2 λ E(w) = t − wTφ(x ) + wT w E(w) 2 ∑{ n n } 2 n=1 46 Machine Learning Srihari Minimization of Unregularized Error subject to constraint • Blue: Contours of Minimization unregularized error With quadratic function Regularizer, q=2 • Constraint region • w* is optimum value

E(w)

47 Machine Learning Srihari A more general regularizer

• Regularized Error 1 N 2 λ M t − wT φ(x ) + | w |q 2 ∑{ n n } 2 ∑ j n=1 j =1 • Where q=2 corresponds to the quadratic regularizer q=1 is known as lasso • Lasso has the property that if λ is sufficiently large some of the coefficients wj are driven to zero leading to a sparse model in which the corresponding basis functions play no role Machine Learning Srihari Contours of regularization term 1 N 2 λ M t − wT φ(x ) + | w |q 2 ∑{ n n } 2 ∑ j n=1 j =1 q • Contours of regularization term |wj| for values of q

Space of w1,w2 Any choice along the contour has the same value of w

2 2 4 4 w + w = const w + w = const w + w = const w + w = const 1 2 1 2 1 2 1 2

49 Lasso Quadratic Machine Learning Srihari Sparsity with Lasso constraint

• With q=1 and λ is sufficiently large, some of the coefficients wj are driven to zero • Leads to a sparse model – where corresponding basis functions play no role • Origin of sparsity is illustrated here: Quadratic solution where Minimization with Lasso Regularizer w1* and w0* are nonzero A sparse solution with w1*=0

Contours of Unregularized Error function

Constraint region

50 Machine Learning Srihari Regularization: Conclusion

• Regularization allows – complex models to be trained on small data sets – without severe over-fitting • It limits model complexity – i.e., how many basis functions to use? • Problem of limiting complexity is shifted to – one of determining suitable value of regularization coefficient

51 Machine Learning Srihari Linear Regression Summary • Linear Regression with M basis functions: M −1 ⎛ 1 ⎞ T ⎜ t −1 ⎟ y(x,w) = w φ (x) = w φ(x) φj (x) = exp⎜− (x − µj ) Σ (x − µj )⎟ ∑ j j ⎝⎜ 2 ⎠⎟ j =0 • Objective Function without/with regularization is

2 N N 1 T 2 λ T 1 T E (w) = t −w φ(x ) ED (w) = ∑{ tn − w φ(xn ) } + w w D ∑{ n n } 2 n=1 2 2 n=1 æ f (x ) f (x ) ... f (x ) ö • Closed-form ML solution is: ç 0 1 1 1 M -1 1 ÷ ç f 0(x 2 ) ÷ T −1 T T −1 T F = ç ÷ w = (Φ Φ) Φ t w = (λI + Φ Φ) Φ t ç ÷ ML ML ç ÷ èf 0(x N ) f M -1(x N )ø (t +1) (t ) • Gradient Descent: w = w -hÑEn

N N T T ⎡ ⎤ ∇E = − t − w φ(x ) φ(x ) T T n ∑{ n n } n ∇En = ⎢−∑{ tn − w φ(xn ) } φ(xn ) ⎥ + λw n=1 ⎣ n=1 ⎦

52 Machine Learning Srihari Returning to LeToR Problem

• Try: • Several Basis Functions • Quadratic Regularization

• Express results as ERMS – rather than as squared error E(w*) or as Error Rate with thresholded results

E RMS = 2E(w*)/N

53

Machine Learning Srihari Multiple Outputs

• Several target variables t =(t1,..,tK) K >1 • Can be treated as multiple (K ) independent regression problems – Different basis functions for each component of t • More common solution: same set of basis functions to model all components of target vector y(x,w)=WTφ(x) – where y is a K-dim column vector, W is a M x K matrix of weights and φ(x) is a M-dimensional column vector with with elements φj(x) 54 Machine Learning Srihari Solution for Multiple Outputs

• Set of observations t1,..,tN are combined into a matrix T of size N x K such that the th T n row is given by tn

• Combine input vectors x1,..,xN into matrix X • Log-likelihood function is maximized

T -1 T • Solution is similar: WML=(Φ Φ) Φ T

55