Generative vs. Discriminative Classifiers

z Goal: Wish to learn f: X → Y, e.g., P(Y|X)

z Generative classifiers (e.g., Naïve Bayes): Yi z Assume some functional form for P(X|Y), P(Y)

This is a ‘generative’ model of the data! Xi z Estimate parameters of P(X|Y), P(Y) directly from training data

z Use Bayes rule to calculate P(Y|X= x)

z Discriminative classifiers: Yi z Directly assume some functional form for P(Y|X) This is a ‘discriminative’ model of the data! Xi z Estimate parameters of P(Y|X) directly from training data

© Eric Xing @ CMU, 2006-2011 1

Naïve Bayes vs

z Consider Y boolean, X continuous, X=

z Number of parameters to estimate:

⎪⎧ ⎛ 1 ⎞⎪⎫ π exp − ⎜ (x − µ )2 − logσ − C ⎟ k ⎨ ∑ j ⎜ 2 j k , j k , j ⎟⎬ ⎪ ⎝ 2σ k , j ⎠⎪ NB: p(y | x) = ⎩ ⎭ ** ⎧ ⎛ ⎞⎫ ⎪ ⎜ 1 2 ⎟⎪ π k ' exp⎨− (x j − µk ', j ) − logσ k ', j − C ⎬ ∑∑kj' ⎜ 2σ 2 ⎟ ⎩⎪ ⎝ k ', j ⎠⎭⎪ LR: 1 µ(x) = T 1 + e−θ x

z Estimation method:

z NB parameter estimates are uncoupled

z LR parameter estimates are coupled

© Eric Xing @ CMU, 2006-2011 2

1 Naïve Bayes vs Logistic Regression

z Asymptotic comparison (# training examples → infinity)

z when model assumptions correct

z NB, LR produce identical classifiers

z when model assumptions incorrect

z LR is less biased – does not assume conditional independence

z therefore expected to outperform NB

© Eric Xing @ CMU, 2006-2011 3

Naïve Bayes vs Logistic Regression

z Non-asymptotic analysis (see [Ng & Jordan, 2002] )

z convergence rate of parameter estimates – how many training examples needed to assure good estimates?

NB order log m (where m = # of attributes in X) LR order m

z NB converges more quickly to its (perhaps less helpful) asymptotic estimates

© Eric Xing @ CMU, 2006-2011 4

2 Rate of convergence: logistic regression

z Let hDis,m be logistic regression trained on n examples in m dimensions. Then with higgph probabilit y:

z Implication: if we want

for some small constant ε0, it suffices to pick order m examples

Æ Convergences to its asymptotic classifier, in order m examples

z result follows from Vapnik’s structural risk bound, plus fact that the "VC Dimension" of an m-dimensional linear separators is m

© Eric Xing @ CMU, 2006-2011 5

Rate of convergence: naïve Bayes parameters

z Let any ε1, δ>0, and any n ≥ 0 be fixed.

Assume that for some fixed ρ0 >0> 0, we have that

z Let

z Then with ppyrobability at least 1-δ, after n examples:

1. For discrete input, for all i and b

2. For continuous inputs, for all i and b

© Eric Xing @ CMU, 2006-2011 6

3 Some from UCI data sets

© Eric Xing @ CMU, 2006-2011 7

Summary

z Naïve Bayes classifier

z What’s the assumption

z Why we use it

z How do we learn it

z Logistic regression

z Functional form follows from Naïve Bayes assumptions

z For Gaussian Naïve Bayes assuming

z For discrete-valued Naïve Bayes too

z But training procedure picks parameters without the conditional independence assumption

z Gradient ascent/descent

z – General approach when closed-form solutions unavailable

z Generative vs. Discriminative classifiers

z – Bias vs. variance tradeoff

© Eric Xing @ CMU, 2006-2011 8

4 Machine Learning

1010--701/15701/15--781,781, Fall 2011

Linear Regression and Sparsity

Eric Xing

Lecture 4, September 21, 2011

Reading: © Eric Xing @ CMU, 2006-2011 9

Machine learning for apartment hunting

z Now you've moved to Pittsburgh!! And you want to find the most reasonably priced apartment satisfying your needs: square-ft., # of bedroom, distance to campus …

Living area (ft2) # bedroom Rent ($) 230 1 600 506 2 1000 433 2 1100 109 1 500 … 150 1 ? 270 1.5 ? © Eric Xing @ CMU, 2006-2011 10

5 The learning problem

z Features:

t z Living area, distance to campus, # bedroom … ren z Denote as x=[x1, x2, … xk]

z Target:

z Rent Living area z Denoted as y

z Training set:

1 2 k ⎡− − x1 − −⎤ ⎡x1 x1 K x1 ⎤ ⎢ ⎥ ⎢ 1 2 k ⎥ − − x − − x x x X = ⎢ 2 ⎥ = ⎢ 2 2 K 2 ⎥ rent ⎢ M M M ⎥ ⎢ M M M M ⎥ ⎢ ⎥ ⎢ 1 2 k ⎥ ⎣− − xn − −⎦ ⎣⎢xn xn K xn ⎦⎥

⎡ y1 ⎤ ⎡− y1 −⎤ ⎢y ⎥ ⎢− y −⎥ Y = ⎢ 2 ⎥ or ⎢ 2 ⎥ Location ⎢ M ⎥ ⎢M M M ⎥ ⎢ ⎥ ⎢ ⎥ ⎣yn ⎦ ⎣− y n −⎦ Living area © Eric Xing @ CMU, 2006-2011 11

Linear Regression

z Assume that Y (target) is a linear function of X (features):

z e.g.: 1 2 yˆ = θ0 +θ1x +θ2 x

z let's assume a vacuous "feature" X0=1 (this is the intercept term, why?), and define the feature vector to be:

z then we have the following general representation of the linear function:

z Our goal is to pick the optimal θ . How! z We seek θ that minimize the following cost function: 1 n ˆ v 2 J (θ ) = ∑(yi (xi ) − yi ) 2 i=1 © Eric Xing @ CMU, 2006-2011 12

6 The Least--Square (LMS) method

z The Cost Function: n 1 T 2 J (θ ) = ∑(xi θ − yi ) 2 i=1

z Consider a gradient descent algorithm: ∂ θ t+1 = θ t −α J (θ ) j j ∂θ j t

© Eric Xing @ CMU, 2006-2011 13

The Least-Mean-Square (LMS) method

z Now we have the following descent rule:

n t+1 t v T t j θ j = θ j +α∑(yi − xi θ )xi i=1

z For a single training point, we have:

z This is known as the LMS update rule, or the Widrow-Hoff learning rule

z This is actually a "stochastic", "coordinate" descent algorithm

z This can be used as a on-line algorithm

© Eric Xing @ CMU, 2006-2011 14

7 Geometric and Convergence of LMS

N=1 N=2 N=3

t+1 t v T t v θ = θ +α(yi − xi θ )xi Claim: when the step size α satisfies certain condition, and when certain other technical conditions are satisfied, LMS will converge to an “optimal

region”. © Eric Xing @ CMU, 2006-2011 15

Steepest Descent and LMS

z Steepest descent

z Note that: T n ⎡ ∂ ∂ ⎤ T ∇θ J = ⎢ J,K, J ⎥ = −∑(yn − xn θ )xn ⎣∂θ1 ∂θk ⎦ i=1

n t+1 t T t θ = θ +α∑(yn − xn θ )xn i=1

z This is as a batch gradient descent algorithm

© Eric Xing @ CMU, 2006-2011 16

8 The normal equations

z Write the cost function in matrix form: ⎡− − x1 − −⎤ n 1 T 2 ⎢ ⎥ − − x2 − − J(θ ) = ∑(xi θ − yi ) X = ⎢ ⎥ 2 i=1 ⎢ M M M ⎥ ⎢ ⎥ 1 T − − x − − = ()()Xθ − yv Xθ − yv ⎣ n ⎦ 2 ⎡ y1 ⎤ ⎢ ⎥ 1 T T T T T T y = ()θ X Xθ −θ X yv − yv Xθ + yv yv yv = ⎢ 2 ⎥ 2 ⎢ M ⎥ ⎢ ⎥ ⎣yn ⎦ z To minimize J(θ), take derivative and set to zero:

1 T T T T v vT vT v ∇θ J = ∇θ tr()θ X Xθ −θ X y − y Xθ + y y 2 ⇒ X T Xθ = X T yv 1 = ()∇ trθ T X T Xθ −2∇ tryvT Xθ + ∇ tryvT yv 2 θ θ θ The normal equations 1 = ()X T Xθ + X T Xθ −2X T yv 2 ⇓ −1 = X T Xθ − X T yv = 0 θ * = (X T X ) X T yv © Eric Xing @ CMU, 2006-2011 17

Some matrix derivatives

m×n z For f : R a R , define: ⎡ ∂ ∂ ⎤ f L f ⎢ ∂A ∂A ⎥ ⎢ 11 1n ⎥ ⎢ ⎥ ∇ A f (A) = M O M ⎢ ∂ ∂ ⎥ ⎢ f L f ⎥ ⎢∂A1m ∂Amn ⎥ ⎣⎢ ⎦⎥ z Trace: n tra = a , trA = ∑ Aii , trABC = trCAB = trBCA i=1

z Some fact of matrix derivatives (without proof)

T T T T −1 T ∇ AtrAB = B , ∇ AtrABA C = CAB + C AB , ∇ A A = A(A )

© Eric Xing @ CMU, 2006-2011 18

9 Comments on the normal equation

z In most situations of practical interest, the number of data points N is larggyer than the dimensionality k of the inppput space and the matrix X is of full column rank. If this condition holds, then it is easy to verify that XTX is necessarily invertible.

z The assumption that XTX is invertible implies that it is positive definite, thus the critical point we have found is a minimum.

z What if X has less than full column rank? Æ regularization (later).

© Eric Xing @ CMU, 2006-2011 19

Direct and Iterative methods

z Direct methods: we can achieve the solution in a single step byyg solving the normal e quation

z Using Gaussian elimination or QR decomposition, we converge in a finite number of steps

z It can be infeasible when data are streaming in in real time, or of very large amount

z Iterative methods: stochastic or steepest gradient

z Converging in a limiting sense

z But more attractive in large practical problems

z Caution is needed for deciding the learning rate α

© Eric Xing @ CMU, 2006-2011 20

10 Convergence rate

z Theorem: the steepest descent equation algorithm converggye to the minimum of the cost characterized by normal equation:

If

z A formal analysis of LMS need more math-mussels; in practice, one can use a small α, or gradually decrease α.

© Eric Xing @ CMU, 2006-2011 21

A Summary:

z LMS update rule t+1 t T t θ j = θ j +α(yn − xn θ )xn,i z Pros: on-line, low per-step cost, fast convergence and perhaps less prone to local optimum

z Cons: convergence to optimum not always guaranteed

z Steepest descent n t+1 t T t θ = θ +α∑(yn − xn θ )xn i=1 z Pros: easy to implement, conceptually clean, guaranteed convergence

z Cons: batch, often slow converging

z Normal equations − θ * = (X T X ) 1 X T yv

z Pros: a single-shot algorithm! Easiest to implement.

z Cons: need to compute pseudo-inverse (XTX)-1, expensive, numerical issues (e.g., matrix is singular ..), although there are ways to get around this …

© Eric Xing @ CMU, 2006-2011 22

11 Geometric Interpretation of LMS

z The predictions on the training data are:

* T −1 T vˆ v ⎡ y ⎤ ⎡− − x − −⎤ y = Xθ = X (X X ) X y 1 1 ⎢y ⎥ ⎢− − x − −⎥ z Note that yv = ⎢ 2 ⎥ X = ⎢ 2 ⎥ ⎢ M ⎥ ⎢ M M M ⎥ − ⎢ ⎥ ⎢ ⎥ vˆ v T 1 T v y − − x − − y − y = (X ()X X X − I )y ⎣ n ⎦ ⎣ n ⎦ and

− X T ()yvˆ − yv = X T (X (X T X ) 1 X T − I )yv − = (X T X (X T X ) 1 X T − X T )yv = 0 !! v yvˆ is the orthogonal projection of y into the space spanned by the columns of X

© Eric Xing @ CMU, 2006-2011 23

Probabilistic Interpretation of LMS

z Let us assume that the target variable and the inputs are related byyq the equation:

T yi = θ xi + ε i

where ε is an error term of unmodeled effects or random noise

z Now assume that ε follows a Gaussian N(0,σ), then we have:

1 ⎛ (y −θ T x )2 ⎞ p(y | x ;θ) = exp⎜− i i i i ⎜ 2 ⎟ 2πσ ⎝ 2σ ⎠

z By independence assumption: n n n ⎛ T 2 ⎞ ⎛ 1 ⎞ (yi −θ xi ) L(θ ) = p(y | x ;θ) = exp⎜− ∑i=1 ⎟ ∏ i i ⎜ ⎟ 2 πσ ⎜ 2σ ⎟ i=1 ⎝ 2 ⎠ ⎝ ⎠ © Eric Xing @ CMU, 2006-2011 24

12 Probabilistic Interpretation of LMS, cont.

z Hence the log-likelihood is:

1 1 1 n l(θ ) = nlog − (y −θ T x )2 2 ∑i= i i 2πσ σ 2 1

z Do you recognize the last term?

n Yes it is: 1 T 2 J(θ ) = ∑(xi θ − yi ) 2 i=1

z Thus under independence assumption, LMS is equivalent to MLE of θ !

© Eric Xing @ CMU, 2006-2011 25

Case study: predicting gene expression

The ggpenetic picture

causal SNPs CGTTTCACTGTACAATTT

a univariate phenotype:

i.e., the expression intensity of a gene

© Eric Xing @ CMU, 2006-2011 26

13 Association Mapping as Regression

Phenotype (BMI) Genotype

. . CTCTC . . . . . T . . C ...... T . . . Individual 2.5 1 . . C . . . . . A . . C ...... T . . .

. . G . . . . . A . . G ...... A . . . Individual 4.8 2 . . C . . . . . T . . C ...... T . . . …

IdiidlIndividual 474.7 . . GTCTG . . . . . T . . C ...... T . . . N . . G . . . . . T . . G ...... T . . .

Benign SNPs Causal SNP

© Eric Xing @ CMU, 2006-2011 27

Association Mapping as Regression

Phenotype (BMI) Genotype

Individual 2.5 . . 0 . . . . . 1 . . 0 ...... 0 . . . 1

Individual 4.8 . . 1 . . . . . 1 . . 1 ...... 1 . . . 2 …

IdiidlIndividual 474.7 . . 2 . . . . . 2 . . 1 ...... 0 . . . N

J SNPs with large yi = x β ∑ ij j |βj| are relevant j=1

© Eric Xing @ CMU, 2006-2011 28

14 Experimental setup

z Asthama dataset

z 543 individuals, ggypenotyped at 34 SNPs

z Diploid data was transformed into 0/1 (for homozygotes) or 2 (for heterozygotes)

z X=543x34 matrix

z Y=Phenotype variable (continuous)

z A single phenotype was used for regression

z Implementation details

z Iterative methods: Batch update and online update implemented.

z For both methods, step size α is chosen to be a small fixed value (10-6). This choice is based on the data used for experiments.

z Both methods are only run to a maximum of 2000 epochs or until the change in training MSE is less than 10-4

© Eric Xing @ CMU, 2006-2011 29

Convergence Curves

z For the batch method, the training MSE is initially large due to uninformed initialization

z In the online update, NdtfN updates for every epoch reduces MSE to a much smaller value.

© Eric Xing @ CMU, 2006-2011 30

15 The Learned Coefficients

© Eric Xing @ CMU, 2006-2011 31

Multivariate Regression for Trait Association Analysis

Trait Genotype Association Strength

A

2.1 x = ? A A CA GTTA C GA A G T

y = X x β

© Eric Xing @ CMU, 2006-2011 32

16 Multivariate Regression for Trait Association Analysis

Trait Genotype Association Strength

A

2.1 = x A A CA GTTA C GA A G T

Many non-zero associations: Which SNPs are truly significant?

© Eric Xing @ CMU, 2006-2011 33

Sparsity

z One common assumption to make sparsity.

z Makes biological sense: each phenotype is likely to be associated with a small number of SNPs, rather than all the SNPs.

z Makes statistical sense: Learning is now feasible in high dimensions with small sample size

© Eric Xing @ CMU, 2006-2011 34

17 Sparsity: In a mathematical sense

z Consider least squares linear regression problem:

z Sparsity most of the beta ’ s are zero . x1 β 1 x2 β2 β3 x y 3

βn-1 …

βn xn-1

xn

z But this is not convex!!! Many local optima, computationally intractable. © Eric Xing @ CMU, 2006-2011 35

L1 Regularization (LASSO) (Tibshirani, 1996)

z A convex relaxation.

Constrained Form Lagrangian Form

z Still enforces sparsity!

© Eric Xing @ CMU, 2006-2011 36

18 Lasso for Reducing False Positives

Trait Genotype Association Strength

A

2.1 = x

Lasso A A CA GTTA C GA A Penalty G

T for sparsity J + λ∑|βj | j=1

Many zero associations (sparse results), but what if there are multiple related traits?

© Eric Xing @ CMU, 2006-2011 37

Ridge Regression vs Lasso

X X

Ridge Regression: Lasso: HOT !

βs with constant J(β) (level sets of J(β))

βs with β2 βs with constant constant l2 norm l1 norm

β1

Lasso (l1 penalty) results in sparse solutions –vector with more zero coordinates Good for high‐dimensional problems –don’t have to store all coordinates! © Eric Xing @ CMU, 2006-2011 38

19 Bayesian Interpretation

z Treat the distribution parameters θ also as a random variable

z The a posteriori distribution of θ after seem the data is:

p(D |θ ) p(θ ) p(D |θ ) p(θ ) p(θ | D) = = p(D) ∫ p(D |θ ) p(θ )dθ This is Bayes Rule

likelihood× prior posterior = margilinal likelih ood

The prior p(.) encodes our prior knowledge about the domain

© Eric Xing @ CMU, 2006-2011 39

Regularized Least Squares and MAP

What if (XTX) is not invertible ?

log likelihood log prior

I) Gaussian Prior

0

Ridge Regression

Closed form: HW

Prior belief that β is Gaussian with zero‐mean biases solution to “small” β © Eric Xing @ CMU, 2006-2011 40

20 Regularized Least Squares and MAP

What if (XTX) is not invertible ?

log likelihood log prior

II) Laplace Prior

Lasso

Closed form: HW

Prior belief that β is Laplace with zero‐mean biases solution to “small” β © Eric Xing @ CMU, 2006-2011 41

Beyond basic LR

z LR with non-linear basis functions

z Locally weighted linear regression

z Reggpression trees and Multilinear Interpolation

© Eric Xing @ CMU, 2006-2011 42

21 Non-linear functions:

© Eric Xing @ CMU, 2006-2011 43

LR with non-linear basis functions

z LR does not mean we can only deal with linear relationships

z We are free to design (non-linear) features under LR

m y = θ + θ φ(x) = θ Tφ(x) 0 ∑ j=1 j

where the φj(x) are fixed basis functions (and we define φ0(x) = 1).

z Example: polynomial regression: φ(x) := [1, x, x2 , x3 ]

z We will be concerned with estimating (distributions over) the weights θ and choosing the model order M. © Eric Xing @ CMU, 2006-2011 44

22 Basis functions

z There are many basis functions, e.g.:

j−1 z PlPolynomi ilal φ j (x) = x

2 ⎛ (x − µ j ) ⎞ z Radial basis functions φ (x) = exp⎜− ⎟ j ⎜ 2 ⎟ ⎝ 2s ⎠

⎛ x − µ j ⎞ z Sigmoidal ⎜ ⎟ φ j (x) = σ ⎜ ⎟ ⎝ s ⎠

z Splines, Fourier, , etc

© Eric Xing @ CMU, 2006-2011 45

1D and 2D RBFs

z 1D RBF

z After fit:

© Eric Xing @ CMU, 2006-2011 46

23 Good and Bad RBFs

z A good 2D RBF

z Two bad 2D RBFs

© Eric Xing @ CMU, 2006-2011 47

Overfitting and underfitting

y = θ +θ x y = θ +θ x +θ x2 y = 5 θ x j 0 1 0 1 2 ∑ j=0 j

© Eric Xing @ CMU, 2006-2011 48

24 Bias and variance

z We define the bias of a model to be the expected ggy(y,eneralization error even if we were to fit it to a very (say, infinitely) large training set.

z By fitting "spurious" patterns in the training set, we might again obtain a model with large generalization error. In this case, we say the model has large variance.

© Eric Xing @ CMU, 2006-2011 49

Locally weighted linear regression

z The algorithm: n Instead of minimizing 1 T 2 J(θ ) = ∑(xi θ − yi ) 2 i=1 n 1 T 2 now we fit θ to minimize J(θ ) = ∑ wi (xi θ − yi ) 2 i=1

2 ⎛ (xi − x) ⎞ Where do wi's come from? w = exp ⎜ − ⎟ i ⎜ 2 ⎟ ⎝ 2τ ⎠

z where x is the query point for which we'd like to know its corresponding y

Æ Essentially we put higher weights on (errors on) training examples that are close to the query point (than those that are further away from the query)

© Eric Xing @ CMU, 2006-2011 50

25 Parametric vs. non-parametric

z Locally weighted linear regression is the second example we are running into of a non-parametric algg(orithm. (what is the first?)

z The (unweighted) linear regression algorithm that we saw earlier is known as a parametric learning algorithm

z because it has a fixed, finite number of parameters (the θ), which are fit to the data;

z Once we've fit the θ and stored them away, we no longer need to keep the training data around to make future predictions .

z In contrast, to make predictions using locally weighted linear regression, we need to keep the entire training set around.

z The term "non-parametric" (roughly) refers to the fact that the amount of stuff we need to keep in order to represent the hypothesis grows linearly with the size of the training set.

© Eric Xing @ CMU, 2006-2011 51

Robust Regression

z The best fit from a quadratic z But this is probably better … regression

How can we do this?

© Eric Xing @ CMU, 2006-2011 52

26 LOESS-based

z Remember what we do in "locally weighted linear regression"? Æ we "score" each point for its impotence

z Now we score each point according to its "fitness"

© Eric Xing @ CMU, 2006-2011 (Courtesy to Andrew53 Moor)

Robust regression

z For k = 1 to R…

z Let (xk ,yk))p be the kth datapoint est z Let y k be predicted value of yk

z Let wk be a weight for data point k that is large if the data point fits well and small if it fits badly:

est 2 wk = φ((yk − yk ) )

z Then redo the regression using weighted data points.

z Repeat whole thing until converged!

© Eric Xing @ CMU, 2006-2011 54

27 Robust regression—probabilistic interpretation

z What regular regression does:

Assume yk was originally generated using the following recipe:

T 2 yk = θ xk + N(0,σ )

Computational task is to find the Maximum Likelihood estimation of θ

© Eric Xing @ CMU, 2006-2011 55

Robust regression—probabilistic interpretation

z What LOESS robust regression does:

Assume yk was originally generated using the following recipe:

with probability p: T 2 yk = θ xk + N(0,σ )

2 but otherwise yk ~ N(µ,σ huge ) Computational task is to find the Maximum Likelihood

estimates of θ, p, µ and σhuge.

z The algorithm you saw with iterative reweighting/refitting does this computation for us. Later you will find that it is an instance of the famous E.M. algorithm © Eric Xing @ CMU, 2006-2011 56

28 Regression Tree

z Decision tree for regression

Gender Rich? Num. # travel Age Children per yr. Gender?

FNo2538 Female Male MNo02 25 M Yes 1 0 72 Predicted age=39 Predicted age=36 :::::

© Eric Xing @ CMU, 2006-2011 57

A conceptual picture

z Assuming regular regression trees, can you sketch a graph of the fitted function yy()*(x) over this dia gram?

© Eric Xing @ CMU, 2006-2011 58

29 How about this one?

z Multilinear Interpolation

z We wanted to create a continuous and piecewise linear fit to the data © Eric Xing @ CMU, 2006-2011 59

Take home message

z Gradient descent

z On-line

z Batch

z Normal equations

z Equivalence of LMS and MLE

z LR does not mean fitting linear relations, but linear Windows Marketplace combination or basis functions (that can be non- linear)

z Weighting points by importance versus by fitness

© Eric Xing @ CMU, 2006-2011 60

30