Bayesian Linear Regression

Total Page:16

File Type:pdf, Size:1020Kb

Bayesian Linear Regression Machine Learning Srihari Bayesian Linear Regression Sargur Srihari [email protected] Machine Learning Srihari Topics in Bayesian Regression • Recall Max Likelihood Linear Regression • Parameter Distribution • Predictive Distribution • Equivalent Kernel 2 Machine Learning Srihari Linear Regression: model complexity M M y(x,w) w w x w x 2 .. w x M w x j • Polynomial regression = 0 + 1 + 2 + + M = ∑ j j =0 – Red lines are best fits with M = 0,1,3,9 and N=10 Poor representations of sin(2πx) Over Fit Best Fit Poor to representation sin(2πx) of sin(2πx) 3 Machine Learning Srihari Max Likelihood Regression • Input vector x , basis functions {ϕ1(x),.., ϕM(x)}: M −1 ⎛ 1 ⎞ T ⎜ t −1 ⎟ y(x,w) = w φ (x) = w φ(x) Radial basis fns: φj (x) = exp⎜− (x − µj ) Σ (x − µj )⎟ ∑ j j ⎝⎜ 2 ⎠⎟ j =0 • Objective Function: 2 N 1 T Max Likelihood objective with N examples {x1,..xN}: E( ) = t − ( ) w ∑{ n w φ xn } 2 n=1 (equivalent to Mean Squared Error Objective) N 2 Regularized MSE with N examples: 1 T λ T E(w) = t − w φ(x ) + w w (λ is the regularization coefficient) ∑{ n n } 2 n=1 2 ⎛ φ (x ) φ (x ) ... φ (x ) ⎞ • Closed-form ML solution is: ⎜ 0 1 1 1 M −1 1 ⎟ ⎜φ 0(x 2 ) ⎟ T −1 T where Φ is the design matrix: Φ = w = (Φ Φ) Φ t T -1 ⎜ ⎟ ML (Φ Φ) is Moore-Penrose inverse ⎜ ⎟ ⎜φ (x ) φ (x )⎟ Regularized ⎝ 0 N M −1 N ⎠ w = (λI + ΦT Φ)−1ΦTt solution is: ML • Gradient Descent: w(τ +1) = w(τ ) − η∇E N N (τ)T ⎡ ⎤ 4 ∇E = − t −w φ(x ) φ(x ) Regularized version: (τ )T (τ ) ∑{ n n } n ∇E = ⎢− t − w φ(x ) φ(x )⎥ − λw n=1 ∑{ n n } n ⎣ n=1 ⎦ Machine Learning Srihari Shortcomings of MLE • M.L.E. of parameters w does not address – M (Model complexity: how many basis functions? – It is controlled by data size N • More data allows better fit without overfitting • Regularization also controls overfit (λ controls effect) 2 1 N 1 T E(w) = E (w) + λE (w) E (w) = t −wTφ(x ) E (w) = w w D W where D ∑{ n n } W 2 2 n=1 • But M and choice of ϕj are still important – M can be determined by holdout, but wasteful of data • Model complexity and over-fitting are better handled using Bayesian approach 5 Machine Learning Srihari Bayesian Linear Regression • Using Bayes rule, posterior is proportional to Likelihood × Prior: p(t | w)p(w) p(w | t) = p(t) – where p(t|w) is the likelihood of observed data – p(w) is prior distribution over the parameters • We will look at: – A normal distribution for prior p(w) – Likelihood p(t|w) is a product of Gaussians based on the noise model – And conclude that posterior is also Gaussian 6 Machine Learning Srihari Gaussian Prior Parameters Assume multivariate Gaussian prior for w (which has components w0,..,wM-1) p(w) = N (w|m0 , S0) with mean m0 and covariance matrix S0 -1 If we choose S0 = α I it means that the variances of the -1 weights are all equal to α and covariances are zero p(w) with zero mean (m0=0) and isotropic over weights (same variances) w1 7 w0 Machine Learning Srihari Likelihood of Data is Gaussian • Assume noise precision parameter β t = y(x,w)+ε where ε is defined probabilistically as Gaussian noise p(t|x,w,β)=N(t|y(x,w),β-1) Note that output t is a scalar • Likelihood of t ={t1,..,tN} is then N p(t | X,w,β) = N t | wTφ(x ),β −1 ∏ ( n n ) n=1 – This is the probability of target data t given the parameters w and input X={x1,..,xN} – Due to Gaussian noise, likelihood p(t |w) is also a Gaussian 8 Machine Learning Srihari Posterior Distribution is also Gaussian • Prior: p(w)~N (w|m0 , S0) i.e., it is Gaussian • Likelihood comes from Gaussian noise N p(t | X,w,β) = N t | wTφ(x ),β −1 ∏ ( n n ) n=1 • It follows that posterior p(w|t) is also Gaussian • Proof: use standard result from Gaussians: • If marginal p(w) & conditional p(t|w) have Gaussian forms then the marginals p(t) and p(w|t) are also Gaussian: – Let p(w) = N(w|µ,Λ-1) and p(t|w) = N(t|Aw+b,L-1) – Then marginal p(t) = N(t|Aµ+b,L-1+AΛ-1AT) and conditional p(w|t) = N(w|Σ{AtL(t-b)+Λµ},Σ) where Σ=(Λ+ATLA)-1 9 Machine Learning Srihari Exact form of Posterior Distribution N p(t | X,w,β) = N t | wTφ(x ),β −1 • We have p(w)= N (w|m0 , S0) & ∏ ( n n ) n=1 • Posterior is also Gaussian, written directly as p(w|t)=N(w|mN,SN) – where mN is the mean of the posterior -1 T given by mN= SN (S0 m0+ β Φ t) Φ is the design matrix ⎛ φ (x ) φ (x ) ... φ (x ) ⎞ ⎜ 0 1 1 1 M −1 1 ⎟ – and S is the covariance matrix of posterior ⎜φ 0(x 2 ) ⎟ N Φ = ⎜ ⎟ ⎜ ⎟ -1 -1 T ⎜φ (x ) φ (x )⎟ given by SN = S0 + β Φ Φ ⎝ 0 N M −1 N ⎠ w1 p(w | α) = N(w | 0,α −1I ) Prior w1 w1 and Posterior in weight space for scalar input x and 10 y(x,w)=w0+w1x w0 w0 Machine Learning Srihari Properties of Posterior 1. Since posterior p(w|t)=N(w|mN,SN) is Gaussian its mode coincides with its mean – Thus maximum posterior weight is wMAP= mN -1 2. Infinitely broad prior S0=α I, i.e.,precision α à0 – Then mean mN reduces to the maximum likelihood value, i.e., mean is the solution vector w = (ΦT Φ)−1 ΦTt ML 3. If N = 0, posterior reverts to the prior 4. If data points arrive sequentially, then posterior to any stage acts as prior distribution for subsequent data points 11 Machine Learning Srihari Choose a simple Gaussian prior p(w) y(x,w)=w0+w1x p(w|α)~N(0,1/α) • Zero mean (m0=0) isotropic w1 • (same variances) Gaussian w0 p(w | α) ∼ N(w | 0,α −1I) Single precision parameter α • Corresponding posterior distribution is p(w|t)=N(w|mN,SN) where T -1 T mN=β SNΦ t and SN =α I+β Φ Φ Note: β is noise precision and Point Estimate α is variance of parameter in prior with 12 w infinite samples Machine Learning Srihari Equivalence to MLE with Regularization N p(t | X,w,β) = N t | wTφ(x ),β −1 −1 • Since ∏ ( n n ) and p(w | α) = N(w | 0,α I ) n=1 • we have N p(w | t) = N t | wTφ(x ),β −1 N(w | 0,α −1I) ∏ ( n n ) n=1 • Log of Posterior is 2 β N α ln p(w | t) = − t − wTφ(x ) − wTw + const ∑{ n n } 2 n=1 2 • Thus Maximization of posterior is equivalent to minimization of sum-of-squares error 1 N 2 λ E(w) = t − wTφ(x ) + wTw ∑{ n n } 2 n=1 2 with addition of quadratic regularization term wTw with λ = α /β 13 Machine Learning Srihari Bayesian Linear Regression Example (Straight Line Fit) • Single input variable x • Single target variable t t • Goal is to fit – Linear model y(x,w) = w0+ w1x x • Goal of Linear Regression is to recover w =[w0 ,w1] given the samples 14 Machine Learning Srihari Data Generation • Synthetic data generated from t f(x,w)=w0+w1x with parameter values w0= -0.3 and w1=0.5 -1 x 1 – First choose xn from U(x|-1,1), then evaluate f(xn,w) – Add Gaussian noise with st dev 0.2 to get target tn – Precision parameter β = (1/0.2 )2= 25 • For prior over w we choose α = 2 p(w | α) = N(w | 0,α −1I ) 15 Machine Learning Srihari Sampling p(w) and p(w|t) • Each sample represents a straight line in data space (modified by examples) Distribution Six samples y(x,w)=w0+w1x w1 p(w) With no examples: w0 With two examples: p(w|t) Goal of Bayesian Linear Regression: Determine p(w|t) 16 Machine Learning Prior/ Six samples Srihari Sequential Bayesian Learning Posterior (regression functions) Likelihood p(t|x.w) p(w) corresponding to y(x,w) • Since there are as function of w gives p(w|t) with w drawn from posterior only two We are plotting p(w|t) parameters for a single data point – We can plot Before data No points Data prior and observed posterior Point True parameter X distributions in Value parameter First After first data point space (x,t) observed Data Band represents values Point • We look at of representing w0, w1 x ,t sequential st lines going near data ( 1 1) point x update of posterior Likelihood Second for 2nd point Data alone Point With infinite points Likelihood for Twenty th posterior is a delta 20 point Data alone function centered Points at true parameters 17 (white cross) Machine Learning Srihari Generalization of Gaussian prior • The Gaussian prior over parameters is −1 p(w | α) = N(w | 0,α I) Maximization of posterior ln p(w|t) is equivalent to minimization of sum of squares error 1 N 2 λ E(w) = t − wTφ(x ) + wTw ∑{ n n } 2 n=1 2 • Other prior yields Lasso and variations: M 1/q ⎡q ⎛ α ⎞ 1 ⎤ ⎛ α M ⎞ ⎢ ⎥ q p(w | α) = exp − | wj | ⎜ ⎟ ⎜ ∑ ⎟ ⎢2 ⎝ 2 ⎠ Γ(1/q)⎥ ⎝ 2 j =1 ⎠ ⎣ ⎦ – q=2 corresponds to Gaussian – Corresponds to minimization of regularized error function 1 N 2 λ M t − wTφ(x ) + | w |q 18 2 ∑{ n n } 2 ∑ j n=1 j =1 Machine Learning Srihari Predictive Distribution • Usually not interested in the value of w itself • But predicting t for a new value of x p(t|t,X,x) or p(t|t) – Leaving out conditioning variables X and x for convenience • Marginalizing over parameter variable w, is the standard Bayesian approach – Sum rule of probability p(t)= p(t,w)dw = p(t|w)p(w)dw ∫ ∫ – We can now write p(t | t)=∫ p(t|w)p(w|t)dw 19 Machine Learning Srihari Predictive Distribution with α, β,x,t • We can predict t for a new value of x using p(t | t)= p(t|w)p(w|t)dw We have left out conditioning variables X and x for convenience.
Recommended publications
  • Simple Linear Regression with Least Square Estimation: an Overview
    Aditya N More et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 7 (6) , 2016, 2394-2396 Simple Linear Regression with Least Square Estimation: An Overview Aditya N More#1, Puneet S Kohli*2, Kshitija H Kulkarni#3 #1-2Information Technology Department,#3 Electronics and Communication Department College of Engineering Pune Shivajinagar, Pune – 411005, Maharashtra, India Abstract— Linear Regression involves modelling a relationship amongst dependent and independent variables in the form of a (2.1) linear equation. Least Square Estimation is a method to determine the constants in a Linear model in the most accurate way without much complexity of solving. Metrics where such as Coefficient of Determination and Mean Square Error is the ith value of the sample data point determine how good the estimation is. Statistical Packages is the ith value of y on the predicted regression such as R and Microsoft Excel have built in tools to perform Least Square Estimation over a given data set. line The above equation can be geometrically depicted by Keywords— Linear Regression, Machine Learning, Least Squares Estimation, R programming figure 2.1. If we draw a square at each point whose length is equal to the absolute difference between the sample data point and the predicted value as shown, each of the square would then represent the residual error in placing the I. INTRODUCTION regression line. The aim of the least square method would Linear Regression involves establishing linear be to place the regression line so as to minimize the sum of relationships between dependent and independent variables.
    [Show full text]
  • Application of Polynomial Regression Models for Prediction of Stress State in Structural Elements
    Global Journal of Pure and Applied Mathematics. ISSN 0973-1768 Volume 12, Number 4 (2016), pp. 3187-3199 © Research India Publications http://www.ripublication.com/gjpam.htm Application of polynomial regression models for prediction of stress state in structural elements E. Ostertagová, P. Frankovský and O. Ostertag Assistant Professor, Department of Mathematics and Theoretical Informatics, Faculty of Electrical Engineering, Technical University of Košice, Slovakia Associate Professors, Department of Applied Mechanics and Mechanical Engineering Faculty of Mechanical Engineering, Technical University of Košice, Slovakia Abstract This paper presents the application of linear regression model for processing of stress state data which were collected through drilling into a structural element. The experiment was carried out by means of reflection photoelasticity. The harmonic star method (HSM) established by the authors was used for the collection of final data. The non-commercial software based on the harmonic star method enables us to automate the process of measurement for direct collection of experiment data. Such software enabled us to measure stresses in a certain point of the examined surface and, at the same time, separate these stresses, i.e. determine the magnitude of individual stresses. A data transfer medium, i.e. a camera, was used to transfer the picture of isochromatic fringes directly to a computer. Keywords: principal normal stresses, harmonic star method, simple linear regression, root mean squared error, mean absolute percentage error, R-squared, adjusted R-squared, MATLAB. Introduction Residual stresses are stresses which occur in a material even if the object is not loaded by external forces. The analysis of residual stresses is very important when determining actual stress state of structural elements.
    [Show full text]
  • Generalized Linear Models
    Generalized Linear Models Advanced Methods for Data Analysis (36-402/36-608) Spring 2014 1 Generalized linear models 1.1 Introduction: two regressions • So far we've seen two canonical settings for regression. Let X 2 Rp be a vector of predictors. In linear regression, we observe Y 2 R, and assume a linear model: T E(Y jX) = β X; for some coefficients β 2 Rp. In logistic regression, we observe Y 2 f0; 1g, and we assume a logistic model (Y = 1jX) log P = βT X: 1 − P(Y = 1jX) • What's the similarity here? Note that in the logistic regression setting, P(Y = 1jX) = E(Y jX). Therefore, in both settings, we are assuming that a transformation of the conditional expec- tation E(Y jX) is a linear function of X, i.e., T g E(Y jX) = β X; for some function g. In linear regression, this transformation was the identity transformation g(u) = u; in logistic regression, it was the logit transformation g(u) = log(u=(1 − u)) • Different transformations might be appropriate for different types of data. E.g., the identity transformation g(u) = u is not really appropriate for logistic regression (why?), and the logit transformation g(u) = log(u=(1 − u)) not appropriate for linear regression (why?), but each is appropriate in their own intended domain • For a third data type, it is entirely possible that transformation neither is really appropriate. What to do then? We think of another transformation g that is in fact appropriate, and this is the basic idea behind a generalized linear model 1.2 Generalized linear models • Given predictors X 2 Rp and an outcome Y , a generalized linear model is defined by three components: a random component, that specifies a distribution for Y jX; a systematic compo- nent, that relates a parameter η to the predictors X; and a link function, that connects the random and systematic components • The random component specifies a distribution for the outcome variable (conditional on X).
    [Show full text]
  • Lecture 18: Regression in Practice
    Regression in Practice Regression in Practice ● Regression Errors ● Regression Diagnostics ● Data Transformations Regression Errors Ice Cream Sales vs. Temperature Image source Linear Regression in R > summary(lm(sales ~ temp)) Call: lm(formula = sales ~ temp) Residuals: Min 1Q Median 3Q Max -74.467 -17.359 3.085 23.180 42.040 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -122.988 54.761 -2.246 0.0513 . temp 28.427 2.816 10.096 3.31e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 35.07 on 9 degrees of freedom Multiple R-squared: 0.9189, Adjusted R-squared: 0.9098 F-statistic: 101.9 on 1 and 9 DF, p-value: 3.306e-06 Some Goodness-of-fit Statistics ● Residual standard error ● R2 and adjusted R2 ● F statistic Anatomy of Regression Errors Image Source Residual Standard Error ● A residual is a difference between a fitted value and an observed value. ● The total residual error (RSS) is the sum of the squared residuals. ○ Intuitively, RSS is the error that the model does not explain. ● It is a measure of how far the data are from the regression line (i.e., the model), on average, expressed in the units of the dependent variable. ● The standard error of the residuals is roughly the square root of the average residual error (RSS / n). ○ Technically, it’s not √(RSS / n), it’s √(RSS / (n - 2)); it’s adjusted by degrees of freedom. R2: Coefficient of Determination ● R2 = ESS / TSS ● Interpretations: ○ The proportion of the variance in the dependent variable that the model explains.
    [Show full text]
  • Chapter 2 Simple Linear Regression Analysis the Simple
    Chapter 2 Simple Linear Regression Analysis The simple linear regression model We consider the modelling between the dependent and one independent variable. When there is only one independent variable in the linear regression model, the model is generally termed as a simple linear regression model. When there are more than one independent variables in the model, then the linear model is termed as the multiple linear regression model. The linear model Consider a simple linear regression model yX01 where y is termed as the dependent or study variable and X is termed as the independent or explanatory variable. The terms 0 and 1 are the parameters of the model. The parameter 0 is termed as an intercept term, and the parameter 1 is termed as the slope parameter. These parameters are usually called as regression coefficients. The unobservable error component accounts for the failure of data to lie on the straight line and represents the difference between the true and observed realization of y . There can be several reasons for such difference, e.g., the effect of all deleted variables in the model, variables may be qualitative, inherent randomness in the observations etc. We assume that is observed as independent and identically distributed random variable with mean zero and constant variance 2 . Later, we will additionally assume that is normally distributed. The independent variables are viewed as controlled by the experimenter, so it is considered as non-stochastic whereas y is viewed as a random variable with Ey()01 X and Var() y 2 . Sometimes X can also be a random variable.
    [Show full text]
  • 1 Simple Linear Regression I – Least Squares Estimation
    1 Simple Linear Regression I – Least Squares Estimation Textbook Sections: 18.1–18.3 Previously, we have worked with a random variable x that comes from a population that is normally distributed with mean µ and variance σ2. We have seen that we can write x in terms of µ and a random error component ε, that is, x = µ + ε. For the time being, we are going to change our notation for our random variable from x to y. So, we now write y = µ + ε. We will now find it useful to call the random variable y a dependent or response variable. Many times, the response variable of interest may be related to the value(s) of one or more known or controllable independent or predictor variables. Consider the following situations: LR1 A college recruiter would like to be able to predict a potential incoming student’s first–year GPA (y) based on known information concerning high school GPA (x1) and college entrance examination score (x2). She feels that the student’s first–year GPA will be related to the values of these two known variables. LR2 A marketer is interested in the effect of changing shelf height (x1) and shelf width (x2)on the weekly sales (y) of her brand of laundry detergent in a grocery store. LR3 A psychologist is interested in testing whether the amount of time to become proficient in a foreign language (y) is related to the child’s age (x). In each case we have at least one variable that is known (in some cases it is controllable), and a response variable that is a random variable.
    [Show full text]
  • Mixed-Effects Polynomial Regression Models Chapter 5
    Mixed-effects Polynomial Regression Models chapter 5 1 Figure 5.1 Various curvilinear models: (a) decelerating positive slope; (b) accelerating positive slope; (c) decelerating negative slope; (d) accelerating negative slope 2 Figure 5.2 More curvilinear models: (a) positive to negative slope (β0 = 2, β1 = 8, β2 = −1.2); (b) inverted U-shaped slope (β0 = 2, β1 = 11, β2 = −2.2); (c) negative to positive slope (β0 = 14, β1 = −8, β2 = 1.2); (d) U-shaped slope (β0 = 14, β1 = −11, β2 = 2.2) 3 Expressing Time with Orthogonal Polynomials Instead of 1 1 1 1 1 1 0 0 X = Z = 0 1 2 3 4 5 0 1 4 9 16 25 use √ 1 1 1 1 1 1 / 6 √ 0 0 − − − X = Z = 5 3 1 1 3 5 / 70 √ 5 −1 −4 −4 −1 5 / 84 4 Figure 4.5 Intercept variance changes with coding of time 5 Top 10 reasons to use Orthogonal Polynomials 10 They look complicated, so it seems like you know what you’re doing 9 With a name like orthogonal polynomials they have to be good 8 Good for practicing lessons learned from “hooked on phonics” 7 Decompose more quickly than (orthogonal?) polymers 6 Sound better than lame old releases from Polydor records 5 Less painful than a visit to the orthodonist 4 Unlike ortho, won’t kill your lawn weeds 3 Might help you with a polygraph test 2 Great conversation topic for getting rid of unwanted “friends” 1 Because your instructor will give you an F otherwise 6 “Real” reasons to use Orthogonal Polynomials • for balanced data, and CS structure, estimates of polynomial fixed effects β (e.g., constant and linear) won’t change when higher-order
    [Show full text]
  • Linear Regression Using Stata (V.6.3)
    Linear Regression using Stata (v.6.3) Oscar Torres-Reyna [email protected] December 2007 http://dss.princeton.edu/training/ Regression: a practical approach (overview) We use regression to estimate the unknown effect of changing one variable over another (Stock and Watson, 2003, ch. 4) When running a regression we are making two assumptions, 1) there is a linear relationship between two variables (i.e. X and Y) and 2) this relationship is additive (i.e. Y= x1 + x2 + …+xN). Technically, linear regression estimates how much Y changes when X changes one unit. In Stata use the command regress, type: regress [dependent variable] [independent variable(s)] regress y x In a multivariate setting we type: regress y x1 x2 x3 … Before running a regression it is recommended to have a clear idea of what you are trying to estimate (i.e. which are your outcome and predictor variables). A regression makes sense only if there is a sound theory behind it. 2 PU/DSS/OTR Regression: a practical approach (setting) Example: Are SAT scores higher in states that spend more money on education controlling by other factors?* – Outcome (Y) variable – SAT scores, variable csat in dataset – Predictor (X) variables • Per pupil expenditures primary & secondary (expense) • % HS graduates taking SAT (percent) • Median household income (income) • % adults with HS diploma (high) • % adults with college degree (college) • Region (region) *Source: Data and examples come from the book Statistics with Stata (updated for version 9) by Lawrence C. Hamilton (chapter 6). Click here to download the data or search for it at http://www.duxbury.com/highered/.
    [Show full text]
  • Linear Regression in Matrix Form
    Statistics 512: Applied Linear Models Topic 3 Topic Overview This topic will cover • thinking in terms of matrices • regression on multiple predictor variables • case study: CS majors • Text Example (KNNL 236) Chapter 5: Linear Regression in Matrix Form The SLR Model in Scalar Form iid 2 Yi = β0 + β1Xi + i where i ∼ N(0,σ ) Consider now writing an equation for each observation: Y1 = β0 + β1X1 + 1 Y2 = β0 + β1X2 + 2 . Yn = β0 + β1Xn + n TheSLRModelinMatrixForm Y1 β0 + β1X1 1 Y2 β0 + β1X2 2 . = . + . . . . Yn β0 + β1Xn n Y X 1 1 1 1 Y X 2 1 2 β0 2 . = . + . . . β1 . Yn 1 Xn n (I will try to use bold symbols for matrices. At first, I will also indicate the dimensions as a subscript to the symbol.) 1 • X is called the design matrix. • β is the vector of parameters • is the error vector • Y is the response vector The Design Matrix 1 X1 1 X2 Xn×2 = . . 1 Xn Vector of Parameters β0 β2×1 = β1 Vector of Error Terms 1 2 n×1 = . . n Vector of Responses Y1 Y2 Yn×1 = . . Yn Thus, Y = Xβ + Yn×1 = Xn×2β2×1 + n×1 2 Variance-Covariance Matrix In general, for any set of variables U1,U2,... ,Un,theirvariance-covariance matrix is defined to be 2 σ {U1} σ{U1,U2} ··· σ{U1,Un} . σ{U ,U } σ2{U } ... σ2{ } 2 1 2 . U = . . .. .. σ{Un−1,Un} 2 σ{Un,U1} ··· σ{Un,Un−1} σ {Un} 2 where σ {Ui} is the variance of Ui,andσ{Ui,Uj} is the covariance of Ui and Uj.
    [Show full text]
  • Reading 25: Linear Regression
    Linear regression Class 25, 18.05 Jeremy Orloff and Jonathan Bloom 1 Learning Goals 1. Be able to use the method of least squares to fit a line to bivariate data. 2. Be able to give a formula for the total squared error when fitting any type of curve to data. 3. Be able to say the words homoscedasticity and heteroscedasticity. 2 Introduction Suppose we have collected bivariate data (xi; yi), i = 1; : : : ; n. The goal of linear regression is to model the relationship between x and y by finding a function y = f(x) that is a close fit to the data. The modeling assumptions we will use are that xi is not random and that yi is a function of xi plus some random noise. With these assumptions x is called the independent or predictor variable and y is called the dependent or response variable. Example 1. The cost of a first class stamp in cents over time is given in the following list. .05 (1963) .06 (1968) .08 (1971) .10 (1974) .13 (1975) .15 (1978) .20 (1981) .22 (1985) .25 (1988) .29 (1991) .32 (1995) .33 (1999) .34 (2001) .37 (2002) .39 (2006) .41 (2007) .42 (2008) .44 (2009) .45 (2012) .46 (2013) .49 (2014) Using the R function lm we found the ‘least squares fit’ for a line to this data is y = −0:06558 + 0:87574x; where x is the number of years since 1960 and y is in cents. Using this result we ‘predict’ that in 2016 (x = 56) the cost of a stamp will be 49 cents (since −0:06558 + 0:87574x · 56 = 48:98).
    [Show full text]
  • Design and Analysis of Ecological Data Landscape of Statistical Methods: Part 1
    Design and Analysis of Ecological Data Landscape of Statistical Methods: Part 1 1. The landscape of statistical methods. 2 2. General linear models.. 4 3. Nonlinearity. 11 4. Nonlinearity and nonnormal errors (generalized linear models). 16 5. Heterogeneous errors. 18 *Much of the material in this section is taken from Bolker (2008) and Zur et al. (2009) Landscape of statistical methods: part 1 2 1. The landscape of statistical methods The field of ecological modeling has grown amazingly complex over the years. There are now methods for tackling just about any problem. One of the greatest challenges in learning statistics is figuring out how the various methods relate to each other and determining which method is most appropriate for any particular problem. Unfortunately, the plethora of statistical methods defy simple classification. Instead of trying to fit methods into clearly defined boxes, it is easier and more meaningful to think about the factors that help distinguish among methods. In this final section, we will briefly review these factors with the aim of portraying the “landscape” of statistical methods in ecological modeling. Importantly, this treatment is not meant to be an exhaustive survey of statistical methods, as there are many other methods that we will not consider here because they are not commonly employed in ecology. In the end, the choice of a particular method and its interpretation will depend heavily on whether the purpose of the analysis is descriptive or inferential, the number and types of variables (i.e., dependent, independent, or interdependent) and the type of data (e.g., continuous, count, proportion, binary, time at death, time series, circular).
    [Show full text]
  • Lecture 2: Linear Regression
    Lecture 2: Linear regression Roger Grosse 1 Introduction Let's jump right in and look at our first machine learning algorithm, linear regression. In regression, we are interested in predicting a scalar-valued target, such as the price of a stock. By linear, we mean that the target must be predicted as a linear function of the inputs. This is a kind of supervised learning algorithm; recall that, in supervised learning, we have a collection of training examples labeled with the correct outputs. Regression is an important problem in its own right. But today's dis- cussion will also highlight a number of themes which will recur throughout the course: • Formulating a machine learning task mathematically as an optimiza- tion problem. • Thinking about the data points and the model parameters as vectors. • Solving the optimization problem using two different strategies: deriv- ing a closed-form solution, and applying gradient descent. These two strategies are how we will derive nearly all of the learning algorithms in this course. • Writing the algorithm in terms of linear algebra, so that we can think about it more easily and implement it efficiently in a high-level pro- gramming language. • Making a linear algorithm more powerful using basis functions, or features. • Analyzing the generalization performance of an algorithm, and in par- ticular the problems of overfitting and underfitting. 1.1 Learning goals • Know what objective function is used in linear regression, and how it is motivated. • Derive both the closed-form solution and the gradient descent updates for linear regression. • Write both solutions in terms of matrix and vector operations.
    [Show full text]