Linear, Ridge Regression, and Principal Component Analysis

Total Page:16

File Type:pdf, Size:1020Kb

Linear, Ridge Regression, and Principal Component Analysis Linear, Ridge Regression, and Principal Component Analysis Linear, Ridge Regression, and Principal Component Analysis Jia Li Department of Statistics The Pennsylvania State University Email: [email protected] http://www.stat.psu.edu/∼jiali Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Introduction to Regression I Input vector: X = (X1, X2, ..., Xp). I Output Y is real-valued. I Predict Y from X by f (X ) so that the expected loss function E(L(Y , f (X ))) is minimized. I Square loss: L(Y , f (X )) = (Y − f (X ))2 . I The optimal predictor ∗ 2 f (X ) = argminf (X )E(Y − f (X )) = E(Y | X ) . I The function E(Y | X ) is the regression function. Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Example The number of active physicians in a Standard Metropolitan Statistical Area (SMSA), denoted by Y , is expected to be related to total population (X1, measured in thousands), land area (X2, measured in square miles), and total personal income (X3, measured in millions of dollars). Data are collected for 141 SMSAs, as shown in the following table. i : 1 2 3 ... 139 140 141 X1 9387 7031 7017 ... 233 232 231 X2 1348 4069 3719 ... 1011 813 654 X3 72100 52737 54542 ... 1337 1589 1148 Y 25627 15389 13326 ... 264 371 140 Goal: Predict Y from X1, X2, and X3. Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Linear Methods I The linear regression model p X f (X ) = β0 + Xj βj . j=1 I What if the model is not true? I It is a good approximation I Because of the lack of training data/or smarter algorithms, it is the most we can extract robustly from the data. I Comments on Xj : I Quantitative inputs p I Transformations of quantitative inputs, e.g., log(·), (·). 2 3 I Basis expansions: X2 = X1 , X3 = X1 , X3 = X1 · X2. Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Estimation I The issue of finding the regression function E(Y | X ) is converted to estimating βj , j = 0, 1, ..., p. I Training data: {(x1, y1), (x2, y2), ..., (xN , yN )}, where xi = (xi1, xi2, ..., xip) . T I Denote β = (β0, β1, ..., βp) . 2 I The loss function E(Y − f (X )) is approximated by the empirical loss RSS(β)/N: N N p X 2 X X 2 RSS(β) = (yi − f (xi )) = (yi − β0 − xij βj ) . i=1 i=1 j=1 Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Notation I The input matrix X of dimension N × (p + 1): 1 x1,1 x1,2 ... x1,p 1 x2,1 x2,2 ... x2,p ... ... ... ... ... 1 xN,1 xN,2 ... xN,p I Output vector y: y1 y2 y = ... yN Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis I The estimated β is βˆ. I The fitted values at the training inputs: p X yˆi = βˆ0 + xij βˆj j=1 and yˆ1 yˆ2 yˆ = ... yˆN Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Point Estimate I The least square estimation of βˆ is ˆ T −1 T §β = (X X) X y ¤ I The fitted value vector¦ is ¥ ˆ T −1 T §yˆ = Xβ = X(X X) X y ¤ I Hat matrix: ¦ ¥ H = X(XT X)−1XT § ¤ ¦ ¥ Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Geometric Interpretation I Each column of X is a vector in an N-dimensional space (NOT the p-dimensional feature vector space). X = (x0, x1, ..., xp) I The fitted output vector yˆ is a linear combination of the column vectors xj , j = 0, 1, ..., p. I yˆ lies in the subspace spanned by xj , j = 0, 1, ..., p. 2 I RSS(βˆ) =k y − yˆ k . I y − yˆ is perpendicular to the subspace, i.e., yˆ is the projection of y on the subspace. I The geometric interpretation is very helpful for understanding coefficient shrinkage and subset selection. Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Example Results for the SMSA Problem I Yˆi = −143.89 + 0.341Xi1 − 0.0193Xi2 + 0.255Xi3. I RSS(βˆ) = 52942336. Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis If the Linear Model Is True Pp I E(Y | X ) = β0 + j=1 Xj βj I The least square estimation of β is unbiased, E(βˆj ) = βj j = 0, 1, ..., p . I To draw inferences about β, further assume: Y = E(Y | X ) + where ∼ N(0, σ2) and is independent of X . I Xij are regarded as fixed, Yi are random due to . T −1 2 I Estimation accuracy: Var(βˆ) = (X X) σ . T −1 2 I Under the assumption, βˆ ∼ N(β, (X X) σ ) . I Confidence intervals can be computed and significant tests can be done. Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Gauss-Markov Theorem I Assume the linear model is true. I For any linear combination of the parameters β0,...,βp, denoted by θ = aT β, aT βˆ is an unbiased estimation since βˆ is unbiased. I The least squares estimate of θ is θˆ = aT βˆ = aT (XT X)−1Xy T , ˜a y , which is linear in y. Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis T I Suppose c y is another unbiased linear estimate of θ, i.e., E(cT y) = θ. I The least square estimate yields the minimum variance among all linear unbiased estimate. Var(˜aT y) ≤ Var(cT y) . T T I βj , j = 0, 1, ..., p are special cases of a β, where a only has one non-zero element that equals 1. Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Subset Selection and Coefficient Shrinkage I Biased estimation may yield better prediction accuracy. ˆ 2 ˆ ˜ βˆ I Squared loss: E(β − 1) = Var(β). For β = a , a ≥ 1, ˜ 2 ˜ ˜ 2 1 1 2 E(β − 1) = Var(β) + (E(β) − 1) = a2 + ( a − 1) . I Practical consideration: interpretation. Sometimes, we are not satisfied with a “black box”. Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Assume βˆ ∼ N(1, 1). The squared error loss is reduced by shrinking the estimation. Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Subset Selection I To choose k predicting variables from the total of p variables, search for the subset yielding minimum RSS(βˆ). I Forward stepwise selection: start with the intercept, then sequentially adds into the model the predictor that most improves the fit. I Backward stepwise selection: start with the full model, and sequentially deletes predictors. I How to choose k: stop forward or backward stepwise selection when no predictor produces the F -ratio statistic greater than a threshold. Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Ridge Regression Centered inputs I Suppose xj , j = 1, ..., p, are mean removed. ˆ PN I β0 =y ¯ = i=1 yi /N. I If we remove the mean of yi , we can assume p X E(Y | X ) = βj Xj j=1 I Input matrix X has p (rather than p + 1) columns. T −1 T I βˆ = (X X) X y T −1 T I yˆ = X(X X) X y Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Singular Value Decomposition (SVD) I If the column vectors of X are orthonormal, i.e., the variables Xj , j = 1, 2, ..., p, are uncorrelated and have unit norm. I βˆj are the coordinates of y on the orthonormal basis X. I In general X = UDVT . § ¤ ¦ ¥ I U = (u1, u2, ..., up) is an N × p orthogonal matrix. uj , j = 1, ..., p form an orthonormal basis for the space spanned by the column vectors of X. I V = (v1, v2, ..., vp) is an p × p orthogonal matrix. vj , j = 1, ..., p form an orthonormal basis for the space spanned by the row vectors of X. I D = diag(d1, d2, ..., dp), d1 ≥ d2 ≥ ... ≥ dp ≥ 0 are the singular values of X. Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Principal Components I The sample covariance matrix of X is S = XT X/N . T I Eigen decomposition of X X: XT X = (UDVT )T (UDVT ) = VDUT UDVT = VD2VT T I The eigenvectors of X X, vj , are called principal component direction of X. Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis I It’s easy to see that zj = Xvj = uj dj . Hence uj , is simply the projection of the row vectors of X, i.e., the input predictor vectors, on the direction vj , scaled by dj . For example X1,1v1,1 + X1,2v1,2 + ··· + X1,pv1,p X2,1v1,1 + X2,2v1,2 + ··· + X2,pv1,p z = 1 . . XN,1v1,1 + XN,2v1,2 + ··· + XN,pv1,p I The principal components of X are zj = dj uj , j = 1, ..., p.
Recommended publications
  • A Generalized Linear Model for Principal Component Analysis of Binary Data
    A Generalized Linear Model for Principal Component Analysis of Binary Data Andrew I. Schein Lawrence K. Saul Lyle H. Ungar Department of Computer and Information Science University of Pennsylvania Moore School Building 200 South 33rd Street Philadelphia, PA 19104-6389 {ais,lsaul,ungar}@cis.upenn.edu Abstract they are not generally appropriate for other data types. Recently, Collins et al.[5] derived generalized criteria We investigate a generalized linear model for for dimensionality reduction by appealing to proper- dimensionality reduction of binary data. The ties of distributions in the exponential family. In their model is related to principal component anal- framework, the conventional PCA of real-valued data ysis (PCA) in the same way that logistic re- emerges naturally from assuming a Gaussian distribu- gression is related to linear regression. Thus tion over a set of observations, while generalized ver- we refer to the model as logistic PCA. In this sions of PCA for binary and nonnegative data emerge paper, we derive an alternating least squares respectively by substituting the Bernoulli and Pois- method to estimate the basis vectors and gen- son distributions for the Gaussian. For binary data, eralized linear coefficients of the logistic PCA the generalized model's relationship to PCA is anal- model. The resulting updates have a simple ogous to the relationship between logistic and linear closed form and are guaranteed at each iter- regression[12]. In particular, the model exploits the ation to improve the model's likelihood. We log-odds as the natural parameter of the Bernoulli dis- evaluate the performance of logistic PCA|as tribution and the logistic function as its canonical link.
    [Show full text]
  • Shrinkage Priors for Bayesian Penalized Regression
    Shrinkage Priors for Bayesian Penalized Regression. Sara van Erp1, Daniel L. Oberski2, and Joris Mulder1 1Tilburg University 2Utrecht University This manuscript has been published in the Journal of Mathematical Psychology. Please cite the published version: Van Erp, S., Oberski, D. L., & Mulder, J. (2019). Shrinkage priors for Bayesian penalized regression. Journal of Mathematical Psychology, 89, 31–50. doi:10.1016/j.jmp.2018.12.004 Abstract In linear regression problems with many predictors, penalized regression tech- niques are often used to guard against overfitting and to select variables rel- evant for predicting an outcome variable. Recently, Bayesian penalization is becoming increasingly popular in which the prior distribution performs a function similar to that of the penalty term in classical penalization. Specif- ically, the so-called shrinkage priors in Bayesian penalization aim to shrink small effects to zero while maintaining true large effects. Compared toclas- sical penalization techniques, Bayesian penalization techniques perform sim- ilarly or sometimes even better, and they offer additional advantages such as readily available uncertainty estimates, automatic estimation of the penalty parameter, and more flexibility in terms of penalties that can be considered. However, many different shrinkage priors exist and the available, often quite technical, literature primarily focuses on presenting one shrinkage prior and often provides comparisons with only one or two other shrinkage priors. This can make it difficult for researchers to navigate through the many prior op- tions and choose a shrinkage prior for the problem at hand. Therefore, the aim of this paper is to provide a comprehensive overview of the literature on Bayesian penalization.
    [Show full text]
  • Shrinkage Estimation of Rate Statistics Arxiv:1810.07654V1 [Stat.AP] 17 Oct
    Published, CS-BIGS 7(1):14-25 http://www.csbigs.fr Shrinkage estimation of rate statistics Einar Holsbø Department of Computer Science, UiT — The Arctic University of Norway Vittorio Perduca Laboratory of Applied Mathematics MAP5, Université Paris Descartes This paper presents a simple shrinkage estimator of rates based on Bayesian methods. Our focus is on crime rates as a motivating example. The estimator shrinks each town’s observed crime rate toward the country-wide average crime rate according to town size. By realistic simulations we confirm that the proposed estimator outperforms the maximum likelihood estimator in terms of global risk. We also show that it has better coverage properties. Keywords : Official statistics, crime rates, inference, Bayes, shrinkage, James-Stein estimator, Monte-Carlo simulations. 1. Introduction that most of the best schools—according to a variety of performance measures—were small. 1.1. Two counterintuitive random phenomena As it turns out, there is nothing special about small schools except that they are small: their It is a classic result in statistics that the smaller over-representation among the best schools is the sample, the more variable the sample mean. a consequence of their more variable perfor- The result is due to Abraham de Moivre and it mance, which is counterbalanced by their over- tells us that the standard deviation of the mean p representation among the worst schools. The is sx¯ = s/ n, where n is the sample size and s observed superiority of small schools was sim- the standard deviation of the random variable ply a statistical fluke. of interest.
    [Show full text]
  • Linear Methods for Regression and Shrinkage Methods
    Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares • Input vectors • is an attribute / feature / predictor (independent variable) • The linear regression model: • The output is called response (dependent variable) • ’s are unknown parameters (coefficients) 2 Linear Regression Models Least Squares • A set of training data • Each corresponding to attributes • Each is a class attribute value / a label • Wish to estimate the parameters 3 Linear Regression Models Least Squares • One common approach ‐ the method of least squares: • Pick the coefficients to minimize the residual sum of squares: 4 Linear Regression Models Least Squares • This criterion is reasonable if the training observations represent independent draws. • Even if the ’s were not drawn randomly, the criterion is still valid if the ’s are conditionally independent given the inputs . 5 Linear Regression Models Least Squares • Make no assumption about the validity of the model • Simply finds the best linear fit to the data 6 Linear Regression Models Finding Residual Sum of Squares • Denote by the matrix with each row an input vector (with a 1 in the first position) • Let be the N‐vector of outputs in the training set • Quadratic function in the parameters: 7 Linear Regression Models Finding Residual Sum of Squares • Set the first derivation to zero: • Obtain the unique solution: 8 Linear Regression
    [Show full text]
  • Principal Component Analysis (PCA) As a Statistical Tool for Identifying Key Indicators of Nuclear Power Plant Cable Insulation
    Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations 2017 Principal component analysis (PCA) as a statistical tool for identifying key indicators of nuclear power plant cable insulation degradation Chamila Chandima De Silva Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/etd Part of the Materials Science and Engineering Commons, Mechanics of Materials Commons, and the Statistics and Probability Commons Recommended Citation De Silva, Chamila Chandima, "Principal component analysis (PCA) as a statistical tool for identifying key indicators of nuclear power plant cable insulation degradation" (2017). Graduate Theses and Dissertations. 16120. https://lib.dr.iastate.edu/etd/16120 This Thesis is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Principal component analysis (PCA) as a statistical tool for identifying key indicators of nuclear power plant cable insulation degradation by Chamila C. De Silva A thesis submitted to the graduate faculty in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Major: Materials Science and Engineering Program of Study Committee: Nicola Bowler, Major Professor Richard LeSar Steve Martin The student author and the program of study committee are solely responsible for the content of this thesis. The Graduate College will ensure this thesis is globally accessible and will not permit alterations after a degree is conferred.
    [Show full text]
  • Kernel Mean Shrinkage Estimators
    Journal of Machine Learning Research 17 (2016) 1-41 Submitted 5/14; Revised 9/15; Published 4/16 Kernel Mean Shrinkage Estimators Krikamol Muandet∗ [email protected] Empirical Inference Department, Max Planck Institute for Intelligent Systems Spemannstraße 38, T¨ubingen72076, Germany Bharath Sriperumbudur∗ [email protected] Department of Statistics, Pennsylvania State University University Park, PA 16802, USA Kenji Fukumizu [email protected] The Institute of Statistical Mathematics 10-3 Midoricho, Tachikawa, Tokyo 190-8562 Japan Arthur Gretton [email protected] Gatsby Computational Neuroscience Unit, CSML, University College London Alexandra House, 17 Queen Square, London - WC1N 3AR, United Kingdom ORCID 0000-0003-3169-7624 Bernhard Sch¨olkopf [email protected] Empirical Inference Department, Max Planck Institute for Intelligent Systems Spemannstraße 38, T¨ubingen72076, Germany Editor: Ingo Steinwart Abstract A mean function in a reproducing kernel Hilbert space (RKHS), or a kernel mean, is central to kernel methods in that it is used by many classical algorithms such as kernel principal component analysis, and it also forms the core inference step of modern kernel methods that rely on embedding probability distributions in RKHSs. Given a finite sample, an empirical average has been used commonly as a standard estimator of the true kernel mean. Despite a widespread use of this estimator, we show that it can be improved thanks to the well- known Stein phenomenon. We propose a new family of estimators called kernel mean shrinkage estimators (KMSEs), which benefit from both theoretical justifications and good empirical performance. The results demonstrate that the proposed estimators outperform the standard one, especially in a \large d, small n" paradigm.
    [Show full text]
  • Generalized Linear Models
    CHAPTER 6 Generalized linear models 6.1 Introduction Generalized linear modeling is a framework for statistical analysis that includes linear and logistic regression as special cases. Linear regression directly predicts continuous data y from a linear predictor Xβ = β0 + X1β1 + + Xkβk.Logistic regression predicts Pr(y =1)forbinarydatafromalinearpredictorwithaninverse-··· logit transformation. A generalized linear model involves: 1. A data vector y =(y1,...,yn) 2. Predictors X and coefficients β,formingalinearpredictorXβ 1 3. A link function g,yieldingavectoroftransformeddataˆy = g− (Xβ)thatare used to model the data 4. A data distribution, p(y yˆ) | 5. Possibly other parameters, such as variances, overdispersions, and cutpoints, involved in the predictors, link function, and data distribution. The options in a generalized linear model are the transformation g and the data distribution p. In linear regression,thetransformationistheidentity(thatis,g(u) u)and • the data distribution is normal, with standard deviation σ estimated from≡ data. 1 1 In logistic regression,thetransformationistheinverse-logit,g− (u)=logit− (u) • (see Figure 5.2a on page 80) and the data distribution is defined by the proba- bility for binary data: Pr(y =1)=y ˆ. This chapter discusses several other classes of generalized linear model, which we list here for convenience: The Poisson model (Section 6.2) is used for count data; that is, where each • data point yi can equal 0, 1, 2, ....Theusualtransformationg used here is the logarithmic, so that g(u)=exp(u)transformsacontinuouslinearpredictorXiβ to a positivey ˆi.ThedatadistributionisPoisson. It is usually a good idea to add a parameter to this model to capture overdis- persion,thatis,variationinthedatabeyondwhatwouldbepredictedfromthe Poisson distribution alone.
    [Show full text]
  • Time-Series Regression and Generalized Least Squares in R*
    Time-Series Regression and Generalized Least Squares in R* An Appendix to An R Companion to Applied Regression, third edition John Fox & Sanford Weisberg last revision: 2018-09-26 Abstract Generalized least-squares (GLS) regression extends ordinary least-squares (OLS) estimation of the normal linear model by providing for possibly unequal error variances and for correlations between different errors. A common application of GLS estimation is to time-series regression, in which it is generally implausible to assume that errors are independent. This appendix to Fox and Weisberg (2019) briefly reviews GLS estimation and demonstrates its application to time-series data using the gls() function in the nlme package, which is part of the standard R distribution. 1 Generalized Least Squares In the standard linear model (for example, in Chapter 4 of the R Companion), E(yjX) = Xβ or, equivalently y = Xβ + " where y is the n×1 response vector; X is an n×k +1 model matrix, typically with an initial column of 1s for the regression constant; β is a k + 1 ×1 vector of regression coefficients to estimate; and " is 2 an n×1 vector of errors. Assuming that " ∼ Nn(0; σ In), or at least that the errors are uncorrelated and equally variable, leads to the familiar ordinary-least-squares (OLS) estimator of β, 0 −1 0 bOLS = (X X) X y with covariance matrix 2 0 −1 Var(bOLS) = σ (X X) More generally, we can assume that " ∼ Nn(0; Σ), where the error covariance matrix Σ is sym- metric and positive-definite. Different diagonal entries in Σ error variances that are not necessarily all equal, while nonzero off-diagonal entries correspond to correlated errors.
    [Show full text]
  • Generalized Shrinkage Methods for Forecasting Using Many Predictors
    Generalized Shrinkage Methods for Forecasting using Many Predictors This revision: February 2011 James H. Stock Department of Economics, Harvard University and the National Bureau of Economic Research and Mark W. Watson Woodrow Wilson School and Department of Economics, Princeton University and the National Bureau of Economic Research AUTHOR’S FOOTNOTE James H. Stock is the Harold Hitchings Burbank Professor of Economics, Harvard University (e-mail: [email protected]). Mark W. Watson is the Howard Harrison and Gabrielle Snyder Beck Professor of Economics and Public Affairs, Princeton University ([email protected]). The authors thank Jean Boivin, Domenico Giannone, Lutz Kilian, Serena Ng, Lucrezia Reichlin, Mark Steel, and Jonathan Wright for helpful discussions, Anna Mikusheva for research assistance, and the referees for helpful suggestions. An earlier version of the theoretical results in this paper was circulated earlier under the title “An Empirical Comparison of Methods for Forecasting Using Many Predictors.” Replication files for the results in this paper can be downloaded from http://www.princeton.edu/~mwatson. This research was funded in part by NSF grants SBR-0214131 and SBR-0617811. 0 Generalized Shrinkage Methods for Forecasting using Many Predictors JBES09-255 (first revision) This revision: February 2011 [BLINDED COVER PAGE] ABSTRACT This paper provides a simple shrinkage representation that describes the operational characteristics of various forecasting methods designed for a large number of orthogonal predictors (such as principal components). These methods include pretest methods, Bayesian model averaging, empirical Bayes, and bagging. We compare empirically forecasts from these methods to dynamic factor model (DFM) forecasts using a U.S. macroeconomic data set with 143 quarterly variables spanning 1960-2008.
    [Show full text]
  • Simple Linear Regression: Straight Line Regression Between an Outcome Variable (Y ) and a Single Explanatory Or Predictor Variable (X)
    1 Introduction to Regression \Regression" is a generic term for statistical methods that attempt to fit a model to data, in order to quantify the relationship between the dependent (outcome) variable and the predictor (independent) variable(s). Assuming it fits the data reasonable well, the estimated model may then be used either to merely describe the relationship between the two groups of variables (explanatory), or to predict new values (prediction). There are many types of regression models, here are a few most common to epidemiology: Simple Linear Regression: Straight line regression between an outcome variable (Y ) and a single explanatory or predictor variable (X). E(Y ) = α + β × X Multiple Linear Regression: Same as Simple Linear Regression, but now with possibly multiple explanatory or predictor variables. E(Y ) = α + β1 × X1 + β2 × X2 + β3 × X3 + ::: A special case is polynomial regression. 2 3 E(Y ) = α + β1 × X + β2 × X + β3 × X + ::: Generalized Linear Model: Same as Multiple Linear Regression, but with a possibly transformed Y variable. This introduces considerable flexibil- ity, as non-linear and non-normal situations can be easily handled. G(E(Y )) = α + β1 × X1 + β2 × X2 + β3 × X3 + ::: In general, the transformation function G(Y ) can take any form, but a few forms are especially common: • Taking G(Y ) = logit(Y ) describes a logistic regression model: E(Y ) log( ) = α + β × X + β × X + β × X + ::: 1 − E(Y ) 1 1 2 2 3 3 2 • Taking G(Y ) = log(Y ) is also very common, leading to Poisson regression for count data, and other so called \log-linear" models.
    [Show full text]
  • Chapter 2 Simple Linear Regression Analysis the Simple
    Chapter 2 Simple Linear Regression Analysis The simple linear regression model We consider the modelling between the dependent and one independent variable. When there is only one independent variable in the linear regression model, the model is generally termed as a simple linear regression model. When there are more than one independent variables in the model, then the linear model is termed as the multiple linear regression model. The linear model Consider a simple linear regression model yX01 where y is termed as the dependent or study variable and X is termed as the independent or explanatory variable. The terms 0 and 1 are the parameters of the model. The parameter 0 is termed as an intercept term, and the parameter 1 is termed as the slope parameter. These parameters are usually called as regression coefficients. The unobservable error component accounts for the failure of data to lie on the straight line and represents the difference between the true and observed realization of y . There can be several reasons for such difference, e.g., the effect of all deleted variables in the model, variables may be qualitative, inherent randomness in the observations etc. We assume that is observed as independent and identically distributed random variable with mean zero and constant variance 2 . Later, we will additionally assume that is normally distributed. The independent variables are viewed as controlled by the experimenter, so it is considered as non-stochastic whereas y is viewed as a random variable with Ey()01 X and Var() y 2 . Sometimes X can also be a random variable.
    [Show full text]
  • Week 7: Multiple Regression
    Week 7: Multiple Regression Brandon Stewart1 Princeton October 24, 26, 2016 1These slides are heavily influenced by Matt Blackwell, Adam Glynn, Jens Hainmueller and Danny Hidalgo. Stewart (Princeton) Week 7: Multiple Regression October 24, 26, 2016 1 / 145 Where We've Been and Where We're Going... Last Week I regression with two variables I omitted variables, multicollinearity, interactions This Week I Monday: F matrix form of linear regression I Wednesday: F hypothesis tests Next Week I break! I then ::: regression in social science Long Run I probability ! inference ! regression Questions? Stewart (Princeton) Week 7: Multiple Regression October 24, 26, 2016 2 / 145 1 Matrix Algebra Refresher 2 OLS in matrix form 3 OLS inference in matrix form 4 Inference via the Bootstrap 5 Some Technical Details 6 Fun With Weights 7 Appendix 8 Testing Hypotheses about Individual Coefficients 9 Testing Linear Hypotheses: A Simple Case 10 Testing Joint Significance 11 Testing Linear Hypotheses: The General Case 12 Fun With(out) Weights Stewart (Princeton) Week 7: Multiple Regression October 24, 26, 2016 3 / 145 Why Matrices and Vectors? Here's one way to write the full multiple regression model: yi = β0 + xi1β1 + xi2β2 + ··· + xiK βK + ui Notation is going to get needlessly messy as we add variables Matrices are clean, but they are like a foreign language You need to build intuitions over a long period of time (and they will return in Soc504) Reminder of Parameter Interpretation: β1 is the effect of a one-unit change in xi1 conditional on all other xik . We are going to review the key points quite quickly just to refresh the basics.
    [Show full text]