HST 190: Introduction to Biostatistics

Total Page:16

File Type:pdf, Size:1020Kb

HST 190: Introduction to Biostatistics HST 190: Introduction to Biostatistics Lecture 5: Multiple linear regression 1 HST 190: Intro to Biostatistics Multiple linear regression • Last time, we introduced linear regression in the simple case of a single outcome � and a single covariate � with model � = � + �� + �, �~� 0, �- • Now, consider multiple linear regression, predicting � via - � = � + �.�. + ⋯ + �0�0 + �, �~� 0, � • In this model, �1 represents the average increase in � corresponding to a one unit increase in �1 when all other �’s have been held constant § The �1’s are called regression coefficients 2 HST 190: Intro to Biostatistics Fitting multiple linear regression models • Like simple linear regression, multiple linear regression can be fit 5 - 5 by minimizing the sum of squared residuals ∑16. �1 = ∑16.(�1 − - � − �.�1. − ⋯ − �0�10) • Formally, letting � = (�., … , �0), 5 - �, � = arg minD,� F �1 − � − �.�1. − ⋯ − �0�10 16. § Unlike simple linear regression, there are no closed formulas for individual multiple regression estimates § However, solution for �, � simultaneously can be expressed using matrix notation ∑N J LJI M • We can estimate Var(�|�) by �I- = KOP K K 5L- 3 HST 190: Intro to Biostatistics • Consider researchers studying the relationship between systolic blood pressure, age, and gender using multiple linear regression. These variables are recorded in a sample: § note “gender” is a categorical variable with two categories. Therefore, we represent it by creating an indicator variable, �-, that takes values 0 and 1 for men and women, respectively. § this is called a dummy variable, since its value (1 vs. 0) is arbitrarily chosen as a stand-in for a non-numeric quantity. patient # � = SBP �1 = age gender �- = female 1 120 45 Female 1 2 135 40 Male 0 3 132 49 Male 0 4 140 35 Female 1 etc. 4 HST 190: Intro to Biostatistics • If the resulting regression equation was estimated as YI = 110 + 0.3�. − 10�- § � = 110 is an estimated average SBP for men (�- = 0) at ‘age 0’ (�. = 0). It is recommended to “center” continuous covariates around their sample averages to make the intercept more meaningful § �. = 0.3 means that, within each gender, a 1-year increase in age is associated with an increase of 0.3 in average SBP. • How do we interpret the value of a dummy variable’s coefficient (�- = 10)? Consider the predicted SBP values for a man and a woman who are both age 40 § For the man: YI = � + �. 40 − �- 0 § For the woman: YI = � + �. 40 − �- 1 § So, �- is a difference in average values between women and men, holding age constant. Or, �- = −10 is the difference in average SBP between genders when all other variables are held constant 5 HST 190: Intro to Biostatistics Interaction terms • In the regression models above, each explanatory variable is related to the outcome independently of all others • If an explanatory variable’s effect depends on another explanatory variable, it is called an interaction effect § If we believe interaction effects are present, we include them in the model • For example, if SBP changed with age differently for men and women, we might choose to model � = � + �.�. + �-�- + �]�.�- + � § In this ‘fullest’ model, �]�.�- is called an interaction term 6 HST 190: Intro to Biostatistics • These two proposed models predict SBP differently § the original model � = � + �.�. + �-�- has the same SBP slope �. for men and women § The interaction model � = � + �.�. + �-�- + �]�.�- has different SBP slopes for men (�.) and for women (�.+ �]) • Suppose our new fit is YI = 110 + 0.2�. − 10�- + 0.2�.�- § For men (�- = 0), we predict YI = 110 + 0.2�. § For women (�- = 1), we predict YI = 100 + 0.4�. • In other words, �] represents the difference in changes in average SBP associated with a 1-year increase in age for women versus men 7 HST 190: Intro to Biostatistics Model specification • We have seen several linear models relating SBP, age and gender. What linear regression models can we fit, in principle? § � = � + �.�. + � �- = female § � = � + �.�. + �-�- + � � = SBP § � = � + �.�. + �-�- + �]�.�- + � § � = � + �.�. + �]�.�- + � �. = Age 8 HST 190: Intro to Biostatistics § � = � + ���� + � § � = � + �.�. + �-�- + � § � = � + �.�. + �-�- + �]�.�- + � �- = female § � = � + �.�. + �]�.�- + � � = SBP • Men and women constrained to same slope and intercept �. = Age 9 HST 190: Intro to Biostatistics § � = � + �.�. + � § � = � + ���� + ���� + � § � = � + �.�. + �-�- + �]�.�- + � �- = female § � = � + �.�. + �]�.�- + � � = SBP • Adding “main effect” for gender allows same slope with different intercepts § assumption that age-SBP relationship is same for men/women �. = Age 10 HST 190: Intro to Biostatistics § � = � + �.�. + � § � = � + �.�. + �-�- + � § � = � + ���� + ���� + ������ + � �- = female § � = � + �.�. + �]�.�- + � � = SBP • Adding interaction for gender allows different slopes and different intercepts § assumption that age-SBP relationship differs for men/women �. = Age 11 HST 190: Intro to Biostatistics § � = � + �.�. + � § � = � + �.�. + �-�- + � § � = � + �.�. + �-�- + �]�.�- + � �- = female § � = � + ���� + ������ + � � = SBP • Omitting gender main effect allows different slopes but forces same intercepts § Rarely sensible, so include main effects first before interactions �. = Age 12 HST 190: Intro to Biostatistics Categorical variables with multiple categories • Suppose researchers want to study the relationship between systolic blood pressure, age, and US geographic region using multiple regression. Region is categorized as “Northeast”, “South”, “Midwest”, or “West”. § Since there are four categories, we must create three dummy variables to uniquely characterize each patient: patient # � = SBP �1 = age region �� = � �� = �� �� = � 1 120 45 NE 0 0 0 2 135 40 S 1 0 0 3 132 49 MW 0 1 0 4 140 35 W 0 0 1 etc. 13 HST 190: Intro to Biostatistics • The fitted regression model is �I = � + �.�. + �-�- + �]�] + �l�l • Consider four 40-year-old patients, one from each region. Their predicted SBP values will be: § NE (�- = �] = �l = 0): �I = � + 40�. § S (�- = 1, �] = �l = 0): �I = � + 40�. + �- § MW (�] = 1, �- = �l = 0): �I = � + 40�. + �] § W (�l = 1, �- = �] = 0): �I = � + 40�. + �l • (�- , �] , �l) represent the difference in prediction between the categories (S, MW, W) and the “baseline” category (or the reference level) NE. • For a categorical variable with multiple categories, you must use one fewer dummy variable than the number of categories. One category, by default, becomes reference (baseline) category. 14 HST 190: Intro to Biostatistics Inference about a single �m • Because each parameter �m describes the relationship between �m and �, inference about �m tells us about the strength of the linear relationship § in multiple linear regression, this is holding all other included covariates constant • We can do hypothesis testing and form confidence intervals for a particular �m • Both require the estimated standard error of �m, seo �m r . § In single linear regression, the closed form is seo � = �I M 5L. pq § In multiple linear regression, no closed form for individual �m 15 HST 190: Intro to Biostatistics • The 100(1 − �)% CI for a particular �m is simply D �m ± �5L0L.,.L ⋅ seo �m - • To test the hypothesis that �x: �m = 0 vs. �x: �m ≠ 0, holding all other parameters fixed, we use test statistic ∗ �m � = ~�5L0L. seo �m ∗ • If �x is true, � follows �-distribution with � − � − 1 degrees of freedom, where � is the number of covariates included in model § testing procedure follows exactly as before 16 HST 190: Intro to Biostatistics Matlab example output • Model output gives all necessary components SBP = � + �1AGE + �2WEIGHT + � > fit = fitlm(data,’sbp ~ age + weight') Linear regression model: sbp ~ 1 + age + weight �m Estimated Coefficients: seo �m � |�∗ > � ) Estimate SE tStat pValue �m seo �m (Intercept) 60.56178 44.81431 1.351 0.219 age 1.01226 .3406004 2.972 0.021 weight .2166858 .2468829 0.878 0.409 � Number of observations: 10, Error degrees of freedom: 7 Root Mean Squared Error: 11.583 � − � − 1 �I 17 HST 190: Intro to Biostatistics Model checking • For linear model, assumptions (in order of importance) are: § Linearity (debatable!) § independence of residuals § equal spread of points around the line § normality of residuals § [Random sampling from a large population] 18 HST 190: Intro to Biostatistics Linearity • Linearity means validity of � = � + �.�. + ⋯ + �0�0 + � • Check this assumption in two ways: • Graphically (for SLR): look for nonlinearity and outliers. § more difficult to check linearity for MLR, consider pairwise scatterplots (plotting each �m against �) and ‘trellis graphs’ • Conceptually: based on a phenomenon of interest and chosen predictors § e.g., age is related linearly to body weight for children 19 HST 190: Intro to Biostatistics • If violated: § Estimators and predictions may be biased • Strategies: . § Consider transformations (log � , , �-, log � , , etc. ) or add ‡ J interaction terms § Use nonlinear functions of �: spline (or polynomial) regression or other generalized additive models 20 HST 190: Intro to Biostatistics • Transformations allow linear regression to model non-linear relationships between x and y, as long as there is a transformation of x or y (or both) that makes it linear. Figure borrowed from Ramsey FL, Schafer DW. The Statistical Sleuth, 2002 21 HST 190: Intro to Biostatistics Figure borrowed from Ramsey FL, Schafer DW. The Statistical Sleuth, 2002 22 HST 190: Intro to Biostatistics Independence of Errors • Independence of errors �1 means that: § Residuals for any two observations
Recommended publications
  • APPLICATION of the TAGUCHI METHOD to SENSITIVITY ANALYSIS of a MIDDLE- EAR FINITE-ELEMENT MODEL Li Qi1, Chadia S
    APPLICATION OF THE TAGUCHI METHOD TO SENSITIVITY ANALYSIS OF A MIDDLE- EAR FINITE-ELEMENT MODEL Li Qi1, Chadia S. Mikhael1 and W. Robert J. Funnell1, 2 1 Department of BioMedical Engineering 2 Department of Otolaryngology McGill University Montréal, QC, Canada H3A 2B4 ABSTRACT difference in the model output due to the change in the input variable is referred to as the sensitivity. The Sensitivity analysis of a model is the investigation relative importance of parameters is judged based on of how outputs vary with changes of input parameters, the magnitude of the calculated sensitivity. The OFAT in order to identify the relative importance of method does not, however, take into account the parameters and to help in optimization of the model. possibility of interactions among parameters. Such The one-factor-at-a-time (OFAT) method has been interactions mean that the model sensitivity to one widely used for sensitivity analysis of middle-ear parameter can change depending on the values of models. The results of OFAT, however, are unreliable other parameters. if there are significant interactions among parameters. Alternatively, the full-factorial method permits the This paper incorporates the Taguchi method into the analysis of parameter interactions, but generally sensitivity analysis of a middle-ear finite-element requires a very large number of simulations. This can model. Two outputs, tympanic-membrane volume be impractical when individual simulations are time- displacement and stapes footplate displacement, are consuming. A more practical approach is the Taguchi measured. Nine input parameters and four possible method, which is commonly used in industry. It interactions are investigated for two model outputs.
    [Show full text]
  • How Differences Between Online and Offline Interaction Influence Social
    Available online at www.sciencedirect.com ScienceDirect Two social lives: How differences between online and offline interaction influence social outcomes 1 2 Alicea Lieberman and Juliana Schroeder For hundreds of thousands of years, humans only Facebook users,75% ofwhom report checking the platform communicated in person, but in just the past fifty years they daily [2]. Among teenagers, 95% report using smartphones have started also communicating online. Today, people and 45%reportbeingonline‘constantly’[2].Thisshiftfrom communicate more online than offline. What does this shift offline to online socializing has meaningful and measurable mean for human social life? We identify four structural consequences for every aspect of human interaction, from differences between online (versus offline) interaction: (1) fewer how people form impressions of one another, to how they nonverbal cues, (2) greater anonymity, (3) more opportunity to treat each other, to the breadth and depth of their connec- form new social ties and bolster weak ties, and (4) wider tion. The current article proposes a new framework to dissemination of information. Each of these differences identify, understand, and study these consequences, underlies systematic psychological and behavioral highlighting promising avenues for future research. consequences. Online and offline lives often intersect; we thus further review how online engagement can (1) disrupt or (2) Structural differences between online and enhance offline interaction. This work provides a useful offline interaction
    [Show full text]
  • CONFIDENCE Vs PREDICTION INTERVALS 12/2/04 Inference for Coefficients Mean Response at X Vs
    STAT 141 REGRESSION: CONFIDENCE vs PREDICTION INTERVALS 12/2/04 Inference for coefficients Mean response at x vs. New observation at x Linear Model (or Simple Linear Regression) for the population. (“Simple” means single explanatory variable, in fact we can easily add more variables ) – explanatory variable (independent var / predictor) – response (dependent var) Probability model for linear regression: 2 i ∼ N(0, σ ) independent deviations Yi = α + βxi + i, α + βxi mean response at x = xi 2 Goals: unbiased estimates of the three parameters (α, β, σ ) tests for null hypotheses: α = α0 or β = β0 C.I.’s for α, β or to predictE(Y |X = x0). (A model is our ‘stereotype’ – a simplification for summarizing the variation in data) For example if we simulate data from a temperature model of the form: 1 Y = 65 + x + , x = 1, 2,..., 30 i 3 i i i Model is exactly true, by construction An equivalent statement of the LM model: Assume xi fixed, Yi independent, and 2 Yi|xi ∼ N(µy|xi , σ ), µy|xi = α + βxi, population regression line Remark: Suppose that (Xi,Yi) are a random sample from a bivariate normal distribution with means 2 2 (µX , µY ), variances σX , σY and correlation ρ. Suppose that we condition on the observed values X = xi. Then the data (xi, yi) satisfy the LM model. Indeed, we saw last time that Y |x ∼ N(µ , σ2 ), with i y|xi y|xi 2 2 2 µy|xi = α + βxi, σY |X = (1 − ρ )σY Example: Galton’s fathers and sons: µy|x = 35 + 0.5x ; σ = 2.34 (in inches).
    [Show full text]
  • ARIADNE (Axion Resonant Interaction Detection Experiment): an NMR-Based Axion Search, Snowmass LOI
    ARIADNE (Axion Resonant InterAction Detection Experiment): an NMR-based Axion Search, Snowmass LOI Andrew A. Geraci,∗ Aharon Kapitulnik, William Snow, Joshua C. Long, Chen-Yu Liu, Yannis Semertzidis, Yun Shin, and Asimina Arvanitaki (Dated: August 31, 2020) The Axion Resonant InterAction Detection Experiment (ARIADNE) is a collaborative effort to search for the QCD axion using techniques based on nuclear magnetic resonance. In the experiment, axions or axion-like particles would mediate short-range spin-dependent interactions between a laser-polarized 3He gas and a rotating (unpolarized) tungsten source mass, acting as a tiny, fictitious magnetic field. The experiment has the potential to probe deep within the theoretically interesting regime for the QCD axion in the mass range of 0.1- 10 meV, independently of cosmological assumptions. In this SNOWMASS LOI, we briefly describe this technique which covers a wide range of axion masses as well as discuss future prospects for improvements. Taken together with other existing and planned axion experi- ments, ARIADNE has the potential to completely explore the allowed parameter space for the QCD axion. A well-motivated example of a light mass pseudoscalar boson that can mediate new interactions or a “fifth-force” is the axion. The axion is a hypothetical particle that arises as a consequence of the Peccei-Quinn (PQ) mechanism to solve the strong CP problem of quantum chromodynamics (QCD)[1]. Axions have also been well motivated as a dark matter candidatedue to their \invisible" nature[2]. Axions can couple to fundamental fermions through a scalar vertex and a pseudoscalar vertex with very weak coupling strength.
    [Show full text]
  • Choosing a Coverage Probability for Prediction Intervals
    Choosing a Coverage Probability for Prediction Intervals Joshua LANDON and Nozer D. SINGPURWALLA We start by noting that inherent to the above techniques is an underlying distribution (or error) theory, whose net effect Coverage probabilities for prediction intervals are germane to is to produce predictions with an uncertainty bound; the nor- filtering, forecasting, previsions, regression, and time series mal (Gaussian) distribution is typical. An exception is Gard- analysis. It is a common practice to choose the coverage proba- ner (1988), who used a Chebychev inequality in lieu of a spe- bilities for such intervals by convention or by astute judgment. cific distribution. The result was a prediction interval whose We argue here that coverage probabilities can be chosen by de- width depends on a coverage probability; see, for example, Box cision theoretic considerations. But to do so, we need to spec- and Jenkins (1976, p. 254), or Chatfield (1993). It has been a ify meaningful utility functions. Some stylized choices of such common practice to specify coverage probabilities by conven- functions are given, and a prototype approach is presented. tion, the 90%, the 95%, and the 99% being typical choices. In- deed Granger (1996) stated that academic writers concentrate KEY WORDS: Confidence intervals; Decision making; Filter- almost exclusively on 95% intervals, whereas practical fore- ing; Forecasting; Previsions; Time series; Utilities. casters seem to prefer 50% intervals. The larger the coverage probability, the wider the prediction interval, and vice versa. But wide prediction intervals tend to be of little value [see Granger (1996), who claimed 95% prediction intervals to be “embarass- 1.
    [Show full text]
  • STATS 305 Notes1
    STATS 305 Notes1 Art Owen2 Autumn 2013 1The class notes were beautifully scribed by Eric Min. He has kindly allowed his notes to be placed online for stat 305 students. Reading these at leasure, you will spot a few errors and omissions due to the hurried nature of scribing and probably my handwriting too. Reading them ahead of class will help you understand the material as the class proceeds. 2Department of Statistics, Stanford University. 0.0: Chapter 0: 2 Contents 1 Overview 9 1.1 The Math of Applied Statistics . .9 1.2 The Linear Model . .9 1.2.1 Other Extensions . 10 1.3 Linearity . 10 1.4 Beyond Simple Linearity . 11 1.4.1 Polynomial Regression . 12 1.4.2 Two Groups . 12 1.4.3 k Groups . 13 1.4.4 Different Slopes . 13 1.4.5 Two-Phase Regression . 14 1.4.6 Periodic Functions . 14 1.4.7 Haar Wavelets . 15 1.4.8 Multiphase Regression . 15 1.5 Concluding Remarks . 16 2 Setting Up the Linear Model 17 2.1 Linear Model Notation . 17 2.2 Two Potential Models . 18 2.2.1 Regression Model . 18 2.2.2 Correlation Model . 18 2.3 TheLinear Model . 18 2.4 Math Review . 19 2.4.1 Quadratic Forms . 20 3 The Normal Distribution 23 3.1 Friends of N (0; 1)...................................... 23 3.1.1 χ2 .......................................... 23 3.1.2 t-distribution . 23 3.1.3 F -distribution . 24 3.2 The Multivariate Normal . 24 3.2.1 Linear Transformations . 25 3.2.2 Normal Quadratic Forms .
    [Show full text]
  • Model Selection for Optimal Prediction in Statistical Machine Learning
    Model Selection for Optimal Prediction in Statistical Machine Learning Ernest Fokou´e Introduction science with several ideas from cognitive neuroscience and At the core of all our modern-day advances in artificial in- psychology to inspire the creation, invention, and discov- telligence is the emerging field of statistical machine learn- ery of abstract models that attempt to learn and extract pat- ing (SML). From a very general perspective, SML can be terns from the data. One could think of SML as a field thought of as a field of mathematical sciences that com- of science dedicated to building models endowed with bines mathematics, probability, statistics, and computer the ability to learn from the data in ways similar to the ways humans learn, with the ultimate goal of understand- Ernest Fokou´eis a professor of statistics at Rochester Institute of Technology. His ing and then mastering our complex world well enough email address is [email protected]. to predict its unfolding as accurately as possible. One Communicated by Notices Associate Editor Emilie Purvine. of the earliest applications of statistical machine learning For permission to reprint this article, please contact: centered around the now ubiquitous MNIST benchmark [email protected]. task, which consists of building statistical models (also DOI: https://doi.org/10.1090/noti2014 FEBRUARY 2020 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY 155 known as learning machines) that automatically learn and Theoretical Foundations accurately recognize handwritten digits from the United It is typical in statistical machine learning that a given States Postal Service (USPS). A typical deployment of an problem will be solved in a wide variety of different ways.
    [Show full text]
  • Inference in Normal Regression Model
    Inference in Normal Regression Model Dr. Frank Wood Remember I We know that the point estimator of b1 is P(X − X¯ )(Y − Y¯ ) b = i i 1 P 2 (Xi − X¯ ) I Last class we derived the sampling distribution of b1, it being 2 N(β1; Var(b1))(when σ known) with σ2 Var(b ) = σ2fb g = 1 1 P 2 (Xi − X¯ ) I And we suggested that an estimate of Var(b1) could be arrived at by substituting the MSE for σ2 when σ2 is unknown. MSE SSE s2fb g = = n−2 1 P 2 P 2 (Xi − X¯ ) (Xi − X¯ ) Sampling Distribution of (b1 − β1)=sfb1g I Since b1 is normally distribute, (b1 − β1)/σfb1g is a standard normal variable N(0; 1) I We don't know Var(b1) so it must be estimated from data. 2 We have already denoted it's estimate s fb1g I Using this estimate we it can be shown that b − β 1 1 ∼ t(n − 2) sfb1g where q 2 sfb1g = s fb1g It is from this fact that our confidence intervals and tests will derive. Where does this come from? I We need to rely upon (but will not derive) the following theorem For the normal error regression model SSE P(Y − Y^ )2 = i i ∼ χ2(n − 2) σ2 σ2 and is independent of b0 and b1. I Here there are two linear constraints P ¯ ¯ ¯ (Xi − X )(Yi − Y ) X Xi − X b1 = = ki Yi ; ki = P(X − X¯ )2 P (X − X¯ )2 i i i i b0 = Y¯ − b1X¯ imposed by the regression parameter estimation that each reduce the number of degrees of freedom by one (total two).
    [Show full text]
  • Sieve Bootstrap-Based Prediction Intervals for Garch Processes
    SIEVE BOOTSTRAP-BASED PREDICTION INTERVALS FOR GARCH PROCESSES by Garrett Tresch A capstone project submitted in partial fulfillment of graduating from the Academic Honors Program at Ashland University April 2015 Faculty Mentor: Dr. Maduka Rupasinghe, Assistant Professor of Mathematics Additional Reader: Dr. Christopher Swanson, Professor of Mathematics ABSTRACT Time Series deals with observing a variable—interest rates, exchange rates, rainfall, etc.—at regular intervals of time. The main objectives of Time Series analysis are to understand the underlying processes and effects of external variables in order to predict future values. Time Series methodologies have wide applications in the fields of business where mathematics is necessary. The Generalized Autoregressive Conditional Heteroscedasic (GARCH) models are extensively used in finance and econometrics to model empirical time series in which the current variation, known as volatility, of an observation is depending upon the past observations and past variations. Various drawbacks of the existing methods for obtaining prediction intervals include: the assumption that the orders associated with the GARCH process are known; and the heavy computational time involved in fitting numerous GARCH processes. This paper proposes a novel and computationally efficient method for the creation of future prediction intervals using the Sieve Bootstrap, a promising resampling procedure for Autoregressive Moving Average (ARMA) processes. This bootstrapping technique remains efficient when computing future prediction intervals for the returns as well as the volatilities of GARCH processes and avoids extensive computation and parameter estimation. Both the included Monte Carlo simulation study and the exchange rate application demonstrate that the proposed method works very well under normal distributed errors.
    [Show full text]
  • On Small Area Prediction Interval Problems
    ASA Section on Survey Research Methods On Small Area Prediction Interval Problems Snigdhansu Chatterjee, Parthasarathi Lahiri, Huilin Li University of Minnesota, University of Maryland, University of Maryland Abstract In the small area context, prediction intervals are often pro- √ duced using the standard EBLUP ± zα/2 mspe rule, where Empirical best linear unbiased prediction (EBLUP) method mspe is an estimate of the true MSP E of the EBLUP and uses a linear mixed model in combining information from dif- zα/2 is the upper 100(1 − α/2) point of the standard normal ferent sources of information. This method is particularly use- distribution. These prediction intervals are asymptotically cor- ful in small area problems. The variability of an EBLUP is rect, in the sense that the coverage probability converges to measured by the mean squared prediction error (MSPE), and 1 − α for large sample size n. However, they are not efficient interval estimates are generally constructed using estimates of in the sense they have either under-coverage or over-coverage the MSPE. Such methods have shortcomings like undercover- problem for small n, depending on the particular choice of age, excessive length and lack of interpretability. We propose the MSPE estimator. In statistical terms, the coverage error a resampling driven approach, and obtain coverage accuracy of such interval is of the order O(n−1), which is not accu- of O(d3n−3/2), where d is the number of parameters and n rate enough for most applications of small area studies, many the number of observations. Simulation results demonstrate of which involve small n.
    [Show full text]
  • Linear Regression: Goodness of Fit and Model Selection
    Linear Regression: Goodness of Fit and Model Selection 1 Goodness of Fit I Goodness of fit measures for linear regression are attempts to understand how well a model fits a given set of data. I Models almost never describe the process that generated a dataset exactly I Models approximate reality I However, even models that approximate reality can be used to draw useful inferences or to prediction future observations I ’All Models are wrong, but some are useful’ - George Box 2 Goodness of Fit I We have seen how to check the modelling assumptions of linear regression: I checking the linearity assumption I checking for outliers I checking the normality assumption I checking the distribution of the residuals does not depend on the predictors I These are essential qualitative checks of goodness of fit 3 Sample Size I When making visual checks of data for goodness of fit is important to consider sample size I From a multiple regression model with 2 predictors: I On the left is a histogram of the residuals I On the right is residual vs predictor plot for each of the two predictors 4 Sample Size I The histogram doesn’t look normal but there are only 20 datapoint I We should not expect a better visual fit I Inferences from the linear model should be valid 5 Outliers I Often (particularly when a large dataset is large): I the majority of the residuals will satisfy the model checking assumption I a small number of residuals will violate the normality assumption: they will be very big or very small I Outliers are often generated by a process distinct from those which we are primarily interested in.
    [Show full text]
  • Scalable Model Selection for Spatial Additive Mixed Modeling: Application to Crime Analysis
    Scalable model selection for spatial additive mixed modeling: application to crime analysis Daisuke Murakami1,2,*, Mami Kajita1, Seiji Kajita1 1Singular Perturbations Co. Ltd., 1-5-6 Risona Kudan Building, Kudanshita, Chiyoda, Tokyo, 102-0074, Japan 2Department of Statistical Data Science, Institute of Statistical Mathematics, 10-3 Midori-cho, Tachikawa, Tokyo, 190-8562, Japan * Corresponding author (Email: [email protected]) Abstract: A rapid growth in spatial open datasets has led to a huge demand for regression approaches accommodating spatial and non-spatial effects in big data. Regression model selection is particularly important to stably estimate flexible regression models. However, conventional methods can be slow for large samples. Hence, we develop a fast and practical model-selection approach for spatial regression models, focusing on the selection of coefficient types that include constant, spatially varying, and non-spatially varying coefficients. A pre-processing approach, which replaces data matrices with small inner products through dimension reduction dramatically accelerates the computation speed of model selection. Numerical experiments show that our approach selects the model accurately and computationally efficiently, highlighting the importance of model selection in the spatial regression context. Then, the present approach is applied to open data to investigate local factors affecting crime in Japan. The results suggest that our approach is useful not only for selecting factors influencing crime risk but also for predicting crime events. This scalable model selection will be key to appropriately specifying flexible and large-scale spatial regression models in the era of big data. The developed model selection approach was implemented in the R package spmoran. Keywords: model selection; spatial regression; crime; fast computation; spatially varying coefficient modeling 1.
    [Show full text]