Learn to Test for Heteroscedasticity in SPSS with Data from the China Health and Nutrition Survey (2006)

Total Page:16

File Type:pdf, Size:1020Kb

Learn to Test for Heteroscedasticity in SPSS with Data from the China Health and Nutrition Survey (2006) Learn to Test for Heteroscedasticity in SPSS With Data From the China Health and Nutrition Survey (2006) © 2015 SAGE Publications, Ltd. All Rights Reserved. This PDF has been generated from SAGE Research Methods Datasets. SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 Learn to Test for Heteroscedasticity in SPSS With Data From the China Health and Nutrition Survey (2006) Student Guide Introduction This dataset example introduces readers to testing for heteroscedasticity following a linear regression analysis. Linear regression rests on several assumptions, one of which is that the variance of the residual from the model is constant and unrelated to the independent variable(s). Constant variance is called homoscedasticity, while non-constant variance is called heteroscedasticity. This example describes heteroscedasticity, discusses its consequences, and shows how to detect it using data from the 2006 China Health and Nutrition Survey (CHNS) survey of adults (http://www.cpc.unc.edu/projects/china). Specifically, we test whether systolic blood pressure is a linear function of a person’s age. After performing the regression, we show how to examine the results for evidence of heteroscedasticity. High blood pressure is associated with a number of negative health outcomes. Results from an analysis like this could therefore have implications for individual behavior and public health policy. What Is Heteroscedasticity? Linear regression analysis expresses a dependent variable as a linear function of one or more independent variables. Equation 1 shows an example of a simple Page 2 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the China Health and Nutrition Survey (2006) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 linear model with a single independent variable: (1) Yi = β0 + β1Xi + εi Where: • Yi = individual values of the dependent variable • Xi = individual values of the independent variable • β0 = the intercept, or constant, associated with the regression line • β1 = the slope of the regression line • εi = the unmodeled random, or stochastic, component of the dependent variable, often called the error term or the residual of the model. Linear regression models are typically estimated using ordinary least squares (OLS) regression. OLS regression rests on several assumptions, one of which is that the variance of the residuals from the regression model (εi) is constant and unrelated to the independent variable(s). Heteroscedasticity means that the variance of the residual is not constant, which means that an important assumption of OLS has been violated. Consequences of Heteroscedasticity The presence of heteroscedasticity does not affect the estimated values of the intercept or slope coefficients of a linear regression model. Those estimates remain unbiased. However, heteroscedasticity does affect the estimated standard errors for those coefficients. It can make them either too large or to small, but most often it makes them too small. Page 3 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the China Health and Nutrition Survey (2006) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 The standard error of any statistic is calculated to provide an estimate of how much that statistic might change if it were calculated again on another random sample of data of the same size taken from the same population. For regression, the standard error for each coefficient provides an estimate of uncertainty about that coefficient. We use both the coefficients and their standard errors when we test hypotheses about those coefficients. For example, we might estimate a regression model like the one presented in Equation 1 and produce an estimate of β1 of 2.5. In order to determine whether 2.5 is statistically significantly different from zero, we need to perform a hypothesis test. Specifically, we would test the null hypothesis that β1 = 0. To do so, we would: 1. Perform the regression analysis to produce our estimates of and its standard error. 2. Divide the estimate of by its standard error to produce a T-score. 3. Compare the T-score from the previous step to a Student’s T distribution, with degrees of freedom equal to the sample size minus the number of coefficients estimated as part of the original regression. 4. Determine the level of significance, or p-value, associated with the calculated T-score. Typically, if that p-value is less than 0.05, researchers would declare the result to be statistically significant. Of course, statistical software generally performs all of these steps for us automatically. However, this process and those computer programs assume that the variance of the residuals is constant. As noted above, heteroscedasticity leads to incorrect estimates of standard errors. As a result, heteroscedasticity will confound subsequent hypothesis testing. Because heteroscedasticity typically produces standard errors that are smaller than they should be, we run the risk of being over-confident in our coefficient estimates and possibly declaring a coefficient estimate to be statistically significant when it is not. Page 4 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the China Health and Nutrition Survey (2006) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 Detecting Heteroscedasticity There are two main strategies for detecting heteroscedasticity. The first approach is graphical. For a simple regression model, a two-way scatter plot with the residuals from the regression model plotted on the Y-axis and the independent variable plotted on the X-axis is a good place to start. For a multiple regression model, you could produce separate plots like this for each independent variable – with the residuals plotted on the Y-axis and the independent variable in question plotted on the X-axis. For either a simple or a multiple regression model, it is quite common plot the residuals on the Y-axis and the predicted value of the dependent variable based on the regression model on the X-axis. Often both the residuals and the predicted values of the dependent variable are standardized before being plotted. Figure 1 shows an example of this approach. Figure 1: Illustration comparing what homoscedasticity and heteroscedasticity look like using two-way scatter plots with the standardized residual from the regression plotted on the Y-axis and the standardized predicted value of the dependent variable from the regression plotted on the X-axis. Page 5 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the China Health and Nutrition Survey (2006) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 Regardless of which figures you produce, you should see the same level of vertical spread among the residuals across all values plotted along the X-axis as you look from left to right at the plot if you have constant variance in the residuals. However, if you have heteroscedasticity, you will see changes in the vertical spread among the residuals as you look across the figure from left to right. That spread could be steadying growing, steadily shrinking, or showing a more complex pattern such as less variance in the center of the data than at both the lower and upper extremes. The panel on the left in Figure 1 shows how the vertical spread of the standardized residuals is roughly the same as you look from left to right. However, the panel on the right in Figure 1 shows a smaller vertical spread for the residuals initially that spreads out wider as you move from left to right. The panel on the left shows what homoscedasticity looks like. The panel on the right shows what one of the more common forms of heteroscedasticity looks like. Graphical methods are useful for seeing the data, but they lack formal statistical Page 6 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the China Health and Nutrition Survey (2006) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 precision. As a result, many researchers turn to formal statistical tests for heteroscedasticity. There are many tests of the null hypothesis of homoscedasticity, but the most common is the Breusch–Pagen test (sometimes called the Breusch–Pagan–Godfrey test; also developed independently as the Cook–Weisberg test). The Breusch–Pagen test unfolds in several steps: 1. Estimate the regression model of interest and save the residuals, εi, for each observation. ^ 2 2. Compute a number we will label σ by: 2.1 squaring each residual, εi 2.2 summing up those squared residuals 2.3 dividing the result by the size of your sample from the regressions model you estimated. ε2 3. i Compute a new variable named ρi as equal to ^ 2 . σ 4. Run a new auxiliary regression with ρi as the dependent variable and all of the same independent variables that were part of your original regression. 5. Compute the explained sum of squares from this auxiliary regression and divide it by 2. 6. Compare the result from the previous step to a Chi-squared probability distribution with degrees of freedom equal to the number of independent variables included in the auxiliary regression. 7. If the level of significance, or p-value, associated with the result from the previous step is small – typically below 0.05 – you can reject the null hypothesis of homoscedasticity and declare that you do in fact have evidence of heteroscedasticity. Page 7 of 14 Learn to Test for Heteroscedasticity in SPSS With Data From the China Health and Nutrition Survey (2006) SAGE SAGE Research Methods Datasets Part 2015 SAGE Publications, Ltd. All Rights Reserved. 1 Fortunately, most (but not all) statistical software programs have this and other formal tests for heteroscedasticity built in.
Recommended publications
  • Hypothesis Testing and Likelihood Ratio Tests
    Hypottthesiiis tttestttiiing and llliiikellliiihood ratttiiio tttesttts Y We will adopt the following model for observed data. The distribution of Y = (Y1, ..., Yn) is parameter considered known except for some paramett er ç, which may be a vector ç = (ç1, ..., çk); ç“Ç, the paramettter space. The parameter space will usually be an open set. If Y is a continuous random variable, its probabiiillliiittty densiiittty functttiiion (pdf) will de denoted f(yy;ç) . If Y is y probability mass function y Y y discrete then f(yy;ç) represents the probabii ll ii tt y mass functt ii on (pmf); f(yy;ç) = Pç(YY=yy). A stttatttiiistttiiicalll hypottthesiiis is a statement about the value of ç. We are interested in testing the null hypothesis H0: ç“Ç0 versus the alternative hypothesis H1: ç“Ç1. Where Ç0 and Ç1 ¶ Ç. hypothesis test Naturally Ç0 § Ç1 = ∅, but we need not have Ç0 ∞ Ç1 = Ç. A hypott hesii s tt estt is a procedure critical region for deciding between H0 and H1 based on the sample data. It is equivalent to a crii tt ii call regii on: a critical region is a set C ¶ Rn y such that if y = (y1, ..., yn) “ C, H0 is rejected. Typically C is expressed in terms of the value of some tttesttt stttatttiiistttiiic, a function of the sample data. For µ example, we might have C = {(y , ..., y ): y – 0 ≥ 3.324}. The number 3.324 here is called a 1 n s/ n µ criiitttiiicalll valllue of the test statistic Y – 0 . S/ n If y“C but ç“Ç 0, we have committed a Type I error.
    [Show full text]
  • A Simple Censored Median Regression Estimator
    Statistica Sinica 16(2006), 1043-1058 A SIMPLE CENSORED MEDIAN REGRESSION ESTIMATOR Lingzhi Zhou The Hong Kong University of Science and Technology Abstract: Ying, Jung and Wei (1995) proposed an estimation procedure for the censored median regression model that regresses the median of the survival time, or its transform, on the covariates. The procedure requires solving complicated nonlinear equations and thus can be very difficult to implement in practice, es- pecially when there are multiple covariates. Moreover, the asymptotic covariance matrix of the estimator involves the density of the errors that cannot be esti- mated reliably. In this paper, we propose a new estimator for the censored median regression model. Our estimation procedure involves solving some convex min- imization problems and can be easily implemented through linear programming (Koenker and D'Orey (1987)). In addition, a resampling method is presented for estimating the covariance matrix of the new estimator. Numerical studies indi- cate the superiority of the finite sample performance of our estimator over that in Ying, Jung and Wei (1995). Key words and phrases: Censoring, convexity, LAD, resampling. 1. Introduction The accelerated failure time (AFT) model, which relates the logarithm of the survival time to covariates, is an attractive alternative to the popular Cox (1972) proportional hazards model due to its ease of interpretation. The model assumes that the failure time T , or some monotonic transformation of it, is linearly related to the covariate vector Z 0 Ti = β0Zi + "i; i = 1; : : : ; n: (1.1) Under censoring, we only observe Yi = min(Ti; Ci), where Ci are censoring times, and Ti and Ci are independent conditional on Zi.
    [Show full text]
  • Cluster Analysis, a Powerful Tool for Data Analysis in Education
    International Statistical Institute, 56th Session, 2007: Rita Vasconcelos, Mßrcia Baptista Cluster Analysis, a powerful tool for data analysis in Education Vasconcelos, Rita Universidade da Madeira, Department of Mathematics and Engeneering Caminho da Penteada 9000-390 Funchal, Portugal E-mail: [email protected] Baptista, Márcia Direcção Regional de Saúde Pública Rua das Pretas 9000 Funchal, Portugal E-mail: [email protected] 1. Introduction A database was created after an inquiry to 14-15 - year old students, which was developed with the purpose of identifying the factors that could socially and pedagogically frame the results in Mathematics. The data was collected in eight schools in Funchal (Madeira Island), and we performed a Cluster Analysis as a first multivariate statistical approach to this database. We also developed a logistic regression analysis, as the study was carried out as a contribution to explain the success/failure in Mathematics. As a final step, the responses of both statistical analysis were studied. 2. Cluster Analysis approach The questions that arise when we try to frame socially and pedagogically the results in Mathematics of 14-15 - year old students, are concerned with the types of decisive factors in those results. It is somehow underlying our objectives to classify the students according to the factors understood by us as being decisive in students’ results. This is exactly the aim of Cluster Analysis. The hierarchical solution that can be observed in the dendogram presented in the next page, suggests that we should consider the 3 following clusters, since the distances increase substantially after it: Variables in Cluster1: mother qualifications; father qualifications; student’s results in Mathematics as classified by the school teacher; student’s results in the exam of Mathematics; time spent studying.
    [Show full text]
  • The Detection of Heteroscedasticity in Regression Models for Psychological Data
    Psychological Test and Assessment Modeling, Volume 58, 2016 (4), 567-592 The detection of heteroscedasticity in regression models for psychological data Andreas G. Klein1, Carla Gerhard2, Rebecca D. Büchner2, Stefan Diestel3 & Karin Schermelleh-Engel2 Abstract One assumption of multiple regression analysis is homoscedasticity of errors. Heteroscedasticity, as often found in psychological or behavioral data, may result from misspecification due to overlooked nonlinear predictor terms or to unobserved predictors not included in the model. Although methods exist to test for heteroscedasticity, they require a parametric model for specifying the structure of heteroscedasticity. The aim of this article is to propose a simple measure of heteroscedasticity, which does not need a parametric model and is able to detect omitted nonlinear terms. This measure utilizes the dispersion of the squared regression residuals. Simulation studies show that the measure performs satisfactorily with regard to Type I error rates and power when sample size and effect size are large enough. It outperforms the Breusch-Pagan test when a nonlinear term is omitted in the analysis model. We also demonstrate the performance of the measure using a data set from industrial psychology. Keywords: Heteroscedasticity, Monte Carlo study, regression, interaction effect, quadratic effect 1Correspondence concerning this article should be addressed to: Prof. Dr. Andreas G. Klein, Department of Psychology, Goethe University Frankfurt, Theodor-W.-Adorno-Platz 6, 60629 Frankfurt; email: [email protected] 2Goethe University Frankfurt 3International School of Management Dortmund Leipniz-Research Centre for Working Environment and Human Factors 568 A. G. Klein, C. Gerhard, R. D. Büchner, S. Diestel & K. Schermelleh-Engel Introduction One of the standard assumptions underlying a linear model is that the errors are inde- pendently identically distributed (i.i.d.).
    [Show full text]
  • Testing Hypotheses for Differences Between Linear Regression Lines
    United States Department of Testing Hypotheses for Differences Agriculture Between Linear Regression Lines Forest Service Stanley J. Zarnoch Southern Research Station e-Research Note SRS–17 May 2009 Abstract The general linear model (GLM) is often used to test for differences between means, as for example when the Five hypotheses are identified for testing differences between simple linear analysis of variance is used in data analysis for designed regression lines. The distinctions between these hypotheses are based on a priori assumptions and illustrated with full and reduced models. The studies. However, the GLM has a much wider application contrast approach is presented as an easy and complete method for testing and can also be used to compare regression lines. The for overall differences between the regressions and for making pairwise GLM allows for the use of both qualitative and quantitative comparisons. Use of the Bonferroni adjustment to ensure a desired predictor variables. Quantitative variables are continuous experimentwise type I error rate is emphasized. SAS software is used to illustrate application of these concepts to an artificial simulated dataset. The values such as diameter at breast height, tree height, SAS code is provided for each of the five hypotheses and for the contrasts age, or temperature. However, many predictor variables for the general test and all possible specific tests. in the natural resources field are qualitative. Examples of qualitative variables include tree class (dominant, Keywords: Contrasts, dummy variables, F-test, intercept, SAS, slope, test of conditional error. codominant); species (loblolly, slash, longleaf); and cover type (clearcut, shelterwood, seed tree). Fraedrich and others (1994) compared linear regressions of slash pine cone specific gravity and moisture content Introduction on collection date between families of slash pine grown in a seed orchard.
    [Show full text]
  • 5. Dummy-Variable Regression and Analysis of Variance
    Sociology 740 John Fox Lecture Notes 5. Dummy-Variable Regression and Analysis of Variance Copyright © 2014 by John Fox Dummy-Variable Regression and Analysis of Variance 1 1. Introduction I One of the limitations of multiple-regression analysis is that it accommo- dates only quantitative explanatory variables. I Dummy-variable regressors can be used to incorporate qualitative explanatory variables into a linear model, substantially expanding the range of application of regression analysis. c 2014 by John Fox Sociology 740 ° Dummy-Variable Regression and Analysis of Variance 2 2. Goals: I To show how dummy regessors can be used to represent the categories of a qualitative explanatory variable in a regression model. I To introduce the concept of interaction between explanatory variables, and to show how interactions can be incorporated into a regression model by forming interaction regressors. I To introduce the principle of marginality, which serves as a guide to constructing and testing terms in complex linear models. I To show how incremental I -testsareemployedtotesttermsindummy regression models. I To show how analysis-of-variance models can be handled using dummy variables. c 2014 by John Fox Sociology 740 ° Dummy-Variable Regression and Analysis of Variance 3 3. A Dichotomous Explanatory Variable I The simplest case: one dichotomous and one quantitative explanatory variable. I Assumptions: Relationships are additive — the partial effect of each explanatory • variable is the same regardless of the specific value at which the other explanatory variable is held constant. The other assumptions of the regression model hold. • I The motivation for including a qualitative explanatory variable is the same as for including an additional quantitative explanatory variable: to account more fully for the response variable, by making the errors • smaller; and to avoid a biased assessment of the impact of an explanatory variable, • as a consequence of omitting another explanatory variables that is relatedtoit.
    [Show full text]
  • This Is Dr. Chumney. the Focus of This Lecture Is Hypothesis Testing –Both What It Is, How Hypothesis Tests Are Used, and How to Conduct Hypothesis Tests
    TRANSCRIPT: This is Dr. Chumney. The focus of this lecture is hypothesis testing –both what it is, how hypothesis tests are used, and how to conduct hypothesis tests. 1 TRANSCRIPT: In this lecture, we will talk about both theoretical and applied concepts related to hypothesis testing. 2 TRANSCRIPT: Let’s being the lecture with a summary of the logic process that underlies hypothesis testing. 3 TRANSCRIPT: It is often impossible or otherwise not feasible to collect data on every individual within a population. Therefore, researchers rely on samples to help answer questions about populations. Hypothesis testing is a statistical procedure that allows researchers to use sample data to draw inferences about the population of interest. Hypothesis testing is one of the most commonly used inferential procedures. Hypothesis testing will combine many of the concepts we have already covered, including z‐scores, probability, and the distribution of sample means. To conduct a hypothesis test, we first state a hypothesis about a population, predict the characteristics of a sample of that population (that is, we predict that a sample will be representative of the population), obtain a sample, then collect data from that sample and analyze the data to see if it is consistent with our hypotheses. 4 TRANSCRIPT: The process of hypothesis testing begins by stating a hypothesis about the unknown population. Actually we state two opposing hypotheses. The first hypothesis we state –the most important one –is the null hypothesis. The null hypothesis states that the treatment has no effect. In general the null hypothesis states that there is no change, no difference, no effect, and otherwise no relationship between the independent and dependent variables.
    [Show full text]
  • The Scientific Method: Hypothesis Testing and Experimental Design
    Appendix I The Scientific Method The study of science is different from other disciplines in many ways. Perhaps the most important aspect of “hard” science is its adherence to the principle of the scientific method: the posing of questions and the use of rigorous methods to answer those questions. I. Our Friend, the Null Hypothesis As a science major, you are probably no stranger to curiosity. It is the beginning of all scientific discovery. As you walk through the campus arboretum, you might wonder, “Why are trees green?” As you observe your peers in social groups at the cafeteria, you might ask yourself, “What subtle kinds of body language are those people using to communicate?” As you read an article about a new drug which promises to be an effective treatment for male pattern baldness, you think, “But how do they know it will work?” Asking such questions is the first step towards hypothesis formation. A scientific investigator does not begin the study of a biological phenomenon in a vacuum. If an investigator observes something interesting, s/he first asks a question about it, and then uses inductive reasoning (from the specific to the general) to generate an hypothesis based upon a logical set of expectations. To test the hypothesis, the investigator systematically collects data, either with field observations or a series of carefully designed experiments. By analyzing the data, the investigator uses deductive reasoning (from the general to the specific) to state a second hypothesis (it may be the same as or different from the original) about the observations.
    [Show full text]
  • Heteroscedastic Errors
    Heteroscedastic Errors ◮ Sometimes plots and/or tests show that the error variances 2 σi = Var(ǫi ) depend on i ◮ Several standard approaches to fixing the problem, depending on the nature of the dependence. ◮ Weighted Least Squares. ◮ Transformation of the response. ◮ Generalized Linear Models. Richard Lockhart STAT 350: Heteroscedastic Errors and GLIM Weighted Least Squares ◮ Suppose variances are known except for a constant factor. 2 2 ◮ That is, σi = σ /wi . ◮ Use weighted least squares. (See Chapter 10 in the text.) ◮ This usually arises realistically in the following situations: ◮ Yi is an average of ni measurements where you know ni . Then wi = ni . 2 ◮ Plots suggest that σi might be proportional to some power of 2 γ γ some covariate: σi = kxi . Then wi = xi− . Richard Lockhart STAT 350: Heteroscedastic Errors and GLIM Variances depending on (mean of) Y ◮ Two standard approaches are available: ◮ Older approach is transformation. ◮ Newer approach is use of generalized linear model; see STAT 402. Richard Lockhart STAT 350: Heteroscedastic Errors and GLIM Transformation ◮ Compute Yi∗ = g(Yi ) for some function g like logarithm or square root. ◮ Then regress Yi∗ on the covariates. ◮ This approach sometimes works for skewed response variables like income; ◮ after transformation we occasionally find the errors are more nearly normal, more homoscedastic and that the model is simpler. ◮ See page 130ff and check under transformations and Box-Cox in the index. Richard Lockhart STAT 350: Heteroscedastic Errors and GLIM Generalized Linear Models ◮ Transformation uses the model T E(g(Yi )) = xi β while generalized linear models use T g(E(Yi )) = xi β ◮ Generally latter approach offers more flexibility.
    [Show full text]
  • Power Comparisons of the Mann-Whitney U and Permutation Tests
    Power Comparisons of the Mann-Whitney U and Permutation Tests Abstract: Though the Mann-Whitney U-test and permutation tests are often used in cases where distribution assumptions for the two-sample t-test for equal means are not met, it is not widely understood how the powers of the two tests compare. Our goal was to discover under what circumstances the Mann-Whitney test has greater power than the permutation test. The tests’ powers were compared under various conditions simulated from the Weibull distribution. Under most conditions, the permutation test provided greater power, especially with equal sample sizes and with unequal standard deviations. However, the Mann-Whitney test performed better with highly skewed data. Background and Significance: In many psychological, biological, and clinical trial settings, distributional differences among testing groups render parametric tests requiring normality, such as the z test and t test, unreliable. In these situations, nonparametric tests become necessary. Blair and Higgins (1980) illustrate the empirical invalidity of claims made in the mid-20th century that t and F tests used to detect differences in population means are highly insensitive to violations of distributional assumptions, and that non-parametric alternatives possess lower power. Through power testing, Blair and Higgins demonstrate that the Mann-Whitney test has much higher power relative to the t-test, particularly under small sample conditions. This seems to be true even when Welch’s approximation and pooled variances are used to “account” for violated t-test assumptions (Glass et al. 1972). With the proliferation of powerful computers, computationally intensive alternatives to the Mann-Whitney test have become possible.
    [Show full text]
  • Hypothesis Testing – Examples and Case Studies
    Hypothesis Testing – Chapter 23 Examples and Case Studies Copyright ©2005 Brooks/Cole, a division of Thomson Learning, Inc. 23.1 How Hypothesis Tests Are Reported in the News 1. Determine the null hypothesis and the alternative hypothesis. 2. Collect and summarize the data into a test statistic. 3. Use the test statistic to determine the p-value. 4. The result is statistically significant if the p-value is less than or equal to the level of significance. Often media only presents results of step 4. Copyright ©2005 Brooks/Cole, a division of Thomson Learning, Inc. 2 23.2 Testing Hypotheses About Proportions and Means If the null and alternative hypotheses are expressed in terms of a population proportion, mean, or difference between two means and if the sample sizes are large … … the test statistic is simply the corresponding standardized score computed assuming the null hypothesis is true; and the p-value is found from a table of percentiles for standardized scores. Copyright ©2005 Brooks/Cole, a division of Thomson Learning, Inc. 3 Example 2: Weight Loss for Diet vs Exercise Did dieters lose more fat than the exercisers? Diet Only: sample mean = 5.9 kg sample standard deviation = 4.1 kg sample size = n = 42 standard error = SEM1 = 4.1/ √42 = 0.633 Exercise Only: sample mean = 4.1 kg sample standard deviation = 3.7 kg sample size = n = 47 standard error = SEM2 = 3.7/ √47 = 0.540 measure of variability = [(0.633)2 + (0.540)2] = 0.83 Copyright ©2005 Brooks/Cole, a division of Thomson Learning, Inc. 4 Example 2: Weight Loss for Diet vs Exercise Step 1.
    [Show full text]
  • Simple Linear Regression: Straight Line Regression Between an Outcome Variable (Y ) and a Single Explanatory Or Predictor Variable (X)
    1 Introduction to Regression \Regression" is a generic term for statistical methods that attempt to fit a model to data, in order to quantify the relationship between the dependent (outcome) variable and the predictor (independent) variable(s). Assuming it fits the data reasonable well, the estimated model may then be used either to merely describe the relationship between the two groups of variables (explanatory), or to predict new values (prediction). There are many types of regression models, here are a few most common to epidemiology: Simple Linear Regression: Straight line regression between an outcome variable (Y ) and a single explanatory or predictor variable (X). E(Y ) = α + β × X Multiple Linear Regression: Same as Simple Linear Regression, but now with possibly multiple explanatory or predictor variables. E(Y ) = α + β1 × X1 + β2 × X2 + β3 × X3 + ::: A special case is polynomial regression. 2 3 E(Y ) = α + β1 × X + β2 × X + β3 × X + ::: Generalized Linear Model: Same as Multiple Linear Regression, but with a possibly transformed Y variable. This introduces considerable flexibil- ity, as non-linear and non-normal situations can be easily handled. G(E(Y )) = α + β1 × X1 + β2 × X2 + β3 × X3 + ::: In general, the transformation function G(Y ) can take any form, but a few forms are especially common: • Taking G(Y ) = logit(Y ) describes a logistic regression model: E(Y ) log( ) = α + β × X + β × X + β × X + ::: 1 − E(Y ) 1 1 2 2 3 3 2 • Taking G(Y ) = log(Y ) is also very common, leading to Poisson regression for count data, and other so called \log-linear" models.
    [Show full text]