Applied Linear Regression
Total Page:16
File Type:pdf, Size:1020Kb
Applied Linear Regression Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 35487-0348 Phone: (205) 348-4431 Fax: (205) 348-8648 March 8, 2004 Textbook references refer to Cohen, Cohen, West, & Aiken's (2003) Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. I would like to thank Angie Maitner for comments made on an earlier version of these notes. For future versions of these notes or help with data analysis visit http://www.stat-help.com ALL RIGHTS TO THIS DOCUMENT ARE RESERVED Contents 1 Introduction and Review 1 2 Bivariate Correlation and Regression 8 3 Multiple Correlation and Regression 19 4 Regression Assumptions and Basic Diagnostics 27 5 Sequential Regression, Stepwise Regression, and Analysis of IV Sets 35 6 Dealing with Nonlinear Relationships 43 7 Interactions Among Continuous IVs 49 8 Regression with Categorical IVs 57 9 Interactions with Categorical IVs 67 10 Outlier and Multicollinearity Diagnostics 73 i Chapter 1 Introduction and Review 1.1 Data, Data Sources, and Data Sets ² Most generally, data can be de¯ned as a list of numbers with meaningful relationships. We are inter- ested in data because understanding the relationships among the numbers can help us understand the relationships among the things that the numbers measure. ² The numbers that you collect from an experiment, survey, or archival source is known as a data source. Before you can learn anything from a data source, however, you must ¯rst translate it into a data set. A data set is a representation of a data source, de¯ning a set of \variables" that are measured on a set of \cases." ± A variable is simply a feature of an object that can categorized or measured by a number. A variable takes on di®erent values to reflect the particular nature of the object being observed. The values that a variable takes will change when measurements are made on di®erent objects at di®erent times. A data set will typically contain measurements on several di®erent variables. ± Each time that we record information about an object we create a case in the data set. Cases are also sometimes referred to as observations. Like variables, a data set will typically contain multiple cases. The cases should all be derived from observations of the same type of object, with each case representing a di®erent example of that type. The object \type" that de¯nes your cases is called your unit of analysis. Sometimes the unit of analysis in a data set will be very small and speci¯c, such as the individual responses on a questionnaire. Sometimes it will be very large, such as companies or nations. The most common unit of analysis in social science research is the participant or subject. ± Data sets are traditionally organized such that each column represents a di®erent variable and each row represents a di®erent case. The value in a particular cell of the table represents the value of the variable corresponding to the column for the case corresponding to the row. For example, if you give a survey to a bunch of di®erent people, you might choose to organize your data set so that each variable represents an item on the survey and each row represents a di®erent person who answered your survey. A cell inside the data set would hold the value that the person represented by the row gave to the item represented by the column. ² The distinction between a data source and a data set is important because you can create a number of di®erent data sets from a single data source by choosing di®erent de¯nitions for your variables or cases. However, all of those data sets would all still reflect the same relationships among the numbers found in the data source. ² If the value of a variable has actual numerical meaning (so that it measures the amount of something that a case has), it is called a continuous variable. If the value of a variable is used to indicate a group the case is in, it is called a categorical variable. In terms of the traditional categorizations given to scales, a continuous variable would have either an ordinal, interval, or ratio scale, while a categorical variable would have a nominal scale. 1 ² When describing a data set you should name your variables and state whether they are continuous or categorical. For each continuous variable you should state its units of measurement, while for each categorical variable you should state how the di®erent values correspond to di®erent groups. You should also report the unit of analysis of the data set. You typically would not list the speci¯c cases, although you might describe where they came from. 1.2 Order of operations ² For many mathematical and statistical formulas, you can often get di®erent results depending on the order in which you decide to perform operations inside the formula. Consider equation 1.1. X = 12 + 6 ¥ 3 (1.1) If we decide to do the addition ¯rst (12 + 6 = 18; 18 ¥ 3 = 6) we get a di®erent result than if we decide to do the division ¯rst (6 ¥ 3 = 2; 12 + 2 = 14). Mathematicians have therefore developed a speci¯c order in which you are supposed to perform the calculations found in an equation. According to this, you should perform your operations in the following order. 1. Resolve anything within parentheses. 2. Resolve any functions, such as squares or square roots. 3. Resolve any multiplication or division operations. 4. Resolve any addition or subtraction operations. This would mean that in equation 1.1 above we would perform the division before we would do the addition, giving us a result of 14. ² One of the most notable things is that addition and subtractionP are at the bottom of the list. This also includes the application of the summation function ( ). For example, in the equation X X = YiZi (1.2) we must ¯rst multiply the values of Y and Z together for each case before we add them up in the summation. ² It is also important to note that terms in the two parts of a fraction are assumed to have parentheses around them. For example, in the equation 5 + 1 a = (1.3) 7 ¡ 4 we must ¯rst resolve the numerator (5 + 1 = 6) and the denominator (7 ¡ 4 = 3) before we do the actual division (6 ¥ 3 = 2). 1.3 General information about statistical inference ² Most of the time that we collect data from a group of subjects we are interested in making inferences about some larger group of which our subjects were a part. We typically refer to the group we want to generalize our results to as the theoretical population. The group of all the people that could potentially be recruited to be in the study is the accessible population, which ideally is a representative subset of the theoretical population. The group of people that actually participate in our study is our sample, which ideally is a representative subset of the accessible population.. ² Very often we will calculate the value of a statistic in our sample and use that to draw conclusions about the value of a corresponding parameter in the underlying population. Populations are usually too large for us to measure their parameters directly, so we often calculate a statistic from a sample drawn from the population to give us information about the likely value of the parameter. The procedure of generalizing from sample statistics to population parameters is called statistical inference. 2 For example, let us say that you wanted to know the average height of 5th grade boys. What you might do is take a sample of 30 5th grade boys from a local school and measure their heights. You could use the average height of those 30 boys (a sample statistic) to make a guess about the average height of all 5th grade boys (a population parameter). ² We typically use Greek letters when referring to population parameters and normal letters when refer- ring to sample statistics. For example, the symbol ¹ is commonly used to represent mean value of a population, while the symbol X is commonly used to represent the mean value of a sample. ² One type of statistical inference you can make is called a hypothesis test. A hypothesis test uses the data from a sample to decide between a null hypothesis and an alternative hypothesis concerning the value of a parameters in the population. The null hypothesis usually makes a speci¯c claim about the parameters (like saying that the average height of 5th grade boys is 60 inches), while the alternative hypothesis makes a claim that the null hypothesis is false in some way. Sometimes the alternative hypothesis simply says that the null hypothesis is false (like saying that the average height of 5th grade boys is not 60 inches). Other times the alternative hypothesis says the null hypothesis is wrong in a particular way (like saying that the average height of 5th grade boys is less than 60 inches). For example, you might have a null hypothesis stating that the average height of 5th grade boys is equal to the average height of 5th grade girls, and an alternative hypothesis stating that the two heights are not equal. You could represent these hypotheses using the following notation. H0 : ¹fboysg = ¹fgirlsg Ha : ¹fboysg 6= ¹fgirlsg where H0 refers to the null hypothesis and Ha refers to the alternative hypothesis. ² To actually perform a hypothesis test you must collect data from a sample drawn from the population of interest.