Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS with Data from the British Crime Survey (2007)
Total Page:16
File Type:pdf, Size:1020Kb
Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007) © 2019 SAGE Publications Ltd. All Rights Reserved. This PDF has been generated from SAGE Research Methods Datasets. SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007) Student Guide Introduction This SAGE Research Methods Dataset example explores how non-linear effects of the demographic variable, age, can be accounted for in analysis of survey data. A previous dataset looked at including age as a categorical dummy variable in a regression model to explore a non-linear relationship between age and a response variable. This example explores the use of polynomial regression to account for non-linear effects. Specifically, the dataset illustrates the inclusion of the demographic variable age as a quadratic (squared) term in an ordinary least squares (OLS) regression, using a subset of data from the British Crime Survey 2007. Analysing Age as a Non-Linear Predictor Variable Age is a key demographic variable frequently recorded in survey data as part of a broader set of demographic variables, such as education, income, race, ethnicity, and gender. These help to identify representativeness of a particular sample as well as describing participants and providing valuable information to aid analysis. Page 2 of 19 Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 Age can be considered as continuous, meaning that it can take on any value within a scale. It is also an interval-ratio variable, as each year represents the same interval of time and zero is a meaningful value. Measurements of age can be of particular interest to social scientists in understanding those changes in behaviours, beliefs, attitudes, and lifestyles that coincide with age. A researcher might want to understand the relationship between age and some other outcome, for example, whether people consume more alcohol as they get older. In a simple regression model, the coefficient for age will identify linear trends between age and the outcome variable where they exist. In the example just given, people might on average drink a certain amount more (or less) with each additional year of age. This approach assumes a relationship between the dependent variable and age that is linear, that can be described with a straight line, where the effect of age is consistent across all ages. The coefficient for age in a regression analysis will tell us the expected change in the dependent variable for each additional year of age which allows us to consider changes in the dependent variable for 10 additional years of age or 20 and so on but does not allow for effects of different magnitudes or different signs between different ages. If we think about age, we can easily imagine that there are many occasions where the relationship between age and some outcome might be non-linear, a common example being age and income, where income generally increases with age up to a certain point but then declines, or use of health services which might be higher when young and old but less frequent in the years between. A scatterplot of the variables of interest might suggest a curved relationship exists but in the Social Sciences, the data are often not that clear. A plot of the linear regression model residuals might also suggest to a researcher that the relationship between the dependent and independent variables is not a linear one. If a researcher has good theoretical reason to suspect a relationship might be non-linear, there are various Page 3 of 19 Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 ways to test that hypothesis. In another SAGE Dataset example, we look at how to turn a continuous age variable into a categorical variable for analysis, splitting age into coherent categories so that we can make comparisons between age groups. Another way of exploring non-linear relationships is through polynomial regression analysis which we consider here. In this case, we are not comparing one age group with another age group but exploring how the relationship between age and the dependent variable changes at different values of the age variable (or in other words at different ages). We can model a non-linear relationship by fitting a quadratic function, including both age and age-squared (age multiplied by age) as independent variables in the regression model. What Is Polynomial Regression? Polynomial regression allows us to account for non-linear relationships between a dependent and independent variable, using linear regression methods. Typically, where a curvilinear relationship is hypothesised between some dependent and independent variable, the square of the independent variable can be added to the model to see whether it improves the model fit, in comparison with the original model. This is called a quadratic or second-order polynomial. Where the parameter estimate for the quadratic term is statistically significant (p < .05) we see evidence of a non-linear relationship, although the exact nature of that relationship is not confirmed. Additional polynomials can be tested, for example, a cubic or third-order polynomial (independent variable cubed). Higher order polynomials can additionally be tested, but the lowest order that explains the most variance should be the aim. Regression models are typically estimated via OLS, which produces estimates of the slopes and intercept that minimize the sum of the squared differences between the observed values of the dependent variable and the values predicted based on Page 4 of 19 Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 the regression model. A dependent (or response) variable is expressed as a linear function of an independent (or explanatory) variable. This requires estimating an intercept (or constant) and a slope (or coefficient) for the independent variable that describes the change in the dependent variable for a one-unit increase in the independent variable. The coefficients of a regression model describe the strength and direction of the slope, and the relationship can be plotted as a straight line. Polynomial regression analysis is a special type of multiple regression which is linear in the parameters but not in the variables. We enter the square of our independent variable (or cubed, or raised to a higher power) in the model as if it were another independent variable in the analysis. This allows us to model a relationship where the effect of an independent variable on the dependent variable depends on different values of that independent variable. The coefficients themselves remain linear and additive as in multiple regression, but the variables themselves can be non-linear. We can see this more clearly when we consider the following equations: A simple linear regression model is often defined as in Equation 1. (1) Yi = β0 + β1X1i + β2X2i + ei Where: • Yi = individual values of the dependent variable • Xi = individual values of the independent variable • β0 = the intercept, or constant, associated with the regression model • β1 = the coefficient operating on the first independent variable • e = the unmodelled random, or stochastic, component of the dependent variable; often called the error term or the residual of the model Page 5 of 19 Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 This model assumes a linear relationship, where X has the same effect on Y for all values of X. A polynomial regression model for an nth degree polynomial can be defined as in Equation 2. (2) 2 3 n Yi = β0 + β1Xi + β2Xi + β3Xi ⋯ βnXi + ei We see then that the values of X can be squared, cubed, or raised to the power of 2 3 n n (X , X , X ), but the regression coefficients themselves remain linear (β0, β1, β2, β3, βn). The effect of X on Y will now be different for each different value of X. It is advisable to mean centre the X variable before including it in a polynomial regression, firstly to aid in interpretation of the results and secondly to reduce correlation between the X variable and the polynomial term, which is X multiplied by itself. A typical approach where a non-linear relationship is suspected is to start by including a quadratic term (X-squared) and if this has a significant effect which improves the model, including a cubed term to see whether it improves the model fit further. As a general principle it is advisable to keep to the simplest model that provides the best fit and make a judgement about whether any additional terms enhance the model enough to warrant inclusion, otherwise overfitting could occur, meaning that modelling results are over-dependent on the characteristics of the particular sample being analysed. All lower order terms must always be included in the model, so for example, a model with age-squared must also include age, a model with age-cubed includes both age-squared and age, and so on. Most commonly you would be including squared or cubed terms in the model. Page 6 of 19 Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd.