<<

Learn About Analysing Age in Survey Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007)

© 2019 SAGE Publications Ltd. All Rights Reserved. This PDF has been generated from SAGE Research Methods Datasets. SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007)

Student Guide

Introduction This SAGE Research Methods Dataset example explores how non-linear effects of the demographic variable, age, can be accounted for in analysis of survey data. A previous dataset looked at including age as a categorical dummy variable in a regression model to explore a non-linear relationship between age and a response variable. This example explores the use of polynomial regression to account for non-linear effects. Specifically, the dataset illustrates the inclusion of the demographic variable age as a quadratic (squared) term in an ordinary (OLS) regression, using a subset of data from the British Crime Survey 2007.

Analysing Age as a Non-Linear Predictor Variable Age is a key demographic variable frequently recorded in survey data as part of a broader set of demographic variables, such as education, income, race, ethnicity, and gender. These help to identify representativeness of a particular sample as well as describing participants and providing valuable information to aid analysis.

Page 2 of 19 Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 Age can be considered as continuous, meaning that it can take on any value within a scale. It is also an interval-ratio variable, as each year represents the same interval of time and zero is a meaningful value.

Measurements of age can be of particular interest to social scientists in understanding those changes in behaviours, beliefs, attitudes, and lifestyles that coincide with age. A researcher might want to understand the relationship between age and some other outcome, for example, whether people consume more alcohol as they get older. In a simple regression model, the coefficient for age will identify linear trends between age and the outcome variable where they exist. In the example just given, people might on average drink a certain amount more (or less) with each additional year of age. This approach assumes a relationship between the dependent variable and age that is linear, that can be described with a straight line, where the effect of age is consistent across all ages. The coefficient for age in a will tell us the expected change in the dependent variable for each additional year of age which allows us to consider changes in the dependent variable for 10 additional years of age or 20 and so on but does not allow for effects of different magnitudes or different signs between different ages.

If we think about age, we can easily imagine that there are many occasions where the relationship between age and some outcome might be non-linear, a common example being age and income, where income generally increases with age up to a certain point but then declines, or use of health services which might be higher when young and old but less frequent in the years between. A scatterplot of the variables of interest might suggest a curved relationship exists but in the Social Sciences, the data are often not that clear. A plot of the model residuals might also suggest to a researcher that the relationship between the dependent and independent variables is not a linear one. If a researcher has good theoretical reason to suspect a relationship might be non-linear, there are various

Page 3 of 19 Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 ways to test that hypothesis.

In another SAGE Dataset example, we look at how to turn a continuous age variable into a for analysis, splitting age into coherent categories so that we can make comparisons between age groups. Another way of exploring non-linear relationships is through polynomial regression analysis which we consider here. In this case, we are not comparing one age group with another age group but exploring how the relationship between age and the dependent variable changes at different values of the age variable (or in other words at different ages). We can model a non-linear relationship by fitting a quadratic function, including both age and age-squared (age multiplied by age) as independent variables in the regression model.

What Is Polynomial Regression? Polynomial regression allows us to account for non-linear relationships between a dependent and independent variable, using linear regression methods. Typically, where a curvilinear relationship is hypothesised between some dependent and independent variable, the square of the independent variable can be added to the model to see whether it improves the model fit, in comparison with the original model. This is called a quadratic or second-order polynomial. Where the estimate for the quadratic term is statistically significant (p < .05) we see evidence of a non-linear relationship, although the exact nature of that relationship is not confirmed. Additional polynomials can be tested, for example, a cubic or third-order polynomial (independent variable cubed). Higher order polynomials can additionally be tested, but the lowest order that explains the most should be the aim.

Regression models are typically estimated via OLS, which produces estimates of the slopes and intercept that minimize the sum of the squared differences between the observed values of the dependent variable and the values predicted based on

Page 4 of 19 Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 the regression model. A dependent (or response) variable is expressed as a linear function of an independent (or explanatory) variable. This requires estimating an intercept (or constant) and a slope (or coefficient) for the independent variable that describes the change in the dependent variable for a one-unit increase in the independent variable. The coefficients of a regression model describe the strength and direction of the slope, and the relationship can be plotted as a straight line.

Polynomial regression analysis is a special type of multiple regression which is linear in the but not in the variables. We enter the square of our independent variable (or cubed, or raised to a higher power) in the model as if it were another independent variable in the analysis. This allows us to model a relationship where the effect of an independent variable on the dependent variable depends on different values of that independent variable. The coefficients themselves remain linear and additive as in multiple regression, but the variables themselves can be non-linear. We can see this more clearly when we consider the following equations:

A model is often defined as in Equation 1.

(1)

Yi = β0 + β1X1i + β2X2i + ei

Where:

• Yi = individual values of the dependent variable • Xi = individual values of the independent variable • β0 = the intercept, or constant, associated with the regression model • β1 = the coefficient operating on the first independent variable • e = the unmodelled random, or stochastic, component of the dependent variable; often called the error term or the residual of the model

Page 5 of 19 Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 This model assumes a linear relationship, where X has the same effect on Y for all values of X.

A polynomial regression model for an nth degree polynomial can be defined as in Equation 2.

(2)

2 3 n Yi = β0 + β1Xi + β2Xi + β3Xi ⋯ βnXi + ei

We see then that the values of X can be squared, cubed, or raised to the power of 2 3 n n (X , X , X ), but the regression coefficients themselves remain linear (β0, β1, β2, β3, βn). The effect of X on Y will now be different for each different value of X.

It is advisable to centre the X variable before including it in a polynomial regression, firstly to aid in interpretation of the results and secondly to reduce correlation between the X variable and the polynomial term, which is X multiplied by itself.

A typical approach where a non-linear relationship is suspected is to start by including a quadratic term (X-squared) and if this has a significant effect which improves the model, including a cubed term to see whether it improves the model fit further. As a general principle it is advisable to keep to the simplest model that provides the best fit and make a judgement about whether any additional terms enhance the model enough to warrant inclusion, otherwise overfitting could occur, meaning that modelling results are over-dependent on the characteristics of the particular sample being analysed. All lower order terms must always be included in the model, so for example, a model with age-squared must also include age, a model with age-cubed includes both age-squared and age, and so on. Most commonly you would be including squared or cubed terms in the model.

Page 6 of 19 Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2

Interpreting Polynomial Regression Results The results of a polynomial regression cannot be interpreted in the same way as a simple or multiple regression model. In the case of a second-order polynomial, which we explore in this example, we are most interested in the coefficient relating to the quadratic term. A statistically significant estimate suggests a non-linear relationship. The coefficient tells us about the strength of the rate of change and the direction. A plot of a polynomial regression model will fit curved lines. The number of curves will be one less than the power that X has been raised to. In the case of a simple linear regression, we could think of X being X raised to the power of 1 with no turns in the line. When X is raised to the power of 2, we see one curve, the power of 3, two curves, and so on. A positive coefficient in quadratic term describes a convex curve, with one turn, shaped like all or part of the letter U. A negative coefficient tells us the curve is concave instead.

We can use the parameter estimates to calculate the value of X at which the curve changes direction, using the following equation:

(3)

− β X = 1 2β2 where β1 is the coefficient operating on X and β2 is the coefficient operating on age-squared.

Where the X variable has been mean-centered prior to estimation, the resulting value of the equation should be added to the mean of X, to give the correct value of X at which this change occurs.

(4)

Page 7 of 19 Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2

− β1 ¯ X = + X 2β2

Confidence intervals for the parameters will typically be quite large. It is advisable not to extrapolate outside of the of the data as you cannot be sure whether the curves may change direction for values outside of the data you have.

Assessing Model Fit The most common way of assessing how well a regression model fits the data is to use a called R2, sometimes known as the “coefficient of determination.”

R2 is a measure of how much better the model’s predictions of Y are, compared to simply taking the mean of Y. They are interpreted as the proportion of variance in the dependent variable that can be “explained” by the independent variables.

Further details of how to compute R2 can be found in this SAGE Dataset on Multiple Regression. To assess whether a polynomial model is a better fit, an F test can be conducted, to test the difference between the R2 values of the two models.

Illustrative Example: Perceptions of Safety in Public Spaces and Age This example uses the 2007 British Crime Survey (teaching subset) to explore non-linearity with respect of age. Specifically, in this case, we look at how safe respondents feel to walk alone after dark and whether sense of safety can be modelled as a linear function of age. Further, we will consider whether any effects of age on feelings of safety are consistent across the sample or whether these effects are different at different ages, by including a quadratic term within the model.

The research questions guiding this example are:

Page 8 of 19 Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 1. Is feeling safe to walk alone after dark related to age? 2. Does the effect of age on feeling safe differ at different ages?

We can also state this in the form of two null hypotheses:

H0a = There is no linear relationship between age and feeling safe to walk alone after dark. H0b = There is no difference in the effect of age on feeling safe to walk alone after dark across different ages.

The Data This example uses the 2007 British Crime Survey to explore whether people’s feelings of safety walking alone after dark, can be modelled as a second-order polynomial function.

This example uses the following two variables from the 2007 British Crime Survey:

• Score on a scale measuring how safe respondents feel to walk alone after dark (walkdark) • Respondent age (age)

The dependent variable walkdark was measured using a four-point Likert scale ranging from very safe to very unsafe and is being treated as continuous for the purposes of this example. It has a mean of 2.18 with a range of 1–4, with a higher score equating to greater fear of walking alone in the dark. The age variable is continuous, ranging from 16 to 101 with a mean age of 50.4.

Analysing the Data for our dependent variable walkdark and for the continuous age variable are shown in Figures 1 and 2 below.

Page 9 of 19 Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2

Figure 1: Showing the Distribution of Respondent Feeling of Safety Walking Alone After Dark, 2007 British Crime Survey.

Figure 2: Histogram Showing the Distribution of Age as a Continuous Variable, 2007 British Crime Survey.

Page 10 of 19 Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2

Figures 1 and 2 show distributions that are close enough to normal not to warrant any concern.

Having explored the variables, we can estimate our first simple regression model. Figure 3 shows the results of a regression of feelings of safety when walking alone after dark on age.

Figure 3: Simple Regression of How Safe Respondent Feels to Walk Alone After Dark on Age, 2007 British Crime Survey.

Page 11 of 19 Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2

The top section of the table provides an for the model as a whole. While these results are not the focus of this example, we note that the R- Squared figure reported to the upper right of the table measures the proportion of the variance in the dependent variable explained by the model. In this case, the model consists of one independent variable. An R2 of .02 that only about 2% of the variance in feeling safe to walk alone after dark is accounted for by age. An applied researcher would want to develop a model with more explanatory variables to gain a better understanding of feelings of safety.

The bottom section of the table presents the estimates of the intercept, or constant (_cons), and the slope coefficients. It reports an estimate for the intercept, or constant, as equal to approximately 2.19. The constant of a simple regression model can be interpreted as the average expected value of the dependent variable when the independent variable equals zero. Having mean-centred the independent variable, zero is now the average age of our sample, and so the intercept of 2.19 is the expected level of feelings of safety for a person of average age (in this case 50.4).

The estimated value for the slope coefficient linking age to feelings of safety walking alone after dark is approximately .007. This represents the average marginal effect of age on feelings of safety and can be interpreted as the expected change in the dependent variable on average for a one-unit increase in the

Page 12 of 19 Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 independent variable. A one-unit increase, in this case, is each additional year of age. The coefficient is statistically significant, based on a p-value of less than .001.

We note then that age is a statistically significant predictor of feelings of safety such that each additional year of age is associated with a .007 increase on the safety score which equates to increased fear of walking in the dark alone. This is a very small effect, but given the range of ages in the data, it can be helpful to think of this as an increase of .07 on a four-point scale, with each additional 10 years of age, or .35 of a point over 50 years.

These results assume that the effect of age is linear and conditional on this being true, that generally speaking people are more fearful walking alone in the dark as they get older. Our intuition however might lead us to suspect that younger people are also likely to be less confident and perhaps feel (or even be) more at risk, developing in confidence as they age before the physical effects of aging increase a sense of vulnerability.

To test this, we next look at the results from the polynomial regression model of Order 2, which includes both age and age-squared. Figure 4 presents the polynomial regression results.

Figure 4: Results of a Polynomial Regression of Order 2, Using Age and Age-Squared 2007 British Crime Survey.

Page 13 of 19 Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2

Interpretation of the results is not as straightforward as with the first model. First, we note that the R2 is .04, only very slightly higher than the first model. Results of a nested F-test to test whether the difference between the R2 of the main effects model and the of R2 the model is equal to zero show a value of 195.97 and associated p-value of < .0001, as shown in Figure 5. We can therefore reject the null hypothesis and conclude that there is evidence of a “non-linear” effect between feelings of safety when walking alone after dark, and age.

Figure 5: Results of a Nested F-Test Testing the Difference Between the R2 Values for the Two Models, 2007–2008 British Crime Survey.

The coefficient for age now provides the rate of change in the relationship between age and feelings of safety at mean age (or 50.4 years old). The coefficient relating to the quadratic term (age2) provides information about the curve. The coefficient in this case is positive which means the curve is convex. The value of .0003 describes the steepness of the curve or the rate of change in the effect of age for each additional year of age.

Page 14 of 19 Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 We can use these values to calculate the age at which the curve changes direction.

This value is calculated by using x = −b/2a, where b is the coefficient operating on x and a is the coefficient operating on age-squared. If we plug the regression coefficients into this equation we get the following:

− .0065906 / (2 × .0003394) = − .0065906 / .000678 = − 9.7091927

Having mean-centred the age variable, we must then add this value to the mean, to get the age at which the change in direction occurs. If we add this to the mean of 50.4, we get an age of 40.7 years.

This is reflected by a plot of predicted scores for the polynomial regression model as shown in Figure 6.

Figure 6: Plot of Predicted Fear of Walking Alone After Dark Across Different Ages, 2007 British Crime Survey.

Page 15 of 19 Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2

Although this is not really the focus of this example, having found a statistically significant effect in a polynomial regression model of Order 2, a researcher might decide to test a model of Order 3 to see whether it improves on the previous model. Results shown in Figure 7 suggest that a cubic model is not an improvement and a quadratic model provides a better fit of the data.

Figure 7: Results of a Polynomial Regression of Order 3, Including Age, Age-Squared, and Age-Cubed, 2007 British Crime Survey.

Page 16 of 19 Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2

There are multiple diagnostic tests researchers might perform following the estimation of regression models to evaluate whether the model appears to violate any of the OLS assumptions or whether there are other kinds of problems such as particularly influential cases. Describing all of these diagnostic tests is beyond the scope of this example.

Presenting Results “We used polynomial regression to examine the association of feelings of safety walking alone after dark and age, in a sample of British citizens.

We used a subset of data from the 2007–2008 British Crime Survey to test the following null hypotheses:

H0a = There is no linear relationship between age and feeling safe to walk alone after dark. H0b = There is no difference in the effect of age on feeling safe to walk alone after dark across different ages.

The data include 11,609 individual respondents. Results presented in Figure 3 show that there is a positive and statistically significant relationship between age

Page 17 of 19 Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 and feelings of safety walking alone after dark. Specifically, the results show that for each additional year of age, we would expect an increase of fear of about .007 of a scale point. The R2 for the model is .02, which means that approximately 2% of the variance in feelings of safety is explained by age. Results presented in Figure 4 suggest that the relationship between age and feeling safe to walk alone after dark is different for different ages. The R2 for the model is improved by including a quadratic term which although small is statistically significant. The curvilinear relationship is convex with feelings of unsafety declining until the age of approximately 41 after which we see an increase in feelings of unsafety walking alone after dark.

Thus, we conclude that on average people feel less safe to walk alone after dark as they get older, but that this relationship is non-linear with both the youngest and the oldest age groups feeling less safe to walk alone after dark than those in the middle age ranges. Further diagnostic tests should be explored to evaluate the robustness of this finding.”

Review This dataset example has demonstrated basic ways of including the demographic variable age in analysis as a non-linear predictor within a polynomial regression model. A subsample of the British Crime Survey was analysed to demonstrate how a quadratic term could be included in a regression model to explore a non- linear relationship between age and feelings of safety walking alone after dark.

Your Turn You can download this sample dataset along with a guide showing how to carry out the analysis using statistical software. The sample dataset also includes another variable called walkday, which is a scale measuring how safe respondents feel to walk alone during the day. See whether you can reproduce the

Page 18 of 19 Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 analysis presented here with this variable replacing walkdark as the independent variable.

Page 19 of 19 Learn About Analysing Age in Survey Data Using Polynomial Regression in SPSS With Data From the British Crime Survey (2007)