<<

Learn About Partial Correlation in SPSS With From Fisher’s Iris Dataset (1936)

© 2019 SAGE Publications, Ltd. All Rights Reserved. This PDF has been generated from SAGE Research Methods Datasets. SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 Learn About Partial Correlation in SPSS With Data From Fisher’s Iris Dataset (1936)

Student Guide

Introduction This example dataset introduces the partial correlation (also called rXY.Z or partial r). Partial r allows researchers to quantify the linear association between two quantitative variables removing the effects of one or more other variables, often described as the correlation between X and Y holding Z constant. Like the Pearson’s , partial r ranges between −1 and 1, with more extreme values implying greater association. It is frequently used by researchers who want a quantification of the association between two variables when one or more other variables are presumed to be confounders, that is, X and Y are both a function of Z.

This example describes partial correlation, discusses the assumptions underlying it, and shows how to compute and interpret it. We illustrate partial correlation using Fisher’s Iris dataset (1936). Specifically, we quantify the partial linear association between two flower properties (sepal length and petal length), controlling for two other flower properties (sepal width and petal width). We also test a hypothesis that the partial correlation is zero. This page provides a link to this sample dataset and a guide to producing the partial correlation coefficient and testing a hypothesis that the population value is 0 using statistical software.

What Is Partial Correlation?

Page 2 of 9 Learn About Partial Correlation in SPSS With Data From Fisher’s Iris Dataset (1936) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 This example introduces readers to the partial correlation statistic, typically denoted rXY.Z or partial r, where r is an estimate of a population correlation value, typically denoted ρ. This statistic quantifies the linear association between two variables (X and Y) conditioning on (also called partialling out, removing the effects of, holding constant) one or more other variables (Z). The statistic allows researchers to quantify the linear association between two variables with the effects of specified variables removed. It is frequently used in conjunction with a theoretical model, for example, when the two variables of interest are each partially a function of another variable or set of variables but are also hypothesized to be related to each other. A similar concept is semipartial (or part) correlation which will also be discussed. As the distribution of partial r and semipartial r are known under certain assumptions, a hypothesis test of r = 0 or some other constant is available under those assumptions. Before introducing the partial correlation coefficient, it is important to understand what is being quantified, under what typical assumptions, and, if applicable, what specific hypothesis is being tested.

The null hypothesis associated with the partial correlation coefficient is ρXY.Z = k, the correlation between X and Y conditioned Z is constant k, and the most common value for k is 0, implying no partial linear association. The p-value for the null hypothesis ρXY.Z = 0 is the same as the p-value for β1 = 0 in the regression model y = β'Z +β1x+ε. For k ≠ 0, an approximate Z test is available.

For example, suppose we are interested in the association between age and spending among adults aged 18–40, we know that income increases with age as does spending. We may use partial r to quantify the association between age and spending, adjusting for income, and test a hypothesis that there is no partial association, ρage, spending.income = 0. Partial r is also used with mediation models, to quantify direct effects. Using the previous example, we may say that income increases with age and spending increases with income, so there is an

Page 3 of 9 Learn About Partial Correlation in SPSS With Data From Fisher’s Iris Dataset (1936) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 indirect effect of age on spending. Removing the effect of income, we can obtain the direct effect of age on spending and quantify the association using partial r.

Calculations

Partial Correlation Partial correlation is the Pearson product– correlation between the residuals of two regression models:

Model 1: y = α1 + β1Z + ε1 Model 2: x = α2 + β2Z + ε2

Partial ρXY.Z = corr(ε1,ε2), the correlation between the error terms. As β1, β2, ε1 and ε2 are unknown, we use the sample estimates from a ordinary ^ ^ regression model and obtain rXY.Z = corr(ε 1, ε 2). Statistical software packages use a more efficient algorithm which nets an identical result.

Another conception of partial correlation is through a single model: y = βZ + βx + ε

And the decomposition of the of y:

σ2 = σ2 + σ2 + σ2 y βZ β(x ⊥ Z) ε where x ⊥ Z is the part of x that is orthogonal to (uncorrelated with) Z

σβ(x ⊥ Z) partial ρXY.Z = 2 2 σ + σ √ β(x ⊥ Z) ε

We use the model-based estimates of the variance components and obtain

Page 4 of 9 Learn About Partial Correlation in SPSS With Data From Fisher’s Iris Dataset (1936) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 SS(x ⊥ Z) partial rXY.Z = ^ SS(x ⊥ Z)+SS ε √ ( ) where SS is the sum-of-squares. The partial correlation value and the associated hypothesis test are not affected by which variable is denoted x and which is denoted y.

Semipartial Correlation Consider Model 2 as above:

Model 2: x = α + β2Z + ε2

^ Semipartial ρXY.Z = corr(y,ε2) and semipartial rXY.Z = corr, (y, ε 2).

As a function of variance components from the regression model: y = βZ + βx + ε

σ2 β(x ⊥ Z) semipartial ρYX.Z = 2 σ √ y and

SS(x ⊥ Z) semipartial r = YX.Z √ SS(y)

The value of the semipartial correlation coefficient is affected by the assignment of the variables to x and y.

When the partial or semipartial correlations are estimates using SS, the value ^ 2 2 2 takes the sign of the corresponding β. As σ ≥ σ + σ , partial correlation y β(x ⊥ Z) ε is always equal to or larger in absolute value than semipartial correlation.

Ordered Semipartial Correlation

Page 5 of 9 Learn About Partial Correlation in SPSS With Data From Fisher’s Iris Dataset (1936) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 Consider the linear regression model: y = a + β1x1 + β2x2 + … + βpxp + ε and

2 2 2 2 2 σ = σ + σ * * +…+σ * * + σ y β1x1 ε β2x2 βpxp

* Where xi is the portion of xi orthogonal to all xj, j < i. Ordered semipartial correlations are then obtained:

σ2 β*x* i i ordered semipartial correlation ρyx = 2 i σ √ y

Estimates of the ordered semipartial correlations are obtained using Type I sum- of-squares SS1:

SS1(xi) ordered semipartial correlation r = yxi √ SS(Total)

* As SS1 are computed sequentially, using xi or xi for modeling yields the same result. The squared ordered semipartial correlations sum to the model R2, so are used to decompose the total variance of y into components due to each x variable, adjusting for all previous x variables. The order in which the variables are entered into the model is important and should be considered carefully.

Requirements and Assumptions Hypothesis tests have assumptions. If the assumptions are not met, methods may still be applied, but error rates may be compromised. Understanding the assumptions will improve your research design and efforts. There are two ways of approaching partial r with two sets of assumptions:

1. Two regression models x = β1Z + ε1 and y = β2Z+ε2

Page 6 of 9 Learn About Partial Correlation in SPSS With Data From Fisher’s Iris Dataset (1936) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 • Both models are correctly specified. • ε1 are independent of each other. • ε2 are independent of each other. • The values are obtained so as to manifest properties of simple random sampling. 2 2 2 2 • ε1 ~ N(0,σ1) and ε2 ~ N(0,σ2), and σ1 and σ2 are both constant. 2. The regression model y = βZ + βx + ε • The model is correctly specified. • All ε are independent. • The values are obtained so as to manifest properties of simple random sampling.

• ε ~ N(0,σ2), σ2 is constant.

These are the usual assumptions for hypothesis testing within the context of the linear regression model. The set of assumptions chosen are a theoretical consideration only; the value of partial r and the associated p-value are the same under both approaches.

Illustrative Example: Partial Correlations Among Leaf Properties in Iris This example presents the use of partial correlation between two leaf properties (sepal length and petal length), controlling for two other leaf properties (sepal width and petal width), all obviously related to size. This is relevant as researchers may be interested in relations between biological properties controlling for other biological properties. In this example, we propose that the association between sepal length and petal length is a function only of overall size, which also manifests in sepal width and petal width, so the adjusted correlation should be low. Partial correlation, and the corresponding hypothesis, will help to assess the tenability of this argument.

Page 7 of 9 Learn About Partial Correlation in SPSS With Data From Fisher’s Iris Dataset (1936) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2

The Data This example uses Fisher’s Iris dataset (https://archive.ics.uci.edu/ml/machine- learning-databases/iris/iris.data). The variables in the dataset comprise a series of measurements on iris: sepal width, sepal length, petal width, and petal length, each a function of plant size. We are interested in whether sepal length and petal length are correlated, controlling for sepal width and petal width.

Analysing the Data There are several ways to perform this analysis. The first is regression based and is made simple in some software packages. The second is direct, using a software package that specifically computes partial correlations. The correlation between sepal length and petal length is high, 0.87. The partial correlation between sepal length and petal length, controlling for sepal width and petal width, is 0.72, p < .0001, the adjusted association is also large.

Presenting Results The results for the partial correlation may be reported as follows:

“We used Fisher’s Iris data to test the following null hypothesis:

H0 = There is no association between sepal length and petal length, controlling for sepal width and petal width, in iris.

The dataset includes observations on 150 individual plants. The zero-order correlation between sepal length and petal length is high, 0.87. After controlling for sepal width and petal width, using a partial correlation, the association is still large (partial r = 0.72, p < .0001), so we reject the null hypothesis.”

Review

Page 8 of 9 Learn About Partial Correlation in SPSS With Data From Fisher’s Iris Dataset (1936) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 Partial correlation, partial r, is used to quantify the linear association between two variables controlling for one or more other variables. Under a null hypothesis that the partial correlation is zero, the of partial r is known, so an exact hypothesis test is possible, under assumptions.

You should know:

• What partial correlation is. • What semipartial correlation is. • The differences between partial and semipartial correlation. • What sequential semipartial correlation is and why it is useful. • If applicable, what hypothesis is being tested. • Assumptions associated with the hypothesis test.

Your Turn Download this sample dataset to see whether you can replicate these results. There are four moderately to highly correlated variables in the dataset. Several partial and semipartial correlation are obtainable.

Page 9 of 9 Learn About Partial Correlation in SPSS With Data From Fisher’s Iris Dataset (1936)