<<

Learn to Use the Measure and Test in R With From the Welsh Health (Teaching Dataset) (2009)

© 2019 SAGE Publications, Ltd. All Rights Reserved. This PDF has been generated from SAGE Research Methods Datasets. SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 Learn to Use the Phi Coefficient Measure and Test in R With Data From the Welsh Health Survey (Teaching Dataset) (2009)

Student Guide

Introduction This example dataset introduces the Phi Coefficient, which allows researchers to measure and test the strength of association between two categorical variables, each of which has only two groups. This example describes the Phi Coefficient, discusses the assumptions underlying its validity, and shows how to compute and interpret it. We illustrate the Phi Coefficient measure and test using a subset of data from the 2009 Welsh Health Survey. Specifically, we measure and test the strength of association between sex and whether the respondent has visited the dentist in the last twelve months. The Phi Coefficient can be used in its own right as a to assess the strength of association between two categorical variables, each with only two groups. However, typically, the Phi Coefficient is used in conjunction with the Pearson’s Chi-Squared test of association in tabular analysis. Pearson’s Chi-Squared test tells us whether there is an association between two categorical variables, but it does not tell us how important, or how strong, this association is. The Phi Coefficient provides a measure of the strength of association, which can also be used to test the (with which that association can be distinguished from zero, or no-association).

This page provides links to this dataset and a guide to producing the Phi

Page 2 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the Welsh Health Survey (Teaching Dataset) (2009) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 Coefficient test using statistical software.

What Is a Phi Coefficient? The Phi Coefficient is a method for determining the strength of association between two categorical variables (e.g., sex, ethnicity, occupation), each of which is or is measured as binary, that is, they only have two groups (male/female or employed/unemployed). Also known as Pearson’s Phi Coefficient, the measure is designed for variables at the binary categorical level only. When used as a formal statistical test, one must, as always, first define the null hypothesis (H0) to be tested. In this case, the standard null hypothesis is that there is no association between the two variables. Even if the variables are not associated in truth, some non-zero association would be expected simply due to error, i.e., random chance in sampling. The Phi Coefficient test conducted here is designed to help us determine whether the difference from zero-association that occurs in the sample is large enough to declare the association statistically significantly non- zero. “Large enough” is typically defined as a test with a level of statistical significance, or p-value, of less than .05, meaning that sample associations this large or larger would occur “just by random chance” in only 5% of samples this size. We would “reject the null hypothesis (H0) of no association between the two variables” at the .05 level.

Calculating a Phi Coefficient The Phi Coefficient is derived from Pearson’s Chi-Square statistic of tabular association. The modifications restrict the resulting statistic to a of −1.0 to 1.0, analogously to (although not the same as) Pearson’s . If the variables are not associated, then the Phi Coefficient value should be 0; perfect positive (negative) association yields a Phi Coefficient of 1 (−1). To illustrate, let’s imagine that we have surveyed 100 participants, whom we have categorised by whether they have children and asked them to identify whether

Page 3 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the Welsh Health Survey (Teaching Dataset) (2009) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 they have a pet or not. Table 1 shows the hypothetical results below.

Table 1: Cross-Tabulation of Pet Ownership and Having a Child.

Whether respondent has a child

Yes No Total Whether respondent has a pet Yes (n = 30) 20 (66.6%) 10 (33.3%) 30

No (n = 70) 10 (14%) 60 (86%) 70

Total 30 70

The cross-tabulation suggests a possible positive association as there appears to be greater pet ownership amongst those who have children, 66.6% of people with a pet also had children compared with 33.3% of people without children. However, we do not know whether this is statistically significant. Table 1 is also known as a 2 × 2 ; two binary variables are considered positively associated if most of the data fall along the diagonal cells, thus a and d are larger than b and c. Conversely, if the data fall in the off-diagonal, then two variables are negatively associated. Table 2 below illustrates this, with each observed count labelled.

Table 2: Cross-Tabulation of Pet Ownership and Having a Child.

Whether respondent has a child

Yes No Total

20 (66.6%) 10 (33.3%) 30 Whether respondent has a pet Yes (n = 30) a b e

10 (14%) 60 (86%) 70 No (n = 70) c d f

30 70 Total g h

Page 4 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the Welsh Health Survey (Teaching Dataset) (2009) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 If we look at Table 2, we can see that a and d appear larger than b and c. However, we need to calculate the Phi Coefficient, using Equation 1.

Equation 1 presents the formula for the Phi Coefficient (using the data in counts)

(1)

ad − bc φ = √efgh

Equation 2 presents the formula populated with data from the example

(2)

20x60 − 10x10 φ = √30x70x30x70 1200 − 100 φ = √4410000 1100 φ = 2100 φ = 0.52

We have calculated the Phi Coefficient to be 0.52. We can interpret this figure using the same scale as that for Pearson’s Correlation coefficient.

Table 3 presents the Phi Coefficient Scale.

Table 3: The Phi Coefficient Scale.

Phi Coefficient Interpretation

−1.0 to −0.7 Strong negative association between the variables

−0.69 to −0.4 Medium negative association between the variables

−0.39 to −0.2 Weak negative association between the variables

Page 5 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the Welsh Health Survey (Teaching Dataset) (2009) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2

−0.199 to 0.01 No or negligible association between the variables

0.00 No association between the two variables

0.01 to 0.19 No or negligible association between the variables

0.2 to 0.39 Weak positive association between the variables

0.4 to 0.69 Medium positive association between the variables

0.70 to 1.0 Strong positive association between the variables

In our example, the Phi Coefficient value is 0.52, which we can interpret as a medium (positive) association between our variables. We can reject the H0; in other words, there is a statistically significant association between the two variables. Moreover, by reviewing the contingency table (Table 1), we can add that the association between having a child and owning a pet is a positive association.

Assumptions Behind the Method All statistical tests rely on some underlying assumptions, and they all are affected by the type of data that you have. The Phi Coefficient test can be run on its own to test the association between two variables. However, typically it is used as a post-test following a cross-tabulation and a Pearson’s Chi-Squared test, where it adds depth to the analysis by identifying the strength of association between two variables.

Assumptions of the Phi Coefficient test

• Both variables must have two categorical, independent groups. • There must be independence of observations, so there is no relationship between the groups or between the observations in each group. • All expected counts should be greater than 1 and no more than 20% of expected counts. No expected counts should be less than 5.

Page 6 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the Welsh Health Survey (Teaching Dataset) (2009) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 The first and second assumptions are not typically testable from the sample data and are related to the research design. The second assumption is only likely to be violated if the data were sampled by pairs rather than individuals (e.g., couples rather than individual persons). It is important to understand how your data were collected and categorized; this will help you avoid violating the first two assumptions. The third assumption can be tested easily in most statistical software programs.

Illustrative Example: Association Between Sex and Whether Respondent Visited the Dentist in the Last Twelve Months This example presents a Phi Coefficient analysis using two variables from the 2009 Welsh Health Survey. Specifically, we test whether there is an association between sex and whether the respondent visited the dentist in the last twelve months.

Thus, this example addresses the following research question:

Does visiting the dentist in the last twelve months vary by an individual’s sex?

Stated in the form of a null hypothesis:

H0 = There will be no association between sex and whether the respondent has visited the dentist in the last twelve months.

It should be noted that this hypothesis is two-tailed.

The Data This example uses a subset of data from the 2009 Welsh Health Survey. This extract includes 16,018 respondents, which is a large sample. It should be noted

Page 7 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the Welsh Health Survey (Teaching Dataset) (2009) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 that the original dataset is larger still, but it has been “cleaned” to include only those who have responded to both our variables. The two variables we examine are:

• Respondent’s sex (sex). • Whether respondent has visited the dentist in the last twelve months (denbi).

The first variable, Respondent’s sex (sex), is coded 1, if male, and 2, if female. Whether the respondent has visited the dentist in the last twelve months (denbi) is coded; 0, if “no” and 1, if “yes.” We treat both variables as categorical, in line with common practice in social science research. In addition, both variables are binary.

First, we should test our data to ensure that no expected counts are less than 5.

Table 4: Contingency Table for Sex and Whether the Respondent Visited the Dentist in the Last Twelve Months.

Sex

Male Female Total

Count 2,329 2,058 4,387

No Expected 2,039.2 2,347.8 4,387.0 Count

Whether the respondent has visited the dentist in the last twelve Count 4,684 6,016 10,700 months Yes Expected 4973.8 5,726.2 10,700.0 Count

Count 7,013 8,074 15,087

Total Expected 7,013.0 8,074.0 15,087.0 Count

We can see from Table 4 that no cells have an expected count less than 5, and

Page 8 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the Welsh Health Survey (Teaching Dataset) (2009) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 all the expected counts are greater than 1. In addition, both our variables are categorical with only two groups each, therefore the Phi Coefficient is appropriate for these data. Usually the Phi Coefficient is run as a post-test to tell us something about the strength of an association that a Pearson’s Chi-squared test has identified as significant.

Analysing the Data Before conducting the Phi Coefficient test, we should first examine each variable in isolation. We start by presenting a distribution of sex in Table 5. Table 5 shows the distribution of sex; there are slightly more females (53.7%) than males (46.3%) in the sample.

Table 5: of Sex.

Frequency Percent Valid percent Cumulative percent

Male 7,412 46.3 46.3 46.3

Valid Female 8,606 53.7 53.7 100.0

Total 16,018 100.0 100.0

Table 6 shows the frequency distribution of denbi. Just under a third of respondents (29.1%) did not visit a dentist in the last twelve months, while 70.9% did. It should be noted that 931 respondents did not answer the question.

Table 6: Frequency Distribution of denbi.

Frequency Percent Valid percent Cumulative percent

No 4,387 27.4 29.1 29.1

Valid Yes 10,700 66.8 70.9 100.0

Total 15,087 94.2 100.0

Page 9 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the Welsh Health Survey (Teaching Dataset) (2009) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2

Missing No answer/refused 931 5.8

Total 16,018 100.0

Tables 5 and 6 show the distribution of each of these variables by themselves, but they cannot tell us whether they are in a relationship.

Calculating the Phi Coefficient and Conducting the Phi Coefficient Test Tables 7 and 8 present the results of the Phi Coefficient analysis. Table 7 presents the Chi-square test result, which statistic also underlies the Phi Coefficient. We can see that our results are significant at the p ≤ .000, the variables are associated with each other.

Table 7: Results of the Phi Coefficient Analysis: The Chi-Squared Result.

Value df Asymptotic significance

Pearson chi-square 108.477 1 .000

N of valid cases 15,087

Table 8: Results of the Phi Coefficient Analysis: Phi Coefficient Measure and Test.

Value Approximate significance

Nominal by nominal Phi 0.085 .000

N of valid cases 15,087

Table 8 presents the Phi Coefficient value, which is 0.085, and this suggests that there is perhaps negligible positive association between the variables. However, the p ≈ .000 suggests a highly statistically significant association. Given the very low Phi Coefficient result, we need to treat the p value with caution as they are very sensitive to sample size; in large samples (like our example), they often

Page 10 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the Welsh Health Survey (Teaching Dataset) (2009) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 deem small differences as significant. Therefore, given that the Phi Coefficient accommodates sample size, we should use it as the basis to accept our null hypothesis; there will be no association between sex and whether the respondent visited the dentist in the last twelve months.

Presenting Results A Phi Coefficient test can be reported as follows:

“We used a subset of data from the 2009 Welsh Health Survey dataset to measure and test the association between sex and whether respondent visited the dentist in the last twelve months. We tested the following null hypothesis:

H0 = There is no association between sex and whether the respondent has visited the dentist in the last twelve months.

The data included 16,018 adult respondents. There was no substantively significant association between sex and whether the respondent visited the dentist in the last twelve months, φ = 0.085, however p = .000, which suggests no association between the variables. This leads us to accept the null hypothesis of no association between sex and whether respondent visited the dentist in the last twelve months.”

Review The Phi Coefficient is a statistical measure used to evaluate the strength of association between two dichotomous variables.

You should know:

• What types of variables are suited for a Phi Coefficient test. • The basic assumptions underlying this statistical test.

Page 11 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the Welsh Health Survey (Teaching Dataset) (2009) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 • How to compute and interpret a Phi Coefficient test. • How to report the results of a Phi Coefficient test.

Your Turn You can download this sample dataset along with a guide showing how to produce a Phi Coefficient test using statistical software. The sample dataset also includes another variable called teethbi, which relates to how many teeth the respondent has. See whether you can reproduce the results presented here for the sex variable, and then try producing your own Phi Coefficient analysis substituting teethbi for sex in the analysis.

Page 12 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the Welsh Health Survey (Teaching Dataset) (2009)