Learn to Use the Phi Coefficient Measure and Test in R with Data from the Welsh Health Survey (Teaching Dataset) (2009)
Total Page:16
File Type:pdf, Size:1020Kb
Learn to Use the Phi Coefficient Measure and Test in R With Data From the Welsh Health Survey (Teaching Dataset) (2009) © 2019 SAGE Publications, Ltd. All Rights Reserved. This PDF has been generated from SAGE Research Methods Datasets. SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 Learn to Use the Phi Coefficient Measure and Test in R With Data From the Welsh Health Survey (Teaching Dataset) (2009) Student Guide Introduction This example dataset introduces the Phi Coefficient, which allows researchers to measure and test the strength of association between two categorical variables, each of which has only two groups. This example describes the Phi Coefficient, discusses the assumptions underlying its validity, and shows how to compute and interpret it. We illustrate the Phi Coefficient measure and test using a subset of data from the 2009 Welsh Health Survey. Specifically, we measure and test the strength of association between sex and whether the respondent has visited the dentist in the last twelve months. The Phi Coefficient can be used in its own right as a means to assess the strength of association between two categorical variables, each with only two groups. However, typically, the Phi Coefficient is used in conjunction with the Pearson’s Chi-Squared test of association in tabular analysis. Pearson’s Chi-Squared test tells us whether there is an association between two categorical variables, but it does not tell us how important, or how strong, this association is. The Phi Coefficient provides a measure of the strength of association, which can also be used to test the statistical significance (with which that association can be distinguished from zero, or no-association). This page provides links to this sample dataset and a guide to producing the Phi Page 2 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the Welsh Health Survey (Teaching Dataset) (2009) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 Coefficient test using statistical software. What Is a Phi Coefficient? The Phi Coefficient is a method for determining the strength of association between two categorical variables (e.g., sex, ethnicity, occupation), each of which is or is measured as binary, that is, they only have two groups (male/female or employed/unemployed). Also known as Pearson’s Phi Coefficient, the measure is designed for variables at the binary categorical level only. When used as a formal statistical test, one must, as always, first define the null hypothesis (H0) to be tested. In this case, the standard null hypothesis is that there is no association between the two variables. Even if the variables are not associated in truth, some non-zero association would be expected simply due to sampling error, i.e., random chance in sampling. The Phi Coefficient test conducted here is designed to help us determine whether the difference from zero-association that occurs in the sample is large enough to declare the association statistically significantly non- zero. “Large enough” is typically defined as a test statistic with a level of statistical significance, or p-value, of less than .05, meaning that sample associations this large or larger would occur “just by random chance” in only 5% of samples this size. We would “reject the null hypothesis (H0) of no association between the two variables” at the .05 level. Calculating a Phi Coefficient The Phi Coefficient is derived from Pearson’s Chi-Square statistic of tabular association. The modifications restrict the resulting statistic to a range of −1.0 to 1.0, analogously to (although not the same as) Pearson’s Correlation Coefficient. If the variables are not associated, then the Phi Coefficient value should be 0; perfect positive (negative) association yields a Phi Coefficient of 1 (−1). To illustrate, let’s imagine that we have surveyed 100 participants, whom we have categorised by whether they have children and asked them to identify whether Page 3 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the Welsh Health Survey (Teaching Dataset) (2009) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 they have a pet or not. Table 1 shows the hypothetical results below. Table 1: Cross-Tabulation of Pet Ownership and Having a Child. Whether respondent has a child Yes No Total Whether respondent has a pet Yes (n = 30) 20 (66.6%) 10 (33.3%) 30 No (n = 70) 10 (14%) 60 (86%) 70 Total 30 70 The cross-tabulation suggests a possible positive association as there appears to be greater pet ownership amongst those who have children, 66.6% of people with a pet also had children compared with 33.3% of people without children. However, we do not know whether this is statistically significant. Table 1 is also known as a 2 × 2 contingency table; two binary variables are considered positively associated if most of the data fall along the diagonal cells, thus a and d are larger than b and c. Conversely, if the data fall in the off-diagonal, then two variables are negatively associated. Table 2 below illustrates this, with each observed count labelled. Table 2: Cross-Tabulation of Pet Ownership and Having a Child. Whether respondent has a child Yes No Total 20 (66.6%) 10 (33.3%) 30 Whether respondent has a pet Yes (n = 30) a b e 10 (14%) 60 (86%) 70 No (n = 70) c d f 30 70 Total g h Page 4 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the Welsh Health Survey (Teaching Dataset) (2009) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 If we look at Table 2, we can see that a and d appear larger than b and c. However, we need to calculate the Phi Coefficient, using Equation 1. Equation 1 presents the formula for the Phi Coefficient (using the data in counts) (1) ad − bc φ = √efgh Equation 2 presents the formula populated with data from the example (2) 20x60 − 10x10 φ = √30x70x30x70 1200 − 100 φ = √4410000 1100 φ = 2100 φ = 0.52 We have calculated the Phi Coefficient to be 0.52. We can interpret this figure using the same scale as that for Pearson’s Correlation coefficient. Table 3 presents the Phi Coefficient Scale. Table 3: The Phi Coefficient Scale. Phi Coefficient Interpretation −1.0 to −0.7 Strong negative association between the variables −0.69 to −0.4 Medium negative association between the variables −0.39 to −0.2 Weak negative association between the variables Page 5 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the Welsh Health Survey (Teaching Dataset) (2009) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 −0.199 to 0.01 No or negligible association between the variables 0.00 No association between the two variables 0.01 to 0.19 No or negligible association between the variables 0.2 to 0.39 Weak positive association between the variables 0.4 to 0.69 Medium positive association between the variables 0.70 to 1.0 Strong positive association between the variables In our example, the Phi Coefficient value is 0.52, which we can interpret as a medium (positive) association between our variables. We can reject the H0; in other words, there is a statistically significant association between the two variables. Moreover, by reviewing the contingency table (Table 1), we can add that the association between having a child and owning a pet is a positive association. Assumptions Behind the Method All statistical tests rely on some underlying assumptions, and they all are affected by the type of data that you have. The Phi Coefficient test can be run on its own to test the association between two variables. However, typically it is used as a post-test following a cross-tabulation and a Pearson’s Chi-Squared test, where it adds depth to the analysis by identifying the strength of association between two variables. Assumptions of the Phi Coefficient test • Both variables must have two categorical, independent groups. • There must be independence of observations, so there is no relationship between the groups or between the observations in each group. • All expected counts should be greater than 1 and no more than 20% of expected counts. No expected counts should be less than 5. Page 6 of 12 Learn to Use the Phi Coefficient Measure and Test in R With Data From the Welsh Health Survey (Teaching Dataset) (2009) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 2 The first and second assumptions are not typically testable from the sample data and are related to the research design. The second assumption is only likely to be violated if the data were sampled by pairs rather than individuals (e.g., couples rather than individual persons). It is important to understand how your data were collected and categorized; this will help you avoid violating the first two assumptions. The third assumption can be tested easily in most statistical software programs. Illustrative Example: Association Between Sex and Whether Respondent Visited the Dentist in the Last Twelve Months This example presents a Phi Coefficient analysis using two variables from the 2009 Welsh Health Survey. Specifically, we test whether there is an association between sex and whether the respondent visited the dentist in the last twelve months. Thus, this example addresses the following research question: Does visiting the dentist in the last twelve months vary by an individual’s sex? Stated in the form of a null hypothesis: H0 = There will be no association between sex and whether the respondent has visited the dentist in the last twelve months.