<<

Statistical Significance, p-Values and Confidence Intervals A Brief Guide for Non-statisticians Using SPSS

© Colm McGuinness, 2015

Table of Contents 1. Introduction ...... 2 2. Statistical Significance/P-values ...... 3 3. ...... 5 4. The H0 Distribution/ Distributions ...... 6 5. for a Population ...... 8 6. Example 1: Student marks ...... 9 7. Example 2: Household Income by Churn ...... 13 8. Example 3: Selecting Subgroups ...... 15 9. Too Many Group Comparisons: Familywise Error Rates ...... 22

Page 1 of 22

1. Introduction This document attempts to give a basic understanding of statistical significance values, p values, and confidence intervals. Various simplifications and assumptions are made throughout this document to make it accessible to as wide an audience of statistical practitioners as possible. It is assumed that you are already somewhat familiar with using SPSS and the telco.sav file from lab classes with me, gathering data, samples & populations, and basic such as the , , and . A primary objective is to avoid technical details, and show how a basic understanding can be applied to a wide of problems1. The basic steps, which will be detailed in later sections, for interpreting any significance/p-value, in summary, are:

- Determine the null hypothesis … This is the “nothing different/new/interesting” statement for some statistic derived from your data. The particular statistic will depend on the test being performed. - Interpret the “sig.”/p-value as the probability of the statistic derived from your data occurring, if the null hypothesis is true. - If the p-value is low then reject the null hypothesis: The statistic derived from your data does not support the null hypothesis. - If the p-value is high then do not reject the null hypothesis: The statistic derived from your data agrees with the null hypothesis.

Statistics can be used to compare two or more subgroups within a sample of data, for example comparing the marks of students from one year to marks obtained from another year, or the marks obtained by males against those obtained by females, or sales values before and after an advertising campaign: All sorts of things!

The main example for this document is the comparison of the from two groups. For a given set of values, the mean represents the arithmetic centre of the values. The mean can be interpreted as an average, “typical” or indicative value for the whole of the data. So if we want to compare two groups, then we instead compare their means, and infer from any difference found (or not) as to whether the groups differ or not. Comparing only group means can be less than ideal since means might not differ, but other characteristics might, or vice-versa. But it is a commonly used technique, and we will use it here as an example to discuss in the context of statistical significance.

The main example used in later sections has the following basic details:

The SPSS sample file telco.sav contains, amongst other things, information on gender and household income for a sample of 1000 telecoms customers. We might wonder if household income differs by gender, ie is there any evidence from this sample that male household income is different to female household income?

To test this requires an “Independent-Samples T Test” from the SPSS Analyze/Compare Means menu path. Why this is the case is not covered by this document, but it is a common test to use to compare two independent subgroups. It only works for two independent subgroups, ie male and female here. For more than two subgroups you would typically consider an ANOVA test, which is a separate matter. Or for dependent subgroups (eg “before” and “after” measurements on the same subjects) you might be able to use a “Paired-Samples T Test”.

The gender SPSS variable is coded as 0=Male, and 1=Female.

Without going into too much technical detail, what SPSS will do for this test is calculate the means for the two 2 subgroups, say x0 and x1 for male and female respectively , and then it will calculate the difference in means, ie

xx01 . It then calculates the significance associated with this specific difference answer, and that answer is the statistical significance that this document is mostly concerned with.

1 In an ideal world one should not avoid technical details as these can be crucially important! However many people who find themselves needing to use some level of statistics will not themselves need to be overly familiar with technical details. For important statistical work I recommend engaging with a professional statistician, since ultimately ignoring technical details is bad, very bad‼ 2 Note that the subscript notation here will not be shown by SPSS in output. This type of notation is very commonly used in one format or another in statistical texts. A variable name, eg x, with a bar over, eg x, it is used to indicate that it is the mean of the x’s that we are referring to. Page 2 of 22

The initial output from this comparison, showing the descriptive statistics, is shown in Table 1.

Group Statistics

Gende Std. Std. Error

r N Mean Deviation Mean Household income in Male 483 73.2505 92.85082 4.22486 thousands Femal 517 81.5377 118.73355 5.22190 e Table 1: Descriptive statistics from an Independent-Samples T Test on the telco.sav SPSS sample file, comparing household income by gender.

So here x0 is 73.2505 and x1 is 81.5377, and their difference (which we will see in Table 2) is

xx01  8.2872.

The main focus of this document is the understanding and interpretation of statistical significance and p values, which are described in the next section.

2. Statistical Significance/P-values Many statistical tests result in a statistical significance (“sig.”) value in SPSS (and other statistical packages). This is commonly known as the “p value” and is often quoted in research as, for example, “p=0.0819” or “p<0.01” or “p>0.05”.

The Independent-Samples T Test from the introduction above produces the following results table3, which has been split to make it fit on the page. The significance answers have been highlighted with shading:

Levene's Test for Equality of

F Sig. Household income in thousands Equal variances assumed 3.259 .071 Equal variances not assumed

t-test for Equality of Means Std. Error 95% Confidence Interval of the Difference t df Sig. (2-tailed) Mean Difference Difference Lower Upper

-1.224 998 .221 -8.28720 6.77230 -21.57678 5.00238 -1.234 968.412 .218 -8.28720 6.71697 -21.46868 4.89428 Table 2: Results of an independent samples T test in SPSS2. Statistical significance is highlighted by the shaded cells.

The (statistical) significance is the probability that your actual statistic (the difference in means here) would be found if the null hypothesis were true4.

Typically the null hypothesis is that “nothing different/new/interesting” has been found. Statistically this would typically be written as:

3 The result shown is from the TELCO.sav sample file, and compares the mean household incomes of males and females in the sample. 4 Technically it is that your actual statistic or “worse”. So in this case not just your actual difference in means, but also any even larger difference in means, but the given text is a starting point. Page 3 of 22

5 6 H0: 12 , where 1 and 2 are the population means for the two groups being compared .

So the “nothing new/interesting” here would be that the means for both groups will be equal. The sig. value is then the probability of getting your actual difference (or worse) between the two sample means, if H0 is true. In Table 2 the actual difference, from the “mean difference” column, is -8.28720, and the two sig. values to the left of this are the probabilities of finding this particular difference (or worse) if in fact the two population means are equal.

Although it is often not explicitly stated all “null hypothesis statistical test procedures” (NHSTP) involve a null hypothesis, and also an , commonly called H1 or Ha, for example:

H1: 12 .

The alternative hypothesis determines the type of test to be performed in terms of whether the test must be “1 tailed” or “2 tailed”. For example the above H1 corresponds to a 2 tailed test, since we don’t know/specify in advance whether

1 might be bigger than 2 or smaller than 2 . This is probably the more common type of alternative hypothesis. However a 1 tailed alternative is possible at times, for example:

H1: 12 .

The first sig. value shown in Table 2, ie p=.071, is a NHST for the hypotheses:

H0: 12 (variances are equal in both groups)

H1: 12 (variances are not equal in both groups) which is important for an independent samples T test, but is not something I will attempt to detail here. What is directly relevant is that p>0.05, which suggests that there is insufficient evidence to reject the null hypothesis, so (for now) we accept that the variances are indeed equal … they still may not be, but our evidence doesn’t detect whatever difference might in fact exist7.

What we’re (typically/generally) looking for is either:

p is “small”: Typical values for regarding p as “small” would be p<0.05, p<0.01 or p<0.001. A “low” p value is then be telling us that the probability of our actual statistic (be that a mean, a difference in means, a variance, or difference in variances, etc.) is low if the null is true, so we should consider rejecting the null hypothesis … our actual evidence does not support the null hypothesis.

p is “large”: Typical values for regarding p as “large” would be p>0.1 or p>0.05. A “high” p value is telling us that the probability of our actual statistic (be that a mean, a difference in means, a variance, or difference in variances, etc.) is high if the null is true, so we don’t have evidence to reject the null hypothesis … We do not generally “accept” the null hypothesis, which might seem odd: We cannot tell if the null is actually true! What we can say is that we did not find evidence for it not being true! For example a larger sample might detect that the null is not in fact true. Results will vary depending on various aspects of the sample, such as sample size, and level of randomness.

What is “small” and “large” also depends on the context, and the level of certainty required by the person doing the testing. If the experimenter wants to be fairly certain of reporting that something new/different has been found then

5 Typically Greek letters are used to represent population statistics, so  is typically used to represent a population mean, and  is typically used to represent a population standard deviation. The corresponding sample mean and sample standard deviation are typically represented by: x and s, respectively. For a given population there is only one  value, but there can be infinitely many x values, depending on the sample. 6 Using the notation from the introduction this would be: H0: 01 . 7 If you’d like to understand more about this then I’d suggest investigating the following statistical topics: Type I errors, Type II errors, (statistical) power, sample size, and effect sizes. Page 4 of 22 they might set a strict limit such as p<0.001 before reporting that something new/different had been found. Alternatively an experimenter might be happy/keen to report any potentially new/different result, so might opt for p<0.1 for statistical significance, and p>0.1 for no significant difference. There are often standard practices for any given context.

In reporting statistical results you will commonly see, for example:

No differences were found between male and female household incomes (p>0.05).

Video example: https://www.youtube.com/watch?v=-FtlH4svqx4. This is more technically detailed than you may want or understand, but it is worth watching as you’ll see the whole topic develop, and you’ll see the terms that have been introduced here being used. There are lots more videos, although some will be way off topic.

This determination and interpretation of a p value works similarly across a very wide array of statistical tests, so whenever you carry out a test in SPSS, you can now look straight for the “sig” or p-value, and attempt to interpret it in the context of the likely null and alternative hypotheses. Bear in mind that the null is usually the “nothing different/new/interesting”.

As a closing example for this section: Here is the SPSS output for a test of normality on the household income variable from the telco.sav sample file:

Tests of Normality

Kolmogorov-Smirnova Shapiro-Wilk Statistic df Sig. Statistic df Sig. Household income in .261 1000 .000 .486 1000 .000 thousands

a. Lilliefors Significance Correction Table 3: SPSS output for a test of normality on the telco.sav household income variable.

So what might the null hypothesis be here?

Well, there is nothing new/interesting if the data are actually normally distributed, so this is the null. The alternative will be that the data are not normally distributed. The two “Sig.” values are below 0.001, ie p<0.001, so the data do not support the null hypothesis, so the household income data from the telco.sav sample file are probably not normally distributed.

Even if one has no idea what the “Statistic” or “df” columns are, or how they are calculated, one can still interpret and use the result8!

3. Effect Size It is poor practice to mindlessly report statistically significant results, without also considering the “effect size”, and whether any effect is actually of any real relevance! Consideration of an effect size can simply be the experimenter thinking about what is a relevant and practically significant difference, rather than simply focussing on the fact that the “p value was low”!!

For example we could test two headache medicines and find a statistically significant difference in the levels of reported pain relief after one hour. However the difference would have little practical value if the difference found would apply to only a small percentage of the population, or if the two medicines were found to be of equal pain relief after a short additional period, and so on.

8 It is worth stressing again that it would be better to in fact know what the “statistic” is, and what “df” stands for! But it is still possible to calculate and interpret statistics at some level, by understanding p values and null/alternative hypotheses: Even for the non-expert/professional. If in doubt, or if the result is important then engage with a professional statistician! Page 5 of 22

In addition to experimenter judgement, there are statistical measures of effect size, such as Cohen’s d, which is used when effects are measured by the differences in means. See http://en.wikipedia.org/wiki/Effect_size for further details.

4. The H0 Distribution/Sampling Distributions

Someone might reasonably ask: If H0 is 12 , and we assume that H0 is true then how come we got the difference in means of -8.28720 in the example shown in Table 2? Surely the difference should be zero if H0 is true!

In statistics you have to think about populations and samples, and realise that each time a sample is taken from a population a different result will probably be arrived at. So even if the two population means are in fact equal, ie

, it can still happen that for any two samples from the two populations that xx12 (ie the sample means are not equal) simply because samples are only partial representations of the populations from which they are drawn.

To fully understand p values, and statistical significance for the NHSTP, it is necessary to understand sampling distributions, as these are the distributions from which p-values are obtained.

Say we are interested in statistically testing H0: 12 . Well, we don’t have 1 and 2 , but we can obtain x1 and

x2 and calculate the difference: xx21 . For all of the different possible samples from the two populations it is possible to consider plotting a or . It can be shown using some mathematical theory that the distribution of xx21 will have a certain shape, which happens to be a , ie a bell curve. Statistical packages have this built into them, and they use this to calculate the p value.

The distribution of a test statistic is called the for the statistic, since it is how the statistic will vary from sample to sample. Here the test statistic is the difference in the two means, and its sampling distribution will be normal with mean 21 . It is a purely theoretical concept, and is never determined in practice. Luckily mathematical theory generally makes its actual calculation unnecessary!

While the difference of two means will have a normal distribution, other statistics can have other, quite different, distributions, but the principles of how the p value is calculated for NHSTP is the same every time.

Figure 1 (below) is derived from the example in section 8.2.3 of the main SPSS document from my web site: http://bbm.colmmcguinness.org/live/Advanced/SPSS%20BRM.pdf. It depicts the theoretical sampling distribution of the differences in means, along with an example of where our actually found difference might lie.

Page 6 of 22

Normal distribution of all possible differences of Our actual difference in means, for means, if H0 is true … this is from theory only. our given samples, is from somewhere in the sampling distribution of differences of means (ie the distribution of all possible differences in the two means).

Population difference in means at centre of unknown sampling distribution of sample difference in means. The actual sampling distribution is unknown to us, but we know from theory (the central limit theorm (CLT)) that the differences in means will be normally distributed, with a theoretically known standard deviation.

Figure 1: Depiction of the theoretical sampling distribution of differences in means, and an actual difference in means.

The p value in this instance, for a 2 sided test, will be calculated from the red shaded area shown in Figure 2: It is the probability of achieving your specific difference in means or worse, on either side of the actual difference (which is theorised to be zero here if H0 is true, ie 210 ).

The p value is the (red) area under the sampling distribution from where your actual value occurred “outwards”, on either side of the expected population difference if H0 is true.

Population difference in means at centre of unknown sampling distribution of sample difference in means.

A two sided test includes the red area shown above, whereas a one sided test would only include one of these two red shaded areas, depending on how H1 was worded.

Figure 2: Calculation of p value (ie sig) for a two sided test on the difference of two means.

Page 7 of 22

5. Confidence Interval for a Population Statistic Table 2 has the following subset of cells:

95% Confidence Interval of the Difference Lower Upper -21.57678 5.00238 -21.46868 4.89428 Table 4: Confidence interval from Table 2.

The first row, ie –21.57678 and 5.00238 tells us that we can be 95% confident9 that the actual population difference in means (ie 21 ), given only our sample information (ie xx21 ), is between –21.57678 and 5.00238.

This is a very powerful form of information, as it includes both information for a hypothesis test, and also an effect size: If an interval is entirely above or below zero for a difference in means, then this would be the equivalent of the hypothesis test p value of p<0.05 (for a 95% confidence). Notice in Table 4 that both intervals include zero, which tells us that p>0.05 for both tests, which we can in fact see from Table 2, where we have the “sig.” values. Some research journals prefer confidence intervals (CIs) over and above p values. In fact some journals don’t allow general use of p values at all, and insist on CIs! The reason being that the CI contains more information, and in particular contains details of the effect size. Effect size is crucially important, as mentioned in Section 3.

9 This is technically not quite correct, but is a good starting point for a non-statistician. The technically correct version is: A 95% confidence interval, [a, b], from a sample statistic T for a population parameter  means that 95% of such intervals, calculated from samples, will contain , the true population value. So a 95% confidence interval for a difference in means tells us that 95% of such intervals, calculated from different samples, will contain the true difference in means  . 21 Page 8 of 22

6. Example 1: Student marks A lecturer teaches two groups of students using different methods: Group 1 are taught using traditional lectures, and Group 2 are taught using online lectures. The marks recorded (out of 100) for both groups are shown in

Group 1 Group 2 35 35 60 18 65 29 40 13 50 1 32 46 40 52 50 65 50 14 65 23 65 52 Table 5: Exam results for students receiving different teaching methods.

Create two appropriate variables in SPSS, as follows:

Enter the data as follows:

Page 9 of 22

Carry out an Independent-Samples T Test from the Analyze/Compare Means/Independent-Sample T Test as follows:

Page 10 of 22

Drag the Mark variable into the “Test Variable(s)” box, and the Group variable into the “Grouping Variable”. Click the “Define Groups …” button, and enter 1 for group 1 and 2 for group 2. Click OK. You should now have the following:

Click OK to get the following output (which will not include shading of the columns): Page 11 of 22

There are two significance values to be interpreted, which is the main objective of this document:

Page 12 of 22

- The “Levene’s test for equality of variances10” (blue shaded cells) gives p=0.090 … What is the likely null hypothesis here? Well, a test for equality of variances is likely to have a null of: H0: Variance 1 = Variance 2, and an alternative of H1: Variance 1 not equal to Variance 2 … so now interpreting the “large” p value (p=0.090) means we do not reject H0. Our data do not suggest/evidence that H0 is incorrect, so we do not reject it11. We can assume equal variances, and use the first row of T test results …

- The T Test result (grey shaded cells) gave p=0.007, which is “small”. Here we are testing for the equality of group means, so what is the likely null hypothesis? It is the nothing different/new/interesting, which here would be: H0: Group 1 mean = Group 2 mean, and an alternative of H1: Group 1 mean ≠ Group 2 mean. The small p value is telling us to reject H0: The gathered sample data do not agree with H0, so we should reject it in favour of H1. In real world terms this is now telling us that the means for the two groups seem to differ statistically significantly. If we look back at the actual mean mark values we see that Group 1 mean = 50.33 and Group 2 mean = 29.60, so Group 1 would appear to have done statistically significantly better than group 2.

- Note that this is a much stronger statement than simply looking at the means alone and saying that Group 1 mean is greater than the Group 2 mean: Why? If we did not perform the T Test then we only know that the Group 1 mean is better than Group 2 for this sample. After performing the T Test we now have information about the population!! Quite a different and far more important thing!!

- Another way to add a level of detail to the p value here would be to say that there is a 7 in 1000 chance (ie 0.007) of the given data, or worse, occurring by chance in our sample, if H0 is true.

Finally for this example, note the 95% confidence interval for the difference in population means, from the first row in the figure above is [6.478, 34.989]. The entire interval for the difference in means is above zero, so this tells us that we have a statistically significant effect at 95% confidence, and we also have the effect size for the population difference in means, so we could go on to interpret what the two possible extremes would mean for teaching practice. Using the confidence interval doesn’t give the full detail on the level of statistical significance, but does include effect size information, which is always an important factor, even if you are only using p values and not confidence intervals.

7. Example 2: Household Income by Churn Open the SPSS telco.sav sample file.

10 It is an important consideration for an Independent Samples T Test whether the variances of the two groups are equal or not. This test checks that which then allows the experimenter to judge which T Test answer to use. 11 Remember that we don’t accept H0 either! See Section 2 discussion of p small and p large. Page 13 of 22

Here it is assumed that you have completed and understood Example 1, so details are briefer.

There are many group comparisons we could carry out with this data, such as the Household Income by Gender, or Household Income by Retirement status, and so on. For this example we look at Household Income by Churn 12status. If you examine the churn variable you will see that it is coded as: 0=No, 1=Yes, for whether the customer has churned within the last month.

Carry out an Independent Samples T Test on Household Income by Churn. You should obtain the following output:

12 People “churn” when they change supplier for a service. Page 14 of 22

- First look at the Levene’s test, and we find that p=0.001. So we can reject the null hypothesis of equal variances, so for the rest of table we should use the second row, which is the row that doesn’t assume equal variances.

- The T Test significance is p=0.000, which is normally written for reports as p<0.001. This is a highly statistically significant result. It is VERY unlikely by random chance to obtain the mean difference in Household Income of 21.91083 (thousand) between those that churned in the last month and those that didn’t.

- The relevant 95% confidence interval for the difference in means is [10.60170, 33.21996], so we 95% confident that Household Income differs between the two population groups from 11,000 up to 33,000.

8. Example 3: Selecting Subgroups Open the SPSS telco.sav sample file.

Here it is assumed that you have completed and understood Example 1, so details are briefer. It is also assumed that you are familiar to some extent with the “Select Cases” command.

Sometimes you will want to compare across several subgroups, whereas a T test will only do two groups. One option is to use the SPSS Data/Select Cases menu path to select two groups at a time for comparison. As an example, say we wanted to compare the proportions churning13 across the 3 geographic regions held in the region variable of the telco.sav sample file. Although there are 5 Zones listed in the “Value labels” for the region variable, there are in fact zero entries for Zones 4 and 5, so we need only deal with Zones 1, 2 and 3.

To do a complete set of comparisons we must compare: 1 with 2, 1 with 3, then 2 with 3. We will need three T tests.

Use Data/Select Cases to select data from region=1 or region=2:

13 See the web site main SPSS document for information on comparing proportions. In essence, with a variable like churn, which consists of 0s and 1s, corresponding to No’s and Yes’s, if we calculate the mean we are dividing the total of the Yes’s by the overall total number, which then gives the proportion churning. So testing the mean here is an approximate test on the proportion. It is not an , and is particularly poor if the resulting proportion is near to either 0 or 1, or if the (group) sample size is small. Page 15 of 22

Now perform the usual Independent Samples T test on churn by region as follows:

To get output as follows:

Page 16 of 22

Interpret the output.

Return to the Data/Select Cases menu path, and now change the filter to only include regions 1 or 3. Repeat the T Test, but now comparing Zones 1 and 3:

Page 17 of 22

And output:

Interpret the output. Page 18 of 22

And finally repeat the process for Zones 2 and 3, to get output as follows:

Interpret the output.

Across all three zones there is no statistically significant evidence that there is any difference in the proportions who churn across zones. Hopefully having interpreted the 3 sets of output you can see why I can say this!

Another approach to this series of T Tests is to use ANOVA, which is a different test of differences in means. It carries with it some stricter assumptions, which we should ideally check before using the results from the ANOVA, but for the purpose of this document we will just focus on interpreting the output of the ANOVA test.

First cancel the “Select Cases” filter, by returning to “All cases”. Then select “One-Way ANOVA” as shown here:

Page 19 of 22

Drag the variables as shown below:

Click OK to get the output:

Page 20 of 22

Although this is perhaps a completely new test to you, and a lot of the output may mean nothing to you, notice that there is a “sig.” column!!

The null hypothesis here is just like before for the T tests, except it now includes all three means, from the three zones, so:

H0: The mean churn is the same across all zones.

H1: The mean churn is not the same across all zones.

Since p=0.817, p is “large” and we cannot reject H0, so for now we continue under the assumption that the means are equal across zones.

Page 21 of 22

9. Too Many Group Comparisons: Familywise Error Rates Briefly …

Every time we conduct a statistical test and interpret the p value there is a chance that we are wrong. Say for some test p=0.02, and we reject the null hypothesis. There is still a 0.02 or 2% chance that we shouldn’t have rejected the null, ie that we made a mistake … called a Type I error. Now say we conduct a number of Tests in a sequence, such in the last example, where we had 3 T tests in a sequence, and for the sequence here imagine that we reject the null hypothesis in each case, with p-values of 0.02, 0.04 and 0.03. Now the overall probability that we made a mistake is 0.02 + 0.04 + 0.03 = 0.09, which is now in “large” arena for p values!

Basically it is not good practice to conduct too many individual tests to assert some overall hypothesis. Generally a sequence of individual tests can be done instead using a single test of a different type. The obvious example being from above: Replacing a series of Independent Samples T Tests with a single ANOVA test.

There are ways of coping with such families of tests, but all of that is beyond the scope of this document.

Page 22 of 22