<<

Public Health & Intelligence

Hypothesis Testing

Document Control Version 0.4 Date Issued 29/11/2018 Authors David Carr, Nicole Jarvie Comments to [email protected] or [email protected]

Version Date Comment Authors 0.1 24/08/2018 1st draft David Carr, Nicole Jarvie 0.2 11/09/2018 1st draft with formatting David Carr, Nicole Jarvie 0.3 17/10/2018 Final draft with changes from David Carr, Nicole Jarvie Statistical Advisory Group 0.4 29/11/2018 Final version David Carr, Nicole Jarvie

Acknowledgements

The authors would like to acknowledge Prof. Chris Robertson and colleagues at the University of Strathclyde for allowing the use of the for the examples in this paper.

The simulated HAI data sets used in the worked examples were originally created by the Health Protection Scotland SHAIPI team in collaboration with the University of Strathclyde.

i

Table of Contents

1 Introduction ...... 1

2 Constructing a Hypothesis Test ...... 3 2.1 Defining the Hypotheses ...... 3 2.1.1 Null and Alternative Hypotheses ...... 3 2.1.2 One-Sided and Two-Sided Tests ...... 4 2.2 Significance Level and Statistical Power ...... 6 2.2.1 Significance Level ...... 6 2.2.2 Statistical Power ...... 6 2.2.3 Type I and Type II Error ...... 7 2.3 Test ...... 7 2.4 Rejection Region ...... 8 2.5 Determining ...... 8 2.5.1 P-values ...... 8 2.5.2 Confidence Intervals ...... 9 2.5.3 Comparing Results from One and Two-Sided Tests ...... 11 2.6 Multiple Comparisons ...... 11 2.6.1 The ...... 12 2.7 General Framework for Hypothesis Testing ...... 12

3 T-tests ...... 14 3.1 One Sample t-test...... 14 3.1.1 Example ...... 16 3.2 Two Sample t-test ...... 19 3.2.1 Example ...... 20 3.3 Paired t-test ...... 22

4 Non-Parametric Tests ...... 23 4.1 Wilcoxon Signed Ranks Test ...... 23 4.1.1 Example ...... 24 4.2 Mann-Whitney U-test ...... 25 4.2.1 Example ...... 25

ii

5 Chi-Squared Tests ...... 27 5.1 Goodness-of-Fit Test ...... 27 5.1.1 Example ...... 28 5.2 Test of Independence ...... 29 5.2.1 Example ...... 30

6 Proportion Tests ...... 33 6.1 One-Sample Test ...... 33 6.1.1 Example ...... 35 6.2 Two-Sample Test ...... 36 6.2.1 Example ...... 38

7 F-tests ...... 40 7.1 F-test for Equality of ...... 40 7.1.1 Example ...... 41 7.2 F-test for Comparing Models ...... 42 7.2.1 Example ...... 43

Bibliography ...... 47

Appendices ...... 48 A Further Detail on Hypothesis Testing ...... 48 B Further Detail on F-test for Comparing Linear Regression Models ...... 50 C R Code for Examples ...... 52 D SPSS Syntax for Examples ...... 57 E SPSS Output for Examples ...... 63

iii

1 Introduction

The purpose of this paper is to outline the theory behind hypothesis testing and to demonstrate how hypothesis testing can be used as part of a of statistical methods. The paper will address the following statistical methods in the context of hypothesis testing: t-tests, non-parametric tests (the Wilcoxon Signed-Ranks test and the Mann-Whitney U- test), chi-squared tests, proportion tests and F-tests. Some preliminary mathematical and statistical knowledge is assumed.

Statistical hypothesis testing is about comparing two contradictory statements about one or more datasets and deciding which one is ‘correct’. For example, if an analyst was investigating if there was a difference in the average A&E waiting time between the Glasgow Royal Infirmary and the Royal Infirmary of Edinburgh, there are only two possible outcomes: either there is statistical evidence of a difference or there is not.

Statistical tests that are based in hypothesis testing usually involve comparing one dataset to another. The objective is often to see if there is any statistically-significant difference between the datasets based on a statistic of interest (e.g. , ). Hypothesis testing theory is relevant here as the analyst is essentially investigating whether there is evidence of a difference and, if not, then concluding that there is no evidence of a difference.

Most of the tests discussed in this paper are examples of ‘univariate’ analysis, where only one variable of interest can be considered. If a multivariate analysis is required, where multiple contributing factors are taken into consideration, then Regression Modelling is usually more appropriate.

Table 1 summarises the tests that will be discussed in this paper. The theory behind hypothesis testing will be addressed first, before going on to discuss how the hypothesis- based tests in Table 1 can be used in practice, including showing examples in R (the equivalent SPSS syntax and output are shown in Appendices D and E, respectively).

1

Table 1: Summary of hypothesis tests

Test When to Use Major Restrictions on Use One-sample t-test Comparing whether the mean of a Data must be normally-distributed (Chapter 3) dataset is less than, greater than or and all subjects must be equal to a specified number independent of each other Two-sample t-test Comparing whether the mean of a Both datasets must be normally- (Chapter 3) dataset is less than, greater than or distributed and all subjects must equal to the mean of a different be independent of each other dataset Paired t-test Similar to the two-sample t-test Data must be normally-distributed (Chapter 3) except the two datasets have the and each subject must have a same study subjects (e.g. a person’s value in both datasets (individuals blood pressure before and after subjects must be independent of treatment) each other) Wilcoxon Signed- Non-parametric equivalent of the one- All subjects must be independent Ranks Test sample t-test or paired t-test and can of each other (Chapter 4) be used when the data are not normally-distributed Mann-Whitney U-test Non-parametric equivalent of the two- All subjects must be independent (Chapter 4) sample t-test and can be used when of each other the data are not normally-distributed Chi-squared test – Tests whether the of cases The expected number of cases in goodness-of-fit across categories in one variable are each category must be sufficient (Chapter 5) equal (5 is a common rule-of-thumb) Chi-squared test – Tests if there is any association The expected number of cases in test of association between two categorical variables each category must be sufficient (Chapter 5) (5 is a common rule-of-thumb) One-sample Comparing whether the proportion of Variable of interest must be proportion test times an event occurs in a dataset is dichotomous (only two possible (Chapter 6) less than, greater than or equal to a categories) specified number Two-sample Comparing whether the proportion of Variables of interest must be proportion test times an event occurs in one dataset is dichotomous (only two possible (Chapter 6) less than, greater than or equal to the categories) proportion in a different dataset F-test – for testing Comparing whether the of a Both datasets must be normally- equality of variances dataset is less than, greater than or distributed and all subjects must (Chapter 7) equal to the variance of a different be independent of each other dataset F-test – for comparing Used as part of a Assumptions of linear regression two linear regression process to decide whether a particular models apply (discussed in models explanatory variable should be Regression Modelling paper) (Chapter 7) dropped from the model

2

2 Constructing a Hypothesis Test

This chapter will discuss the framework of a standard hypothesis test and how it can then be applied to the statistical tests which form the remainder of this paper.

A hypothesis test consists of four key components (Mendenhall and Beaver, 1994), all of which will be defined later in the chapter:

· A null hypothesis (usually denoted by the symbol ‘ ’)

· An which contradicts the null퐻0 hypothesis (usually denoted by the symbol ‘ ’)

· A test statistic퐻1 · A rejection region

These components are used as part of a framework for designing and carrying out a hypothesis test. The framework will be outlined at the end of this chapter after a discussion of each of its fundamental elements.

2.1 Defining the Hypotheses

2.1.1 Null and Alternative Hypotheses

A hypothesis test essentially involves making a judgement about which one of two contradictory statements is ‘true’. These statements are collectively known as ‘study hypotheses’ and individually as the null hypothesis ( ) and the alternative hypothesis ( ).

Ruling which hypothesis is ‘true’ making a judgement퐻0 based on the results of a 퐻1 statistical test.

If the data are based only on a sample from a wider population, then it is not possible to conclude that is true, but simply that there is no evidence to reject . It is generally safer to conclude퐻0 that ‘there is no evidence to reject ’ or ‘there is evidence퐻0 to reject in 퐻0 퐻0

3

favour of ’ as opposed to ‘ is true’ or ‘ is true’, unless an analyst has access to the

entire population퐻1 in question.퐻 0 The hypothesis퐻1 testing procedure assumes that is true unless proved otherwise at the conclusion, somewhat analogous to the legal maxim퐻0 of “innocent until proven guilty, beyond a reasonable doubt”.

When deciding how to define the null and alternative hypotheses in terms of the aims of a study, it should be remembered that the two hypotheses must be mutually-exclusive as only one can be true. The alternative hypothesis is almost always the outcome of interest being investigated; the null hypothesis is the inverse, usually being ‘no change’ or ‘no difference’. Going back to the legal analogy, this means that the null hypothesis would mean “no evidence or not enough evidence of guilt”, whereas the alternative hypothesis would mean “sufficient evidence of guilt”. In statistical terms, ‘sufficient evidence’ would be determined by a p-value or derived at the end of a hypothesis testing procedure (i.e. statistical significance). How much statistical ‘evidence’ is enough to be considered ‘sufficient’ is controlled by the significance level and power, discussed in Section 2.2.

2.1.2 One-Sided and Two-Sided Tests

When defining the null and alternative hypotheses, it must also be decided whether the test is to be one-sided or two-sided (sometimes referred to as ‘one-tailed’ or ‘two-tailed’). This depends on what the question of interest is. Going back to the A&E waiting times example introduced in Chapter 1, if one is comparing the average waiting time in Glasgow Royal Infirmary (GRI) to the Royal Infirmary of Edinburgh (RIE), there are three possible questions of interest that could be proposed:

1. Does GRI have a longer average waiting time than RIE? 2. Does GRI have a shorter average waiting time than RIE? 3. Does GRI have a different average waiting time than RIE (i.e. is it shorter or longer)?

If the analyst wishes to look at Question 1 or 2, then the test is one-sided i.e. the “directional difference” (Mendenhall and Beaver, 1994) is only in one direction, either

4

higher or lower. If the analyst is looking at Question 3, then the test is two sided i.e. the directional difference can be in either direction and could be higher or lower.

The null and alternative hypotheses for the above three possible tests are summarised in Table 2. As aforementioned, the alternative hypothesis is the usually the outcome of interest and the null hypothesis contradicts this.

Table 2: Study hypotheses for A&E waiting times example (two-sample)

Question of Interest Null Hypothesis Alternative Hypothesis Type of Test Is the average waiting time ( ) in GRI > One-sided greater than in RIE? 휇 퐻0 ∶ 휇퐺푅퐼 ≤ 휇푅퐼퐸 퐻1 ∶ 휇퐺푅퐼 휇푅퐼퐸 Is the average waiting time ( ) in GRI lower < One-sided than in RIE? 휇 퐻0 ∶ 휇퐺푅퐼 ≥ 휇푅퐼퐸 퐻1 ∶ 휇퐺푅퐼 휇푅퐼퐸 Are the two average waiting times ( ) = Two-sided different? 휇 퐻0 ∶ 휇퐺푅퐼 휇푅퐼퐸 퐻1 ∶ 휇퐺푅퐼 ≠ 휇푅퐼퐸

The example in Table 2 is a two-sample test, where one dataset is compared to a different one. It is also possible to construct a hypothesis test for comparing one dataset to some outcome of interest. For example, if an analyst wanted to test whether the average A&E waiting time in Glasgow Royal Infirmary was greater than 4 hours, less than 4 hours or not equal to 4 hours, then Table 2 would become Table 3. Table 3 is an example of a one- sample t-test.

Table 3: Study hypotheses for A&E waiting times example (one-sample)

Question of Interest Null Hypothesis Alternative Hypothesis Type of Test Is the average waiting time ( ) in GRI 4 > 4 One-sided greater than 4 hours? 휇 퐻0 ∶ 휇퐺푅퐼 ≤ ℎ표푢푟푠 퐻1 ∶ 휇퐺푅퐼 ℎ표푢푟푠 Is the average waiting time ( ) in GRI lower 4 < 4 One-sided than 4 hours? 휇 퐻0 ∶ 휇퐺푅퐼 ≥ ℎ표푢푟푠 퐻1 ∶ 휇퐺푅퐼 ℎ표푢푟푠 Is the average waiting time ( ) in GRI not = 4 4 Two-sided equal to 4 hours? 휇 퐻0 ∶ 휇퐺푅퐼 ℎ표푢푟푠 퐻1 ∶ 휇퐺푅퐼 ≠ ℎ표푢푟푠

5

Although most hypothesis tests are two-sided, Table 3 is an example of where a one-sided test seems more sensible, as the aim is for A&E waiting times to be less than 4 hours (the second option). However, either Option 1 or 2 could be used as the final interpretation will be the same, albeit the results will be the inverse of each other.

2.2 Significance Level and Statistical Power

2.2.1 Significance Level

The significance level (usually denoted by the Greek letter alpha, α) is defined as the probability of wrongly rejecting the null hypothesis in favour of the alternative hypothesis, when the null hypothesis is actually ‘true’. is usually set to 0.05. In other words, there is a

5% risk of rejecting in favour of when훼 is ‘true’. This is essentially equivalent to 95% confidence, which is퐻 the0 standard for퐻1 most statistical퐻0 testing. This concept can be expressed mathematically in the equation below:

= ( | ) (1) ퟎ ퟎ 훂 퐏 퐫퐞퐣퐞퐜퐭 퐇 퐇 퐢퐬 퐭퐫퐮퐞 As 95% confidence is the convention for most analyses, software such as R and SPSS usually set the significance level to 0.05 by default unless the user specifies otherwise.

2.2.2 Statistical Power

Statistical power is a closely-related concept to the significance level and is the probability of correctly rejecting the null hypothesis when it is false. The is typically denoted as ‘1 – ’, the second term being the Greek letter beta. is the probability of not rejecting when훽 is false. The power can be expressed mathematically훽 as: 퐻0 퐻0 = ( | ) (2)

ퟏ − 훃 퐏 퐫퐞퐣퐞퐜퐭 퐇ퟎ 퐇ퟎ 퐢퐬 퐟퐚퐥퐬퐞 Power is an important element in designed studies, such as clinical trials, since enough power is needed to detect statistically-significant differences. This is used to calculate the

6

minimum size of the sample needed for the study to be robust. Ideally, the power should be as high as possible. But in most cases where the sample size is adequate, power does not generally need to be addressed.

2.2.3 Type I and Type II Error

The decision process of choosing between the null and alternative hypotheses can result in one of two types of error (Mendenhall and Beaver, 1994), arising from the significance level ( ) and statistical power (1 ). These errors are known as ‘Type I’ error and ‘Type II error’훼 , respectively. Ideally,− the훽 significance level should be small and the power should be large (Mendenhall and Beaver, 1994).

Type I error is the possible mistake arising from Equation 1, i.e. that the null hypothesis was rejected when it was true. Type II error is the possible mistake arising from Equation 2, i.e. that the null hypothesis was not rejected when it was false. These misclassifications can be summarised in this decision table, derived from Mendenhall and Beaver (1994):

Table 4: Decision table for study hypotheses

Null Hypothesis ( ) Decision True 푯ퟎFalse Reject Type I error ( ) Correct decision Accept Correct decision Type II error ( ) 푯ퟎ 훼

푯ퟎ 훽 2.3

The test statistic is the ‘decision maker’ with regards to whether the null hypothesis is rejected or not. It is derived from the data and is used along with the rejection region (discussed next) to determine whether the values which make up the dataset being analysed are consistent with the null hypothesis being true. The formulation of the test statistic depends on the test (e.g. t-test, chi-squared test) and the various test statistics will be addressed in the following chapters. Different hypothesis-based statistical methods are ultimately distinguished from each other by their test statistic.

7

However, the test statistic is automatically calculated by software such as R and SPSS. It is often shown in software outputs as a number but it does not provide much meaningful information without the context provided by its position with regards to the rejection region. For more information on this, please see Appendix A.

2.4 Rejection Region

The decision whether to reject the null hypothesis or not is based on whether the test statistic falls within a ‘rejection region’ (sometimes called a ‘critical region’). The rejection region is essentially a range of values for which it would be considered unlikely that the null hypothesis is correct. If the test statistic is within this range, then is rejected; if the test

statistic is outside the rejection region, there is no evidence to reject퐻0 . For more information on this, please see Appendix A. 퐻0

The rejection region is also defined automatically by statistical software and is ultimately based on the significance level and nature of the test that has been chosen.

2.5 Determining Statistical Significance

The test statistic and rejection region are fairly abstract concepts and when carrying out statistical tests in practice, statistical significance is determined by looking at the resulting p- value and/or confidence interval instead.

However, when using a p-value or confidence interval to interpret the results of the test in the context of the question of interest, any caveats must be acknowledged. These could include lower sample size than desired, data quality or completeness issues, the data not being entirely normally-distributed (for tests which assume this) etc.

2.5.1 P-values

The p-value is the probability that the data would have been observed if the null hypothesis was true and all other assumptions underlying the hypothesis test were satisfied.

8

With a significance level of 0.05, the null hypothesis is typically rejected in favour of the alternative hypothesis when p-value < 0.05. The null hypothesis is not rejected if p-value ≥ 0.05. However, it is better to think of the p-value as a ‘spectrum of significance’ as opposed to a simple cut-off at 0.05. For example, p-value = 0.001 indicates much stronger evidence to support rejecting the null hypothesis than p-value = 0.04. In outputs, the exact p-value should be quoted and not just ‘< 0.05’, unless it is incredibly small e.g. < 0.001.

P-values, or rather the overreliance on and misunderstanding of p-values, has become a controversial topic in recent years. For example, one common misconception is that a p- value of 0.01 means that the null hypothesis has a 1% chance of being true, which is incorrect; rather, it is a 1% probability that, if the study was repeated again and again, the data would be consistent with the null hypothesis being true, with all other test assumptions also being satisfied. If the analysis was repeated 100 times at 95% confidence ( = 0.05), the correct decision (to either reject or not reject the null hypothesis) would be made훼 95 times, with the other 5 occasions expected to result in either Type I or Type II error. Many authors recommend confidence intervals as being superior to p-values or at least providing both in any outputs, as they should agree with each other.

For a much more in-depth discussion on p-values and what they mean, please refer to the following journal article.

2.5.2 Confidence Intervals

A confidence interval often allows more detailed and useful interpretation of a test than a p- value. Since one of the main aims of statistical analysis is to quantify uncertainty, having a range of likely values is more advantageous that just a single number. A p-value is a probability between 0 and 1, whereas confidence intervals express the same information on statistical significance (or lack of statistical significance) but on the same scale as the statistic being tested.

A confidence interval is a range of possible values for the statistic being tested (e.g. the difference between the means of two datasets). A 95% confidence interval can be

9

interpreted as that, upon repeated , there is a 95% probability that the confidence interval will contain the true value of the statistic in question. Like p-values, confidence intervals assume that the assumptions underlying the test are adequately satisfied.

A confidence interval is calculated using the formula below:

Estimate of the statistic in question ± Critical value x (3)

The formulae for the estimate and standard error are dependent on the test being used and are calculated directly from the data. The critical value is the threshold which separates the acceptance and rejection regions of the test statistic (Mendenhall and Beaver, 1994). It is a theoretical concept related to the which underlies the test but the analyst is not required to actually calculate it (this is done as a consequence of choosing a test and a significance level and is performed automatically by statistical software).

The confidence interval for a one-sided test will almost always be bounded by either infinity ( ) or negative infinity (- ), depending on the direction of the null hypothesis (i.e. greater

or∞ less than). The confidence∞ interval for a two-sided test will be bounded by two numbers.

Returning to the one-sample A&E waiting times example in Table 3, if an analyst was carrying out a one-sided test for whether the average waiting time at Glasgow Royal Infirmary was less than 4 hours, and the 95% confidence interval was (- , 4.5), then the null

hypothesis would not be rejected as the value 4.0 is within this range. In∞ other words, there is a 95% probability that the confidence interval would contain a value up to 4.5, if the analysis was repeated 100 times. If the confidence interval was (- , 3.5), then the null

hypothesis would be rejected in favour of the alternative hypothesis∞ as 4.0 is not within this range and it can be concluded that there is a 95% probability that the confidence interval would be bounded by a value up to 3.5 if the analysis was repeated 100 times.

For the two-sided test comparing the means of the two hospitals in Table 2, the null hypothesis is that the means are the same i.e. the difference between the means is zero. Therefore, if the confidence interval contains the value zero, then the null hypothesis is not

10

rejected as zero is a very plausible value of the difference between the two means; if it does not contain the value zero, then the null hypothesis is rejected.

2.5.3 Comparing Results from One and Two-Sided Tests

It should be noted that, if using a significance level of 0.05 for both a one-sided and two- sided test, it is easier (although not necessarily desirable statistically) to return a statistically-significant result for the one-sided test. This is due to the rejection region for a one-sided test being larger than in a two-sided test (see Appendix A for further details), meaning a statistically-significant result is more likely to occur. Therefore, if one carries out a one-sided test and a two-sided test on the same data, comparing the results directly should be avoided.

2.6 Multiple Comparisons

If we consider an example of a new blood pressure drug, it may be of interest to monitor the patients’ blood pressure over time (for example, at 1 month, 3 month and 6 month time periods) and consequently we would want to test for any statistically-significant changes over time simultaneously. We may think we could test each hypothesis separately using significance level α; however, if we were to test, say, 15 hypotheses with significance levels of 0.05, the probability of observing at least one significant result due to chance would be equal to:

( ) = ( ) = ( . ) (4) 푷 퐚퐭 퐥퐞퐚퐬퐭 퐨퐧퐞 퐬퐢퐠퐧퐢퐟퐢퐜퐚퐧퐭 퐫퐞퐬퐮퐥퐭 ퟏ − 퐏 퐧퐨 퐬퐢퐠퐧퐢퐟퐢퐜퐚퐧퐭 퐫퐞퐬퐮퐥퐭퐬 . ퟏퟓ ퟏ − ퟏ − ퟎ ퟎퟓ ≈ ퟎ ퟓퟒ In the case of 15 tests, there is a 54% chance of observing at least one significant result, even if all of the tests are not significant. As the number of hypothesis tests increases, the probability of observing at least one significant result due to chance remains below the specified significance level.

11

2.6.1 The Bonferroni Correction

The Bonferroni correction method can be used to overcome issues with multiple comparisons. The method sets the significance cut-off to / , so in the above example with

15 tests and = 0.05, the null hypothesis would only be rejected훼 푛 if the p-value is less than 0.003. 훼

2.7 General Framework for Hypothesis Testing

Taking all of these individual elements together, the general procedure for hypothesis testing is outlined below. Steps 5 to 8 in this framework are mostly theoretical and are usually performed automatically by the software (e.g. R, SPSS) being used in the analysis.

1. Plot the data (in order to gain an initial impression of the data and look for any obvious patterns) and look at (e.g. mean, median, minimum and maximum etc.)

2. Decide which test to use (based on the aim of the analysis and any inference from the plotting)

3. Check any assumptions which underline the chosen test (e.g. are the data normally- distributed if using a t-test; are the data independent, if required?)

4. Define the two mutually-exclusive study hypotheses, and , and whether it is a

one or two-sided test 퐻0 퐻1

5. Set the significance level (denoted ‘ ’ and usually 0.05, equivalent to 95%

confidence) 훼

6. Calculate the test statistic

7. Find the distribution of the test statistic when is true

퐻0

12

8. Obtain the rejection region (RR) such that the probability of the test statistic (TS) being in the rejection region equals the significance level (e.g. 0.05), provided is

true: 퐻0

( | ) = (5)

퐏 퐓퐒 ∈ 퐑퐑 퐇 ퟎ 퐢퐬 퐭퐫퐮퐞 훂

9. Make an assessment of the validity of R (by means of a p-value or confidence

interval) 퐻0

10. Give a conclusion to the original question of interest using the inference gained from the test (and acknowledge any caveats)

The remainder of this paper will focus on several common hypothesis testing procedures and show examples of how to apply them to health data with software used in PHI.

13

3 T-tests

The t-test can be used to determine if the mean of a dataset significantly differs from a specific value or from the mean of another dataset.

For example, a one-sample t-test could be used to determine whether the mean waiting time in an A&E department differs from the Scottish Government’s set target of 4 hours.

A two-sample t-test could be used to determine whether the mean waiting time in A&E differs between males and females at the Glasgow Royal Infirmary.

3.1 One Sample t-test

The test can either be one-sided or two-sided. In the case of a one-sided test, the hypotheses can be written as follows (with mean and comparison value ):

휇 푚

ퟎ > 푯 ∶ 흁 ≤ 풎 (6) 푯ퟏ ∶ 흁or 풎

푯ퟎ ∶ 흁 ≥< 풎

푯ퟏ ∶ 흁 풎 And for a two-sided test, the hypotheses would be:

= (7) 푯ퟎ ∶ 흁 풎 ퟏ The one-sample t-test has four main assumptions:푯 ∶ 흁 ≠ 풎

· The variable must be continuous. · The observations must be independent of one another. · The variable should be approximately normally distributed. · The variable should not contain any outliers. (Statistics Solutions – Manova Analysis One Sample T-test)

14

It is worth re-emphasising the importance of independence between the observations and how this should be considered carefully. For example, if we were interested in comparing the mean A&E waiting time between months in a particular hospital, some patients could have attended A&E in more than one month. As such, the assumption of independence would not be valid and this test would not be appropriate.

To test the null hypothesis, the following steps should be followed:

1. Calculate the sample mean:

+ + … + = (8) 풙ퟏ 풙ퟐ 풙풏 풙� 2. Calculate the sample standard : 풏

( ) + ( ) + … + ( ) = (9) ퟐ ퟐ ퟐ 풙ퟏ − 풙� 풙ퟐ − 풙� 풙풏 − 풙� 풔 � 풏 − ퟏ 3. Calculate the test statistic:

= (10) 풙� − 풎 풕 풔 � √풏 where:

· is the i observation in the data · is the sampleth size 풙풊 · is the comparison value 풏 · is the sample mean 풎 · s is the sample 풙� · is the t-statistic (the test statistic for a t-test)

풕 4. Calculate the probability ( ) of observing the test statistic under the null value. This

is obtained by comparing 푝 to a -distribution with 1 degrees of freedom.

푡 푡 푛 − 5. Set the significance level ( ) and compare to the p-value. If the p-value is less than

or equal to , reject the null훼 hypothesis in favour of the alternative. 훼

15

The t-test can also be used to construct a confidence interval for the population mean. This can be calculated as:

( )( ) , + ( )( ) (11) 풔 풔 ퟏ−휶 ퟏ−휶 �풙� − 풕ퟏ− ퟐ 풏 − ퟏ √풏 풙� 풕ퟏ− ퟐ 풏 − ퟏ √풏� where:

= 풏 ( ) (12) ퟐ ퟏ ퟐ 풔 � 풙풊 − 풙� 풏 − ퟏ 풊=ퟏ In the case of a 95% confidence interval this can be simplified to:

± . ( ) (13) 풔 �풙� 풕ퟎ ퟗퟕퟓ 풏 − ퟏ � or in more general terms this can be thought of as (see√ Equation풏 3):

± (14)

퐞퐬퐭퐢퐦퐚퐭퐞 풕 − 퐯퐚퐥퐮퐞 퐱 퐞퐬퐭퐢퐦퐚퐭퐞퐝 퐬퐭퐚퐧퐝퐚퐫퐝 퐞퐫퐫퐨퐫 3.1.1 Example

For the examples in this paper, a simulated dataset will be used. The dataset was created by the HPS SHAIPI team in collaboration with the University of Strathclyde. It consists of synthetic healthcare associated infection (HAI) data along with other related variables. The original analysis of these data was carried out by Prof. Chris Robertson and colleagues at the University of Strathclyde. The dataset, as well as some documentation on the variables in the dataset, can be found at this link (this can only be accessed when connected to the internal NHS National Services Scotland network). Some of the R code is included above specific plots in this section; however, the full R code used for this example is displayed in Appendix C (the equivalent SPSS syntax and output are shown in Appendices D and E, respectively).

After loading the data, patients with incomplete data on a number of variables are excluded from the analysis. There are a total of 8,555 complete cases. The aim of the analysis is to

16

test if the mean length of stay of the patients in the study is greater than ISD’s reported average length of stay of 6.3 or in terms of a one-sided hypothesis:

. (15) 푯ퟎ ∶ 흁 ≤> ퟔ. ퟑ 푯ퟏ ∶ 흁 ퟔ ퟑ More information on ISD’s reported average length of stay can be found by visiting: http://www.isdscotland.org/Health-Topics/Hospital-Care/Inpatient-and-Day-Case-Activity/.

Before carrying out a one-sided t-test we firstly need to check the underlying assumptions. The one sample t-test requires the sample data to be numeric and continuous, as it is based on the . As the variable of interest is length of stay, the assumption of numeric and continuous sample data is satisfied.

Next it is of interest to check that the data are approximately normally distributed:

Figure 1: for total stay and log transformation of total stay – (a) Length of Stay; (b) Log transformation of Length of Stay

(a) (b) 8000 1200 1000 6000 800 Frequency Frequency 4000 600 400 2000 200 0 0

0 1000 3000 5000 0 2 4 6 8 Total Stay log(Total Stay + 1)

17

From Figure 1(a) it is clear that the data are right skewed. Although the data are not normally-distributed, the data can be transformed onto a different scale in order to use a t- test. The most commonly-used transformation is the log transformation, although many others are available (e.g. square root, inverse etc.). For more information on transforming data to satisfy the assumption of normality, please refer to Section 2.3 of the Regression Modelling paper.

Since a patient cannot have a negative length of stay, a log-transformation may be appropriate. As the number zero cannot be log-transformed, each patient’s length of stay is shifted one whole number higher prior to the transformation (i.e. log(Total.Stay + 1)). Figure 1(b) shows the data on the logarithmic scale and is much closer to the normal distribution, so normality can be assumed.

We will also assume the observations are independent of one another and the data contains no outliers. If we were to look for the possibility of outliers, the data could be plotted using a boxplot and if extreme values were found, these would need to be removed before carrying out analysis (as the mean is highly sensitive to outlying values).

A one sided t-test can then be calculated for the transformed length of stay variable versus the transformed comparison value (6.3 + 1). The standard R output is displayed below:

> x <- log(6.3+1) > t.test(log(HAI.data.subset$Total.Stay + 1), mu = x, alternative = "greater")

One Sample t-test

data: log(HAI.data.subset$Total.Stay + 1) t = 71.65, df = 8554, p-value < 2.2e-16 alternative hypothesis: true mean is greater than 1.987874 95 percent confidence interval: 2.900431 Inf sample estimates: mean of x 2.921875

As the p-value is less than 0.05, we can therefore reject the null hypothesis 6.3 in

favour of the alternative hypothesis > 6.3. 퐻0 ∶ 휇 ≤ 퐻1 ∶ 휇

18

The 95% confidence interval is (2.90, ); however, to interpret this in terms of length of stay, the exponential needs to be taken∞ and then one subtracted as this is currently the interval for the log-transformed length of stay. The interval then becomes (17.18, ); therefore, it can be concluded that the population mean length of stay for patients∞ in the dataset is highly likely to be at least 17.18 days, with a point estimate of 17.58 days.

As the transformed interval does not contain the value 6.3, it provides the same conclusion as the p-value that there is sufficient evidence that the population mean is greater than 6.3.

3.2 Two Sample t-test

A two-sample t-test can be used to compare two independent groups to see if their means are different. The assumptions mentioned in Section 3.1 for the one sample t-test also apply to the two sample t-test along with the assumption that the two samples are independent of one another. For the two-sample t-test, the test statistic is:

= (16) �ퟏ �ퟐ 푿 − 푿 풕 � 풔∆

where:

· and are the respective sample means

푿�ퟏ 푿�ퟐ

= + (17) ퟐ ퟐ ퟏ ퟐ � 풔� 풔� 풔∆ � 풏 ퟏ 풏ퟐ

· and are the respective sample sizes · and are the respective sample standard deviations 풏ퟏ 풏ퟐ 풔ퟏ 풔ퟐ In the case of a two-sample t-test the probability of observing the test statistic under the null hypothesis (p) is obtained by comparing to a t-distribution with + 2 degrees

of freedom. 푡 푛1 푛2 −

19

The 95% confidence interval for a difference in population means is given by:

± . ( + ) + (18) ퟐ ퟏ ퟏ 푿���ퟏ� − 푿���ퟐ� 풕ퟎ ퟗퟕퟓ 풏ퟏ 풏ퟐ − ퟐ �풔풑 � � 풏ퟏ 풏ퟐ

where is the pooled estimated of : 2 푠푝 휎�

( ) + ( ) = (19) +ퟐ ퟐ ퟐ 풏ퟏ − ퟏ 풔ퟏ 풏ퟐ − ퟏ 풔ퟐ 풔풑 � 풏ퟏ 풏ퟐ − ퟐ 3.2.1 Example

The aim of this analysis is compare the mean length of stay of the patients in the study who have a hospital acquired infection ( ) against those patients who do not have a hospital

acquired infection ( ), or in terms휇 of푌 a two-sided hypothesis test: 휇푁

= (20) 푯ퟎ ∶ 흁풀 흁푵 푯ퟏ ∶ 흁풀 ≠ 흁푵

Again before calculating the two-sided t-test we firstly need to validate the assumptions.

We will again assume the observations are independent of one another, the data contains no outliers and the two samples are independent.

Figure 2(a) shows a of the length of stay for each patient with a HAI infection and Figure 2(b) shows those patients without an infection.

Similar to before, as the data are right skewed, the length of stay variable is log transformed and each patient’s length of stay is shifted by one whole number (Figures 2(c) and 2(d)).

20

Figure 2: Histograms of Length of Stay by HAI – (a) Length of stay for patients with HAI equal to yes; (b) Length of stay for patients with HAI equal to no; (c) Log transformation of length of stay for patients with HAI equal to yes; (d) Log transformation of length of stay for patients with HAI equal to no

(a) HAI = Yes (b) HAI = No 8000 250 6000 4000 Frequency Frequency 150 2000 50 0 0

0 100 200 300 400 500 0 1000 2000 3000 4000 5000

Total Stay Total Stay

(c) HAI = Yes (d) HAI = No 100 1000 80 60 Frequency Frequency 600 40 20 200 0 0

0 1 2 3 4 5 6 0 2 4 6 8 log(Total Stay + 1) log(Total Stay + 1)

The variances of both populations are assumed to be unequal for the purposes of the test, however this can be formally tested using an F-test and further details can be found in Chapter 7.

The standard R output is displayed below:

> t.test(log(HAI.data.subset$Total.Stay[HAI.data.subset$HAI == "No"] + 1), + log(HAI.data.subset$Total.Stay[HAI.data.subset$HAI == "Yes"] + 1))

Welch Two Sample t-test data: log(HAI.data.subset$Total.Stay[HAI.data.subset$HAI == "No"] + and log(HAI.data.subset$Total.Stay[HAI.data.subset$HAI == "Yes"] + 1) and 1) t = -18.537, df = 590.77, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.9048984 -0.7315190 sample estimates: mean of x mean of y 2.875202 3.693411

21

The 95% confidence interval for the difference in the population means is (-0.60, -0.52) and, as this interval does not contain 0 and the p-value is less than 0.05, there is evidence of a statistically significant difference between the mean length of stay for patients with a HAI compared to those without. The mean difference is highly likely to lie between -0.60 and - 0.52, with patients without an infection highly likely to have shorter stays.

3.3 Paired t-test

A paired t-test should be used in place of a two-sample t-test when the samples are dependent. For example, when one sample has been tested twice (repeated measures) or when two samples have been matched or “paired”.

If a group of patients with high blood pressure were trialling a new drug, a paired t-test could be used to compare the mean blood pressure of the group before and after using the drug for a specified time period.

The assumptions of the paired t-test are the same as those for the two-sided t-test, however, rather than assuming the samples are independent they are assumed to be dependent.

The test statistic is calculated as:

= (21) 풅� 풕 ퟐ �풔 where: 풏

· is the mean difference

· 풅� is the sample variance ퟐ · 풔 is the sample size 풏 The p-value is then obtained by comparing to a t-distribution with 1 degrees of

freedom. The test is carried out and interpret푡 ed the same way as the푛 −two-sample t-test.

22

4 Non-Parametric Tests

A non-parametric test, sometimes referred to as a distribution-free test, does not assume anything about the underlying data. A parametric test will make an assumption about the population’s parameters or about the population data. For example, a -test is a parametric

test as it assumes the underlying data are normally distributed. Non-parametric푡 tests are useful when the data are not normally-distributed and cannot be easily transformed to be normally-distributed either (e.g. a log transformation, as shown in the examples in Chapter 3).

4.1 Wilcoxon Signed Ranks Test

If we only have one sample of data, a frequent question of interest is usually estimating the average value of a specific variable in the population.

The Wilcoxon Signed Ranks test is an example of a non-parametric test that can be used to investigate the null hypothesis that the median ( ) of the distribution has a specified value

( ). The hypotheses can be written as: 휂 푚 = (22) 푯ퟎ ∶ 휼푨 풎 푯ퟏ ∶ 휼푨 ≠ 풎 This is an alternative to the one-sample t-test and is based on the differences between each of the sampled data points and the specified value for the mean. If half of the differences are positive and half of the differences are negative, we have no evidence to reject the null hypothesis. If this is not the case we have evidence to reject the null hypothesis.

The test can also be used for paired data to test the null hypothesis that the median difference between paired data points is equal to 0. In contrast to the one-sample test the difference between each pair of points is computed and then, under the null hypothesis, we would expect half of the differences to be positive and half to be negative. If this is not the case we again have evidence to reject the null hypothesis.

23

The test has the following assumptions: · The observations are random and independent · The observations come from a symmetric distribution

The test statistic is calculated as follows:

· Let = where is the sample value 푡ℎ · Calculate푑푖 푥푖 −for 푚 each sample푥푖 value푖 and then rank these values · Compute 푑the푖 sum of the positive and negative ranks, + and respectively 푊 푊 − The test statistic is whichever is the smallest of W + and W and then the p-value can be calculated at the specified significance level. −

4.1.1 Example

Referring back to the HAI dataset, we can use the Wilcoxon Signed Ranks test to test the hypothesis that the median stay time for males in the study is equal to 10 days or in terms of the study hypotheses:

= (23) 푯ퟎ ∶ 휼풎풂풍풆풔 ퟏퟎ 푯ퟏ ∶ 휼풎풂풍풆풔 ≠ ퟏퟎ We will assume the observations are random and independent and after doing a log transformation the distribution appears approximately symmetric:

Figure 3: Boxplot of Length of Stay for male patients

24

> wilcox.test(log(HAI.data.subset$Total.Stay[HAI.data.subset$Sex == "Male"] + 1), mu = log(11))

Wilcoxon signed rank test with continuity correction

data: log(HAI.data.subset$Total.Stay[HAI.data.subset$Sex == "Male"] + 1) V = 4905500, p-value < 2.2e-16 alternative hypothesis: true location is not equal to 2.397895

When running the test in R we can see the p-value is less than 0.05. We can reject and

conclude there is sufficient evidence of a difference from 10 days. 퐻0

4.2 Mann-Whitney U-test

The Mann-Whitney U-test can be used in place of the two sample t-test when data are not normally distributed. It can be used to derive a test that compares the of two independent populations.

The idea behind the test is that if there is no difference between populations, e.g. in terms their medians, then it should be possible to collect all data together (ignoring the populations) and rank the magnitude of values. If the ranks are then aggregated for each population, the answers should be roughly the same. If they are very different then there is evidence that the values of the two populations are not the same.

The assumptions of the Mann Whitney U-test are:

· The observations are independent · The distribution of the variable of interest has the same shape and spread in the two populations

4.2.1 Example

Referring back to the HAI dataset, we can use the Mann-Whitney U-test to test the hypothesis that the median stay time for males in the study is equal to the median stay for females:

25

= (24) 푯ퟎ ∶ 휼풎풂풍풆풔 휼풇풆풎풂풍풆풔 푯ퟏ ∶ 휼풎풂풍풆풔 ≠ 휼풇풆풎풂풍풆풔 We will assume the observations are random and independent and again after doing a log transformation of the stay variable the distribution appears approximately symmetric for both populations:

Figure 4: Boxplot of Length of Stay for male and female patients

> wilcox.test(log(HAI.data.subset$Total.Stay[HAI.data.subset$Sex == "Male"] + 1),log(HAI.data.subset$Total.Stay[HAI.data.subset$Sex == "Female"] + 1),alternative = "two.sided")

Wilcoxon rank sum test with continuity correction data: log(HAI.data.subset$Total.Stay[HAI.data.subset$Sex == "Male"] + and log(HAI.data.subset$Total.Stay[HAI.data.subset$Sex == "Female"] + 1) and 1) W = 8752400, p-value = 0.009582 alternative hypothesis: true location shift is not equal to 0

The results show a p-value of approximately 0.01 and since this is < 0.05 we can reject .

Therefore there is evidence that the median length of stay is different for both males and퐻0 females. As you can see the results provide the same conclusion as the two sample t-test in Chapter 3. It should be noted that the R function for the Mann-Whitney U-test is wilcox.test(), despite the different name of the test.

26

5 Chi-Squared Tests

The t-tests and non-parametric tests discussed in Chapters 3 and 4, respectively, can only be used for continuous (numeric) data.

The chi-squared test (or squared test, after the Greek letter chi) is a hypothesis test that 2 has two common applications:휒 to compare frequencies between the categories of one categorical (non-numeric) variable (a goodness-of-fit test) or to test if there is any association between two different categorical variables (a test of independence). Both tests involve comparing the number of cases which are expected to those which are actually observed. The chi-squared test can be used in other contexts, such as testing the goodness- of-fit of a model, but these will not be addressed here.

5.1 Goodness-of-Fit Test

For a goodness-of-fit chi-squared test, the study hypotheses would be defined as follows:

(25) 푯ퟎ ∶ 풂풍풍 풇풓풆풒풖풆풏풄풊풆풔 풂풓풆 풆풒풖풂풍 푯ퟏ ∶ 풂풕 풍풆풂풔풕 풔풐풎풆 풇풓풆풒풖풆풏풄풊풆풔 풂풓풆 풏풐풕 풆풒풖풂풍 For a theoretical categorical variable with three categories (as an example), the cell probabilities are essentially the frequency of cases which are attributed to each category.

Table 5: Example of data for chi-squared goodness-of-fit test

Category A Category B Category C Total

Number of cases A B C N A B C Frequency of cases a = b = c = 1 N N N

where N is the total number of cases in the dataset. Under the null hypothesis, the expected frequency of cases in all categories (a, b and c) should be equal (i.e. a = b = c). If

27

one or more categories have a significantly higher or lower frequency of cases than expected under , then the null hypothesis would be rejected.

퐻0 The chi-squared test statistic is defined as follows, where o is the observed number of cases and e is the expected number of cases under the null hypothesis:

( ) = (26) ퟐ ퟐ 퐨 − 퐞 훘 � 퐞 The chi-squared statistic combines all differences between the expected number of cases and observed number of cases in the data into one measure of the distance between the observed data and the expected data under the null hypothesis.

If the p-value for the test is less than 0.05, then the null hypothesis would be rejected and it would be concluded that at least one of the categories has a significantly higher or lower frequency of cases than expected. If the p-value is greater than 0.05, the null hypothesis would not be rejected and it can be concluded that there is no evidence that any of the frequencies are not equal to another. Confidence intervals are not typically used for chi- squared tests and there are no one-sided or two-sided tests.

Like most statistical techniques, chi-squared tests require a sufficient sample size to be robust. A widely-used rule of thumb is that all expected cell counts must be at least 5 (Mendenhall and Beaver, 1994). When the sample size is particularly small, a related test called Fisher’s may be appropriate, which is outwith the scope of this paper.

5.1.1 Example

The simulated healthcare associated infection (HAI) dataset will be used again for this example. This is a fairly straightforward example based on the number of people who develop a healthcare associated infection or not during their stay in hospital. The hypotheses can be defined as follows (where is the number of patients):

28

= (27) 푯ퟎ ∶ 푵푯푨푰 풑풓풆풔풆풏풕 푵풏풐 푯푨푰 풑풓풆풔풆풏풕 푯ퟏ ∶ 푵푯푨푰 풑풓풆풔풆풏풕 ≠ 푵풏풐 푯푨푰 풑풓풆풔풆풏풕 Since there are 8,555 cases, it would be ‘expected’ under the null hypothesis that half of all patients (4277.5) would develop an HAI and the other 4,277.5 would not (i.e. equal frequencies). However, the observed data are as follows:

> chisq.test(table(HAI.data.subset$HAI))$observed

No Yes 8067 488

From simply looking at the above output, it is quite clear that there is an immense difference in the numbers and the results of the chi-squared test below confirm this:

> chisq.test(table(HAI.data.subset$HAI))

Chi-squared test for given probabilities

data: table(HAI.data.subset$HAI) X-squared = 6714.3, df = 1, p-value < 2.2e-16

The statistic (the test statistic) is derived as follows: 2 휒 ( ) ( . ) ( . ) = = + = . (28) ퟐ . ퟐ . ퟐ ퟐ 퐨 − 퐞 ퟖퟎퟔퟕ − ퟒퟐퟕퟕ ퟓ ퟒퟖퟖ − ퟒퟐퟕퟕ ퟓ 훘 � ퟔퟕퟏퟒ ퟑퟒퟕ 퐞 ퟒퟐퟕퟕ ퟓ ퟒퟐퟕퟕ ퟓ The statistic (6714.3) is very large and this is a consequence of the magnitude of the 2 difference휒 between the number of patients with an HAI and those without. The p-value is incredibly small, so the null hypothesis can be rejected and it can be concluded that the frequencies of people with and without HAIs are significantly different.

5.2 Test of Independence

In a chi-squared test of independence, the study hypotheses would be defined as follows:

29

(29) 푯ퟎ ∶ 풏풐 풂풔풔풐풄풊풂풕풊풐풏 풃풆풕풘풆풆풏 풕풉풆 풕풘풐 풗풂풓풊풂풃풍풆풔 푯ퟏ ∶ 풔풐풎풆 풂풔풔풐풄풊풂풕풊풐풏 풃풆풕풘풆풆풏 풕풉풆 풕풘풐 풗풂풓풊풂풃풍풆풔 It should be emphasised that association does not imply causation. The data for two categorical variables can be summarised in a , an example of which is shown in Table 6. Variable A has two categories and Variable B has three categories.

Table 6: Example of data for chi-squared test of independence

Variable B Total Category B.1 Category B.2 Category B.3 Category A.1 a b c a + b + c Variable A Category A.2 d e f d + e + f a + b + c + Total a + d b + e c + f d + e + f

Under the null hypothesis, the expected number of cases for each combination of categories (a, b, c, d, e and f) should be equal (i.e. a = b = c = d = e = f).

The test statistic of the chi-squared test of independence is the same as for the goodness- of-fit variant.

5.2.1 Example

The simulated healthcare associated infection (HAI) dataset will be used again for this example. The question of interest is whether there is an association between the size of a hospital and the number of HAIs and the hypotheses can be defined as follows:

(30) 푯ퟎ ∶ 풏풐 풂풔풔풐풄풊풂풕풊풐풏 풃풆풕풘풆풆풏 풉풐풔풑풊풕풂풍 풔풊풛풆 풂풏풅 풕풉풆 풏풖풎풃풆풓 풐풇 푯푨푰풔 푯ퟏ ∶ 풔풐풎풆 풂풔풔풐풄풊풂풕풊풐풏 풃풆풕풘풆풆풏 풉풐풔풑풊풕풂풍 풔풊풛풆 풂풏풅 풕풉풆 풏풖풎풃풆풓 풐풇 푯푨푰풔 The ‘Hospital Size’ variable has three categories (‘Small’, ‘Medium’ and ‘High’) and the ‘HAI’ variable has two categories (‘Yes’ and ‘No’). The data can be summarised in this contingency table:

30

> table(HAI.data.subset$Hospital.Size, HAI.data.subset$HAI)

No Yes Large 5465 334 Medium 2163 122 Small 439 32

The percentage of cases falling into each possible combination of categories can also be calculated:

> round(prop.table(table(HAI.data.subset$Hospital.Size, HAI.data.subset$HAI), 1)*100, 1)

No Yes Large 94.2 5.8 Medium 94.7 5.3 Small 93.2 6.8

There does not appear to be much difference in the percentages between the rows (varying between 5.3% and 6.8%). The results of the chi-squared test are displayed below:

> chisq.test(HAI.data.subset$Hospital.Size, HAI.data.subset$HAI)

Pearson's Chi-squared test data: HAI.data.subset$Hospital.Size and HAI.data.subset$HAI X-squared = 1.6392, df = 2, p-value = 0.4406

The p-value is 0.44, so it can be concluded that there is no evidence of an association between the size of a hospital and HAI rates. The expected number of cases can be displayed using the syntax below. These numbers are not that different from the observed data. It should also be noted that all expected cell counts are greater than 5, a recommended minimum.

> chisq.test(HAI.data.subset$Hospital.Size, HAI.data.subset$HAI)$expected

HAI.data.subset$HAI HAI.data.subset$Hospital.Size No Yes Large 5468.2096 330.7904 Medium 2154.6575 130.3425 Small 444.1329 26.8671

These expected numbers can be calculated by using the formula:

31

× = (31) 퐫퐨퐰 퐬퐮퐦 퐜퐨퐥퐮퐦퐧 퐬퐮퐦 퐞퐱퐩퐞퐜퐭퐞퐝 퐧퐮퐦퐛퐞퐫 퐭퐨퐭퐚퐥 퐬퐮퐦 For example, the expected number of ‘No’ HAI cases in ‘small’ hospitals can be calculated as follows:

( + ) × ( + + ) = = . (32) + + + + + ퟒퟑퟗ ퟑퟐ ퟓퟒퟔퟓ ퟐퟏퟔퟑ ퟒퟑퟗ 퐞퐱퐩퐞퐜퐭퐞퐝 퐧퐮퐦퐛퐞퐫 ퟒퟒퟒ ퟏퟑퟐퟗ ퟓퟒퟔퟓ ퟑퟑퟒ ퟐퟏퟔퟑ ퟏퟐퟐ ퟒퟑퟗ ퟑퟐ A final point is that if one attempts to carry out a chi-squared test on a 2x2 contingency table (i.e. the two variables have only two categories each), R and SPSS automatically apply Yates’ continuity correction, which adjusts the chi-squared test statistic from Equation 26 as follows:

(| | . ) = (33) ퟐ ퟐ 퐨 − 퐞 − ퟎ ퟓ 훘 � 퐞 This adjustment is made because the number of degrees of freedom from a 2x2 contingency table is too low for the test to be used. However, this does not affect the interpretation of the output.

32

6 Proportion Tests

The proportion test (also known as a ‘binomial proportion test’) is a closely-related method to the chi-squared test but relies on a different probability distribution. The dataset(s) of interest must be in a binary (dichotomous) format. Binary data typically follow the but this can be approximated by the more common normal distribution when the sample size is large enough, which proportion tests are based on. Proportion tests and chi-squared tests should produce comparable results when the sample size is sufficiently large. Comparing proportions was also considered in the PHI Trend Analysis paper.

There are one-sample and two-sample varieties of the test, which will be considered separately as they have slightly different formulations.

6.1 One-Sample Test

The purpose of the one-sample test is to compare the proportion of interest in a dataset to a comparison value. For example, if an analyst wished to check if there was a gender difference in the number of patients being admitted to A&E in a particular hospital, it could be analysed whether the proportion of male patients in the hospital was not equal to 0.5.

For a one-sided test, the study hypotheses would be defined as follows (for proportion and comparison value ): 푝 푚

ퟎ < 푯 ∶ 풑 ≥ 풎 (34) 푯ퟏ ∶ or 풑 풎

푯ퟎ ∶ 풑 ≤> 풎 푯ퟏ ∶ 풑 풎 For a two-sided test, the study hypotheses would be defined as:

33

= (35) 푯ퟎ ∶ 풑 풎 푯ퟏ ∶ 풑 ≠ 풎 The test statistic (TS) for the test is as follows:

= × (36) 풑� − 풎 퐓퐒 풑� 풒� � 풏 where:

· is the number of data points

· 풏 is the proportion of times the event of interest occurs in the data (i.e. = / ) 풑and� is the number of times the event occurs 풑� 풙 풏 · is the풙 proportion of times the event does not occur and is calculated as = · 풒� is the comparison value that the proportion is being compared to 풒� ퟏ − 풑� 풎 풑�

To compute a confidence interval, the standard error (SE) for the test is:

× = (37) 풑� 풒� 퐒퐄 � 풏 For a two-sided test, if the p-value for the test is less than 0.05, then the null hypothesis would be rejected and it would be concluded that the proportion is significantly different

from the comparison value . Alongside this, the confidence interval풑� for the test would not include (meaning it is not풎 a likely value for ). 풎 풑� For a one-sided test, if the p-value is less than 0.05, then the null hypothesis would also be rejected and it would be concluded that the proportion is higher or lower than ,

depending on how the study hypotheses are framed. A 풑�one-sided test for a proportion 풎 is bounded by either -1 or 1 instead of negative or positive infinity.

34

6.1.1 Example

An example using the healthcare associated infection (HAI) data will be used to compare whether the proportion of male patients in the data is significantly different from female patients (i.e. is the male proportion equal to 0.5). This is a two-sided test. The hypotheses can be defined as follows, where is the proportion of patients:

푝 = . (38) 푯ퟎ ∶ 풑풎풂풍풆 ퟎ. ퟓ 푯ퟏ ∶ 풑풎풂풍풆 ≠ ퟎ ퟓ Some software, such as R, requires the data to be a binary format. For this example, the ‘Sex’ variable must be recoded from ‘Male’ and ‘Female’ to ‘0’ and ‘1’.

> table(HAI.data.subset$Sex)

Female Male 4729 3826

> table(HAI.data.subset$Sex_Binary)

0 1 3826 4729

The proportion of males in the dataset is 0.4472238; the proportion of females is 1 minus this i.e. 0.5527762. The proportion test can be carried out using the code below and the results are also displayed. R automatically assumes the specified value being tested against the proportion is 0.5 unless otherwise stated.

> prop.test(table(HAI.data.subset$Sex_Binary))

1-sample proportions test with continuity correction data: table(HAI.data.subset$Sex_Binary), null probability 0.5 X-squared = 95.103, df = 1, p-value < 2.2e-16 alternative hypothesis: true p is not equal to 0.5 95 percent confidence interval: 0.4366556 0.4578397 sample estimates: p 0.4472238

Since the p-value is very small (< 0.001), the null hypothesis is rejected and it can be concluded that there the proportion of males in the dataset is significantly different from

35

0.5. Here, the proportion of males is significantly lower as the proportion from the data is less than 0.5 (approximately 0.45). The 95% confidence interval does not contain the comparison value (0.5), so it concurs with the p-value that the null hypothesis should be rejected.

6.2 Two-Sample Test

The purpose of a two-sample proportion test is to test whether the proportion of times an event occurs in a group is significantly different or higher/lower than in another group (two- sided and one-sided tests, respectively). Mendenhall and Beaver (1994) describe an example where 52 men in a sample of 1,000 men are admitted to a specific hospital due to heart disease compared to 23 women in a sample of 1,000 women. A proportion test can be used to test whether there is a higher rate of heart disease among men who are admitted to the hospital.

For a one-sided test, the study hypotheses would be defined as follows (where is the proportion in group 1 and is the proportion in group 2): 푝1 푝2

푯ퟎ ∶ 풑ퟏ ≥< 풑ퟐ 푯ퟏ ∶ 풑oퟏr 풑ퟐ (39)

푯ퟎ ∶ 풑ퟏ ≤> 풑ퟐ 푯ퟏ ∶ 풑ퟏ 풑ퟐ For a two-sided test, the study hypotheses would be defined as follows:

= (40) 푯ퟎ ∶ 풑ퟏ 풑ퟐ 푯ퟏ ∶ 풑ퟏ ≠ 풑ퟐ The test statistic (TS) for the test is as follows:

36

( ) = × × 풑�ퟏ − 풑�+ퟐ − 푫 (41) 퐓퐒 풑ퟏ 풒ퟏ 풑ퟐ 풒ퟐ where: � 풏ퟏ 풏ퟐ

· and are the number of data points in group 1 and 2, respectively

· 풏ퟏ is the풏 ퟐproportion of times the event of interest occurs in group 1 (i.e. = 풑ퟏ / ) and is the number of times the event occurs in group 1; is 풑theퟏ estimate 풙ퟏ 풏ퟏ of 풙ퟏ 풑�ퟏ · is the proportion풑ퟏ of times the event does not occur in group 1 · 풒ퟏ is the proportion of times the event of interest occurs in group 2 (i.e. = 풑ퟐ / ) and is the number of times the event occurs in group 2; is the풑ퟐ estimate풙ퟐ 풏ퟐ of 풙ퟐ 풑�ퟐ · is the proportion풑ퟐ of times the event does not occur in group 2 · 풒ퟐ is the specified difference being tested ( = 0 when testing whether the two 푫proportions are the same i.e. = or, equivalently,퐷 = 0) 푝1 푝2 푝1 − 푝2 However, if the null hypothesis is that there is no difference between the two groups (i.e. = 0), it is essentially being hypothesised that = = . Therefore, Equation 41 can

be퐷 condensed into Equation 42 in such cases because푝1 “the푝2 best푝 estimate of is obtained by pooling the data from both samples” (Mendenhall and Beaver, 1994). 푝

( ) = (42) × 풑�ퟏ×− 풑�ퟐ + 퐓퐒 ퟏ ퟏ �� � ퟏ ퟐ 풑 풒 �풏 풏 � where:

+ = (43) + 풙ퟏ 풙ퟐ 풑� 풏ퟏ 풏ퟐ Equation 43 is called the ‘pooled proportion’ and = . To compute a confidence interval, the standard error (SE) for the test is the 풒�denominatorퟏ − 풑� of the test statistic:

37

= × × + (44) ퟏ ퟏ 퐒퐄 �풑� 풒� � � 풏ퟏ 풏ퟐ For a two-sided test, if the p-value for the test is less than 0.05, then the null hypothesis would be rejected and it would be concluded that the proportions are different between the two groups. Alongside this, the confidence interval for the test would not include the value D (this is 0 when testing if there is no difference between the proportions). The confidence interval for a two-sided test will also give an indication as to which of the two groups has the higher proportion depending on whether it is wholly positive or negative.

For a one-sided test, if the p-value is less than 0.05, then the null hypothesis would also be rejected and it would be concluded that one of the proportions is greater or lower than the other one, depending on how the study hypotheses are framed. A one-sided test for proportions is bounded by either -1 or 1 instead of negative or positive infinity.

6.2.1 Example

The example in 6.1.1 will be extended to investigate patients contracting HAIs based on gender, like the heart disease example described above. The hypotheses can be defined as follows, where is the proportion of patients contracting an HAI:

= (45) 푯ퟎ ∶ 풑풎풂풍풆 풑풇풆풎풂풍풆 푯ퟏ ∶ 풑풎풂풍풆 ≠ 풑풇풆풎풂풍풆

For this example, the ‘Sex’ variable must be recoded in R from ‘Male’ and ‘Female’ to ‘0’ and ‘1’; the ‘HAI’ variable is also recoded from ‘No’ and ‘Yes’ to ‘0’ and ‘1’.

> table(HAI.data.subset$Sex, HAI.data.subset$HAI)

No Yes Female 4469 260 Male 3598 228

38

> addmargins(table(HAI.data.subset$Sex_Binary, HAI.data.subset$HAI_Binary, + dnn = c("Sex", "HAI")))

HAI Sex 0 1 Sum 0 3598 228 3826 1 4469 260 4729 Sum 8067 488 8555

The proportion test can be carried out using the code below and the results are also displayed:

> prop.test(table(HAI.data.subset$Sex_Binary, HAI.data.subset$HAI_Binary))

2-sample test for equality of proportions with continuity correction

data: table(HAI.data.subset$Sex_Binary, HAI.data.subset$HAI_Binary) X-squared = 0.75291, df = 1, p-value = 0.3856 alternative hypothesis: two.sided 95 percent confidence interval: -0.014772136 0.005547431 sample estimates: prop 1 prop 2 0.9404077 0.9450201

Since the p-value is much greater than 0.05 and the 95% confidence interval contains the value 0 (indicating no difference), then it can be concluded that there is no difference in the proportion of males and females contracting HAIs.

Since it is a similar test, a chi-squared test can be performed on these data as well, with the result being displayed below. A very similar answer is obtained although notice that Yates’ continuity correction (see Chapter 5) has been applied automatically here. For a chi- squared test, the original variables or converted binary versions can be used and achieve the same outcome.

> chisq.test(HAI.data.subset$Sex, HAI.data.subset$HAI)

Pearson's Chi-squared test with Yates' continuity correction

data: HAI.data.subset$Sex and HAI.data.subset$HAI X-squared = 0.75291, df = 1, p-value = 0.3856

39

7 F-tests

F-tests are any statistical tests which have an F-distribution under the null hypothesis. F- tests can be used to test the hypothesis that two samples come from populations with the same variance. They can also be used in regression modelling to compare the fits of different linear models. F-tests assume the data are normally distributed and this assumption should be tested prior to carrying out the test using the steps described in t-test section.

7.1 F-test for Equality of Variances

As mentioned in Chapter 3 and above, the F-test can be used to test if the variances of two populations are equal. For example, the F-test could be used to test whether or not the variance of waiting times at Glasgow Royal Infirmary (GRI) differs to that in the Royal Infirmary of Edinburgh (RIE).

In the example of testing for equality of variances, the F-test can be either one-sided or two- sided. The one-sided version tests in only one direction. For example, it would test whether the variance from one population is less or greater than the variance of the second population. In terms of hypotheses, this can be written as:

ퟐ ퟐ 푯ퟎ ∶ 흈ퟏ >≤ 흈ퟐ ퟐ ퟐ 푯ퟏ ∶ 흈orퟏ 흈ퟐ (46)

ퟐ ퟐ 푯ퟎ ∶ 흈ퟏ <≥ 흈ퟐ ퟐ ퟐ ퟏ ퟏ ퟐ 푯 ∶ 흈 흈 The two-sided version tests that the variances are not equal or in terms of hypothesis:

= ퟐ ퟐ (47) 푯ퟎ ∶ 흈ퟏ 흈ퟐ ퟐ ퟐ 푯ퟏ ∶ 흈 ퟏ ≠ 흈ퟐ

40

The F-statistic is equal to:

= ퟐ (48) 풔ퟏ ퟐ 푭 ퟐ 풔 where and are the sample variances and can be calculated as: 2 2 1 2 푠 푠 ( ) = ퟐ (49) ퟐ ∑ 풙 − 풙� 풔 풏 − ퟏ The hypothesis that the two variances are equal is rejected if:

> , ,

푭 푭휶 푵ퟏ−ퟏ 푵ퟐ−ퟏ > , , 푓표푟 푎푛 푢푝푝푒푟 표푛푒 − 푡푎푖푙푒푑 푡푒푠푡 (50) 푭 푭ퟏ−휶 푵ퟏ−ퟏ 푵ퟐ−ퟏ > , , 푓표푟 푎 푙표푤푒푟 표푛푒 − 푡푎푖푙푒푑 푡푒푠푡

휶 ퟏ ퟐ 푭 푭ퟏ− �ퟐ 푵 −ퟏ 푵 −ퟏ 푓표푟 푎 푡푤표 − 푡푎푖푙푒푑 푡푒푠푡 where , , is the critical value of the distribution with -1 and -1 degrees of freedom퐹훼 and푁1−1 significance푁2−1 level . 퐹 푁1 푁2 훼 It should be noted that defines the population variance whereas, when calculating the F- 2 statistic, we instead use휎 the푛 sample variance . This is because it is often difficult, if not 2 impossible, to calculate the population variance푠푛 in practice, so we assume the sample variance to be similar to that of the population variance.

7.1.1 Example

Referring back to the variables used in the example in Chapter 3, the assumption of the population variances being equal can be formally tested using the F-test (the two-sided test defined in Equation 47).

41

> var.test(log(HAI.data.subset$Total.Stay[HAI.data.subset$HAI == "No"] + 1), log(HAI.data.subset$Total.Stay[HAI.data.subset$HAI == "Yes"] + 1))

F test to compare two variances

data: log(HAI.data.subset$Total.Stay[HAI.data.subset$HAI == "No"] + and log(HAI.data.subset$Total.Stay[HAI.data.subset$HAI == "Yes"] + 1) and 1) F = 1.6818, num df = 8066, denom df = 487, p-value = 3.113e-13 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 1.472069 1.907127 sample estimates: ratio of variances 1.68185

For this test the null hypothesis would be equal variances (as in Equation 47) and, as the p- value is less than 0.05, we would reject the null hypothesis in favour of the alternative and conclude unequal variances between the populations, as assumed in the t-test in Chapter 3.

7.2 F-test for Comparing Linear Regression Models

We recommend reading the Regression Modelling paper prior to this section.

F-tests can be used to test whether any of the explanatory variables in a multiple linear regression model are significant. In terms of hypothesis this can be written as:

= = = = (51) 푯ퟎ ∶ 휷ퟏ 휷ퟐ ⋯ 휷풑−ퟏ ퟎ

ퟏ 풋 푯 ∶ 휷 ≠ ퟎ 풇풐풓 풂풕 풍풆풂풔풕 풐풏풆 풗풂풍풖풆 풐풇 풋 They can also be used in model selection to test the fit of two linear models, where the simpler of the two models is nested within the more complex model (i.e. the simpler model has the exact same explanatory variables as the more complex model except it is missing one explanatory variable, whose significance is being tested).

To test the null hypothesis, the following steps should be followed:

1. Calculate the test statistic as:

42

= (52) 푴푺푴 푭 푴푺푬 where:

· (Mean of Squares for Model):

푴푺푴 = (53) 푺푺푴 푴푺푴 푫푭푴 · (Mean of Squares for Error) :

푴푺푬 = (54) 푺푺푬 푴푺푬 푫푭푬 The full derivation of MSM and MSE can be found in Appendix B.

2. Calculate a (1 )100% confidence interval for ( , ) degrees of freedom using the F-table − 훼 퐼 퐷퐹푀 퐷퐹퐸

3. Accept the null hypothesis if 1; reject if

퐹 ∈ 퐹 ∉ 1 The F-test is often used (as in the example below) to decide whether a particular explanatory variable should be kept in or dropped from a linear regression model. If the p- value for the F-test (which compares two nested models) is less than 0.05, then the model with fewer terms is significantly different from the larger model and should not be used. The explanatory variable that was dropped from the simpler model is important to keep. If the p-value is greater than 0.05, then the explanatory variable can be dropped as it is not a significant predictor of the response variable.

7.2.1 Example

This example will build on the fitted in Section 2.6 of the Regression Modelling paper and show a simple example of using hypothesis testing as part of a model selection procedure for linear regression models.

43

A linear model was fitted to the transformed length of stay variable with a number of explanatory variables from the data. As the response variable had been transformed the assumption of normality holds.

The standard R output for the linear model is shown below. Some explanatory variables are statistically-significant predictors of length of stay (p-value < 0.05) but some others are not. In model selection, it is common to remove non-significant explanatory variables one at a time, starting with the ‘least’ significant one. From the output below, the Sex variable has the highest p-value (0.3849) of any variable. It should be noted that although the ‘Surgery (NHSN Surgery)’ category in the Surgery variable has the highest p-value, the entire Surgery variable has a smaller p-value than the Sex variable (see the anova table on Page 15 of the Regression Modelling paper).

> summary(model1)

Call: lm(formula = log(Total.Stay + 1) ~ Age + factor(HAI) + Sex + Hospital.Size + Surgery + Prognosis + centralcatheter + peripheralcath + urinarycatheter, data = HAI.data.subset)

Residuals: Min 1Q Median 3Q Max -3.3331 -0.7384 -0.0704 0.7045 4.7907

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.3309918 0.0808469 28.832 < 2e-16 *** Age 0.0149926 0.0007019 21.360 < 2e-16 *** factor(HAI)Yes 0.7254588 0.0509391 14.242 < 2e-16 *** SexMale -0.0205254 0.0236233 -0.869 0.3849 Hospital.SizeMedium -0.0476521 0.0269000 -1.771 0.0765 . Hospital.SizeSmall 0.0597580 0.0519643 1.150 0.2502 SurgeryNo Surgery -0.1066112 0.0515086 -2.070 0.0385 * SurgeryNot Known -0.1416653 0.1399424 -1.012 0.3114 SurgerySurgery (NHSN Surgery) 0.0324752 0.0571668 0.568 0.5700 PrognosisLife Limiting -0.2558871 0.0433060 -5.909 3.58e-09 *** PrognosisNone/Non-fatal -0.4754648 0.0406985 -11.683 < 2e-16 *** PrognosisNot Known -0.1624948 0.1361281 -1.194 0.2326 centralcatheterYes 0.2783705 0.0608494 4.575 4.84e-06 *** peripheralcathYes -0.4672574 0.0250617 -18.644 < 2e-16 *** urinarycatheterYes 0.6048488 0.0298676 20.251 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.079 on 8540 degrees of freedom Multiple R-squared: 0.201, Adjusted R-squared: 0.1997 F-statistic: 153.5 on 14 and 8540 DF, p-value: < 2.2e-16

44

The above model can be re-fitted without the Sex variable (called model2), with its output displayed below.

> summary(model2)

Call: lm(formula = log(Total.Stay + 1) ~ factor(HAI) + Age + Hospital.Size + Surgery + Prognosis + centralcatheter + peripheralcath + urinarycatheter, data = HAI.data.subset)

Residuals: Min 1Q Median 3Q Max -3.3234 -0.7424 -0.0687 0.7067 4.8005

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.3196209 0.0797795 29.075 < 2e-16 *** factor(HAI)Yes 0.7252572 0.0509379 14.238 < 2e-16 *** Age 0.0150401 0.0006998 21.493 < 2e-16 *** Hospital.SizeMedium -0.0474927 0.0268990 -1.766 0.0775 . Hospital.SizeSmall 0.0601824 0.0519613 1.158 0.2468 SurgeryNo Surgery -0.1062390 0.0515061 -2.063 0.0392 * SurgeryNot Known -0.1434496 0.1399254 -1.025 0.3053 SurgerySurgery (NHSN Surgery) 0.0328630 0.0571642 0.575 0.5654 PrognosisLife Limiting -0.2572204 0.0432782 -5.943 2.90e-09 *** PrognosisNone/Non-fatal -0.4763102 0.0406862 -11.707 < 2e-16 *** PrognosisNot Known -0.1641042 0.1361136 -1.206 0.2280 centralcatheterYes 0.2779882 0.0608469 4.569 4.98e-06 *** peripheralcathYes -0.4679391 0.0250490 -18.681 < 2e-16 *** urinarycatheterYes 0.6033288 0.0298159 20.235 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.079 on 8541 degrees of freedom Multiple R-squared: 0.2009, Adjusted R-squared: 0.1997 F-statistic: 165.2 on 13 and 8541 DF, p-value: < 2.2e-16

The estimates and p-values for the remaining explanatory variables have slightly changed as a result of Sex being dropped from the model. However, removing Sex from the model is only a valid action if model2 is not statistically different from model1. An F-test can be carried out using the anova() function in R to investigate this. This is a hypothesis test and the study hypotheses are defined as follows:

= (55) 푯ퟎ ∶ 풎풐풅풆풍ퟏ 풎풐풅풆풍ퟐ 푯ퟏ ∶ 풎풐풅풆풍ퟏ ≠ 풎풐풅풆풍ퟐ

45

> anova(model1, model2) Table

Model 1: log(Total.Stay + 1) ~ Age + factor(HAI) + Sex + Hospital.Size + Surgery + Prognosis + centralcatheter + peripheralcath + urinarycatheter Model 2: log(Total.Stay + 1) ~ factor(HAI) + Age + Hospital.Size + Surgery + Prognosis + centralcatheter + peripheralcath + urinarycatheter Res.Df RSS Df Sum of Sq F Pr(>F) 1 8540 9935.6 2 8541 9936.4 -1 -0.87829 0.7549 0.3849

From the output above, the p-value is approximately 0.39, and since this is greater than 0.05, we cannot reject the null hypothesis. Therefore, model2 is very similar statistically to model1 and the Sex variable can indeed be dropped from the modelling process. This procedure would typically be repeated until only statistically-significant variables are included in the final model.

If the F-test above had produced a p-value less than 0.05, then the null hypothesis would have been rejected and the conclusion would have been that the two models were not similar. If this had occurred, the more complex model (model1) should have been used going forward.

Finally, it is emphasised that this model selection process is only valid for comparing ‘nested’ linear models, where the simpler model is a subset of the larger model and only for testing whether to remove one explanatory variable at a time.

46

Bibliography

Dalgaard, P. (2008). Introductory Statistics with R. Second Edition. New York: Springer.

Greenland, S., Senn, S.J., Rothman, K.J., Carlin, J.B., Poole, C., Goodman, S.N. and Altman, D.G. (2016). “Statistical tests, P values, confidence intervals and power: a guide to misinterpretations.” European Journal of . Vol. 31 (4): 337 – 350. https://link.springer.com/article/10.1007/s10654-016-0149-3.

Institute for Digital Research and Education, University of California Los Angeles (UCLA) – https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-the-differences- between-one-tailed-and-two-tailed-tests/

Mendenhall, W. and Beaver, R.J. (1994). Introduction to Probability and Statistics. Ninth Edition. Belmont, California: Duxbury Press.

R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL www.R-project.org.

Statistic Solutions – http://www.statisticssolutions.com/manova-analysis-one-sample-t- test/

Rosie Shier (2004) – http://www.statstutor.ac.uk/resources/uploaded/mannwhitney.pdf

This paper has also made use of lecture notes from the School of Mathematics and Statistics, University of Glasgow.

47

Appendices

A Further Detail on Hypothesis Testing

The test statistic, rejection region and critical values for a test using the normal distribution (or another distribution which can be approximated by the normal distribution via the ), can be displayed graphically:

Figure A1: The normal distribution in two-sided hypothesis testing (https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-the-differences-between- one-tailed-and-two-tailed-tests/)

The rejection region is the dark blue areas of the distribution (significance level = 0.05) with each of the areas accounting for 2.5% of the distribution (5% in total). The critical values for a normal distribution are ± 1.96. Therefore, if the test statistic is either less than -1.96 or greater than 1.96, then the null hypothesis is rejected, and there is a 5% probability of this occurring.

However, the critical values in Figure A1 are only suitable if a test is two-sided, since the rejection region covers both the lower and upper tails of the distribution. If an analyst

48

wishes to carry out a one-sided test but retain the 95% confidence level, then the critical values must be shifted, as shown in Figure A2.

Figure A2: The normal distribution in one-sided hypothesis testing (https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-the-differences-between- one-tailed-and-two-tailed-tests/)

As can be seen, the critical value has moved from -1.96 and 1.96 to -1.645 and 1.645, respectively, therefore making it is easier to return a statistically-significant result in a one- sided test. Although this may seem advantageous to a study as it is more likely to achieve a potentially-positive result, a one-sided test should only be used when the question of interest supports such a design. Two-sided tests are much more common and robust.

As mentioned in Section 2.5.3, one and two-sided tests at the same significance level (usually 0.05) are not comparable as a result of these distributions since the critical values are not the same. It is possible, however, to design a comparable one-sided test by employing a significance level of 0.025 (half of 0.05) instead, which would be comparable with the two-sided test where only 0.025% of the distribution is in either tail. In such a case, a p-value less than 0.025 would indicate evidence of statistical significance, instead of 0.05.

Although the discussion above has used the normal distribution as an example, the vast majority of these rules apply to any probability distribution which underlies a hypothesis test, although the shape and critical values of other distributions differ.

49

B Further Detail on F-test for Comparing Linear Regression Models

The equations in this Appendix follow on from Equation 52 in Chapter 7:

= 푴푺푴 푭 where: 푴푺푬

· (Mean of Squares for Model):

푴푺푴

= (B1) 푺푺푴 푴푺푴 푫푭푴 · (Mean of Squares for Error):

푴푺푬 = (B2) 푺푺푬 푴푺푬 푫푭푬 · – Corrected Sum of Squares for Model:

푺푺푴 = 풏 ( ) (B3) ퟐ 푺푺푴 � 풚�풊 − 풚� 풊=ퟏ · – Sum of Squares for Error:

푺푺푬

= ( ) 풏 (B4) ퟐ 풊 푺푺푬 �풊=ퟏ 풚 − 풚� · – Corrected Sum of Squares Total:

푺푺푻

= 풏 ( ) (B5) ퟐ 푺푺푻 � 풚풊 − 풚� 풊=ퟏ For multiple regression models:

+ = (B6)

푺푺푴 푺푺푬 푺푺푻

50

In all of the equations below, is the number of observations and is the number of regression parameters. 풏 풑

· (Corrected Degrees of Freedom for Model):

푫푭푴 = (B7)

푫푭푴 풑 − ퟏ · (Degrees of Freedom for Error):

푫푭푬 = (B8)

푫푭푬 풏 − 풑 · (Corrected Degrees of Freedom Total):

푫푭푻 = (B9)

푫푭푻 풏 − ퟏ For multiple regression models:

+ = (B10)

푫푭푴 푫푭푬 푫푭푻

51

C R Code for Examples

## Read in data (available at this link – only accessible when connected to the internal NHS National Services Scotland network) ## HAI.data <- read.csv("//stats/cl-out/Datasets for SAG Statistical Papers/Simulated_HAI_Prevalence_Data.csv", header = TRUE)

## Filter data ## # Remove discharged patients HAI.data.subset <- subset(HAI.data, Discharged == 1)

# Remove patients where gender is not recorded HAI.data.subset <- subset(HAI.data.subset, Sex != "Unknown")

# Remove patients where specialty is not recorded (spaces required due to how data have # been coded) HAI.data.subset <- subset(HAI.data.subset, Specialty != "Not Known ")

# Remove non-acute and obstetrics hospitals (spaces required due to how data have been # coded) HAI.data.subset <- subset(HAI.data.subset, Hospital.category != "Non Acute ") HAI.data.subset <- subset(HAI.data.subset, Hospital.category != "Obsetrics ")

# Remove patients where catheter information is not provided HAI.data.subset <- subset(HAI.data.subset, centralcatheter != "Unknown") HAI.data.subset <- subset(HAI.data.subset, peripheralcath != "Unknown") HAI.data.subset <- subset(HAI.data.subset, urinarycatheter != "Unknown")

## Exploratory Analysis ##

## Summary of length of stay boxplot(log(HAI.data.subset$Total.Stay + 1), xlab = "Length of Stay", ylab = "log(Total_Stay + 1)", col = "lightblue") summary(HAI.data.subset$Total.Stay)

## Boxplots boxplot(log(HAI.data.subset$Total.Stay + 1), xlab = "Length of Stay", ylab = "log(Total_Stay + 1)", col = "lightblue")

52

## Boxplots – HAI variable included boxplot(log(HAI.data.subset$Total.Stay + 1) ~ factor(HAI.data.subset$HAI), xlab = "HAI", ylab = "log(Total_Stay + 1)", col = "lightblue")

## Histograms par(mfrow = c(1, 2))

hist(HAI.data.subset$Total.Stay, main = "Length of Stay", xlab = "Total Stay", col = "lightblue") hist(log(HAI.data.subset$Total.Stay + 1) , main = "Length of Stay - Log Transformation", xlab = "log(Total Stay + 1)", col = "lightblue")

## Histograms, HAI variable included par(mfrow = c(2, 2))

hist(HAI.data.subset$Total.Stay[HAI.data.subset$HAI == "Yes"] , main = "HAI = Yes", xlab = "Total Stay", col = "lightblue") hist(HAI.data.subset$Total.Stay[HAI.data.subset$HAI == "No"] , main = "HAI = No", xlab = "Total Stay", col = "lightblue")

hist(log(HAI.data.subset$Total.Stay[HAI.data.subset$HAI == "Yes"] + 1), main = "HAI = Yes", xlab = "log(Total Stay + 1)", col = "lightblue") hist(log(HAI.data.subset$Total.Stay[HAI.data.subset$HAI == "No"] + 1), main = "HAI = No", xlab = "log(Total Stay + 1)", col = "lightblue")

## Chapter 3 - T-Test ##

## One sample t-test

x <- log(6.3+1) t.test(log(HAI.data.subset$Total.Stay + 1) , mu=x)

exp(2.896322)-1 exp(2.947428)-1 exp(2.921875)-1

## Two sample t-test t.test(log(HAI.data.subset$Total.Stay[HAI.data.subset$HAI == "No"] + 1), log(HAI.data.subset$Total.Stay[HAI.data.subset$HAI == "Yes"] + 1)) exp(-0.9048984) - 1

53

exp(-0.7315190) - 1 exp(2.921875)-1

## Chapter 4 – Non-Paramateric tests ##

## Boxplots – HAI variable included par(mfrow = c(1, 1)) boxplot(log(HAI.data.subset$Total.Stay[HAI.data.subset$Sex == "Male"] + 1), main = "Male - Stay Analysis", ylab = "log(Total_Stay + 1)", col = "lightblue")

wilcox.test(log(HAI.data.subset$Total.Stay[HAI.data.subset$Sex == "Male"] + 1), mu = log(11) )

boxplot(log(HAI.data.subset$Total.Stay + 1) ~ factor(HAI.data.subset$Sex), xlab = "Gender", ylab = "log(Total_Stay + 1)", col = "lightblue")

x <- log(HAI.data.subset$Total.Stay[HAI.data.subset$Sex == "Male"] + 1) y <- log(HAI.data.subset$Total.Stay[HAI.data.subset$Sex == "Female"] + 1)

wilcox.test(log(HAI.data.subset$Total.Stay[HAI.data.subset$Sex == "Male"] + 1),log(HAI.data.subset$Total.Stay[HAI.data.subset$Sex == "Female"] + 1),alternative = "two.sided")

## Chapter 5 - Chi-Squared Tests ##

## Chi-squared Goodness-of-fit test

# Summary table for HAI variable table(HAI.data.subset$HAI)

# Chi-squared test chisq.test(table(HAI.data.subset$HAI))

## Chi-squared test of independence

# Summary table table(HAI.data.subset$Hospital.Size, HAI.data.subset$HAI)

54

# Percentages for each combination of categories round(prop.table(table(HAI.data.subset$Hospital.Size, HAI.data.subset$HAI), 1)*100, 1)

# Chi-squared test (applies Yates' continuity correction automatically) chisq.test(HAI.data.subset$Hospital.Size, HAI.data.subset$HAI)

# Expected number of cases chisq.test(HAI.data.subset$Hospital.Size, HAI.data.subset$HAI)$expected

## Chapter 6 – Proportion tests ##

## One-sample proportion test # Recode Sex variable HAI.data.subset$Sex_Binary <- ifelse( HAI.data.subset$Sex == "Male", 0, 1 )

# Summary tables for original and recoded variable table(HAI.data.subset$Sex) table(HAI.data.subset$Sex_Binary)

# One-sample proportion test prop.test(table(HAI.data.subset$Sex_Binary))

## Two-sample proportion test # Recode Sex and HAI variables HAI.data.subset$Sex_Binary <- ifelse( HAI.data.subset$Sex == "Male", 0, 1 )

HAI.data.subset$HAI_Binary <- ifelse( HAI.data.subset$HAI == "No", 0, 1 )

# Summary tables for original and recoded variables table(HAI.data.subset$Sex, HAI.data.subset$HAI) addmargins(table(HAI.data.subset$Sex_Binary, HAI.data.subset$HAI_Binary, dnn = c("Sex", "HAI")))

55

# Two-sample proportion test prop.test(table(HAI.data.subset$Sex_Binary, HAI.data.subset$HAI_Binary))

# Equivalent chi-squared test chisq.test(HAI.data.subset$Sex, HAI.data.subset$HAI)

## Chapter 7 - F Test ##

## Equality of variances var.test(log(HAI.data.subset$Total.Stay[HAI.data.subset$HAI == "No"] + 1), log(HAI.data.subset$Total.Stay[HAI.data.subset$HAI == "Yes"] + 1))

## Multiple linear regression model

# With the exception of 'HAI', other categorical variables are already designated as 'factors' # Fit larger model with Sex variable model1 <- lm(log(Total.Stay + 1) ~ Age + factor(HAI) + Sex + Hospital.Size + Surgery + Prognosis + centralcatheter + peripheralcath + urinarycatheter, data = HAI.data.subset) summary(model1)

# Fit smaller nested model without Sex variable model2 <- lm(log(Total.Stay + 1) ~ Age + factor(HAI) + Hospital.Size + Surgery + Prognosis + centralcatheter + peripheralcath + urinarycatheter, data = HAI.data.subset) summary(model2)

# F-test to compare models anova(model1, model2)

56

D SPSS Syntax for Examples

********** Read in data (available at this link – only accessible when connected to the internal NHS National Services Scotland network) and remove incomplete cases **********

GET DATA /TYPE = TXT /FILE = '//conf/linkage/output/Datasets for SAG Statistical Papers/Simulated_HAI_Prevalence_Data.csv' /ENCODING = 'UTF8' /DELCASE = LINE /DELIMITERS = "," /ARRANGEMENT = DELIMITED /FIRSTCASE = 2 /IMPORTCASE = ALL /VARIABLES = counter F5.0 Hospital.Type A50 Hospital.category A50 Hospital.size A50 Sex A10 Specialty A50 Age F3.0 Surgery A50 Prognosis A50 centralcatheter A50 peripheralcath A50 urinarycatheter A50 intubation A50 antimicrobials_2 A50 HAI A3 Total.Stay F4.0 Discharged A1 Time.To.Survey F10.0 timetohai F10.0. CACHE. EXECUTE.

SELECT IF Discharged EQ '1'. EXECUTE.

57

SELECT IF Sex <> 'Unknown'. EXECUTE.

SELECT IF Specialty <> 'Not Known'. EXECUTE.

SELECT IF Hospital.category <> 'Non Acute'. EXECUTE.

SELECT IF Hospital.category <> 'Obsetrics'. EXECUTE.

SELECT IF centralcatheter <> 'Unknown'. EXECUTE.

SELECT IF peripheralcath <> 'Unknown'. EXECUTE.

SELECT IF urinarycatheter <> 'Unknown'. EXECUTE.

* Log-transform length of stay variable. COMPUTE Total.Stay.Transformed = LN(Total.Stay + 1). EXECUTE.

********** One-sample t-test (one-sided) **********

* To get the p-value for a one-sided t-test in SPSS, run a two-sided test with an adjusted significance level i.e. * 5% one-sided significance = 10% two-sided significance. * 2.5% one-sided significance = 5% two-sided significance.

* For the confidence interval, take the relevant lower or upper limit from the two-sided confidence interval.

* Equivalent to 5% significance. T-TEST /TESTVAL = 0 /MISSING = ANALYSIS /VARIABLES = Total.Stay.Transformed /CRITERIA = CI(.9).

58

* Equivalent to 2.5% significance. T-TEST /TESTVAL = 0 /MISSING = ANALYSIS /VARIABLES = Total.Stay.Transformed /CRITERIA = CI(.95).

********** Two-sample t-test (two-sided) **********

T-TEST GROUPS = HAI('No' 'Yes') /MISSING = ANALYSIS /VARIABLES = Total.Stay.Transformed /CRITERIA = CI(.95). * Look at 'unequal variances' line.

********** Wilcoxon Signed Ranks Test **********

TEMPORARY. SELECT IF Sex = 'Male'.

NPTESTS /ONESAMPLE TEST (Total.Stay.Transformed) WILCOXON(TESTVALUE=2.398) /MISSING SCOPE=ANALYSIS USERMISSING=EXCLUDE /CRITERIA ALPHA=0.05 CILEVEL=95.

********** Mann-Whitney U-test **********

NPTESTS /INDEPENDENT TEST (Total.Stay.Transformed) GROUP (Sex) MANN_WHITNEY /MISSING SCOPE=ANALYSIS USERMISSING=EXCLUDE /CRITERIA ALPHA=0.05 CILEVEL=95.

********** Chi-squared goodness-of-fit test **********

NPTESTS /ONESAMPLE TEST (HAI) CHISQUARE(EXPECTED=EQUAL) /MISSING SCOPE=ANALYSIS USERMISSING=EXCLUDE /CRITERIA ALPHA=0.05 CILEVEL=95.

59

********** Chi-squared test of association **********

CROSSTABS /TABLES = Hospital.size BY HAI /FORMAT = AVALUE TABLES /STATISTICS = CHISQ /CELLS = COUNT /COUNT ROUND CELL. * Look at Pearson Chi-Square row.

********** One-sample proportion test **********

COMPUTE Sex_Binary = 0. IF Sex = 'Female' Sex_Binary = 1. EXECUTE.

NPAR TESTS /BINOMIAL (0.50)=Sex_Binary /MISSING ANALYSIS.

********** Two-sample proportion test **********

COMPUTE Sex_Binary = 0. IF Sex = 'Female' Sex_Binary = 1. EXECUTE.

COMPUTE HAI_Binary = 0. IF HAI = 'Yes' HAI_Binary = 1. EXECUTE.

* Two-sample proportion test is equivalent to the chi-squared test (shown below). CROSSTABS /TABLES = Sex BY HAI /FORMAT = AVALUE TABLES /STATISTICS = CHISQ /CELLS = COUNT /COUNT ROUND CELL.

60

* Would ordinarily look at Pearson Chi-Square row but, since this is a 2x2 contingency table, look at Continuity Correction row.

********** F-test equality of variances **********

* This is provided in the output of the two-sample t-test but SPSS uses a different type of F- test (Levene's F-test) to R's var.test function. * The function for Levene’s F-test is in the 'car' R package.

********** F-test for comparing linear regression models **********

* Create indicator/dummy variables for string variables. SPSSINC CREATE DUMMIES VARIABLE = HAI ROOTNAME1 = HAI /OPTIONS ORDER = A USEVALUELABELS = YES USEML = YES OMITFIRST = YES.

SPSSINC CREATE DUMMIES VARIABLE = Sex ROOTNAME1 = Sex /OPTIONS ORDER = A USEVALUELABELS = YES USEML = YES OMITFIRST = YES.

SPSSINC CREATE DUMMIES VARIABLE = Hospital.size ROOTNAME1 = Hospital_Size /OPTIONS ORDER = A USEVALUELABELS = YES USEML = YES OMITFIRST = YES.

SPSSINC CREATE DUMMIES VARIABLE = Surgery ROOTNAME1 = Surgery /OPTIONS ORDER = A USEVALUELABELS = YES USEML = YES OMITFIRST = YES.

SPSSINC CREATE DUMMIES VARIABLE = Prognosis ROOTNAME1 = Prognosis /OPTIONS ORDER = A USEVALUELABELS = YES USEML = YES OMITFIRST = YES.

SPSSINC CREATE DUMMIES VARIABLE = centralcatheter ROOTNAME1 = centralcatheter /OPTIONS ORDER = A USEVALUELABELS = YES USEML = YES OMITFIRST = YES.

61

SPSSINC CREATE DUMMIES VARIABLE = peripheralcath ROOTNAME1 = peripheralcath /OPTIONS ORDER = A USEVALUELABELS = YES USEML = YES OMITFIRST = YES.

SPSSINC CREATE DUMMIES VARIABLE = urinarycatheter ROOTNAME1 = urinarycatheter /OPTIONS ORDER = A USEVALUELABELS = YES USEML = YES OMITFIRST = YES.

* Build linear model from Regression Modelling paper with automatic backwards variable selection (the first level is excluded from each block as they are used as the reference category). * This syntax is different from the syntax used to fit the linear model in the Regression Modelling paper but gives the same results (the commands used in the Regression Modelling paper cannot be used for variable selection).

REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA = PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT Total.Stay.Transformed /METHOD = BACKWARD Age /METHOD = BACKWARD HAI_2 /METHOD = BACKWARD Sex_2 /METHOD = BACKWARD Hospital_Size_2 Hospital_Size_3 /METHOD = BACKWARD Surgery_2 Surgery_3 Surgery_4 /METHOD = BACKWARD Prognosis_2 Prognosis_3 Prognosis_4 /METHOD = BACKWARD centralcatheter_2 /METHOD = BACKWARD peripheralcath_2 /METHOD = BACKWARD urinarycatheter_2.

* This fits model1 from Section 7.2.1. * model2 in Section 7.2.1 is the second model fitted as part of the backwards selection process. * Other models shown are what one would typically fit afterwards in order to continue the process until only statistically-significant variables are included in the model.

62

E SPSS Output for Examples

One-sample t-test (equivalent to R output on Page 18)

Two-sample t-test (equivalent to R output on Page 21)

63

Wilcoxon Signed Ranks Test (equivalent to R output on Page 25)

Mann-Whitney U-test (equivalent to R output on Page 26)

64

Chi-squared goodness-of-fit test (equivalent to R output on Page 29)

Chi-squared test of independence (equivalent to R output on Page 31)

65

One-sample proportion test (equivalent to R output on Page 35)

Two-sample proportion test (equivalent to R output on Page 39)

F-test for equality of variances (see output for two-sample t-test)

66

F-test for comparing linear regression models (equivalent to R output on Page 46) – only ‘model1’ and ‘model2’ are shown below

67