<<

HST 190: Introduction to

Lecture 3: Two- comparisons, testing and power/sample size calculations

1 HST 190: Intro to Biostatistics Estimating two-sample comparisons

• In Lecture 2, we introduced: § probability distributions § from a § estimation of the of some § quantification of uncertainty via standard errors and confidence intervals • Today, we focus on a common setting in public and , specifically the comparison of two groups • Consider, for example, the question: “What is the difference in average birthweight between children born at a specific hospital in a disadvantaged area and children born at a specific hospital in a wealthy area?”

2 HST 190: Intro to Biostatistics • This question moves beyond the estimation scenario we discussed in Lecture 2 § that is, interest lies in more than a single mean � • Now, we want to compare two unknown § each corresponding to some (sub-)population • How we achieve this, specifically estimation and the quantification of uncertainty, depends on how the we have at our disposal was collected.

3 HST 190: Intro to Biostatistics Paired samples

• Solish et al (2010, Clinical Ophthalmology) reported on a study of medical therapy for uncontrolled glaucoma § � = 55 patients § bimatoprost 0.03% once daily in one eye § either travoprost 0.004% or latanoprost 0.005% once daily in the fellow eye § measured compared intraocular pressure (IOP) at baseline and at 4- 6 weeks follow-up • This is an example of a paired study § � is number of pairs not number of total observations

4 HST 190: Intro to Biostatistics • Let � denote the change in IOP between baseline and follow-up • in each of the two eyes:

§ �, �, … , �

§ �, �, … , � • Solish et al report estimated mean reductions in IOP of:

§ �̅ = 2.7 mmHg among bimatoprost-treated eyes

§ �̅= 1.7 mmHg among travoprost-treated eyes • Suggests that treatment with bimatoprost is associated with greater improvements • Q: How can we use the methods we’ve been exposed to so far to formally assess this comparison?

5 HST 190: Intro to Biostatistics • Define a new random variable � to be the difference in the IOP change between the two eyes:

§ � = � – � • Thus, we have exploited the special of the paired data to turn the two-sample problem into a one-sample § E D = Δ § ��� � = � • We can now proceed just as in Lecture 2: § use �̅ = ∑ � sample mean of the differences to estimate Δ § use � = ∑ � − �̅ to estimate the ̅ ̅ o note, this means that the of � is sd � = �/ � § use ~� to derive 100(1 − �)% for Δ / � ̅ � ± � ⋅ , �

6 HST 190: Intro to Biostatistics Unpaired samples

• In many settings there may not be an obvious path for creating “pairs” on which to base the two-sample comparison • As an example, Maciejewski et al (2016, JAMA) report on weight loss following bariatric surgery: § 1787 veterans who underwent RYGB § 5305 non-surgical patients from the same VA populationResearch Original Investigation Bariatric Surgery and Long-term Weight Loss Durability

Statistical Analysis Figure 1. Differences in Estimated Weight Changes Among Patients Undergoing Roux-en-Y Gastric Bypass (RYGB) and Nonsurgical Matches The statistical analysis was developed to estimate mean tra- jectories of the percentage of weight change and estimates at

Patients undergoing RYGB specific postsurgical years. Penalized spline mixed-effects 10 Nonsurgical matches models were used, with piecewise linear functions with pre- specified knots included as fixed and random effects at the 0 population-average and subject-specific levels.31,32 This ap- proach allowed us to estimate overall mean trajectories and –10 trajectories with enough flexibility to reflect weight 30 27 24 22 21 (29-30) (25-29) (20-28) (16-28) (11-31) change fluctuations during an individual’s entire follow-up –20 (eFigure 3 in the Supplement illustrates 75 cases). Because of the extent of weight change in the 2 years after

–30 surgery,our primary model specification for the overall mean in- Weight Change From Baseline, % Change From Weight cluded a linear term for time and knots at 3 months, 6 months,

–40 and 1, 2, 3, 5, 7,and9 years. All time terms were interacted with 0 1 2 3 4 5 6 7 8 9 10 11 atreatmentindicatortoallowdifferentialestimatesforpatients Years Since Baseline undergoing RYGB and nonsurgical matches using HPMIXED in SAS statistical software, version 9.4(SAS Institute Inc). The model 7 Estimated values, differences, and 95% CIs (shown in parentheses) were HST 190: Intro to Biostatistics generated from a penalized spline mixed-effects model (7092 patients: 5305 also included individual-level random effects with knots at 6 nonsurgical matches and 1787 patients undergoing RYGB). Numbers and arrows months and 1, 2, 3, 5, 7,and9 years;asubject-levelrandomslope in the center of the figure represent the differences and 95% CIs of the to account for correlation among patients’ repeated measures; differences between nonsurgical matches and patients undergoing RYGB at and a VA geographic area–level random slope to account for cor- years 1, 3, 5, 7, and 10. The sample for whom there was follow-up weight data for each year are as follows: year 1, n = 6894 patients (5131 nonsurgical matches relation between each pair matched within VA geographic area and 1763 patients undergoing RYGB); year 3, n = 6301 (4629 nonsurgical and patients more than once. Estimate statements were matches and 1672 patients undergoing RYGB); year 5, n = 5172 (3748 used to generate the predicted percentage of weight change at nonsurgical matches and 1424 patients undergoing RYGB); year 7, n = 3942 (2806 nonsurgical matches and 1136 patients undergoing RYGB); and year 10, 1, 3, 5, 7 and 10 years of follow-up for the patients undergoing n = 1847 (1274 nonsurgical matches and 573 patients undergoing RYGB). RYGBcomparedwiththenonsurgicalmatches.Themodel’sem- pirical Bayes estimates were used to calculate patients’ predicted weights at 1,3,5,7,and10 yearsonlyforthosewithaweightmea- surement within a 12-month interval of (6 months before to 6 (n = 61), who only had inpatient weights after surgery (n = 1), months after) and at least 1 after each follow-up who lacked postsurgical weights (n = 6), or who did not have year,whichwereusedtoclassifyeachpatientashavinglostless a weight measurement within 5 years after surgery (n = 2 pa- than 5%, 5%ormore,10%ormore,20%ormore,and30%or tients undergoing RYGB). This resulted in a final sample of 1785 more of their baseline weight. The 12-month interval requirement patients undergoing RYGB, 246 undergoing AGB, and 379 un- modestly reduced the number of patients for whom individual dergoing SG (eFigure 2 in the Supplement). weights could be predicted. For example, 573 surgical patients were followed up for at least 10 years, but only 564 of these 573 Definition and Cleaning of Weight Outcome had a weight measurement between 9.5and 10.5years. Patients’ Follow-up weight data were obtained from measurements re- predicted weights at 10 years of follow-up were used to calcu- corded in the electronic health records during outpatient vis- late mean weight loss and excess body weight loss from baseline. its from January 1, 2000,throughDecember31, 2014.Thepri- Asimilarmodelwithknotsspecifiedat3 months, 6 months, mary outcome was percentage change in weight at follow-up and 1,2,and3yearswasusedtocompareweightchangethrough compared with baseline, which is less confounded by base- 4yearsoffollow-upacrosssurgicalprocedures,whichhadshorter line BMI than other commonly reported measures (eg, per- follow-up because of the recent introduction of the SG procedure centage of excess weight loss).30 Baseline weight data on the and study end date of December 31,2014 (eFigure 2in the Supple- surgery day was available for 78.6%ofpatientsundergoing ment). To understand the effect of baseline differences in patients RYGB but only 2.2%ofnonsurgicalmatches,andthemean(SD) who underwent each surgical procedure, patients’ predicted per- number of days for the baseline weight measurement before centage of weight loss at 4 years was regressed on surgery type surgery was 7.3 (24.0) for patients undergoing RYGB and 59.8 while adjusting for baseline BMI, comorbidity , demograph- (48.6) for nonsurgical matches. ics, and diabetes status (eTable in the Supplement). Postsurgical weight measurements are highly variable and have a nonlinear trend, so we developed a multistep - detection algorithm that examined the SD of consecutively Results measured weights (eAppendix in the Supplement). An algo- rithm to identify multiple weight measures on the same day, Characteristics of Patients Undergoing RYGB or weights that deviated from clinically plausible trends over and Nonsurgical Matches time, excluded a small proportion (4201 of 89 757 [4.7%]) of Patients undergoing RYGB had a mean (SD) age of 52.1 (8.5) weight measurements. yearsandnonsurgicalmatcheshadamean(SD)ageof 52.2 (8.4)

E4 JAMA Surgery Published online August 31, 2016 (Reprinted) jamasurgery.com

Copyright 2016 American Medical Association. All rights reserved.

Downloaded From: http://archsurg.jamanetwork.com/ by a Department of Veterans Affairs User on 08/31/2016 • More generally, consider two independent (unpaired) samples of size � and � • Suppose: § �, �, … , � ~ ������ �, � § �, �, … , � ~ ������ �, � • Now, the comparison will be framed in terms of estimating difference in means:

� − � § contrast this with consideration of the mean of differences Δ when the data arose from a matched pairs deign

8 HST 190: Intro to Biostatistics • As a toy example, suppose interest lies in difference in daily caloric consumption in two : § (1) seniors § (2) young adults.

• We take two samples and compute the following:

§ �̅ = 2000, � = 250, � = 13

§ �̅ = 2500, � = 200, � = 10

• In unpaired setting, we simply estimate � − � the difference in means, using the difference in the sample means:

�̅ − �̅ = 500

• Q: How do we quantify uncertainty in this estimate? § answer depends on what we are willing to assume about � and �

9 HST 190: Intro to Biostatistics Quantifying uncertainty assuming equal

• Unlike paired setting, the variances of each sample may not be equal—affecting our approach to quantifying uncertainty • Consider the ratio: § if this is relatively close to 1.0, then it may be reasonable to proceed as if � = � = � § one rule of thumb is to consider whether the ratio is > 0.5 or < 2.0 § see Rosner Sec. 8.6 • If equality can be assumed, then one can estimate the common by the estimator: � − 1 � + � − 1 � � = � + � − 2

10 HST 190: Intro to Biostatistics • Then, with � and � “large”, we have that:

�̅ − �̅ − (� − �) ~� 1 1 � + � � • Use this to construct a 100(1 − �)% confidence interval for � − �:

1 1 �̅ − �̅ ± � , ⋅ � + � �

• Recall the details from the toy caloric comparison example

§ �̅ = 2000, � = 250, � = 13

§ �̅ = 2500, � = 200, � = 10 • We therefore have = 1.2, so that assuming equal variances may be reasonable

11 HST 190: Intro to Biostatistics • The pooled variance estimate is: � − 1 � + � − 1 � �= � + � − 2

= = 52857 = 229.9

• Thus, a 95% CI for the difference is

1 1 2500 − 2000 ± � ⋅ 229.9 + = 500 ± 201 ,. 13 10

§ with 95% confidence we conclude the difference in mean daily calorie intake between young adults and seniors is between 299 and 701

12 HST 190: Intro to Biostatistics Quantifying uncertainty assuming unequal variances

• If it is not reasonable to assume assume equal variances, then the ‘pooled’ estimate no longer appropriate • We can, however, directly quantify uncertainty in the estimate � − � by appealing to properties of random variables laid out on Slide 24 of Lecture 2 • If Var � = � �, then assuming the two samples are independent: Var � − � = � ⁄� + � ⁄� and it follows that

sd � − � = � ⁄� + � ⁄�

13 HST 190: Intro to Biostatistics • Then, with � and � “large”, we have that: �̅ − �̅ − (� − � ) ~ � � ⁄� + � ⁄� • Note, the degrees of freedom for this �-distribution is the following rounded down to the nearest integer: � � �⁄� �⁄� � = + + � � � − 1 � − 1 • Result again enables a 100(1 − �)% confidence interval for � − � defined by

� � �̅ − �̅ ± �, ⋅ + � �

14 HST 190: Intro to Biostatistics Hypothesis testing

• So far, focus has been exclusively on 1) estimating quantities of interest

o e.g. means or differences in means 2) quantifying the uncertainty of those estimates

o via standard errors and confidence intervals • Sometimes, the scientific goal involves assessing evidence in relation to some specific hypothesis § e.g., test whether there is evidence that the difference in mean daily caloric intake between young adults and seniors is non-zero § refer to this as hypothesis testing

15 HST 190: Intro to Biostatistics • As we cover hypothesis testing, it is worth having in the back of your mind, an analogy with the legal system • Consider an individual who has been accused of a crime and is undergoing a trial § at the outset, the individual is “presumed innocent” § evidence is presented regarding the guilt § a “guilty” or “not guilty” verdict is given • A couple of relevant details: § “innocent” is not an option for the verdict § mistakes can be made

16 HST 190: Intro to Biostatistics • Cluster-randomized trial to assess a quality improvement intervention:

Tropical and International Health doi:10.1111/tmi.13220

volume 24 no 5 pp 636–646 may 2019

Effect of a maternal and newborn quality improvement project on the use of facilities for childbirth: a cluster-randomised study in rural Tanzania

Elysia Larson1,2, Anna D. Gage1, Godfrey M. Mbaruku3,*, Redempta Mbatia4, Sebastien Haneuse2 and Margaret E. Kruk1

1 Department of and Population, Harvard T.H. Chan School of , Boston, MA, USA 2 Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA 3 Ifakara Health Institute, Dar es Salaam, Tanzania 4 Tanzania Support, Dar es Salaam, Tanzania § in-service training, mentoring and supportive supervision, Abstract objectives Reduction in maternal and newborn mortality requires that women deliver in high infrastructure supportquality health facilities. However, many facilities provide sub-optimal quality of care, which may be a reason for less than universal facility utilisation. We assessed the impact of a quality improvement project on facility utilisation for childbirth. methods In this cluster-randomised in four rural districts in Tanzania, 12 primary care • Primary question is whether the intervention resulted in clinics and their catchment areas received a quality improvement intervention consisting of in-service training, mentoring and supportive supervision, infrastructure support, and peer outreach, while 12 an increase in the percentage of deliveries that were “infacilities and their catchment areas functioned as controls. We conducted a of all deliveries - within the catchment area and used difference-in-differences analysis to determine the intervention’s effect on facility utilisation for childbirth. We conducted a secondary analysis of utilisation among facility” women whose prior delivery was at home. We further investigated mechanisms for increased facility utilisation. results The intervention led to an increase in facility births of 6.7 percentage points from a 17 baseline of 72% (95% Confidence Interval: 0.6, 12.8). The intervention increased facilityHST 190: Intro to Biostatistics delivery among women with past home deliveries by 18.3 percentage points (95% CI: 10.1, 26.6). Antenatal quality increased in intervention facilities with providers performing an additional 0.5 actions across the full population and 0.8 actions for the home delivery subgroup. conclusions We attribute the increased use of facilities to better antenatal quality. This increased utilisation would lead to lower maternal mortality only in the presence of improvement in care quality.

keywords maternal and newborn health, quality, utilisation, Tanzania, cluster-randomised controlled trial, evaluation

Improving utilisation, therefore, remains an important, Introduction unfinished global priority. After two decades of global policy and action focused on Maternal characteristics, such as education, socio-eco- increasing the proportion of births occurring in health nomic status, and parity, are often cited as reasons for facilities, many populations have seen a shift in delivery incomplete facility utilisation [5]. Similarly, many efforts location to health facilities [1, 2]. However, gaps in facil- to increase facility utilisation for childbirth have focused ity utilisation for childbirth persist in many regions, par- on the demand side; for example, providing travel vouch- ticularly throughout sub-Saharan Africa (SSA) [1]. High ers, fee exemption, education and text mes- quality facility-based care, with good access to emergency sage reminders [6–8]. While some demand-side obstetric care, has the potential to reduce preventable interventions have been successful in increasing facility maternal and newborn mortality and morbidity [3, 4]. utilisation, recent evidence suggests that quality plays an important role in motivating (when quality is strong) or *Deceased. dissuading (when quality is weak) utilisation [9, 10].

636 © 2019 The Authors. Tropical Medicine & International Health Published by John Wiley & Sons Ltd. This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and in any medium, provided the original work is properly cited. General steps to hypothesis testing

• Q: How do we formally assess this questions such as the one just posed? • General steps: 1) Check assumptions 2) Formulate hypotheses 3) Calculate the test and the corresponding p-value 4) State the conclusion

18 HST 190: Intro to Biostatistics 1) Checking assumptions

• Are units sampled randomly from the population(s) of interest? § allows the findings to be generalized • Are study subjects independent? Watch out for § dependence between measurements within subgroups of similar units (cluster effect) § dependence between measurements collected close in time (serial effect) § interference across space (spatial effect) • Are population distribution(s) approximately Normal? (e.g., do they come from a, roughly, symmetric distribution?) § In paired test, assume distribution of differences is normal § In unpaired test, assume each population distribution is normal § Note that CLT allows for deviations from normality for larger samples

19 HST 190: Intro to Biostatistics 2) Formulate hypotheses: the “Null hypothesis”

• Just as in the legal analogy, hypothesis testing begins with an initial “position” on the question of whether the intervention improves the outcome § referred to as the null hypothesis § reflects the status quo, in the absence of any information

§ use the notation “�” • In the previous example:

�: there is no difference in outcomes between clinics that received the intervention and those that did not

20 HST 190: Intro to Biostatistics • Towards performing a statistical analysis, we usually translate this statement using parameters to reflect the quantities we are interested in § e.g., a null hypothesis that the mean difference is equal to zero would read: �: Δ = 0 § or we could write it in terms of the difference in means: �: � − � = 0

• Note, we could generalize the above with, say, �: Δ = Δ, for some Δ ≠ 0 • Hypothesis testing then proceeds by assessing evidence to the contrary of the null hypothesis § “proof” by contradiction

21 HST 190: Intro to Biostatistics 2) Formulate hypotheses: the “

• In the legal analogy, the alternative to “not guilty” is “guilty” • Q: What would the alternative hypothesis be for the Tanzania study? • In contrast to the legal analogy, one could write down a number of such alternative hypotheses: § the intervention does something § the intervention improves outcomes § The intervention makes things worse

22 HST 190: Intro to Biostatistics • Notationally, we write “�” and (again) usually translate them in terms of the same (s) we used to write down the null hypothesis

• For example, for �: Δ = 0, each of the following are valid alternatives:

§ �: Δ ≠ 0

§ �: Δ > 0

§ �: Δ < 0

23 HST 190: Intro to Biostatistics • Strategy is to consider whether what we learn from a sample of data provides sufficient evidence to “reject” the null hypothesis, implicitly in favor of the alternative hypothesis • Approach taken depends on the framing of the alternative • Two-sided alternative:

§ �: Δ ≠ Δ

o Reject � if �̅ is either much lower than Δ or much greater than Δ • One-sided alternative:

§ �: Δ > Δ

o Reject � if �̅ is much greater than Δ

§ �: Δ < Δ

o Reject � if �̅ is much lower than Δ

24 HST 190: Intro to Biostatistics 3) Calculate

• Once hypotheses are set, we need a means of quantifying the strength of evidence so that a decision regarding whether or not to reject the null hypothesis § implicitly in favor of the alternative • The first step is to calculate a test statistic • One common choice is the statistic: sample statistic − [value under � ] � = [est. standard error of statistic] § roughly corresponds to a ratio of ‘signal’ to ‘noise’ • Other options correspond to different tests: § likelihood ratio test § 25 HST 190: Intro to Biostatistics 3) Calculate a p-value

• Suppose that the observed value of the test statistic is �∗ § value of � based on the data we collected • We need to assess whether �∗ is “large” or “small” • While it is not without controversy, the standard approach to quantifying evidence in hypothesis testing is the p-value: The probability of observing a value of � as extreme or more extreme than �∗ if the null hypothesis is actually true § “probability” in the frequentist sense § i.e. across hypothetical repetitions of the experiment/study

26 HST 190: Intro to Biostatistics • To calculate a p-value we need to know the of the test statistic in the setting where the null hypothesis is true

§ say “under �” • Crucially, the tells us what it is! • Suppose, for example, we have a single sample and are testing a single mean: �: � = �

§ if � is known, we have under � that: � = ~ �(0,1) /

§ if � is not known, we have under � that: � − � � = ~ � �/ �

27 HST 190: Intro to Biostatistics • Suppose we have paired two-sample data and are testing the mean difference:

�: Δ = Δ

§ we have under � that:

�̅ − Δ � = ~ � �/ � • Suppose we have unpaired two-sample data with equal variances and are testing the difference in means:

�: � − � = �, − �,

§ we have under � that:

�̅ − �̅ − (�, − �,) � = ~ �, 1 1 � + � � where �2 is the pooled variance estimate (see Slide 10)

28 HST 190: Intro to Biostatistics • Suppose we have unpaired two-sample data with unequal variances and are testing the difference in means: �: � − � = �, − �,

§ we have under � that: �̅ − �̅ − (� − � ) � = , , ~ � � ⁄� + � ⁄� where �′ given on Slide 14. • Note, if the sample size is “large”, then for each of the above settings, we have: � ~ �(0,1) § i.e. don’t need to worry about the degrees of freedom

29 HST 190: Intro to Biostatistics • Whatever the sampling distribution is, we can now use it to quantify the probability of a value of � that is as extreme or more extreme than � = �∗ • What is meant by “extreme” depends on the alternative hypothesis • Suppose we are testing:

�: Δ = 0 vs �: Δ ≠ 0 § “extreme” would correspond to any value far from 0 § consider both large negative and positive values § referred to as a “two-sided alternative” § calculate: ∗ � = �(|�| > |� |) 0 −|�∗| |�∗|

30 HST 190: Intro to Biostatistics • Suppose we are testing:

�: Δ = 0 vs �: Δ > 0 § “extreme” would correspond to values much greater than 0 § only consider large positive values § referred to as a “one-sided alternative” § calculate: � = �(� > �∗)

• Analogously, we would calculate � = � � < �∗ if we are testing:

�: Δ = 0 vs �: Δ < 0

31 HST 190: Intro to Biostatistics State the conclusion

• Small p-value:

§ data would have been unlikely if � were true

§ reject � • Large p-value:

§ data not inconsistent with �

§ do not reject � • Decision by comparing the p-value to the significance level § typically denoted by �

§ corresponds to the largest p-value for which we’d reject � § commonly pre-specified to equal to 0.05 or 0.01 § connection to 100(1 − �)% confidence interval idea

32 HST 190: Intro to Biostatistics • If we reject the null, we could say that the result (or test) is “statistically significant at the 100(�)% level”. § example: � = 0.05 ⇒ “significant at 5% level” • It is important (and often required) that � be chosen in advance § before conducting a test § ensure objectivity of one’s conclusions • Choosing a lower value of � as the threshold for rejection corresponds to requiring stronger evidence against � before believing � § considered “conservative”

33 HST 190: Intro to Biostatistics Additional comments

• You can only form a hypothesis about a population quantity, such as �

§ it is meaningless to state a “hypothesis” about a sample quantity

(e.g., H0: X = 3), because that’s a random variable whose value is different in different samples

§ “There’s a 30% chance it’s already raining” type statement

• Even if �0 is not rejected, it is not “proven” or “accepted”

§ all this means is that the data failed to contradict it

§ “Not guilty” ≠ “innocent”!

34 HST 190: Intro to Biostatistics • In the frequentist view, we do not consider the “probability

that �0 is true” (or false)

§ common misinterpretation of a p-value

§ �0 either it is true or it isn’t

§ inference is about deciding whether we have enough evidence to conclude that it’s true or not

35 HST 190: Intro to Biostatistics Example: a paired sample hypothesis

• Recall: an experimental drug is administered to 20 patients. Each patient’s SBP before and after treatment is measured, and �̅ = −5mmHg. We also compute � = 10.

• �: Δ = 0 versus �: Δ ≠ 0

∗ § � = = = −2.23 on 19 degrees of freedom / / § � = �(|�| > |�∗ = 2 ⋅ � � > 2.23 = 2 0.019 = 0.038 < 0.05

• Thus, we reject � at the 0.05 level, and conclude there is evidence that “mean of differences is statistically area=0.019 significantly different from 0” area=0.019

� -2.23 0 2.23 36 HST 190: Intro to Biostatistics • What if this test was just one-sided?

• �: Δ = 0 versus �: Δ < 0

∗ § � = = = −2.23 on 19 degrees of freedom / / § � = �( � < �∗) = � � < −2.23 = 0.019 < 0.05

Area = 0.019

� -2.23 0

• Thus, we again reject � at the 0.05 level § conclude there is evidence of statistically significant difference § do not include the right-tail area in the p-value because we

wouldn’t have considered � > 2.23 to be evidence against H0.

37 HST 190: Intro to Biostatistics T-test vs. Z-test

• Our focus has been on t-tests, computed assuming that population variances are unknown § Much more likely to be the case in reality • When population variances are known, or if sample sizes are ‘large,’ �-distribution can be replaced �(0,1) so test- statistic under the null can be treated as standard normal • In all preceding discussion simply replace the null �- distribution with a standard normal (i.e., quantiles, etc.), and resulting test is called a z-test § Z-tests also form basis for testing MLEs, using ‘asymptotic normality’ property

38 HST 190: Intro to Biostatistics Non-parametric hypothesis testing

• T-tests rely on the assumption that the underlying population is roughly normally distributed or the sample size is large (or, preferably, both).

• If these conditions are not fulfilled, then when � is true the test statistics do not necessarily follow the t-distribution • If one naively forged ahead with a t-distribution: § the p-value would be the wrong numerical value § the “correct” p-value should have been calculated on the basis of some other distribution • In such setting, nonparametric methods may be a way forward

39 HST 190: Intro to Biostatistics • Nonparametric tests address hypotheses about population , not population mean • Median = “middle number” of a set of values. If a few extreme values are present, the median is more representative of a “typical” observation than the mean. It is said to be “robust to ”. • Silly example: if Mark Zuckerberg moves onto your block, the mean income of your neighborhood changes a lot (how?). The median income hardly changes at all.

40 HST 190: Intro to Biostatistics Nonparametric tests to know

• Wilcoxon Signed-Rank Test § replaces 1-sample or paired t-tests § tests whether population median is equal to a certain prespecified value (usually, 0). § a simplified version is called • Wilcoxon Rank Sum Test § replaces unpaired two-sample t-test § assumes the two distributions have same general shape

41 HST 190: Intro to Biostatistics Wilcoxon signed-rank test

• Wilcoxon signed-rank test replaces observations (or differences between paired observations) with their signs and ranks. § The ranks of the observations incorporate information about relative magnitude while still being robust to outliers. § Compute in Matlab: p = signrank(x) • Example: 20 have SBP measured before and after taking a dietary supplement for 1 month Individual SBP before SBP after 1 118 118 … … … 10 120 124

42 HST 190: Intro to Biostatistics procedure and test statistic

1) Remember signs of each difference and rearrange differences in order of their absolute values. 2) Rank observations from 1 (lowest absolute value) to � (highest absolute value). Ignore zeros. 3) If you come to � > 1 observations with the same absolute value, replace their ranks with an average of all k ranks assigned to them initially. 4) � = sum of ranks of �������� �����������

43 HST 190: Intro to Biostatistics • � = 1 + 3 + 3 + 6.5 + 9 = 22.5

individual SBP SBP diff |diff| before after 1 118 118 0 0 2 110 112 2 2 3 140 135 -5 5 4 120 121 1 1 5 135 133 -2 2 6 116 126 10 10 7 125 127 2 2 8 133 130 -3 3 9 95 91 -4 4 10 120 124 4 4

44 HST 190: Intro to Biostatistics Null distribution

• The Wilcoxon Signed-Rank test uses the hypotheses

§ �: median difference = 0

§ �: median difference ≠ 0

• If � is true and � ≥ 16, then the statistic

� = ~�(0,1)

() • Where � = and � = § Note: Rosner uses a slightly modified version of this test!

45 HST 190: Intro to Biostatistics • Continuing SBP example, where � = 22.5

() � = 10 ⇒ � = = 27.5 and � = = 9.81

.. • Then � = = −0.52 . • Then, � = 2 ⋅ � � < −0.52 = 0.6 § Conclude that there is no statistically significant evidence that the median difference is non-zero § Using 11 in Rosner appendix yields same conclusion

46 HST 190: Intro to Biostatistics Wilcoxon rank-sum test

• Wilcoxon Rank Sum Test is a rank-based test for two independent samples • Also referred to as Mann-Whitney Test § Compute in Matlab: p = ranksum(x,y) • Example: compare length of patient stay (in days), for a particular condition, at two different hospitals. § Hospital #1: 21, 10, 32, 60, 8, 44, 29, 5, 13, 26, 33 § Hospital #2: 86, 27, 10, 68, 87, 76, 125, 60, 35, 73, 96, 44, 238 • Why might a nonparametric test for the difference in lengths of stay between the two hospitals be more appropriate?

47 HST 190: Intro to Biostatistics 48 HST 190: Intro to Biostatistics Ranking procedure and test statistic

• � = size of larger sample

• � = size of smaller sample • Ranking Procedure: 1) Combine samples into one group and order values from lowest to highest 2) Assign ranks from lowest to highest; assign average rank to any tied values 3) Test statistic � = sum of ranks in smaller sample

49 HST 190: Intro to Biostatistics Null distribution

• The Wilcoxon rank-sum test uses the hypotheses

§ �: median in group 1 = median in group 2

§ �: median in group 1 ≠ median in group 2

• If � is true and � ≥ 10, then the statistic

� = ~�(0,1)

( ) • Where � = and � = § Note: Rosner uses a slightly modified version of this test!

50 HST 190: Intro to Biostatistics • Returning to our hospital example § Hospital #1: 21, 10, 32, 60, 8, 44, 29, 5, 13, 26, 33 § Hospital #2: 86, 27, 10, 68, 87, 76, 125, 60, 35, 73, 96, 44, 238 • Ordered and ranked observations

Ordered 5 8 10 10 13 21 26 27 29 32 33 35 44 44 Values Ranks 1 2 3.5 3.5 5 6 7 8 9 10 11 12 13.5 13.5

60 60 68 73 76 86 87 96 125 238 15.5 15.5 17 18 19 20 21 22 23 24

51 HST 190: Intro to Biostatistics • � = 11, normal approximation should hold. Also, � = 13

() ⋅ • � = = 137.5, � = = 17.26

Ranks 1 2 3.5 3.5 5 6 7 8 9 10 11 12 13.5 13.5

15.5 15.5 17 18 19 20 21 22 23 24 � = 1 + 2 + 3.5 + 5 + 6 + 7 + 9 + 10 + 11 + 13.5 + 15.5 = 83.5

.. • Then � = = = −3.13 . � � > −3.13 = 2 ⋅ � � > 3.13 = 2 0.0009 = 0.0018

• Therefore, we reject � with a p-value of 0.0018 § Median stay at the two hospitals is not the same.

52 HST 190: Intro to Biostatistics Robustness of rank-based test

• For the same data, how does � change if we change values: § 73 → 200 (� does not change) § 8 → 1 (� does not change) § 32 → 500 (� only increases by 11) • This is why such tests are called robust

Ordered 5 8 10 10 13 21 26 27 29 32 33 35 44 44 Values Ranks 1 2 3.5 3.5 5 6 7 8 9 10 11 12 13.5 13.5

60 60 68 73 76 86 87 96 125 238 15.5 15.5 17 18 19 20 21 22 23 24

53 HST 190: Intro to Biostatistics Summary

• Both tests formulate � using § Wilcoxon Signed-Rank for paired two-sample or one-sample test § Wilcoxon Rank-Sum for unpaired two-sample test • Positives of non-parametric testing: § robust to non-normality: still valid even if distributions are highly non- normal, or sample sizes are relatively small, or there are outliers, or data are ordinal, or observations are censored… § less sensitive to measurement error § have corresponding C.I.’s (computed by enumeration) • Negatives of non-parametric testing: § reduce information in the data § loss of power is the price paid for robustness; if the data really do follow , it’s better to use parametric methods.

54 HST 190: Intro to Biostatistics Power and sample size calculation • Let’s return to basic steps of hypothesis testing 1) Check assumptions 2) Formulate null & alternative hypotheses 3) Calculate test statistic 4) Find the p-value 5) State your conclusion • Will our inference always be correct?

55 HST 190: Intro to Biostatistics Errors in hypothesis testing

The truth

�0 true �0 true Type I Correct

Reject �0 error, decision, a 1 − � Correct Type II Our decision Do not decision, error, � reject 0 1 − a �

56 HST 190: Intro to Biostatistics • Type I error is a “false alarm” �(Type I error) = � reject � � true) = � § (same as the significance level of the test) • Type II error is an “alarm failure” �(Type II error) = � don t reject � � false) = �

• Power = 1 − � = � reject � � false) • In discussing �, we focused on keeping type I error rate low. § However, we also want a hypothesis test to be powerful (able to detect a difference if one exists) § Balance making both � and � as small as possible

57 HST 190: Intro to Biostatistics Significance and power cut-offs

• Significance level and desired power are important to study design. When you design a study to test a hypothesis, you need to know: § What value of the test statistic will you use as the threshold for calling a result “significant”? § Common thresholds: 0.01, 0.05, and 0.1. But there is nothing magical about these cut-off values! • If the null hypothesis is false, what power will your test have to declare that it is false (at a given significance level)? § Be aware that the theoretical power can be made arbitrarily large if one sets the alternative hypothesis too far away from the null!

58 HST 190: Intro to Biostatistics Power calculation: simplified setting

• To illustrate the fundamental concepts relating statistical power and sample size in study design, consider a simplified example where the standard deviation is known, and only two population means are possible:

§ � = � or � = � • This means that our sample mean follows one of two distributions:

�~� � , , or �~� � ,

59 HST 190: Intro to Biostatistics One-sided hypothesis testing

• In this world imagine testing �: � = � versus �: � > �

Distribution of � Type I error probability under �

Distribution of � Power under �

μ0 μ1 critical value 60 HST 190: Intro to Biostatistics ¯Type I error rate Þ ¯ Power Type I error

Type I error

μ0 μ1 μ0 μ1

critical value critical value

61 HST 190: Intro to Biostatistics ­ Type I error rate Þ ­ Power Type I error

Type I error

μ0 μ1 μ0 μ1

critical value critical value

62 HST 190: Intro to Biostatistics Power calculation: z-test

• The power of a one-sided z-test, defined by �: � = � versus �: � = � > � is given by � Power = 1 − � = Φ −� + � − � � § where Φ c = P Z ≤ � for Z~N(0,1)

• For a two-sided test, we simply substitute �/ into above

• To decide threshold for rejecting the null, need only to specify the distribution of the test statistic under the null hypothesis.

• To determine , you need to know its distribution under the alternative, which is often much harder to specify, requiring information from prior studies or scientific intuition.

63 HST 190: Intro to Biostatistics • As another example, researchers wish to test whether a drug is effective in reducing intraocular pressure by 5mm Hg (for glaucoma prevention). § They will conclude effectiveness if they see a result significant at the 5% level. The std. dev. of the pressure change, on the basis of previous , is believed to be 20. • If they administer the drug to 100 patients, what is the probability they will detect a change if the drug is truly effective?

§ �: � = � versus �: � > �

§ � = 100, � = 20, � = 0.05, � − � = 5 100 Power = Φ −1.68 + 5 = Φ 0.82 = 0.794 20

64 HST 190: Intro to Biostatistics Factors affecting power

• Power formula relates many design aspects: � Power = 1 − � = Φ −� + � − � � 1) significance level �

2) |� − �| 3) sample size � 4) Population standard deviation �

o This is often not under the researcher’s control, but researcher must be able to make a good guess of it for valid power estimation

65 HST 190: Intro to Biostatistics � Power = 1 − � = Φ −� + � − � � ­ � Þ ­ power

μ0 μ1 μ0 μ1

66 HST 190: Intro to Biostatistics � Power = 1 − � = Φ −� + � − � � ­ � OR ¯ s Þ ­ power

μ0 μ1 μ0 μ1

67 HST 190: Intro to Biostatistics � Power = 1 − � = Φ −� + � − � �

­ |�� − ��| Þ ­ power

μ0 μ1 μ0 μ1

68 HST 190: Intro to Biostatistics Sample size calculation

• Instead of first fixing n and then calculating power, we can also choose desired power and calculate required n.

• To perform a one-sided z-test of �: � = � vs. �: � = � with significance level � and power 1 − �, the required sample size is given by � � + � � = � − �

• For a two-sided test, simply substitute �/ for �

69 HST 190: Intro to Biostatistics PLOS Medicine, 2005.

-

70 HST 190: Intro to Biostatistics § From the Economist, Unreliable Research: Trouble at the lab

o https://www.youtube.com/watch?v=aMv8ZNwXTjQ

71 HST 190: Intro to Biostatistics