<<

Lecture 16

Lecture Topics

BIOST 514/517 • I / Applied Biostatistics I • Mann-Whitney U Test (Wilcoxon rank sum test) • Wilcoxon signed-rank test • Kathleen Kerr, Ph.D. Associate Professor of Biostatistics University of Washington • As with all tests, we need to look at what summary measure they are based on, and what assumptions they make Lecture 16: Rank-based Tests December 2-4, 2013

Sign Test Sign Test

• The sign test is a test that the of a population is • If the median of a population is m, then half the equal to some specified value. population is above m and half is below m. Recoding • It is most often used to test that the median difference in the variable to 1 if it is above m and 0 if it is below m paired observations is zero gives n binary observations that will have ½ if m is the population median. – median difference, not difference in • The sign test compares the observed proportion using the binomial distribution. • STATA provides signtest for testing zero median difference between paired observations. Lecture 16

Sign Test: Example Sign Test: Example One-sided tests: • Shoulder pain : Is pain higher or lower at the earliest Ho: median of pain1 - pain6 = 0 vs. time (time 1) and the latest time (time 6)? Ha: median of pain1 - pain6 > 0 Pr(#positive >= 15) = • Each person contributes pain scores at time 1 and time 6. Binomial(n = 19, x >= 15, p = 0.5) = 0.0096 Does the difference between them tend to be positive or negative? Ho: median of pain1 - pain6 = 0 vs. . signtest pain1=pain6 Ha: median of pain1 - pain6 < 0 Sign test Pr(#negative >= 4) = sign | observed expected Binomial(n = 19, x >= 4, p = 0.5) = 0.9978 ------+------positive | 15 9.5 Two-sided test: negative | 4 9.5 Ho: median of pain1 - pain6 = 0 vs. zero | 22 22 Ha: median of pain1 - pain6 != 0 ------+------Pr(#positive >= 15 or #negative >= 15) = all | 41 41 min(1, 2*Binomial(n = 19, x >= 15, p = 0.5)) = 0.0192

Mann-Whitney U test Wilcoxon rank-sum test

• The Mann-Whitney U test compares two groups. Unlike • Another way to compute the same test involves the two-sample t-test, the Mann-Whitney U test is based replacing the data by ranks and computing the sum of on a bivariate summary rather than the difference in a ranks in each group. univariate summary. • This leads to a second name for the test, “Wilcoxon

•Write Xi (i=1,..n1) for the first group and Yj (j=1,…,nj) for rank-sum test.” the second group. The U test is based on the • They are the same test. The only reason to distinguish

U=Σ I(Xi>Yj) between two versions is if you are doing the test by hand where the sum is over all possible pairs. and have tables of U or the rank sum. • In other words, the test is based on the proportion of times an X is greater than a Y. • The null hypothesis is that the proportion is ½. Lecture 16

Issues Considerations

• There are two problems with the Mann-Whitney test • Because of its use of ranks, the test is more sensitive to – It is not based on any univariate summary, so it shifts in all observations and less sensitive to large shifts doesn’t lead to confidence intervals for a difference in in a small number of observations. location. • If the distributions of X and Y were known to have • It is often incorrectly described as a test for equality of exactly the same shape, and differ only in location, then medians. the Mann-Whitney U test/Wilcoxon rank sum test would – The p-value is usually computed not under the be very useful. But we never know this. assumption that P(X>Y)=1/2, but under the much stronger assumption that X and Y have exactly the same distribution.

Mann-Whitney U Test: Example Mann-Whitney U Test: Example

• PBC dataset: bilirubin by presence of edema • The p-value is <0.0001. . ranksum bilirubin, by(edema) porder • The option asks for P(X>Y) to be given. We see Two-sample Wilcoxon rank-sum (Mann-Whitney) test porder edema | obs rank sum expected that bilirubin is higher in the no-edema group only 23.6% of ------+------the time. 0 | 275 40356 43037.5 1 | 37 8472 5790.5 ------+------combined | 312 48828 48828 unadjusted 265397.92 adjustment for ties -410.96 ------adjusted variance 264986.96 Ho: biliru~n(edema==0) = biliru~n(edema==1) z = -5.209 Prob > |z| = 0.0000 P{biliru~n(edema==0) > biliru~n(edema==1)} = 0.236 Lecture 16

Signed rank test Signed rank test

• The Wilcoxon signed-rank test adds to the naming •Write ∆i= Xi -Yi . confusion. • The Wilcoxon signed rank test statistic is based on the

• It is a test for difference in location between paired proportion of the pairwise sums ∆i + ∆j that are positive. samples. • It is a test for the median pairwise sum being zero. • Like the previous test, it is not based on any univariate • The p-value is usually computed under the additional summary. assumption that the distribution of is symmetric about zero. • Compared to the sign test, the signed rank test is more sensitive to changes in extreme values and less sensitive than the paired t-test. • In large samples it has much more restrictive assumptions than either the sign test or paired t-test.

logrank test STATA

• The logrank test is a rank-based test that has been • PBC dataset, survival by treatment group modified for survival data. stset time, failure() • The logrank test is much more useful than the other rank sts test treatment failure _d: censoring tests because there are fewer alternatives available in analysis time _t: time standard software for survival data. Log-rank test for equality of survivor functions

∆i | Events Events treatment | observed expected ------+------1 | 65 63.22 2 | 60 61.78 ------+------Total | 125 125.00 chi2(1) = 0.10 Pr>chi2 = 0.7498 Lecture 16

STATA Self-consistency

• . sts graph, by(treatment) • Sometimes when comparing two groups X and Y, all univariate summary give the same answer: the Kaplan-Meier survival estimates mean, median, , all the quantiles are

1.00 higher for X than for Y. • Sometimes they don’t. Then the decision of which is 0.75 larger depends on the definition of “large.”

0.50 • In the ambiguous case, any test based on a univariate summary is (thankfully) consistent with itself.

0.25 – If group A has higher mean cost than B, and group B has higher mean cost than C, you know that A has 0.00 0 1000 2000 3000 4000 5000 higher mean cost than C analysis time treatment = 1 treatment = 2 • This property, which may seem trivial, does not hold for rank-based tests.

Self-consistency: Example Self-consistency: Example

• Suppose there are three illnesses with similar symptoms. • Base the choice on mean days of illness. Suppose all three take 3 days to recover when – B (mean 2.4) is better than untreated (mean 3) better untreated. than A (mean 3.2) • There are also treatments A and B, which may be beneficial or harmful depending on which underlying illness the patient actually has.

Illness Proportion untreated A B Illness Proportion untreated A B 1 40% 3 2 0 1 40% 3 2 0 2 20% 3 2 4 2 20% 3 2 4 3 40% 3 5 4 3 40% 3 5 4

• If you can’t tell which illness a patient actually has, which is the best treatment option? Lecture 16

Self-consistency: Example Self-consistency: Example

• Base the choice on median days of illness. • Base the choice on maximum days of illness. – A (median 2) is better than untreated (median 3) – untreated (maximum 3) better than B (maximum 4) is better than B (median 4) better than A (maximum 5)

Illness Proportion untreated A B Illness Proportion untreated A B 1 40% 3 2 0 1 40% 3 2 0 2 20% 3 2 4 2 20% 3 2 4 3 40% 3 5 4 3 40% 3 5 4

Self-consistency: Example Self-consistency: Example

• The Wilcoxon rank-sum test leads in circles. • The Wilcoxon rank-sum test leads in circles. – Choose A because it has the best chance of beating – Choose A because it has the best chance of beating untreated (60% chance) untreated (60% chance) – But then choose B because it has a 80% chance of beating A. Base the choice on maximum days of illness.

Illness Proportion untreated A B Illness Proportion untreated A B 1 40% 320 1 40% 3 20 2 20% 324 2 20% 3 24 3 40% 354 3 40% 3 54 Lecture 16

Self-consistency: Example Self-consistency: Example

• The Wilcoxon rank-sum test leads in circles. • NOTE: In the previous example there was no – Choose A because it has the best chance of beating uncertainty. untreated (60% chance) • If rank comparisons cannot help us make decisions – But then choose B because it has a 80% chance of when there is no uncertainty, why would we use it in the beating A. Base the choice on maximum days of presence of uncertainty? illness. • The same sort of problem can happen with other rank – But no treatment beats B 60% of the time. tests.

Illness Proportion untreated A B 1 40% 3 2 0 2 20% 3 2 4 3 40% 3 5 4

Summary

• In large samples of uncensored data there is little need for rank-based tests. Tests and confidence intervals based on summary parameters (, geometric means, medians, etc.) have clearer interpretations and fewer assumptions. • In , the logrank test and Cox regression models are widely used and there are fewer good alternatives. You should be aware that strange things can happen when an exposure improves survival for some and worsens for others. • In very small samples, any statistical analysis needs strong assumptions.