Non-parametric

Definition and Concepts

Yang Cao, PhD, Associate Professor

Clinical , School of Medical Sciences, Örebro University Email: [email protected]

Yang Cao 11 oktober 2019 2 1. Non-

Yang Cao 11 oktober 2019 3

Hypothesis Testing Procedures

Hypothesis Testing Procedures

Wilcoxon One-way Kruskal- Z test t test … rank sum … ANOVA Wallis H Test test

Parametric Nonparametric

Yang Cao 11 oktober 2019 4 What are Non-Parametric Statistics?

Non-parametric statistics are a special form of statistics which help with a problem occurring in parametric statistics.

Yang Cao 11 oktober 2019 5

What are Parametric Statistics?

Parametric statistics is a branch of statistics that assumes that the has come from a type of and makes inferences about the parameters of the distribution.

Normal Distribution Binomial distribution Poisson distribution Chi-squared distribution

Yang Cao 11 oktober 2019 6 Assumptions of t-test

Null hypothesis of two sample t-test: the of two normally distributed populations are equal

In the t-test comparing the means of two independent samples, the following assumptions should be met: • Each of the two populations being compared should follow a . • If using Student's original definition of the t-test, the two populations being compared should have the same • The data used to carry out the test should be sampled independently from the two populations being compared. (This is in general not testable from the data.)

Yang Cao 11 oktober 2019 7

Assumptions of ANOVA

One-way (ANOVA)

The results of a one-way ANOVA can be considered reliable as long as the following assumptions are met: • Response variable are normally distributed (or approximately normally distributed). • Samples are independent. • of populations are equal. • Responses for a given group are independent and identically distributed normal random variables (not a simple random sample (SRS)). *ANOVA is a relatively robust procedure with respect to violations of the normality assumption

Yang Cao 11 oktober 2019 8 Why does lack of normality cause problems?

When we calculate the p-value for an inference test, we find the probability that the sample was different due to variability. Basically, we are trying to see if a recorded value occurred by chance and chance alone. When we look for a p-value, we are assuming that all samples of the given sample size are normally distributed around the . This is why the test , which is the number of standard deviations that sample mean is away from the population mean, is able to be used. Therefore, without normality, no p-value can be found.

Yang Cao 11 oktober 2019 9

.Calculations can always be derived no matter what the distribution is. Calculations are algebraic properties separating sums of squares. Normality is only needed for .

Yang Cao 11 oktober 2019 10 When are Parametric Statistics not useful? When we do significance tests, we rely on the assumption that the of samples taken follows the t-distribution or the z-distribution, depending on the situation. When this assumption is not true, none of our tests, which are called “parametric statistical inference tests,” are reliable.

Some non-normal distributions for which the t-statistic is invalid.

Yang Cao 11 oktober 2019 11

What are Non-Parametric Statistics?

The way in which statisticians deal with this problem of parametric statistics is the field of non-parametric statistics. These are tests that can be done without the assumption of normality, approximate normality, or symmetry. These tests do not require a mean and standard . Since a assumes symmetry, it is not useful for many distributions anyway.

Yang Cao 11 oktober 2019 12 What is different about Non-Parametric Statistics? In non-parametric statistics, one deals with the rather than the mean. Since a mean can be easily influenced by or , and we are not assuming normality, a mean no longer makes sense. The median is another judge of location, which makes more sense in a non-parametric test. The median is considered the “center” of a distribution.

Sometimes statisticians use what is called “ordinal” data. This data is obtained by taking the raw data and giving each sample a rank. These ranks are then used to create test statistics.

Yang Cao 11 oktober 2019 13

There are non-parametric tests which are similar to the parametric tests, but each is slightly different. The following table shows how some of the tests match up.

Parametric Test Goal for Parametric Test Non-Parametric Test Goal for Non-Parametric Test Two Sample T-Test To see if two samples Mann-Whitney Test To see if two samples have identical population have identical population means One Sample T-Test To test a hypothesis Wilcoxon Signed Ranks To test a hypothesis about the mean of the Test about the median of the population a sample was population a sample was taken from taken from Chi-Squared Test for To see if a sample fits a Kolmogorov-Smirnov Test To see if a sample could theoretical distribution, have come from a such as the normal curve certain distribution ANOVA To see if two or more Kruskal-Wallis Test To test if two or more sample means are sample medians are significantly different significantly different

Yang Cao 11 oktober 2019 14 Parametric Test Procedures

1.Involve Population Parameters (Mean) 2. Have Stringent Assumptions (Normality) Examples: Z Test, t Test, 2 Test, F test

Nonparametric Test Procedures 1. Do Not Involve Population Parameters 2. Data Measured on Any Scale (Ratio or Interval, Ordinal or Nominal) Example: Wilcoxon Rank Sum Test

Yang Cao 11 oktober 2019 15

Advantages of Nonparametric Tests

1. Robust: Used With All Scales 2. Easier to Compute 3. Make Fewer Assumptions 4. Need Not Involve Population Parameters 5. Results May Be as Exact as Parametric Procedures

Yang Cao 11 oktober 2019 16 Disadvantages of Nonparametric Tests 1.May Waste Information Parametric model more efficient if data permit 2. Lose Power This means that there is a greater risk of accepting a false hypothesis. In other words, chances of committing the Type II error are considerable 3.Null Hypothesis is Somewhat Loosely Formulated

Yang Cao 11 oktober 2019 17

2.

Yang Cao 11 oktober 2019 18 Measurement of normality

Skewness

Skewness=0

Kurtosis

Kurtosis=3

Yang Cao 11 oktober 2019 19

A worked example

A psychologist wished to investigate whether a new relaxation technique for clinical patients was more effective than an existing one. One group of 10 patients suffering high levels of anxiety was taught the new technique and another group, matched for age, sex and general anxiety level, was taught the old one. The patients were asked to practice the technique every day for a month. At the end of the month a standard measuring anxiety, (in which a lower score indicates a lower level of anxiety), was given to each participant and the scores were compared.

Yang Cao 11 oktober 2019 20 At first, we would like to check if the data is normally distributed.

Null hypothesis: the data is normally : the data is not normally distributed.

The data to be tested in stored in the second column.

Yang Cao 11 oktober 2019 21

Graphical Methods

1. (not suitable for small sample) 2.quantile-quantile plot (Q-Q plot) 3. probability–probability plot or percent– percent plot(P-P plot)

Yang Cao 11 oktober 2019 22 Frequentist/numerical tests 1. D'Agostino's K-squared test, 2. Jarque–Bera test, 3. Anderson–Darling test, 4. Cramér–von Mises criterion 5. Shapiro–Wilk test, 6. Pearson's chi-squared test, 7. Shapiro–Francia test 8. Kolmogorov–Smirnov test 9. Energy test and the ecf tests

Yang Cao 11 oktober 2019 23

Create a histogram in Stata

• histogram score

• histogram score, by(group)

Yang Cao 11 oktober 2019 24 Panel graph with normal curve

• histogram score, by(group sex) norm

Yang Cao 11 oktober 2019 25

Normal Q-Q Plot

In order to determine normality graphically we can use the output of a normal Quantile-Quantile Plot (Q-Q Plot). A Q-Q plot is a plot of the quantiles of the first against the quantiles of the second data set. If the data are normally distributed then the data points will be close to the diagonal line. If the data points stray from the line in an obvious non- linear fashion then the data are not normally distributed.

* By a quantile, we mean the point below which a given fraction (or percent) of points lies. That is, the 0.3 (or 30%) quantile is the point at which 30% percent of the data fall below and 70% fall above that value.

Yang Cao 11 oktober 2019 26 Normal Q-Q plots from Normal populations

Normal Q-Q Plots of Samples from Heavy Tailed (Leptokurtic) Populations

Normal Q-Q Plots of Samples from Skew Populations

Yang Cao 11 oktober 2019 27

To create a normal Q-Q plot in Stata

. qnorm score

. qnorm score if group==1

Yang Cao 11 oktober 2019 28 Normal P-P Plot

Probability–Probability plot (or percent–percent plot, P-P Plot) plots a variable's cumulative proportions against the cumulative proportions of any of a number of test distributions. P-P plot is generally used to determine whether the distribution of a variable matches a given distribution. If the selected variable matches the test distribution, the points cluster around a straight line.

Yang Cao 11 oktober 2019 29

Make normal P-P Plot in Stata.

• pnorm score

• pnorm score if group==1

Yang Cao 11 oktober 2019 30 • A Q-Q plot compares the quantiles of a data distribution with the quantiles of a standardized theoretical distribution from a specified family of distributions.

• You should use a Q-Q plot if your objective is to compare the data distribution with a family of distributions that vary only in location and scale, particularly if you want to estimate the location and scale parameters from the plot.

• A P-P plot compares the empirical cumulative distribution function (CDF) of a data set with a specified theoretical cumulative distribution function F(ꞏ).

• An advantage of P-P plots is that they are discriminating in regions of high probability density, since in these regions the empirical and theoretical cumulative distributions change more rapidly than in regions of low probability density.

Yang Cao 11 oktober 2019 31

P-P Plots tend to magnify deviations from the normal distribution in the middle.

Q-Q Plots tend to magnify deviations from the normal distribution on the tails.

Generally, P-P plots are better to spot non-normality around the mean, and Q-Q plots to spot non-normality in the tails.

Yang Cao 11 oktober 2019 32 Numerical tests The Kolmogorov–Smirnov test (K–S test) is a nonparametric test for the equality of continuous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution (one-sample K–S test), or to compare two samples (two-sample K–S test).

The Kolmogorov–Smirnov test can be modified to KOLMOGOROV. Andrei Nikolaevich serve as a goodness of fit test. In the special case of testing for normality of the distribution, samples are standardized and compared with a standard normal distribution.

Various studies have found that it is less powerful for testing normality than the Shapiro–Wilk test or Anderson–Darling test. SMIRNOV, Nikolai Vasil’yevich

Yang Cao 11 oktober 2019 33

Conduct K-S test in Stata.

. summarize score

Variable | Obs Mean Std. Dev. Min Max ------+------score | 20 18.55 5.041668 10 27

. ksmirnov score=normal((score-18.55)/5.041668)

One-sample Kolmogorov-Smirnov test against theoretical distribution normal((score-18.55)/5.041668)

Smaller group D P-value ------score: 0.0829 0.760 Cumulative: -0.0865 0.741 Combined K-S: 0.0865 0.998

Note: Ties exist in dataset; there are 14 unique values out of 20 observations.

Yang Cao 11 oktober 2019 34 Conduct S-W test in Stata.

. swilk score

Shapiro-Wilk W test for normal data

Variable | Obs W V z Prob>z ------+------score | 20 0.97457 0.602 -1.023 0.84676

For 2 or more samples, the normality test need to be made sample by sample.

. swilk score if group==1

Shapiro-Wilk W test for normal data

Variable | Obs W V z Prob>z ------+------score | 10 0.95879 0.635 -0.745 0.77201

Yang Cao 11 oktober 2019 35

To create a boxplot plot in Stata

. graph box score

. graph box score, over(group)

Yang Cao 11 oktober 2019 36 Numerical tests Advantage: allowing objective judgment Disadvantage: sometimes not sensitive enough at low sample sizes or overly sensitive to large sample sizes. As such, some statisticians prefer to use their experience to make a subjective judgment about the data from graphs. Graphical interpretation Advantage: allowing good judgment when numerical tests might be over or under sensitive Disadvantage: lack objectivity. If you do not have a great deal of experience interpreting normality graphically then it is probably best to rely on the numerical methods.

Yang Cao 11 oktober 2019 37

3. Homogeneity of variance test

Yang Cao 11 oktober 2019 38 Why homogeneity of variance is needed?

Pooled S2

Yang Cao 11 oktober 2019 39

Only when the assumption of homogeneity of variances is valid can variances be pooled across groups to yield an estimate of variance that is used in the of the statistic in question. If this assumption is ignored, the results of the statistical test (i.e., t- test and ANOVA) are greatly distorted leading to incorrect inferences based on the results (i.e., increased Type I error rates leading to invalid inferences).

Yang Cao 11 oktober 2019 40 Test Assumption of Homogeneity of Variance in t-test in Stata

. sdtest score, by(group)

Variance ratio test ------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ------+------New tech | 10 15.7 1.468559 4.643993 12.37789 19.02211 Old tech | 10 21.4 1.185093 3.747592 18.71913 24.08087 ------+------combined | 20 18.55 1.127351 5.041668 16.19043 20.90957 ------ratio = sd(New tech) / sd(Old tech) f = 1.5356 Ho: ratio = 1 degrees of freedom = 9, 9

Ha: ratio < 1 Ha: ratio != 1 Ha: ratio > 1 Pr(F < f) = 0.7335 2*Pr(F > f) = 0.5330 Pr(F > f) = 0.2665

For 3 or more samples, use command: robvar.

Yang Cao 11 oktober 2019 41

Don't be too quick to switch to using the nonparametric Kruskal-Wallis ANOVA (or the Mann-Whitney test when comparing two groups). While nonparametric tests do not assume normal distributions, the Kruskal-Wallis and Mann-Whitney tests do assume that the shape of the data distribution is the same in each group. So if your groups have very different standard deviations and so are not appropriate for one-way ANOVA, they also should not be analyzed by the Kruskal-Wallis or Mann- Whitney test. Often the best approach is to transform the data. Often transforming to logarithms or reciprocals does the trick, restoring equal variance.

Yang Cao 11 oktober 2019 42 Much current opinion, but by no means all of it, is that, in general, you shouldn't worry a lot about normality and equal variance. Research seems to indicate that most of the parametric (that is, normal- curve-based) inference procedures are fairly well-behaved in the face of moderate departures from both normality and equality of variance. Tests and estimates that are relatively uninfluenced by violations in their assumptions are known as robust procedures, and a substantial literature has developed in the field of robust statistics.

Yang Cao 11 oktober 2019 43

Graphical Numerical method: method: objective, medium subjective, small sample Levene’s test or large sample

Lose power Normality Equal variance Non-par Lose pricison Yes Yes Par: t-test, ANOVA

Yes No Non-par Par: t’-test, Brown- Forsythe, Welch No No Non-par

Transformation: log, reciprocal…

Yang Cao 11 oktober 2019 44 4. Mann-Whitney U test for 2 independent medians

Henry Berthold Mann Donald Ransom Whitney

Yang Cao 11 oktober 2019 45

What kind of test is the Mann-Whitney U test , and what is it used for?

The Mann-Whitney U test is one of the non- parametric tests of difference and is used to test a null hypothesis that two independent samples of scores could have been drawn from the same population.

Yang Cao 11 oktober 2019 46 What does this mean?

The medians of any two independent samples will always differ to some degree. However, you cannot tell just by inspecting the scores whether the difference is meaningful, or not. On the one hand it could be the result of sampling from a single undifferentiated population. On the other hand, it might be because the two groups really come from different populations. The Mann- Whitney U Test tells you whether the difference between the samples is so great as to make it unlikely that the null hypothesis, (that they came from the same population), is correct.

Yang Cao 11 oktober 2019 47

When to use it?

• You require a test of difference between two samples of data. • The samples are independent, i.e., each participant contributes only one value to only one of two groups. • The values represent measures on either the ordinal or interval scales. • The population distribution is either unknown or likely to be very non-normal.

Yang Cao 11 oktober 2019 48 How does it work?

A worked example of the Mann-Whitney U Test

Yang Cao 11 oktober 2019 49

1. The test requires only measurement on the ordinal scale and looks at the relative size (magnitude) of the values rather than at the exact differences between them. In other words, it asks only whether one value is greater or less than another.

Yang Cao 11 oktober 2019 50 The null hypothesis is that both samples were drawn from the same population, while the alternative hypothesis is that they came from different populations corresponding to the different relaxation techniques. As this hypothesis is nondirectional, (the psychologist could not predict whether the new technique was going to be more or less effective than the old), we use a two-tailed test at a significance level of α =0.05.

Yang Cao 11 oktober 2019 51

2. In the calculation of the test, the two groups of values are merged and the values are ranked, with the lowest value overall receiving the lowest rank. The in each group are then summed independently. It follows that whichever group has the greater sum of ranks will also necessarily 7 contain most of the higher 1 23 4 5 6 scores, and the two medians will also be different. The greater the difference between the two sets of values, the greater the difference between their summed rankings and medians.

Yang Cao 11 oktober 2019 52 Rank all the items of data by placing a “1” in the rankings column against the lowest score in both sets of data, a “2” against the next lowest, and so on. Sum the rankings.

Score: 10 11 11 13 15 15 … 26 27 Rank: 1 2.5 2.5 4 5.5 5.5 … 19 20

The sum of all ranks for 2 samples combined must be equal to [N(N+1)]/2. If this equality is not satisfied, you need check your ranks and calculation. * If some scores are tied (equal) give them the average of the ranks.

Yang Cao 11 oktober 2019 53

Yang Cao 11 oktober 2019 54 3. This difference is summarized as a value called “U”. The value of U is determined by the number of times that a score from one set of data has been given a lower than a score from the other set. However, because there are two independently obtained sets of data, these relationships are different for each set of scores, and so there are actually two possible values

for “U”, called U1 and U2. You require whichever is the smaller of U1 and U2.

Yang Cao 11 oktober 2019 55

where n1 is the sample size for sample 1, and R1 is the sum of the ranks in sample 1. Note that it doesn't matter which of the two samples is considered sample 1. An equally valid formula is:

The smaller value of U1 and U2 is the one used when consulting significance tables.

Yang Cao 11 oktober 2019 56 4. Then consult a table of critical values of U for the required alpha level (usually a =0.05) and sample size. If the obtained value for U is greater than the value shown in the table, the null hypothesis should be retained; if less, it may be rejected and the alternative hypothesis should be accepted at that level of significance.

Yang Cao 11 oktober 2019 57

Yang Cao 11 oktober 2019 58 In our example, U is 17.5 (the smaller one of U1 and U2) and smaller than critical value 23, therefore we reject the null hypothesis and accept the alternate hypothesis for these data, concluding that the difference between the two samples is such that they are unlikely to have been drawn from the same population.

0.05

Yang Cao 11 oktober 2019 59

Conduct Mann-Whitney U Test in Stata

. ranksum score, by(group)

Two-sample Wilcoxon rank-sum (Mann-Whitney) test

group | obs rank sum expected ------+------New techniqu | 10 72.5 105 Old techniqu | 10 137.5 105 ------+------combined | 20 210 210 unadjusted variance 175.00 adjustment for ties -1.05 ------adjusted variance 173.95

Ho: score(group==New technique) = score(group==Old technique) z = -2.464 Prob > |z| = 0.0137

Yang Cao 11 oktober 2019 60 From this data it can be concluded that there is a statistically significant difference between the new and old technique group's median anxiety scores at the end of both treatments (p = 0.0137). It can be further concluded that the new technique elicited statistically significant lower anxiety scores than the old technique.

Yang Cao 11 oktober 2019 61

5. Wilcoxon test for 2 related medians

Yang Cao 11 oktober 2019 62 Frank Wilcoxon

Yang Cao 11 oktober 2019 63

Wilcoxon signed ranks test to test for the median difference. The Wilcoxon signed ranks test requires that the differences are approximately symmetric and that the data are measured on an ordinal, interval, or ratio scale. When the assumptions for the Wilcoxon signed ranks test are met but the assumptions of the t test are violated, the Wilcoxon signed ranks test is usually more powerful in detecting a difference between the two populations. Even under conditions appropriate to the paired t test, the Wilcoxon signed ranks test is almost as powerful.

Yang Cao 11 oktober 2019 64 Example: The table below shows the hours of relief provided by two analgesic drugs in 12 patients suffering from arthritis. Is there any evidence that one drug provides longer relief than the other?

Yang Cao 11 oktober 2019 65

Solution: 1. In this case our null hypothesis is that the median difference is zero.

2. Our actual differences (Drug B - Drug A) are: +1.5, +2.1, +0.3,−0.2, +2.6,−0.1, +1.8,−0.6, +1.5, +2.0, +2.3, +12.4 Our actual median difference is 1.65 hours.

3. Ranking the absolute values of differences and affixing a sign to each rank:

Yang Cao 11 oktober 2019 66 4. Calculating W+ and W− gives: W− = 1 + 2 + 4 = 7 W+ = 3 + 5.5 + 5.5 + 7 + 8 + 9 + 10 + 11 + 12 = 71 W = min(W−,W+) = 7

5. Then consult a table of critical values of W for the required alpha level (usually  =0.05) and the number of difference (NOT sample size). If the obtained value for W is greater than the value shown in the table the null hypothesis should be retained; if less, it may be rejected and the alternate hypothesis accepted at that level of significance.

Yang Cao 11 oktober 2019 67

Here, we got a W=7, which is smaller than critical value 13, therefor we reject null hypothesis and accept alternative hypothesis. There is strong evidence that Drug B provides more relief than Drug A.

Yang Cao 11 oktober 2019 68 Wilcoxon Paired Signed-Rank Test in Stata

. signrank Drug_a= Drug_b

Wilcoxon signed-rank test

sign | obs sum ranks expected ------+------positive | 3 7 39 negative | 9 71 39 zero | 0 0 0 ------+------all | 12 78 78

unadjusted variance 162.50 adjustment for ties -0.13 adjustment for zeros 0.00 ------adjusted variance 162.38

Ho: Drug_a = Drug_b z = -2.511 Prob > |z| = 0.0120

Yang Cao 11 oktober 2019 69

6. Kruskal-Wallis test for >2 independent medians

William Kruskal W. Allen Wallis

Yang Cao 11 oktober 2019 70 Step by step example of the Kruskal- Wallis test

Does physical exercise alleviate depression? We find some depressed people and check that they are all equivalently depressed to begin with. Then we allocate each person randomly to one of three groups: no exercise; 20 minutes of jogging per day; or 60 minutes of jogging per day. At the end of a month, we ask each participant to rate how depressed they now feel, on a Likert scale that runs from 1 ("totally miserable") through to 100 (ecstatically happy").

Yang Cao 11 oktober 2019 71

Yang Cao 11 oktober 2019 72 Step 1: Rank all of the scores, ignoring which group they belong to. The procedure for ranking is as follows: the lowest score gets the lowest rank. If two or more scores are the same then they are "tied". "Tied" scores get the average of the ranks that they would have obtained, had they not been tied. Step 2: Find "Tc", the total of the ranks for each group. Just add together all of the ranks for each group in turn. Here, Tc1 (the rank total for the "no exercise" group) is 76.5. Tc2 (the rank total for the "20 minutes" group) is 79.5. Tc3 (the rank total for the "60 minutes" group) is 144.

Yang Cao 11 oktober 2019 73

Yang Cao 11 oktober 2019 74 Step 3: Find "H".

N is the total number of participants (all groups

combined). Tc is the rank total for each group. nc is the number of participants in each group. For our data,

Yang Cao 11 oktober 2019 75

Yang Cao 11 oktober 2019 76 Step 4: the degrees of freedom is the number of groups minus one. Here we have three groups, and so we have 2 d.f.

Step 5: Assessing the significance of H depends on the number of participants and the number of groups. If you have three groups, with five or fewer participants in each group, then you need to use the special table for small sample sizes. If you have more than five participants per group, then treat H as Chi-Square. H is statistically significant if it is equal to or larger than the critical value of Chi-Square for your particular d.f.

Yang Cao 11 oktober 2019 77

Yang Cao 11 oktober 2019 78 Here, we have eight participants per group, and so we treat H as Chi-Square. H is 7.27, with 2 d.f. Here's the relevant part of the Chi-Square table:

Yang Cao 11 oktober 2019 79

Look along the row that corresponds to your number of degrees of freedom and compare our obtained value of H to each of the critical values in that row of the table. Here, our obtained value of 7.27 is larger than 5.99 with 2 degrees of freedom. This tells us that our value of H will occur by chance with a probability of less than 0.05.

Yang Cao 11 oktober 2019 80 Step 6 We would conclude that there is a difference of some kind between our three groups.

We could write this up as follows: "A Kruskal-Wallis test revealed that there was a significant effect of exercise on depression levels (H (2) = 7.27, p < .05). Inspection of the group medians suggests that compared to the "no exercise" control condition, depression was significantly reduced by 60 minutes of daily exercise, but not by 20 minutes of exercise". (NB: this should be tested by post-hoc test.).

Yang Cao 11 oktober 2019 81

Test Procedure in Stata

. kwallis DepScale, by(Group)

Kruskal-Wallis equality-of-populations rank test

+------+ | Group | Obs | Rank Sum | |------+-----+------| | No exerc | 8 | 76.50 | | Jogging | 8 | 79.50 | | Jogging | 8 | 144.00 | +------+

chi-squared = 7.271 with 2 d.f. probability = 0.0264

chi-squared with ties = 7.290 with 2 d.f. probability = 0.0261

Yang Cao 11 oktober 2019 82 Reporting the Output of the Kruskal-Wallis H Test . graph box DepScale, over(Group) In our example, we can report that there was a statistically significant difference between the different exercise groups (H(2) = 7.27, P = 0.026) with a median of 40.5 for No exercise, 42.5 for Jogging 20 minutes and 57.5 for Jogging 60 minutes.

Yang Cao 11 oktober 2019 83

Yang Cao 11 oktober 2019 84 It is important to note that the Kruskal- Wallis H Test is an omnibus test like its parametric alternative - that is, it tells you whether there are overall differences but does not pinpoint which groups in particular differ from each other. To do this you need to conduct post-hoc tests/ Pairwise Multiple Comparisons.

Yang Cao 11 oktober 2019 85

Pairwise Multiple Comparisons

To examine where the differences actually occur, you need to run separate Mann-Whitney U test on the different combinations of related groups. So, in this example, you would compare the following combinations:

Test 1: no exercise vs. Jogging for 20 minutes Test 2: no exercise vs. Jogging for 60 minutes Test 3: Jogging for 20 minutes vs. Jogging for 60 minutes

Yang Cao 11 oktober 2019 86 You need to use a Bonferroni adjustment on the results you get from the Mann-Whitney U test as you are making multiple comparisons, which makes it more likely that you will declare a result significant when you should not (a Type I error). Luckily, the Bonferroni adjustment is very easy to calculate; simply take the significance level you were initially using (in this case 0.05) and divide it by the number of tests you are running.

With 3 tests, we have to divide our -level by 3, .05/3 = .0167 So we are doing our Posthoc tests on this more rigorous level. If the p value is larger than 0.017 then we do not have a statistically significant result.

Yang Cao 11 oktober 2019 87

7. Differences between several related groups: Friedman's ANOVA

William Frederick Friedman

Yang Cao 11 oktober 2019 88 . Friedman's ANOVA is the non-parametric analogue to a repeated measure ANOVA where the same subjects have been subjected to various conditions. . Example: Testing the effect of a new diet called 'Andikins diet' on n=10 women. Their weight (in kg) was tested 3 times: Start Month 1 Month 2 . Would they loose weight in the course of the diet?

Yang Cao 11 oktober 2019 89

Steps of Friedman's ANOVA

 Subject's weight on each of the 3 dates is listed in a separate column. Then ranks for the 3 dates within each woman are Always the 3 determined and listed in separate columns. scores are ranked: The • Then, the ranks are summed up for each date (R ) smallest one gets i 1, the next 2, and Diet data the biggest one 3.

Weight Weight Start Month 1 Month 2 Start Month1 Month2 (Ranks) (Ranks) (Ranks) Person 1 63,75 65,38 81,34 1 2 3 2 62,98 66,24 69,31 1 2 3 3 65,98 67,7 77,89 1 2 3 4 107,27 102,72 91,33 3 2 1 5 66,58 69,45 72,87 1 2 3 6 120,46 119,96 114,26 3 2 1 7 62,01 66,09 68,01 1 2 3 8 71,87 73,62 55,43 2 3 1 9 83,01 75,81 71,63 3 2 1 10 76,62 67,66 68,6 3 1 2

Ri 19 20 21

Yang Cao 11 oktober 2019 90 The Fr

From the sum of ranks for each group, the

test statistic Fr is derived: k 2 Fr= 12/Nk (k+1) Σi=1 R i -3N(k+1)

= (12/(10x3)(3+1)) (192 + 202 + 212)) – (3x10)(3+1) =12/120 (361+400+441) – 120 =0.1 (1202) – 120 =120.2 - 120 = 0.2

Yang Cao 11 oktober 2019 91

The probability distribution of Fr can be approximated by that of a chi-square distribution. If n or k is small, the approximation to chi-square becomes poor and the p-value

should be obtained from tables of Fr specially prepared for the . If the p-value is significant, appropriate post-hoc multiple comparisons tests would be performed.

Here, Fr has df=2 (k-1, where k is the number of 2 groups. The statistics is < (2,0.05)=5.99. There was no statistically significant difference in three groups.

Yang Cao 11 oktober 2019 92 Friedman Test Procedure in Stata

1. Install friedman command first!

2. Transpose the data such that subjects are the columns and the variables are the rows!

. xpose, clear

Yang Cao 11 oktober 2019 93

. friedman v1-v10

Friedman = 0.2000 Kendall = 0.0100 P-value = 0.9048

Yang Cao 11 oktober 2019 94 Reporting the Output of the Friedman Test (without post-hoc tests)

You can report the Friedman Test result as follows: There was no statistically significant difference of in different time, χ2(2) = 0.200, P = 0.905. You could also include the median values for each of the related groups. However, at this stage, if you know that there were differences somewhere between the related groups, but you did not know exactly where those differences lie.

Yang Cao 11 oktober 2019 95

Again:

When the test tells us whether there are overall differences, it doesn’t tell which groups in particular differ from each other. To know this we need to run post-hoc tests.

Remember though, that if your Friedman Test result was not statistically significant then you should not run post-hoc tests.

Yang Cao 11 oktober 2019 96 Mann-Whitney U Test

Yang Cao 11 oktober 2019 97

Yang Cao 11 oktober 2019 98