<<

Hypothesis testing, part 2

With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal

1 CATEGORICAL IV, NUMERIC DV

2 Independent samples, one IV

# Conditions Normal/Parametric Non-parametric Exactly 2 T-test Mann-Whitney U, bootstrap 2+ One-way ANOVA Kruskal-Wallis, bootstrap

3 Is your data normal?

• Skewness: asymmetry • Kurtosis: “peakedness” rel. to normal – Both: within +- 2SE(s/u) is OK • Or use Shapiro-Wilk (null = normal) • Or look at Q-Q plot

4 T-test

• Already talked about • Assumptions: normality, equal , independent samples – Can use Levene to test equal assumption • Post-test: check residuals for assumption fit – For a t-test this is the same pre or post – For other tests you check residual vs. fit post

5 One way ANOVA

• H0: m1 = m2 = m3 • H1: at least one doesn’t match

• NOT H1: m1 != m2 != m3 • Assumptions: normality, common variance, independent errors • Intuition: F statistic – Variance between / Variance within – Under (exact null), F=1; F >> 1 rejects null

6 One-way ANOVA

• F = MSb / MSw 2 • MSw = sum [sum[ (diff from mean) ]] / dfw

– dfw = N-k, where k = number of conditions – Sum over all conditions; sum per condition 2] • MSb = sum [(diff from grand mean) / dfb

– dfb = k-1 – Every observation goes in the sum

7 (example from Vibha Sazawal) 8 9 F-distribution

rejected 10 Now what? (Contrasts)

• So we rejected the null. What did we learn? – What *didn’t* we learn? – At least one is different ... Which? All? – This is called an “omnibus test” • To answer our actual research question, we usually need pairwise contrasts

11 The trouble with contrasts

• Contrasts mess with your Type I bounds – One test: 95% confident – Three tests: 85.7% confident – 5 conditions, all pairs: 4 + 3 + 2 + 1 = 10 tests: 59.9% – UH OH

12 Planned vs. post hoc

• Planned: You have a theory. – Really, no cheating – You get n-1 pairwise comparisons for free – In theory, should not be control vs. all, but prob. OK – NO COMPARISONS unless omnibus passes • Post-hoc – Anything unplanned – More than n-1 – Requires correction! – Doesn’t necessarily require omnibus first 13 Correction

• Adjust {p-values, alpha} to compensate for multiple testing post-hoc • Bonferroni (most conservative) – Assume all possible pairs: m = k(k-1)/m (comb.)

– alphac = alpha / m – Once you have looked, implication is you did all the comparisons implicitly! • Holm-Bonferroni is less conservative – Stepwise adjusting alpha as you go

• Dunnett for specifically all vs. control, others 14 Independent samples, one IV

# Conditions Normal/Parametric Non-parametric Exactly 2 T-test Mann-Whitney U, bootstrap 2+ One-way ANOVA Kruskal-Wallis, bootstrap

15 Non-parametrics: MWU and K-W

• Good for non-normal data, likert data (ordinal, not actually numeric) • Assumptions: independent, at least ordinal • Null: P(X > Y) = P(Y > X) where X,Y are observations from the 2 distributions (MWU) – If assume same distribution shape, continuous then this can can be seen as comparing medians

16 MWU and K-W continued

• Essentially: rank order all data (both conditions) – Total ranks for condition 1, compare to “expected” – Various procecures to correct for ties

17 Bootstrap

• Resampling technique(s) • Intuition: – Create “null” distribution by e.g. subtracting means so mA = mB = 0 • Now you have shifted samples A-hat and B-hat – Combine these to make a null distribution – Draw sample of size N, with replacement • Do it 1000 (or 10k) times – Use this to determine critical value (alpha = 0.05) – Compare this critical value to your real data for test

18 Paired samples, one IV

# Conditions Normal/Parametric Non-parametric Exactly 2 Paired T-test Wilcoxon signed-rank 2+ 2-way ANOVA w/ Friedman subject random factor Mixed models (later)

19 Paired T-test

• Two samples per participant item • Test subtracts them • Then uses one-sample T-test with H0: m = 0 and H1: m != 0 • Regular T-test assumptions, plus: does subtraction make sense here?

20 Wilcoxon S.R. / Friedman

• H0: difference btwn pairs is symmetric around 0 • H1: … or not • Excludes no-change items • Essentially: rank by abs. difference; compare signs * ranks • (Friedman = 3+ generalization)

21 One numeric IV, numeric DV SIMPLE

22 Simple linear regression

• E(Y|x) = b0 + b1x … looks at populations – Population mean at this value of x

• Key H0: b1 != 0

– b0 usually not important for significance (obv. important in model fit)

• b1 : slope à change in Y per unit X • Best fit: Least squares, or maximum likelihood – LSq: minimize sum of squares of residuals – ML: max prob. of seeing this data with this model

23 Assumptions, caveats

• Assumes: – linearity in Y ~ X – normally distributed error for each x, with constant variance at all x – Error measuring X is small compared to var. Y (fixed X) • Independent errors! – Serial correlation, data that is grouped, etc. (later) • Don’t interpret widely outside available x vals • Can transform for linearity!

– Log(Y), sqrt(y), 1/y, y^2 24 Assumption/residual checking

• Before: Use scatterplot for plausible linearity • After: residual vs. fit – Residual on Y vs. predicted on X – Should be relatively even distributed around 0 (linear) – Should have relatively even v. spread (eq. var) • After: quantile-normal of residuals

25 Model interpretation

• Interpret b1, interpret the p-value • CI: if it crosses 0, it’s not significant • R2: fraction of total variation accounted for – Intutively: explained variance / total variance – Explained = var(Y) – residual errors • F2 = R2 / (1 – R R2); SML: 0.02, 0.15, 0.35 (cohen)

26 Robustness

• Brittle to linearity, independent errors • Somewhat brittle to fixed-X • Fairly robust to equal variance • Quite robust to normality

27 CATEGORICAL OUTCOMES

28 One Cat. IV, Cat. DV, independent

• Contingency tables: how many people in each combination of categories

29 Chi-square test of independence

• H0: distribution of Var1 is the same at every level of Var2 (and vice versa) – Null dist. Approaches X^2 when sample size grows – Heuristic: no cells < 5 – Can use FET instead

• Intuition: – Sum over rows/columns: (observed – expected)^2 / expected – Expected: marginal % * count in other margin

30 Paired 2x2 tables

• Use McNemar’s test – Contigency table: matches and mismatches for each option. • H0: marginals are the same

Cond1: Yes Cond 1: No Cond2: Yes a b a + b Cond2: No c d c + d a + c b + d N

• Essentially a X^2 test on the agreement

– Test stat: (b-c)^2 / (b+c) 31 Paired, continued

• Cochran’s Q: extended for more than two conditions • Other similar extensions for related tasks

32 Critiques

• Choose a paper that has one (or more) empirical experiments as a central contribution – Doesn’t have to be human subjects, but can be – Does have to have enough description of experiment • 10-12 minute presentation • Briefly: research questions, necessary background • Main: describe and critique methods – Experimental design, data collection, analysis – Good, bad, ugly, missing • Briefly, results? 33 (logit)

• Numeric IV, binary DV (or ordinal)

• log( E(Y)/ (1-E(Y)) ) == log ( Pr (Y=1) / Pr (Y=0)) = b0 + b1x • Log odds of success = linear function – Odds: 0 to inf., 1 is the middle – e.g.: odds = 5 = 5:1 … for five successes, one fail – Log odds: -inf to inf w/ 0 in the middle: good for regression • Modeled as binomial distribution

34 Interpreting logistic regression

• Take exp(coef) to get interpretable odds.

• For each unit increase in x, odds increase b1 times – Note that this can make small coefs important! • Use e.g., Homer-Lemeshow test for goodness of fit – null == data fit the model – But not a lot of power!

35 MULTIVARIATE

36 Multiple regression

• Linear/logistic regression with more variables! – At least one numeric, 0+ categorical • Still: fixed x, normal errors w/ equal variance, independent errors (linear) • Linear relationship in E(Y) and one x, when other inputs held constant – Effects of each x are independent! • Still check q-n of residuals, residual vs. fit

37 Model selection

• Which covariates to keep? (more on this in a bit)

38 Adding categorical vars

• Indicator variables (everything is 0 or 1) • Need one fewer indicator than conditions – One condition is true; or none are true (baseline) – Coefs are *relative to baseline*! • Model selection: keep all or none for one factor

• Called “ANCOVA” when at least one each numeric + categorical

39 Interaction

• What if your covariates *aren’t* independent?

• E(Y) = b0 + b1x1 + b2x2 + b12x1x2 – Slope for x1 is diff. for each value of x2 • Superadditive: all in same direction, interaction makes effects stronger • Subadditive: interaction is in opposite direction • For indicator vars, all or none

40 Model selection!

• Which covariates to keep? • From theory • Keep interaction only if it’s significant? – If keep interaction, should keep corresponding mains • ”Adjusted” R^2? – Regular R^2 always higher w/ more covars • BIC and AIC – Take model likelihood and penalize for more params – Abs value not interpretable; lower is better • All combinations? Stepwise? 41 Know they exist; look them up if relevant THINGS WE ARE ONLY GOING TO MENTION BRIEFLY

42 Multi-way ANOVA

• >1 cat IVs, 1 numeric DV • Normality, equal variance, indep. Errors • With interaction: every combo of factor levels has its own population mean • Without interaction (additive): change in one var consistent as all fixed vals for others • Works basically like standard ANOVA, etc.

43 Mixed models regression

• Explicitly model correlations in data • Fixed effects: affect outcome for everyone • Random effects: deviations per data item, don’t want to model individually • Simplest example: repeated measures – Y ~ b0 + b1x1 + b2x2 …. + random ID intercept – Each participant has their own intercept adjustment

44 POWER ANALYSIS

45 What is power?

• Null distribution: designed so that we’d only see a test statistic this extreme 5% of the time • This bounds type I but not type II • Power = 1 – type II error rate • Heuristic: 80% is “good enough”

46 Alternative scenarios

• One null, but infinitely many alternatives! • Alternative distribution: given some n, underlying variance, underlying diff. in pop. means, what is the distribution of test statistic • You know the critical value, so tells you how often your p will be above 0.05 when the “true” scenario is as you model

47 Calculating power

• A priori, to think about sample size and judge value of experiment • Inherently requires estimating the alternative scenario! – Maybe try a few • Statistic-specific, but in general: – Sample size, effect size, power, alpha • “Consider the smallest effect size that you consider interesting and try to achieve

reasonable power for that effect size” 48 Example from Seltman book

• F statistic (ANOVA) • 3 treatments • 50 people each • Red: sigma = 10, means: 10, 12, 14 • Blue: sigma = 10, means: 10, 13, 16

49 Promoting power

• (Review from earlier) • Raise sample size; reduce variance; aim for bigger effects

50 Walkthrough: linear regression

• u = model df -> number of params • v = F-test error df -> N – u – 1 • f2 = r2 / (1 – r2) … r2 = f2 / (1 + f2)

51 Retrospective power

• Somewhat controversial • Calculate observed effect size, then determine what sample size would be needed – Whole new experiment, not just collect more • Not a good idea: – We didn’t find a significant effect, but if we had studied 12 more people …

52