Hypothesis testing, part 2
With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal
1 CATEGORICAL IV, NUMERIC DV
2 Independent samples, one IV
# Conditions Normal/Parametric Non-parametric Exactly 2 T-test Mann-Whitney U, bootstrap 2+ One-way ANOVA Kruskal-Wallis, bootstrap
3 Is your data normal?
• Skewness: asymmetry • Kurtosis: “peakedness” rel. to normal – Both: within +- 2SE(s/u) is OK • Or use Shapiro-Wilk (null = normal) • Or look at Q-Q plot
4 T-test
• Already talked about • Assumptions: normality, equal variances, independent samples – Can use Levene to test equal variance assumption • Post-test: check residuals for assumption fit – For a t-test this is the same pre or post – For other tests you check residual vs. fit post
5 One way ANOVA
• H0: m1 = m2 = m3 • H1: at least one doesn’t match
• NOT H1: m1 != m2 != m3 • Assumptions: normality, common variance, independent errors • Intuition: F statistic – Variance between / Variance within – Under (exact null), F=1; F >> 1 rejects null
6 One-way ANOVA
• F = MSb / MSw 2 • MSw = sum [sum[ (diff from mean) ]] / dfw
– dfw = N-k, where k = number of conditions – Sum over all conditions; sum per condition 2] • MSb = sum [(diff from grand mean) / dfb
– dfb = k-1 – Every observation goes in the sum
7 (example from Vibha Sazawal) 8 9 F-distribution
rejected 10 Now what? (Contrasts)
• So we rejected the null. What did we learn? – What *didn’t* we learn? – At least one is different ... Which? All? – This is called an “omnibus test” • To answer our actual research question, we usually need pairwise contrasts
11 The trouble with contrasts
• Contrasts mess with your Type I bounds – One test: 95% confident – Three tests: 85.7% confident – 5 conditions, all pairs: 4 + 3 + 2 + 1 = 10 tests: 59.9% – UH OH
12 Planned vs. post hoc
• Planned: You have a theory. – Really, no cheating – You get n-1 pairwise comparisons for free – In theory, should not be control vs. all, but prob. OK – NO COMPARISONS unless omnibus passes • Post-hoc – Anything unplanned – More than n-1 – Requires correction! – Doesn’t necessarily require omnibus first 13 Correction
• Adjust {p-values, alpha} to compensate for multiple testing post-hoc • Bonferroni (most conservative) – Assume all possible pairs: m = k(k-1)/m (comb.)
– alphac = alpha / m – Once you have looked, implication is you did all the comparisons implicitly! • Holm-Bonferroni is less conservative – Stepwise adjusting alpha as you go
• Dunnett for specifically all vs. control, others 14 Independent samples, one IV
# Conditions Normal/Parametric Non-parametric Exactly 2 T-test Mann-Whitney U, bootstrap 2+ One-way ANOVA Kruskal-Wallis, bootstrap
15 Non-parametrics: MWU and K-W
• Good for non-normal data, likert data (ordinal, not actually numeric) • Assumptions: independent, at least ordinal • Null: P(X > Y) = P(Y > X) where X,Y are observations from the 2 distributions (MWU) – If assume same distribution shape, continuous then this can can be seen as comparing medians
16 MWU and K-W continued
• Essentially: rank order all data (both conditions) – Total ranks for condition 1, compare to “expected” – Various procecures to correct for ties
17 Bootstrap
• Resampling technique(s) • Intuition: – Create “null” distribution by e.g. subtracting means so mA = mB = 0 • Now you have shifted samples A-hat and B-hat – Combine these to make a null distribution – Draw sample of size N, with replacement • Do it 1000 (or 10k) times – Use this to determine critical value (alpha = 0.05) – Compare this critical value to your real data for test
18 Paired samples, one IV
# Conditions Normal/Parametric Non-parametric Exactly 2 Paired T-test Wilcoxon signed-rank 2+ 2-way ANOVA w/ Friedman subject random factor Mixed models (later)
19 Paired T-test
• Two samples per participant item • Test subtracts them • Then uses one-sample T-test with H0: m = 0 and H1: m != 0 • Regular T-test assumptions, plus: does subtraction make sense here?
20 Wilcoxon S.R. / Friedman
• H0: difference btwn pairs is symmetric around 0 • H1: … or not • Excludes no-change items • Essentially: rank by abs. difference; compare signs * ranks • (Friedman = 3+ generalization)
21 One numeric IV, numeric DV SIMPLE LINEAR REGRESSION
22 Simple linear regression
• E(Y|x) = b0 + b1x … looks at populations – Population mean at this value of x
• Key H0: b1 != 0
– b0 usually not important for significance (obv. important in model fit)
• b1 : slope à change in Y per unit X • Best fit: Least squares, or maximum likelihood – LSq: minimize sum of squares of residuals – ML: max prob. of seeing this data with this model
23 Assumptions, caveats
• Assumes: – linearity in Y ~ X – normally distributed error for each x, with constant variance at all x – Error measuring X is small compared to var. Y (fixed X) • Independent errors! – Serial correlation, data that is grouped, etc. (later) • Don’t interpret widely outside available x vals • Can transform for linearity!
– Log(Y), sqrt(y), 1/y, y^2 24 Assumption/residual checking
• Before: Use scatterplot for plausible linearity • After: residual vs. fit – Residual on Y vs. predicted on X – Should be relatively even distributed around 0 (linear) – Should have relatively even v. spread (eq. var) • After: quantile-normal of residuals
25 Model interpretation
• Interpret b1, interpret the p-value • CI: if it crosses 0, it’s not significant • R2: fraction of total variation accounted for – Intutively: explained variance / total variance – Explained = var(Y) – residual errors • F2 = R2 / (1 – R R2); SML: 0.02, 0.15, 0.35 (cohen)
26 Robustness
• Brittle to linearity, independent errors • Somewhat brittle to fixed-X • Fairly robust to equal variance • Quite robust to normality
27 CATEGORICAL OUTCOMES
28 One Cat. IV, Cat. DV, independent
• Contingency tables: how many people in each combination of categories
29 Chi-square test of independence
• H0: distribution of Var1 is the same at every level of Var2 (and vice versa) – Null dist. Approaches X^2 when sample size grows – Heuristic: no cells < 5 – Can use FET instead
• Intuition: – Sum over rows/columns: (observed – expected)^2 / expected – Expected: marginal % * count in other margin
30 Paired 2x2 tables
• Use McNemar’s test – Contigency table: matches and mismatches for each option. • H0: marginals are the same
Cond1: Yes Cond 1: No Cond2: Yes a b a + b Cond2: No c d c + d a + c b + d N
• Essentially a X^2 test on the agreement
– Test stat: (b-c)^2 / (b+c) 31 Paired, continued
• Cochran’s Q: extended for more than two conditions • Other similar extensions for related tasks
32 Critiques
• Choose a paper that has one (or more) empirical experiments as a central contribution – Doesn’t have to be human subjects, but can be – Does have to have enough description of experiment • 10-12 minute presentation • Briefly: research questions, necessary background • Main: describe and critique methods – Experimental design, data collection, analysis – Good, bad, ugly, missing • Briefly, results? 33 Logistic regression (logit)
• Numeric IV, binary DV (or ordinal)
• log( E(Y)/ (1-E(Y)) ) == log ( Pr (Y=1) / Pr (Y=0)) = b0 + b1x • Log odds of success = linear function – Odds: 0 to inf., 1 is the middle – e.g.: odds = 5 = 5:1 … for five successes, one fail – Log odds: -inf to inf w/ 0 in the middle: good for regression • Modeled as binomial distribution
34 Interpreting logistic regression
• Take exp(coef) to get interpretable odds.
• For each unit increase in x, odds increase b1 times – Note that this can make small coefs important! • Use e.g., Homer-Lemeshow test for goodness of fit – null == data fit the model – But not a lot of power!
35 MULTIVARIATE
36 Multiple regression
• Linear/logistic regression with more variables! – At least one numeric, 0+ categorical • Still: fixed x, normal errors w/ equal variance, independent errors (linear) • Linear relationship in E(Y) and one x, when other inputs held constant – Effects of each x are independent! • Still check q-n of residuals, residual vs. fit
37 Model selection
• Which covariates to keep? (more on this in a bit)
38 Adding categorical vars
• Indicator variables (everything is 0 or 1) • Need one fewer indicator than conditions – One condition is true; or none are true (baseline) – Coefs are *relative to baseline*! • Model selection: keep all or none for one factor
• Called “ANCOVA” when at least one each numeric + categorical
39 Interaction
• What if your covariates *aren’t* independent?
• E(Y) = b0 + b1x1 + b2x2 + b12x1x2 – Slope for x1 is diff. for each value of x2 • Superadditive: all in same direction, interaction makes effects stronger • Subadditive: interaction is in opposite direction • For indicator vars, all or none
40 Model selection!
• Which covariates to keep? • From theory • Keep interaction only if it’s significant? – If keep interaction, should keep corresponding mains • ”Adjusted” R^2? – Regular R^2 always higher w/ more covars • BIC and AIC – Take model likelihood and penalize for more params – Abs value not interpretable; lower is better • All combinations? Stepwise? 41 Know they exist; look them up if relevant THINGS WE ARE ONLY GOING TO MENTION BRIEFLY
42 Multi-way ANOVA
• >1 cat IVs, 1 numeric DV • Normality, equal variance, indep. Errors • With interaction: every combo of factor levels has its own population mean • Without interaction (additive): change in one var consistent as all fixed vals for others • Works basically like standard ANOVA, etc.
43 Mixed models regression
• Explicitly model correlations in data • Fixed effects: affect outcome for everyone • Random effects: deviations per data item, don’t want to model individually • Simplest example: repeated measures – Y ~ b0 + b1x1 + b2x2 …. + random ID intercept – Each participant has their own intercept adjustment
44 POWER ANALYSIS
45 What is power?
• Null distribution: designed so that we’d only see a test statistic this extreme 5% of the time • This bounds type I but not type II • Power = 1 – type II error rate • Heuristic: 80% is “good enough”
46 Alternative scenarios
• One null, but infinitely many alternatives! • Alternative distribution: given some n, underlying variance, underlying diff. in pop. means, what is the distribution of test statistic • You know the critical value, so tells you how often your p will be above 0.05 when the “true” scenario is as you model
47 Calculating power
• A priori, to think about sample size and judge value of experiment • Inherently requires estimating the alternative scenario! – Maybe try a few • Statistic-specific, but in general: – Sample size, effect size, power, alpha • “Consider the smallest effect size that you consider interesting and try to achieve
reasonable power for that effect size” 48 Example from Seltman book
• F statistic (ANOVA) • 3 treatments • 50 people each • Red: sigma = 10, means: 10, 12, 14 • Blue: sigma = 10, means: 10, 13, 16
49 Promoting power
• (Review from earlier) • Raise sample size; reduce variance; aim for bigger effects
50 Walkthrough: linear regression
• u = model df -> number of params • v = F-test error df -> N – u – 1 • f2 = r2 / (1 – r2) … r2 = f2 / (1 + f2)
51 Retrospective power
• Somewhat controversial • Calculate observed effect size, then determine what sample size would be needed – Whole new experiment, not just collect more • Not a good idea: – We didn’t find a significant effect, but if we had studied 12 more people …
52