Hypothesis Testing, Part 2
Total Page:16
File Type:pdf, Size:1020Kb
Hypothesis testing, part 2 With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal 1 CATEGORICAL IV, NUMERIC DV 2 Independent samples, one IV # Conditions Normal/Parametric Non-parametric Exactly 2 T-test Mann-Whitney U, bootstrap 2+ One-way ANOVA Kruskal-Wallis, bootstrap 3 Is your data normal? • Skewness: asymmetry • Kurtosis: “peakedness” rel. to normal – Both: within +- 2SE(s/u) is OK • Or use Shapiro-Wilk (null = normal) • Or look at Q-Q plot 4 T-test • Already talked about • Assumptions: normality, equal variances, independent samples – Can use Levene to test equal variance assumption • Post-test: check residuals for assumption fit – For a t-test this is the same pre or post – For other tests you check residual vs. fit post 5 One way ANOVA • H0: m1 = m2 = m3 • H1: at least one doesn’t match • NOT H1: m1 != m2 != m3 • Assumptions: normality, common variance, independent errors • Intuition: F statistic – Variance between / Variance within – Under (exact null), F=1; F >> 1 rejects null 6 One-way ANOVA • F = MSb / MSw 2 • MSw = sum [sum[ (diff from mean) ]] / dfw – dfw = N-k, where k = number of conditions – Sum over all conditions; sum per condition 2] • MSb = sum [(diff from grand mean) / dfb – dfb = k-1 – Every observation goes in the sum 7 (example from Vibha Sazawal) 8 9 F-distribution rejected 10 Now what? (Contrasts) • So we rejected the null. What did we learn? – What *didn’t* we learn? – At least one is different ... Which? All? – This is called an “omnibus test” • To answer our actual research question, we usually need pairwise contrasts 11 The trouble with contrasts • Contrasts mess with your Type I bounds – One test: 95% confident – Three tests: 85.7% confident – 5 conditions, all pairs: 4 + 3 + 2 + 1 = 10 tests: 59.9% – UH OH 12 Planned vs. post hoc • Planned: You have a theory. – Really, no cheating – You get n-1 pairwise comparisons for free – In theory, should not be control vs. all, but prob. OK – NO COMPARISONS unless omnibus passes • Post-hoc – Anything unplanned – More than n-1 – Requires correction! – Doesn’t necessarily require omnibus first 13 Correction • Adjust {p-values, alpha} to compensate for multiple testing post-hoc • Bonferroni (most conservative) – Assume all possible pairs: m = k(k-1)/m (comb.) – alphac = alpha / m – Once you have looked, implication is you did all the comparisons implicitly! • Holm-Bonferroni is less conservative – Stepwise adjusting alpha as you go • Dunnett for specifically all vs. control, others 14 Independent samples, one IV # Conditions Normal/Parametric Non-parametric Exactly 2 T-test Mann-Whitney U, bootstrap 2+ One-way ANOVA Kruskal-Wallis, bootstrap 15 Non-parametrics: MWU and K-W • Good for non-normal data, likert data (ordinal, not actually numeric) • Assumptions: independent, at least ordinal • Null: P(X > Y) = P(Y > X) where X,Y are observations from the 2 distributions (MWU) – If assume same distribution shape, continuous then this can can be seen as comparing medians 16 MWU and K-W continued • Essentially: rank order all data (both conditions) – Total ranks for condition 1, compare to “expected” – Various procecures to correct for ties 17 Bootstrap • Resampling technique(s) • Intuition: – Create “null” distribution by e.g. subtracting means so mA = mB = 0 • Now you have shifted samples A-hat and B-hat – Combine these to make a null distribution – Draw sample of size N, with replacement • Do it 1000 (or 10k) times – Use this to determine critical value (alpha = 0.05) – Compare this critical value to your real data for test 18 Paired samples, one IV # Conditions Normal/Parametric Non-parametric Exactly 2 Paired T-test Wilcoxon signed-rank 2+ 2-way ANOVA w/ Friedman subJect random factor Mixed models (later) 19 Paired T-test • Two samples per participant item • Test subtracts them • Then uses one-sample T-test with H0: m = 0 and H1: m != 0 • Regular T-test assumptions, plus: does subtraction make sense here? 20 Wilcoxon S.R. / Friedman • H0: difference btwn pairs is symmetric around 0 • H1: … or not • Excludes no-change items • Essentially: rank by abs. difference; compare signs * ranks • (Friedman = 3+ generalization) 21 One numeric IV, numeric DV SIMPLE LINEAR REGRESSION 22 Simple linear regression • E(Y|x) = b0 + b1x … looks at populations – Population mean at this value of x • Key H0: b1 != 0 – b0 usually not important for significance (obv. important in model fit) • b1 : slope à change in Y per unit X • Best fit: Least squares, or maximum likelihood – LSq: minimize sum of squares of residuals – ML: max prob. of seeing this data with this model 23 Assumptions, caveats • Assumes: – linearity in Y ~ X – normally distributed error for each x, with constant variance at all x – Error measuring X is small compared to var. Y (fixed X) • Independent errors! – Serial correlation, data that is grouped, etc. (later) • Don’t interpret widely outside available x vals • Can transform for linearity! – Log(Y), sqrt(y), 1/y, y^2 24 Assumption/residual checking • Before: Use scatterplot for plausible linearity • After: residual vs. fit – Residual on Y vs. predicted on X – Should be relatively even distributed around 0 (linear) – Should have relatively even v. spread (eq. var) • After: quantile-normal of residuals 25 Model interpretation • Interpret b1, interpret the p-value • CI: if it crosses 0, it’s not significant • R2: fraction of total variation accounted for – Intutively: explained variance / total variance – Explained = var(Y) – residual errors • F2 = R2 / (1 – R R2); SML: 0.02, 0.15, 0.35 (cohen) 26 Robustness • Brittle to linearity, independent errors • Somewhat brittle to fixed-X • Fairly robust to equal variance • Quite robust to normality 27 CATEGORICAL OUTCOMES 28 One Cat. IV, Cat. DV, independent • Contingency tables: how many people in each combination of categories 29 Chi-square test of independence • H0: distribution of Var1 is the same at every level of Var2 (and vice versa) – Null dist. Approaches X^2 when sample size grows – Heuristic: no cells < 5 – Can use FET instead • Intuition: – Sum over rows/columns: (observed – expected)^2 / expected – Expected: marginal % * count in other margin 30 Paired 2x2 tables • Use McNemar’s test – Contigency table: matches and mismatches for each option. • H0: marginals are the same Cond1: Yes Cond 1: No Cond2: Yes a b a + b Cond2: No c d c + d a + c b + d N • Essentially a X^2 test on the agreement – Test stat: (b-c)^2 / (b+c) 31 Paired, continued • Cochran’s Q: extended for more than two conditions • Other similar extensions for related tasks 32 Critiques • Choose a paper that has one (or more) empirical experiments as a central contribution – Doesn’t have to be human subjects, but can be – Does have to have enough description of experiment • 10-12 minute presentation • Briefly: research questions, necessary background • Main: describe and critique methods – Experimental design, data collection, analysis – Good, bad, ugly, missing • Briefly, results? 33 Logistic regression (logit) • Numeric IV, binary DV (or ordinal) • log( E(Y)/ (1-E(Y)) ) == log ( Pr (Y=1) / Pr (Y=0)) = b0 + b1x • Log odds of success = linear function – Odds: 0 to inf., 1 is the middle – e.g.: odds = 5 = 5:1 … for five successes, one fail – Log odds: -inf to inf w/ 0 in the middle: good for regression • Modeled as binomial distribution 34 Interpreting logistic regression • Take exp(coef) to get interpretable odds. • For each unit increase in x, odds increase b1 times – Note that this can make small coefs important! • Use e.g., Homer-Lemeshow test for goodness of fit – null == data fit the model – But not a lot of power! 35 MULTIVARIATE 36 Multiple regression • Linear/logistic regression with more variables! – At least one numeric, 0+ categorical • Still: fixed x, normal errors w/ equal variance, independent errors (linear) • Linear relationship in E(Y) and one x, when other inputs held constant – Effects of each x are independent! • Still check q-n of residuals, residual vs. fit 37 Model selection • Which covariates to keep? (more on this in a bit) 38 Adding categorical vars • Indicator variables (everything is 0 or 1) • Need one fewer indicator than conditions – One condition is true; or none are true (baseline) – Coefs are *relative to baseline*! • Model selection: keep all or none for one factor • Called “ANCOVA” when at least one each numeric + categorical 39 Interaction • What if your covariates *aren’t* independent? • E(Y) = b0 + b1x1 + b2x2 + b12x1x2 – Slope for x1 is diff. for each value of x2 • Superadditive: all in same direction, interaction makes effects stronger • Subadditive: interaction is in opposite direction • For indicator vars, all or none 40 Model selection! • Which covariates to keep? • From theory • Keep interaction only if it’s significant? – If keep interaction, should keep corresponding mains • ”Adjusted” R^2? – Regular R^2 always higher w/ more covars • BIC and AIC – Take model likelihood and penalize for more params – Abs value not interpretable; lower is better • All combinations? Stepwise? 41 Know they exist; look them up if relevant THINGS WE ARE ONLY GOING TO MENTION BRIEFLY 42 Multi-way ANOVA • >1 cat IVs, 1 numeric DV • Normality, equal variance, indep. Errors • With interaction: every combo of factor levels has its own population mean • Without interaction (additive): change in one var consistent as all fixed vals for others • Works basically like standard ANOVA, etc. 43 Mixed models regression • Explicitly model correlations in data • Fixed effects: affect outcome for everyone • Random effects: deviations per data item, don’t want to model individually • Simplest example: repeated measures – Y ~ b0 + b1x1 + b2x2 …. + random ID intercept – Each participant has their own intercept adjustment 44 POWER ANALYSIS 45 What is power? • Null distribution: designed so that we’d only see a test statistic this extreme 5% of the time • This bounds type I but not type II • Power = 1 – type II error rate • Heuristic: 80% is “good enough” 46 Alternative scenarios • One null, but infinitely many alternatives! • Alternative distribution: given some n, underlying variance, underlying diff.