<<

Gov 2002: 2. Randomized

Matthew Blackwell

September 10, 2015 Where are we (going)?

• Last time: defined potential outcomes and causal estimands/quantities of interest. • This time: how we identify these quantities in randomized experiments. • Later: what if only happens conditional on covariates? • Or, what if we weren’t able to randomize? What is the selection problem?

• First pass at the : prima facie or naive difference in :

퐸[푌푖|퐷푖 = 1] − 퐸[푌푖|퐷푖 = 0]

=퐸[푌푖(1)|퐷푖 = 1] − 퐸[푌푖(0)|퐷푖 = 0] (consistency)

=퐸[푌푖(1)|퐷푖 = 1] − 퐸[푌푖(0)|퐷푖 = 1] + 퐸[푌푖(0)|퐷푖 = 1] − 퐸[푌푖(0)|퐷푖 = 0]

[ᅪ 퐸[푌푖(0)|퐷푖 = 1] − 퐸[푌푖(0)|퐷푖 = 0ﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰᅩﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰᅪ푖(1) − 푌푖(0)|퐷푖 = 1] +ᅨﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰᅩﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰ퐸[푌ᅨ = ATT selection

• Naive = ATT + selection bias. • Selection bias: how different the treated and control groups are in terms of their potential outcome under control. Selection bias = unidentified ATT

[ᅪ 퐸[푌푖(0)|퐷푖 = 1] − 퐸[푌푖(0)|퐷푖 = 0ﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰᅩﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰᅪ푖(1) − 푌푖(0)|퐷푖 = 1] +ᅨﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰᅩﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰ퐸[푌ᅨ ATT selection bias

• ATT is unidentified here unless selection bias equals 0.

▶ Both ATT and the SB are unobserved. ▶ No amount of data will help us distinguish between them. • Example: effect of negativity on vote shares.

▶ Naive estimate: negative candidates do worse than positive candidates. ▶ Could that the ATT is negative OR the ATT is positive and there is large negative selection bias. ▶ SB = candidates that go negative are worse than those who stay positive, even if they ran the same campaigns.

• With an unbounded 푌푖, we can’t even bound the ATT because, in principle, SB could be anywhere from −∞ to ∞. Notation

• We’ll need some notation for the entire vector of treatment, outcomes, etc:

( 퐃 = (퐷﷠, 퐷﷡, … , 퐷푁 ▶ ▶ 퐗, 퐘(1), and 퐘(0) are similarly defined for 푋푖, 푌푖(1), and 푌푖(0). Experiments

• An is a study where assignment to treatment is controlled by the researcher.

▶ 푝푖 = ℙ[퐷푖 = 1] be the probability of treatment assignment probability.

▶ 푝푖 is controlled and known by researcher in an experiment. • A is an experiment with the following properties:

1. Positivity: assignment is probabilistic: 0 < 푝푖 < 1 ▶ No deterministic assignment.

2. Unconfoundedness: ℙ[퐷푖 = 1|퐘(1), 퐘(0)] = ℙ[퐷푖 = 1] ▶ Treatment assignment does not depend on any potential outcomes.

▶ Sometimes written as 퐷푖 ⟂⟂(퐘(1), 퐘(0))

• Natural experiment: experiment where treatment is randomized, but that randomization was not under the control of the researcher. • Randomization has to be justified in these cases since it wasn’t directly implemented. • Hyde paper on syllabus:

▶ election observers were assigned to polling stations “using a method that approximates randomization” Randomization

• What does randomization (positivity + unconfoundedness) buy us?

▶ treatment group is a random sample from the population. ▶ control group is a random sample the population. • ⇝ sample control mean is unbiased for population control mean:

피[푌푖|퐷푖 = 0] = 퐸[푌푖(0)|퐷푖 = 0] = 퐸[푌푖(0)] = 퐸[푌푖(0)|퐷푖 = 1]

• Not the same as the observed outcomes being independent of

treatment (푌푖 ⟂⟂퐷푖) • Randomization eliminates selection bias:

퐸[푌푖(0)|퐷푖 = 1] − 퐸[푌푖(0)|퐷푖 = 0] = 퐸[푌푖(0)] − 퐸[푌푖(0)] = 0 Identification by randomization

• Goal: show that we can identify a causal effect under a randomization assumption. • Use the selection bias result with the naive difference in means:

퐸[푌푖|퐷푖 = 1] − 퐸[푌푖|퐷푖 = 0]

ᅪ 퐸[푌푖(1)|퐷푖 = 1] − 퐸[푌푖(0)|퐷푖 = 1] +⏟ 0ﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰᅩﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰᅨ= 퐴푇푇 selection bias

=퐸[푌푖(1)|퐷푖 = 1] − 퐸[푌푖(0)|퐷푖 = 1]

=퐸[푌푖(1)] − 퐸[푌푖(0)] (unconfoundedness)

• 퐸[푌푖(1) − 푌푖(0)] = 휏 is just the ATE. • Thus, if we can estimate the conditional expectations,

피[푌푖|퐷푖 = 1] and 피[푌푖|퐷푖 = 0], we can estimate the ATE. • Result: ATE is identified in a randomized experiment. Types of randomizations/experiments

• Let ∑푁 and . 푁푡 = 푖=﷠ 퐷푖 푁푐 = 푁 − 푁푡 • Bernoulli trials:

▶ flip coins for each person in the population with probability 푞 ▶ ℙ[퐃] = 푞푁푡 (1 − 푞)푁푐 ▶ Downside: could end up with all treated or all control • Completely randomized experiment:

▶ Randomly sample 푁푡 units from the population to be treated 푁 ▶ Equal probability of any assignment with ∑ 푖=﷠ 퐷푖 = 푁푡 ﷠− ▶ Each possible assignment has probability (푁 ) 푁푡 ▶ Each unit has probability 푝푖 = 푁푡/푁 of being selected into treatment, but treatment assignment is not independent between units. Bernoulli assignment

T T T T

푞 푞 푞 푞

C C C C Completely randomized design

C1 T2 C3 T4 T5 C6

• Start with 푁 = 6 and say we want to have 푁푡 = 3 • Randomly pick 3 from {1, 2, 3, 4, 5, 6}: 2, 4, 5 • Not independent: knowing 2 is treated means 3 is less likely to be treated. • Fixed number of treatment spots induces dependence:

피[퐷푖 ⋅ 퐷푗] ≠ 피[퐷푖]피[퐷푗]

푁 푁 − 1 피[퐷 ⋅ 퐷 ] = ℙ[퐷 = 1] ⋅ ℙ[퐷 = 1|퐷 = 1] = 푡 푡 푖 푗 푖 푖 푖 푁 푁 − 1 Stratified designs

• Stratified randomized experiment:

▶ form 퐽 blocks, 푏푗, 푗 = 1, … , 퐽 based on the covariates ▶ completely randomized assignment within each block.

▶ Randomization depends on the block variable, 퐵푖 ▶ Conditional unconfoundedness: 퐷푖 ⟂⟂(푌푖(1), 푌푖(0))|퐵푖. • Pair randomized experiments:

▶ Stratified randomized experiment where each block has 2 units. ▶ 1 unit in each pair receives treatment. ▶ Extreme version of the stratified/blocked randomized experiment. ▶ Also called “matched pair” design • Both of these seek to remove ”bad randomizations” where covariates are related to treatment assignment by chance. Identification under stratification

• Generally, stratified designs mean that the probability of

treatment depends on a covariate, 푋푖:

푝푖(푥) = ℙ[퐷푖 = 1|푋푖 = 푥]

• Conditional randomization assumptions:

1. Positivity: 0 < 푝푖(푥) < 1 for all 푖 and 푥. 2. Unconfoundedness: ℙ[퐷푖 = 1|퐗, 퐘(1), 퐘(0)] = ℙ[퐷푖 = 1|푋푖]

▶ Also written as 퐷푖 ⟂⟂(퐘(﷠), 퐘(﷟))|푋푖 Stratification and the ATE

• Can we identify the ATE under these stratified designs? Yes!

피[푌푖(1) − 푌푖(0)] = 피푋 피[푌푖(1) − 푌푖(0)|푋푖]￾ (iterated expectations)

= 피푋 피[푌푖(1)|푋푖] − 피[푌푖(0)|푋푖]￾

= 피푋 피[푌푖(1)|퐷푖 = 1, 푋푖] − 피[푌푖(0)|퐷푖 = 0, 푋푖]￾ (unconfoundedness)

= 피푋 피[푌푖|퐷푖 = 1, 푋푖] − 피[푌푖|퐷푖 = 0, 푋푖]￾ (consistency)

• ATE is just the average of the within-strata differences in means. • Identified because the last line is a function of observables. • The averaging is over the distribution of the strata ⇝ size of the blocks. Stratification example

• Stratified by incumbency, where 푋푖 = 1 is a Democratic incumbent and 푋푖 = 0 is a Democratic challenger. • Then we have:

피푋 피[푌푖|퐷푖 = 1, 푋푖] − 피[푌푖|퐷푖 = 0, 푋푖]￾

[ᅪℙ[푋푖 = 1ﻰﻰﻰﻰᅩﻰﻰﻰﻰ￵피[푌푖|퐷푖 = 1, 푋푖 = 1] − 피[푌푖|퐷푖 = 0, 푋푖 = 1]￸ ᅨ = ᅪﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰᅩﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰᅨ share of incumbents diff-in-means for incumbents

[ᅪℙ[푋푖 = 0ﻰﻰﻰﻰᅩﻰﻰﻰﻰ￵피[푌푖|퐷푖 = 1, 푋푖 = 0] − 피[푌푖|퐷푖 = 0, 푋푖 = 0]￸ ᅨ + ᅪﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰᅩﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰﻰᅨ share of challengers diff-in-means for challengers

• We call this “averaging over 푋푖” Effect modification

• Averaging over 푋푖 might hide some interesting variation in the :

▶ Effect of negativity might varies by incumbency status? ▶ Effect of clientelistic messages varies by gender of recipient? ▶ Effect of having daughters varies by gender? • This means the conditional ATE (CATE) is non-constant:

∗ ∗ 휏(푥) ≡ 퐸[푌푖(1) − 푌푖(0)|푋푖 = 푥] ≠ 퐸[푌푖(1) − 푌푖(0)|푋푖 = 푥 ] ≡ 휏(푥 )

• The difference between 휏(푥) and 휏(푥∗) might be causal or not. • Under randomization or , CATE is identified from within-strata difference-in-means (see last slide):

휏(푥) = 피[푌푖|퐷푖 = 1, 푋푖 = 푥] − 피[푌푖|퐷푖 = 0, 푋푖 = 푥] Estimation and Inference

• Up until now, we’ve talked about identification. • Now that we know that the ATE is identified, how will we estimate it? • Remember: identification first, then estimation. Samples versus Populations

• Remember the differences between the population, 푈, of size 푁, with the PATE:

푃퐴푇퐸 = 휏 = 퐸[푌푖(1) − 푌푖(0)]

• And the sample, 푆, from the population of size 푛 with the SATE: 1 푆퐴푇퐸 = 휏푆 = ワ[푌푖(1) − 푌푖(0)] 푛 푖∈푆 • Today, we will focus on the Neyman approach to estimation and inference:

▶ derive estimators for these quantites and, ▶ derive the properties of these estimators under repeated . • Next week, we’ll discuss an alternative approach proposed by Fisher. Finite sample results

• Finite sample results take the observed sample as the target of interest.

• Let 푛푡 be the number of treated units in the sample. • Once we assign some groups to treatment and some to

control we do not actually observe 푌푖(1) and 푌푖(0) and so we cannot actually observe SATE. We can, however, estimate it:

1 푛 1 푛 ̂휏 = ワ 퐷 푌 − ワ(1 − 퐷 )푌 푆 푛 푖 푖 푛 푖 푖 ᅪ푐 푖=﷠ﻰﻰﻰﻰﻰﻰﻰﻰﻰᅩﻰﻰﻰﻰﻰﻰﻰﻰﻰᅪ푡 푖=﷠ ᅨﻰﻰﻰﻰﻰᅩﻰﻰﻰﻰﻰᅨ mean among treated mean among control

• Conditional on the sample, the only variation in ̂휏푆 is from the treatment assignment. • Unconditionally, there are two sources of variation: the treatment assignment and the sampling procedure. Repeated samples/randomizations

﷠ randomization 1 C T C T T C ̂휏푆

﷡ randomization 2 T T C C T C ̂휏푆

﷢ randomization 3 C T T C C T ̂휏푆

﷣ randomization 4 T C C T C T ̂휏푆 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ randomization distribution

• Randomization distribution is a special version of the of this estimator. Finite-sample properties

• What are the properties of ̂휏푆 in repeated samples/randomizations? What does the distribution look like? • Unbiasedness: is the mean of the randomization distribution equal to the true SATE? • Sampling : what is the variance of the randomization distribution? • Knowing these will allow to construct confidence intervals, conduct tests, etc. Unbiasedness

• In a completely randomized experiment, ̂휏푆 is unbiased for 휏푆 • Let 퐎 = {퐘(1), 퐘(0)} be the the potential outcomes. 1 푛 1 푛 피[ ̂휏푆|푆, 퐎] = ワ 피[퐷푖푌푖|푆, 퐎] − ワ 피[(1 − 퐷푖)푌푖|푆, 퐎] 푛푡 푖=﷠ 푛푐 푖=﷠ 1 푛 푛 푛 = ワ ￵ ⋅ 피[퐷푖푌푖|푆, 퐎] − ⋅ 피[(1 − 퐷푖)푌푖|푆, 퐎]￸ 푛 푖=﷠ 푛푡 푛푐 1 푛 푛 푛 = ワ ￵ ⋅ 피[퐷푖푌푖(1)|푆, 퐎] − ⋅ 피[(1 − 퐷푖)푌푖(0)|푆, 퐎]￸ 푛 푖=﷠ 푛푡 푛푐 1 푛 푛 푛 = ワ ￵ ⋅ 피[퐷푖|푆, 퐎] ⋅ 푌푖(1) − ⋅ 피[(1 − 퐷푖)|푆, 퐎] ⋅ 푌푖(0)￸ 푛 푖=﷠ 푛푡 푛푐

푛 1 푛 푛푡 푛 푛푐 = ワ ￵ ⋅ ⋅ 푌푖(1) − ⋅ ⋅ 푌푖(0)￸ 푛 푖=﷠ 푛푡 푛 푛푐 푛 1 = ワ 푌푖(1) − 푌푖(0) = 휏푆 푛 푖∈푆 Finite-sample sampling variance

• It turns out that the sampling variance of the difference in means estimator is: ﷡ ﷡ ﷡ 푆푐 푆푡 푆휏푖 핍( ̂휏푆|푆) = + − , 푛푐 푛푡 푛 ﷡ ﷡ • 푆푐 and 푆푡 are the in-sample of 푌푖(0) and 푌푖(1), respectively.

1 푛 1 푛 ﷡ ̅ ﷡ ﷡ ̅ ﷡ 푆푐 = ワ(푌푖(0) − 푌(0)) 푆푡 = ワ(푌푖(1) − 푌(1)) 푛 − 1 푖=﷠ 푛 − 1 푖=﷠

• Here, ̅ ∑푛 . (푌(푑) = (1/푛) 푖=﷠ 푌푖(0 • Last term is the in-sample variation of the individual treatment effects: ﷡ 1 ﷡ ワ 푆휏푖 = ￴푌푖(1) − 푌푖(0) − 휏푆￷ 푛 − 1 푖=﷠ Finite-sample sampling variance

﷡ ﷡ ﷡ 푆푐 푆푡 푆휏푖 핍( ̂휏푆|푆) = + − , 푛푐 푛푡 푛

. If the treatment effects are constant across units, then ﷡ • 푆휏푖 = 0 • ⇝ in-sample variance is largest when treatment effects are constant. • Intuition looking at two-unit samples: 푖 = 1 푖 = 2 Avg. 푖 = 1 푖 = 2 Avg.

푌푖(0) 10 -10 0 푌푖(0) -10 10 0 푌푖(1) 10 -10 0 푌푖(1) 10 -10 0 휏푖 0 0 0 휏푖 20 -20 0

• Both have 휏 = 0, first has constant effects.

• In first setup, ̂휏푆 = 20 or ̂휏푆 = −20 depending on the randomization.

• In second setup, ̂휏푆 = 0 in either randomization. Estimating the sampling variance

﷡ • We can use sample variances within levels of 퐷푖 to estimate 푆푐 ﷡ and 푆푡 : 1 1 푠﷡ = ワ (푌 (0) − 푌̅ )﷡ 푠﷡ = ワ (푌 − 푌̅ )﷡ 푐 푛 − 1 푖 푐 푡 푛 − 1 푖 푡 푐 푖∶퐷푖=﷟ 푡 푖∶퐷푖=﷠

• Here, ̅ ∑푛 and ̅ ∑푛 . 푌푐 = (1/푛푐) 푖=﷠(1 − 퐷푖)푌푖 푌푡 = (1/푛푡) 푖=﷠ 퐷푖푌푖 ? But what about ﷡ • 푆휏푖

﷡ 1 ﷡ ワ ᅪ푌푖(1) − 푌푖(0) −휏푆￷ﻰﻰﻰﻰﻰﻰᅩﻰﻰﻰﻰﻰﻰ푆휏푖 = ￴ᅨ 푛 − 1 푖=﷠ ???

• What to do? Conservative variance estimation

﷡ ﷡ ﷡ 푆푐 푆푡 푆휏푖 핍( ̂휏푆|푆) = + − , 푛푐 푛푡 푛

• We will estimate this quantity with the so-called Neyman (or robust) estimator: 푠﷡ 푠﷡ 핍ᄃ = 푐 + 푡 , 푛푐 푛푡 • Notice that 핍ᄃ is biased for 핍, but that bias is always positive. • Construct CIs and conduct hypothesis tests as usual. • Leads to conservative inferences:

▶ Standard errors, √핍̂ will be at least as big as they should be. ▶ Confidence intervals using √핍̂ will be at least wide as they should be. ▶ Type I error rates will still be correct, power will be lower. ▶ Both will be exactly right if treatment effects are constant. Population estimands

• Now imagine we want to estimate the PATE, 휏. • Implied DGP: (SRS) from the population, then randomized experiment within sample.

▶ ⇝ the sample mean is unbiased for the population mean,

퐸푆[휏푆] = 휏 ▶ 퐸푆[⋅] is the expectation over repeated samples from the population. • How does our difference-in-means estimator do?

ᅪ 피푆[휏푆] = 휏ﻰﻰᅩﻰﻰᅪ푆퐸[ ̂휏푆|푆]� =ᅨﻰﻰﻰﻰﻰᅩﻰﻰﻰﻰﻰ피푆[ ̂휏푆] = 피ᅨ iterated expectations SATE unbiasedness

• ̂휏푆 unbiased for the PATE! Population sampling variance

• What about the sampling variance of ̂휏푆 when estimating the PATE? • It turns out that the sampling variance of the estimator is simply: ﷡ ﷡ 휎푐 휎푡 핍( ̂휏푆) = + , 푁푐 푁푡 ﷡ ﷡ • Here, 휎푐 and 휎푡 are the population-level variances of 푌푖(1) and 푌푖(0). • The third term drops out ⇝ higher variance for PATE than SATE. Estimating pop. sampling variance

﷡ ﷡ 휎푐 휎푡 핍( ̂휏푆) = + , 푁푐 푁푡

• Notice that the Neyman estimator 핍ᄃ is now unbiased for

핍( ̂휏푆): 푠﷡ 푠﷡ 핍ᄃ = 푐 + 푡 푛푐 푛푡 • Two interpretations of 핍ᄃ: 1. Unbiased estimator for sampling variance of the traditional estimator of the PATE 2. Conservative estimator for the sampling variance of the traditional estimator of the SATE Analyzing experiments with regression?

• Can we just use regression to estimate the ATE in this case?

▶ lm(y ~ d)?

• Call the coefficient on 퐷푖 the regression estimator: ̂휏ols. • We can justify this using the consistency relationship:

푌푖 = 퐷푖푌푖(1) + (1 − 퐷푖)푌푖(0)

= 퐷푖푌푖(1) + (1 − 퐷푖)푌푖(0) + 피[푌푖(0)] − 피[푌푖(0)]

+ 퐷푖피[푌푖(1) − 푌푖(0)] − 퐷푖피[푌푖(1) − 푌푖(0)]

= 피[푌푖(0)] + 퐷푖피[푌푖(1) − 푌푖(0)] + (푌푖(0) − 피[푌푖(0)])

+ 퐷푖 ⋅ ((푌푖(1) − 푌푖(0)) − 피[푌푖(1) − 푌푖(0)])

= 훼 + 퐷푖휏 + 휖푖

• See that 훼 = 퐸[푌푖(0)] and remember that 휏 = 피[푌푖(1) − 푌푖(0)]. And also the residual here is the deviation for the control group plus the treatment effect hetergeneity. Independent errors

휀푖 = (푌푖(1)−푌푖(0))−피[푌푖(1)−푌푖(0)])+퐷푖 ⋅((푌푖(1)−푌푖(0))−피[푌푖(1)−푌푖(0)])

• Let’s check to see if the errors here are independent of the

treatment, which would imply that a regression estimator ̂휏ols would be unbiased for 휏:

피[휖푖|퐷푖 = 0] = 피[푌푖(0) − 피[푌푖(0)]|퐷푖 = 0]

= 피[푌푖(0)|퐷푖 = 0] − 피[푌푖(0)] = 0

• and for 퐷푖 = 1:

피[휖푖|퐷푖 = 1] = 피[푌푖(1) − 피[푌푖(0)] + 피[푌푖(1) − 푌푖(0)]|퐷푖 = 1]

= 피[푌푖(1)|퐷푖 = 1] − 피[푌푖(1)] = 0 • Thus, just using the randomization assumption, we have justified the use of regression. • No functional form assumptions at all, only consistency. Including covariates

• Completely randomized design ⇝ no need to control for covariates. • Adding covariates won’t matter for unbiasedness/consistency.

▶ (Not true for stratified designs!)

• Still consistent even if functional form for 푋푖 is misspecified. • Effects of conditioning on covariates: reduce uncertainty in effect estimates Next week

• More experiments, this time under Fisherian inference. • Randomization inference: even fewer assumptions. • Back to the ! • Then: regression, , etc!