Methods Notes

Apoorva Lal ∗

Thursday 2nd September, 2021 11:34

Contents 3.7 Other Least-Squares Estimators ...... 19 3.8 Measures of ...... 20 1 Brand-name Distributions 4 3.8.1 ...... 20 3.9 Multiple Testing Corrections ...... 20 2 Probability and Mathematical 5 3.10 Quantile Regression ...... 21 2.1 Basic Concepts and Distribution Theory ...... 5 3.10.1 Interpreting Quantile Regression Models ...... 23 2.2 Densities and Distributions ...... 5 3.11 Measurement Error ...... 23 2.2.1 Multivariate Distributions ...... 7 3.12 ...... 23 2.3 Moments ...... 7 3.13 Inference on functions of parameters ...... 23 2.4 Transformations of Random Variables ...... 9 3.13.1 (Nonparametric) Bootstrap ...... 23 2.4.1 Useful Inequalities ...... 9 3.13.2 Propogation of Error / Delta Method ...... 24 2.5 Transformations and Conditional Distributions ...... 9 3.13.3 MVN Simulations ...... 24 2.5.1 Distributions facts and links between them ...... 10 2.6 Statistical Decision Theory ...... 11 4 Causal Inference 25 2.7 Estimation ...... 12 4.1 Identification ...... 25 2.8 Hypothesis Testing ...... 12 4.2 ...... 26 2.9 Some Asymptotics results ...... 12 4.2.1 Potential Outcomes (Rubin 1974, Holland 1986) ...... 26 2.10 Parametric Models ...... 14 4.2.2 Treatment Effects ...... 27 2.11 Robustness ...... 15 4.2.3 Regression Formulation ...... 27 4.2.4 Difference in ...... 27 3 16 4.2.5 Randomisation Inference ...... 27 3.1 ...... 16 4.2.6 ...... 28 3.1.1 OLS in Summation Form ...... 16 4.2.7 Power Calculations ...... 28 3.1.2 Prediction ...... 16 4.2.8 Partial Identification ...... 29 3.2 Classical Linear Model ...... 16 4.3 Selection On Observables ...... 29 3.2.1 Gauss Markov Assumptions ...... 16 4.3.1 Regression Anatomy / FWL ...... 29 3.2.2 OLS Derivation ...... 17 4.3.2 Treatement Effects under Unconfoundedness ...... 30 3.3 Geometry of OLS ...... 17 4.3.3 Regression Adjustment ...... 30 3.3.1 Partitioned Regression ...... 17 4.3.4 Subclassification ...... 30 3.4 Relationships between Exogeneity Assumptions ...... 17 4.3.5 Matching ...... 31 b b2 3.5 Finite and Large Sample Properties of β, σ ...... 18 4.3.6 Propensity Score ...... 31 3.6 Residuals and Diagnostics ...... 19 4.3.7 Double Robust Estimators ...... 32 4.3.8 Sensitivity Analysis ...... 34 ∗Originally written as notes for Comprehensive Exam in Spring 2020. Now maintained as a stats/metrics brain-dump for reference. 4.4 Instrumental Variables ...... 35

1 4.4.1 Traditional IV Framework (Constant Treatment Effects) ...... 35 6 Maximum Likelihood 62 4.4.2 Weak Instruments ...... 35 6.1 Properties of Maximum Likelihood Estimators ...... 63 4.4.3 IV with Heterogeneous Treatment Effects / LATE Theorem ...... 36 6.2 QMLE / Misspecification / Information Theory ...... 64 4.4.4 Characterising Compliers ...... 36 6.2.1 Robust Standard Errors ...... 64 4.4.5 Marginal Treatment Effects: Treatment effects under self selection .. 37 6.3 Computation ...... 65 4.4.6 High Dimensional IV selection ...... 39 6.4 Testing ...... 65 4.4.7 Principal Stratification ...... 39 6.5 Binary Choice ...... 65 4.5 Regression Discontinuity Design ...... 40 6.6 Discrete Choice ...... 66 4.5.1 Estimators ...... 40 6.6.1 Ordered ...... 66 4.5.2 Fuzzy RD ...... 41 6.6.2 Unordered ...... 66 4.6 Differences-in-Differences ...... 41 6.7 Counts and Rates ...... 67 4.6.1 DiD ...... 41 6.7.1 Counts ...... 67 4.6.2 DiD Estimators ...... 42 6.7.2 Rates ...... 67 4.7 Panel Data ...... 42 6.8 Truncation and Censored Regressions ...... 68 4.7.1 Fixed Effects Regression ...... 42 6.8.1 Truncated Regression ...... 68 4.7.2 Random Effects ...... 43 6.8.2 Tobit Regression ...... 68 4.7.3 Hausman Test: Choosing between FE and RE ...... 43 6.9 Generalised Linear Models Theory ...... 68 4.7.4 Time Trends ...... 43 6.9.1 ML estimation ...... 69 4.7.5 Distributed Lag ...... 44 4.7.6 ‘Generalised’ Difference in Differences: staggered adoption ...... 44 7 Machine Learning 69 4.7.7 Synthetic Control ...... 45 7.1 General Setup ...... 69 4.8 Generalised Method of Moments ...... 48 7.2 ...... 71 4.9 Decomposition Methods ...... 49 7.2.1 Regularised Regression ...... 71 4.9.1 Oaxaca-Blinder Decomposition ...... 49 7.2.2 Classification ...... 72 4.9.2 Distributional Regression ...... 49 7.2.3 Goodness of Fit for Classification ...... 73 4.10 Causal Directed Acyclic Graphs ...... 51 7.3 ...... 75 4.10.1 Basics / Terminology ...... 51 7.4 Double Machine Learning for ATEs and CATEs ...... 76 4.10.2 Mediation Analysis ...... 52 7.4.1 Average Treatment Effects ...... 76 7.4.2 Heterogeneous Treatment Effects with selection on Observables .... 77 5 Semiparametrics and Nonparametrics 54 5.1 Semiparametric Theory ...... 54 8 Bayesian Statistics 78 5.1.1 Empirical Processes Background ...... 54 8.1 Setup ...... 78 5.1.2 Influence Functions ...... 54 8.2 Conjugate Priors and Updating ...... 79 5.1.3 Tangent Spaces ...... 55 8.3 Computation / Markov Chains ...... 83 5.2 Semiparametric Theory for Causal Inference ...... 56 8.4 Hierarchical Models ...... 84 5.3 Nonparametric ...... 57 8.4.1 Empirical Bayes ...... 84 5.3.1 Conditional Density and Distribution Function Estimation ...... 58 8.4.2 Hierarchy of Bayesianity ...... 85 5.4 ...... 58 8.5 Graphical Models ...... 85 5.5 Semiparametric Regression ...... 59 8.5.1 Empirical Bayes ...... 85 5.5.1 Index Models ...... 59 5.6 Splines ...... 59 9 Dependent Data: and spatial statistics 86 5.6.1 Reproducing Kernel Hilbert Spaces ...... 60 9.1 Time Series ...... 86 5.7 Gaussian Processes ...... 61 9.1.1 Regression with time series ...... 88 5.7.1 Bayesian Linear Regression ...... 61 9.2 Spatial Statistics ...... 88 9.2.1 Spatial Linear Regression ...... 89

← ToC 2 9.2.2 Spatial Modelling ...... 91

A Mathematical Background 93 A.1 Proof Techniques ...... 93 A.2 Set Theory ...... 93 A.2.1 Relations ...... 93 A.2.2 Intervals and Contour Sets ...... 94 A.2.3 Algebra ...... 94 A.3 Analysis and Topology ...... 94 A.3.1 Metric Spaces ...... 94 A.4 Functions ...... 96 A.4.1 Fixed Points ...... 98 A.5 Measure ...... 98 A.6 Integration ...... 99 A.7 Probability Theory ...... 99 A.7.1 Densities ...... 100 A.7.2 Moments ...... 101 A.7.3 Random vectors ...... 101 A.7.4 Product Measures and Independence ...... 101 A.7.5 Conditional Expectations ...... 101 A.7.6 Order Statistics ...... 102 A.8 Linear Functions and Linear Algebra ...... 102 A.8.1 Linear Functions ...... 102 A.8.2 Projection ...... 103 A.8.3 Matrix Decompositions ...... 104 A.8.4 Matrix Identities ...... 104 A.8.5 Partitioned Matrices ...... 104 A.9 Function Spaces ...... 105 A.9.1 Lp spaces ...... 105 A.9.2 o () , O () Notation ...... 107 A.10 Optimisation ...... 108 A.11 Calculus ...... 108 A.11.1 Differentiation Rules ...... 108 A.11.2 Matrix Derivatives ...... 109 A.11.3 Constrained Maximisation ...... 109

B Bibliography 110

← ToC 3 1 Brand-name Distributions

Dist E V Sample Space Notation FX (x) fX (x) [X] [X] Rationale Relations Bernoulli − 1−x x − 1−x − {0, 1} Bern (p) (1 p) p (1 p) p p(1 p) Poisson - Coin flips Bern(1) = Bin(1, p) Approx: ( ) ≈ − n np n(1 p) >> 0 Binomial − x − n−x − sum of n ≈ N − ∪ Z+ Bin (n, p) I1−p(n x, x + 1) p (1 p) np np(1 p) Bin (n, p) (np, np(1 p)) 0 x binomials np >> n(1 − p): ( )( ) Bin (n, p) ≈ Poi (np) k np1 − − Multinomial n! ∑ np1(1 p1) np1p2 x1 ··· xk . Multivariate Analogue Nonnegative integer vectors Mult (n, p) p1 pk xi = n . . Marginals are binomial n x1! . . . xk! . − . of binomial sum to i=1 npk np2p1 . ∑x i x −λ Poisson −λ λ λ e counts to Normal approx If λ large, + Po (λ) e λ λ 0 ∪ Z i! x! poisson process Poi (λ) = N (λ, λ) i=0 ( − ) − − Normal Approx If r(1 − p) large Negative Binomial x + r 1 r x 1 p 1 p ( ) (r, p) I (r, x + 1) p (1 − p) r r sum of r(1−p) r(1−p) ∪ Z+ NBin p 2 ≈ N 0 r − 1 p p IID geom rvs NBin (r, p) p , 2 { p 0 x < a 2 − I(a < x < b) a + b (b − a) Uniform Unif (a, b) x a a < x < b Equally likely Beta(1, 1) = Unif(0,1) S ⊂ R b−a b − a 2 12 Outcomes 1 x > b ( ) ∫ 2 ( ) x 1 (x − µ) Z ∼ Φ =⇒ Z2 ∼ (1/2, 1/2) Normal N µ, σ2 Φ(x) = ϕ(t) dt ϕ(x) = √ exp − µ σ2 Limiting Dist of Gamma R 2 ∼ 2 −∞ σ 2π 2σ χ1 − Multivariate Normal −k/2 −1/2 − 1 (x−µ)T Σ 1(x−µ) mv( analogue) of Marginals( ) are MVN (µ, Σ) (2π) |Σ| e 2 µ Σ Rk N µ, σ2 N µ, σ2 ( ) { ( ) ν+1 ( )−(ν+1)/2 ν ν ν Γ x2 dist of ∼ N ∼ 2 Student’s t √ 2( ) ν > 2 X (0√, 1) ,Y χν R Student(ν) Ix , ν 1 + 0 ν > 1 ν − 2 √pivotal q ⇒ ∼ 2 2 νπΓ ν ∞ 1 < ν ≤ 2 n(X¯ − µ)/σ ) = X/ Y/ν t(ν) ( ) 2 n x Sampling dist of Chi-square 2 1 k x 1 k/2−1 −x/2 X ∼ N (0, 1) =⇒ ∞ χk γ , x e k 2k sample ; 2 ∼ 2 (0, ) Γ(k/2) 2 2 2k/2Γ(k/2) X χ1 √ SSQ of IID Normal d d1 2 ( ) (d1x) d2 2 d1+d2 2d (d + d − 2) ∼ 2 ∼ 2 F d1 d2 (d1x+d2) d2 2 1 2 Ratio of X χµ,Y χν F(d1, d2) I d x , ( ) (0, ∞) 1 d d − − 2 − Sum of Squares =⇒ (X/µ)/(Y/ν) ∼ (µ, ν) d1x+d2 2 2 1 1 d2 2 d1(d2 2) (d2 4) F xB 2 , 2 Exponential −xλ −xλ 1 1 Exp (λ) 1 − e λe wait time of Exp (λ) = Gamma (1, λ) (0, ∞) λ λ2 poisson process α Gamma γ(α, βx) λ α−1 −λx α α Sum of IID Normal( Approx) : α large Gamma (α, λ) x e 2 (0, ∞) Γ(α) Γ(α) λ λ2 exponential rvs N α/λ, α/λ

Beta Γ(α + β) α−1 β−1 α αβ Beta (α, β) Ix(α, β) x (1 − x) Ratio of U [0, 1] = Beta (1, 1) (0, 1) Γ(α)Γ(β) α + β (α + β)2(α + β + 1) gamma rvs (∑ ) k ∏k Γ i=1 αi − α EX (1 − EX ) Dirichlet (α) ∏ xαi 1 ∑ i ∑i i Multivariate analogue Unit Simplex Dir k i k k Marginals are Beta Γ(αi) αi αi + 1 of Beta i=1( ) i=1 (i=1 ) (i=1 ) k−1 Weibull −(x/λ)k k x −(x/λ)k 1 2 2 2 Weibull(λ, k) 1 − e e λΓ 1 + λ Γ 1 + − µ failure rates [0, ∞) λ λ k k ( ) α α 2 xm xm αxm xmα Pareto Pareto(xm, α) 1 − x ≥ xm α x ≥ xm α > 1 α > 2 Long tail -income x xα+1 α − 1 (α − 1)2(α − 2)

← ToC 4 2 Probability and Mathematical Statistics Example 2.2 (Continuous Random Variable).

2.1 Basic Concepts and Distribution Theory • Sample space is R Defn 2.1 (Probability). • Event space is B(R): the Borel σ-algebra on the real line Given a measurable space (Ω, F), if P [Ω] = 1, P [] is called a probability measure and so (Ω, F,P ) is a probability space. Sets f ∈ F are called events, points ω ∈ Ω • Px defined so that ∀A ∈ B(R), are called outcomes, and P (f) is called the probability of f. −1 Px(A) = Pω(ω ∈ Ω: x(ω) ∈ A) =: Pω(x (A)) Defn 2.2 (Kolmogorov Axioms). The triple (Ω, S,P ) is a probability space if it satisfies the following Defn 2.4 (Demorgan’s Laws, Conditional Probability).

• Unitarity: Pr (Ω) = 1 • DM: (A ∩ B) = (AC ∪ BC )C ;(A ∪ B) = (AC ∩ BC )C • Non Negativity: ∀s ∈ S, Pr (a) ≥ 0 Pr (a) ∈ R ∧ Pr (a) < ∞ • Inclusion-Exclusion Rule: (A∪B) = (AC ∩BC )C = P (A)+P (B)−P (A∩B)

• Countable Additivity: If A1,A2,..., ∈ S are pairwise disjoint[i.e. ∀i ≠ j, Ai ∩ • Conditional Probability: P (A|B) = P (A ∩ B)/P (B) ∅ Aj = ], Then ( ) ∪∞ ∑∞ Theorem 2.3 (Bayes Rule). P Ai = P(Ai) P (B|A)P (A) i=1 i=1 P (A|B) = P (A) Other properties for any event A, B Equivalently, f(x|θ) f(θ) • A ⊂ B =⇒ Pr (A) ≤ Pr (B) | {z } |{z} | | ∫ f(x θ)f(θ) likelihood prior • Pr (A) ≤ 1 f(θ x) = ′ ′ ′ = ′ f(x|θ )f(θ )dθ f(x) θ ∈Θ |{z} − c • Pr (A) = 1 Pr (A ) data • Pr (∅) = 0 Defn 2.5 (Statistical Independence). A ⊥⊥ B ⇔ P (A ∩ B) = P (A)P (B),P (A|B) = P (A) Fact 2.1 (Properties of Probability). For events A, B ∈ E, 2.2 Densities and Distributions ≤ ≤ 1. 0 Pr (A) 1: Events from never happening to always happening Defn 2.6 ((Cumulative) Distribution Function). F : R→[0, 1] 2. Pr (E) = 1: Something must happen ∫ x F ≤ Pr (∅) = 0 x (x) = Pr (X x) = p(x)dx 3. : Nothing never happens −∞ 4. Pr (A) + Pr (Ac) = 1: A must either happen or not happen Properties of CDFs: F Defn 2.3 (Random Variable). 1. Bounded on [0, 1]: limx→∞ (x) = 1; limx→−∞ = 0 X :Ω→R s.t. ∀x ∈ R, {ω : X(ω) ≤ x} ∈ F, where Ω is the sample space and F is 2. Nondecreasing: if x < x , then F(x ) ≤ F(x ) the event space. 1 2 1 2 i.e. a RV is a mapping/function from the sample space (or per some authors, event 3. Right Continuous: limh→0+ F(x + h) = F(x) space) to the real line. 4. limh→0+ F(x − h) = F(x−) = F(x) − Pr (X = x) = Pr (X < x)

← ToC 5 Suppose F′(x) exists ∀x ∈ R and ∫ ∞ F′(x)dx < ∞ −∞ then F is absolutely continuous with density function F′ = f (·) Defn 2.7 (Probability Density / mass Function). f : R→R A density is such that ∫ x F(x) = p(t)dt , −∞ < x < ∞ −∞ which defines the density / PMF : F′ ≡ f (x) = | {z(x}) |Pr (X{z= x}) Continuous Version Discrete version wherever F′(·) exists. Since F is nondecreasing, f is nonnegative and must have Figure 1: CDF and Quantile function. Rotate and flip CDF to get QF ∫ ∞ f(x)dx = 1 −∞ If the distribution of Y is continuous, one can show that the τ−th quantile of the distribution of Yi =: Qτ minimises the distance between Yi and y ∈ R, where the Fact 2.4 (Integration w.r.t. a distribution function). distance is defined as the check function. Suppose X is a random-variable with distribution function F. Then we expect that A ⊂ R for any set ∫ Properties of quantile functions ∈ A F Pr (X ) = d (x) Q(F(x)) ≤ x , −∞ < x < ∞ A 1. If X has an absolutely continuous distribution, this integral simplifies to the famil- 2. F(Q(t)) ≥ 1 , 0 < t < 1 iar form ∫ ∫ x x 3. Q(t) ≤ x ⇔ F(x) ≥ t F F (x) = d (t) = p(t)dt − − −∞ −∞ 4. If F 1 exists, then Q(t) = F 1(t) and more generally Q ≤ Q ∫ ∞ ∫ ∞ 5. if t1 < t2, (t1) (t2) g(x)dF(x) = g(x)f (x) dx −∞ −∞ Fact 2.5 (Equivariance of quantiles under monotone transformations). Defn 2.8 (Quantile Function / Inverse-CDF). Let g(.) be a nondecreasing function. Then, for a r.v. Y , Since real valued R.V. can be characterised by its F s.t. F (x) := Pr (X ≤ x), we can Qτ [g(Y )] = g[Qτ (Y )] ≤ invert it. In other words, we can ask for the point xp s.t. Pr (X xp) = τ for any i.e. the quantiles of g(Y ) coincide with transformed quantiles of Y . τ ∈ [0, 1]. This defines the quantile function Q : (0, 1)→R where Fact 2.6 (Lorenz Curve). − let Y be a positive random variable (e.g. income) with distribution function F Q (τ) ≡ F 1(τ) := inf {x : F (x) ≥ τ} Y x and µ < ∞; then the Lorenz curve mayb be written in terms of the quantile is called the τth quantile of F . The associated loss-function is the check function function Q (τ) Y ∫ ρ (u) = u(τ − 1 ≤ ) = 1 τ |u| + 1 ≤ (1 − τ) |u| t τ u 0 u>0 u 0 −1 λ(t) = µ QY (τ)dτ 0

← ToC 6 which describes the proportion of total wealth owned by the poorest t proportion • Joint density can be factored into marginals: fX,Y (X,Y ) = fx(X)fy(Y ) of the population. Gini’s mean difference (‘gini coefficient’) can be expressed as ∫ [X,Y ] = 0 1 • Cov − γ = 1 2 λ(t)dt • ρ(X,Y ) = 0 0 which is twice the area between the 45◦ line and the Lorenz curve. • V [X + Y ] = V [X] + V [Y ]

2.2.1 Multivariate Distributions 2.3 Moments Defn 2.9 (Random Vectors). p ′ For a random variable x with support [x, x¯] A p−random vector is a map X :Ω→R , X(ω) := (X1(ω),...,Xp(ω)) such that each Xi is a random variable. Defn 2.13 (N-th raw ). ′ E n Joint CDF of X is µj =:= x F : P ≤ : P ≤ ≤ (x) = [X x] = [X1 x1,...,Xp xp] Defn 2.14[ (N-th central] moment). j If X is continuous, the joint pdf is µj := E (X − E [X]) ∂p f (x) = F(X) Defn 2.15 (Expectation and Variance). ∂x1 . . . ∂xp ∫ ∫ x¯ x¯ F The marginals of and f are E [X] := xdF(x) |{z}≡ xf (x) dx x x If Absolute Continuity holds FX (xi) := P [Xi ≤ xi] = F(∞,..., ∞, xi, ∞,..., ∞) ∫ i ∫ x¯ ∂ F V [X] := (X − E(X))2dx fXi (xi) := Xi (xi) = f(x)dx−i ∂xi Rp−1 x = E[(X − EX)2] = E(X2) − (EX)2 | The conditional CDF and PDF of X1 (X2,...,Xp) are defined as Defn 2.16 (Variance Covariance Matrix). For vector-valued RVs, this translates to F P ≤ | X1|X−1 (x1) := [X1 x1 X−1 = x−1] V E ′ − E E ′ f(x) [X] = (XX ) (X) (X) f | (x ) := ′ X1 X−1=x−1 1 Cov [X, Y] = E[(X − EX)(Y − EY) ] = E [XY] − E [X] E [Y] fX−1 (x−1) Defn 2.17 ( and ). Defn 2.10 (Marginalization of f(x, y)). [ ] ∫ E (X − µ)3 ∞ ≡ : Skewness γ = 3 fx(x) = f(x, y)dy [ σ ] −∞ E (X − µ)4 Kurtosis ≡ κ := − 3 Defn 2.11 (Conditional Distribution). σ4 f(x, y) Fact 2.7 (Linear functions of a vector valued RV). f(y|x) = ′ fx(x) For any (well-behaved) vector rv x, Cov [Ax + b] = AΣA where Σ is the covari- To get marginal for x, we can integrate out y. ance matrix of the random vector x. ∫ For Normal quantities where X ∼ N (µ, Σ), ′ fx(x) = fx|y(x|y)fy(y)dy Ax + y ∼ N (Aµ + y, AΣA ) Σ−1/2 ∼ N (0, I) Defn 2.12 (Independent Random Variables). − ′ −1 − ∼ 2 two r.v.s X and Y are said to be independent if (x µ) Σ (x µ) χn

← ToC 7 Theorem 2.8 (Law of the Uncoscious Statistician (LOTUS)). Moments from MGF The kth derivative of the m.g.f. evaluated at t = 0 is the kth Y = r(X) X let is a transformation of a∫ random variable∫ . Then, (uncentered) moment of X. ∂kM [ ] E [Y ] = E [r(X)] = r(x)dF (x) = r(x)f(x)dx x | = E Xk ∂tk t=0 Example 2.9 (Properties of combinations of RV). Defn 2.19 (Cumulant + Cumulant Generating Function). Let X be a real-valued scalar rv and Mx(t) be its moment generating function. The • E [aX + bY ] = aE [X] + bE [y] ; Expectation is a linear operator cumulant-generating function of X is defined as ∑ ∑ KX (t) = log Mx(t), |t| < δ – E [( a X )] = a E [X ] ∏ i i i ∏ i i i the CGF may be expanded to the form – E [( X )] = E [X ] ∞ i i i i ∑ κ − E E × K (t) = j tj , |t| < δ – For p random vector X, [AX + b] = A [X] + b for q p matrix A x j! and b ∈ Rq j=1 • Variance where κ1, κ2,... are constants that depend on the distribution of X and are called Cumulants. Cumulants can be obtained by differentiating the CGF V 2V – [aX] = a [X] j ∂ 2 2 κ = K (t) ; j = 1, 2,... – V [aX + bY ] = a V [X] + b V [Y ] + 2abCov [X,Y ] j j x ∂t t=0 – Cov [X,X] = V [X] Defn 2.20 (Characteristic Function). – ∀a, b, c, d ∈ R, Cov [aX + c, bY + d] = abCov [X,Y ] The characteristic function of a random variable X is the function – Cov [X + W, Y + Z] = Cov [X,Y ]+Cov [X,Z]+Cov [W, Y ]+Cov [W, Z] √ ϕ(t) := E [exp(itX)] , −∞ < t < ∞ , i = −1 X V [AX + b] = AV [X] A⊤ q × p A – For random vector , for matrix and ϕ(t) = E [cos(tX)] + iE [sin(tX)] b ∈ Rq If X has a moment generating function M, then it can be shown that M(it) = ϕ(t). Defn 2.18 (Moment Generating Function). ϕ, unlike the MGF, is always well defined, and shares properties of the MGF. If X is nonnegative, we know E [exp(tX)] < ∞∀t ≤ 0. Then, we define the Laplace Transform Defn 2.21 (Order Statistics). X ,...,X ∼ f (x) F (x) X k− L(t) = E [exp(−tX)] ; t ≥ t 1 n iid x with X . (k) is the th order (in ascending order) Since this is limited to nonnegative RVs, one generalises them to Moment Gener- n! − ating Functions f (x) = F (x)k 1 (1 − F (x))n−kf (x) X(k) (k − 1)!(n − k)! X X MX (t) = E [exp(tX)] 2 e.g. Standard Normal MGF et /2 Defn 2.22 (Correlation Coefficient). | | E n For a given r.v. with MGF Mx(t), t < δ for some δ > 0, [X ] exists and is finite Cov [X,Y ] ∈ − ∀ ρ[X,Y ] = [ 1, 1] by Cauchy Schwarz n = 1, 2,... and σxσy ∞ [ ] ∑ E Xj M (t) = tj • ρ[X,Y ] = 1 ⇔ ∃a, b ∈ R with b > 0 s.t. Y = a + bX x j! j=0 • ρ[X,Y ] = −1 ⇔ ∃a, b ∈ R with b > 0 s.t. Y = a − bX E [Xn] = M (n)(0) and x . Defn 2.23 (Entropy). For a random variable X with pmf f (x ), the entropy H(X) is ∑ i ∑ − − H(X) := f (x) logb (f (xi)) = pi log(pi) i i

← ToC 8 where pi is Pr (X = x) ∀i ∈ Supp [X] Theorem 2.12 (Markov Inequality). Defn 2.24 (Copula). E [X] G Pr (|X| ≥ t) ≤ ∀t > 0 Equivalent For a pair of r.v.s X,Y with joint distribution (x, y) and marginal distribution t F H functions (x) and (y), a copula function E [ψ(X)] − − Pr (ψ(X) ≥ t) ≤ C : [0, 1]2→[0, 1]; C(u, v) = G(F 1(u), H 1(v)) t where C has the following properties Where ψ(.) is a nonnegative, nondecreasing function; in the basic form ψ = I. ϵ, r > 0 C C ∀ ∈ Equivalently, for 1. (u, 0) = (0, v) = 0 u, v [0, 1] E [|X|r] P [|X| > ϵ] ≤ 2. C(u, 1) = C(1, v) = 1 ∀u, v ∈ [0, 1] ϵr Theorem 2.13 (Chebychev’s Inequality). 3. ∀ u1 < u2 ∧ v1 < v2 ∈ [0, 1], C(u2, v2) − C(u2, v1) − C(u1, v2) + C(u1, v1) ≥ 0 Let X be any r.v. w E [X] = µ < ∞ and V [X] = σ2 < ∞. Then, ∀ϵ > 0 Defn 2.25 (Frechet Bounds). V [X] σ2 Pr (|X − µ| > ϵ) ≤ = max {F (x) , H(y) − 1, 0} ≤ G(x, y) ≤ min {F (x) , H(y)} ϵ2 ϵ2 Important in partial identification lit. The upper bound his occurs when X and This implies that averages of random variables with finite variance converge to Y are comonotonic, that is, when Y can be expressed as a deterministic, non- their mean. decreasing function of X. The lower bound is achieved when X and Y are coun- Theorem 2.14 (Holder’s Inequality). termonotonic, so Y is a deterministic, non-increasing function of X. These two Let X be a r.v. with range X and let g1, g2 denote real valued functions on X . Let very special cases correspond to the situations in which all of the mass of the cop- p, q > 1 s.t. 1 + 1 = 1. Then ula function is concentrated on a curve connecting opposite corners of the unit p q p 1/p q 1/q square. These special cases correspond to of +1 and −1 respec- E [|g1(X)g2(X)|] ≤ E [|g1(X)| ] E [|g2(X)| ] tively. The other important special case is independent X and Y , which obviously corresponds to C(u, v) = uv. Theorem 2.15 (Chernoff Bounds). Let Z be any random variable. Then, ∀t ≥ 0, 2.4 Transformations of Random Variables [ ] λ(Z−E[Z]) −λt −λt Pr (Z ≥ E [Z] + t) = min E e e = min MZ−E[Z](λ)e 2.4.1 Useful Inequalities λ≥0 λ≥0 Basic question [as framed by Duchi’s notes], given a random variable X with ex- pectation E [X], how likely is X to be close to its expectation, and how close is it Where M is the MGF of Z. Pr (X ≥ E [X]  t) t ≥ likely to be? This implies putting bounds on quantities of the form Theorem 2.16 (Hoeffding’s Inequality). 0. Let X be a random variable with a ≥ X ≥ b. Then, ∀s ∈ R, Theorem 2.10 (Cauchy Schwartz Inequality). [ ] [ ] [ ] s2(b − a)2 2 2 2 E sX ≤ E Cov [X,Y ] ≤ σxσy Equivalently, for RVs X,Y with E X < ∞ ∧ E Y < ∞ log e s [X] + √ 8 E [|XY |] ≤ E [X2] E [Y 2] Theorem 2.17 (Law of the Unconscious Statistician (LOTUS)). If g(X) is a bijective transformation of a random variable with density f(x), Theorem 2.11 (Jensen’s Inequality). ∫ ∞ Let Y be a random function and g(·) be a concave function. If E [Y ] and E [g(Y )] E [g(X)] = g(x)f(x)dx exist, then E [g(Y )] ≤ g(E [Y ]). −∞ Similarly, if f : R→R is convex, E [f(X)] ≥ f(E [X]). 2.5 Transformations and Conditional Distributions

Y = g(X) in terms of fx and Fx.

← ToC 9 Defn 2.26 (CDF of transformation). 2.5.1 Distributions facts and links between them ( ) ( ) −1 −1 FY (y) = Pr (Y ≤ y) = Pr (g(X) ≤ y) = Pr X ≤ g (y) = FX g (y) Fact 2.20 (Normal( ) Distribution( Facts).) ∼ N 2 ∼ N 2 Let X µx, σx and Y µy, σy Defn 2.27 (Change of Variables technique for PDF of transformation). ( ) y = g(X) x ∀ ∈ R ̸ ⇒ ∼ N 2 2 Density of a transformation of a random variable : • a, b a = 0,W = aX + b = W aµx + b, a σX ) ( ) ( ) −1 d −1 −1 1 2 2 fY (y) = fx(g (y)) g (y) = fx g (y) • If X ⊥⊥ Y and Z = X + Y , then Z = N µ + µ , σ + σ dy g′(g−1(y)) x y x y If X ∈ Rn, Fact 2.21 (Multivariate Normal). ( ) ( ) N − × −1 −1 −1 −1 p (µ, Σ) with µ a p vector of means and Σ a p p symmetric and positive definite fy (y) = fx g (y) detJg−1(y) = fx g (y) detJg(g (y)) matrix. PDF of Np (µ, Σ) is Example 2.18 (finding pdf of transformation). ( ) 2 2 1 1 ⊤ −1 Let X have f (x) = 3x , and we want to find Y = X . ϕΣ = exp − (x − µ) Σ (x − µ) −1 1/2 −1 1/2 ∂g (y) −1/2 (2π)p/2 |Σ| 2 Then, g (y) = y , and ∂y = (1/2)y Plugging into the expression above, we get It also inherits the linearity property of the( form ) ⊤ ANp (µ, Σ) = Nq Aµ + b, AΣA −1 d −1 fY (y) = fx(g (y)) g (y) × dy where A is q p √ 1/2 2 1 −1/2 3 −1/2 3 1/2 ∼ 2 → ∼ = 3(y ) y = y × y = y Fact 2.22 (Special Cases of RVs). • t : let x χn,Y = z/ x/n y tn 2 2 2 2 2 x1/n1 • F : let x1 ∼ χ , and x2 ∼ χ ; y = → Y ∼ F n1, n2 n1 n2 x2/n2 Defn 2.28 (Conditional Expectations). ∼ N For jointly continuous X,Y with joint pdf f, the conditional expectation of Y given • let x (0,I) and A is symmetric, then X = x ′ 2 is a function ∫ ∫ p(x Ax) ∼ χ (K) where K := tr (A) E | F | | [Y X = x] = yd Y |x (y X) = y fY |X (y x) dy • Bin(1, p) ∼ Bern(p) Conditional Expectation of function h of random variables is • Beta(1, 1) ∼ Unif(0,1) ∫ ∞ ∼ E [h(X,Y )|X = x] = h(x, y) fY |X (y|x) dy ∀x ∈ Supp[X] • Gamma(1, λ) Expo(λ) −∞ 2 ∼ • χn Gamma(n/2, 1/2) Defn 2.29 (Conditional Variance). For r.v.s X,Y , conditional variance V [Y |X = x] is • NBin(1, p) ∼ Geom(p) [ ] [ ] V | E − E | 2| E 2| − E | 2 [Y X] = (Y [Y X]) X = Y X [ [Y X]] Fact 2.23 (Misc Distribution Facts). Theorem 2.19 (Law of Iterated Expectations/ Adam’s Law). • Exponential distribution is E [Y ] = E [E [Y |X]] – memoryless: Pr (X > s + t|X > s) = Pr (X > t) Defn 2.30 (Law of Total Variance / ANOVA Theorem / Eve’s Law). – Gaps between Poisson realisations is exponential V [Y ] = E [V [Y |X]] + V [E [Y |X]] – Scales (Y ∼ Expo(λ) =⇒ λY = Expo(1)) – Order statistics (min, max) of expos are also expo

← ToC 10 • if X ∼ Gamma(a, λ), it is the distribution of the wait time for the a realisations Example 2.24 (Estimation Problem). of a Expo(λ) process The action space A = Θ. The DM needs to decide what is the parameter θ that generated the data. The decision rule for this problem is called an estimator.A ∼ • Discrete distributions: If X typical is quadratic loss := L(a, θ) = (a − θ)2. – Bernoilli, coin flip, Example 2.25 (Testing Problem). – Binomial: n coin flips, Partition the parameter space into Θ0 [Null hypothesis] and Θ1 [Alternative hy- A { } – Geometric: number of tails before first head when success probability is pothesis]. Action space = a0, a1 defined as choosing null or alternative. De- p cision rules are called a test. Loss function is typically zero-one loss L(a1, θ0) = L – Negative binomial: the number of tails until rth head (a0, θ1) = 1; 0 otherwise. – Poisson: if rare events occur at rate λ per unit time, the number events Example 2.26 (Inference Problem). that occur in a unit or space of time is X A ⊆ R where each action a[] is an interval containing the best candidate values for – Multinomial: n items that can fall into k buckets independently w.p. p = θ. Decision rules here are confidence sets. (p1, . . . , pk). Defn 2.34 (Risk Function). of decision rule d is Defn 2.31 (Exchangeability). R(θ; d) = EP [L(d(X), θ)] Real-valued r.v.s X1,...,Xn are said to have an exchangeable distribution if the θ A decision rule d is dominated by d′ if R(θ; d′) ≤ R(θ; d). Decision rules that are distribution of X1,...,Xn) is the same as the distribution of Xi1 ,...,Xin for any permutation i1, . . . , in of 1, . . . , n. not dominated are called admissable. Defn 2.32 (Martingales). Example 2.27 (James-Stein Estimator). Given (possibly correlated) jointly normal r.v’s Y ,...,Y with y ∼ N (µ , 1), and Consider a sequence of random variables {X ,X ,...} s.t. E [|X |] < ∞ ∀ n = 1. 1 n i i 1 2 n would like to estimate the n− vector µ under squared loss The sequence {X1,X2,...} is said to be martingale if ∑n ∀n, E [Xn+1|X1,...,Xn] = Xn L b b − 2 ∥b − ∥ (µ, µ) = (µi µ) = µ µ 2 i=1 2.6 Statistical Decision Theory µ Y The MLE for each is just the (unbiased)( vector) itself, but the estimator n − 2 Define a statistical decision problem as a game involving ‘nature’ and ‘decision maker’(DM). µb = 1 − ∑ Y In the first stage (data-generation), nature selects a parameter θ ∈ Θ and uses it to i Y 2 i P i i generate data according to the distribution θ. In the second stage (decision mak- has better L than the MLE whenever n ≥ 3. ing), the DM observes the data but not θ, but knows the used by nature. Based on realised data, the DM would like to take an action a ∈ A Defn 2.35 (Bayes Risk). whose payoff depends on the parameter drawn by nature , which can be modeled given π on Θ (defined as a prior), we define the Bayes risk L d by endowing DM with a state-contingent utility u(a, θ) or loss (a, θ). The DM’s of a decision rule as ∫ decision problem is the selection of an action depending on the realisation of the E data. r(π, d) = R(θ, d)dπ(θ) = π [R(θ; R)] Θ Defn 2.33 (Statistical Problem). A decision rule d∗ is said to be Bayes Rule w.r.t. prior π and class of decision rules ∗ is a tuple D if r(π, d ) = infd r(π, d) [i.e. it minimises Bayes Risk]. (Θ, A, u(.), {P }) θ Defn 2.36 (Minimax). containing a parameter space, action space, utility function, and statistical model. a decision rule d0 is said to be minimax (relative to a class D of decisions) if A decision rule d is a function d : X →A. sup R(θ, d0) = inf sup R(θ, d) θ∈Θ d∈D θ∈Θ minimax rule protects DM against worst-case situations.

← ToC 11 2.7 Estimation 2.8 Hypothesis Testing Defn 2.37 (Sample Statistic). X ,X ,...,X Null For iid random variables 1 2 n, a sample statistic is a function True False T = h (X ,X ,...,X ) (n) (n) 1 2 n α n Reject Power where h(n) : R →R ∀n ∈ N Type 1 error ¬ 1 − α 1 - Power Defn 2.38 (Unbiasedness). [ ] Reject Type 2 Error ˆ E ˆ An estimator θ is unbiased if θ = θ Defn 2.43 (Test statistic). T := T (X ,...,X ) Defn 2.39 (Consistency). A test statistic, similar to an estimator, is a real valued function n 1 n p of the data sample. It is a random variable. A test t : Domain(Tn)→{0, 1}. An estimator θˆ is consistent if θˆ → θ. standard normal test statistic: Defn 2.40 (Asymptotic Normality). θˆ − θ An estimator θˆ is asymptotically normal iff 0 ≤ ( [ ]) s = z(1−α/2) √ d ω n(θˆ(X) − θ) → N 0, V θˆ √ ∑ 2 1 2 where ω = σ /n = uˆi [ ]−1 n−k where V θˆ is called the asymptotic . Defn 2.44 (Two-sided Normal Approximation-Based Confidence interval). Defn 2.41 (Sampling Variance of an Estimator).[ ] ˆ CI1−α(β) := {β ∈ Rk : Pr (β ∈ CI1−α) = 1 − α} For an estimator θˆ, the sampling variance is V θˆ . [ √ √ ] b b = θˆ − z − V [θˆ], θˆ + z − V [θˆ] Defn 2.42 (Mean Squared Error(MSE) of an estimator). (1 α/2) (1 α/2) For an estimator θˆ, the MSE in estimating θ is [ ] [ [ ] ] [ ] 2 th 2 z c Φ Φ(z ) = c ∀c ∈ MSE = E (θˆ − θ) = E θˆ − θ + V θˆ where c denotes the quantile of the standard normal s.t. c | {z } | {z } (0, 1). [ ]Bias Variance Defn 2.45 (Asymptotically valid two-tailed P-value). E − 2 E argmin (X c) = [X] P-value = 2[1 − Φ(|s|)]; in words - smallest critical value under which H0 would be c∈R rejected Fact 2.28 (Properties of Mean and Variance). ∑ ∑ Defn 2.46 (Asymptotically valid one-tailed P-value). ¯ 1 n 2 1 n − ¯ 2 One sided P-value = 1 − Φ(s) or Φ(s) • X = n i=1 Xi and SX = n−1 i=1(Xi X) are unbiased estimators of E [X] and V [X] respectively. Defn 2.47 (Power). H H π(θ) = P [|z + s| > z − ] • Both are asymptotically normal Pr(reject 0 | 1 is true) : 1 α/2 ( ) 2 • If Xi ∼ N µ, σ ( ) 2.9 Some Asymptotics results ¯ ∼ N σ2 b – X µ, n We estimate θ from data, and hope that it is close to true parameter θ0. How close n−1 2 2 b – 2 S ∼ χ − is θ to θ0? Basic idea of asymptotics is to take the taylor expansions and show σ X n 1 √ b ¯ 2 asymptotic normality: which is that the distribution of n(θ − θ0)→N (0, 1) as – X and SX are independent →∞ – n . ¯ X − µ Defn 2.48 (Modes of Convergence: Probability, Mean Squared, Distribution). √ ∼ t − 2 n 1 A sequence of r.v.s {X }, n = 1, 2,... is said to SX /n n

← ToC 12 → d • Xn p X - converge in probability if • Xn/Yn → X/c given c invertible ∀ϵ, δ > 0, ∃N s.t. ∀n > N, Pr (|Xn − X| < ϵ) < 1 − δ. Equivalent notation: plim (Xn) = X Theorem 2.34 ((Lindberg-Levy) Central Limit Theorem). 2 X1,...,Xn are IID with E [Xi] = µ, V [Xi] = σ , for a general class of Xn, → E − 2 √ • Xn a.s. X - converge in mean square if limn→∞ [Xn X] = 0. − n(√Xn µ) →d N → F (0, 1) • Xn d X - converge in distribution if n of Xn converges to the distribution σ2 function F of X at every continuity point of F. We call F the limit distribution of {Xn} Example 2.35 (CLT + Slutzky for asymptotic distribution of test statistic). Z test: Under the null that E [x] = θ , √0 Theorem 2.29 ((Chebyshev) Weak Law of Large Numbers). (ˆµ − θ ) 2 2 √n 0 →d N X1,...,X∑ n are independent r.v.’s with means µ1, . . . , µn and σ1, . . . , σn. Zn(θ0) := (0, 1) − 2 2 2→ →∞ b→ sn If n σi 0 as n , then X = µ µ ( ) 2 p because s → V [X]. Reject if Z (θ ) ̸∈ z , z − . Theorem 2.30 ((Khinichine) Law of Large Numbers). n n 0 α/2 1 α/2 p X1,...,Xn are IID; E[X1] < µ. Then, X → µ Example 2.36 (Multiplicative Noise and Lognormal). Suppose we have iid rvs Xi with finite mean and variance, and instead of taking Theorem{ 2.31} ((Kolmogorov)∑ Strong Law of large Numbers). nth 2 2 2 the , we take the (multiply them and take the For Xi, µi, σ finite and σ /i < ∞, then X→µ a.s. i i i root). ( ) N 1/N n Theorem 2.32 (Glivenko-Cantelli). ∏ 1 ∑ F R Sn = Xi =⇒ log Sn = log Xi Let Xi, i = 1, . . . , n be an iid sequence with distribution on . The empirical n distribution function is the funcction of x defined by i=1 i=1 ∑n Assuming log Xi has finite mean and variance, log Sn will be normally distributed, c 1 S p(x) = F (x) = 1 ≤ and n will be log-normal. This is not a power-law, which is of the form n n Xi x −α ⇒ − i=1 kx = log p = log k α log x. for a given x ∈ R, apply the SLLN to the sequence 1X ≤x to assert Theorem 2.37 (Continuous Mapping Theorem I). i → ⇒ → b a.s. Xn d X; h(.) is continuous ; h(Xn) d h(x) Fn(x) → F (x) Theorem 2.38 (Continuous Mapping Theorem II). Similarly, G.C. asserts : Xn →p X; h(.) is continuous ⇒; h(Xn) →p h(x) b a.s. sup Fn(x) − F (x) → 0 x∈R Theorem 2.39 (Delta Method). k Let g : R →R, let θ be a point in the domain of g, and let {tn} be a sequence of In words, for random samples from a continuous distribution F, the empirical dis- random vectors in Rk. If b b− tribution F is consistent. By extension, so are the sample quantiles F 1(τ). This is √ − →d N important for inference in quantile regression. • n(tn θ) (0, Σ) ∇ Theorem 2.33 (Slutsky’s Theorem). • g(θ) exists and is continuous   →d →d ′ Let Xn,Yn be sequences of scalar/vector random elements. If Xn X and Y c, gn(θ)1 then where ∇g(θ) :=  .  ′ . d gn(θ)n • Xn + Yn → X + c Scalar version: √ ( ) d ′ 2 2 d n{g(tn) − g(θ)} → N 0, g (θ) σ • XnYn → Xc Equivalently,

← ToC 13 √ • op(1) + Op(1) = Op(1); op(1) × Op(1) = op(1) n(g(ˆµ ) − g(µ)) n√ →d N (0, 1) |g′(µ)| V [X] Example 2.40 (Consistency of MM Estimator). √ ∑n ∑n 1 ′ 1 ′ p Defn 2.49 (Stochastic Orders of Magnitude: o () , O () notation, N consistency). βˆ = β + x x x u = β + O (1) × o (1) → β n i i n i i | p {z p } (White, 2014) | i=1{z } | i=1{z } { } { } op(1) For deterministic sequences an , bn , p p → (Q)−1 → 0

1. ∃∆ < ∞ s.t. an/bn→0 for sufficiently large n, we say an = o (bn) [tending to zero in probability; an is smaller than bn asymptotically] 2.10 Parametric Models a a = o (b ):⇔ lim n = 0 Defn 2.50 (Parametric Model). n n →∞ n bn For r.v Y and random vector X of length K, a parametric model is a set of functions: P : RK+1→R indexed by a parameter vector θ of length τ 2. ∃∆ < ∞ s.t. an/bn ≤ ∆ for sufficiently large n, we say an = O (bn) [bounded P { ∈ } in probability, not larger than bn asymptotically; an does not decrease slower = P (y, x; θ): θ Θ τ than bn] where Θ ⊂ R is called the parameter space [this guarantees existence as long as Θ an ∃ ∈ an = O (bn):⇔ lim ≤ C for C > 0 is a compact set]. The model is said to be true if θ Θ s.t. n→∞ b n | ( ) fY |X (y X) = g(y, X; θ) { } λ λ −λ • A sequence an is O n ( read - at most of order( n ), if )n an is bounded. A parametrisation is identifiable if there is a unique θ ∈ Θ that corresponds with λ ⇔ ∀ ∃ ∞ ∧ ∗ Zn ∀ ≥ ∗ Zn = Op(n ) δ, ∆(δ) < n (δ) s.t. Pr nλ > ∆ < δ n (δ) each P ∈ P. Equivalently, θ1 ≠ θ2 =⇒ P (·, θ1) ≠ P (·, θ2) λ = 0, {a } a = O (1) When n is bounded and we write n . Defn 2.51 (Regression Model). Y ⊂ Rd P { · ∈ } – If w = O (1), {w } is stochastically bounded, i.e. not explosive as Consider a parametric model where and := P ( , λ): λ Λ . Y1,...,Yn N p N ∀ ∈ P N→∞. Formally, for any constant ϵ > 0, ∃δϵ s.t. are independent such that j = 1, . . . , n, the distribution of Yj corresponding λ P | | with parameter j (i.e. independent but not identically distributed). sup [ wN > δϵ] < ϵ x , x ,..., x ∀j = 1, . . . , n N 1 2 n is a known sequence of nonrandom vectors such that , ∃h(·) s.t. λj = h(xj, θ) for some θ ∈ Θ. Thus, the distribution of Yj depends on the – Any random∑ sequence converging in distribution is Op(1), which im- value of xj, θ, and the function h. plies N −1/2 n {z − E [z]} = O (1). In a regression, h is known/assumed, while θ is an unknown parameter. i=1 i p √ − – For an estimator aN for√ a parameter α, in most cases, we have N(aN Example 2.41 (Classical Linear Model). α) = Op(1) : aN is N-consistent. The convergence rate is therefore the parametric setup is −1/2 N . g(y, X, (β, σ)) = ϕ(y;(Xβ, σ2)) ( ) { } { } λ λ −λ → • A( sequence) an is (o n) ( read - of order smaller than n ) if n an 0. bn = and the parameter space Θ = (β, σ) ∈ RK+2 : σ ≥ 0 λ λ o n =⇒ bn = O n Example 2.42. →p For a binary choice model where y ∈ {0, 1}, the parametric model is – In other words, when wN 0, it is also denoted as wN = op(1). For zN , { by LLN we thus have zN − E [z] = op(1). − g(y, X; β) = 1 h(Xβ) if y = 0 h(Xβ) if y = 1 Sums and Products and the mean function h : R→[0, 1] is known. Common choices of h are logit • op(1) + op(1) = op(1); Op(1) + Op(1) = Op(1) (h(Xβ) = exp(Xβ)/(1 + exp(Xβ))) or normal CDF Φ.

• op(1) × op(1) = op(1); Op(1) × Op(1) = Op(1);

← ToC 14 Theorem 2.43 (Sufficient Statistic / Fisher-Neyman Theorem). Defn 2.55 (Influence function). Let X have pdf p(x, θ). Then, the statistics ϕ(x) are sufficient for θ IFF the density also called the Frechet derivative of the functional θn. can be written as θn(Fε) − θn(F) IFθn,F(x) := lim p(x|θ) = h(x)gθ(ϕ (x)) ε→0 ε where h(x) is a distribution independent of θ and gθ captures all the dependence where Fε = (1 − ε)F + εδx. on θ via sufficient statistics ϕ(x). Equivalently, the bayesian interpretation is that ϕ(x) p(θ|x) = p(θ|ϕ(x)) Example 2.44 (Nonrobustness∫ of mean and variance). is sufficient if the posterior . − − Mean: T (F ) = xdF∫ (x), T (Fε) = (1 ε)µ(F ) + εδx, so IF(F )(x) = x µ(F ) Defn 2.52 (Extremum (m) estimator). Variance T (F ) = (x − µ)2dF (x). θˆ is an estimator that maximises a scalar objective function that is a sum of N sub- (1 − ε)σ2(F )t(x − µ)2σ2(F ) functions IF (x) = lim = (x − µ)2 − σ2 T,F → ∑N t 0 ε 1 →∞ →∞ arg max QN (θ) = q(wi, θ) Since IF(x) as x , ε contamination noises things up enormously. ∈ N θ Θ i=1 MLE is a type of extremum estimator. So is GMM.

2.11 Robustness ∑ b 1 n F F 1 ≤ Write estimators as θ = θn( n) where n(y) = n i=1 yi y is the empirical dis- tribution fn. In this setup, ∫ • Mean: θn ydFn(y) F−1 1 • : θn = n ( 2 ) • Trimmed Mean = ∫ 1−α 1 F−1 θn = − n (u)du 1 2α α b The mapping θn(·) induces a probability distribution for the estimator θn under F b which we denote LF(θn). Fn→F ∧ θn→θ∞(F) Defn 2.53 (Prokhorov Distance). Metric on the collection of probability measures on a{ given metric space. } ε Let A denote the Borel sets on R ∀A ∈ A and A := x ∈ R| infy∈A|x−y|≤ε The prokhorov distance between F and G distributions is given by π(F, G) = inf {ε|F[A] ≤ G[Aε] + ε ∀A ∈ A} Defn 2.54 (Hempel Robustness). The sequence of estimators {θn} is robust at F iff ∀ ε > 0 ∃δ > 0 s.t. ∀n,

π(F, G) < δ =⇒ π(LF(θn),LG(θn)) < ϵ This is a continuity requirement on the mapping θ(.). An estimator is robust at F if b small departures from F induce small departures in the distribution of θ measured by the Prokhorov distance.

← ToC 15 3 Linear Regression With more than 1 predictor, the variance is [ ] σ2 V βˆ = 3.1 Simple Linear Regression j − 2 TSSj(1 Rj ) 3.1.1 OLS in Summation Form 2 2 where Rj∑is the R from the regression of Xj on the other Xs and an intercept, and Y = β + β X + ε E [ε |X ] = 0, V [ε |X ] = σ2 − 2 Stipulate model i 0 1 i i, where i i i i . TSSj = i(xij x¯j) . This denominator is called the Variance Inflation Factor. Fact 3.1 (BLP Least Squares Estimands and Estimators). { }N 3.1.2 Prediction Consistent inference for β with only assumption being yi, xi i=1 is IID with well ˆ ˆ defined moments. Say we have a model rˆ(x) = β0 + β1X from data (X1,Y1,...Xn,Yn). We see a ∑ X = x y − ¯ − ¯ − · new observation 0 and want to predict 0. An estimate of the outcome Cov [X,Y ] ˆ i(Xi X)(Yi Y ) XY X Y ˆ ˆ β1 := =⇒ β1 = ∑ = yˆ0 = β0 + β1x0. Variance of the prediction is V [X] (X − X¯)2 2 − 2 i i X X [ ] [ ] [ ] [ ] [ ] 1 (x − x¯)2 Cov [X,Y ] ˆ ˆ V ˆ ˆ V ˆ 2V ˆ ˆ ˆ 2 0 β0 := E [Y ] − E [X] =⇒ β0 = Y¯ − β1X¯ β0 + β1x0 = β0 + x0 β1 + 2x0Cov β0, β1 = σ + V [X] n (n − 1)sxx ∑ 2 εˆi Theorem 3.3 ( for OLS). σˆ2 = i e := y − yˆ n − 2 Variance of prediction error f 0 0 is V V V E | − 2 V Yˆ = βˆ + βXˆ [e0] = [ε0] + [ [y0 x0] yˆ0] = σ + [ˆy0] As an application of the law of total variance, we can construct 0 and ( ∑ ) ( ) U = Y − Yˆ . Then, the variance decomposition for V [Y ] is (x − x )2 1 (x − X¯)2 ˆ2 2 i i 0 ˆ2 0 ξ = V [ˆy0 − y0] =σ ˆ 1 + ∑ = σ 1 + + [ ] [ ] n (x − x¯)2 n S Cov(X,Y ) 2 ρ2 σ2 σ2 i i xx V Yb = (βˆ )2V [X] = V [X] = XY X Y = ρ2 σ2 1 V 2 XY Y Example 3.4 (Simple linear regression in matrix form). [X] σX [ ] Partition design matrix s.t. X = [1 x] V 2 − V b − 2 2 [ ] [U] = σY Y = (1 ρXY )σY [ ′ ′ ] ′ 1 1 1 ′ −1 Sxx/∆ −Sx/∆ X X = ′ ′ =⇒ (X X) = x 1 x x −Sx/∆ n/∆ ∑ ∑ ∑ ∑ ( ) ∑ Theorem 3.2 (Properties of Least Squares Estimators). 2 − 2 2 − 2 where Sxx = xi ,Sx = xi, ∆ = n (xi x¯) = N xi ( (xi)) . ˆ⊤ ˆ ˆ T i i Let β = (β0, β1) denote least squares estimators. The conditional means and Equivalent expression variances are [ ] [ ] ′ −1 1 V[ 2 − ′ ny [ ] ( ) (X X) = n [x]x nx ; X y = d 2V[ −nx n nCov(x, y) + nxy E βˆ|X = β0 n [x] β1 [ ] ( ∑ ) σ2 1 X2 −X¯ V βˆ|X = n i i 3.2 Classical Linear Model ns2 −X¯ 1 ∑ X 3.2.1 Gauss Markov Assumptions 2 −1 − ¯ 2 −1 where sX = n i(Xi X) =: n Sxx. This simplifies to [ ] 1. Linearity : Y = Xβ + ε V ˆ 2 2 • β0 = σ (1/n +x ¯ /Sxx). 2. Strict Exogeneity : E [εi|X] = 0 [ ] [ ] 2 V − E [ε ] = 0 when estimating V ˆ 2 σ V ˆ [[xi x¯]ui] • Equivalent to unconditional mean of error i • β1 = σ /Sxx = n·V[X] ; under Heteroskedasticity, this is β1 = ¯ 2 nV[(Xi−X)] with ’fixed’ regressors [ ] • Cross moment of residuals and regressors is zero, X is orthogonal to ε : ˆ ˆ 2 − • Cov β0, β1 = σ ( x/S¯ xx) E [Xiεi] = 0

← ToC 16 [ ] E 2| 2 E ′| 2 − 3. Spherical error variance : εi X = σ ; [εε X] = σ In – p eigenvalues of 1 and n p zero eigenvalues – 0 ≤ h ≤ 1 4. Full Rank: No multicollinearity - rank(X) = k ii – Prediction for observation i is simply Mi·y where Mi· is the ithe row of 2 5. Spherical Errors: ε|X ∼ N(0, σ In) the hat matrix − − ′ −1 ′ 6. (Yi,Xi): i = 1, ..., n are i.i.d. • Mx = In Px = In X(X X) X - Annihilator Matrix - projector into span(X)⊥ (orthogonal subspace of X) (A1-A4) sufficient for unbiasedness + Gauss-Markov. – rank(M) = trace(M) = trace(I − P ) = trace(I) − trace(P ) = n − k 3.2.2 OLS Derivation This generates OLS minimises ⟨ε, ε⟩ • Fitted Value: yˆ = PxY ′ min (y − Xβ) (y − Xβ) e = M Y β∈RK • Residual : x ′ − ′ − ′ ′ − ′ ′ = y y |y Xβ {zβ X y} β X Xβ Theorem 3.5 (Frisch-Waugh-Lovell Theorem). Let X1,X2 be partitions of X containing first K1 and K −K1 columns respectively, scalar, combined and β1, β2 be the corresponding coefficients in β. Further, let M1,P1 be the projec- tion and residualiser matrices for X1. Then, Taking FOCs and solving yields ( )− ˆ ⊤ 1 ⊤ β2 = X2 M1X2 X2 M1y ∂ ⟨ε, ε⟩ ′ ′ ′ − ′ = −2βX y − 2X Xβ =⇒ βˆ = (X X) 1X y IoW, one can estimate coefficients for X by first residualising X and y on X ∂β 2 2 1 With fixed regressors, ′ − 3.3.1 Partitioned Regression V (β) = σ2(X X) 1 ′ b2 e e − Choose k1, k2; k1 + k2 = k s.t. X = [X1 X2]. Then, the normal equation is where, under homoskedasticity, σ = n−k , where e = y Xβ [ ] [ ] ′ ′ [ ] ′ X1X1 X1X2 β1 X1y ′ ′ β = ′ 3.3 Geometry of OLS X2X1 X2X2 2 X2y yields the FWL solutions Define 2 matrices that are ( ) −1 ( ) −1 ′ −1 ′ ′ ≥ ∀ ∈ Rk ˆ ′ ′ ′ ′ ′ ′ ∗ ∗ ∗ • positive semidifinite x Ax 0 x (conformable x) β1 = (X1M2X1) X1M2y = X| 1{zM}2 M2X1 X1M2y = X 1X 1 X 1y • symmetric A′ = A 3.4 Relationships between Exogeneity Assumptions • idempotent AA = A main model: yi = β0 + β1x1 + ui. Defn 3.1 (Projection Matrices). Matrices that project a vector y into a subspace S. For OLS, 1. E [u] = 0 is technical assumption; not meeting it only affects the constant { } term β0. • L := span(X) := Xb : b ∈ Rk is the linear space spanned by the columns of X. 2. Zero Covariance: Key assumption for consistency of β1: Cov [u, x] = 0. As- [u, x] = E [ux] − E [u] E [x] =⇒ E [ux] = 0 ′ −1 ′ sumption (1) implies that Cov • Px = X(X X) X - Hat Matrix - projector into columns space of X 3. Mean independence: E [u|x] = E [u] = 0 – rank(P ) = trace(P ) = p.

← ToC 17 • Mean independence implies zero covariance E [ux] = E [xE [u|x]] = OLS coefficient efficient in the class of LUEs. For any other Linear unbiased esti- E [xE [u]] = E [x] E [u]. Since Cov [u, x] = E [ux] − E [u] E [x], E [u|x] = mator b, ( [ ] ) E [u] =⇒ E [ux] = E [u] E [x] =⇒ Cov [x, u] = 0 a′ V βˆ|X − V [b|X] a ≥ 0 ∀ a ∈ Rk • Zero covariance does not imply mean independence p Property 3.9 (Large: Consistent - βˆ → β). ⊥⊥ 4. u x if f(u, x) = f(u)f(x). No longer need fixed regressors assumption. ( )−1 Violations of the zero conditional mean assumption E [ε|x] usually arise one of 1 ∑n 1 ∑n βb − β = x x′ x u three ways: n i i n i i i=1 i=1

• Omitted Variables Bias: if unobserved variable q is correlated with both y ∵ E [xiui] = 0, we can apply Slutsky’s Theorem and LLN to write and x, failing to include it in the regression results in Cov(x, ε) ≠ 0 ( )− ∑n 1 ∑n b 1 ′ 1 ′ −1 • Measurement Error plim →∞(β−β) = plim x x plim x u = (E [x x ]) E [x u ] = 0 n n i i n i i i i i i • Simultaneity i=1 i=1 Property 3.10 (Large: Asymptotically Normal).   b 2 3.5 Finite and Large Sample Properties of β, σb ( ) √ − −   ˆ− →d N ′ 1 ′ ′ 1 ≡ N −1 −1 Property 3.6 (Finite: Unbiased). n(β β) 0, (X X) X ΩX (X X) 0, (MXX ) |MX{zΩX} (MXX )  | {z } ∑ | {z } X −1 ′ −1 ′ −1 ′ - under fixed regressor assumption - that ’s are nonrandom. Otherwise, the N X X N i uˆixixi N X X conditioning is ill-defined. Under A(1-4), [ ] [ ] Defn 3.2 (Huber-White Sandwich ‘Robust’ SEs). ( )−1 E βˆ|X = E X⊤X X⊤(Xβ + ε) This a plug-in estimate of an asymptotic approximation of the . [ ] ( )− ( )− ( ) ( )− E ⊤ 1 ⊤ [ ] ∑ 1 ∑ ∑ 1 = β + X X X ε = β 2nd term 0 by A2 1 ′ ′ ′ Asym V βˆ := x x eˆ2x x x x n i i i i i i i | i {z } | i {z } | i {z } Otherwise, the expectation operator cannot pass through a ratio. A−1 B A−1 ( ) ( ) Alternate statement of bias without fixed regressors: ⊤ −1 ⊤ ˆ ⊤ −1   = X X X ΩX X X ( )− [ ] ∑n 1 ∑n where E b E  ′  ˆ 2 2 β = xixi xiyi Ω := diag(ˆe1,..., eˆn) i=1 i=1 ( ) Property 3.11 (σˆ2 is unbiased). −1 ′ ∑n ∑n σˆ2 = e e M e = Mε ̸ E ′ E Since n−k , this follows from trace of x. Recall that , so = [xixi] [xiyi] = β ′ ′ 2 2 i=1 i=1 E [e e] = E [ε Mxε] = tr((Mx)Iσ ) = (n − k)σ Theorem 3.12 (Wierstrass Approximation Theorem). Property 3.7 (Finite: Variance). Let f :[a, b]→R be continuous. Then ∀ε > 0, ∃p s.t. ∀x ∈ [a, b], |f(x) − p(x)| < ε [ ] ( )−1 ( )−1 ( )−1 V βˆ|X = X⊤X X⊤ E [εε′|X] X X⊤X = σ2 X⊤X Defn 3.3 (Polynomial Approximation of CEF). | {z } Let X,Y be r.v.s and suppose E [Y |X = x] is continuous and supp[X] = [a, b]. Then, 0 by A2 ∀ε > 0, ∃K ∈ N ∀ K′ ≥ K s.t. [ , ] ( )2 Property 3.8 (Finite: Best Linear Unbiased Estimator (BLUE) - Gauss Markov E E | − 2 K′ Thm). [Y X] g(X,X ,...X ) < ε

← ToC 18 ′ g(X,X2,...XK ) Y X where is the BLP of given and higher orders. ′ ⊤ d (X X)di − D = i where d = βˆ( i) − βˆ Defn 3.4 (Polynomial least squares Sieve Estimator). i ks2 i For iid r.v.s (Y , X ),..., (Y , X ), the polynomial least squares sieve estimator of 1 1 n n and p is the rank of M = k and s2 =σ ˆ2. D > 1 is often interpreted as an the CEF is x i influential point. ∑Jn b ˆ k E[Y |X = x] = βkx k=1 3.7 Other Least-Squares Estimators where Defn 3.9 (). ( ) ∑n ∑Jn When the data has some very high-leverage points, Huber’s Robust Regression is an b − k alternative to OLS. β = argmin Yi bkXi ∈RJn+1 n b i=1 k=0 ∑ e 1 − ′ →∞ 1 → β = argmin ρ(yi xib) where Jn and n Jn 0. Asymptotics of the estimator allow for increasing flex- b n ibility; as n grows, so does flexibility. As long as ‘flexibility’ grows slowly relative i=1 ρ(.) to n, the estimator will be consistent. where the term is Huber’s Loss{ Function u2 |u| < c Defn 3.5 (Inference for conditional mean mb (X)). ρ(u) = 2c |u| − c2 |u| ≥ c ( )−1 mb (X) = X⊤ X⊤X X⊤y which looks like square error for ‘small’ errors and absolute error for ‘large’ er- ( ) ( )− rors. It is continuous in c and has a continuous first derivative, which helps with 2 ⊤ ⊤ 1 ⊤ 2 mb (X) = MVN Xβ, σ X X X X = MVN(Xβ, σ PX ) optimisation. c is a tuning parameter. Implemented in R using MASS:rlm. Defn 3.10 (Weighted Least Squares). 3.6 Residuals and Diagnostics Minimise WMSE ∑n Defn 3.6 (Leverage).∑ 1 − ′ n WMSE(β, ω1, . . . , ωn) = ωi(yi xiβ) Since yˆi = hijyj, hij can be interpreted as the weight associated with the n j=1 i=1 datum (x , y ). Diagonal elements h measures how much impact y has on yˆ , j j ii i i which yields the estimator and is therefore called leverage. ( ) b ⊤ −1 βW LS = X WX XWy Fact 3.13 (Variance of εˆi). − V 2 − Since εˆ = My = (I H)y, [ˆεi] = σ (1 hi). Defn 3.11 (Generalised least squares). If covariance matrix of errors is known: E(εε′|X) = Ω Defn 3.7 (Standardised and Studentised Residuals). ˆ ′ −1 −1 ′ −1 y − yˆ βGLS = (X Ω X) X Ω y estd = √i i i − s 1 hii ˆ ′ −1 −1 V(βGLS) = (X Ω X) Studentised residuals often omit the observation in question and estimate y − yˆ Defn 3.12 (Restricted OLS). estu = i√ i maximise i (−i) ′ s 1 − hii L(b, λ) = (Y − Xβ) (Y − Xβ) + 2λ(Rβ − r) Defn 3.8 (Cook’s Distance). where R and r are the restriction matrix and vector respectively. ˆ(−i) Let β be the estimate of β with (xi, yi) omitted. Cook’s distance of xi, yi is defined as

← ToC 19 3.8 Measures of Goodness of Fit 3.8.1 Model Selection • TSS = Total Sum of Squares := ∥y∥2 Defn 3.20[ (Generalisation] Error). G = E (Y − mb (x))2 2 • ESS = Explained Sum of Squares := ∥P y∥ for a new data point (x, y). This is different from in-sample training error := ∥My∥2 • RSS = Residual Sum of Squares ∑n 1 2 T = (yi − mb (xi)) ′ n ⟨y, y⟩ = ⟨yˆ + e⟩ = (Xβ + e) (Xβ + e) i=1 ′ ′ ′ = β X Xβ + e e Other terms disappear bc xβ⊥e Usually, T < G. ′ 2 ′ ′ 2 ′ (y y − ny¯ ) = β X Xβ − ny¯ + e e Defn 3.21 (Leave-one-out Cross-Validation). TSS = ESS + RSS b−(i) Let yi be the predicted value when we leave out (xi, yi) from the dataset. The Defn 3.13 (R2). LOOCV is ∑ ∑ ( ) n n ∑n ( )2 ∑n 2 ˆ − ¯ 2 ˆ − 2 1 −(i) 1 yi − ybi 2 ESS ∑i=1(Y Y ) RSS ∑i=1(Y Y ) = y − yb = R = = n = 1 − = 1 − n LOOCV i i − ¯ 2 − ¯ 2 n n 1 − Hii TSS i=1(Y Y ) TSS i=1(Y Y ) i=1 i=1 R2 since tr(H) = p + 1, the average of Hii = (p + 1)/n =: γ. Then, Defn 3.14 (Adjusted ). ( ) − ∑n 2 2 n 1 1 yi − ybi 2σb R¯2 = 1 − (1 − R2) LOOCV ≈ ≈ training error + (p + 1) n − k n 1 − γ n i=1 Defn 3.15 (Mallow’s Cp). RSS + 2(k + 1)σb2 Cp = 3.9 Multiple Testing Corrections k Fact 3.14 (Probability of rejecting any nulls when k independent true nulls are Defn 3.16 (Akaike Information Criterion (AIC)). ( ) tested). e′e 2k AIC = ln + n n • Test size = 0.05 Defn 3.17 (Bayesian Information Criterion (BIC)). k ( ) • Probability (No rejections) = 0.95 e′e k ln(n) BIC = ln + • Probability (Any (incorrect) rejections) = 1 − 0.95k→1 as k gets big n n Defn 3.18 (F statistic). Defn 3.22 (Family Wise Error Rate (FWER) adjustments). ′ 2 ′ −1 ′ −1 M0 { } R { } F Stat = (Rβˆ − r) (s R(X X) R ) (Rβˆ − r)/q Define := i : Hi is true and := i : Hi is rejected . The FWER is defined as follows Equivalently, ( ) FWER = Pr M0 ∩ R ̸= ∅ (TSS − RSS)/(k − 1) R2/(k − 1) F Stat = = ∼ Fk−1,n−k ≤ α RSS/(n − k) (1 − R2)/(n − k) We want procedures for which FWER . Defn 3.19 (Wald Statistic). τ = α/J ( ) • Bonferroni correction - If testing J hypotheses : Critical value ∂h(βˆ ) ∂h(βˆ )′ W = nh(βˆ )′ n Vˆ n nh(βˆ ) • Holms-Bonferroni stepdown method n n ∂β′ n ∂β n – Order k p-values from smallest to largest p(1), . . . p(k) 2 reject H0 if Wq > χq,1−α = F/q

← ToC 20 – If p(1) > α/k, stop. Fail to reject all. Else, Reject H(1) if p(1) < α/k. Proceed. – If p(2) > α/(k − 1), stop.

– rinse and repeat until you stop rejecting because p(j) > α/(k − (j − 1)) or all k have been rejected. • Computational methods such as Romano and Wolf (2005), Westfall and Young

These methods all modify test sizes; instead we might want an adjusted p-value for each hypothesis. Defn 3.23 (False Discovery Proportion (FDP)/ Rate (FDR)).

• setup – data X ∼ P ∈ P – nulls H1,...,Hm ⊆ P

– p− values p1(X), . . . , pm(X). Not independent. H { } • 0 = i : Hi is true Figure 2: BH Procedure visual - data-dependent slope gray line

• R = {i : Hi is rejected}; R = |R|

• V = |R ∩ H0| 3.10 Quantile Regression Notes based on Koenker (2005). V FDP = FDR = E [FDP] ≤ α R ∨ 1 Defn 3.24 (Conditional Quantile Function). Benjamini-Hochberg Procedure Instead of CEF, we may be interested in the conditional quantile function at quan- tile τ. Define the conditional CDF of yi

• Order p− values p(1) ≤ · · · ≤ p(m) F (y|xi) = Pr (Yi < y|xi) The quantile regression model assumes that the τ−th conditional quantile of yi • BH(α) rejects R hypotheses | ′ { } given xi is a parametric function of xi and is given by Qτ (τ x) = xiβτ , where βτ αr R(X) = max r : p ≤ tells us the impact of x on a conditional quantile. (r) | F−1 | m The conditional quantile function at quantile τ is Qτ (yi xi) = y (τ xi).

• Data-dependent rejection threshold Fact 3.15 (Relation to Heteroskedasticity).( ) ′ | ∼ N 2 − αR(X) Let yi = xiβ + εi ; εi xi 0, σ (xi) . The τ th conditional quantile function Reject Hi ⇔ pi ≤ =: τ(α; X) m Qτ (xi) satisfies τ = Pr (yi ≤ Qτ (xi)|xi) = F (Qτ (xi)|xi) q (X) = min {α : X α } Adjusted P-value / BH q-value i i is rejected by BH( ) . Let zτ denote the τ−the quantile of the standard normal distribution. Since p.adjust(pvals, method = "BH") Anderson (2008) adjustment - Rescale p values by number of hypotheses / p-value rank, and adjust for non-monotonicity.

← ToC 21 − ′ yi xiβ |xi ∼ N (0, 1) we have σ(xi) ( ) − ′ yi xiβ τ = Pr ≤ zτ |xi σ(xi) ≤ ′ | = Pr (Yi xiβ + σ(xi)zτ xi)

This implies that the τ−th conditional quantile of the distribution of yi is given by ′ Qτ (xi) = xiβ + σ(xi)zτ The marginal effect of xi on the τ−th quantile of yi is therefore given by

∂Qτ (xi) σ(xi) = β + zτ ∂xi xi If the errors are homoskedastic (i.e. σ(xi = σ), the effect of xi is the same for all τ ∈ (0, 1) and coincides with the effect on the conditional mean of yi. Moreover, since zτ < 0 if τ < 0.5 ∧ zτ > 0 if τ > 0.5, the contribution of the second term ∂σ(xi) has opposite effects on the upper and lower quantiles. ∂xi ˆ Defn 3.25 (Quantile Regression Estimator βτ ). b β(τ) solves E − ′ min R(β) := [ρτ (yi xiβ)] β∈Rp This objective function is piecewise linear and continuous, and differentiable ex- − ′ cept at the points at which one or more residuals yi xiβ are zero. At these points, only Gateaux derivatives exist, see details in Koenker (2005, Chap 2-3). The objective function (reframed as a linear program and solved numerically) is Figure 3: Examples of Treatment Effects on CDF, QF, and QTE from Koenker (2005) ∑N ∑N | − ′ | − | − ′ | Qn(βτ ) = τ yi xiβτ + (1 τ) yi xiβτ ′ ′ Define ∆(x) as athe ‘horizontal distance’ between F and G, such that F(x) = G(x+ i:yi≥x β i:yi

← ToC 22 The L-D quantile treatment effect is the response necessary to keep a respondent Fact 3.20 (IVs solve measurement error problem). ∃ ∗ ̸ ∗ at the same quantile under both control and treatment regimes. Zi s.t. Cov [Zi,Xi ] = 0, Cov [Zi, mi] = 0, where Zi is the instrument, X is the signal, and m is the measurement error. in the bivariate regression, ∗ ∗ 3.10.1 Interpreting Quantile Regression Models ˆ Cov [Y,Z] Cov [α + βXi + ei,Zi] βCov [Xi ,Zi] βIV = = ∗ = ∗ = β ′ Cov [X,Z] Cov [Xi + m, Zi] Cov [Xi ,Zi] In a transformed model Qh(y)(τ|X = x) = h(QY (τ|X = x)) = x β(τ), for mono- tone transforms h(.), we get 3.12 Missing Data ∂Q (τ|X = x) ∂h−1(x′β) Y = ∂xj ∂xj Defn 3.28 (Missing Data Categories).

∂QY (.) ′ If we specify Q , then = exp(x β)βj. log(Y ) ∂xj • Missing at Random (MAR) : missingness in xi does not depend on its value For practical purposes, suppose we observe two CDFs F and G for treated and un- but may depend on values of xj (j ≠ i) treated groups. Under randomisation, the two CDFs are identical by assumption (since the treatment was randomly administered), so the difference between their • Missing Completely at Random (MCAR): Xobs is a simple random sample is the median treatment effect. of all potentially observable data values. ignorable for likelihood inference if this is the case. 3.11 Measurement Error • Not missing at Random (NMAR) if neither of the above applies. Fact 3.17 (Classical Measurement error in the outcome does not Bias OLS). ∗ let observed y = y + uy be the true value plus noise. We estimate the regression 3.13 Inference on functions of parameters y = xβ + ε + uy. The coefficient is 3.13.1 (Nonparametric) Bootstrap Cov(y, x) Cov(x + ε + uy, x) Cov(ε + uy, x) plim (βˆ) = = = β + = β Var(x) Var(x) Var(x) For a bootstrap statistic calculated on M boostrap samples, The last equality holds iff measurement error in y is orthogonal to x. This means ∑M m ∗ h(β) this rarely holds in practice. E [h (β )] ≈ M Also means OLS estimates are more imprecise. The (homoscedastic) variance is m=1 − now (X′X) 1 [σ2 + σ2 ]. and further moments. ε uy ˆ Key idea: sample from Empirical CDF∫ Fn(x), (first approximation), then approxi- Defn 3.27 (Classical measurement error / Error in variables). mate MeanF (Y ) ≈ Mean ˆ (Y )) = ydFˆ(y). True data X∗ is measured with error, X. Let y = X∗β +U, and X = X∗ +V . Then, F y = Xβ + (u − V β). The error term is correlated with X through measurement ∑M h (Y ) h (Y ) ≈ h (Y ) ≈ m error V , so OLS is inconsistent. F Fˆ M m=1 Example 3.18 (Measurement Error with a Scalar Regressor).( ) General Bootstrap Algorithm ∗ 2 ∼ N 2 True regressor: x , variance σx∗ measured with v 0, σv . Then, we under- estimate the true coefficient [Attentuation Bias] ∗ ∗ ( ) 1. Given data w1, . . . , wN , draw a bootstrap sample of size N, denoted as w1, . . . , wN 2 − ˆ σx∗ 1 s ˆ ∗ ∗ ∗ plim β = 2 2 β = β 2. compute test-statistic h(θ)n(w1, . . . , wN ) σx∗ + σv 1 + s 2 2 where s = σv/σx∗ is the noise-to-signal ratio. Repeat steps (1) and (2) B independent times, obtaining B bootstrap replications ˆ of the statistic θn. Compute quantiles / variance of empirical distribution of Fact 3.19 (Measurement error with correlated regressors). (1) (M) . Adding correlated errors makes attenuation bias from measurement error worse. h(β)N , . . . , h(β)N .

← ToC 23 Example 3.21 (Bootstrap Standard Error).

B B ∑p ∑ ∗ ∑ ∗ ∗ ∗ ∂f 2 1 ∗ 2 ¯∗ −1 ∗ b ≈ ≈ b b − b s = (θˆ − θˆ ) θˆ = B θˆ θ f(ϕ1, . . . ϕp) f(ϕ1,..., ϕp) + (ϕi ϕi) θ,Bootˆ b where b B − 1 ∂ϕi ϕ=ϕb b=1 b=1 i=1 ∑p ≈ ∗ ∗ b − ∗ ′ b Defn 3.29 (Asymptotically Pivotal Statistic). f(ϕ1, . . . , ϕp) + (ϕi ϕi )f (ϕ) A statistic whose limit distribution does not depend on unknown parameters is i=1 said to be asymptotically pivotal. Estimators are generally not asymptotically piv- ∑p otal, while standard normal or chi squared test statistics typically are. ≈ ∗ b − ∗ ′ b θ + (ϕi ϕi )fi (ϕ) Defn 3.30 (Cluster Wild Bootstrap). i=1 b With a ‘small’ number of clusters, conventional clustered bootstrap errors yield The variance for θ can be written using a general analogue to V [a + bX + cy] = over-optimistic variances. CGM Algorigthm for each b2V [X] + c2V [Y ] + 2bcCov [XY ]. Allowing for covariance between any two pa- ϕb , ϕb • Estimate the main model imposing the null, e.g. to test the stat. significance rameters in the vector i j, we can write the variance as p p−1 p of a single variable regress yig on all components of xig except the variable [ ] ∑ ( )2 [ ] ∑ ∑ [ ] − ′ ˜ V θb ≈ f ′(ϕb) V ϕb + 2 f ′(ϕb)f ′(ϕb) ϕb , ϕb that has coefficient zero under the null. Construct u˜ig = yig xigβH0 i i i j Cov i j i=1 i=1 j=i+1 • for each resampling The second term is zero if the quantities are uncorrelated. { − General Statement 1 w.p. 0.5 h(x , θ) – Randomly assing cluster g the weight dg = 1 w.p. 0.5. All obser- Works by considering a Taylor expansion of QoI i . ′ vations in cluster g get the same Rademacher weights h(z) ≈ h(z0) + h (z0)(z − z0) + o(∥z − z0∥) ∗ ∗ ′ ∗ ̸ – Generage new pseudo-residuals u = d ×u˜ g =⇒ y = x β˜ +u . If τ = h(β) and h(θ) = 0, by Slutsky’s thm, ig g i ig ig H0 ig ( ) ∗ ∗ ∗ d ∗ ′ ∗ – Regress yig on xig [not imposing the null] and compute w [the t-stat h (β ) → MVN h(β), ∇h (β ) Σ∇h (β ) with clustered SEs] where the gradient is evaluated at MLE estimates and Σ is the covariance matrix p |w| > |w∗| |w| of the MLE. • value for this test is the proportion of times that b [where is ( ) the original sample statistic] ∗ ∂h(β) ∂h(β) ∂h(β) ∇h (β ) = ,..., ∂β1 ∗ ∂β2 ∗ ∂βK ∗ 3.13.2 Propogation of Error / Delta Method β , β β For the scalar case, Propagation of error for generated quantities − b b b τˆn τ d Say we estimate θ, which is a function of some intermediate quantities ϕ ,..., ϕ , → N (0, 1) 1 p seb (ˆτ) which are themselves estimated. Difference in group means, marginal effects in nonlinear models are examples of such generated quantities. where b c b seb (ˆτ ) = g′(θˆ)seb (θˆ ) Since θ = f(ϕ1,..., ϕp), we derive standard errors for the generated quantity by n n writing a taylor expansion. 3.13.3 MVN Simulations Assuming the point estimates and variances of the parameters are correct, draw from distribution of parameters and construct quantity of interest (e.g. a marginal effect) for each draw. For m of M simulations,

← ToC 24 ( ) m ∗ −1 4 Causal Inference β ∼ MVN β ,IN (β) h(β)m = h (βm) 4.1 Identification Then, average them or take their quantiles A data generating process(DGP) is a complete specification of the stochastic process generating the observed data. Knowledge of the DGP allows one to compute the ∑M f (h(β)m) likelihood of any realisation of the data but is conceptually distinct since it provides f [h (β∗)] = M a description of the structure (/ mechanism) that gives rise to the distribution. m=1 A Model M is a family of theoretically possible DGPs. A model can be 1. fully parametric: indexed by a finite number of parameters 2. non-parametric: indexed by an infinite dimensional parameter (ie an unknown function) 3. semi-parametric: indexed by a finite-dimensional vector of parameters and an infinite-dimensional nuisance function Example 4.1 (OLS as a semiparametric model). The simplest semiparametric model is ′ yi = xiβ + εi E [εi|xi] = 0 β ∈ Rk This model is deemed ‘semi-parametric’ because it contains a finite dimensional parameter β and an infinite dimensional joint distribution for εi, xi left unspecified other than for the conditional mean assumption E [εi|xi] = 0. Example 4.2 (Index Models - Canonical semi-parametric models). ′ yi = g(xiβ) + εi E [εi|xi] = 0 β ∈ Rk g(.): R→R is monotone increasing { } where we are interested in β. Functions g(.), Fε|X (.), FX are nuisance Example 4.3 (Generic Non-parametric model).

Yi = g(xi, εi) xi ⊥ εi g(., .): R2→[0, 1] is monotone increasing in both arguments

Unrestricted marginals of εi, xi. Interested in function g(., .) such as h(x) = Eε [g(xi, εi)] Most models can be written in the form iid Yi = g(Ui), Ui ∼ FU (u)

← ToC 25 where Yi is a vector of observables, Ui is a vector of unobserved r.v.s, with dis- 4.2 Experiments tribution function FU (u) and g(.) is a vector-valued function. We call the pair of functions 4.2.1 Potential Outcomes (Rubin 1974, Holland 1986)

θ := (g(.), FU (·)) Athey and G. Imbens (2016b) and G. W. Imbens and D. B. Rubin (2015) Y is the observed outcome, D is the treatment the structure. There is a 1:1 mapping between a particular DGP and a particular i i M We observe choice of structure. A model space can be represented in terms of a family Θ of Yi = Di · Y1i + (1 − Di) · Y0i structures. Notationally, define a potential outcome Defn 4.1 (Identification). d While structure θ uniquely identifies the distribution of observed variables, the re- Yi = φ(d, Xi, Ui) verse isn’t necessarily true. Identification is the study of which structures are consis- where Xi is a vector of observed covariates and Ui is a vector of unobservables, tent with the joint distribution of observed variables. Let Fy(y denote the distribu- and φ is an unknown measurable function. Typically, we are interested in non- tion function governing the observed variables and Fθ(y) denote the distribution parametric identification of φ or some features of it. function implied by a particular structure θ. The identified set of structures is Defn 4.5 (Fundamental Problem of Causal Inference). Ω(FY , Θ) = {θ ∈ Θ: Fθ(.) = Fy(.)} We never see both potential outcomes for any given unit. point identified Ω(F , Θ) E | − E | E | − E | The structure is if Y is a singleton. | [Yi Di = 1] {z [Yi Di = 0]} = | [Y1i Di = 1] {z [Y0i Di = 1]} Defn 4.2 (Observational Equivalence). observed effect ATT ′ ′′ E | − E | Two structures θ , θ are said to be observationally equivalent if + | [Y0i Di = 1] {z [Y0i Di = 0]} k Fθ′ (y) = Fθ′′ (y) ∀y ∈ R Selection Bias E − E E | − E | i.e. the distribution function implied by the two structures is identical. . The struc- = [Y1] [Y0] + [Y0i Di = 1] [Y0i Di = 0] + ′ | {z } | {z } ture θ is globally point identified if there is no other θ in the model space with ATE Selection Bias which it is observationall equivalent. − − |(1 π)(AT{z T AT U}) Defn 4.3 (Partial Identification). Heterogeneous Treatment Bias e ∈ F If for some θ Θ, Ω( θe(.), Θ) is a subset of the family of Θ but not a singleton, e where π = E [D] is the share of the sample treated. the structure θ is said to be partially identified because some (but not all) competing structures have been ruled out. The identified set for a feature µ(θ) can be written Assumption 1 (Identification Assumption: Complete Randomisation). as { } (Y1i,Y0i) ⊥⊥ Di µ(θ): θ ∈ Ω(Fe(.), Θ) θ This is a Missing Completely at Random (MCAR) assumption on potential out- comes. Defn 4.4 (Ceteris Paribus Effect). Consider the model Yi = f(Xi,Ui; ϕ) where (Yi,Xi) are observed scalars, Ui is an Assumption 2 (Stable Unit Treatment Value Assumption (SUTVA)). unobserved scalar, ϕ is a parameter vector, and f(., .; ϕ) is a function. The model Potential outcomes for any unit do not vary with the treatment assigned to other implies a set of counterfactual values f(x, u) that the outcome Yi would take under units. In practice, this is equivalent to a no spillovers assumption. various realisations of the random variables X ,U . The causal effect of changing X i i i (Y ,Y ) ⊥⊥ D− from x′ to x′′ for individual i can be written as 1i 0i i ′′ ′ ′′ ′ D N Y (D) ∆ (x , x ) = f(x ,U , ϕ) − f(x ,U ; ϕ) Equivalently, let denote a treatment vector for units, and be the poten- i i i tial outcome vector that would be observed if was based on allocation D. Then, ′ If we can identify ϕ and establish that Xi ⊥⊥ Ui, we can identify a distribution of D, D ′′ ′ SUTVA requires that for allocations , causal effects δi(x , x ). ′ ′ Yi(D) = Yi(D ) if Di = Di

← ToC 26 4.2.2 Treatment Effects Including controls: ′ Estimands Yi = α + τDi + Xiβ + ηi Corrects for chance covariate imbalances, improves precision by removing variation • τAT E := E (Y1i − Y0i) in outcome accounted for by pre-treatment characteristics. τ := E (Y − Y |D = 1) • AT T 1i 0i i Fact 4.5. Freedman (2008) Critique Under randomisation, τAT E = τAT T , since the treated are a random sample of the Regression of the form Yi = α + τregDi + β1Xi + ϵi population. Under weak(er) assumption of Y0i ⊥⊥ Di, only τAT T is identified. • τˆreg is consistent for ATE but has small sample bias (unless model is true); 4.2.3 Regression Formulation bias is on the order of 1/n

Yi = α + τREGDi + ηi • τˆreg precision does not improve through the inclusion of controls; including ( {( ) [( ) ( )]} controls is harmful to precision if more than 3/4 units are assigned to one = Y + Y − Y )D + Y − Y + D · Y − Y − Y − Y |{z}0 | 1{z 0} i | i0 0 i {zi1 1 i0 0 } treatment condition α τ η REG i Theorem 4.6 (Lin (2013) fix / response to Freedman critique). Recommends fitting • α = E[Y0i] Yi = α + τinteractDi + β · (Xi − X¯) + β2 · Di · (Xi − X¯) + ϵi τ = E [Y − Y ] • 1i 0i which has same small sample bias, but cannot hurt asymptotic precision even if

• ηi = Y0i − E(Y0i) [extra terms above come from allowing for heterogeneous the model is incorrect and will likely increase precision if covariates are predictive TEs] of the outcomes.

Selection bias: Cov(Di, ηi) ≠ 0 4.2.4 Difference in Means Example 4.4 (Matrix formulae for randomized regression). Suppose 50 percent of the population gets the treatment. Let Xi = [Di 1]. Then, Defn 4.6 (Difference in Means point estimate). [ ] [ ] [ ] ( ) N N N − 2 − ∑N · − · ∑ ∑ X′X = 2 2 = 1 1 =⇒ (X′X) 1 = 2 1 b 1 D Y1 − (1 D) Y0 1 − 1 − N N 1 2 −1 1 τ = = DiYi (1 Di)Yi 2 2 N N N1/N N0/N N1 N0 i=1 N N Similarly, [ ∑ ] Defn 4.7 ((Conservative) Variance estimator for difference in means). ′ ∑ T Y∑i √ X y = ( ) 1 T Yi + c Yi \ \ 2 2 2 σb | σb | d V ar [Y1i] V ar [Y0i] Y Di=1 Y Di=0 [ ] SE [ = + = + Therefore AT E N N N N − ¯ − ¯ 1 0 1 0 βˆ = (X′X) 1 X′y = YT Yc Y¯c τˆ t := Generalise to p fraction treated d SEAT[ E VCV under ( ) ( ) [ ] ∑ ∑ 1 1 −p 2 2 4.2.5 Randomisation Inference − uˆ + uˆ p(1 − p)N(N − 2) p p T C Defn 4.8 (Fisher’s Exact Test). sharp null: Y = Y ∀i. Implies H : E [Y ] = E [Y ]; H : E [Y ] ≠ E [Y ]. VCV under heteroskedasticity 1i 0i 0 i 0 1 i 0 ( )[ ∑ ∑ ∑ ] To test sharp null, set Y1 = Y0 for all units and( re-randomize) treatment. Complete 1 − 2 2 2 2 − 2 2 2N (1 p) T ∑uˆ + p C uˆ p ∑C uˆ randomisation of 4N units with N treated. assignment vectors. P value can 2 2 − 2 2 − 2 2 ( ) N p (1 − p) N(N − 2) p C uˆ p C uˆ 1/ be as small as 2NN .

← ToC 27 Ω is the full set of randomisation( ) realisations, and ω is an element in the set, with associated probability 1/ 2N V V ⇔ SSRϵˆ SSRϵˆ N [ˆτBR] < [ˆτCR] − − < − P-value : Pr ((ˆα(ω) ≥ τˆAT E)) n k 1 n 2

4.2.7 Power Calculations 4.2.6 Blocking [ ] σ2 σ2 V ¯ − ¯ ≈ 1 0 Stratify randomisation to ensure that groups start out with identical observable Basic idea: With large enough samples, Y1 Y0 pN + (1−p)N [where p = characteristics on blocked factors. N1/N is the share of sample treated]. Set p to minimise overall variance. Yields SSRb∗ SSR ∗ b b ε εˆ ˆ ∗ σ1 1 1 V [τBR] < V [τCR] if n−k−1 < n−2 where ϵˆ and ϵ are errors from specification p = . With homoskedasticity, this is Treatment, control. σ1+σ0 2 2 omitting and including block dummies respectively. For J blocks, Defn 4.9 (Power Function). Point estimate τ = µ1 − µ0 () ∑J N Test for τ > (t − + t SE(βˆ). τˆ = j τˆ 1 κ α/2 B N j For common variance σ, j=1 ( √ ) ( ( √ )) τ N τ N Variance Randomisations within each block are independent, so the variances are π = Pr (|t| > 1.96) = Φ −1.96 − + 1 − Φ 1.96 − simple means (with squared weights). 2σ 2σ ( ) ∑J 2 N This yields Var (ˆτ ) = j Var (ˆτ ) B N j j=1 Defn 4.10 (minimum detectable effect). Common variance (assumed) Regression Formulation √ 2 ∑J σ MDE(τ) = Mn−2 yi = τDi + βj · Bij + ϵi Np(1 − p) j=2 where Mn−2 = t(1−α/2) + t1−κ = Critical t-value to reject null + t-value for alterna- If treatment probabilities vary by block, then weight by tive (where 1 − κ) is power. General case with unequal variances ( ) ( )      1 1 − wij = Di + − (1 Di) δ δ pij 1 pij π = Φ 1.96 − √  + 1 − Φ 1.96 − √  2 2 2 2 σ1 σ0 σ1 σ0 Efficiency Gains from Blocking pN + (1−p)N pN + (1−p)N

• Complete Randomisation : Yi = α + τCRDi + ϵi Fact 4.7 (Typical MDE for α = 0.05, κ = 0.8, N large). ∑ MDE ≈ (0.84 + 1.96)SE(ˆτ) ≈ 2.8SE(ˆτ) ⇔ J ∗ • Block Randomisation: Yi = α + τBRDi + j=2 βjBij + ϵi Rearrange to get necessary sample size for any given hypothesised MDE and ex- pected variance. ∑ ( ) 2 n b2 2 σε 2 i=1 εi SSRεb 2 1 σ Var [τbCR] = ∑ ( ) with σb = = N = (z1−κ + z ) · · n − 2 ε n − 2 n − 2 α/2 p(1 − p) 2 i=1 Di D MDE ∑ 2 n ∗2 σ ∗ εb SSRb∗ MDES for Blocking b ε b2 i=1 i ε Var [τBR] = ∑ ( ) ( ) with σε∗ = = √ n 2 2 n − k − 1 n − k − 1 Di − D 1 − R − 2 i=1 j 1 RB MDE(τ ) = M − − 2 2 ≈ BR n k 1 − Where Rj is the fit from regressing D on all Bj dummies. Since Rj 0 by ran- Np(1 p) domisation, 2 where RB is the R-squared from regressing Y on block dummies.

← ToC 28 Fact 4.8 (Required Sample Size for rejection probability β, size α, treatment Standard difference in means estimate yields share γ, effect size τ). To test H : E [Y (1) − Y (0)] = 0 against the alternative, we look at the T Statistic βITT = E [Y |D = 1]−E [Y |D = 0] = β+E [U|D = 1,V ≥ −δ2 − γ] − E [U|D = 0,V ≥ δ2] 0 i i ( ) | {z } obs obs Y − Y τ Selection Bias T = √ t c ≈ N √ , 1 2 2 S2/N + S2/N σ /Nt + σ /Nc Defn 4.11 (Manski Bounds). y t y c Assume bounded support for the outcome. Replace missing values with maxi- α/2 mum or minimum of support. Usually yields CIs that are too wide. Inverting this for size gives us a required( sample size ) −1 −1 Φ (β) + Φ (1 − α/2) Defn 4.12 (Lee Bounds). Required Sample Size = N = (τ/σ)2 · γ · (1 − γ) Monotonicity Assumption: treatment can only affect sample selection in one di- rection. typically, β = 0.8, α = 0.05, γ = 0.5, so by subtitution ( ) Each individual characterised by − − Φ 1(0.8) + Φ 1(0.975) ∗ ∗ N = • Y1 ,Y0 potential outcomes 2 · 2 (τ/σ) 0.5 ∗ ∗ • S1 ,S0 potential outcomes for attrition ∗ ∗ − 4.2.8 Partial Identification – Observed in sample when S = S1 D + S0 (1 D) = 1 ∗ ∗ the ATE can be decomposed as – never attritors: S1 = S0 = 1 – marginal types: S∗ = 1,S∗ = 0. z }|PI { z Observed}| { 1 0 E | ∗ ≥ ATE = E [Yi(1)|Di = 1] Pr (Di = 1) + E [Yi(1)|Di = 0]Pr (Di = 0) Then [Y D = 1,Z 0] = − E [Y (0)|D = 1]Pr (D = 1) + E [Y (0)|D = 0] Pr (D = 0) − E ∗| ≥ − E ∗| − − ≤ − i i i | i {z i } i (1 p) | [Y D ={z 1,V δ2}] +p | [Y D = 1, δ{z2 γ V < δ2}] PI outcome for outcome among marginals never attriters p = Pr (−δ − γ ≤ V < −δ ) /Pr (V ≥ −δ − γ) The terms in red are counterfactual outcomes for which the data contains no in- where 2 2 2 formation. Bounding approaches involve estimators for these missing quantities. Theorem 4.9 (Kolmogorov’s Conjecture - Sharp bounds on treatment effects). 4.3 Selection On Observables Let τi := Y1i − y0i denote the treatment effect and F denote its distribution, and let 4.3.1 Regression Anatomy / FWL F1, F0 denote the distributions of outcomes for the two potential outcomes. Then, FL ≤ F ≤ FU Cov(Yi, x˜ki) (b) (b) (b) where βk = { } V (˜xki) L where x˜ki is the residual from a regression of xki on all other covariates. F (b) = max max F1(y) − F0(y − b), 0 y { } Fact 4.10 (Omitted-Variables Bias Formula). ′ ′ U If structural (long) equation is Yi = α + τDi + W γ + ϵi, with W vector of unob- F (b) = 1 + min min F1(y) − F0(y − b), 0 i i y served, and we estimate short Yi = α+ρDi +ϵi, then we can write the specification ′ as y = τDi + |Wi {zγ + }ϵ

Attrition as Selection Bias Notional : business training (D), Profits νi ≥ (Y ). We only observe profits for businesses that still exist Z 0. True model Cov [Yi,Di] ′ ∗ ρ = V = τ + γ δWD Y = βD + δ1 + U [D] ∗ Z = γD + δ2 + V equivalently, ( ) Y = 1Z∗≥0 −1 ′ −1 −1 ′ plim τˆOLS = τ + δγ = τ + plim [ N D D N D W ]γ

← ToC 29 Coefficient in Short Regression = Coefficient in long regression + effect ofomit- Define conditional response surfaces as ted × regression of omitted on included µ(d)(x) := E [Yi|Xi = x, Di = d] First pass regression adjustment estimator (using OLS) 4.3.2 Treatement Effects under Unconfoundedness 1 ∑n [ ] τb = µb (X ) − µb (X ) n (1) i (0) i Assumption 3 (Conditional Independence and Overlap). i=1 where µb(d)(x) is obtained via OLS. This generically doesn’t work for regularised • unconfoundedness / Selection on Observables / Ignorability: (Y0,Y1) ⊥⊥ D|X regression. • common support 0 < P r(D = 1|X) < 1 Fact 4.11 (Consistency of Regression estimation of ATE). Additional Assumptions for consistent estimate of ATE from OLS: Weakening assumptions to uncover ATT only 1) Constant treatment effects • unconfoundedness for controls: (Y0) ⊥⊥ D|X 2) Outcomes linear in X • weak overlap: P r(D = 1|X) < 1 =⇒ τ will provide unbiased and consistent estimates of ATE. Given unconfoundedness, we can write : • (2) fails - τOLS is Best Linear Approximation of average causal response func- E [Y1 − Y0|X] = E [Y1 − Y0|X,D = 1] tion E [Y |D = 1,X] − E [Y |D = 0,X]. = E [Y |X,D = 1] − E [Y |X,D = 0] Estimands: • (1) fails - τOLS is conditional variance weighted average of underlying τs. Discrete Case (x has finite values indexed by 1, . . . , m ( ) ∑m E 2 σD|X τX τAT E = (E[Y |D = 1,X = xk] − E[Y |D = 0,X = kk])P r(X = xk) ( ) E 2 k=1 σD|X ∑m ∑m E | − E | τAT T = (E[Y |D = 1,X = xk] − E[Y |D = 0,X = kk]) Pr (X = xk|D = 1) τOLS = ( [Y X = xk,D = 1] [Y X = xk,D = 0]) wk k=1 k=1 Estimation where V [(D|X = x ] Pr (X = x ) w = ∑ k k 4.3.3 Regression Adjustment k m V | r=1 [D X = xr] Pr (X = xr) A single regression with controls X is potentially problematic because of Simp- τOLS weighs up groups where the size of the treated and untreated population are son’s paradox. To account for this in a parametric setup, assume a set of iid subjects roughly equal, and weighs down groups with large imbalances in the size of these i = 1, . . . n we observe a tuple (Xi,Yi,Di), comprised of two groups. τOLS is true effect IFF constant treatment effects holds. p • feature vector Xi ∈ R 4.3.4 Subclassification • response Yi ∈ R

• treatment assignment Di ∈ {0, 1} Weighted combination of K subclasses, which partition the population ( ) ∑K ( ) k k k N τˆ = Y − Y · AT E 1 0 N k=1

← ToC 30 ( ) For matching without replacement, ∑K ( ) k k k N ( ) − · 1 ∑ ∑M 2 τˆAT T = Y 1 Y 0 1 1 N1 b2 − − ˆ k=1 σAT T = Yi Yjm(i) δAT T NT M Di=1 m=1 4.3.5 Matching For matching with replacement, X τ ( ) When is continuous, we can estimate AT T by ‘imputing’ potential outcome ∑ ∑M 2 using the observed outcome from ‘closest’ control unit. b2 1 − 1 − ˆ σAT T = Yi Yjm(i) δAT T + NT M Defn 4.13 (Matching Estimators). Di(=1 m)=1 ∑ ( ) ∑ − 1 − 1 Ki(Ki 1) V | τˆAT T = Yi Yj(i) 2 [ϵ Xi,Di = 0] N1 NT M D=1 Di=0 Or use average of M closest matches. K i { ( )} where i is the number of times observation is used in a match, and the last ∑ ∑M error variance term is estimated by matching also. the bootstrap doesn’t work for 1 1 matching. τˆAT T = Yi − Yjm(i) N1 M D=1 m=1 Bias-corrected (Abadie-Imbens) 4.3.6 Propensity Score ∑ ( ) ( ( )) \ 1 π(X) := Pr (D = 1|X). Can be estimated π(X) using various parametric and τˆAT T = Yi − Yj(i) − µˆ0 (Xi) − µˆ0 Xj(i) N1 non-parametric regression methods. D=1 PScore is a balancing score - Conditioning on Propensity score is equivalent to | Where µ0(x) = E[Y X = x, D = 0] is the population regression function under conditioning on covariates: the control. Pr (D = 1|Y ,Y , π(X)) = Pr (D = 1|π(X)) = π(X) Metrics 0 1 • Euclidian Distance Equivalently, palancing property of propensity score (Rosenbaum and D. B. Rubin, √ 1983) ∥ − ∥ − ′ − Xi Xj = ED (Xi,Xj) = (Xi Xj) (Xi Xj) Yi(0),Yi(1) ⊥⊥ Di|Xi ≡ Yi(0),Yi(1) ⊥⊥ Di|π(Xi) • Stata diagonal distance Defn 4.15 (Weighting on the Propensity Score: Horvitz-Thompson Estimands). √ [ ] [ ] ( )− ∗ D − π(X ) Y D Y (1 − D ) ′ 1 IPW E · i E i i − i − b − τ ate = Y = StataD (Xi,Xj) = (Xi Xj) diag ΣX (Xi Xj) π(Xi)(1 − π(Xi)) π(Xi) (1 − π(Xi)) Analogous estimand for ATT, with population[ analogue] as estimator where the normalisation factor is the diagonal element of Σˆ, the estimated 1 D − π(Xi) variance covariance matrix. τatt = E Y · π(Xi) (1 − π(Xi) • Mahalanobis distance (scale-invariant) √ Defn 4.16 (Inverse Probability-Weighted Estimator). ′ −1   MD (Xi,Xj) = (Xi − Xj) Σ (Xi − Xj)   ( ) 1 ∑n  Y D Y (1 − D )  1 ∑n D 1 − D Where Σˆ is the variance-covariance matrix. τbate =  i i − i  = Y i − i ipw n πˆ(X ) (1 − πˆ(X )) n i πb(X ) 1 − πb(X ) Defn 4.14 (Variance Estimators for Matching). i=1 | {z i } | {z i } i=1 i i E E Matching estimators have a normal distribution in large samples provided that [Y1] [Y0] bias is small.

← ToC 31 Horvitz-Thompson Estimator as Regression ∑ Yi = α + τDi + ϵi with weights √ wicri(Xi) = mr r ∈ 1,...,R − i:D=0 Di − 1 Di ∑ λi = ≥ ∀{ } π(Xi) 1 − π(Xi) wi = 1 and wi 0 i : di = 0 i:D=0 Defn 4.17 (Overlap Weights (Li, Morgan, and Zaslavsky, 2018)). n p ∝ The above problem is convex but has dimensionality of 0 (nonnegativity) + (mo- Sample drawn from f(X), and can represent a target population as g(X) f(X)h(X) 1 · ment conditions) + (normalisation). The dual, on the other hand, only has di- where h( ) is the tilting function. mensionality p + 1 and unconstrained, which is considerably easier to solve using A Weighted Average Treatment Effect (WATE) over a target population g is Newton-Raphson-like techniques. E − h h − h [h(X)[µ1(X µ0(X))]] E − τ = τ1 τ0 = = g [Y (1) Y (0)] E [h(X)] 4.3.7 Double Robust Estimators | ∝ ∝ Define fd(x) = Pr (X = x D = d), which gives f1(x) f(x)π(x); f0(x) A Double Robust estimate is consistent if one gets either the propensity score πˆ or − f(x)(1 π(x) the regression µˆ right. For a given tilting function, to estimate τh, weight fd(x) Defn 4.19 (Double Robust Estimators for ATE). h(x) h(x) Oracle AIPW w (x), w (x) = , 1 0 π(x) 1 − π(x) [ ] ∑n y − µ (X ) y − µ (X ) b∗ 1 − i (1) i − i (0) i Target h(x) Estimand ( w1, w0) τAIP W = µ(1)(Xi) µ(0)(Xi) + Di + (1 Di) n e(Xi) 1 − e(Xi) 1 , 1 i=1 Combined 1 ATE π(x)( 1−π(x) )[IPW] ( ) π(x) ∑n Treated π(x) ATT 1, − 1 D (Y − µˆ (X )) (1 − D )(Y − µˆ (X )) ( 1 π(x) ) τb = i i 1 i − i i 0 i + {µˆ (X ) − µˆ (X )} 1−π(x) DR − 1 i 0 i Control 1 − π(x) ATC , 1 N πˆ(Xi) 1 πˆ(Xi) π(x) i=1  Overlap π(x)(1 − π(x)) ATO (1 − π(x), π(x)) IPW Regression z }| { n  z }| {  Overlap weights are defined by choosing h(x) that minimises asymptotic variance 1 ∑  D (Y − µˆ (X )) h i i 1 i of τb =  µˆ1(Xi) +  − . The achieve exact balance on covariates included in the propensity score n  πˆ(X )  estimation. i=1 | {z i } estimator for E[Y (1)] ∑N w (x )D Y w (x )(1 − D )Y [ i ] τb = µbh − µbh = 1 i i i − 0 i i i ∑n − − 1 0 w (x )D w (x )(1 − D ) 1 (1 Di)(Yi µˆ0(Xi)) i 1 i i 0 i i µˆ0(Xi) + n 1 − πˆ(Xi) OW i=1 | {z } τ can be interpreted as treatment effect among population that have good bal- E ance on observables. estimator for [Yi(0)] Implemented in PSweight. This is the Augmented-Inverse-Propensity Weighting Estimator (AIPW) intro- Defn 4.18 (Entropy Balancing (Hainmueller, 2012)). duced by J. M. Robins, Rotnitzky, and Zhao (1994). Additional overviews:(Bang and J. M. Robins, 2005; Glynn and Quinn, 2010). General double-robustness prop- Entropy weights wi for each control unit are chosen by a reweighting scheme ∑ erty also shared by targeted maximum-likelihood estimators(TMLE) - due to Van Der max H(w) = − wi log(wi) Laan and D. Rubin (2006). wi i:D=0 Defn 4.20 (Cross-Fit AIPW). subject to balance/moment-condition and normalising constraints The Cross-fit version can be stated as

← ToC 32 the outcome. Then, if we use a post-LASSO selection to estimate the treatment ∑n effect, the effect will be contaminated with an omitted variable bias. 1 − − τb = µb k(i)(X ) − µb k(i)(X ) recommended two-step approach IPW n (1) i (0) i i=1 ′ −k(i) −k(i) 1. Estimate yi = ciβ + νi with LASSO, select predictive variables (i.e. those yi − µb (Xi) yi − µb (Xi) A + D (1) − (1 − D ) (0) with nonzero coefficients) in i −k(i) i −k(i) πb (Xi) 1 − πb (Xi) ′ 2. Estimate di = ciβ + νi with LASSO, select predictive variables (i.e. those where k(i) is a mapping that takes an observation and puts it into one of the k with nonzero coefficients) in B b−k(i) th folds. µ(1) is an estimator excluding the k fold. ′ A ∪ B 3. Estimate yi = τDi + eiκ + υi where ci := [i.e. control for variables Define individual treatment effect estimator as that are selected in either the first or second regression] D (Y − µb (X )) (1 − D )(Y − µb (X )) b b − b i i (1) i − i i (0) i Γi = µ(1)(Xi) µ(0)(Xi) + b − b Defn 4.22 (Post-double-selection estimator). π(Xi) 1 π(Xi) b b ∑ Let l1, l2 ⊂ {1, . . . , p} be the indices of the selected controls for the outcome and b 1 n b Then, τ = n i=1 Γi treatment respectively. We can form level-α CIs Iα : The post-double-selection estimator is { [ ] } ∑n ′ 1 ˇ 2 b b b b 1 b 2 (ˇτ, β) = argmin En (yi − diτ − x β) : βj = 0 ∀j ̸∈ l1 ∪ l2 I = τb  z − V 2 ; V = (Γ − τb) i α 1 α/2 n(n − 1) i τ∈R,β∈Rk i=1 grf Can use plugin estimator for variance based on residuals has a forest-based implementation of AIPW [ ] √ E v2ψ2 cf = causal_forest(X, Y, D) σ−1 τˇ − τ →d N (0, 1) =⇒ σ2 = i i ate_hat = average_treatment_effect(cf) n n E 2 2 [vi ] GAM-based version in causalgam. where √ Defn 4.21 (Double Selection Estimator for High-Dimensional Controls). b − − ′ ˇ n ψi = (yi diτˇ xiβ) − b− Belloni, Chernozhukov, and Hansen (2014) and Chernozhukov, Chetverikov, Demirer, n s 1 { } Duflo, Hansen, Newey, and J. Robins (2018) partially-linear setup [ ] b − ′ b b E − ′ 2 ∀ ̸∈ b vi = di xiβ β = argmin n (di xiβ) : βj = 0 j l β∈Rp yi = diτ + g(xi) + εi E [εi|xi, di] = 0 di = m(xi) + ηi E [ηi|xi] = 0 Implemented in hdm::rlassoEffect(., 'double selection') where di is a scalar treatment indicator. Observations are independent but not Theorem 4.12 (Efficiency bound and efficient score ). necessarily identically distributed. We are interested in inference about τ that is Semiparametric Efficiency Bound for ATE (Hahn 1998) states that the asymptotic robust to mistakes in model-selection. variance of any regular estimator of τ is bounded from below by Approximate g and m with linear combinations of control terms ci = P (xi), which [ ] [ ] τ(Xi) may contain interactions and non-linear transformations. 2 2 z }| { Assume approximate sparsity (:= there are only a small number of relevant controls, σ1(X) σ0(X) 2 Veff = E + E + (µ(1, Xi) − µ(0, Xi) − τ) and irrelevant controls have a high probability of being small). π(Xi) 1 − π(Xi) | {z } Naive (incorrect) approach: use LASSO on an eqn of the form Var[τ(X)] ∑ [ ] [ ] ′ ∥ ∥ | | where σ2(X) = V Y d|X and τ(X) ≡ E Y 1 − Y 0|X . yi = τDi + xiβ + εi with penalty h(β) = β 1 = β j d j Any regular∑ estimator whose asymptotic√ variance achieves this efficiency bound 1 n ψ (µ) + O ( n) where the treatment τ is not penalised. This will mean we drop any control that is equal to n i=1 i P , where is highly correlated with the treatment if the control is moderately correlated with

← ToC 33 2 • RD,par(γ): Residual variation in treatment assignment explained by U (after D (Y − µ(1, X )) (1 − D )(Y − µ(0, X )) partialling out X). ψ (µ) = µ(1, X ) − µ(0, X ) + i i i − i i i i i i π(X ) 1 − π(X ) i i Draw threshold contours, should expect most covariates to be clustered around is the Efficient Influence Function for estimating τ. origin. Rosenbaum (2002) 4.3.8 Sensitivity Analysis Tuning parameter Γ ≥ 1 that measures departure from zero hidden bias. i j X = X Defn 4.23 (Standardised Difference). For any two observations and with identical covariate values i j, under Check balance by computing SDiff for observable confounders unconfoundedness, probability of assignment into treatment should be identical π(Xi) = π(Xj) Xt − Xc Standardised Difference = √ Treatment assignment probability may differ due to unobserved binary confounder 2 2 (st + sc )/2 U. We can bound this by the ratio: 1 e (1 − e ) Example 4.13. ≤ i j ≤ Γ Multiple control groups Γ (1 − ei)ej ∈ {− } Three valued treatment indicator: Ti 1, 0, 1 corresponding with ineligibles, γ = 1 =⇒ No bias. Γ = 2 =⇒ i is twice as likely to be treated than j despite eligible nonparticipants, and participants. We can test unconfoundedness by com- identical x. paring ineligibles with eligible nonparticipants, i.e. test ⊥⊥ { } | ∈ {− } Defn 4.25 (Coefficient Stability Approaches). Yi 1 Ti = 0 Xi,Ti 1, 0 Altonji, Elder, Taber (2005) Placebo Outcomes Only informative if selection on observables is informative about selection on Covariates included lagged outcomes Yi,−1,...Yi,−T . Test unobservables. ⊥⊥ | How much does treatment effect move when controls are added? Estimate model Yi,−1 Di Yi,−2,Yi,−T ,Xi with and without controls: e.g. Earnings in 1975 in Lalonde F • Yi = α Di + Xβ + ϵ Defn 4.24 (Parametric Sensitivity Analysis (Imbens (2003)). Y = αRD + ϵ U is a nuisance parameter. • i i F Y1,Y0 ⊥ D|X,U αˆ AET ratio: ρ = αˆR−αˆF Where U ∼ B(π = 0.5), and U ⊥ X. P (U = 1) = P (U = 0) = 0.5. Want ρ to be as big as possible (i.e. αˆR − αˆF →0 under unconfoundedness). Propensity score is Logistic: exp(Xθ + γU) Defn 4.26 (Oster (2019) - Proportional selection coefficient). P (D = 1|X,U) = Define proportional selection coefficient 1 + exp(Xθ + γU) Cov [ϵ, D] Cov [X′γ] γ indicates strength of relationship between U and D|X. δ = / V [ϵ] V [X′γ] Y is conditionally normal Then, Y |X,U ∼ N (αD + Xβ + δU, σ2) [ ] R − R˜ p | β∗ ≈ β˜ − δ β˙ − β˜ max → β δ indicates strength of relationship between U and Y X. ˜ − ˙ MLE setup R R Construct grid of (γ, δ) and calculate the MLE for αˆ(γ, δ) by maximising l(α, β, θ, γ, δ) where over (γ, δ). • β,˙ R˙ are from a univariate regression of Y on T Use 2 partial R2s: ˜ ˜ 2 • β, R are from a regression including controls • RY,par(δ): Residual variation in outcome explained by U (after partialling 2 out X). • Rmax is maximum achievable R

← ToC 34 4.4 Instrumental Variables Defn 4.30 (k-Class estimation). ˆ ′ −1 ′ SOO Fails/E [Xiϵi] ≠ 0 because of OVB, then βOLS is no longer consistent. Use Z αbk = (X (I − kPz)X) X (I − kPz)y as instrument for D which isolates variation unrelated to the omitted variable. which nests 2SLS, LIML, and Fuller’s estimator as special cases. Specifically, ⇒ b 4.4.1 Traditional IV Framework (Constant Treatment Effects) • k = 0 = αk is OLS ⇒ b Setup • k = 1 = αk is 2SLS

• k = k =⇒ αbk is LIML • Second Stage: Y = α0 + α1D + u2 LIML b • k = kLIML − − − ; b > 0 =⇒ αbk is Fuller’s estimator • First Stage: D = π0 + π1Z + u1 n L p k k • Reduced Form: here, LIML is the minimum( value of that satisfies ) ⊤ ⊤ Y = γ0 + γ1Z + u3 y (I − kPz)y y (I − kPz)X det ⊤ − ⊤ − = 0 = α0 + α1(π0 + π1Z + u1) + u2 X (I kPz)y X (I kPz)X ivmodel AER::ivreg = (α0 + α1π0) + |(α{z1π1}) Z + (α1u1 + u2) Implemented in , which takes model fits from and computes γ LIML / k-class estimates. 1 Asymptotically, all k− class estimators are consistent for α when k→1, n→∞. Assumption 4 (IV Assumptions). Inference 2 ′ −1 • Under homoscedasticity, V [ˆα2SLS] = σ (X PzX) • Exogeneity (as good as random conditional on covariates): Cov [u1,Z] = 0 • Under heteroskedasticity, • Exclusion Restriction: Cov [u2,D] = 0, Z has no effect on Y except through D. V ˆ ′ −1 ′ ˆ ′ −1 ˆ 2 (βIV ) = (Z PzX) PzX ΩPzX (X PzX) ; Ω = Diag[ˆui ] • Relevance: Z affects D Defn 4.31 (Hausman test for exogeneity). With the above assumptions, we can write Test statistic and null distribution (βˆ − βˆ )2 Defn 4.27 (Instrumental Variables Estimator). H := 2sls ols ∼ χ2 b ˆ − b ˆ 1 ˆ ′ −1 ′ V (β2sls) V (βols) βIV = (Z X) Z y Equivalently, Assuming the instrument Z is valid, we can test for whether x is This is equivalent to endogenous by estimating the following regression ′ Defn 4.28 (Wald Estimator). yi = Ziπ + xiα1 +v ˆiρ1 + ϵi With binary treatment and binary instrument, one can write the IV effect as where vˆ are the (fitted) residuals from estimating the first stage regression xi = ′ Ziψ + vi. A standard t-test for ρ tests whether x is exogenous assuming Zi is a valid γ1 Cov [Y,Z] E [Y |Z = 1] − E [Y |Z = 0] α1 = = = set of instruments. [means this test is not that useful in practice] α1 Cov [Y,D] E [D|Z = 1] − E [D|Z = 0] Defn 4.29 (2SLS Estimator). 4.4.2 Weak Instruments With multiple instruments or endogenous variables, Cov [Y,Z] Cov [Z, u2] Cov [Z, u2] ′ −1 ′ plim αIV = + = αD + αb2SLS = (X PzX) X Pzy Cov [Z,D] Cov [Z,D] Cov [Z,D] ′ −1 ′ where Pz = Z (Z Z) Z is X projected in the column space of Z. Second term non-zero if instrument is not exogenous. Let σu1,u2 = Cov [u1, u2] σ2 = V [u ] F and u1 2 [variance of first stage error] and be F statistic of the first-stage. Then, bias in IV is

← ToC 35 • Defiers: D1 < D0 σu1u2 1 E [ˆαIV − α] = Assumption 5 (LATE Thm Assumptions). σ2 F + 1 u2 σ u1u2 →∞ → {Y ,Y ,D ,D } ⊥⊥ Z If first stage is weak, bias approaches σ2 . As F , BIV 0. • A1: Independence of Instrument : 0 1 0 1 u2 • A2: Exclusion restriction : Y (d, 0) = Y (d, 1) ≡ Y for d = 0, 1 Defn 4.32 (Anderson-Rubin Robust Confidence Intervals). i i di When instruments are weak, AR Confidence intervals are preferable to eyeballing • A3: First Stage: E [D1i − D0i] ≠ 0 F-statistics. Let M be a n × 2 matrix of (y X), and let a0 = (β0, 1), b0 = (1, −β0) • A4: Monotonicity / No defiers: D − D ≥ 0 ∀ i or vice versa (where β0 is typically 0), and 1i 0i M⊤P M Theorem 4.14 (LATE Theorem (Angrist and Imbens (1994))). Σb = z n − L − p Under A1-A4, E | − E | be an estimator for the covariance matrix for the errors. [Y Z = 1] [Y Z = 0] αIV = = E [Y1i − Y0i|D1i > D0i] and let bs,bt be two-dimensional vectors defined as E [D|Z = 1] − E [D|Z = 0] ⊤ 1 ⊤ ⊤ − 1 b 2 b 2 If A1:A4 are satisfied, the IV estimate is the Local Average Treatment Effect for s := (Z Z) Z Mb0(b0 Σb0) the compliers. and ⊤ 1 ⊤ − ⊤ − 1 Cov [β1i, π1i] b 2 b 1 b 2 = + t := (Z Z) Z MΣ a0(a0 Σa0) LATE ATE E [π1i] b b⊤b b b⊤b b⊤b Define the scalars Q1 = s s, Q2 = s t,Q3 = t t So, late is weighted average for people with large π1i; i.e. treatment effect for those based on these scalars, two tests that are fully robust to weak instruments for test- whosle probability of treatment is most influenced by Zi. ing H0 : β = β0 - Anderson Rubin test (AR1949) and Conditional Likelihood Test (Moriera 2003) Theorem 4.15 (Bloom Result). b IV in Randomized Trials with one-sided noncompliance. Conditional on A1:A4 hold- Q1 ing, and E [D|Zi = 0] = Pr (D = 1|Z = 0) = 0. Then, AR (β0) = L E | − E | ( ) √( ) ( ) [Y Z = 1] [Y Z = 0] ITT E − | 1 1 2 | = = [Y1 Y0 D = 1] = AT T CLR (β ) = Qb − Qb + Qb + Qb − 4 Qb Qb − Qb2 Pr (D = 1 Z = 1) Compliance 0 2 1 3 2 1 3 1 3 2 Precision for LATE Estimation

4.4.3 IV with Heterogeneous Treatment Effects / LATE Theorem SEITTd SE \ ≈ LAT E Compliance • binary instrument Zi ∈ {0, 1}

• binary treatment Dz ∈ {0, 1} is potential treatment status given Z = z 4.4.4 Characterising Compliers

• potential outcomes: Yi(D,Z) = {Y (1, 1),Y (1, 0),Y (0, 1),Y (0, 0)} PO Model of IV allows for heterogeneous treatment effects but does not formally iden- tify LATE conditional on X. • heterogeneous treatment effects βi = Yi(1) − Yi(0) Abadie (2003) extends methods by allowing the treatment inducer to be random- ized conditionally on the covariates and by allowing the outcome to depend on Defn 4.33 (IV Subpopulations). the covariates besides the treatment intake. Abadie (2003) also provided semi- parametric estimations of the probability of receiving the treatment inducement, • Compliers: D1 > D0,D0 = 0,D1 = 0 which helps to identify the treatment effects in a more robust way. We need an additional assumption of Common Support : 0 < Pr (Z = 1, X) < 1. • Always takers: D0 = D1 = 1 Specifically, when the treatment inducer Z is as good as randomized after condi- tioning on covariates X, Abadie proposed a two-stage procedure to estimate treat- • Never Takers : D0 = D1 = 0 ment effects.

← ToC 36 • it estimates the probability of receiving the treatment inducement P(Z = 1|X) Kappa can be applied to any characteristic or outcome and get its mean for compliers in order to provide a set of pseudo-weights. by removing the means for never and always takers. Angrist and Pischke (2008, p 181-183) provides overview of estimation. Trick is to construct a weighting scheme • Second, the pseudo-weights are used to estimate the local average response with positive weights so that κi, which is negative for always-takers and never- function (LARF) of the outcome conditional on the treatment and covariates. takers, can be replaced with the always-positive average weight 1 − E [Z|X,D,Y ] (1 − D)E [Z|X,D,Y ] The estimated coefficient for the treatment intake D reflects the conditional treat- E [κ|X,D,Y ] = 1 − − ment effect. 1 − Pr (Z = 1|X) Pr (Z = 1|X) Fact 4.16 (Size of Strata). To compute κ, we need Pr (Z = 1|X), which can be computed using a standard Given monotonicity, we can identify the proportion of compliers, never-takers, and logit/probit or a power-series. always-takers respectively. Likelihood that Complier has a given value of (Bernoulli distributed) character- | E | − E | istic X relative to the rest of the population is given by πcompliers = Pr (D1 > D0 X) = [D X,Z = 1] [D X,Z = 1] | E | E [D|Z = 1,X = 1] − E [D|Z = 0,X = 1] FS in Subgroup πalways-takers = Pr (D1 = D0 = 1 X) = [D X,Z = 0] = | − E | E [D|Z = 1] − E [D|Z = 0] Overall FS πnever-takers = Pr (D1 = D0 = 0 X) = 1 [D X,Z = 1] If nobody in the treatment group has access to the treatment (i.e. E [D|Z = 0] = Theorem 4.19 (Average Causal Response). { } 0), the LATE = ATT. Assume A1-A4 from LATE. Generalise D to take values in the set 0, 1,..., Dˇ ; Let Y := f (d) denote the potential (or latent) outcome for person i for treatment level Fact 4.17 (Proportion of treatment group that are compliers). di i d. Then, By Bayes rule, E | − E | ∑Dˇ | [Y Z = 1] [Y Z = 0] Pr (D = 1 D1 > D0) Pr (D1 > D0) = ωdE [Ydi − Yd−1,i|d1i ≥ d > d0i] Pr (D1 > D0|D = 1) = E [D|Z = 1] − E [D|Z = 1] Pr (D = 1) d=1 Pr (Z = 1) [E [D|Z = 1] − E [D|Z = 0]] where the weights = Pr (D = 1) Pr (d1i > d > d0i) ωd = ∑ Dˇ ≥ ≥ j=1 Pr (d1i j d0i) Theorem 4.18 (Abadie’s Kappa). are non-negative and sum to 1. Suppose assumptions of LATE thm hold conditional on covariates X. Let g(·) be Defn 4.34 (Local Average Response Function (LARF)). any measurable real function of Y,D,X with finite expectation. We can show that | E | the expectation of g is a weighted sum of the expectation in the three groups CEF of Y X,D for the subpopulation of compliers: [Y X,D,D1 > D0] E [κY |X,D] E | E | | E | | E [Y |X,D,D > D ] = [g X] = | [g X,D1 > D0{z] Pr (D1 > D0 X}) + | [g X,D1 = D0 = 1]{zPr (D1 = D0 = 1 X}) 1 0 E [κ] Compliers Always takers E | • Estimate κ + | [g X,D1 = D0 = 0]{zPr (D1 = D0 = 0X}) • Estimate E [Y |X,D] in the whole population, weighting by κ Never Takers Rearranging terms gives us implemented in LARF::larf in R. Then, E · E · 4.4.5 Marginal Treatment Effects: Treatment effects under self selection E | [κ g(Y,D, X)] [κ g(Y,D, X)] [g(Y,D, X) D1 > D0] = = E Pr (D1 > D0) [κ] Heckman and Vytlacil (2007) propose the marginal treatment effect (MTE) setup that where generalises the IV approach for continuous instruments and nests many estimands D(1 − Z) (1 − D)Z (and is a generalisation of the Roy (1951) model). It also has a clearer treatment of κ = 1 − − i 1 − Pr (Z = 1|X) Pr (Z = 1|X) self-selection.

← ToC 37 Exposition based on Cornelissen et al. (2016). Define potential outcomes Fact 4.20 (Estimation with Binary Instrument). The covariate-specific Wald estimator is Y0i = µ0(xi) + υ0i E [Y |z = z, x = x] − E [Y |z = z, x = x] (x) = i i i i i i Y1i = µ1(xi) + υ1i Wald E [Di|zi = z, xi = x] − E [Di|zi = z, xi = x] Under the standard A1-A4 from AIR96, where µj(·) is the conditional mean function and υji captures deviations, with E − | E [υji|xi] = 0. LATE(x) := [Y1i Y0i D1i > D0i, xi = x] Treatment assignment assumes a weakly separable choice model = µ1(xi) − µ0(xi) + E [υ1i − υ0i|D1i > D0i, xi = x] ∗ These can be aggregated using the ‘saturate and weight’ theorem (Angrist and D = µd(xi, zi) + vi i Imbens) D = 1 ∗≥ ∑ i Di 0 ∗ IV = ω(x)LATE(x) where di is the latent propensity to take the treatment, and is interpreted as the net ∗ ≥ x∈X gain from treatment since treatment is only taken up if Di 0. zi is an instrument. with weights vi enters the selection equation negatively, and thus represents latent resistance to   E [D |x = x, z = z] treatment. ∗ i i i ≥ ≥ Share with xi = x z}|{ The condition Di 0 can be rewritten as µd(xi, zi) vi. Applying the CDF of v z}|{   V b | Fv to both sides yields px  Di xi, zi F (µ (x , z )) ≥ F (v ) | v d{z i i } | v{z i} ω(x ) = [ ] i b Propensity score =: P (xi, zi) Quantiles of distaste distribution =: υdi V Di Both RHS and LHS are distributed on [0, 1]. The treatment decision can now be For a continuous instrument, for a pair of instrument values z, z′, LATE(z, z′, x) = written as 1 E [Y1i − Y0i|Dzi > Dz′i, xi = x]. Di = P (xi,zi)≥υdi . Now, we define treatment effects Defn 4.35 (Marginal Treatment Effect (MTE)).

Yi = (1 − Di)Y0i + DiY1i MTE(xi = x,UDi = ui) := E [Y1i − Y0i|xi = x,UDi = ui] − = Y0i + Di |(Y1i {z Y0i}) ∂E [Y |x = x, p(z ) = p] = i i i =: ∆i ∂p ∆(xi) Idiosyncratic gain z }| { z }| { MTE is defined as a continuum of treatment effects along the distribution of υD. − − = µ0(xi) + Di [|µ1(xi) µ0(xi{z) + υ1i υ0i }] +υ0i Define two marginal treatment response (MTR) functions E | E | ≡ ∆i m0(u, x) = [Y0 U = u, X = x]; m1(u, x) = [Y1 U = u, X = x] Many useful parameters are identified using the following expression [∫ ] [∫ ] Aggregating over different parts of the covariate distribution yields different esti- 1 1 ⋆ ≡ ⋆ ⋆ mates. β E m0(u, X)ω0 (u, X, Z)du + E m1(u, X)ω1 (u, X, Z)du 0 0 ATE(x) := E [∆i|xi = x] = µ1(x) − µ0(x) with weights specified in 4. ATT(x) := E [∆i|xi = x,Di = 1] = µ1(x) − µ0(x) + E [υ1i − υ0i|Di = 1] Parametric Model: Assuming joint normality for U0,U1,V , E | − E − | ( ) ATU(x) := [∆i xi = x,Di = 0] = µ1(x) µ0(x) + [υ1i υ0i Di = 0] ϕ ((X ,Z ) β ) E [U | D = 0,X ,Z ] = E [U | V ≥ (X ,Z ) β ,X ,Z ] = ρ i i d 0i i i i 0i i i i d i i 0 1 − Φ ((X ,Z ) β ) ( i i )d Integrating these over x yields the conventional estimators. With self-selection −ϕ ((Xi,Zi) βd) 1 ∗ | | based on Di = di≥0 typically means ATT > ATE > ATU. E [U1i Di = 1,Xi,Zi] = E [U1i Vi < (Xi,Zi) βd,Xi,Zi] = ρ1 Φ ((Xi,Zi) βd)

← ToC 38 ′ yi = τdi + xiβ + εi ′ ′ di = xiγ0 + ziδ0 + υi

where x 1. xi is a vector of pn exogenous controls, including a constant. z 2. zi is a vector of pn instruments

3. di is an endogenous variable x z Figure 4: MTE weights from Mogstad and Torgovitsky (2018) 4. pn >> n and pn >> n b 1. Run (post)LASSO of di on xi, zi to obtain γ,b δ where ρ is the correlation ρ [U ,V ], and ρ = ρ [U ,V ]. 0 0i i 1 1i i b yields MTE estimator 2. Run (post)LASSO of yi on xi to get θ. MTE (x, u ) = E (Y − Y | X = x, U = u ) = x (β − β )+(ρ − ρ )Φ−1 (u ) b ′ b ′b b D 1i 0i i Di D 1 0 1 0 D 3. Run (post)LASSO of di = xiγ + ziδ on xi to get ϑ. Defn 4.36 (Control Function IV). by − b bd − ′ b b b ′b ′ b 4. Construct ρi := yi xiθ, ρi := di xiϑ and vi := xiγ + ziδ + xiϑ. Let xei = xi − x. Write ⊤ e′ τb ρby ρbd vb Yi = xi α + Dixiθ + Diδi + εi 5. Estimate by using standard IV regression of i on i with i as instrument. Perform inference using score stastics or conventional heteroskedasticity-robust where δi is a random effect that captures treatment effect heterogeneity .Wecan e SEs. rewrite this and by demeaning δi = δ − δi. z}|{υ0i implemented in hdm::rlassoIV(., select.X = T, select.Z = T). ⊤ e ⊤ Yi = xi α + Dixi θ + Diδ + εi (1) Discussion in https://cran.r-project.org/web/packages/hdm/vignettes/hdm.pdf. where δ captures the ATE at means of X, which is the unconditional ATE under the linear specification. 4.4.7 Principal Stratification Write the selection equation Treatment comparisons often need to be adjusted for post-treatment variables. ⊤ ∈ { } ∈ { } Di = xiπ + ziπ2 + νi with E [νi|xi, zi] = 0 Binary treatment Zi 0, 1 . post-treatment Intermediate variable Si(zi) 0, 1 , 1 ∈ { } Assumptions Outcome Yi 0, 1 . For each individual, the treatment assumes a single value, so only one of the two potential intermediate values are observed. Based on joint • E [εi|νi] = ηνi: Conventional selection bias. potential outcomes of the intermediate variable (Si, (0),Si(1)), we have 4 strata [ ] e e 00 = {i : S (0) = 0,S (1) = 0} • E δi|νi = ψνi: unobservable part of treatment effect δi depends linearly on i i Never Takers { } the unobservables that affect treatment selection. 10 = i : Si(0) = 1,Si(1) = 0 Defiers 01 = {i : Si(0) = 0,Si(1) = 1} Compliers Including νbi and νbiDi in eqn 1 yields a consistent estimate of the ATE : δ. 11 = {i : Si(1) = 0,Si(1) = 1} Always takers 4.4.6 High Dimensional IV selection Defn 4.37 (Principal Stratification Frangakis and D. B. Rubin (2002)). The basic principal stratification P0 w.r.t post treatment variable S is the partition Chernozhukov, Hansen, and Spindler (2015) setup: of units i = 1, . . . , n such that, forall units in any set of P0, all units have the same vector of (Si(0),Si(1)). The principal stratum Gi ∈ {00, 10, 01, 11} to which unit i

← ToC 39 belongs is not affected by treatment assignment for any principal stratification, so 4.5.1 Estimators can be considered pre-treatment. Normalise running variable c := x0. Then, the linear regression implementation is the following: • Treatment Ignorability implies Y = αl + τD + βlf(X − c) + (βr − βl) × D × g(X − c) + ϵ (Yi(0),Yi(1)) ⊥⊥ Zi|Si(0),Si(1), X where f and g are local or global polynomials. Since the design relies on identifi- (i.e. treatment and control units can be compared conditional on stratum) cation at infinity (i.e. at the cutoff), choice of polynomial / functional form matters a lot. • Average Principal Causal Effect (APCE) Calonico, Cattaneo, Titiunik (2014) recommend local-linear regressions. Older PCE := E [Yi(1) − Yi(0)|Si(0) = sC ,Si(1) = sT ]; sC , sT ∈ {0, 1} literature relies on global higher-order polynomials, which often yields strange estimates. Complier Average Causal Effect (CACE) = Causal Effect on Principal Stratum of Defn 4.39 (Local Linear RD Estimator). Compliers { ( ) } ∑n | − | = E [Y (1) − Y (0)|S (0) = 0,S (1) = 1] Xi c CACE i i i i τb = argmin K × (Y − a − τD − β (Z − c)− − β (Z − c) ) c h i i (0) i (1) i + i=1 n Direct and Indirect Effects via Principal Stratification Direct effect of Z con- Where K(·) is a kernel function. Common choices are the window function K(x) = S Z Y ditional on exists if there is a causal effect of on for observations for whom 1| |≤ or the triangular kernel K(x) = (1 − |x|) the treatment does not affect selection S, i.e. principal strata 00, 11. This is a zero- x 1 + first-stage sample in IV-terms. The Indirect Effect is mediated through S. Assumptions for Local Linear Estimator Loosely, we need CEFs µ(w) to be smooth. More precisely, we need µ(w)(x) to be twice-differentiable with uniformly bounded second derivative. 4.5 Regression Discontinuity Design d2 Setup ≤ ∀ ∈ R ∧ ∈ { } 2 µ(w)(x) B x w 0, 1 Treatment (D) changes discontinuously at some particular value x0 in x [and noth- dx ing else does], so Taking a taylor expansion around c, we can write the CEFs as { 0 if xi < x0 1 2 Di = ≥ µ(w)(x) = a(w) + β(w)(x − c) + ρ(w)(x − c) ρ(w)(x) ≤ Bz 1 if xi x0 2 Standard identification assumptions violated by definition because although un- with τc = a(1) − a(0). The local linear regression with a window kernel can be 1 confoundedness holds trivially since we have Di = xi≥c, this also means overlap is solved in closed form always violated . Need to invoke continuity to do causal inference. ∑ b 2 b E(1)[(Xi − c) ] − E(1)[Xi − c] · (Xi − c) ba(1) = γiYi , γi = Defn 4.38 (Sharp Regression Discontinuity Estimand (Hahn et al 2001)). Eb − 2 − Eb − 2 c≤X ≤c+h (1)[(Xi c) ] (1)[Xi c] Identified at x = c, i.e. τc = µ(1)(c) − µ(1)(c) via i n b τc := E [Y1 − Y0|X = c] = lim E [Y |X = c] − lim E [Y |X = c] where E(·) denote sample averages over the regression window. Then, the error ↓ ↑ x c x c term can be written as ∑ ∑ lim E [y|X] − lim E [y|X] = τSRDD + lim E [u|X] − lim E [u|X] ↓ ↑ ↓ ↑ b − − x c x c |x c {z x c } a(1) = a(1) + γρ(1)(Xi c) + γi(Yi µ(1)(Xi)) ≤ ≤ ≤ ≤ ≈ 0 |c Xi c+hn {z } |c Xi c+hn {z } Assumption 6 (Smoothness of Unobservables). Curvature Bias Sampling Noise 2 Curvature bias bounded by Bhn. E | ( ) • Conditional mean function [u X] is continuous at c −2/5 −1/5 τbc = τc + O n withhn ∼ n • Mean Treatment effect function E [τi|X] is right continuous at c

← ToC 40 This rate is a consequence of working with the 2nd derivative. In general, if we { − assume µ (·) has a bounded k−th derivative, we can achive en k/(2k+1) rate obs Yit(0) if Dit = 0 (w) Yit = Yit(Dit) = using local polynomial regression of order k − 1 with a bandwidth scaling as Yit(1) if Dit = 1 ∼ −1/(2k+1) hn n . Defn 4.41 (Causal Effects in DiD).

Defn 4.40 (Minimax Linear Estimation (G. Imbens and Wager, 2017)). τAT T := E [Y1(1) − Y0(1)|D = 1] The local linear regression estimator for τc ( ) Naive Estimation Strategies ∑n | − | Zi c 2 τbc = argmin K (Yi − a − τWi − β(0)(Zi − c)− − β(1)(Zi − c)+) τ = E [Y (1)|D = 1] − E [Y (0)|D = 1] h • Before-After Comparison: ; i=1 n ∑ E | E b n – assumes [Y0(1) D = 1] = [Y0(0) D = 1] (No trending) which can be written as a local linear estimator τc = i=1 γiYi where weights γi only depend on the running variable Z. G. Imbens and Wager (2017) show that • Post Treatment-Control Comparison: τ = E [Y (1)] − E [Y (1)|D = 0] local linear regression is not the best estimator in this class. E | E | ′′ ≤ | { } – Assumes [Y0(1) D = 1] = [Y0(1) D = 0] () Under an assumption that µ(w)(z) B Z1,...,Zn , the minimax linear estima- b | { } ≤ ∥ ∥2 2 tor is the one that minimises the MSE MSE(τc(γ) Z1,...,Zn ) σ γ 2 + IB(γ) Both typically untenable in practice, so we need and is given by Defn 4.42 (Parallel Trends Assumption). ∑n { } E − | E − | b B B B ∥ ∥2 2 [Y0(1) Y0(0) D = 1] = [Y0(1) Y0(0) D = 0] τc(γ ) = γi Yi ; γ = argmin σ γ 2 + IB(γ) | {z } | {z } i=1 Trend in control PO for Treated Trend in control PO for Control These weights can be solved for using quadratic programming. Often justified using a figure [with transformed y if necessary], or control for time trends [which relies on a strong functional form assumption], or a clear falsifica- tion test [on a placebo group]. 4.5.2 Fuzzy RD Equivalently, for a two-period difference, we can also write the standard OLS exo- Discontinuity doesn’t deterministically change treatment, but affects probability of geneity condition in differences form treatment. Analogue of IV with one-sided non-compliance. { E [∆x′∆ϵ] = 0 | g0(xi) if xi < x0 E ′ E ′ − E ′ − E ′ P [Di = 1 xi] = [x2ϵ2] + [x1ϵ1] [x1ϵ2] [x2ϵ1] = 0 g1(xi) if xi ≥ x0 | {z } No feedback loop g0(xi) ≠ g1(xi). Assuming g1(x0) > g0(x0), the probability of treatment relates to xi via: Which makes a direct link with the strong exogeneity assumption in panel data models that asserts that ϵt ⊥⊥ x1,... xt. E[Di|xi] = P [Di = 1|xi] = g0(xi) + [g1(xi) − g0(xi)]Ti 1 where Ti = xi≥x0 := point of discontinuity Given this, we can write Defn 4.43 (DiD Estimand). 4.6 Differences-in-Differences E − | E | − E | − 4.6.1 DiD [Y1(1) Y0(1) D = 1] = | [Y (1) D = 1] {z [Y (0) D = 1]} ∆Y |D=1 Observe n0+1 units for t = 1,...,T periods. E | − E | | [Y (1) D = 0] {z [Y (0) D = 0]} • Y (t) PO that unit i attains in period t when treated between t and t − 1 1i ∆Y |D=0

• Y0i(t) PO that unit i attains in period t with control between t and t − 1

← ToC 41 4.6.2 DiD Estimators and define the same for each time period for panel data. Sample Means Estimator If Dit is as good as randomly assigned conditional on Ai: { } E[Y |A ,X , t, D ] = E[Y |A ,X , t] 1 ∑ 1 ∑ 0it i it it 0it i it τDiD = {Yi(1) − Yi(0)} − {Yi(1) − Yi(0)} Then, assuming Ai enter linearly, N1 N0 ′ ′ Di=1 Di=0 | E[Y0it Ai,Xit, t, Dit] = α + λt + Aiγ + Xitβ N N Where 1 and 0 are the number of treated and control units respectively. Assuming the causal effect of the treatment is additive and constant, Regression Estimator E[Y1it|Ai,Xit, t] = E[Y0it|Ai,Xit, t] + ρ Yist = α + γT reats + λP ostt + τ(T reats × P ostt) + ϵist where ρ is the causal effect of interest. Triple Differences (DDD) Estimator Then, we can write: Regular Diff-in-Diff estimate - Diff-in-diff estimate for placebo group ′ Yit = αi + λt +ρ Dit + Xitβ + ϵit − | 4.7 Panel Data ϵit = Y0it E[Y0it Ai,Xit, t] Error Term α = α + A γ Setup: We observe a sample of i = 1,...,N cross-sectional units for t = 1,...,T i i Fixed effect ⇒ { ′ }T Restrictions time periods = Data : (yit, xit): t = 1,...,T t=1 One-way fixed effects and Random effects both use theform • Linear ′ yit = xitβ + |θi +{zϵit} (2) • Additive functional form

eit • Variation in Dit, over time, for i, must be as good as random although they make different assumptions about the error. Error assumptions for panel regressions Defn 4.44 (Within Estimator). (1) FE: E [ϵit|xi, θi] = 0 ⇔ θi ̸⊥⊥ xi. Estimate the specification E | ′ (2) RE: (1) and [eit xi] = 0 [Absorb unobserved unit effect into error term, impose y¨ = x¨ β + ϵ ⇒ ⊥⊥ i i i orthogonality it] = θi xi. Equivalent to Pooled OLS with FGLS. ¨ where ki = Miki individual demeaned values from pre-multiplying by the Indi- − M := I − 1 (1′ 1 ) 1 1′ 4.7.1 Fixed Effects Regression vidual specific demeaning operator i i i i i i with every compo- nent in eqn 2, which removes the fixed effect θi. Identification Assumption Defn 4.45 (First Differences Estimator). • Strict Exogeneity - errors are uncorrelated with lags and leads of x Lag eqn 2 1 period and subtracting gives ′ ′ E | E | ··· ⇔ E ∀ ∆yit = ∆x β + ∆ϵit [ϵit xi] = [ϵit xi1, xiT ] = 0 [xisϵit] = 0 s, t = 1,...T it where ∆yit = yit − yi,t−1 and so on. This naturally eliminates the time-invariant Equivalent statement for yit is fixed effect θi. The pooled OLS estimation of β in the above regression is called the E | E | ′ b [yit xi1,..., xiT ] = [yit xit] = xitβ first differences (FD) estimator βFD.

– Rules out feedback loops i.e. xit correlated with ϵi,t−1 because Xs are Fact 4.21 (Efficiency of FE and FD Estimators). set in response to prior error, e.g. Policing and crime. • FE estimator is more efficient under the assumption that ϵit are serially un- • regressors vary over time for at least some i. E ′ | 2 correlated [ [eiei xi, θi] = σe IT ] Setup ϵ Define an individual fixed effect for individual i • FD more efficient when it follows a random-walk. { 1 A = if the observation involves unit i i 0 otherwise

← ToC 42 Fact 4.22 (Equivalence between Within/FE and first differences for 2 periods). Under these assumptions, the FGLS matrix Ω takes a special form For Individual Fixed Effects/Within estimation, using the regression anatomy for- 2 2 ′ Ω = σν IT + σθ jT jT mula, write: ′ where jT j is a T × T matrix of 1s. Estimators for the variance components are in Cov(Y¨ , X¨ ) T ρˆ = it it Wooldridge (2010, c 10, pp 260-61). A robust estimator of Ωˆ is constructed using fe ¨ Var(Xit) pooled OLS residuals vˆi ∆Yi,t ∆Xi,t ∑n Since t = 2, Y i = Yi,t + and Xi = Xi,t + 1 2 2 Ωb = vˆ vˆ′ Thus, n i i i=1 Cov(Y¨ , X¨ ) With this, we can apply the FGLS estimator ρˆ = it it fe ( )−1 Var(X¨it) b ′ −1 ′ −1 βRE = X Ωˆ X X Ωˆ y Cov(Y − Y ,X − X ) = it i it i Var(X − X )) it i 4.7.3 Hausman Test: Choosing between FE and RE − − ∆Yit − − ∆Xit Cov(Yit Yit 2 ,Xit Xit 2 ) = βFE is assumed to be consistent. Oft-abused test as a result. − − ∆Xit Var(Xit Xit 2 ) β − β = 0 − Cov(∆Yit, ∆Xit) Cov(∆Yit, ∆Xit) • H0: FE RE = − = Var(∆Xit) Var(∆Xit) • H0: βFE − βRE ≠ 0 =ρ ˆfd ( )′ ( ] [ ] ( ) βb − βb Vardβb − Vard βb )−1 βb − βb →d χ2 4.7.2 Random Effects FE RE FE RE FE RE k Identification Assumption If the error component θ is correlated with x, RE estimates are not consistent. Per- form Hausman test for random vs fixed effects (where under the null, Cov(θi, xit) = Assume θi ⊥⊥ Xi ⇔ E [θi|xi] = E [θi] = 0 - strong assumption 0) In other words, entire error term eit = νit + θi is independent of X. This assumes OLS is consistent but inefficient, which is why it is of limited use in observational σˆ2 T σˆ2 λ→0 settings. • When the idiosyncratic error variance ν is large relative to i θ , and ˆ ≈ ˆ When there is in time series (i.e. ϵt s are correlated over time ), βRE βpool. In words, the individual effect is relatively small, so Pooled GLS estimates can be obtained by estimating OLS on quasi-differenced data. This OLS is suitable. allows us to estimate the effects of time-invariant characteristics (assuming the in- 2 2 → • When the idiosyncratic error variance σˆν is small relative to Tiσˆθ , λ 1 and dependence condition is met). ˆ ˆ βRE ≈ βFE. Individual effects are relatively large, so FE is suitable. yit − λy¯i = (xit − λx¯i)β + (1 − λ)θi + νit − λν¯i where [ ] 1 4.7.4 Time Trends σ2 2 λ = 1 − ν Linear Time Trend σ2 + T σ2 ν θ yit = xitβ + ci + t + εit, t = 1, 2,...,T Assumption 7 (RE FGLS Assumptions). Time Fixed Effects (a.k.a. Two-way Fixed Effects) [ ] yit = xitβ + ci + tt + εit, t = 1, 2,...,T E 2 2 • Idiosyncratic errors νit have constant finite variance: νit = σν Unit Specific Time Trends y = x β + c + g · t + t + ε , t = 1, 2,...,T • Idiosyncratic errors νit are serially uncorrelated: E [νitνis] = 0∀t ≠ s. it it i i t it [ ] E 2| 2 • θi xi = σθ

← ToC 43 4.7.5 Distributed Lag 40

Define switching indicator Dit as 1 if i switched from control to treatment between t − 1 and t. ∑m ∑q ′ Y = γ + λ + δ− D − + δ D + X β + ϵ ist s t τ s,t τ +τ s,t+τ ist ist 30 τ=0 τ=1 where the sums on the RHS allow for m lags / post-treatment effects, and q leads / pre-treatment effects. Leads should be close to 0.

4.7.6 ‘Generalised’ Difference in Differences: staggered adoption k 20 yit Theorem 4.23 (DiD Decomposition Theorem (Goodman-Bacon, 2018)).

Consider a dataset comprising K timing groups ordered by the time at which they Units of y l U y first receive treatment and a maximum of one never-treated group . The OLS it yU estimate from a two-way fixed effects regression is  it   10 ∑ ∑ ∑   βˆ = s βˆDD + s βˆDD + s βˆDD  DD kU kU  kj kj |jk{zjk }  k≠ U k≠ U j>k DD estimated with already treated group PRE(k) MID(k, l) POST(l) where weights depend on sample size and variance of treatment within each DD. 0 This maximises the weights of groups treated in the middle of the panel. The Late * * vs Early comparison is particularly problematic (and is typically incorrect when tk tl treatment effects are heterogeneous in time). Time Visually, this involves decomposing the setup in 5 into its constituent two-way parts 6. Figure 5: Some Staggered Difference in Differences data Fact 4.24 (Group-time average treatment effects (Callaway and Sant’Anna, 2020)). Estimand: Group-time average treatment effect notyet AT T (g, t) = E [Yt − Yg−1|Gg = 1] − E [Yt − Yg−1|Dt = 0,C = 1] AT T (g, t) = E [Y (g) − Y (∞)|G = 1] , ∀ t ≥ g unc t t g Aggregation: event-study type estimand. where Y (g) is the potential outcome for group treated at g. t ∑T Separate (1) identification, (2) estimation and inference, and (3) aggregation. θD(e) = 1g+e≤T AT T (g, g + e)Pr (G = g|G + e ≤ T ,C ≠ 1) ′ ′ g=2 • A1: No anticipation ∀i, t and t < g, g , Yi,t(g) = Yi,t(g ) Implemented in did and DRDID. • A2: Parallel trends based on ‘never treated’ group: ∀t ∈ {2,..., T}, g ∈ G s.t. t ≥ g E [Y (0) − Y − (0)|G = 1] = E [Y (0) − Y − (0)|C = 1] , | t t{z1 g } | t {zt 1 } Trend in group treated at 1 Trend in never treated

Estimators for Group-time ATEs never E − | − E − | AT Tunc (g, t) = [Yt Yg−1 Gg = 1] [Yt Yg−1 C = 1]

← ToC 44 A. Early Group vs. Untreated Group B. Late Group vs. Untreated Group 40 40   Y1,T (1) Y2,T (0) ..., Yn0+1,T (0)  . . . .  30 30  . . . .   . . . .  k obs obs Y1,T +1(1) Y2,T +1(0) ..., Yn +1,T +1(0) yit 0 0 0 0 Y := (Yit )t=T,...,1, i=1,...,n0+1 =   20 20 l Y1,T (1) Y2,T (0) ..., Yn +1,T (0) U yit U  0 0 0 0  yit yit

Units of y Units of y   . . . . 10 10 . . . .

Pre(k) Post(k) Pre(l) Post(l) Y1,1(1) Y2,1(0) ..., Yn0+1,1(0) 0 0 * * FPCI applies; potential outcome matrices are: tk tl Time Time   ? Y2,T (0) ..., Yn0+1,T (0) C. Early Group vs. Late Group, before t* D. Late Group vs. Early Group, after t*  . . . .  l k  . . . .  40 40  . . . .  yk  ? Y2,T +1(0) ..., Yn +1,T +1(0) it Y(0) =  0 0 0  30 30 Y1,T0 (1) Y2,T0 (0) ..., Yn0+1,T0 (0)   . . . .  yk . . . . it . . . . 20 20 yl yl it it Y1,1(1) Y2,1(0) ..., Yn0+1,1(0)

Units of y Units of y   10 10 Y1,T (1) ? ..., ?   Pre(k) Mid(k, l) Mid(k, l) Post(l)  . . . . 0 0  . . . . t* t* t* t* Y1,T +1(1) ...,  k l k l Y(1) =  0  Time Time  ? ?) ..., ?  . . . . . . . . Figure 6: Constituent 2-way Differences in Differences Comparisons ? ? ..., ?

Let Xtreat be a p−vector of a pre-intervention characteristics, and Xc is a p × n0 matrix containing the same values for control units. This typically includes pre- 4.7.7 Synthetic Control treatment outcomes, in which case p = T0, but predictors (even time invariant Original Abadie, Diamond, and Hainmueller (2010) setup. ones, Zi) are usually available.   Observe n0+1 units in periods t = 1,...,T . Unit 1 is treated starting from period obs Y1,1 T0 + 1, while 2, . . . , n0 + 1 are never treated, and are therefore called the donor pool.  obs  { Y1,2  Y (0) D = 0 X :=  .  Y obs = Y (D ) = it if it treat  .  it it it Y (1) if D = 1  .  it it Y obs 1,T0 Since there is only 1 treated unit, the effect of interest Zi τ := Y (t) − Y (t), t = T + ,...,T t 1i 0i 0 1 Defn 4.46 (Synthetic Control Estimator). √ × || || ′ Observed data matrix ((Doudchenko and G. W. Imbens, 2016)) For some p p PSD matrix V, define X V = X VX, where V is typically

diagonal. Consider weights ω = (ω2, . . . , ωn0+1) satisfying ω ≥ 0, 2, . . . , n + 1 (Non-Negativity) ∑ i 0 ωi = 1 (Sum to 1) i≥2

← ToC 45 obs obs obs This forces interpolation, i.e. the counterfactual cannot take a value greater than – Methods differ in how µ and ω are chosen as a function of Yc, post, Yt, pre, Yc, pre the maximal value or smaller∗ than the minimal value of for a control unit. The synthetic control solution ω solves • Impose four constraints || − ||2 min Xtreat Xcω V s.t. Non-negativity, Sum to 1 1. No Intercept: µ = 0. Stronger than Parallel trends in DiD. ω ∑ n The Synthetic Control Estimator is then 2. Adding up : i=1 ωi = 1. Common to DiD, SC. n∑0+1 3. Non-negativity: ω ≥ 0 ∀ i. Ensures uniqueness via ‘coarse’ regularisa- b obs − ∗ obs i τt := Y1,t ωi Yit tion + precision control. Negative weights may improve out-of-sample i=2 prediction. In contrast, a simple difference-in-differences estimator gives ( ) 4. Constant Weights: ωi = ω ∀ i 1 n∑0+1 τbDID := Y obs Y obs − Y obs − Y obs • DiD imposes 2-4. t 1,t 1,T0 it i,T0 n0 i=2 • ADH(2010, 2014) impose 1-3 Abadie, Diamond, and Hainmueller (2010) choose V = diag v1, . . . , vp using a nested-minimisation of the Mean Square Prediction Error (MSPE) over the pre- – 1 + 2 imply ‘No Extrapolation’. treatment period

∑T0 n∑0+1 Relaxing these assumptions: MSPE(V) := Y obs − ω (V)Y obs 1,t i it • Negative weights t=1 i=2 Defn 4.47 (Imbens and Doudchenko representation). – If treated units are outliers on important covariates, negative weights Doudchenko and G. W. Imbens (2016) Setup: might improve fit [ ] [ ] – Bias reduction - negative weights increase bias-reduction rate Yobs Yobs obs t, post c, post Yt, post(1) Yc, post(0) × Y = obs obs = T (N + 1) Y Y Yt, pre(0) Yc, pre(0) • When N >> T0, (1-3) alone might not result in a unique solution. Choose by [ t, pre c, pre ] [ ] ? Y (0) ? Y (0) – Matching on pre-treatment outcomes : one good control unit is better Y(0) = c, post = c, post Yt, pre(0) Yc, pre(0) Yt, pre(0) Yc, pre(0) than synthetic one comprised of disparate units – Constant weights - implicit in DiD • relative magnitudes of T and N might dictate whether we impute the missing potential outcome ? using this or this comparison • Given many pairs of (µ, ω)

– Many Units and Multiple Periods: N >> T0, Y(0) is ‘fat’, and red • prefer values s.t. synthetic control unit is similar to treated units in terms of comparison becomes challenging relative to blue. So matching methods lagged outcomes are attractive. – T0 >> N, Y(0) is ‘tall’, and matching becomes infeasible. So it might • low dispersion of weights be easier to estimate blue dependence structure. • few control units with non-zero weights – Finally, if T0 ≈ N, regularization strategy for limiting the number of

control units that enter into the estimation of Y0,T0+1(0) may be impor- Optimisation Problem tant Ingredients of objective function − obs − • Focus on last period for now: τ0,T = Y0,T (1) Y0,T (0) = Y0,T Y0,T (0) • Balance: difference between pre-treatment outcomes for treated and linear- b combination of pre-treatment outcomes for control • Many∑ estimators impute Y0,T (0) with the linear structure Y0,T (0) = µ + n · obs − − ⊤ 2 − − ⊤ ⊤ − − ⊤ i=1 ωi Yi,T – Yt, pre µ ω Yc, pre 2 = (Yt, pre µ ω Yc, pre) (Yt, pre µ ω Yc, pre)

← ToC 46 ∑ bS bS · N 1 ≤ • Sparse and small weights: (µ , ω ) = argminµ,ω Q( ; λ = 0, α) with i=1 ωi̸=0 k (=1 for OtO) ∥ ∥ Synthetic Control – sparsity : ω 1 ∥ ∥ µ = 0 – magnitude: ω 2 • assume (1-3) (i.e. ) • For M × M PSD diagonal matrix V en en (µb (λ, α), ωb (λ, α)) = argmin Q(µ, ω|Yt, pre, Yc, pre; λ, α) µ,ω ⊤ ⊤ ⊤ 2 (ωb(V), µb(V)) = argmin{(X − µ − ω X) V where Q(µ, ω|Y , Y ; λ, α) = Y − µ − ω Y t t, pre c, pre ( t, pre c, pre) 2 ω,µ 1 − α − − ⊤ } + λ ∥ω∥2 + α ∥ω∥ (Xt µ ω X) 2 2 1 b ⊤ ⊤ V = argmin {(Yt, pre − ωb(V) Yc, pre) Tailored Regularisation V=diag(v1,...,vM ) ⊤ (Yt, pre − ωb(V) Yc, pre)} • don’t want to scale covariates Yc, pre to preserve interpretability of weights Constrained regression: When Xi = Yi,t; 1 ≤ t ≤ T0 (Lagged Outcomes only) b V = I λ = 0 • Instead, treat each∑ control unit as a ‘pseudo-treated’ unit and compute Yj,T (0) = N and ben b · obs µ (j; α, λ) + i≠ j ωi(j; α, λ) Yi,T where Defn 4.48 (Interactive Fixed Effects (Bai, 2009)). ′ ′ Yit = δitDit + x β + λ ft + εit  2 it i T ∑0 ∑ Where D is the treatment, δit is the heterogeneous treatment effect for unit i at time ben ben  − −  ′ (µ (j; λ, α), ω (j; λ, α)) = argmin Yj,t µ ωiYi,t + t, xit is a p− vector of time-varying controls. ft = [f1t, . . . , frt] is a k × 1 vector of µ,ω ′ t=1 i=0̸ ,j λ = [λ , . . . , λ ] r × 1 ( ) unknown common factors, i i1 ir is a vector of unknown factor loadings. This factor component nests standard functional forms 1 − α 2 λ ∥ω∥ + α ∥ω∥ × 2 2 1 |{z}Uit = |{z}λi |{z}ft en en Confounders Loadings factors pick the value of the tuning parameters (αopt, λopt) that minimises b Yj,T (0) • ft = 1 =⇒ λi × 1 = λi unit FEs ∑N z ∑}| { en 1 en en CV (α, λ) = (Y − µb (j; α, λ) − ωb (j; α, λ) · Y ) • λi = 1 =⇒ 1 × ft = ft time FEs N j,T i i,T j=1 i=0̸ ,j • f1t = 1, f2t = ξt, λi1 = αi, λi2 = 1 =⇒ ft × λi = αi + ξt two-way FEs. Difference in Differences • ft = t =⇒ λi × ft = λi × t Unit-specific linear time trends • assume (2-4) • λi = yi0, ft = αt =⇒ λi × ft = αyi,t−1 − νit Lagged dependent variable 1 • No unique µ, ω solution for T = 2, so fix ω = N Steps 1 b ωdid = ∀i ∈ {1,...N} 1. Get initial value of β using within estimator i N ∑T0 ∑T0 ∑N b b b did 1 1 2. Estimate λi, ft using β µb = Y0,s − Yi,s T NT ′ 0 s=1 0 s=1 i=1 b b b 3. Re-estimate β using λi ft Best Subset; One-to-one Matching 4. Iterate

← ToC 47 Drawback - constant effect Sample Analogue: Defn 4.49 (Generalized Synthetic Control (Xu, 2017)). 1 ∑n E [g(w , θ)] ≈ g(w , θ) =: g (θ) With, NCO control units and NTR treated units, Write DGP for individual unit as i n i N ′ i=1 Yi = Di ◦ δi + X β + Fλi + εi i ∈ 1, 2,...NCO,NCO + 1,...,N i Define Jacobian [ ] Y = [y , y , . . . , y ]′ D = [D ,...,D ]′ X = [x ,..., x ]′ T × k Where i i1 i2 iT , i i1 iT , i i1 iT is , ∂g(wi, θ) F = [f , . . . , f ]′ T × r D(θ) := E , q × k 1 T is . ∂θ Stack controls together gives ′ problem is under-identified if rank(D) < k, just identified if rank(D) = k, and |Y{zCO} = |X{zCO} |{z}β + |{z}F |Λ{zCO} +εCO over-identified when rank(D) > k. × Evaluated at the maximum, T ×NCO T ×NCO ×p p×1 T NCO NCO ×r ∑n GSC for treatment effects is an out-of-sample prediction method: the treatment 1 d √ g(wi, θ0) → N (0, S) effect for unit i at time t is the difference between teh actual outcome and its esti- N b − b c i=1 mated coutnerfactual δit = Yit(1) Yit(0), where Yit(0) is imputed in three steps. ′′ where S = E [g(wi, θ0)g(wi, θ) ] is q ×q. Can insert a positive definite q ×q weight- 1. Estimate an IFE model using only the control group data and estimate ing matrix W that tells us how much to penalise violations of one moment condi- b b b tion relative to another. β, F, ΛCO ˆ ∑ So, GMM estimator θGMM solves b b b e e e ′ e e e β, F, Λ = argmin (Y − X β − FΛ ) (Y − X β − FΛ ) ′ CO i i i i i i θb = argmin Q (θ) := g (θ) W g (θ) βe,Fe,Λe ∈C N N N N i θ e′ e e ′ e s.t. F F = Ir; ΛCOΛCO = Diagonal Asymptotic Distribution √ b d 2. Estimate Factor loadings for each treated unit by minimising mean-squared N(θ − θ) → N (0, Vθ) error of the predicted treated outcome in pretreatment periods Where − V = (DWD)−1(DWSW′D′)(DWD) 1 b 0 − 0 b − b0 e ′ 0 − 0 b − b0 e ′ θ Λi = argmin(Yi Xi β F Λi) (Yi Xi β F Λi) e Λi Defn 4.51 (Efficient GMM). ( )− b −1 ′ 1 ′ The weight matrix W that minimises the variance of θ is W = S , which b0 b0 b0 0 − 0 b ∈ T N GMM N = F F F (Yi Xi β) i produces ( ) −1 ′ −1 where 0 superscripts denote the pretreatment periods. Vθ = DS D b b b Want S to be small (sampling variation / noise of the moments) to be small and D 3. Calculate Treated Counterfactuals based on β, F, Λi to be large (objective function steep around θ0). c ′ b b′ b ∈ T E ′ Yit(0) = xitβ + λift ; i ; t > T0 Problem is S = [g(wi, θ0)g(wi, θ0) ] is unobserved. So use sample analogue ( )− ∑n ( )( ) 1 Choose the number of factors r by cross-validation. Implemented in gsynth. − 1 ′ Wc = Sb 1 = g(w , θb) − g (θb) g(w , θb) − g (θb) n i N i N i=1 4.8 Generalised Method of Moments Steps Defn 4.50 (GMM Estimator). W = I Assume existence (from theoretical model) of r moment conditions for q parame- 1. Pick some initial guess 0 q ters b ′ E [g(wi, θ0)] = 0 2. Solve θ = argminθ gN (θ) W0gN (θ) ( ( )( )) where θ is a q × 1 vector, g· is a r × 1 vector function with r ≥ q, and θ denotes ∑ −1 0 0 c 1 n b − b b − b ′ the value of θ in the DGP. w includes all observables. 3. Update W = n i=1 g(wi, θ) gN (θ) g(wi, θ) gN (θ)

← ToC 48 b ′ c 4. Solve θGMM = argminθ gN (θ) WgN (θ) ′ ′ b b R = E [YA] − E [YB] = E [xA] βA − E [xB] βB 5. Compute D(θGMM ) and S(θGMM ) Aggregate Decomposition Example 4.25 (Standard Methods nested in GMM). ′ ′ = (E [xA] − E [xB]) βA + E [xB] (βA − βB) Most standard methods can be re-expressed as GMM. | {z } | {z } Explained Unexplained ′ E ′ • OLS: yi = xiβ +ϵi. Exogeneity implies [xiϵi] = 0. Can write in terms of ob- Decomposition from B’s PoV (Threefold Decomposition) E ′ − ′ servables and parameters as [xi(yi xiβ)] = 0. Yields moment condition ′ ′ z discrimination}| { g(yi, xi, β) = xi(yixiβ) ′ ′ ′ = {E [xA] − E [xB]} βB + E [xB] (βA − βB) + (E [xA] − E [xB]) (βA − βB) E ′ ̸ ∃ E ′ | {z } | {z } • IV: xi is endogenous, so [xiϵi] = 0. However, zi such that [ziϵi] = 0, so ′ ′ endowments moment condition for exclusion restriction is g(yi, xi, zi, β) = z (yi − x β) ∗ i i Stipulating a non-discriminatory coefficient β (Twofold Decomposition) • Maximum Likelihood: g is simply the score function, so g(w , θ) = ∂ log f(wi,θ) Unexplained i ∂θ {z }| }{ {E − E }′ ∗ E ′ − ∗ E ′ ∗ − = | [xA] {z [xB] β} + [xA] (βA β ) + [xB] (β βB) 4.9 Decomposition Methods | {z } Explained E ′ −E ′ [xA] |{z}δA [xB ] δB − ∗ Basic idea of decomposition :=βA β ∫ ∫ Detailed Decomposition F − F F | − F | M (y) F (y) = M (y x)fM (x)dx F (y x)fF (x)dx To examine the ‘contribution’ of each variable to the observed gap, estimate ∫ ∫ ∑k ∑k { F | − F | F | − 1 i ∈ B = [ M (y x) F (y x)]fM (x)dx + M (y x)[fM (x) fF (x)]dx y = x β + d x δ + ε ; d := if i ji j i ji j i i 0 otherwise j=1 j=1

so, βj is the coefficient for group A, and βj + δj is the coefficient for group B.A 4.9.1 Oaxaca-Blinder Decomposition t-test for δj is used to establish whether a variable is a source of the observed gap. The contribution of each variable to th explained part is − ′ − ′ ′ − − ′ yA yB = xAβA xBβB = xB(βA βB) + (xA xB) βA (xA − xB)βbA We consider two groups, A and B, and an outcome Y , and a vector of predictors ∗ k k k ck = b x. Main question for decomposition is how much of the mean outcome differ- (xA − xB)βA ence [or another summary statistic / quantile of CDF] is accounted for by group differences in the predictors x. The Oaxaca-Blinder decomposition refers to the 4.9.2 Distributional Regression following decompositions: Section based on Chernozhukov, Fernández-Val, and Melly (2013). Reference pa- pers:

• https://arxiv.org/abs/0904.0951 • https://ocw.mit.edu/courses/economics/14-382-econometrics-spring-2017/lecture- notes/MIT14_382S17_lec7.pdf • https://cran.r-project.org/web/packages/Counterfactual/vignettes/vignette.pdf

Let FXk denote the distribution of job-relevant characteristics (education, experi-

ence, etc.) for men when k = m and for women when k = w. Let FYj |Xj denote

← ToC 49 the conditional distribution of wages given job-relevant characteristics for group j ∈ {w, m}, which describes the stochastic wage schedule that a given group faces. Using these distributions, we can construction F, the distribution of wages for group k facing group j’s wage schedule as ∫ | ∈ F⟨j|k⟩(y) = FYj |Xj (y x)dFXk (x), y T

For example, F⟨Y0|X0⟩ is the distribution of wages for men who face men’s wage

schedule, and FY1|Y1 is the distribution of wages for women who face women’s wage schedule, which are both observed distributions. We can also study F⟨0|1⟩, the counterfactual distribution of wage for women if they would face the men’s wage schedule F | . Y0 X0 ∫ ≡ | FY ⟨0|1⟩(y) FY0|X0 (y x)dFX1 (x) X1 is the counterfactual distribution constructed by integrating the conditional dis- tribution of wages for men with respect to the distribution of characteristics for women. We can Interpret FY ⟨0|1⟩ as the distribution of wages for women in the absence of gender discrimination, although it is predictive and cannot be interpreted as causal without further (strong) assumptions. [ ] [ ] ← − ← ← − ← ← − ← FY ⟨1|1⟩ FY ⟨0|0⟩ = FY ⟨1|1⟩ FY ⟨0|1⟩ + FY ⟨0|1⟩ FY ⟨0|0⟩ | {z } | {z } structure composition Assumptions for Causal Interpretation Under conditional exogeneity / selection on observables, CE can be interpreted as ∗ ∈ J causal effects. Sec 2.3 in ECTA 2013 paper spells this out in detail. Let (Yj : j ) be the vector of potential outcomes for various values of a policy j ∈ J , and X be a vector of covariates. Let∗ J denote the random variable that describes the realised policy and let Y := YJ be denote the realised outcome variable. When J is not randomly assigned, the distribution of Y |J = j may differ from the distribution ∗ | of Yj . However, under conditional exogeneity, the distribution of Y X,J = j and ∗| Yj X agree, and the observed conditional distributions have a causal interpreta- Figure 7: Oaxaca decomposition where D1 is the ‘discrimination’ piece (Bazen, tion, and so do counterfactual distributions generated from these conditionals by 2011). D1 ≠ D2 generically unless two groups have the same slope (which is prac- integrating out X. ∗ F ∗| (y|k) Y tically never the case) Let Yj J denote the distribution of the potential outcome j in the popu- lation with J = k ∈ J . The causal effect of exogenously changing the policy from l to j on the distribution of the potential outcome in the population with the J = k F ∗| (y|k) − F ∗| (y|k) realised policy is Yj J Yl J . Under conditional exogeneity, for any j, k, ∈ J , the counterfactual distribution FY ⟨j|k⟩(y) exactly corresponds to F ∗| (y|k) Yj J , and hence the causal effect of exogenously changing the policy from l to j in the population with J = k corresponds to the CE of changing the condi- tional distribution from l to j, that is

← ToC 50 nj FY ∗|J (y|k) − FY ∗|J (y|k) = FY ⟨j|k⟩(y) − FY ⟨l|k⟩(y) ∑ j l ˆ − { ≤ ′ } − ′ β (u) = argmin d [u 1 Y X b ][Y X b] Conditional exogeneity assumption for this section: j b∈R x ji ji ji ji i=1 ∗ ∈ J ⨿ | (Yj : j J X • method = "logit" implements the distribution regression estimator of the d K groups that partition the sample. For each population k, ∃Xk ∈ R and outcome conditional distribution with the logistic link function Yk. Covariate vector is observable in all populations, but the otucome is only ob- j ∈ J ⊂ K F servatble in populations . Let Xk denote the covariate distribution ˆ | ′ ˆ ∈ FYj |Xj (y x) = Λ(x β(y)) in the population k K, and FYj |Xj and QYj |Xj denote the conditional distribu- tion and quantile functions in population j ∈ J . We denote the support of Xk by where Λ is the standard logistic CDF and βˆ(y) is the distribution regression esti- dx Xk ⊂ R and the region of interest Yj by Yj ⊆ R. We refer to j as the reference mator population and k as the counterfactual population. n The reference and counterfactual populations in the wage example correspond to ∑j βˆ(y) ≡ [1{Y ≤ y} log Λ(X′ b) + 1{Y > y} log Λ(−X′ b)] different groups. We can also generate coutnerfactual populations by artificially argmaxb∈Rdx ji ij ij ji i=1 transforming a reference population. We can think of Xk as being created through a known transformation of Xj: 4.10 Causal Directed Acyclic Graphs Xk = gk(Xj), where gk : Xj→Xk Counterfactual distribution and quantile functions are formed by combining the based on http://www.stat.cmu.edu/ cshalizi/350/lectures/31/lecture-31.pdf ,Pearl conditional distribution in population j with the covariate distribution in popula- (2009), Morgan and Winship (2014), Cunningham (2020). tion k, namely: For an undirected graph between X,Y, and Z, there are four possible directed ∫ graphs: ≡ | ∈ Y FY ⟨j|k⟩(y) FYj |Xj (y x)dFXk (x), y | • X → Y → Z (a chain) Xk ≡ ← ∈ X ← Y ← Z QY ⟨j|k⟩(τ) FY ⟨j|k⟩(τ), τ (0, 1) • (another chain) ∈ J K ← { ∈ Y ≥ } X ← Y → Z where (j, k) and FY ⟨j|k⟩(τ) = inf y j : FY ⟨j|k⟩(y) τ is the left-inverse • (a fork on Y) function of FY ⟨j|k⟩. • X → Y ← Z (collision on Y) The main interest lies in the quantile effect (QE) function, defined as the difference of the two counterfactual quantile functions over a set of quantile indexes T ⊂ With the fork or either chain, we have X ⊥⊥ Z|Y . However, With a collider, X ̸⊥⊥ (0, 1) Z|Y . Causal effect of X on Y is written Pr (Y | do(X = x)). Basic idea is condition on ∆(τ) = QY ⟨j|k⟩(τ) − QY ⟨j|j⟩(τ), τ ∈ T adequate controls (i.e. not every observed control). Here, controlling for U is unnec- Estimation of Conditional distribution ∫ essary and would bias the estimate of Pr (Y | do(X = x)). | ≡ { | ≤ } FYj |Xj (y x) 1 QYj |XJ (u x) y du (0,1) U A Y

• method = "qr" default implements 4.10.1 Basics / Terminology ∫ Defn 4.52 (Backdoor Path; Confounder ≈ Omitted Variable). ˆ | { ′ ˆ ≤ } FYj |Xj (y x) = ε + 1 x βj(u) y du A backdoor path is a non-causal path from A to Y . They are ‘backdoor’ because (ϵ,1−ϵ) they flow backwards out of A: all of these paths point into A. where ε is a small constant that avoids estimation of tail quantiles, and βˆ(u) is the quantile regression estimator

← ToC 51 U ∑ ∑ Pr (Y |do(A)) = Pr (M = m|A = a) Pr (Y |A = a′,M = M) P r(A = a′) |M {z } |a′ {z } A Y Pr (M|do(A)) Pr (Y |M, do(A)) Here, A ← X → Y , where X is a common cause for treatment and the outcome. A M y So, X is a confounder. γ δ A worse problem arises with the following DAG, where dotted lines indecate that U is unobserved. Because U is unobserved, this backdoor path is open. U U The above DAG in words 1. The only way A influences Y is through M, so there is no arrow bypassing A Y M between X and Y . In other words, M intercepts all directed paths from A to Y . Defn 4.53 (Collider). U 2. Relationship between A and M is not confounded by unobservables - i.e. no back-door paths between A and M. 3. Conditional on A, the relationship between M and Y is not confounded, i.e. every backdoor path between M and Y has to be blocked by A. A Y Colliders, when left alone, always close a backdoor path. Conditioning on them, With a single mediator M that is not caused by U, the ATE can be estimated by b however, opens a backdoor path, and yields biased estimates of the causal effect multiplying estimates γb × δ (Bellemare, Bloem, and Wexler, 2020). A Y of on . The FDC estimates the ATE because it decomposes a reduced-form relationship • Common colliders are post-treatment controls A → C ← Y that is not causally identified into two causally identified relationships. Implementation through linear regressions: • Another insidious type of collider is of the form A ← · · · → C ← · · · → Y , where C is typically a lagged outcome. Mi = κ + γAi + ωi (3) Yi = α + δMi + ψAi + νi (4) Defn 4.54 (Back Door Criterion). E [M|A] = γ [ω A ] = 0 [M, ν] = 0 Vector of measured controls S satisfies the backdoor criterion if (i) S blocks every Since is identified, Cov i i in 3. Cov in 4. Assume ψ = 0 path from A to Y that has an arrow into A (i.e. blocks the back door) and (ii) no . Then, write node in S is a descendent of A. Then, τ = E [Y |do(A)] = δb× γb ∑ FDC Pr (Y |do(A = a)) = Pr (Y |A = a, S = s) Pr (S = s) s 4.10.2 Mediation Analysis Which is the same as the subclassification estimator. The conditional Expectation (Imai, Keele, and Yamamoto, 2010) Pearl (2001), Robins(2003) E | [Y A = a, S = s] can be computed using a nonparametric regression / ML algo- Consider SRS where we observe (Di,Mi, Xi,Yi), where Di is a treatment indicator, rithm of choice. Mi is a mediator, Xi is a vector of pre-treatment controls, and Yi is the outcome. M X Y Defn 4.55 (Frontdoor Criterion). The supports are , , respectively. M satisfies the frontdoor criterion if (i) M blocks all directed paths from A to Y , Let Mi(d) denote potential value for the mediator under treatment status Di = d. (ii) there are no unblocked back-door paths from A to M, and (iii) A blocks all The outcome Yi(d, m) is the potential outcome for unit i when Di = d, Mi = m. backdoor paths from M to Y . The observed variables can be written as Mi = Mi(Di),Yi = Yi(Di,Mi(Di)). Then,

← ToC 52 M ′ ′ CDEi(d, d , m) = Yi(d, m) − Yi(d , m) m ∈ M Difference between NDE and CDE is what value mediator is fixed at. Restated: ψ (d, d′, m) = Y (d, m) − Y (d′, m) D Y i i i d, a used interchangeably for treatment. Effect of changing the treatment while fixing the value of the mediator at some level m. ψ(d, d′, m) = E [Y (d, m) − Y (d′, m)] Assumption 8 (Sequential Unconfoundedness of Treatment, Mediator). i i { } ⊥⊥ | Theorem 4.26 (ATE decomposition (VanderWeele and Tchetgen-Tchetgen(2014))). Yi(d, m),Mi(d) Di Xi = x Random assignment of D (5) Decomposing total effect with binary mediator Y (d, m) ⊥⊥ M (d)|D ,X = x i i i i No outcome mediation (6) ′ ′ ′ ′ τ(d, d ) = ACDE(d, d , 0) + ANIE(d, d ) ∀d , d ∈ {0, 1} and (m, x) ∈ M × X | {z } | {z } This requires the treatment to be conditionally independent of the potential me- Direct Effect Indirect Effect E ′ ′ − ′ diator states and outcomes given X, ruling out unobserved confounders jointly + | [M(a )[CDE(d, d{z, 1) CDE(d, d , 0)]]} affecting the treatment on the one hand and the mediator and/or the outcome on Interaction the other hand conditional on the covariates. (5) postulates independence between the counterfactual outcome and mediator values ‘across-worlds’. Effectively, Need M to be randomly assigned (approx). Fact 4.27 (Parametric Setup for ACME estimation). Assume linear models for mediator M = ψT + Um and Y = βT + γM + UY . Defn 4.56 (Natural Indirect Effect). Then fit the following regressions NIEi(d) ≡ δi(d) := Yi(d, Mi(1)) − Yi(d, Mi(0)) Y = α + τ D + ε (7) Difference in Y holding treatment status constant, and varying the mediator. Sample i 1 |{z} i i1 Average: Average Causal Mediation Effect (ACME) Total effect Mi = α2 + ψDi + εi2 (8) δ(d) := E [δi(d)] = E [Yi(d, Mi(1)) − Yi(d, Mi(0))] Yi = α3 + |{z}β Di + γMi + εi3 (9) Defn 4.57 (Natural Direct Effect). Direct Effect ≡ − NDEi(d) θi(d) := Yi(1,Mi(d)) Yi(0,Mi(d)) Baron and Kenny (1986) suggest testing τ = ψ = β = 0. If all nulls rejected, Difference in Y holding mediator constant, and varying the treatment. Mediation effect δ = ψγ. Equivalently, mediation effect is τ − β = ψ × γ. Estimate variance using bootstrap / delta method. Defn 4.58 (Total Causal Effect / Treatment Effect Decomposition). − Fact 4.28 (Semiparametric Estimation). τi = Yi(1,Mi(1)) Yi(0,Mi(0)) Assume selection on observables w.r.t. D, M. − − − = |Yi(1,Mi(1)){z Y (0,Mi(1))} Y| i(0,Mi(1)){z Y (0,Mi(0))} Huber(2014) Average direct effect identified by θi(1) δi(0) [( ) ] − − − · · − | = |Yi(1,Mi(0)){z Y (0,Mi(0))} Y| i(1,Mi(1)){z Y (1,Mi(0))} E Y D − Y (1 D) · Pr ((D = d M,X)) θ(d) = | − | | θi(0) δi(1) Pr (D = 1 M,X) 1 Pr (D = 1 M,X) Pr (D = d X) − = |δi{z(d}) + |θi(1{z d}) Average Indirect Effect identified by [ ( )] indirect effect direct effect Y · 1D=d Pr (D = 1|M,X) 1 − Pr (D = 1|M,X) = δ(d) = E − NDE + NIE defined on opposite treatment states Pr (D = d|M,X) Pr (D = 1) 1 − Pr (D = 1|X) Defn 4.59 (Controlled Direct Effect (Acharya, Blackwell, and Sen, 2016)). implemented in causalweight::medweight. NDE conditions on potential mediator effects.For CDE, we set mediator at a pre- https://cran.r-project.org/web/packages/causalweight/vignettes/bodory-huber.pdf scribed value m.

← ToC 53 5 Semiparametrics and Nonparametrics Defn 5.2 (Empirical Process). Given an empirical distribution Fn, the corresponding empirical process is the rescaled gap based on Tsiatis (2007), Wasserman (2006), and Kennedy (2015) √ Zn(x) = n(Fn(x) − F(x)) 5.1 Semiparametric Theory Theorem 5.1 (Glivenko Cantelli Theorem). Observations Z1,...,Zn that take values in a measurable space (Z, B) with dis- the Kolmogorov-Smirnov statistic Dn tribution P0 . A statistical model P is a collection of probability measures on the a.s. Dn := ∥Fn − F∥∞ ≡ sup |Fn(t) − F(t)| → 0 sample space, which is assumed to contain the data distribution P0. t∈R ∈ The general goal is estimation and inference for some target parameter ψ0 = ψ(P0) and √ Rp where ψ = ψ(P ) can be viewed ∥Fn − F∥∞ = Op (log n/n) = Op( 1/n) F f : X 7→ R P − • A nonparametric model P is a collection of all probability distributions A class of measureable functions is said to be a Glivenko-Cantelli class if ∫ • A parametric model is a model that can be smoothly indexed by a Euclidian |P − | a.s.→ θ ∈ Rq ψ ⊆ θ sup nf P f 0 ; P f := f(x)P (dx) vector with . f∈F X • A semiparametric model is one that contains both parametric and nonpara- Theorem 5.2 (Dvoretzky-Kiefer Wolfowitz (DKW) inequality). metric parts. [ ] b 2 ∀ε > 0, P sup F(x) − Fn(x) > ε ≤ 2 exp(−2nε ) 5.1.1 Empirical Processes Background x 2 log(2/α) A stochastic process is a collection of random variables {X(t), t ∈ T } on the same This allows us to construct a confidence set. For example, let εn = 2n . The probability space indexed by an arbitrary index set T . An empirical process is a nonparametric DKW confidence band is stochastic process based on a random sample. b b (Fn − εn, Fn + εn) Defn 5.1 (empirical distribution function). Defn 5.3 (Statistical Functional and Plug-in Estimator). 1 ∑n A∫ statistical functional T (F) is any function of F. Examples include the mean, µ := F (t) = 1 ≤ F 2 − F F−1 n n Xi t xd (x), variance σ := (x µ)d (x), and the median m = (1/2) i=1 b b A plug-in estimator of θ =: T (F) is defined by θn := T (Fn). Fn(t) is unbiased and has variance [ ] [ ] F − F ∫ b 1 b (x)(1 (x)) a(x)dF(x) V Fn(t) = V nFn(t) = Plugin estimator for a linear functional A functional of the form is n2 n called a linear functional. A plug-in estimator for it is ∫ ∑n This can be generalised to an empirical measure over a random sample X1,...,Xn 1 T (Fb ) = a(x)dFb (x) = a(X ) of independent draws from a probability measure P on an arbitrary sample space n n n i X . The empirical measure is defined as i=1 1 ∑n P = δ 5.1.2 Influence Functions n n Xi i=1 Defn 5.4 (influence∑ function). δ 1 x 0 P 1 where x is a dirac delta that assigns mass at and otherwise.∑ Let n = n i δZi denote the empirical distribution of the data.∑ This means For a measurable function f : X 7→ R, we denote P f = 1 n f(X ). 1 = f(Z ) = n n i=1 i ∫for example that empirical averages can simply be written as n i i X R F {P ∈ F} F b b Setting = , n can be re-expressed as the empirical process nf, f where := f(z)dP = P {f(Z)}. An estimator ψ = ψ(P ) is Regular and Asymptotically {1 ∈ R} n n n x≤t, t . Linear (RAL) with influence function ϕ if it can be approximated by an empirical average

← ToC 54 ∑n √ ∑n b 1 b 1 ψ − ψ = ϕ(Z) + o (1/ n) θ = θ + ϕ (x ) + higher order terms 0 n p n θ i i=1 [ ] i=1 where ϕ has mean zero (E [ϕ(Z)] = 0) and finite variance (E ϕ(Z)⊗2 < ∞). By the higher order terms converge in probability to zero, which implies b √ ∑n CLT, an estimator ψ with influence function ϕ is asymptotically normal with b 1 √ ( [ ]) n(θ − θ) = √ ϕθ(xi) + op(1) b d ⊗2 n n(ψ − ψ0) → N 0, E ϕ(Z) i=1 Gateaux derivative is defined Practically, one can compute variances of complicated structural problems by com- puting the empirical equivalents of influence functions and stacking. Given esti- θ((1 − ε)F + εδx) − θ(F) LF(x) = lim mators θ1, . . . , θM and observations i = 1,...,N, create matrix with rows corre- ε→0 ε sponding to observations and columns corresponding to estimators The empirical influence function uses the empirical distribution function Φ = [ϕθ1 , . . . , ϕθN ]N×M =:F (x) ∑ z }|ϵ { √1 n Since the distribution of each estimator is the same as n i=1 ϕθj (xi), the vari- θb((1 − ε)Fb + εδ ) − θb(Fb) ϕ(x) = lim n x ance can be computed as ε→0 ε 1 ⊤ LF(x) V = (Φ Φ) the influence function behaves like the score function[ in parametric] ∫ estimation N 2 b 2 because of the MLE analogues LF(x)dF(x) = 0 and V T (Fn) ≈ LF(x)dF(x)/n. 5.1.3 Tangent Spaces Fact 5.3 (Robust estimators have bounded influence functions.). Assume target parameter ψ is scalar. Influence functions∫ reside[ in Hilber] space 2 2 2 L2(P ) of measureable functions g : Z→R with P g = g dP = E g(Z) < ∞ Example 5.4 (Examples of Influence functions). ⟨ ⟩ ⟨ ⟩ mean: ϕ(x) = x − E [X] equipped with covariance inner product g1, g2 = P g1g2 . variance: ϕ(x) = (x − E [X])2 − V [X] − E − E − Defn 5.5 (Tangent Space for parametric models). covariance: ϕ(x) = (x [X])(y [y]) Cov [X,Y ] the tangent space T for parametric models indexed by the real-valued parameter linear regression : θ ∈ Rq+1 is the linear subspace of L (P ) spanned by the score vector { 2 } x − E [X] T ⊤ ∈ Rq+1 ϕ(x, y) = {(y − E [Y ]) − β(x − E [x])} = bSθ (Z; θ0): b V [X] ∂ log p(z;θ) | M-estimators solve E [g(X, θ)] = 0 where g is a score function. The influence func- where Sθ(Z, θ0) = ∂θ θ=θ0 is the score-function. If we can decompose θ = tion is (ψ, η), we can decompose the tangent space as well and write T = Tψ ⊕ Tη. In this − T ψ ϕ (x) = E [∇ g(X, θ)] 1 g(x, θ) formulation, η is known as the nuisance tangent space. Influence functions for θ θ reside in the orthogonal complement of the nuisance tangent space denoted by which nests both MLE and GMM estimators. T ⊥ { ∈ ∀ ∈ T } https://j-kahn.com/files/influencefunctions.pdf η := g L2(P ): P (gh) = 0 h η { ∈ − | T ∈ } Theorem 5.5 (Influence Functions and Variance). = g L2(P ): h Π(h η), h L2(P ) E If [ϕθ(x)] = 0, we can write where Π(g|S) denotes projections of g on the space S. E Cov [ϕθ1 (x), ϕθ2 (x)] = [ϕθ1 (x)ϕθ2 (x)] b • The subspace of influence functions is the set of elements ϕ ∈ T ⊥ that satisfy Say we have an estimate θ from a random sample. We can look at this sample as a η − 1 P (ϕSψ) = 1. series of ε contaminations to the true distribution, each of which puts n weight on the derivative. Then, for large enough N, we can represent the difference between • The efficient influence function is that with the smallest covariance P (ϕ2) and b θ and θ as a Taylor expansion 2 −1 is given by ϕeff = P (Seff) Seff

← ToC 55 − |T |T ⊥ − – Seff is the efficient score Seff = Sψ Π(Sψ η) = Π(Sψ η ). • Cn is a finite-sample 1 α confidence set if

T ⊥ S inf PF (θ ∈ Cn) ≥ 1 − α ∀ n • All RAL estimators have influence functions ϕ that reside in η with P (ϕ ψ) = F ∈F 1 C 1 − α Defn 5.6 (Tangent spaces for semiparametric models). • n is a uniform asymptotic confidence set if A parametric submodel Pε is indexed by a real valued parameter ε (Pε = {Pε : ε ∈ R}) lim inf inf PF (θ ∈ Cn) ≥ 1 − α →∞ ∈ is a set of distributions contained in the larger model P, which also contains the n F F P ∈ P truth 0 ε. A typical example of a parametric submodel is • Cn is a pointwise asymptotic 1 − α confidence set is pε(z) = p0(z) {1 + εg(z)} ∀F ∈ F lim inf PF (θ ∈ Cn) ≥ 1 − α E | | | | ≥ n→∞ where [g(Z)] = 0 and supz g(z) < M, ε < 1/M , so pε(z) 0. The parametric submodel is sometimes indexed by g such that Pε = Pε,g Finite sample confidence set ≻ Uniform asymptotic confidence sets succ pointwise The tangent space T for semiparametric models is defined as the closure of the asymptotic confidence set. linear span of scores of the parametric submodels. IoW, we define scores on para- Informally, a true confidence band comprises of two functions that bracket the c.d.f. ∂ log pε(z) | 1 − α metric submodels Pε{as Sε(z) = ∂ε} ε=0, and construct parametric submodel at all points with probability . This should be contrasted to the pointwise ⊤ 1 − α tangent spaces Tε = b Sε(Z): b ∈ R . bands in common use, which bracket the c.d.f. with probability at any given point. Similarly, the nuisance tangent space Tη is the set of scores in T that do not vary the target parameter Defn 5.7 (Gaussian Process). { } G A {G } A gaussian process indexed by a set is a collection (x) x∈A of Gaussian ∂ψ(Pε,g) ′ Tη = g ∈ T : |ε=0 = 0 ∀x , . . . , x ∈ A, (G(x ),...,G(x )) ∼ ∂ε random variables such that 1 k 1 k MVN. The function m(x) := E [G(x)] is a mean function and C(x, y) := Cov [G(x),G(y)] is In nonparametric models, the tangent space is the whole Hilbert space of mean- the covariance function. A centered gaussian process is one whose mean function is zero functions. identically zero. As before, the efficient influence function is the influence function with thesmall- [ ] |T 2 est covariance and is defined as the projection ϕeff = Π(ϕ ). It can also be de- ∀x, y ∈ A, E |G(x) − G(y)| = C(x, x) + C(y, y) − 2C(x, y) fined as the pathwise derivative of the target parameter in the sense that P (ϕSε) = ∂ψ(Pε) |ε=0. Example 5.8 (Brownian Motion / Wiener Process). ∂ε is a centered gaussian process indexed by [0, ∞) whose covariance function is given Fact 5.6 (M− and Z− estimation). by Distinction from Kosorok(2008). C(s, t) := min(s, t) , s, t ≥ 0 • approximate Maximisers of data-dependent processes are known as M-estimators. Theorem 5.9 (Donsker Theorem). Z → Z ≡ U R ∥·∥ U • approximate Zeroes of data-dependent processes are known as Z-estimators. n (F ); D( , ∞) where is a standard Brownian bridge process on ′ e.g. Un(β) = Pn[X(Y − X β)]. Z ⊂ M [0, 1], which means it is a zero-mean Gaussian process with covariance function E [U(s)U(t)] = s ∧ t − st , s, t ∈ [0, 1] R ∥·∥ →R Nonparametric Delta Method which means for any bounded continuous function g : D( , ∞) , we have E Z →d E T (Fb ) − T (F) [g( n)] [g(Z)] n ≈ N (0, 1) ∑n √ and 1 2 L (Xi) / n d n g(Zn) → g(Z) | i=1{z } b τ 5.2 Semiparametric Theory for Causal Inference Fact 5.7 (Confidence Bands). Let F be a class of distribution functions F, let θ be the quantity of interest, and Cn notes from Kennedy (2015) (epi-ish notation). Treatement denoted by A (’action’) be a set of possible values of θ which depends on the data X1,...,Xn. and controls denoted by L.

← ToC 56 [ ] [ ] [ ] Target parameter (ψ) : which may be an ATE E Y 1 − Y 0 or a risk ratio E Y 1 /E Y 0 ( ) ψ (Z; ψ, α) and so on. m(z; θ) = ipw Assumptions S(Z; α) 1. Consistency/SUTVA: A = a =⇒ Y = Y a. Under standard regularity conditions, we have ( [ ] ) −1 √ a ⊥⊥ | ∂m(Z; θ0) 2. Unconfoundedness: Y A L θb− θ = P E m(Z; θ ) + o (1/ n) 0 n ∂θ 0 p 3. Overlap: Pr (A = a | L = l) ≥ δ > 0 ∀ p(L = l) > 0. Example 5.11 (Efficient Influence function for ATE). Then the ATE can be written [ ] ∫ For ψ = E Y 1 − Y 0 = E [µ(L, 1) − µ(L, 0)]. Under a nonparametric model where the distribution of P is left unrestricted, the efficient influence function for ψ is ψ = (E [Y | L = l, A = 1] − E [Y | L = l, A = 0]) dFL L given by

This is the outcome regression (), subclassification estimator (statis- ϕ(Z; ψ, η) = m1(Z; η) − m0(Z; η) − ψ tics), g-formula(), backdoor criterion estimator (DAGology). where Z := ⟨L, A, Y ⟩ P We suppose the data is and its distribution 0 admits to the follow- 1 (Y − µ(L, a)) ing factorisation: m (Z, η) = m (Z; π, µ) = A=a + µ(L, a) a a aπ(L) + (1 − a)(1 − π(L)) p(z) = p(y | l, a) p(a | l) p(l) Suppose the estimator ηbconverges to some η = (π, µ). Then, P(m(Z, η)) = P(m(Z; η )) = In semiparametric causal settings, one typically imposes parametric assumptions 0 ψ0 ∫ on the treatment mechanism leaving the outcome mechanism unspecified. For Given P(f(Z)) = f(z)dP to denote expectations of f(Z) for a new observation Z example, for the ATE, one might write and the decomposition zParametric}| { b ψ − ψ = (P − P)m(Z; ηb) − P(m(Z; ηb) − m(Z; η )) (10) p(z; η, α) = p(y | l, a; η ) p(a | l; α) p(l; η ) 0 n 0 | {z y} l the first term can be shown to admit to the following result Nonparametric √ (Pn − P)m(Z; ηb) = (Pn − P)m(Z; η0) + op(1/ n) where α ∈ Rq but η = ⟨η , η ⟩ represents an infinite-dimensional parameter that y l so that (P − P)m(Z; ηb) is asymptotically equivalent to its limiting version (P − does not restrict the conditional distribution of the outcome given covariates and n n P)m(Z; η ). This requires that M = {m(; η): η ∈ H} is a Donsker class, where H treatment p(y | l, a) and the marginal covariate distribution p(l). 0 is a function class containing the nuisance estimator ηb, or that H itself is Donsker. We can then expand 10 to Example 5.10 (Influence functions for causal estimands). √ For IPW estimator b ψ − ψ0 = (Pn − P) m(Z;η ¯) + P{m(Z;η ˆ) − m(Z;η ¯)} + op(1/ n) 1 ∑n AY (1 − A)L b ψb = − By the fact that π is bounded away from 0, 1 and Cauchy Schwartz, the middle IPW n π(L) 1 − π(L) term |P(m(Z; ηb) − m(Z; η))| is bounded from above by i=1 ∑ The influence function is clearly ∥π0(L) − πb(L)∥ ∥µ0(L, a) − µb(L, a)∥ AY (1 − A)Y a∈{0,1} ϕIPW (Z) = − − ψ0 π(L) 1 − π(L) So, if πb is based on a correctly specified model, we √only need µb to be consistent to ∑ n make the product term P(m(Z; ηb) − m(Z; η)) = o (1/ n) asymptotically negligi- ψb − ψ = 1 ϕ (Z) p since IPW 0 n i=1 ipw exactly. ble. In an , π(l) needs to be estimated, and suppose we do so with a corectly specified parametric model π(l; α) , α ∈ Rq so that αb solves some mo- P b b b∗ b⊤ ⊤ 5.3 Nonparametric Density Estimation ment( condition) n (S(Z; α)) = 0. Then, we have θ = (ψIPW , α ) which solves b Defn 5.8 ( Estimator). Pn m(Z; θ) where Given a vector of mutually exclusive bins B1,...,Bj that partition supp X, and

← ToC 57 ν , . . . , ν be the corresponding counts in each bin, the histogram estimator is de- Conditional Quantile Function 1 j { } fined as b Fb | ≥ b−1 | ∑n qα(x) = inf y : (y x) α =: F (α x) νj/n fˆ(x) = 1 ∈ h x Bj i=1 Conditional mb (x) = max gb(y|x) Defn 5.9 ((Rosenblatt-Parzen) Kernel Density Estimator). y K : R→R Given a Smoothing kernel that satisfies the following properties where gb(y|x) is the kernel estimator of the conditional density. 1. K(x) ≥ 0 ∫ 5.4 Nonparametric Regression 2. xK(x)dx = 0 ∫ Given a random pair (x, y) ∈ Rd × R, the regression function is 3. K(x)dx = 1 ∫ m0(x) = E [Y |X = x] 4. 0 < x2K(0, x)dx < ∞ [ We aim to approximate this with m(·) when estimating nonparametric regressions. and a bandwidth h > 0, the kernel density estimator is ∑n ( ) ∑n mb (x) = wi(x)yi 1 1 Xi − x fˆ(x) = K i=1 n h h i=1 where the weights w are estimated using different methods. [ ] ( ) i E ˆ 2 fk(x) = f(x) + O h ; bias decreases as h gets smaller. Defn 5.10 (K-Nearest Neighbours Regression). [ ] ∫ ∫ ≥ f(x) K(x)2 Fix an integer k 1. V ˆ 1 ≈ f(x) 2 ∑ fk(x) = + o( ) K(x) dx 1 nh nh nh mb (x) = y k i vanishes as nh→∞. i∈N (x) Higher density implies higher variance - more data in a neighbourhood makes k the density estimation problem harder. where Nk(x) is the k-neighbourhood of x which contains the indices of the k clos- Haerdle et al (2004) find ‘optimal’ bin-width h for n observations is est points. ( √ ) 24 π 1/3 Defn 5.11 (Nadaraya-Watson / Kernel regression). h = opt n ∑N b ′ mh(x) = Sxy = wi(x)yi 5.3.1 Conditional Density and Distribution Function Estimation i=1 K w (x) where is a kernel and the weights i are( given) by Conditional Density x−x b K i f(x, y) w (x) := (h ) b | i ∑ − g(y x) = b n K x xj f(x) j=1 h

where K(·) is a kernel-function that assigns a value that is lower the closer xi is to Conditional CDF ∑ x, and h is the bandwidth. The estimate of E [Y |X = x] at x is a weighted average 1 n y−yi G( )Kh(xi, x) of yi’s ‘near x’. Fb | n i=1 h0 (y x) = Stated differently, fb(x) ∑ n x−xi K( )yi where G(.) is a kernel CDF (typically Normal), h0 is a the smoothing parameter mb (x) = ∑i=1 h n x−xi associated with y, and Kh(xi, x) is a product kernel. i=1 K( h ) This can be inverted to get the

← ToC 58 mb (x) is consistent and asymptotically normal. Assuming X has density f, Its vari- Klein and Spady’s Semiparametric Binary Model ance is ∫ ( ) ∑n 2 ′ ′ σ (x) 1 ℓ(β, h) = ((1 − yi) log(1 − gb−i(x β)) + yi log(gb−i(x β) )) V [mb (x)] = K2(u)du + o i i f(x) · nh nh i=1 Where σ2(x) = V [Y |X = x] 5.6 Splines the bias for this estimator is ( ) ′ 2 C1 ′′ ′ f (x) Defn 5.15 (Regression Splines). bias[mch(x)] = E [mb h(x)] − m(x) ∼ h m (x) + C2m (x) 2 f(x) Given inputs x1, . . . , xn and responses y1, . . . , yn, fit functions f that are k−th order splines with knots at some chosen locations t1, . . . , tp. This means expressing A Nadaraya-Watson estimator with a uniform kernel is called a Regressogram. p+∑k+1 Defn 5.12 (Local Linear Regression / loess). f(x) = βjgj(x) Define the loss function j=1 ( ) ∑n − x − x where β1, . . . , βp+k+1 are coefficients and g1, . . . , gp+k+1 are basis functions for k L(β) = (y − m − (x − x)′β)2K i i i h splines over the knots t1, . . . , tp. i=1 √ This is equivalent to the standard regression problem when we define − mb (x) is obtained by regressing Y on X −x, with weights equal to K( x xi ). This h G = gj(xi), i = 1, . . . , n , j = 1, . . . , p + k + 1 E | estimator is consistent for [Y X = x], with the same rate of convergence as NW. b ∥ − ∥2 β = argmin y Gβ 2 β∈Rp+k+1 ( ) NW fits a Kernel-weighted constant (0-th order polynomial) near x; LLR fits a b ⊤ −1 ⊤ straight line. which gives the standard OLS solution β = G G G y. Regression splines are linear smoothers. 5.5 Semiparametric Regression Regression splines exhibit erratic boundary behaviour . One solution to this is A partially linear model is given by to force lower degrees at the extremities. ′ yi = xiβ + g(zi) + εi i = 1, . . . , n • f is of order k on each [t1, t2],..., [tp−1, tp] where xi is a p− vector, zi is a scalar, and g(.) is not specified. Standard exogeneity E | • f is of degree (k − 1)/2 on (−∞, t1] and [tp, ∞) assumption [ε xi, zi] = 0. √ Robinson (1988) estimator is n-consistent. • f is continuous and has continuous derivatives of orders 1, . . . , k − 1 Defn 5.13 (Robinson Estimator). Defn 5.16 (Smoothing Splines). − E | − E | y = g(x )+ε x ∈ [0, 1] |yi {z[yi zi}] = (|xi {z[xi zi])} β + εi Given a simple non-parametric regression problem i i i, with i , e e the problem to be solved is yi xi ∫ ∑n 1 where yei and xi are estimated using Kernel regression. 2 ′′ 2 min (yi − g(xi)) + λ g (x) dx g∈G i=1 | 0 {z } 5.5.1 Index Models roughness penalty Defn 5.14 (Single Index Model). The solution to the above problem is a cubic smoothing spline, which is a piece- Y = g(x′β) + ε wise cubic polynomial: a function with continuous first and second derivatives, whose third derivative may take discrete jumps at designated points, called ‘knots’. with E [ε|x] = 0. The term x′β is called a single index because it is a scalar. g(.) is On each segment [xi, xi+1), we may write s(x) as a cubic polynomial. left unspecified, hence ‘semiparametric’.

← ToC 59 Defn 5.17 (Generalised Additive Model (GAM)). Theorem 5.12∑ (Representer Theorem / The Reproducing Property of RKHS). Semiparametric model of the form y = f(x) + ε, where f(x) is typically imple- Let f(x) = α K (x). Then, i i xi ∑ mented using basis-splines ⟨f, Kx⟩ = αiK(xi, x) = f(x) ∑J i E [y|x] = g (x ) j ij This also implies j ⟨K ,K ⟩ = K(x, y) or ‘thin-plate’ splines. Estimated using backfitting [mgcv in R]. x y This implies that Kx is the representer of the evaluation functional. The completion H ∥·∥ H 5.6.1 Reproducing Kernel Hilbert Spaces of 0 with respect to K is denoted by K and is called the RKHS generated by K. A RKHS is a Hilbert space (complete inner product space) with extra structure. Defn 5.19 (RKHS Regression). Most common example is L space 2 { ∫ } Define mb to minimise →R 2 ∞ ∑n L2[0, 1] = f : [0, 1] : f < − 2 ∥ ∥2 R := (yi m(xi)) + λ m K endowed with the inner product i=1 ∑ ∫ b n By the representer thm, m(x) = i=1 αiK(xi, x). Plugging this into the above ⟨f, g⟩ = f(x)g(x)dx definition, R = ∥Y − Kα∥2 + λα⊤Kα with corresponding norm √∫ The minimiser over α is √ − || || ⟨ ⟩ 2 b K 1 f = f, f = f (x)dx ∑ α = ( + λI) Y b b and m(x) = j αjK(Xi, x) Defn 5.18 (Mercer Kernel). A RKHS is defined by a Mercer kernel K(x, y) that is symmetric and positive def- Defn 5.20 (Support Vector Machines). ∈ {− } inite, which means that for any function f, Support Vector Machines. Suppose Yi 1, +1 . Recall the the linear SVM min- ∫ ∫ imizes the penalized hinge loss: ≥ K(x, y)f(x)f(y)dx dy 0 ∑ [ ( )] λ J = 1 − Y β + βT X + ∥β∥2 i 0 i + 2 2 The main example is the Gaussian kernel( ) i ∥x − y∥2 The dual is to maximize K(x, y) = exp − ∑ ∑ 2 1 σ α − α α Y Y ⟨X ,X ⟩ i 2 i j i j i j Given a kernel K, let Kx(·) be the function obtained by fixing the first coordinate i i,j at x s.t. Kx(y) = K(x, y). For the Gaussian kernel, Kx is a Normal, centered at x. 0 ≤ α ≤ C H subject to i The RKHS version is to minimize We can create functions by taking linear combinations of the kernel. Let 0 denote ∑ all such functions λ 2   J = [1 − Yif (Xi)] + ∥f∥ + 2 K  ∑k  i H = f : α K (x) 0  j xj  Defn 5.21 (Linear Smoothers).∑ j=1 b Estimators of the form m(x) = i wi(x)Yi with weights wi(x) that don’t depend which can be used to define an inner product on Yi are known as linear smoothers. Fittedvalues are µb = Sy for some matrix ∑ ∑ × S ∈ Rn n depending on inputs and tuning parameters. ⟨f, g⟩ = ⟨f, g⟩ = α β K(x , y ) K i j i j The effective degrees of freedom of a linear smoother is i j

← ToC 60 Realistic case where σ2 is unknown we can show that posterior has the form ∑n [Murphy pp 237] ν = df(µb) = Sii = tr S i=1 2 2 p(β, σ , D) = NIG(β, σ |βN , ΣN , aN , bN ) −1 ′ Defn 5.22 (). βN = ΣN ((Σ0) β0 + X y) A wavelet is a function ψ such that ( ) { } −1 ′ −1 ΣN = Σ0 X X 2j/2ψ(2j · −k); j, k ∈ Z Defn 5.23 (Kernel Trick). L ψ is an Orthonormal basis for 2 space. is called a ‘mother wavelet’, which can be Let ϕ(x): RD→RM be a mapping from D dimensional input space to M dimen- φ constructed from a ‘father wavelet’ . sional feature / basis space. Let Φ(X) be the aggregation of columns ϕ(x) for all observations. Then, the same formulation as above applies, with X, x replaced by 5.7 Gaussian Processes ϕ, Φ, which gives the posterior predictive distribution

Based on Williams and Rasmussen (2006), Murphy (2012), and Scholkopf and Smola ⊤ 2 −1 f∗ | x∗, X, y ∼ N (ϕ Σ Φ(K + σ I) y, (2018). ∗ p n ⊤ − ⊤ 2 −1 ⊤ ϕ∗ Σpϕ∗ ϕ∗ ΣpΦ(K + σnI) Φ Σpϕ∗) 5.7.1 Bayesian Linear Regression ⊤ where K = Φ ΣpΦ. ( ) ⊤ y, X y = y −y¯1 p(y|X, β, σ2) = N y|Xβ, σ2I In the above expression, the feature space always enters in the form or Φ ΣpΦ, Center such that N . Likelihood is N . ⊤ ′ ⊤ ′ | 2 ∝ − 1 ∥ − ∥ ϕ∗Σϕ, or ϕ∗ Σϕ∗. This means we can define a kernel k(x, x ) = ϕ(x) Σpϕ(x ), Choose normal prior p(y X, β, σ ) exp( 2 y Xβ ) or 2σ 2 ′ · ′ Conjugate prior is also Gaussian, which we denote by p(β) = N (β|β , Σ ). Then, which gives us an equivalent dot-product representation k(x, x ) = ψ(x) ψ(x ) 0 0 1/2 the posterior is given by where ψ = Σp ϕ(x). This can be decomposed as follows If an algorithm is defined solely in terms of inner products in input space, thenit ( ) can be ‘lifted’ into feature space by replacing the occurrences of those inner prod- 2 2 ′ p(β|X, y, σ ) = N (β|β0, Σ0) N y|Xβ, σ IN = N (β|βN , ΣN ) Where ucts by k(x, x ). This is called the kernel trick. By replacing ⟨x , x ⟩ with K(x , x ), we turn a linear procedure into a nonlinear −1 1 ′ i j i j βN = ΣN (Σ0) β0 + ΣN X y one without adding much computation. σ2 2 2 −1 ′ −1 ΣN = σ (σ Σ0 + X X) Defn 5.24 (Gaussian Process). A Gaussian Process is a collection of random variables, any finite number of which have Fact 5.13 (Bayes - Ridge Equivalence).( ) a joint Gaussian Distribution. | N 2 ∼ N Let likelihood be p(y X, β) = Xβ, σ I and let prior on coefficients β (0, Σp). A GP is completely specified by its mean function m(x) and covariance function ( ) k(x, x′). A GP of a real process f(x) is specified as follows 1 − − p(β|X, y) =N (A) 1 Xy, (A) 1 σ2 m(x) = E [f(x)] −2 ′ −1 ′ E − ′ − ′ where A :=σ XX + Σp k(x, x ) = [f(x m(x))(f(x ) m(x ))] then f(x) ∼ GP(m(x), k(x, x′)) The predictive distribution is ∫ Example 5.14 (Bayesian Linear Regression as a GP). e e e ⊤ ∼ N p(f∗|x∗, X, y) = p(f∗|x∗, β)p(β|X, y)dβ let f(x) = ψ(x) β with prior β (0, Σp). Then, the mean and covariance ( ) functions are 1 ′ −1 ′ −1 ⊤ = N x∗ (A) Xy, x∗ (A) x∗ E [f(x)] = ϕ(x) E [β] = 0 σ2 [ ] ′ ⊤ ′ ′ E [f(x)f(x) ] = ϕ(x)E ββ ϕ(x ) = ϕ(x)Σpϕ(x )

← ToC 61 Square-Exponential covariance / Gaussian Kernel( ) 6 Maximum Likelihood 1 2 { }N | ∗ Cov [f(xp), f(xq)] = k(xp, xq) = exp − |xp − xq| Setup Let Z i=1 be a sequence of iid rv’s with common CDF F (z θ ). We want 2 ∗ to estimate θ ∈ Θ ⊂ Rk. Since the rv’s are IID, we can write ∏N ∗ ∗ f(y|θ ) = f(yi|θ ) i=1 Defn 6.1 (). L: Θ→R ∏N L(θ|y) := f(yi|θ) i=1 ∑ | N | We usually work with the log of this ℓ(θ y) := i=1 log f(yi θ), and drop the con- ditioning on y though strictly speaking f(y, X|θ) = f(y|X|θ)f(X|θ) Defn 6.2 (Maximum Likelihood Estimator). is the estimator that maximises the conditional log-likelihood estimator. ∑ ˆ θMLE := arg max L(θ) = arg max log f(Zi|θ) ∈ ∈ θ Θ θ Θ i=1 solves the first-order conditions 1 ∂L(θ) 1 ∑ ∂ℓ(y |x , θ) = i i = 0 N ∂θ N ∂θ i Example 6.1 (Linear Regression). conditional density: ( ) ∏N − ′ 2 2 1 (yi xiβ) f(yi|xi, |β, σ ) = √ exp − 2 2σ2 i=1 2πσ Log likelihood: n n 1 ℓ(β, σ2) = − log(2π) − log σ2 − (y − Xβ)′(y − Xβ) 2 2 2σ2 N N = − RSS(β) − log(2πσ2) = 2 2

∑ − n uˆ2 2 ˆ ′ 1 ′ 2 i=1 i Maximising this w.r.t. b, s yields β = (X X) X y, σˆ = N . Defn 6.3 (Score Function). S ∇ ∂ℓ(θ) gradient vector (θ) := ℓ(θ) = ∂θ ∂ℓ(θ) S(z, θ) := (z; θ) ∂θ

← ToC 62 ∗ Evaluated at θ , this is the efficient score. Equivalently, ( ) S [ ]− local maximum that solves the FOCs (z, θ) = 0, IOW ∂2ℓ(θ) 1 θˆ ∼a N θ, −E 1 ∑N log f(Z |θ) MLE ∂θ∂θ′ i = 0 N ∂θ i=1 conditions: (1) θ ∈ Θ , (2) ℓ is twice-differentiable Defn 6.4 (Fisher Information / Information Matrix Equality). Property 6.5 (Efficiency). [ ] [ ] Cramer-Rao Bound ∂ℓ(θ) ∂ℓ(θ) ∂2ℓ(θ) Variance of MLE is the ; the asymptotic variance of the MLE is I(θ) := E [S(θ)S(θ)′] = E = −E at least as small as that of any other consistent estimator. ∂θ ∂θ′ ∂θ∂θ′ [ ] Theorem 6.6 (Cramer-Rao Inequality / Bound). ∗ ∗ ∗ ′ ∂ ∗ | ∈ b A = B := V [si(θ )] = E [si(θ )si(θ ) ] = E si(θ ) Let the pdf of the r.v. X be fX (x θ) for some θ0 Θ. Let θ be an unbiased∫ estimator ∂θ [ ] for θ∫0. Suppose the derivative ∂/∂θ can be passed under the integral f(x|θ)dx V θˆ = 1 A−1 and θ(x)f(x|θ)dx and suppose the fisher information Variance estimate n ; [ ] 2 Estimated as I −E ∂ log f | ∑N 2 (θ) = ′ (X θ) 1 ∂ ℓi(θ) ∂θ∂θ I(ˆθ) = − N ∂θ∂θ′ is finite. Then, i=1 θ=θMLE [ ] b −1 Example 6.2 (IM for OLS). V θ(X) ≥ I(θ0) ′ 2 ′ ′ ′ | For OLS, parameter vector θ = (β , σ ) ) = (β , γ) . If X1,X2,...,X[n are] iid with common density fx(x θ), the implied bound on the Scores: b −1 variance is NV θ(X) ≥ I(θ0) ∂ℓ 1 = X′(y − Xβ) ∂β γ Defn 6.5 (Asymptotic Efficiency). Let X1,X2,... be iid random variables with common density fx(x|θ). A sequence ∂ℓ n 1 ′ b = − + (y − Xβ) (y − xβ) of estimates θN , a function X1,X2,...,XN that satisfies ∂γ 2γ 2γ2 | {z } √ ( ) b d −1 :=s N(θml − θ) → N 0, I(θ) ∑n 1 ′ θ ∈ Θ γ = s/n = (y − x β) whatever the true value of is, is said to be asymptotically efficient. n i i i=1 Defn 6.6 (Semiparametric Efficiency Bound). ∼ | Information Matrix / Variance Suppose X1,X2,... are iid with density X f(x θ, h(.)) where h(.) is an unknown ( ) ( ) h() ′ ′ −1 function. Next, we pretend to know the infinite dimensional parameter up 1 − 2 2 X X 0 1 σ (X X) 0 γ I(θ) = σ ′ =⇒ CRLB := (I(θ)) = 4 to a finite dimensional parameter , in which we have a fully parametric finite- 0 n ′ 2σ 2σ4 0 n dimensional parameter θ, thus we can calculate the Cramer-Rao Bound. f(x|θ, γ) = f(x|θ, h(γ)) 6.1 Properties of Maximum Likelihood Estimators Partitioning the information matrix for (θ′, γ′)′ and its inverse in [ ] [ ] θθ′ θγ′ Property 6.3 (Consistency). Iθθ′ Iθγ′ −1 I I I(θ, γ) = and I(θ, γ) = ′ ′ Iγ′θ Iγγ′ Iγθ Iγγ As N→∞, probability of missing the true parameter goes to zero. The Cramer-Rao bound implies that |ˆ − | →p ∀ P ( θn θ > ϵ) 0 ϵ > 0 b θθ′ −1 −1 ASV (θ) ≥ I = (Iθθ′ − Iθγ′ (Iγγ′ ) Iγθ′ ) Property 6.4 (Asymptotic Normality). √ This is true for any parametrisation of the unknown function h(.). The lowest pos- d − N(θˆ − θ) → N (0, I(θ) 1) sible variance for any estimator for θ that does not use knowledge of h(.) has to

← ToC 63 be at least as high as the lowest variance we can get if we know more, that is, the 6.2.1 Robust Standard Errors Cramer-Rao bound for any parametric submodel. So, the semiparametric efficiency Asymptotic distribution of QMLE bound is the largest lower-bound we can get for any parametric submodel. √ ( ) ( ) b ˆ − ⋆ →d N ˆ−1 ˆ ˆ−1 Suppose we have a candidate Estimator θ and a given parametrisation h(x; γ). n θ θ 0, A BA Then, where [ ] [ ] ∑n 2 −1 −1 b ∂Si(θ) 1 ∂ ℓi(θ) (I ′ − I ′ (I ′ ) I ′ ) ≤ Semiparametric Efficiency Bound ≤ ASV (θ) Aˆ = −E H(βˆ) = E | = | θθ θγ γγ γθ ∂θ′ θˆ n ∂θ∂θ′ θˆ For any estimator we can calculate the left hand side, for any parametrization we i=1 ∑n can calculate the right hand side, so if we find an estimator and a parametrization [ ′] 1 ∂ℓ ∂ℓ Bˆ = E S (θ⋆) S (θ⋆) = i × i | that the two are equal we have found the efficiency bound. i i n ∂θ ∂θ′ θˆ i=1 Theorem 6.7 (Equivariance of the MLE). ˆ Defn 6.9 (Kullback-Liebler Distance). Let τ = g(θ) where g is bijective, continuous, and differentiable. Let θn be the MLE let f(y|θ) be the assumped joint density, and let h(y) be the true density. Then, ˆ [ ( )] ∫ ( ) of θ. Then, τˆn = g(θn is the MLE of τ. h(y) ∞ h(t) · || · : E E | KL[h( ) f( )] = h log = h(t) log dt ∂ [y x] f(y|θ) −∞ f(t) Defn 6.7 (Marginal Effect ∂x ). ′ For a regression of the form E [y|x] = g(x β), one can estimate multiple ‘marginal Minimised when ∃θ0 s.t. h(y) = f(y|θ0). QMLE minimises distance between f(y|θ) E | effects’. For the special case where E [y|x] = x′β, ∂ [y x] = β, but this is not gener- and h(y). Notation KL(h, f) denotes ‘information lost when f is used to approxi- ∂x mate h’. ically true. Discrete version illustrates links to Entropy ∑ E | • Average Marginal Effect (AME): := 1 ∂ [yi xi] ∑J ∑J ∑J N i ∂xi pj KL[p||q] := pj log = pj log pj − pj log qj = −H(p) + H(p, q) ∂E[y|x] q | {z } | j=1 j j=1 j=1 • Marginal Effect at Mean (MEM) ∂x x¯ cross entropy || ≥ Defn 6.8 (Factorisation Theorem / Sufficient Statistics). KL(p q) 0 and with equality IFF p = q. Suppose t(x) is a for θ. Then, KL ‘distance’, unlike Euclidian Distance, is not the same between f, g as g, f; i.e. ∏ it is directional. f(x |θ) = g (t(x), θ) h(x) i Defn 6.10 (Akaike Information Criterion (AIC)). i Akaike showed that using K-L model selection entails finding a good estimator for ˆ [ [ ]] θ(x) depends on the data x only through t(x), the sufficient statistic. b Ey,h Ex,h log(f(x|θ(y))) 6.2 QMLE / Misspecification / Information Theory where x, y are independent, random sampels from the same distribution and ex- pectations are taken w.r.t. the true distribution h. Estimating this quantity for If model is misspecified, f (·|x , θ) ≠ p (·|x ) ∀ θ ∈ Θ i 0 i each model fi is biased upwards. An approximately unbiased estimator of the The MLE converges to the best fitting θ for the population (pseudo-true value) above target quantity is ∑N For a general class of maximum-likelihood models, ⋆ 1 θ = arg max plim ℓi(θ) b ∈ N AIC = −2 log L(θ|y) + 2K θ Θ i=1 For the linear , the quasi-MLE is consistent even when the den- For linear regression models, this simplifies to ∑ sity is partially misspecified. n εb2 AIC = n log σb2 + 2K ; σb2 = i=1 n

← ToC 64 Defn 6.11 (Bayesian Information Criterion (BIC)). Defn 6.16 (). ( ) ′ Unrestricted estimates of θ0 should be close to zero. e e k ln(n) ( ) BIC = ln + −1 n n · ˆ′ Iˆ00 ˆ W := N θ0u θ0u Iˆ00 6.3 Computation Where is the top-left of the information matrix (corresponding with the re- W ∼ χ2 stricted parameters). Under the null, K0 . − · ∂ℓ ˆ ′ −1 ˆ General version of update rule: θk+1 = θk λk Ak ∂θ (θk). Step length λ = 1 for alternatively, W = h(θu) Ω h(θu) both N-R and BHHH. where h1(θ) . . . hK0 (θ) are restrictions, ( )′ ( ) Defn 6.12 (Newton Raphson). ∂h(θ) ∂h(θ) − 1 Ω = V [θu] set Ak = (H(θ)) ′ ∂θ ∂θ − f (x0) Update rule: x1 = x0 ′′ . For Log-likelihood, ˆ (f (x0) ) evaluated at θu. 2 −1 ∂ ℓ ∂ℓ − − ≡ − 1 2 θk+1 = θk ′ (θk) (θk) θk (H(θk)) s(θk) Defn 6.17 (Pseudo-R ). ∂θ∂θ ∂θ McFadden’s Pseudo-R2 Defn 6.13 (Berndt-Hall-Hall-Hausman (BHHH)). ℓ(βˆ) 1 ′ 2 − A = (S(θ )S(θ ) ) Rbin := 1 Uses Information-matrix equality. Set k N k k to be outer product of ℓ(¯y) scores ( )−1 1 ∑N ∂ℓ ∂ℓ A := (θ ) (θ ) 6.5 Binary Choice k N ∂θ k ∂θ′ k i=1 Defn 6.18 (Linear Probability Model). Estimate the probability using OLS Pr (y = 1|x) = Xβ. V [y|x] = Xβ(1 − Xβ), so 6.4 Testing heteroskedasticity is mechanically present unless all coefficients are zero. Defn 6.19 (Logit and Logistic Functions). To test the hypothesis H0 : α = 0 against the alternative, there are three classical ( ) tests. ′ ′ ′ p We partition the parameter K-vector θ into two parts (θ , θ ) s.t. the dimensions 0 1 Logit(p) = log − of the two sub-vectors s.t. K0 + K1 = K. θ1 is a nuisance parameter: its value is 1 p not restricted under the null. 1 ex ˆ ˆ ˆ Logistic(x) = = Let θu := (θ0u, θ1u) be the unrestricted MLEs. If we are testing the restriction θ0 = 1 + exp(−x) 1 + ex ˆ ˆ 0, then the restricted parameter vector is θR := (0, θ1r). IOW, test h(θ0) = 0. ′ fits logit(pi) = xiβ, where logit is the link function that scales ′ Defn 6.14 (Likelihood Ratio Test). xiβ onto the probability scale. Alternatively, one can use Φ(0, 1). If null is true, ℓ at restricted model ((0, θ ) should not be much smaller than ℓ at 1r Defn 6.20 (Logit Parametrisation). the unrestricted model ((θ0u, θ1u). ∗ ′ Y ∈ {0, 1} Y = X β + ϵ Y := 1 ∗ Y For i , assume latent index model i i i; i Yi >0. i is × ˆ − ˆ ∏ − LR := 2 (ℓ(θu) ℓ(θR)) L N Yi − 1 Yi bernoulli, so = i=1 πi (1 πi) . LR ∼ χ2 K E | | F ′ − F − ′ Under the null, K0 (where 0 is the number of restrictions being tested). Symmetric CDFs. Let πi = [Yi Xi] = Pr (Yi = 1 Xi) = (Xiβ) = 1 ( Xiβ) Defn 6.15 (Lagrange Multiplier Test / ). • Probit: F(u) = Φ(u) If the limiting ℓ is maximised at θ0 = 0, the derivative of the ℓ wrt θ0 at that point should be close to zero. • Logit: S ˆ ′ I−1 ˆ S ˆ LM := (θR) [ (θR)] (θR) F 1 exp(u) – (u) = Λ(u) = 1+exp(−u) = 1+exp(u) LR ∼ χ2 K Under the null, K0 (where 0 is the number of restrictions being tested).

← ToC 65 – f(u) = Λ′(u) = (1 − e−u)−2e−u Defn 6.21 (Gumbel Distribution). f (ϵ) = exp (−ϵ) exp (− exp(−ϵ)) − ∞ < ϵ < ϵ ∂p Model πi = Pr (y = 1|x) Marginal Effect: ∂xi F − − ′ and (ϵ) = exp ( exp( ϵ)). ′ exp(x β) ′ − ′ Logit Λ(x β) = 1+exp(x′β) Λ(x β) [1 Λ(x β)]βj ′ ′ Probit Φ(x β) ϕ(x β)βj 6.6.1 Ordered ′ ′ ′ ′ Clog-log C(x β) = 1 − exp(− exp(x β)) exp(− exp(x β)) exp(x β)βj ′ Random utility with multiple cutoffs ϕ . . . ϕ , where ϕ = 0, ϕ = ∞. LPM x β β 1 J 1 J ∑ j 1 ′ ′ Defn 6.22 (Ordered Logit). ℓ(β) = (Y log F(X β) + (1 − Y ) log(1 − F(X β))) ∗ i i i i y n Define i latent variable, and 0 if − ∞(= ψ ) < y∗ ≤ ψ Fact 6.8 (Score and QoIs for binary choice).  0 ∗i 1 ′ ′ ′  ≤ F 1 if ψ1 < yi ψ2 Let fi := f (xiβ); Fi = (xiβ) be the density and CDF evaluated at xiβ. yi = . . ′ − . . fixi[yi Fi] . . ∗ si(θ) = J if ψJ−1 < y ≤ ∞(= ψJ ) Fi(1 − Fi) i Sample Score solves which means ( ) ′ N exp (ψ − x β) ∑ y 1 − y ≤ | (j i ) i i Pr (yi j xi) = ⊤ fixi − fixi = 0 − F 1 − F 1 + exp ψj xi β i=1 i i Variance: Defn 6.23 (Ordered Probit). ( )− ′ 1 Pr (y ≤ j|x ) = Φ (ψ − x β) ⇔ Pr (y = k − 1|x ) = Φ(α − Xβ) − Φ(α − − Xβ) ∑N f 2x x′ i i j i i k k 1 Var[d βˆ] = i i i Fi(1 − Fi) Both specifications yield a likelihood that is simply the product of binary logit/probit i=1 models that switch between adjacent categories for each observation. V [yi|xi] = Fi(1 − Fi) ∑N ∑J Marginal effect: ′ ′ ℓ(β, ψ|Y, X) = 1y =j log (F (ψj − X β) − F (ψj−1X β)) ∂Pr (yi = 1|xi) ′ i i i = f (xiβ) β i=1 j=1 ∂xi Marginal effects are of the form Example 6.9 (Logistic Regression). ( ( ) ( )) ∂Pr (Y = j) ′ ′ ∑ ˆ ˆ − ˆ − ˆ − − ˆ 1 ′ ′ = βj f ψj x¯ β f ψj 1 x¯ β Q(θ) = ℓ(θ) = (y log Λ(x θ) + (1 − y ) log[1 − Λ(x θ)]) ∂xj x¯ n i i i i i Since for the logistic CDF, Λ′(v) = λ(v) = Λ(v)(1 − Λ(v)), the score and hessian 6.6.2 Unordered ∏ 1 can be written as J yj =j Multinomial distribution := p(yi) = j=1 πj − ′ S(θ) = [yi Λ(xiθ)]xi ∑N ∑J − ′ − ′ ′ − ′ ′ | 1 H(θ) = Λ(xiθ)[1 Λ(xiθ)]xixi = λ(xiθ)xixi ℓ(π Y ) = ij log πj i=1 j=1 6.6 Discrete Choice Defn 6.24 (Multinomial Logit). In many Additive Random Utility Models (ARUMs), F (ϵ − ϵ ) is logistic for mul- ′ ′ 1 0 [ exp(xiβ) ] exp(xiβj) πij = Pr (yi = j|xi) = ∑ = ∑ tivariate extensions to logit. This assumes that the errors themselves are distributed J ′ J exp (x′ β ) Gumbel/ type 1 extreme-value distribution 1 + j=1 exp(xiβ) k=2 i k

← ToC 66 Pr (y = 0|x′ β) = ∑ 1 where we adopt normalisation i 1+ J exp(x′ β) for identifica- Hessian [ j=1 i ] ( )− ∑n 1 tion. ∂s(β) ′ ′ ′ ′ Coefficient interpretation: H(β) = = − exp (x β) x x =⇒ Avar(βˆ) = exp (x β) x x [ ] ∂β i i i i i i i=1 pj(xi, β) ⇔ pj(xi, β) ′ − ∀ ∈ = exp(xβj) log = xi(βj βh) j, h 1,...J ′ E | V | p0(x, β) ph(x, β) Assumes λ := exp(xiβ) = [Y X] = [Y X]. ′ ∂E[y|x] x Marginal Effect: Since E [y|x] = exp x β for poisson, = E [y|x] βj. Parame- which implies that the log-odds ratio is linear in . ∂xj ters can be interpreted as semi elasticities, since Defn 6.25 (Conditional Logit). X ∂E [y|x] 1 ∂ log E [y|x] Permits incorporation of choice-varying predictors ( ij, nests) MNL. β = × = ′ ∂x E [y|x] ∂x exp Xijβ πij = Pr (Yi = j|Xij) = ∑ J ′ Defn 6.29 (Overdispersed poisson). k=2 exp (Xikβ) ′ 2 2 Log likelihood of the form E(Yi|Xi) = λi = exp (X β) ; Var (Yi|Xi) = Vi = σ λi; σ > 1 ( ( )) i ∑n ∑M ∑M Defn 6.30 (Zero Inflated Poisson (ZIP)). 1 ′ − ′ ℓ = ij[xiβh] log exp xiβl define a bernoulli πi = 1w.p.θi for y = 0 observation, and specify separate models i=1 h=1 l=1 for zero and nonzero data, with potentially different covariates on θ and λ. Yields the following (difficult to maximise) likelihood Defn 6.26 (IIA). ( ) ( ) − ¬{ } ∏N yi 1 πi Relative risk πij/πik independent of other choices j, k ; choices are series of pair- exp(−λ )λ − ⊥⊥ ̸ L = θ + (1 − θ ) exp(−λ ) (1 − θ ) i i wise comparisons. pj(xj)/ph(xh) = exp[(xj xh)β]. IoW ϵij ϵik for j = k. i i i i y ! i=1 i Defn 6.27 (Multinomial probit). ∼ Defn 6.31 (Negative ). ϵi iid MVN (0, ΣJ ) ( ) ∫ − ¨ ⊤ ∫ − ¨ ⊤ λ ( ) X β X β yi 1j Jj Γ σ2−1 + yi σ2 − 1 ( ) −λ ··· ··· ( ) 2 σ2−1 πij = ϕ (¨ϵ1j,..., ϵ¨Jj) dϵ¨1j dϵ¨Jj p (yi) = σ −∞ −∞ λ σ2 yi!Γ σ2−1 X¨ = X − X ;ϵ ¨ = ϵ − ϵ where kl ik il kl ik il 2 E [Yi] = λ; V [Y ] = λσ . 6.7 Counts and Rates Fact 6.10 (NB2 Likelihood). ′ 2−p 6.7.1 Counts Let µi = exp(xiβ), ri = α/(α + µi), qi = αµi Defn 6.28 (). ∑n y f(y|λ) = λ exp(−λ)/y! ℓ = log Γ(yi + qi) − log Γ(qi)− ′ Poisson specification: λ = exp(xiβ). Yields log density i=1 | − ′ ′ − log Γ(yi + 1) + qi + log ri + yi log(1 − ri) log f(y x, β) = exp(xiβ) + yi exp(xiβ) log y! Score: 6.7.2 Rates s (θ) = −exp(x′ β)x′ y x′ = x′ (y − exp(x′ β)) i i i i i i i i • Survival: S(y) := 1 − F(y) solves ∑ ′ ′ f(y) f(y) − • Hazard: λ(y) = h(y) := − = ; xi (yi exp(xiβ)) = 0 1 F (y) S(y) ∫ y − • Cumulative Hazard Λ(y) := 0 λ(s)ds = log S(y).

← ToC 67 E | V | 2{ − − } Defn 6.32 (Kaplan-Meier). [y y > c] = µ0 + σ0λ(v) , [y y > c] = σ0 1 λ(v)[λ(v) v] ϕ(v) J ( ) J − ∏ ∏ − where v = (c µ0)/σ0 and λ(v) = 1−Φ(v) is the inverse Mills ratio / Hazard b − ˆ rk dk S (tj) = 1 λ (tk) = function. rk k=j k=j Proportional Hazard Models 6.8.1 Truncated Regression Conditional hazard rate λ(t|x) can be factored as ( ) Consider a model y = x′ β + ϵ ; ϵ |x ∼ N 0, σ2 and y is not observed if y > c. | i i i i i i i λ(t x, β) = |λ0{z(t}) |ϕ(x{z, β}) Yields log-likelihood as follows baseline hazard exp(x′β) ( ( ) ) ( ( )) ′ 2 ′ − − − = 1 αyα 1 2 1 1 2 1 yi xiβ c xiβ baseline hazard ( for exponential and for weibull). ℓ(yi|xi; β, σ ) = − log(2π) − log(σ ) − −log 1 − Φ Parametric Model Hazard Survival 2 2 2 σ σ Exponential γ exp(−γt) Weibull γαtα−1 exp(−γtα α−1 − α 1/µ 6.8.2 Tobit Regression Generalised Weibull γαt [1 µγt ] { γ exp(αt) exp(−(γ/α)(eαt−1)) ( ) ∗ ∗ Gompertz ∗ ′ ∼ N 2 yi if yi > c Censored Yi s.t. yi = β xi + ϵi ϵi 0, σ and yi = ∗ ≤ [i.e. y is For survival models with censoring, Likelihood is often written as c{ if yi c ∏ ∗ ≥ − 0 if yi c L | di | 1 di censored from below at c]. Define censoring indicator Di = ∗ . (θ) = f(ti θ) S(ti θ) 1 if yi < c i f(y|x) = f ∗(y|x)dF ∗(c|x)1−d Yields likelihood [ ( )] . log-likelihood( ( )) d t ′ − ′ where i is a right-censoring indicator and i is the observed time. 2 xiβ 1 yi xiβ ℓ(yi|xi, β, σ ) = Di log 1 − Φ + (1 − Di) log ϕ Example 6.11 (Weibull Example). σ σ σ Weibull Density: f(y) = γαyα−1 exp (−γyα) , y, α, γ > 0. E [y] = γ−1/αΓ(α−1 + 1). Specify γ = exp(x′β), so E [y|x] = exp(−x′β/α)Γ(α−1 + 1). Then, the log- 6.9 Generalised Linear Models Theory likelihood is Semi-robust likelihoods belong to the Linear exponential Family of the following 1 ∑ ℓ(θ) = {x′ β + log α + (α − 1) log y − exp(x′ β)yα} form: N i i i i i Defn 6.33 (Random Component). y Y FOCs are ∑ Response observations i are realisations of random variables i with densities of −1 { − ′ α} the form N 1 exp(xiβ)yi xi = 0 ( ) yθ − b(θ) i f(y|θ, ϕ) = exp + c(y, ϕ) −1{ −1 − ′ α } a(ϕ) N α + log yi exp (xiβ) yi log yi = 0 θ ⊂ R is called the canonical / natural parameter, ϕ ⊂ R+ is the dispersion parameter. E [Y |θ, ϕ] = b′(θ), V [Y |θ, ϕ] = a(ϕ)b′′(θ) Model needs to be correctly specified to be consistent. Unlike OLS or poisson. { } y µ − 1 µ2 y2 1 ( ) f (y ) = exp i i 2 i − i − log 2πσ2 6.8 Truncation and Censored Regressions i σ2 2σ2 2 Fact 6.12 (Truncated Distribution Density). Defn 6.34 (Systematic Component). ∼ ′ If a c.r.v y f(y) and is truncated at c, Linear predictor ηi := Xiβ specifying the variation in Y accounted for by known covariates. | f(y) f(y) f(y y > c) = = − F Pr (y > c) 1 ( (c) ) Defn 6.35 (Link Function). ∼ N 2 g is a transformation of the mean that addresses scaling. It is so called because it For the truncated normal distribution where y µ0, σ0 is truncated at c,

← ToC 68 ′ links the expected value of the response variable E [Y |θ, ϕ] = µi = b (θi) to the 1. conditional on a level of complexity, choose best in-sample loss-minimising explanatory covariates. function ′ ⇒ −1 ′ ∑n g(µi) = ηi = Xiβ = µi = g (Xiβ) min L(f(xi), yi) over f ∈ F subject to R(f) ≤ c Since µ = b′(θ ), under a canonical link ( g(µ ) = θ (µ )), θ = X′β. | {z } | {z } i i i i i i i i=1 | {z } function class complexity restriction 6.9.1 ML estimation in-sample loss log likelihood 2. Estimate the ‘optimal’ level of complexity using empirical tuning ( ) ∑N y θ − b(θ ) L(θ, ϕ|y) = i i i + c(y , ϕ) Fact 7.1 (Discriminative vs Generative ML). a (ϕ) i i=1 i Discriminative Model Goal Directly estimate E [y|x] Estimate Pr (x|y) to deduce Pr (y|x) Score function What is Learned Decision Boundary Probability Distribution of the data Examples Regressions, SVM GDA, Naive Bayes ∑N ∂ℓ ∑N ∂ℓ ∂θ ∂µ ∂η S(β, y) = i = i i i i β ∂θ ∂µ ∂η ∂β Defn 7.1 (Loss Functions). i=1 j i=1 i i i j L :(z, y) ∈ R × Y 7→ L(z, y) ∈ R that takes as inputs predicted value z and real −1 yi − µi 1 ∂g (ηi) data value y and outputs how different they are. = xij ai(ϕ) V [µi] ∂ηi • Least Squares: 1 (y − z)2 The FoC can be written as 2 − ∂ log L(θ, ϕ|y) − • Logistic: log(1 + exp( yz)) = X′ (W) 1 [y − yb] = 0 ∂β • Hinge: max(0, 1 − yz) where W is a weight matrix (which depends on β). The fitted value − ′ • Cross Entropy: −[y log z + (1 − y) log(1 − z)] yb = m(x) = E [y|X = x] = g 1(x β) By a first-order taylor expansion, define z = g(yb) + (y − yb)∇g(yb) This gives us an update rule ( )−1 b −1 −1 βk+1 = X (Wk) X X (Wk) zk b repeat until convergence β∞. Model Density Link OLS Gaussian Identity Logistic Binomial Logistic Logistic Binomial Normal Poisson Poisson Log

7 Machine Learning 7.1 General Setup Every ML algorithm essentially involves a function class F and a regulariser R(f) that expresses the complexity of the representation. Then, two steps

← ToC 69 Table 1: Handy Dandy reference from Mullainathan and Spiess (2017, table 2)

Function class F Regulariser / Tuning Parameters R(f)

Global / Parametric Predictors ∑ ′ k β x ∥β∥ = 1 ̸ Linear and generalisations Subset selection∑ 0 j=1 βj =0 ∥β∥ = k |β | LASSO 1 ∑ j=1 j ∥ ∥2 k 2 Ridge β 2 = j=1 βj ∥ ∥ − ∥ ∥2 Elastic Net α β 1 + (1 α) β 2 Local/Nonparametric predictors Decision / Regression trees Depth, number of nodes/leaves, minimal leaf size, information gain at splits Random forest Number of trees, Number of variables used in aach tree, size of bootstrap sample, complexity (above) Nearest Neighbours Number of Neighbours Kernel Regression Kernel Bandwidth Mixed Predictors Neural Networks (including Deep, Convolutional) Number of layers, number of neurons per layer, connectivity between neurons Splines Number of knots, order Combining Predictors Bagging: unweighted average of predictors from bootstrap draws Number of draws, size of bootstrap samples, individual tuning parameters Boosting: linear combination of predictions of residual learning rate, number of iterations, individual tuning parameters Ensemble: weighted combination of different predictions Ensemble weights, individual tuning parameters

← ToC 70 Defn 7.2 (Curse of Dimensionality). take a unit hypercube in dimension p and we put another hypercube within it that ∑N ∑J − ′ ∥ ∥ captures a fraction r of observations within the cube. Each edge will be ep(r) = J(β, X, y) = (yi xiβ) + λ βj 1 1/p i=1 j=1 r . For moderately high dimensions p = 10, e10(0.01) = 0.63; e10(0.1) = 0.8. Need 80% data to cover 10% of sample. fit using sequential coordinate descent.( Coefficient ) vector is soft-thresholded: Define d(p,N) as distance from the origin to the closest point. n = 500, p = 10 =⇒ βlasso = (βˆ ) βˆ − λ, 0 d = 0.52 [closest point closer to the boundary than to the origin]. j sgn j max j ( ) ( ) 1/p where X is a standardized design matrix [s.t. all Xs have unit variance], and || is 1 1/N d(p, N) = 1 − the l1 norm. 2 In both cases, pick tuning parameter λ using cross-validation. 7.2 Supervised Learning Defn 7.5 (Penalised Maximum Likelihood Regression). ML analogue to LASSO. Define ( ) 7.2.1 Regularised Regression b − | P θ = argmin ℓ(θ Y , X) + λ θ 1 In general, we want to impose a penalty for model complexity in order to minimise θ∈Θ MSE [trade off some bias for lower variance]. where | P | ∑θ Defn 7.3 (Ridge Regression). θP = θP Estimate the following regression 1 k k ∑N ∑J − ′ 2 f(β, X, y) = (yi Xiβ) + λ β Defn 7.6 (Elastic Net (Zou and Hastie 2005)). i=1 j=1 Combines ridge regression and the lasso by adding a ℓ2 penalty to the LASSO’s objective function − βˆ ˆRidge ′ 1 ′ ≡ ˆRidge λ β = (X X + λIk) X y βj = 2 2 2 1 + λ J(β) = ∥y − Xβ∥ + λ1 ∥β∥ + ∥β∥ 2 1 2 2 where X is a standardized design matrix [s.t. all Xs have unit variance]. Let X = USV′ be the SVD of X. Then, ridge coefficients can also be written as Defn 7.7 (Principal Components LASSO (Tay, Friedman, Tibshirani 2018)). − ′ ℓ βˆRidge = V(S2λI) 1SU y. Generalise the 2 penalty to a class of penalty functions of the form The solution vector is ‘biased’ towards the leading right singular vectors of X, βT VZVT β which gives it the property of a ‘smoothed’ Principal Components regression. where Z is a diagonal matrix whose diagonal elements are functions of the squared Fact 7.2 (Degrees of Freedom for Ridge (and other semi-parametric meth- singular values. ods)). Define the following objective function For Ridge regression, ∑ 1 ∥ − ∥2 ∥ ∥ θ T λj J(β) = y Xβ + λ β + VDd2−d2 V β dof(λ) = 2 2 1 2 1 j λj + λ 2 j Where D 2− 2 is a m × m diagonal matrix with diagonal entries equalling d − d1 dj 1 2 2 − 2 Where λjs are the eigenvalues of the Covariance Matrix. d1, d1 d2 ... . This penalty term gives no weight to the component of β that c c More generally, for any smoother matrix W, df(ˆµ) = tr(W), which may not be an aligns with the first right singular vector of X (i.e. the first principal component. integer for semi/non-parametric smoothers. In the special parametric case of OLS, This gives it better predictive accuracy in some settings. c ′ −1 ′ Comparing principal-coordinate predictions of ridge and pcLASSO: W = X (X X) X , so the DoF is simply k. Defn 7.4 (Lasso Regression). Consider the objective function

← ToC 71 Let y ∈ {−1, 1}. A can then be written as h(x) = sgn(H(x)) where m 2 ∑ d b dj T ∑ XβRidge = ujuj y d2 + θ H(x) = a0 + aixi j=1 j i=1 ∑m 2 b dj T ∃ ≥ ∀ Xβ = u u y Suppose a hyperplane H(x)∑s.t. YiH(xi) 1 i. pcL d2 + θ(d2 − d2) j j ˆ N j=1 j 1 j The hyperplane H(x) =a ˆ0 + i=1 aˆ∑ixi that separates the data and maximises the d 2 ≥ ‘margin’ is given by minimising 1/2 j=1 aj subject to YiH(xi) 1. The latter corresponds to a more aggressive form of shrinkage towards the leading Defn 7.10 (Boosting and Sequential Learning). singular vectors. Typically, the function space M is large and complex, so a natural idea is to learn iteratively. Loosely, estimate a model m1 for y from X, which produces error ε1. 7.2.2 Classification Next, estimate m2 for ε1 from X, which produces ε2, and so on. So, after k steps, ∈ Y {− } k Training sample (xi, yi) where y := 1, +1 (can relabel to Bernoulli). A m (·) = m1(·) + m2(·) + . . . mk(·) predictor m : X →Y, where the labels are produced by an (unknown) classifier f. | {z } | {z } | {z } ∼ y ∼ ε ∼ ε − Let P be an (unknown) distribution on X . The error of m w.r.t. f is defined by 1 k 1 where the first error is y − m(x) and so on, and can also be seen as the gradient as- RP,f (m) = P [m(X) ≠ f(X)] = P [{x ∈ X : m(x ≠ f(x))}] whereX ∼ P sociated with the quadratic loss function, ε = ∇ℓ. So, an equivalent representation is The empirical risk is defined as      ∑n ∑n  b 1 −  −  1 ̸ (k) (k 1) (k 1) R(m) = m(xi)=yi m = m + argmin ℓ yi − m (xi), h(xi) n ∈H  | {z }  i=1 h i=1 εk,i RP (m) = 0) A perfect classifier (in the sense that ,f does not exist, so we aim for H RP (m) ≤ ε 1−δ where is a space of ‘weak learners’ (typically step functions). To ensure ‘slow’ Probably Approximately Correct (PAC) learners that have ,f w.p. . − ∈ The space of models m is restricted to be in finite set M. It can be shown that learning, one typically applies a shrinkage parameter ε1 = y αm1(x1) α (0, 1). −1 −1 ∗ ∀ε, δ, P, f, if n ≥ ε log[(δ) |M]|, then RP,f (m ) ≤ ε w.p. ≥ 1 − δ where Arthur Charpantier’s series on the probabilistic foundations of econometrics and [ ] ∑n machine learning cover this and more and have an excellent bibliography ∗ ∈ 1 1 m argmin m(xi)=yi ∈M n • Econometrics m i=1 – https://freakonometrics.hypotheses.org/57649 Defn 7.8 (Vapnik-Chervonenkis Dimension). – https://freakonometrics.hypotheses.org/57674 loosely represents the expressive capacity of a set of functions. Consider k points {x1, . . . , xk} and the set – https://freakonometrics.hypotheses.org/57693 k – https://freakonometrics.hypotheses.org/57703 Ek = {m(x1), . . . , m(xk): for m ∈ M} ≡ {−1, +1} k • ML we say that m shatters all the points if |Ek| = 2 , i.e. all combinations are possible. Linear functions can shatter 2 points. – https://freakonometrics.hypotheses.org/57705 M The VC dimension of is – https://freakonometrics.hypotheses.org/57745 M { M { }} VC( ) := sup k s.t. shatters x1, . . . , xk – https://freakonometrics.hypotheses.org/57782 Defn 7.9 (Support Vector Machines). – https://freakonometrics.hypotheses.org/57790 – https://freakonometrics.hypotheses.org/57813 • Bibliography https://freakonometrics.hypotheses.org/57737

← ToC 72 7.2.3 Goodness of Fit for Classification

Defn 7.11 (Calibration and Discrimination). • calibration: Bin predicted prob- b { } b abilities y into bins gk , and within each compute Y gk (average predicted

probability) and Y gk . Plot the two average against each other. In a well cali- brated model, the binned averages trace the identity line. • discrimination: Discrimination is a measure of whether Y = 1 observations b b have high Y , and correspondingly Y = 0 values have low Y . Many measures; listed below Defn 7.12 (Confusion Matrix). Observed Y = 1 Y = 0 b Predicted positive (Y > c) True Positive (TP) False Positive (FP) Figure 8: AUC b Predicted negative (Y < c) False Negative (FN) True Negative (TN) Total Positive(P) Total Negative(N) Defn 7.13 (Random Forests). { }N • Accuracy = (TP + TN)/(P + N) - Overall performance Suppose we have a training set (Xi,Yi,Di) i=1, a test point x, and a tree predictor µb(x) = T (x; {(X ,Y ,D )}N ) • Precision = TP/(TP + FP ) - How accurate positive predictions are i i i i=1 Equivalently, T P/P • Sensitivity = Recall = True positive Rate = - Coverage of actual pos- ∑n 1 xi∈L(X) itive sample µb(x) = αi(x)Yi where αi(x) = |i : xi ∈ L(x)| • Specificity = True Negative Rate = T N/N - Coverage of actual negative sam- i=1 X L ple where is partitioned into leaves (x), where leaves are constructed to maximise heterogeneity between nodes . Do this until all leaves have 2× minimum leaf size • Brier Score = observations. Regression trees overfit, so we need to use cross-validation +other z Calibration}| { z Refinement}| { tricks. ∗ Random forests build and average many different trees T by ∑ ( )2 ∑K ( )2 ∑K ( ( )) 1 ˆ 1 ˆ ¯ 1 ¯ ¯ Yi − Yi = Yk − Yk + nk Yk 1 − Yk • Bagging / subsampling training set (Breiman) N N N i k k • Selecting the splitting variable at each step from m out of p randomly drawn 2TP • F1 Score = 2TP +FP +FN : hybrid metric for unbalanced classes features (Amit and Geman)

∑B Fact 7.3 (Receiver Operating Curve (ROC) / Area Under the Curve (AUC)). 1 ∗ N c τb(x) T (x; {(Xi,Yi,Di)} ) Is the plot of TPR vs FPR by varying the threshold . B b i=1 b=1 Defn 7.14 (Neural Network). Generalised nonparametric regression with many ’layers’, with components out- line in 10. For the ith layer of the network and jth hidden layer of the unit, we have [i] [i]T [i] zj = wj x + bj where w, b, z are the weight (coefficient), bias (intercept) and output respectively.

← ToC 73 Figure 9: Wikipedia table for confusion matrix

← ToC 74 4. Use the gradients to update the weights over the network

7.3 Unsupervised Learning Defn 7.16 (Principal Components Analysis). k k Original data xi in R . We approximate orthogonal unit vectors wl ∈ R and associated scores [L ≤ k weights zil] to minimise reconstruction error

Figure 10: Neural Network Components 2 1 ∑n 1 ∑n ∑L J(X, θ) = ∥x − xb ∥2 = x − z w n i i n i il l i=1 i=1 l=1 where xbi = Wzi subject to the constraint that the smoother matrix W is orthonor- mal. Equivalently, the objective function can be written as − T × J(W, Z) = X WZ Fr where Z is N L with zi in its rows. The optimal solution sets each wl to be the l-th eigenvector of the empirical covari- c ance matrix. Equivalently, W = VL, which contains the ∑L eigenvectors with the b 1 n ′ largest eigenvalues of empirical covariance matrix Σ = n i=1 xixi. Defn 7.17 (Truncated SVD). If we rank singular values of the data matrix X, we can construct a rank L approx- Figure 11: Activation Functions imation, the truncated SVD ≈ ′ X U:,1:LS1:L,1:LV:,1:L b c ′ Defn 7.15 (Activation Function). This is identical to the optimal reconstruction X = ZW . Activation functions are used at the end of a hidden layer to introduce non-linearities into the model. Common ones are Fact 7.5 (Dimension selection for PCA). Neural networks frequently use the cross-entropy loss function. ∑J λl = error(L) Fact 7.4 (Fitting Neural Networks). j=L+1 η Learning rate is denoted by , which is the pace at which the weights get updated. The error is the sum of remaining eigenvalues of the covariance matrix. Total vari- This can be fixed or adaptively changed using ADAM. ance explained = (sum of included eigenvalues)/(sum of all eigenvalues) Back-propagation is a method to update the weights in the neural net by taking into account the actual output and desired output. The derivative with respect to weight w is computed using the chain rule and is of the following form ∂L(z, y) ∂L(z, y) ∂a ∂z = ∂w ∂a ∂z ∂w So the weight is updated ∂L(z, y) w ← w − η ∂w 1. Take a batch of training data 2. Perform forward propagation to compute corresponding loss 3. Perform back propagation to compute gradients

← ToC 75 7.4 Double Machine Learning for ATEs and CATEs which yields the Orthogonalised Moment Condition E [ψ(Zi, τ0, η0)] = 0 for some real-valued condition ψ(.). Notes based on papers and https://github.com/jlhourENSAE/hdmetrics. Using a Neyman-orthogonal score eliminates first-order biases arising from the replacement of η0 with ηb0. 7.4.1 Average Treatment Effects Defn 7.21 (Orthogonal Scores). Defn 7.18 (Double-Selection (Belloni, Chernozhukov, and Hansen, 2014)). Reference: Bach et al. (2021) ′ E ⊥ Let Yi = Diτ + Xiβ0 + εi with [εi] = 0; εi (Di,Xi). β0 is a nuisance parameter, Consider data W := (Y,D,X) with D ∈ {0, 1} ∈ Rp and we allow Xi ; p > n. Assume further that the treatment is generated Partial linear setup Y = Dθ + g (X) + U; D = m (X) + V . ⊥ ⊥ 0 0 0 byDi = Xiδ0 + ξi, where Xi ξi, ξi εi. The DML method is Score function is W − − − − bL b ψ( ; θ, η) = (Y l(X) θ(D m(X)))(D m(X)) • {(Treatment Selection)} Regress D on X using a Lasso, obtain δ . Define SD := |{z} | {z } b E [Y |X] E [D|X] j = 1, . . . , p, δL ≠ 0 the set of selected variables Partially Linear IV bL b • (Outcome{ Selection) Regress} Y on X using a Lasso, obtain β . Define SY := Y − Dθ0 = g0(X) + ζ, E(ζ | Z,X) = 0 bL ̸ j = 1, . . . , p : βj = 0 Z = m0(X) + V, E(V | X) = 0

Score is b b ∪ b • (Estimation) Regress Y on D and s := SD SY elements in X which corre- W : − − − − ψ( , θ, η) = (Y |l({zX}) θ(D |r({zX}) ))(Z |m{z(X})) ∈ b ∪ b spond with the indices j SD SY using OLS E [Y |X] E [D|X] E [Z|X] Interactive Regression Defn 7.19 (Double-ML in a partially-linear model (general case for other ML E | tools)(Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, and Newey, 2017)). Y = |g0(D,X{z }) +ε, [ε D,X] = 0 Let Y = Dθ0 + g0(X) + U; E [U|X,D] = 0, where D is a treatment indicator and E [Y |D,X] θ0 is the target parameter of interest, and X is a set of confounders / predictors in D = m (X) +ξ, E [ξ|X] = 0 the sense that D = m0(X) + V ; E [V |X] = 0. m0 ≠ 0 generically in observational | 0{z } studies. P [D = 1|X] Double ML - FWL version \ b \ 1. Predict Y and D using X to compute E [Y |X] =: l0(X), E [D|X] =: mc0(X) Here, the estimands are obtained using RF or other ‘best-performing’ tools AT E E − θ0 = [g0(1,X) g0(0,X)] e \ e \ AT T E − | 2. Residualise Y = Y − E [Y |X] and D = D − E [D|X] θ0 = [g0(1,X) g0(0,X) D = 1] e e e 3. Regress Y on D to get θ0. FWL-style approach. The score function for ATE (Hahn (1998)) Defn 7.20 (Double-ML - General case with moment conditions). D(Y − g(1,X)) (1 − D)(Y − g(0,X)) E AT E − − − Let the target parameter τ0 solve the equation [m(Zi, τ0, β0)] = 0 for known score ψ (Zi, θ, η) = (g(1,X) g(0,X))+ − θ { }n m(X) 1 m(X) function m, vector of observables Zi := Xi,Di,Yi i=1, and nuisance parameter η = (g , m ) β0. In fully parametric models, m is simply the score function [derivative of the The nuisance parameter true( value is 0 0 0 .) For ATET, − − ′ log-likelihood]. For ATE, m(Zi, τ, β) := (Yi Diτ Xiβ)Di. 1 m(X) D ψAT T (Z , θ, η) = D − (1 − D) (Y − g(0,X)) − θ In naive double-ML settings, E [∂βm(Zi, τ0, β0) = π0µ1 ≠ 0]. So, we replace m with i π0 1 − m(X) π0 the Neyman-orthogonal score ψ s.t.

E [∂ηψ(Zi, τ0, η0) = 0] . Defn 7.22 (Cross-fitting Double-ML). 1. Take a K-fold random partition (Ik)k=1,...,K

← ToC 76 of observation indices {1, . . . , n} s.t. each fold Ik has size n/k. For each k, de- 2. conditional mean regression use the fact that under SOO C fine I := {1, . . . , n} Ik as the complement / auxilliary sample. k τ(x) = E [Y1|X = x] − E [Y0|X = x] = µ1(x) − µ0(x)

2. For each k ∈ {1,...,K}, construct a ML estimator of η0 using only the aux- C b (1) typically inefficient because of pscore in denominator, so most focus ison(2). illiary sample I ; ηk =η ˆ((Zi)i∈IC ) k k Random forests are a flexible method that is widely liked. ∈ { } 3. For each k 1,...,K , using the main sample Ik, construct the estimator Defn 7.23 (Robinson Semiparametric Setup (Robinson, 1988)). τˇk as the solution of Consider a model for τ(x) where 1 ∑ ψ(Zi, τˇk, ηˆk) = 0 Yi(d) = f(Xi) + d · τ(Xi) + ε(d), P [di|Xi] = e(x) n/K ∈ i Ik where τ(x) = ψ(x)β for some pre-determined set of basis functions ψ : X →Rk. ∑ K We allow for non-parametric relationships between X , y , d , but the treatment 4. Aggregate the estimators τˇ on each main sample τˇ = 1 τˇ i i i k K k=1 k effect function itself is parametrised by β ∈ Rk. Robinson (1988) showed that under Example 7.6 (Sample-Splitting for Treatment Effects). unconfoundedness, we can rewrite the semiparametric setup above as Simple implementation of Cross-fitting for Treatment effects Yi − m(Xi) = (di − e(Xi))ψ(Xi) · β + εi where

1. Partition the data in two, such that each fold I1,I2 has size n/2. m(x) = E [Yi|Xi = x] = f(Xi) + e(Xi)τ(Xi) I g(0,X) m(X) 2. Using only sample 1, construct a ML estimator of and ,e.g. a e ∗ e∗ c The oracle algorithm for estimating β is (1) define Y = Yi − m(Xi) and Z = feedforward nnet of Yi on Xi, denoted as gI1 (x), and logit-lasso of Di on Xi, i i d (di − e(Xi)ψ(Xi), then estimate residuals-on-residual regression. This procedure denoted by mI1 (x). √ is n-consistent and asymptotically normal. 3. Use the estimators on the hold-out sample I2 to compute the T.E Use cross-fitting to emulate the Oracle. [ ] b 1 mI1 (Xi) ∼ ∼ b b τˇ = ∑ D − (1 − D ) (Y − gb (X )) 1. Run non-parametric regressions Y X and D X to get m(x), e(x) I2 i − b i i I1 i i∈I Di 1 mI1 (Xi) 2 e −k(i) e −k(i) 2. define transformed features Yi = Yi−mb (Xi), Z = (Di−eb (Xi)ψ(Xi) 4. Repeat (2,3) swapping the roles of I and I to get τˇ 1 2 I1 b e e 3. Estimate ζb by regressing Yi ∼ Zi 5. Aggregate the estimators: τˇ +τ ˇ τˇ = I1 I2 Defn 7.24 (R-Loss). 2 To define R-Loss (Athey, J. Tibshirani, and Wager, 2019), under more general setup restate unconfoundedness as follows Implemented in DoubleML E [εi(di)|Xi, di] = 0 where εi(d) := Yi(d) − (µ(0)(Xi) + wτ(Xi)) and follow Robinson’s approach to write 7.4.2 Heterogeneous Treatment Effects with selection on Observables [ ] Yi − m(Xi) = (Di − e(Xi))τ(Xi) + εi E 1 − 0| Conditional Average Treatment Effects (CATEs) (τ(x) = Y Y X = x ) are R-loss is then written often of great policy interest for targeting those who have largest potential gains. { [ ]} ′ 2 However, severe risk of fishing. Use nonparametric estimators to find subgroups, τ(·) = argmin E ((Yi − m(Xi) − (di − e(Xi))τ (Xi)) ′ use sample-splitting for honesty. τ Defn 7.25 (R-Learner, Athey, J. Tibshirani, and Wager (2019)). 1. transformed outcome regression use outcome transformed w pscore Define e(x) = Pr (D = 1|X = x) and m(x) = E [Y |X = x] The R-learner consists DY (1 − D)Y of the following steps H = − p(X) 1 − p(X) 1. Use any method to estimate the response functions eb(x), mb (x)

← ToC 77 2. Minimise R-loss using cross-fitting for nuisance components 8 Bayesian Statistics

2 τb(.) = argmin ((Yi − mb (Xi)) − τ(Xi)(Di − eb(Xi))) + Λn(τ(·)) 8.1 Setup τ D := {(x , y )}N where Λn is some regulariser. Notation: per the Murphy textbook, some statements use notation i i i=1 Causal forest as implemented by grf starts by fitting two separate trees to estimate as shorthand for data. m,b eb, makes out-of-bag predictions [using cross-fitting] using the two first-stage Theorem 8.1 (Bayes Theorem). forests, then grow causal forest via ∑ f(X|θ)f(θ) n ∫ α (x)(Y − mb (X )(D − eb(X ))) f(θ|X) = ∝ L(θ) f(θ) i=1 ∑i i i i i | {z } f(X|θ)f(θ)dθ | {z } |{z} τb(x) = n − b 2 posterior prior i=1 αi(x)(Di e(Xi)) likelihood where ∑ 1 Example 8.2 (Bayesian Updating Steps). 1. Use Bayes Rule to come up with 1 Xi∈Lb(x),i∈B αi(x) = a of some hypothesis Hu given event E. Your prior is B |i : Xi ∈ Lb(x), i ∈ B| b P (Hu). are the learned adaptive weights. P (E|H )P (H ) P (H |E) = u u u | | c c Defn 7.26 (Double-sample / Honest trees (Athey and G. Imbens, 2016a)). P (E Hu)P (Hu) + P (E Hu)P (Hu) ′ Call the posterior probability P (Hu). 1. Draw a subsample of size s from the sample with replacement and divide it ′ into disjoint sets I, J ; |I| = |J | = n/2. 2. Given second event E , use posterior probability from step 1 as your prior in the second update step. J 2. Grow a tree via recursive partitioning, with splits chosen from (i.e. with- ′ ′ I ′ P (E |Hu)P (H ) out using Y observations from sample) P (H |E ) = u u ′| ′ ′| c ′ c P (E Hu)P (Hu) + P (E Hu)P ((Hu) ) 3. Estimate leaf responses using only I sample Defn 8.1 (Exchangeability). Finally, aggregate all trees over subsamples of size s A sequence of random variables y1, . . . , yn is finitely exchangeable if their joint ( ) ∑ density remains the same under any re-ordering or re-labeling of the indices of b n −1 E µ(x, Zi,..., Zn) = s ξ∈Ξ [T (x, ξ, Zi1 ,..., Zis )] the data.

1≤i1<...,

← ToC 78 Fact 8.3 (Posterior Quantities of Interest). the mode of the posterior density is the θ that maximizes the likelihood function. With the full posterior, one can compute Posterior Mean, median, and mode (the An informative prior on θ yields a posterior mean that is a precision-weighted latter is sometimes called the Maximum A Posteriori estimate). average of the prior mean and the MLE. One can also compute Stan dev team recommendations: https://github.com/stan-dev/stan/wiki/Prior- Defn 8.2 (Highest Posterior Density Region R(θ)). Choice-Recommendations , which is a region such that the the parameter lies in the region with probability 1 − α Theorem 8.5 (Bernstein-Von Mises Theorem / Bayesian CLT). ∫ ( ) | ∼a N ˆ I ˆ −1 1 − α = Pr (θ ∈ R(θ)|y) = p(θ|y)dθ θ y θ, (θ) R(θ) As N→∞, the likelihood component of the posterior becomes dominant and as Defn 8.3 (Posterior Predictive Density). a result frequentist and bayesian inferences will be based on the same limiting Consider out-of-sample prediction for a single observation y˜. The posterior pre- multivariate normal distribution. dictive density is ∫ Defn 8.5 (Bayesian Model Selection). ∞ To choose between Bayesian models, we compute the posterior over models | | | p(˜y y1, . . . , yn) = p(˜y θ, y1, . . . , yn)p(θ y1, . . . , yn)dθ D| −∞ |D ∑p( m)p(m) p(m ) = D Because y˜ is independent of y conditional on θ (exchangeability), we can simplify m∈M p(m, ) b |D this as ∫ which allows us to pick the MAP model m = arg max p(m ). If we use a uniform ∞ p(m) ∝ 1 prior over models , this amounts∫ to picking the model wich maximises p(˜y|y1, . . . , yn) = p(˜y|θ, y1, . . . , yn)p(θ|y1, . . . , yn)dθ −∞ D| D| | ∫ ∞ p( m) = p( θ)p(θ m)dθ = p(˜y|θ)p(θ|y1, . . . , yn)dθ −∞ which is called the marginal likelihood / integrated likelihood / evidence for model m. This is just the data density for y multiplied by the posterior density for θ. Example 8.4 (Posterior predictive density for Bernoulli trial). 8.2 Conjugate Priors and Updating Consider y˜ ∼ Bernoulli(θ). The posterior predictive density is Defn 8.6 (Improper Priors). ∫ ∫ 1 Priors of the form f(θ) ∝ c c > 0 are improper because f(θ)dθ = ∞. Improper p(˜y|y) = p(˜y|θ)p(θ|y)dθ priors generally not a problem as long as resulting posterior is well defined. ∫0 1 − Fact 8.6 (Flat priors are not invariant). = θy˜(1 − θ)1 y˜p(θ|y)dθ Suppose X ∼ Bernoulli (p), and we choose prior f(p) = 1. Define transformation 0 − eψ ψ = log(p/(1 p)). Resulting distribution of ψ is fψ(ψ) = ψ 2 So if we want to know the posterior predictive probability p(˜y = 1|θ), we can com- (1+e ) pute it as ∫ Defn 8.7 (Jeffreys’ Prior). 1 Method of constructing invariant priors. | | p(˜y = 1 θ) = θp(θ y)dθ f(θ) ∝ I(θ)1/2. For multiparameter model, f(θ) ∝ |I(θ)|1/2 0 | = E[θ y] Example 8.7 (Jeffreys’ Prior). L L which is the posterior mean. Given γ = h(θ), ∂ = ∂ = ∂θ and ∂γ ∂θ ∂γ ( ) ∂2L ∂2L ∂θ 2 ∂L ∂2θ Defn 8.4 (Uninformative Prior vs. Informative Prior). = + An uninformative prior on θ produces a posterior density that is proportional to ∂γ2 ∂θ2 ∂γ ∂θ ∂γ2 the likelihood (differing only by the constant of proportionality). This implies that

← ToC 79 [ ] E ∂L Quantity Formula Taking expectations wrt sample density sends second piece to zero (since ∂θ = 0), so ( ) ∑ ∑ 2 α′ x + 1 x + 1 ∂θ 1/2 1/2 ∂θ ∑ i ∑ i I(γ) = I(θ) =⇒ |I(γ)| = |I(θ)| Posterior Mean ′ ′ = = ∂γ ∂γ α + β xi + 1 + n − xi + 1 n + 2 ∑ ′ Defn 8.8 (Conjugate Prior). α − 1 xi Posterior mode = Analytically tractable expressions for the posterior are derived when sample and α′ + β′ − 2 n prior densities form a natural conjugate pair, defined as having the property that sample, prior, and posterior densities all lie in the same class of densities. α′β′ Exponential family Posterior variance is essentially the only class of densities to have natural conju- (α′ + β′)2(α′ + β′ + 1) gate priors. A one parameter member of the exponential family has density for N obs that can ∼ be expressed as Posterior predictive distribution Beta-Binomial(n, a, b)∑ ∑ ( ) ∼ Beta-Binomial(n, α + xi, β + n − xi) ∏ ∑ library(extraDistr) L(y|θ) = exp ((a(θ)) + b(y) + c (θ) u (y)) ∝ exp Na(θ) + c(θ) u(y) rbbinom(n, size, alpha, beta) i

Example 8.8 (Beta-Binomial Updating). Example 8.9 (How to find Beta hyper-parameters using a prior proportion and Let X1,...,Xn ∼ Bernoulli (p), and we take prior f(p) = 1. By Bayes thm, the variance). posterior is of the form Suppose we have a proportion from a previous study θ , with variance V (θ ; α, β). ∑ ∑ 0 0 n s n−s xi n− xi Then we can create a constant f(p|x ) ∝ f(p)Ln(p) = p (1 − p) = p (1 − p) f(p) = Beta (α, β) α = β = 1 − Instead we take . Uniform prior is a special case with . θ0(1 θ0) − In general, the posterior is of the form γ = 1 V (θ0; α, β) ∑ ∑ Γ(n + 2) − And compute the hyper-parameters α and β for our prior distribution as f(p|xn) = p xi (1 − p)n xi Γ(s + 1)Γ(n − s + 1) (∑ ∑ ) α = γθ0 n β = γ(1 − θ ) p|x ∼ Beta xi + 1, n − xi + 1 0 ∼ Beta (α′, β′) Surprisingly this works! See Jackman p.55 for a worked out example. rbeta(n, shape1, shape2) Example 8.10 (Gamma-Poisson Updating). Let Y1,...,Yn ∼ Poisson(λ). This means that ∏N exp(−λ)λyi p(y|λ) = y ! i=1 i ∑ ∝ λ yi exp(−nλ) We specifiy a Gamma prior on λ, which has density ba p(λ; a, b) = λa−1 exp(−bλ) Γ(a)

← ToC 80 ( ) So then the posterior for λ is y ∼ N µ, σ2 ., where σ2 is known but mean µ is not known. The joint density of p(λ|y) ∝ p(λ)p(y|λ) y is ∑ N ( ) ( ) a−1 y ∏ − 2 ∝ λ exp(−bλ)λ i exp(−nλ) 2 −1/2 (yi θ) N 2 ∑ L(y|θ) = (2πσ ) exp − ∝ exp − (¯y − θ) − 2 2 yi+a 1 − 2σ 2σ = λ exp( λ(b + n)) i=1 ( ) ∑ ( ) 2 ∼ ∼ N 2 ⇒ ∝ − (θ−µ) Gamma( yi + a, b + n) Given a normal prior θ µ, τ = f(µ) exp 2τ 2 , we can write the ′ ′ ∼ Gamma(a , b ) posterior density of the form ( ) ( ) ( ( )) rgamma(n, shape, rate) N (θ − µ)2 1 (θ − µ )2 f(θ|y) ∝ exp − (θ − y¯)2 exp ∝ exp − 1 a = b = 0 2 2 2 A flat prior is . 2σ 2τ 2 τ1 where µ = τ 2(Ny/σ¯ 2 + µτ 2) and τ 2 = (N/σ2 + 1/τ 2)−1. Posterior mean is a Quantity Formula 1 1 1 weighted sum of prior mean µ and sample mean y¯ with weights that reflect the ∑ precision of the likelihood via Nσ2 and prior τ 2. ′ a yi + a Posterior Mean = Three cases (ref. Jackman p.80-94): b′ b + n ∑ 1. Variance known, mean unknown. Model: ′ a − 1 yi + a − 1 ∼ N 2 Posterior mode = µ (µ0, σ0) b′ b + n y ∼ N (µ, σ2) ∑ ′ a yi + a Posterior variance = (b′)2 (b + n)2 Quantity Formula

Posterior predictive distribution ∼ Negative Binomial(y, θ) 1 n µ0( 2 ) +y ¯( 2 ) 2 2 1 σ0 σ nσ0 σ ∼ − Posterior Mean = y + µ Negative Binomial(a, 1 ) 1 n nσ2 + σ2 0 nσ2 + σ2 b + 1 σ2 + σ2 0 0 rnbinom(rnbinom(n, size, prob) 0 ( )−1 2 2 Example 8.11 (Dirichlet-Multinomial Updating). 1 n σ0σ Posterior variance 2 + 2 = 2 2 σ0 σ σ + nσ0 ∏K − p(θ|α , . . . α ) ∝ θαj 1 1 k j Posterior predictive distribution ∼ N (˜µ, σ˜2) j=1 n0µ0 + ny¯ ∏K µ˜ = n0 +(n ) p(y|θ) ∝ θyj −1 j 1 n j=1 2 2 σ˜ = σ + 2 + 2 σ0 σ where yj is the count of observations in category j. For 3 categories, the posterior is: 2. Variance and mean both unknown. Prior densities: − − − − − | ∝ α1+y1 1 α2+y2 1 − − α3+y3 1 2 | 2 2 p(θ1, θ2, 1 θ1 θ2 y) θ1 θ2 (1 θ1 θ2) p(µ, σ ) = p(µ σ )p(σ ) ∼ 2 2 Dirichlet(α1 + y1, α2 + y2, α3 + y3) p(µ|σ ) ∼ Normal(µ0, σ /n0) 2 ∼ 2 2 Example 8.12 (Normal-Normal updating). p(σ ) Scaled-Invese-χ (ν0/2, ν0σ0/2)

← ToC 81 Conditional posterior densities: 2 2 p(µ, σ2) ∝ 1/σ2 µ|σ , y ∼ N (µ1, σ /n1) 2| ∼ 2 2 σ y Scaled-Invese-χ (ν1/2, ν1σ1/2) Posterior densities where µ|σ2, y ∼ N (¯y, σ2/n) ∑ n1 = n0 + n − N − 2 2 2 n 1 i=1(yi y¯) n0µ0 + ny¯ σ |y ∼ Scaled-Invese-χ ( , ) µ1 = 2 2 n1 ν1 = ν0 + n which implies ∑N 2 2 − 2 n0n − 2 µ − y¯ ν1σ1 = v0σ0 + (yi y¯) + (µ0 y¯) √ ∼ n + n tn−1 i=1 0 S/((n − 1)n)

Marginal posterior density of µ: Posterior predictive distribution √ √ p(µ) ∼ Student-T(µ , σ2/n , v ) n + 1 1 1 1 1 p(˜y|y) ∼ Student-T(¯y, s , n − 1) ∼ brms::rstudent_t(n, df, mu = 0, sigma = 1) n where where ∑n n = n + n 1 1 0 s2 = (y − y¯)2 n0µ0 + ny¯ n − 1 µ1 = i=1 n1 ∑n ν = ν + n 1 1 0 y¯ = yi 2 n σ1 = S1/ν1 i=1 ∑N n n S = ν σ2 + (n − 1) (y − y¯)2 + 0 (¯y − µ )2 Fact 8.13 (Simulation from Posterior). 1 0 0 i n 0 Posterior can often be approximated by simulation. i=1 1 ∼ | n Posterior predictive distribution for y˜: • Draw θ1, . . . , θB p(θ x ) √ n • Histogram of θ1, . . . , θB approximates posterior density p(θ|x ) p(˜y|y) ∼ Student-T(µ1, σ1 (n1 + 1)/n1, v1) where Methods for this: Markov-Chain Monte-Carlo, Metropolis-Hastings, Hamilto- n1 = n0 + n nian Monte-Carlo n0µ0 + ny¯ µ1 = Fact 8.14 (Conjugacy for Discrete Distributions). n1 ν1 = ν0 + n 2 σ1 = S1/ν1 ∑N n n S = ν σ2 + (n − 1) (y − y¯)2 + 0 (¯y − µ )2 1 0 0 i n 0 i=1 1 3. Improper reference prior. Prior densities:

← ToC 82 Likelihood Conjugate prior Posterior hyperparameters Defn 8.11 (Monte Carlo Integration). ∫ b ∑n ∑n First rewrite the integral to be evaluated I = h(x)dx as follows ∫ ∫ a Bern (p) Beta (α, β) α + xi, β + n − xi b b i=1 i=1 I = h(x)dx = w(x)f(x)dx ∑n ∑n ∑n a a Bin (p) Beta (α, β) α + xi, β + Ni − xi − 1 where w(x) = h(x)(b a) and f(x) = b−1 . Since f is the probability density for i=1 i=1 i=1 E ∼ ∑n a uniform r.v. over (a, b), we can write I = f [w(X)] where X U [a, b]. If we X ,...,X ∼ U [a, b] NBin (p) Beta (α, β) α + rn, β + xi generate 1 N , by LLN, i=1 ∑N ∑n 1 p Iˆ = w(Xi) → E [w(X)] = I Po (λ) Gamma (α, β) α + x , β + n N i i=1 i=1 ∑n Defn 8.12 (Reversibility / Detailed Balance). (i) Multinomial(p) Dir (α) α + x The goal is the generate sequences θ1, θ2,... from f(θ|y). An MCMC scheme will i=1 generate samples from the conditional if | | | | Fact 8.15 (Conjugacy for Continuous Distributions). P (θj θj−1)f(θj−1 y) = P (θj−1 θj)f(θj y) Likelihood Conjugate prior Posterior hyperparameters where P (θi|θj) is the pdf of θi given θj. The LHS is the joint pdf of θj, θj−1 from { } the chain, if θj−1 is from f(θ|y). Integrating RHS over dθj−1 yields f(θj|y), so Unif (0, θ) Pareto(xm, k) max x(n), xm , k + n the result states that given θ − is from the correct posterior distribution, the chain ∑n j 1 generates θj also from the posterior f(θ|y). Exp (λ) Gamma (α, β) α + n, β + xi ( ∑ i=1 ) ( ) Defn 8.13 (Gibbs Sampling). ( ) ( ) µ n x 1 n Basic idea - turn high dimensional problem into several one-dimensional prob- N µ, σ2 N µ , σ2 0 + i=1 i / + c 0 0 2 2 2 2 , lems. Suppose (X,Y ) has joint density fX,Y (x, y). Suppose it is possible to sim- σ0 σc σ0 σc ( )−1 f | (x|y) f | (y|x) (X ,Y ) 1 n ulate from conditional distributions X Y and Y X . Let 0 0 be start- (X ,Y ),..., (X ,Y ) (X ,Y ) 2 + 2 ing values. Assuming we have drawn 0 0 i i , we generate i+1 i+1 σ0 σc ∑ as follows ( ) 2 n − 2 2 νσ0 + i=1(xi µ) N µc, σ Scaled Inverse ν + n, • X ∼ f | (x|Y ) 2 ν + n n+1 X Y n Chi-square(ν, σ0) • Y ∼ f | (y|X ) ( ) νλ + nx¯ n n+1 Y X n+1 N µ, σ2 Normal- , ν + n, α + , scaled Inverse ν + n 2 • ... for multiple parameters Gamma(λ, ν, α, β) ∑n − 2 1 − 2 γ(¯x λ) β + (xi x¯) + Example 8.16( (Gibbs) for Univariate Normal). 2 2(n + γ) iid i=1 2 2 . Let yi ∼ N µ, σ . Define precision τ = 1/σ ( ∑ ) n/2 1 n 2 • Likelihood: f (y|µ, τ) ∼ τ exp τ (yi − µ) 8.3 Computation / Markov Chains 2 i=1 • (Noninformative) Prior: πµ, τ ∼ τ Defn 8.9 (Stochastic Process {Xt : t ∈ T }). is a collection of random variables. Posterior Distribution ( ) ∑n Defn 8.10 (Markov Chains). (n/2)+1 1 2 π (µ, τ|y) ∼ τ exp − τ (yi − µ) The process {Xn : n ∈ T } is a Markov Chain if 2 i=1 Pr (X = X|X ,...,X − ) = Pr (X = x|X − ) n 0 n 1 n n 1 full conditionals:

← ToC 83 ( ) • π(µ|τ, y) = N y,¯ (nτ)−1 ( ∑ ) ℓ(z, β|Σ ) = log p(y|z) + log N (z|Xβ, I) + log N (β|0, Σ ) | n 1 n − 2 0 0 • π(τ µ, y) = Γ 2 , n i=1(yi µ) ∑n 1 ′ 1 ′ − = log p(y |z ) − (z − Xβ) (z − Xβ) − β (Σ ) 1 β + const However, it is typically impossible to write out or sample from full-conditionals. i i 2 2 0 i=1 Defn 8.14 (Metropolis Algorithm). Let q(y|x) be an arbitrary, friendly distribution we can sample from. The condi- The posterior in the E step is a truncated Gaussian q(y|x) { tional density is called the proposal distribution. MH creates a sequence of ϕ(µi) µi + if yi = 1 observations X0,... as follows Choose X0 arbitrarily. Suppose we have generated p(z |x , β) = Φ(µi) i i ϕ(µi) X0,X1,...,Xi. Generate Xi+1 as follows µi − if yi = 0 Φ(µi) ′ E • Generate proposal Y ∼ q(y|Xi) where µi = xiβ. In the M step, we estimate β using ridge, where µ = [z]. ( )−1 −1 ′ ′ • Evaluate r := r(Xi,Y ) where βˆ = (Σ ) + X X X µ { } 0 f(y) q(x|y) r(x, y) = min , 1 f(x) q(y|x) 8.4 Hierarchical Models • Set { Defn 8.16 (Heirarchical Priors). Y w.p r Parameters in a prior are modeled as having a distribution that depends on hyper- Xi+1 = − Xi w.p 1 r parameters. This results in joint posteriors of the form | ∝ L | | Defn 8.15 (Expectation Maximisation). f(θ, τ y) | ({zy θ}) |f({zθ τ}) |f{z(τ}) Let xi be observed and zi missing. The goal is to maximise the log-likelihood of likelihood parameter prior hyperparameter prior the observed data [ ] Represented by the τ → θ→D. ∑n ∑n ∑ We are typically interested in the marginal posterior of θ, which is obtained by | | ℓ(θ) = log p(xi θ) = p(xi, zi θ) integrating the joint posterior w.r.t τ. i=1 i=1 zi τ Cannot push log inside the sum because of unobserved variables. EM∑ tackles the By treating as a latent variable, we allow data-poor observations to borrow strength n | from data rich ones. problem as follows. Define complete data log likelihood as ℓc(θ) := i=1 log p(xi, zi θ). This cannot be computed, since[ zi is unknown.] t−1 t−1 8.4.1 Empirical Bayes Instead, define Q(θ, θ ) = E ℓc(θ|D, θ ) where t is the iteration number and Q is called the auxiliary function. In hierarchical models, we need to compute the posterior on multiple layers of latent variables. For example, for a two-level model, we need Q(θ, θt−1) • Expectation (E) Step: Compute , which is an expectation wrt old p(η, θ|D) ∝ p(D|θ)p(θ|η)p(η) params θt−1. We can employ a computational shortcut by approximating the posterior on the • Maximisation (M) Step: Optimise the Q function wrt θ. hyper-parameters with a point-estimate p(η|D) ≈ δηˆ(η), where ηˆ = arg max p(η|D). − − Since η is usually much smaller than θ in dimensionality, we can safely use a uni- Compute θt = arg max Q(θ, θt 1). For MAP estimation, θt = arg max Q(θ, θt 1)+ θ θ form prior on η. Then, the estimate becomes log p(θ) [∫ ] ηˆ = arg max p(D|η) = arg max p(D|θ)p(θ|η)dθ Example 8.17 (EM for probit regression). | 1 ∼ N ′ | {z } Probit has the form p(yi = 1 zi) = zi>0 where zi (xiβ, 1) is the latent vari- marginal likelihood able. The complete data log likelihood, assuming a N (0, Σ0) prior on β.

← ToC 84 This violates the principle that the prior should be chosen independently of the The joint distribution p(x) = p(x1, . . . , xK ) can be written as data, but is a cheap computational trick. This produces a hierarchy of Bayesian ∏K methods in increasing order of the number of integrals performed. p(x) = p(xk|Pak) Example 8.18 (Cancer Rates across cities). k=1 Suppose we measure the number of people in various cities Ni and the number of where Pak denotes the parent nodes of xk, which are nodes that have arrows point- x people who died of cancer xi. We assume xi ∼ Bin (Ni, θi) and want to estimate the ing to k. θ cancer rates i. The MLE solution would be to either estimate them all separately, Defn 8.18 (Conditional Independence). or estimate a single θ for all cities. X Y ∼ and are said to be conditionally independent iff the conditional joint can be The hierarchical approach is to model θi Beta (a, b), and write a joint distribution written as the product of the conditional marginal. ∏N X ⊥⊥ Y |Z ⇔ p(X,Y |Z) = p(X|Z)p(Y |Z) p(D, θ, η|N) = p(η) Bin (xi|Ni, θi) Beta (θi, η) Defn 8.19 (d-separation). i=1 If the condition A ⊥⊥B|C, it must be the case that all paths are blocked. All paths θ analytically integrate out i are blocked iff ∏N ∫ = Bin (x |N , θ ) Beta (θ , η) dθ • Arrows on the path meet either head to tail or tail to tail at the node, and the i i i i i C i=1 note is in the set ∏N Beta (a + xi, b + Ni − xi) • Arrows meet head to head at the node, and neither the node nor any of its = descendents is in the set C Beta (a, b) i=1 ′ where η := (a, b). We can also put covariates on θi = f(xiβ). 8.5.1 Empirical Bayes Example 8.19 (James-Stein Estimator for batting averages (Efron and Mor- 8.4.2 Hierarchy of Bayesianity ris)). We suppose that each player’s MLE value pi (his batting average in the first 90 tries) Method Definition is a binomial proportion, ∼ Maximum Likelihood θˆ = arg max p(D|θ) pi Bi (90,Pi) /90 θ Here Pi is his true average, how he would perform over an infinite number of tries; MAP Estimation θˆ = arg max p(D|θ)p(θ|η) θ ∫ TRUTH i is itself a binomial proportion, taken over an average of 370 more tries per player. D| | D| Empirical Bayes ηˆ = arg max p( θ)p(θ η)dθ = arg max p( η) At this point there are two ways to proceed. The simplest uses a normal approxi- η ∫ η mation to (7.17) ( ) ηˆ = arg max = p(D|θ)p(θ|η)p(η)dθ = arg max p(D|η)p(η) ∼N 2 MAP-II pi ˙ Pi, σ0 η η |D ∝ D| | 2 Full Bayes p(θ, η ) p( θ)p(θ η)p(η) where σ0 is the binomial variance 2 − σ0 =p ¯(1 p¯)/90 8.5 Graphical Models with p¯ = 0.254 the average of the pi ’s. Letting xi = pi/σ0, applying (7.13), and JS JS Defn 8.17 (Chain rule of probability). transforming back to pˆi = σ0µˆi , gives James-Stein estimates Any joint distribution can be represented as follows [ ] − 2 JS (N 3)σ0 p(x1:v) = p(x1)p(x2|x1)p(x3|x2, x1) . . . p(xv|x1:V −1) pˆ =p ¯ + 1 − ∑ (p − p¯) i − 2 i where V is the number of variables [and we have dropped the parameter vector (pi p¯) θ]. It follows that

← ToC 85 A second approach begins with the arcsin transformation 9 Dependent Data: Time series and spatial statistics [( ) ] np + 0.375 1/2 x = 2(n + 0.5)1/2 sin−1 i 9.1 Time Series i n + 0.75 { }T A time series is a sequence of data points wt t=1 observed over time.∏ In a random T sample, points are iid, so the joint distribution f(w1, . . . , wT ) = t=1 f(wt). In time series, this is clearly violated, since observations that are temporally close to each other tend to be more similar. Defn 9.1 (Stochastic Process). is a sequence of random variables {...,Y−1,Y0,Y1,...} that are indexed w.r.t the elements in a set of indices {Yt : t ∈ T }. Hypothetical repeated realisations of a stochastic process look like { }∞ (1) (2) (n) wt , wt , . . . , wt t=−∞ The index set T may be either countable, in which case we get a discrete time process or an uncountable, in which case we get a continous time process.

State Space We assume ∃ a set Y ∈ R s.t. ∀t ∈ T ,Yt ∈ Y. Then, Y is called the State Space of the stochastic process. Defn 9.2 (Martingales). { }∞ Consider a random process Yt t=1 and an increasing sequence of information {F }∞ − F ⊂ F F ⊂ F sets t t=1 i.e. collection of σ fields s.t. 0 1 ... ∞ . If Yt belongs to the information set Ft and is absolutely integrable [i.e. yt ∈ L0(Ft) ∩ L1(F)], E |F ∀ ∞ { }∞ and [Yt+1 t] = Yt t < then Yt t=0 is called a martingale. In words, the conditional expected value of the next observation, given all the past observations, is equal to the most recent observation. Defn 9.3 (Autocovariance). th The autocovariance of Yt is the covariance between Yt and its j lagged value

γjt := E [Yt − µt][Yt−j − µt−j] y = {y } the variance covariance matrix of t is given by  γ γ . . . γ − γ0 γ1 ...... T 1  1 0  V [y] = ...... γT −1 . . . γ1 γ0 th the j order correlation coefficient ρj := γj/γ0. Defn 9.4 (Stationarity ≡ I(0)).

A random process is said to be stationary if the distribution functions of (Xt1 ,Xt2 ... ) ∀ ∈ Z and (Xt1+j,Xt2+j ... ) are the same t1, . . . , tk, h . A process is said to be covariance (or weakly) stationary if

1. E [Yt] = µ ∀ t ∈ T

← ToC 86 2. γjt = E [Yt − µt][Yt−j − µt−j] = γj ∀t ∈ T 1. X(0) = 0 { } i.e. neither the mean nor the autocovariances depend on the date t; stationary 2. X(si + ti) − X(Si) over an arbitrary collection of disjoint intervals (si, si + expectation, variance, and covariance. Most relevant variables aren’t stationary, ti are independent r.v.s but their detrended or first-differenced versions may be. 3. ∀ s, t ≥ 0,X(s + t) − X(s) ∼ N (0, t) Defn 9.5 (Markov Process). If X0,X1,... is a Markov Process, Defn 9.8 (White Noise). { } 2 Pr (Xn+1 ≤ x|X1,...,Xn) = Pr (Xn+1 ≤ x|Xn) White noise is a sequence εt whose elements have mean zero and variance σ , and for which εt’s are uncorrelated over time that is, the conditional distribution of Xn+1 given X1,...,Xn does not depend on X1,...,Xn−1. 1. E [εt] = 0 [ ] 2. E ε2 = σ2 Markov Chain A Markov chain is simply a Markov process in which the state- t space is a countable set. Since a Markov chain is a markov process, the conditional 3. E [εtεt−j] = 0 ∀ j ≠ 0 distribution of Xt+1|X1,...,Xt depends only on Xt. The conditional distribution is often represented by a Transition matrix where Defn 9.9 (Moving average : MA(q)). q MA(q) q t,t+1 | A moving average of order , is a weighted average of the most recent Pij = Pr (Xt+1 = j Xt = i); i, j = 1,...J values of a white noise defined as ∀ If P is the same t, we say the Markov chain has stationary transition probabilities. Yt = µ + εt + θ1εt−1 + ··· + θq + εt−q Defn 9.6 (Ergodic Processes). Defn 9.10 (Autoregressive : AR(p)). A is ergodic if any two variables positioned far apart in the se- An autoregressive process of order p, AR(p) is given by Yt as a linear combination quence are almost independently distributed. of p lags of itself and one white noise {x } is ergodic if, for any two bounded functions f(.) in k + 1 variables and g(.) in t Y = µ + ϕ Y − + ··· + ϕ Y − + ε l + 1 variables, t 1 t 1 p t p t Defn 9.11 (Autoregressive Moving Average: ARMA(p, q)). |E | − lim [f(xt, . . . , xt+k)g(xt+N , . . . , xt+l+N )] ARMA(p, q) combines AR(p) and MA(q) N→∞ Yt = µ + εt + εt + θ1εt−1 + ··· + θq + εt−q + ϕ1Yt−1 + ··· + ϕpYt−p |E [f(xt, . . . , xt+k)]| |E [g(xt+N , . . . , xt+l+N )]| = 0 | {z } | {z } MA(q) AR(p) i.e. limj→∞ γj = 0 ∑ ∞ | | Sufficient condition for ergodicity is xt be covariance stationary and γj < Theorem 9.1 (Wold Theorem). ∞ j=0 Consider AR(1): Y = ρY − + ε . Since this holds at t, it holds at t − 1 =⇒ Y − = Ergodic processes have the following property t t 1 t t 1 [ ] ρYt−2 + εt−1. Substitute into original to get Yt = ρ(ρYt−2 + εt−1) + εt. Repeat ad ∑T T∑−1 infinitum to obtain, as long as ρ < 1 V − | | xt = (T j )γj ∑∞ t=1 j=1−T s Yt = ρ εt−s this result implies that s=0 [ ] T ∞ In other words, AR(1) ≡ MA(∞) ; they are different representations of the same 1 ∑ ∑ lim V √ xt = γj < ∞ underlying stochastic process. T →∞ T t=1 j=−∞ Wold Representation: All covariance-stationary time series processes can be rep- MA(∞) This permits us to swap is for ts and derive Asymptotic theory with dependent resented by / decomposed into a deterministic component and a observations, such as LLN and CLT. Defn 9.12 (Time Trends). E Defn 9.7 (Brownian Motion). In a stationary process, [xt] = µ, which is seldom true. A less restrictive assump- tion that allows for nonstationarity is to specify the mean as a function of time. A family of r.v.s {Xt} indexed by a continous variable t over [0, ∞) is a Brownian Motion iff E [xt] = α + βt specify xt = α + βt + εt; ε stationary

← ToC 87 Defn 9.13 (Random walk ≡ I(1)). E | ( ) is a a process such that ([xt x)t−1, xt−2,... ] = xt−1. ∑m ∑T ∼ N 2 b b j b b′ b 1 ′ xt = xt−1 + εt, εt 0, σ = AR(1) process with ϕ = 1 =: Unit Root. Rewrite V = Γ + 1 − (Γ + Γ ) where Γ = εb εb − x x − 0 m + 1 j j j T − j j t j t t j as i=1 t=j+1 ∑t with variance estimated the normal way xt = xt−1 + εt = x0 + εj ( )−1 ( )−1 j 1 ∑T 1 ∑T Wc = x x′ Vb x x′ Random walk with drift T t t T t t t=1 t=1 ∑t xt = xt−1 + δ + εt = δt + εt + εt−1 . . . ε1 + x0 = x0 + δt + εj Defn 9.17 (Error Correction Mechanism). j Consider the model y = δ + αy − β x + β x − + ε Defn 9.14 (Unit Root Tests). ( ) t t 1 0 t 1 t 1 t 2 − For the following model xt = ρxt−1 + εt, εt ∼ N 0, σ = AR(1) Subtract yt−1 and add β0xt−1 + β0xt−1 to the l.h.s. we get b test ρ = 1. Distribution of ρ under the null ρ = 1 is non-standard: CLT not valid. yt − yt−1 = δ − (1 − α)yt−1 + β0(xt − xt−1) + (β0 + β1)xt−1 + εt test to use: Dickey Fuller, Augmented Dickey Fuller, Phillips-Perron. and

Defn 9.15 (Cointigration). ∆yt = δ + β0∆xt − (1 − α)(yt−1 − γxt−1) + εt ∼ ∧ ∼ ∃ − ∼ Let yt I(1) xt I(1). yt and xt are said to be cointegrated if ψ s.t. yt ψxt where γ is the long run effect I(0) . For example, let β + β γ = 0 1 yt = βxt + εt 1 − α x = x − + υ t t 1 t Defn 9.18 (Testing for trend-breaks - sequential Chow Test). A Quandt Likelihood ratio test begins with no knowledge of when the trend break ∼ − ∼ where (εt, υt) is white noise. Then, yt, xt I(1), but yt βxt I(0), with cointe- occurs [although researchers typically know of the timing for substantive reasons], gration vector a = (1, −β). and sequentially estimates the following model − Defn 9.16 (Hodrick-Prescott (HP) filter). ∆Yt = log Yt log Yt−1 = α + δ0Dt(τ) + εt decomposes an observed time series Xt, t = 1, 2, . .{ . , n }into a trend Xt and a sta- where ∆Yt is the first difference of the outcome, and Dt(τ) is an indicator variable e n equal to zero for all years before τ and one for all subsequent years. The researcher tionary component Xt = Xt − Xt so that the trend Xt minimises t=1 varies τ and tests the null that δ = 0, and the largest F-statistic is used to determine ∑n n∑−1 ( ) 0 1 2 1 2 the best possible break point. Use Andrews (2003) critical values to account for (X − X ) + w (X − X ) − (X − X − ) n t t n t+1 t t t 1 multiple-testing. t=1 | t=2 {z } Penalty for incorporating fluctuations 9.2 Spatial Statistics w is a tuning parameter. In quarterly data, w = 1600. Defn 9.19 (Spatial Stochastic Process, Autocorrelation). A spatial stochastic process is a collection of random variables y indexed by loca- 9.1.1 Regression with time series tion i: {yi, i ∈ D}, where D is either a continuous surface of a finite set of discrete Basic assumption in conventional OLS with time series is E [yt|x1, . . . , xT ] = E [yt|xt] = locations. ′ ′ E | ′ Spatial Autocorrelation is expressed as xtβ. Equivalently, yt = xt[β + ε]t [εt X] = 0 where X = (x1, . . . , xT ) . The second E 2| 2 ∀ E ∀ [y , y ] = E [y y ] − E [y ] E [y ] ≠ 0 ∀i ≠ j classical assumption is εt x = σ t; [εtεt−j] = 0 t, j. Cov i j i j i j E [ut, ut−j|X] ≠ 0 is called autocorrelation. Fix: Newey-West HAC consistent This can be modelled in 3 basic ways: variance estimator ’meat’

← ToC 88 1. Specify a particular functional form on the stochastic process generating the The spatial lag term induces correlation between the error and explanatory vari- random variables {yi, i ∈ D}, from where covariance structure follows ables, and thus must be treated as an endogenous variable. A spatial error model is simply an linear model with a non-spherical but typically 2. Model the covariance structure directly, typically as a function of a small parametric structure in the error covariance matrix. number of parameters ( ) ∼ N 2 y = Xβ + |λW{zξ + η} , η 0, σ I 3. Leave covariance unspecified and estimate it nonparametrically Composite error ε Spatial Autoregressive Processes: E [εε′] = Ω(θ) − y − µι = ρW(y − µι) + ε = (I − ρW) 1 + ε Example 9.2 (Kelly (2020)’s ’Direct’ standard errors). where W is a N × N weight matrix. a spatial lag for y ∑ i A covariance function decomposes into a systematic part and idiosyncratic noise as follows Wyi = wijyj 2 2 2 2 j Σij = σ C(∥i − j∥ , π) + τ 1ij ≡ σ CUU + τ I ∥ − ∥ Defn 9.20 (Direct Representation of Spatial Autocorrelation). where C is a correlation function, i j is the distance between points i, j. A parsimonious specification of a small number of parameters for the covariance Kelly recommends using a Whittle-Matern function defined next. These parame- matrix is typically presumed. ters can be fitted on the error distribution to estimate the covariance matrix. 2 Defn 9.22 (Covariance Function). Cov [εi, εj] = σ f(dij, φ) ′ 2 A covariance function Cov (Y (x),Y (x )) describes the joint variability between a where εi, εj are residuals, σ is the error variance, dij is the distance between i, j, stochastic process Y (·) at two locations x and x′. This covariance function is vital ∂f | | ≤ ∈ and f is a distance decay function such that ∂d < 0 and f(dij, φ) 1, with φ Φ in spatial prediction. The fields package includes common parametric covariance being a p × 1 vector. families (e.g. exponential and Matern) as well as nonparametric models (e.g. radial and tensor basis functions). Defn 9.21 (Moran’s I). ′ When modeling Cov (Y (x),Y (x )) , we are often forced make simplifying assump- The generalised Moran’s I is a∑ weighted,∑ scaled cross-product n tions. n ̸ wij(yi − y)(yj − y) I := ∑i=1 ∑j=i ∑ • Stationarity assumes we can represent the covariance function as n − 2 i=1 j≠ i wij i(yi y) − Cov(Y (x + h),Y (x)) = C(h) Its expected value is 1 . n−1 Rd → R A test for Moran’s I involves shuffling the locations of points and computing I S times. for some function C : where dim(x) = d. This produces a randomization distribution under H0. • - Isotropy assumes we can represent the covariance function as A Monte-carlo P-value is ∑ ∥ ∥ S Cov(Y (x + h),Y (x)) = C( h ) 1 ∗ 1 + s=1 I ≥Iobs pb = s for some function C : R → R, where ∥ · ∥ is a vector norm. S + 1 Exponential : 9.2.1 Spatial Linear Regression ′ −r/θ 2 Cov (Y (x),Y (x )) = C(r) = ρe + σ 1x=x′ A simple spatial regression is Matern: y = ρWy + Xβ + ε ( ) 1−ν ( )ν ( ) ′ 2 r r 2 the solution is Cov (Y (x),Y (x )) = C(r) = ρ Kν + σ 1x=x′ ( )−1 Γ(ν) θ θ βb = XX⊤ X(⊤I − ρW)y where Kν is a modified Bessel function of the second kind, of order ν Its reduced form is Matern covariance depends on (ρ, θ, ν, σ2), while exponential depends on ρ, θ, σ2), − − y = (I − ρW) 1 Xβ + (I − ρW) 1 ε where

← ToC 89 • θ: is the range of the process at which observations become uncorrelated Fact 9.3 (Workhorse Spatial Regression). AC in Y AC in X AC in errors • ρ : marginal variance / ’sil’ z }| { z }| { z }| { y = Xγ + βWy + WXθ + Wνλ +ε 2 • σ : small scale variation such as measurement error Here, W is a weight matrix (typically row-standardised), so WM is a spatial lag. • ν : smoothness In spatial econometrics, the above form nests many popular regressions Defn 9.23 ((Simple) ). • Spatially Autoregressive (SAR) Model : λ = θ = 0 The additive spatial model is • Spatially lagged x : β = λ = 0 Y (x) = Z(x)d + g(x) + ϵ(x) • Spatial Durbin Model : λ = 0 where Y (·) is an observation, Zd is a deterministic (nonrandom) product of co- variates Z with a weight vector( d that) acts as a mean function, g(·) ∼ N(0, ρK) is • Spatial Error model : β = λ = 0 2 spatial process, and ϵ(·) ∼ N 0, σ I is error (i.e. white noise).( Thus, we assume) that our observations Y (·) are Gaussian; in particular, Y ∼ N Zd, ρK + σ2I . Fact 9.4 (The Reflection Problem). For now, we’ll assume Zd ≡ 0 and focus on the spatial aspect g(x). This is known In the Social Interactions Literature (e.g. Manski (1993)), the above expression is as simple kriging, where the mean of the stochastic process is known. Denote our written in the form of conditional expectations Endogenous locations of observed points as x1,..., xn z}|{ Contextualz}|{ · ˆ ′ E | ′ E | ′ E | ′ Suppose we want to predict Y ( ) at a new location x0. The kriging estimate Y (x0) yi = xiγ + [y wi] β + [xi wi] θ + [ν wi] |{z}λ +εi · of our mean zero Gaussian process Y ( ) is given by: Correlated −1 Eb | Yˆ (x0) = Σ0Σ Y in practice, the expectations are replaced with empirical counterparts (y wi) = Wy and so on, so the estimation steps are isomorphic. where Define unobservables as υ = Wν + ε, and assume they are uncorrelated with •   observables x; that is, there is no sorting and no omitted spatial variables. Then, Y (x1) we can write  .  × Y = . is of dimension n 1 y = Xγ + Wyβ + WXθ + υ Y (xn) Premultiplying by Wy gives Wy = WXγ + WWyβ + WWXθ + Wυ • Σ0 = (Cov [Y (x0) ,Y (x1)] , Cov [Y (x0) ,Y (x2)] ,..., Cov [Y (x0) ,Y (xn)]) is of dimension 1 × n This shows that Gy is correlated with υ, i.e. E [υ|Wy] ≠ 0, and least square esti- mates of the above regression are biased. Σ = (Cov [Y (x ) ,Y (x )])n n × n Z(·) • i j i,j=1 is the covariance matrix of with If we assume W is idempotent (by constructing a block-diagonal, transitive ma- ijth entry equal to Cov [Y (xi) ,Y (xj)] trix), we can simplify the above expression to Now, the question is how to determine the covariance function of Y (·). For this ex- γ + θ ′ Wy = WX + Wυ/(1 − β) Plugging in definition for Wy ample, we’ll consider the (isotropic) exponential covariance: Cov [Y (x),Y (x )] = 1 − β −∥ − ′∥ ρe x x /θ. Here, θ is a range parameter that corresponds to the length of spatial − − − y = X |γ/(1{z β}) +WX |(γβ + θ{z)/(1 β}) + |υ + Wυβ/{z (1 β}) autocorrelation of our process, and ρ is the overall variance of the process. e e e The covariance matrix Σ thus has i, j -th entry γ θ υ β, θ −∥xi−xj ∥/θ 2 In summary, cannot be separately identified from the composite parameters Σi,j = e + σ 1{xi=xj }, e e { β, θ. This is the reflection problem of (Manski, 1993). 1 xi = xj where 1{ } = xi=xj 0 else

← ToC 90 9.2.2 Spatial Modelling dense with entries 1 σ = ϕ|i−j| Based on Rue and Held (2005) and various lecture notes. ij 1 − ϕ2 Defn 9.24 (Conditional / Markov Independence). Entries of the covariance matrix Σ only give direct information about the marginal x1, x2 are conditionally independent given x3 if, for a given value of x3, learning dependence structure, not the conditional one. x2 gives one no additional information about x1. The density representation is therefore Defn 9.25 (Spatial Gaussian Process (GP)). A spatial process Y (s) s ∈ D ⊂ R2 is said to follow a Gaussian Process if any real- f (x) = f (x1|x3) f (x2|x3) f (x3) ′ isation Y = (Y (s)1,...,Y (s)n) at the finite number of locations s1,..., sn follows which is a simplification of the general representation. an N− variate Gaussian. More precisely, let µ(s : D→R denote a mean func- f (x) = f (x1|x2, x3) f (x2|x3) f (x3) tion returning a mean at location s (typically assumed to be linear in covariates ′ 2 + X(s) = (1,X1(s),...,Xp(s)) ) and C(s1, s2): D →R denote a covariance func- Theorem 9.5 (Factorisation Criterion for Conditional Independence). Y (s) Y tion. Then, follows( a spatial) Gaussian{ process, and has a density} x ⊥⊥ y|z ⇔ f (x, y, z) = g(x, z)h(y, z) √1 | |−1/2 −1 − ′ −1 − ∀ fY (y) = Σ exp (y µ) (Σ) (y µ) for some functions f, g , and z with f (z) > 0 2π 2 ′ {C } Example 9.6 (AR1 GMRF). Where µ = (µ(s1), . . . , µ(sN )) is the mean vector and Σ = (si, sj) ij is the 3 iid N × N covariance matrix. Evaluating this density requires O(N ) operations and x = ϕx − + ε ; ε ∼ N (0, 1) , |ϕ| < 1 t t 1 t O(N 2) memory, which means it does not scale well with large datasets. See Heaton This can be re-expressed as et al. (2019) for overview of alternatives. | ∼ N ∀ xt x1, . . . , xt−1 (ϕxt−1, 1) t = 2, . . . , n Defn 9.26 (Conditional Autoregressions (Besag 1974)). So, for xs , xt , 1 ≤ s < t ≤ n, Let x be associated with some property of points (typically location), with no nat- ural ordering of the indices. The joint density of a zero-mean GMRF is specified x ⊥⊥ x | {x , . . . , x − } if t − s > 1 s t s+1 t 1 n by each of the full-conditionals   In addition to the conditional distribution, also assume the marginal distribution of ( ) ∑ iid 2 −1 x1 ∼ N 0, 1/(1 − ϕ) , which is the stationary distribution of this process. Then, | ∼ N   xi x−i βijxj, (κ)i the join distribution of x is j:j≠ i | | these are called CAR models. The associated precision matrix is f (x) = f (x1) f (x2 x1) ..., f((xn xn−1)) { 1 1 κi i = j = |Q|1/2 exp − x′Qx Q = Qij = −κ β i ≠ j (2π)n/2 2 i ij which is symmetric and positive-definite. where Q is a precision matrix of the form   − Defn 9.27 (Gaussian (GMRF)). 1 ϕ ′ ∈ Rn  −ϕ 1 + ϕ2 −ϕ  A random vector x = (x1, . . . , xn) is called a GMRF wrt a labelled graph   G V E  . . .  = ( , ) with mean µ and precision matrix Q > 0 iff its density has the form Q = ...... ( )   1 −ϕ 1 + ϕ2 −ϕ f (x) = (2π)−n/2 |Q|1/2 exp − (x − µ)′Q(x − µ) −ϕ 1 2 ⊥⊥ | − | This tridiagonal form is due to the fact that xi xj if i j > 1 given the rest and Qij ≠ 0 ⇔ {i, j} ∈ E∀i ≠ j. If Q is completely dense, G is completely con- of the sequence. This is generally true for any GMRF: Qij = 0 , i ≠ j =⇒ xi ⊥⊥ nected. In spatial settings, Q is typically sparse [depending on how neighbours xj| {xk : k ≠ i, j}. are defined.] While the conditional independence structure is readily apparent from the preci- Key summary quantities sion matrix, it isn’t evident in the covariance matrix Σ = Q−1, which is completely

← ToC 91 • So, random effects w are modelled conditionally. Let w−i denote the vector of w 1 ∑ E | − − excluding w(si). Model w(si) in terms of its full-conditional. [xi x−i] = µi Qij(xj µj)   Qii j:j∼i ∑n | ∼ N  −1 w(si) w−i, Θ cijw(sj), κi , i = 1, . . . , n • Prec(xi, x−i) = Qii and j=1 −Qij Corr(xi, xj|x−ij) = √ , i ≠ j where cij describes the neighbourhood structure. QiiQjj 1. Besag (1974) proved that if Q is symmetric( PD,) with κi in the diagonals and Fact 9.7 (Markov Properties of GMRFs). −κ c in the off-diagonals. w|Θ ∼ N 0, Q−1 . Simplest version assumes Let x be a GMRF wrt G = (V, E). The following are equivalent i ij common precision parameter κi = τ.

1. Pairwise Markov Property: xi⊥xj|x−ij if {i, j} ̸∈ E ∧ i ≠ j (N−1)/2 ′ 2. Intrinsic GMRF: f(w|Θ) ∼ τ exp(−w Q(τ)w). When cij = 1 for

2. Local Markov Property; xi⊥x{i,ne(i)}|xne(i) ∀i ∈ V neighbours (i.e. adjacency matrix instead of distances), it simplifies further to   3. Global Markov: xA⊥xB|xC for disjoint sets A, B, C where C separates A, B ∑ (N−1)/2  1 2 A B (w|Θ) ∼ τ exp − (w(si) − w(sj)) and and are nonempty. 2 i∼j Defn 9.28 (Linear Gaussian Process Models). ∈ D let the spatial process at location s be Defn 9.30 (Gaussian Process Spatial GLMs). Z(s) = X(s)β + w(s) , ∀s ∈ D Let {Z(s): s ∈ D} and {w(s : s ∈ D)} be two spatial processes on D ⊂ Rd(d ∈ Z+). where X(s) collects a p− vectors of covariates for site s, and β is a p-vector of Assume Z(si)s are conditional independent given random effects w(s1), . . . , w(sn), coefficients. Spatial dependence can be imposed by modelling {w(s): s ∈ D} as a and that Z(si) follow some common distributional form, and E | ∀ zero-mean stationary Gaussian Process. Distributionally, this implies that for any [Z(si) w] = µ(si) i = 1, . . . n ( ) s ,..., s ∈ D, if we let w = (w(s ), . . . , w(s ))′, and Θ be the parameters of the 1 n 1 n η(s) = h(µ(s)) h(·) h(x) = log x model Let for some known link function e.g. 1−x for w|Θ ∼ N (0, Σ(Θ)) logit. Assume linear form for projection η(s) = X(s)β + w(s). Spatial dependence via w|Θ ∼ N (0, Σ(Θ)), where Σ is Σ(Θ) where is the covariance matrix of a n-dimensional normal density. We need often Matern. Σ(Θ) to be Symmetric, PD for this distribution to be proper. Special cases: • Exponential Covariance Matrix: Θ = (ψ, ϕ, κ) Σ(Θ) = ψI + κH(ϕ), where the i, jth element of H(ϕ) = exp(− ∥si − sj∥ /ϕ). The ‘nugget’ ψ is the vari- ance of the non-spatial error, κ dictates the scale, and ϕ dictates the range of the spatial dependence.

• Matern Covariance: Θ = (ψ, κ, ϕ, ν) > 0 for distance x := ∥si − sj∥. { √ √ κ ν ν−1 (2 νx/ϕ) K (2 νx/ϕ) if x > 0 Cov [x; ϕ, ψ, κ, ν] = 2 Γ(ν) ν ψ + κ if x = 0

where Kν (x) is a modified Bessel function of order ν. Defn 9.29 (Linear GMRF Models). Specifying Σ directly can be awkward when dealing with irregular spatial data [i.e. every real use case].

← ToC 92 A Mathematical Background A.1 Proof Techniques 1. Direct Proof / modus ponens: If R is a true statement and R =⇒ S is a true conditional statement, then S is a true statement. Direct proofs typically involve backwards-forwards reasoning - take all statements that follow from R that might relate to S, list them in R. Then, take all statements that follow from S, list them in S. Then, look for statements r, s ∈ R × S that have a straightforward proof, and write proof of the form R =⇒ r =⇒ s =⇒ S. 2. Contrapositive: Since every conditional statement is equivalent to its con- trapositive, proving ¬Q =⇒ ¬P is equivalent to proving P =⇒ Q. 3. Proof by contradiction: Assume P is true, and assume Q is false [i.e. ¬Q is true], and show that ¬Q =⇒ S (using P and other possible intermediate results) where S is known to be false. Conclude that ¬Q must be false, so Q must be true, and we have proved that P =⇒ Q. 4. Induction [only applies to statements pertaining to well ordered sets] N • Assume a base case - P (0) is a true statement • Prove whenever P (k) is true, P (k + 1) is true • Therefore P (n) is true for every n ∈ N Figure 12: Types of Relations A.2 Set Theory A set is a collection of objects. E.g. R, Q, Z, N. A.2.1 Relations Set operations: Given two sets X and Y , any subset of their Cartesian product X × Y is called a binary relation. For any pair of elements (x, y) ∈ R ⊆ RX × Y =⇒ xRy. • Intersection : A ∩ B Properties of binary relations: • Union : A ∪ B • reflexive xRx ∀ x ∈ X • Difference: A \ B := {x : x ∈ A ∧ x∈ / B} • transitive if xRy ∧ yRz =⇒ xRz • cartesian product: A × B := {(a, b): a ∈ A ∧ b ∈ B} • symmetric If xRy =⇒ yRx Defn A.1 (Power set). • antisymmetric If xRy ∧ yRx =⇒ x = y Set of all subsets of S is itself a set. Denoted as P(S) • asymmetric if xRy =⇒ ¬(yRx) |C| > |R| > |Q| > |Z| > |N| • complete if either xRy or yRx or both ∀x, y, z ∈ X Defn A.2 (Equivalence Relations). An equivalence relation R on a set X is a relation that is reflexive, transitive, and symmetric. Given an equivalence relation ∼, the set of elements that are related to a given element a :

← ToC 93 4. Inverse element: ∀x ∈ G, ∃y ∈ G : x ⊗ y = e ∧ y ⊗ x = e. ∼ (a) := {x ∈ X : x ∼ a} ∀ ∈ G ⊗ ⊗ a ∼ 5. if additionally x, y : x y = y x, then G is an Abelian/Commutative is called the equivalence class of . e.g. Indifference preference relation is an group equivalence relation, but the preference relation ⪰ is not because it isn’t symmetric. Z Rm×n R\{ } Defn A.3 (Order Relations). ( , +), ( , +)( 0 ,.) are all groups A relation that is reflexive and transitive but not symmetric is called an order re- x ⪰ y ≻ Defn A.6 (Vector Spaces). lation: . This is also called a weak order. is not an order relation because it V = (V, +, ·) is not reflexive (and is called a strong order). Every order relation also induces an A real valued vector space is a vector space with two operations equivalence relation: x ∼ y ⇔ x ⪰ y ∧ y ⪰ x + : V × V→V · R × V→V An ordered set (X, ⪰) consists of a set X together with an order relation ⪰ defined : on X. Where A.2.2 Intervals and Contour Sets 1. (V, +) is an Abelian Group Given an ordered set and two elements a, b ∈ X s.t. b ⪰ a, we can define 2. Distributivity • The open interval (a, b) : set of all elements strictly between a and b. • ∀λ ∈ R, x, y ∈ V : λ · (x + y) = λ · x + λ · y { ∈ • The closed interval [a, b] : set of all elements between a and b s.t. [a, b] x • ∀λ, ψ ∈ R, x ∈ V :(λ + ψ) · x = λ · x + ψ · x X : a ≼ x ≼ b} 3. Associativity : ∀λ, ψ ∈ R, x ∈ V : λ · (ψ · x) = (λψ) · x Analogously, for arbitrary ordered sets, (X, ⪰) we can define 4. Neutral Item wrt outer operation: ∀x ∈ V : 1 · x = x • Upper contour set ⪰ (a) := {x ∈ X : x ⪰ a} : set of all elements that follow or dominate a A.3 Analysis and Topology • Lower contour set ≼ (a) := {x ∈ X : x ≼ a} : set of all elements that preced a in the order ⪰ Preliminaries: Vectors : x := (x1, ··· , xk), where xi ∈ R A partial order is a relation that is reflexive, transitive, and antisymmetric. A.3.1 Metric Spaces Defn A.4 (Meet and Join). ∨ The join of a partially ordered set S is the supremum and is denoted S. max(a, b) Defn A.7 (Euclidian Distance). is sometimes written a ∨ b. ∧ ( )1/2 The meet of a poset is the infimum and is denoted S. min(a, b) is sometimes ∑k ∥ − ∥ − 2 written a ∧ b. d2(x, y) := x y 2 := (xi yi) i=1

A.2.3 Algebra 2 k Requirements for a metric(e.g. d2 : R × R→R ∀x, y, v ∈ R ): Defn A.5 (Groups). A set G and an operation ⊗ : G × G→G defined on G. Then G := (G, ⊗) is called a • d2(x, y) = 0 ⇔ x = y : a point is at zero distance from itself group if the following conditions hold: • d2(x, y) = d2(y, x) : distance is symmetric 1. Closure of G under ⊗: ∀x, y ∈ G, x ⊗ y ∈ G • d2(x, y) ≤ d2(x, v) + d2(v, y) : triangle inequality 2. Associativity: ∀x, y, z ∈ G, (x ⊗ y) ⊗ z = x ⊗ (y ⊗ z) We can generalise this definition to arbitrary nonempty sets S. 3. Neutral element: ∃e ∈ G∀x ∈ G s.t. x ⊗ e = e ⊗ x = x

← ToC 94 Defn A.8 (Metric Spaces). • ∞: Chebychev A metric space is a nonempty set S and a metric of distance ρ : S ×S→R∀x, y, v ∈ S s.t. Defn A.12 (Frobinius Norm). Frobinius Norm of a matrix A is v • ρ(x, y) = 0 ⇔ x = y u u∑M ∑N √ ∥ ∥ t 2 ′ • ρ(x, y) = ρ(y, x) A Fr = aij = trace(A A) i=1 j • ρ(x, y) ≤ ρ(x, v) + ρ(v, y) Defn A.13 (Sup, Inf). k k For example, (R , d2) is a metric space. Many additional metric spaces in R are a, b ∈ R, [a, b] denotes the set of real numbers satisfying a ≤ x ≤ b. ( or ) denotes generated by a norm. a strict inequality (i.e. closed from above or below). If S ⊂ R is bounded from above, ∃y s.t. x ≤ y ∀x ∈ S.. Then y is the least upper Defn A.9 (Norm). bound or supremum of sup {x : x ∈ S}. If S is not bounded from above, we write ⊆ Rk ∋ 7→ ∥ ∥ ∈ R ∀ ∈ Rk ∈ R ∞ A norm on X is a mapping X x x s.t. x, y and γ supx∈S = . satisfying Similarly, the greatest lower bound of a set or infimum is denoted infx∈S (x)∨inf {x : x ∈ S}

• Nonnegativity: ∥x∥ ≥ 0 ∀x inX Defn A.14 (Sequences, Liminf, Limsup). { }∞ { } A sequence x1, x2, . . . xn is denoted by xi i=1 or xi when the range of the indices • Non degeneracy: ∥x∥ = 0 ⇔ x = 0 is clear. Let {x } be an infinite sequence of real numbers and ∃S s.t. (1)∀ε > 0, ∃N s.t. ∀n > ∥ ∥ | | ∥ ∥ i • Homogeneity: γx = γ x N, xn < S + ε and (2)∀ε > 0 and M > 0, ∃n > M s.t. xn > S − ε. Then, S is the lim sup {x }. • Triangle Inequality: ∥x + y∥ ≤ ∥x∥ + ∥y∥ n If {xn} is not Bounded from above, lim sup xn = ∞. ∥ ∥ Rk Rk ∥ − ∥ Each norm . ∑on generates a metric ρ on via ρ(x, y) := x y . Defn A.15 (Cauchy Criterion). ∥ ∥ k 2 1/2 ∀ E.g. x 2 := ( i=1 xi ) generates euclidian distance d2. A sequence (xn) in a metric space (S, ρ) is said to be a Cauchy sequence if, ϵ > The pair (X, ∥·∥) consisting of a vector space X together with a norm ∥·∥ is called 0, ∃N ∈ N s.t. ρ(xj, xk) < ϵ whenever j, k ≥ N (intuitively, points in a Cauchy a normed linear space. sequence get tighter together).

k Defn A.10 (Banach Space). Let (xn) be a sequence of vectors in R . Suppose for any ϵ > 0, ∃n ∈ N s.t. ∀p, q > ∥·∥ p q A banach space(X, ) is a normed linear space that is a complete(in the Cauchy- n, ρ(x , x ) < ϵ. Then, (xn) has a limit. { }∞ → ∀ ∃ ∀ ≥ | − | convergence sense) metric space with respect to the metric derived from its norm. More basic definition: an n=1 A if ϵ > 0, N s.t. n N, an A < ϵ

Defn A.11 (p-norm). • {anbn}→AB Also known as Minkowski Norm ∥ ∥ ∥ ∥ A class of norms that includes . 2 as a special case is the . p defined by • {a + b }→A + B ( ) n n ∑k 1/p ∥ ∥ | |p ∈ Rk Sequences x p := xi x Let S = (S, ρ) be a metric space. A sequence (xn) ⊂ S is said to converge to i=1 x ∈ S > 0, ∃N ∈ N s.t. n ≥ N =⇒ ρ(x , x) < ϵ. ∥ ∥ Rk ∥ − ∥ ∀ ∈ n . ps give rise to a class of metric spaces ( , dp) where dp(x, y) := x y p x, y Rk. Theorem A.1. A sequence in (S, ρ) can have at most one limit Examples: Defn A.16 (ϵ ball). • 1: Taxicab centered on x ∈ S with radius ϵ > 0 is the set { ∈ } • 2: Euclidian B(ϵ, x) := z S : ρ(z, x) < ϵ

← ToC 95 Set Definitions Theorem A.2 (Bolzano-Weierstrass). (Rk, d ) Defn A.17 (Bounded Set). Every bounded sequence in euclidian space 2 has at least one convergent A subset E of S is called bounded if E ⊂ B(n, x) for some x ∈ S and some suitably subsequence. ∈ N large n (intuition - some arbitrarily large ϵ ball can fit E inside it). Theorem A.3 (Heine-Borel). A sequence (xn) in S is called bounded if its range {xn : n ∈ N} is a bounded set. k A subset (R , d2) is precompact in the same iff it is bounded and compact. ⇔ ∧ Defn A.18 (Closed Set). IOW : Compact Closed Bounded F ⊂ S F A set is closed IFF for every convergent sequence contained in , the limit Theorem A.4. of the sequence is also in F . All metrics on Rk induced by a norm are equivalent. A closed set contains all its limit points. That is , if (xk) is a convergent sequence of points in S, then lim →∞ x is in S as well. k k A.4 Functions Defn A.19 (Open Set). A function f from set A to B, written as A ∋ x 7→ f(x) ∈ B or f : A→B is a rule A subset of an arbitrary metric space S is open iff its complement is closed, and associating every element in A to one and only one element in B. The point b is closed iff its complement is open. − also written as f(a), and is called the image of a under f. For D ⊂ B, the set f 1(D) A set S ∈ Rk is called open if, ∀x ∈ S∃ϵ > 0 s.t. y ∈ B(ϵ, x), ρ(x, y) < ϵ is in S. is the set of all points in A that map into D under F, and is called the preimage of D −1 If F is a closed, bounded subset of (R, ∥.∥), then sup F ∈ F . under F. f (D) := {a ∈ A : f(a) ∈ D} a function f : A→B is called • A set S ⊂ Rk is open iff its complement is closed. • injective / one-to-one if distinct elements of A are always mapped into distinct • the union of any number of open sets is open elements of B f • the intersection of a finite number of open sets is open. • surjective / onto if every element of B is the image under of at least one point in A • the intersection of any number of closed sets is closed • bijective if a function is both injective and surjective • the union of a finite number of closed sets is closed. Defn A.24 (Continuous functions). A real valued function f on Rk is continous at point a if ∀ϵ > 0, ∃δ > 0 s.t. ρ(x, a) < Defn A.20 (Boundary and Closure). δ =⇒ |f(x) − f(a)| < ϵ. A point x ∈ S is called an interior point of S if the set {y : ρ(y, x) < ϵ} is contained Equivalently, lim → f(x) = f(a) in S for all ϵ > 0 sufficiently small. A point is called a boundary point if {y : x a S ⊂ Rk ∀a ∈ S ∧ ∀ϵ > 0, ∃δ > ρ(y, x) < ϵ}∩Sc is non-empty for all ϵ > 0 sufficiently small. The set of all boundary A function is said to be continuous on the set if, 0 s.t. ∀{x : ρ(x, a) < δ}, |f(x) − f(a)| < ϵ. Equivalently, in lim → f(x), we require points in A is denoted by ∂A. x a the sequence of points that converge to a to be entirely in S. The closure of a set S is the set S combined with all points that are the limits of S sequence of points in . • The sum of two continuous functions is continuous Defn A.21 (Complete Set). • The product of two continuous functions is continuous A subset A ⊂ S is said to be complete iff every cauchy sequence in A converges to some point in A. • The quotient of two continuous functions is continuous at any point where the denominator is nonzero Defn A.22 (Compact Set). The set K ⊂ S is called compact if every sequence contained in K has a subse- Defn A.25 (ϵ, δ definition of limit). K quence that converges to a point in . lim f(x) = L ⇐⇒ ∀ε > 0, ∃δ > 0, s.t. 0 < |x − c| < δ ⇒ |f(x) − L| < ε x→c Defn A.23 (Convex Set). A set S ⊂ Rk is called convex if, ∀λ ∈ [0, 1] and a, a′ ∈ S, we have λ+(1−λ)a′ ∈ S. (i.e. all convex combinations of two points in a set are also in the set).

← ToC 96 Defn A.26 (Lipshitz Continuity). Fact A.9 (Restrictiveness of Function Classes). ⊂ ⊂ ∃ Given two metric spaces (X , ρX ), (Y, ρY ), a function f : X →Y is called Lipshitz Differentiability Continuity Limit i.e. not all functions with limits are continuous if ∃K ∈ R s.t. ∀x1, x2 ∈ X , continuous, not all continuous functions are differentiable. More generally, ρ (f(x ), f(x )) ≤ Kρ (x , x ) Y 1 2 X 1 2 Continuously Differentiable ⊂ Lipshitz Continuous ⊂ α− Holder Continuous such a K is referred to as a Lipshitz constant for the function f. ⊂ Uniformly Continuous ⊂ Continuous A Real valued function f : R→R is Lipschitz if ∃K > 0 such that Fact A.10 (Properties of differentiable functions). |f(x1) − f(x2)| ≤ K|x1 − x2| This limits how fast a function can change. Every function that has bounded first- • Linearity: f, g : X→Y are differentiable at x, then f + g and αf are differ- derivatives is Lipshitz continuous. A differentiable function is Lipshitz if and entiable at x with ∇ (f + g)(x) = ∇ f (x) + ∇ g (x); ∇ αf (x) = α∇ f (x) only if it has a bounded derivative. • Chain Rule: g · f differentiable with ∇ g · f (x) = ∇ g (f(x)) · ∇ f (x) Defn A.27 (Holder Continuity). A function defined on X is said to be Holder of order α > 0 if ∃M ≥ 0 such that Theorem A.11 (Rolle’s theorem). f :[a, b]→R f f(a) = f(b) =⇒ ∃c ∈ ρ (f(x), f(y)) ≤ Mρ (x, y)α ∀x, y ∈ X Let , is continuous and differentiable. Y X [a, b] s.t. f ′(c) = 0. this is also called Uniform Lipshitz. Theorem A.12 (Mean Value theorem). Theorem A.5 (Continuity). f :[a, b]→R, f is continuous and differentiable. Then, S→Y f −1(G) G ⊂ Y A function f is continuous iff the preimage of every open set ′ f(b) − f(a) is open in S. f (c) = b − a Defn A.28 (Continuous function). Theorem A.13 (Taylor’s theorem). f ∀ϵ > 0 ∃δ > 0 |x − x | < δ ∀ x =⇒ |f(x) − f(x )| < ϵ is continuous if , s.t. 0 0 . f : R→R admits to Taylor expansion around a such that ∞ Theorem A.6. f ′(a) f ′′(a) ∑ f n(a) Let function f S→Y , where S, Y are metric spaces and f is continuous. If K ⊂ S f(x) = f(a) + (x − a) + (x − a)2 ··· = (x − a)n 1! 2! n! is compact, then so is f(K), the image of K under f. n=0 Rk→R Example A.7 (Gamma Function). For a function with multiple arguments f : , the second-order Taylor ex- ∫ ∫ pansion around the point x0 is ∞ 1 − Γ[α] := tα−1e−tdt = (log(1/t))α 1 dt 1 f(x) ≈ f(x0) + (x − x0) · ∇f(x0) + (x − x0)H(x)(x − x0) 0 0 2 Beta function : B(α, β) = Γ(α) · Γ(β)/Γ(α + β) Defn A.30 (Epigraph of a function). Theorem A.8 (Weirstrass Maximum Theorem). The epigraph of a function f is epif := {(x, t):} f(x) ≤ t (i.e. area above the Let f : k→R, where K ⊂ (S, ρ) (an arbitrary metric space). If f is continuous and function). K is compact, then f attains its supremum and infimum on K. In case of continuous functions on compact domains, optima always exist. Defn A.31 (Concavity and Convexity for Real-valued functions). Let f :[a, b]→R, x, y ∈ [a, b]; t ∈ (0, 1). Then, Defn A.29 (Differentiability). − ≤ − ′′ ≥ The function f : R→R is differentiable at x0 if • F is convex if f((1 t)x + ty) (1 t)f(x) + tf(x). f 0: i.e. the epigraph − of f is a convex set. ∃ (f(x) f(x0)) ′ lim R(x) = = f (x0) − ≥ − ′′ ≤ (x − x0) • F is concave if f((1 t)x + ty) (1 t)f(x) + tf(x). f 0

, i.e. (f(x) − f(x0))/(x − x0) has a limit as x→x0. ′ ∂f | The derivative of f at x0 is this limit and is denoted f (x0) or ∂x x=x0

← ToC 97 A.4.1 Fixed Points • Null empty-set: µ(∅) = 0 Defn A.32 (Fixed Point). • Non-Negativity: ∀A ∈ A, 0 ≤ µ(A) ≤ ∞ =⇒ µ : A→[0, ∞] Let T : S→S, where S is any set. An x∗ ∈ S is called a fixed point of T on S if ∗ ∗ T x = x . • Countable Additivity: If A1,A2,... are disjoint elements of A (i.e. Ai ∩Aj = ⊂ R ∅ ∀i ≠ j), ( ) If S , then fixed points of T are those points in S where T meets the 45 degree ∪∞ ∑∞ line. µ Ai = µ(Ai) Theorem A.14 (Brouwer’s Fixed Point Theorem). i=1 i=1 Consider the space (Rk, d), where d is the metric induced by any norm. Let S ⊂ Rk, and let T : S→S. If T is continuous and S is both compact and convex, then T has Existence ensured by Caratheodory’s Extension Theorem. at least one fixed point in S. Examples of measures µ: Defn A.33 (Mapping Categories). • If X is countable, let µ(A) = #A = number of points in A. This counting Let (S, ρ) be a metric space. T : S→S is a map. it is called measure can be defined for any subset A ⊂ X , then the σ− field A is the collection of all subsets of X =: X = 2X , the power set of X . • nonexpansive if ρ(T x, T y) ≤ ρ(x, y) ∀x, y ∈ S ∫ ∫ k • If X = R , define µ(A) = ...A dx1 . . . dxk • contracting if ρ(T x, T y) < ρ(x, y) ∀x, y ∈ S, x ≠ y Defn A.36 (Borel Set). • uniformly contracting with modulus λ ∈ [0, 1) if ρ(T x, T y) < λρ(x, y) ∀x, y ∈ Given a topology on Ω, a Borel σ-field is a σ field generated by the family of open S, x ≠ y subsets of Ω, i.e. the smallest σ field that contains all the open sets. The Lebesgue measure of a set A can be defined implicitly for any set B ina σ− field Theorem A.15 (Hahn-Banach Fixed Point Theorem). B called the Borel sets of Rn. B is the smallest σ− field that contains all ‘rectangles’ T : S→S (S, ρ) T Let , where is a complete metric space. If is a uniform contraction × · · · × { ∈ Rn } on S with modulus λ, then T has a unique fixed point x∗ ∈ S. Moreover for every (a1, b1) (an, bn) := x : ai < xi < bi, i = 1, . . . , n n ∗ n ∗ n ∗ x ∈ S and n ∈ N, we have ρ(T x, x ) ≤ λ ρ(x, x ) =⇒ T x→x as n→∞ Example A.16 (Uncountable Sample Spaces). Suppose Ω = R. We say that I ⊂ R is a bounded interval if ∃a, b, a < b s.t. I ∈ A.5 Measure {[a, b], (a, b), (a, b], [a, b)}. Define C1 := {I ⊂ R,I is a bounded interval}, the small- est σ− algebra that contains C1 is denoted by B1 and is called the Borelian σ−algebra Defn A.34 (σ− field / Event Space). σ A σ-algebra (also σ-field) is a collection F of subsets of Ω that Thus, a countable union of open intervals belongs to the Borelean -algebra. Since every open subset of R can be written as a countable union of open intervals, it is Ω ∈ F ∧ ∅ ∈ F Ω therefore also a Borelean set. The closed subsets are Borelean since a closed set is • : includes itself and the null set the complement of an open set. ∈ F ⇒ − : C ∈ F • A = Ω A = A : is closed under complement Defn A.37 (Lebesgue Measure). ∪ ∞ Basic problem: how to assign each subset of Rk, i.e. each element of P(Rk) a real • A1,A2, · · · ∈ F =⇒ Ai ∈ F : is closed under countable unions. i=1 number that will represent its ‘size’. Rn This is effectively the definition of the event space S for a sample space Ω. With , n = 1, 2, 3, µ(A) is the length, area, or volume of A, respectively. µ is a Lebesgue measure on Rk. Defn A.35 (Measure). A measure µ on a set X assigns a nonnegative value µ(A) to many subsets of X . Defn A.38 (Measure Space). A − X X A For a collection F subsets of Ω, a measure is a map If is a σ field of subsets of , the pair ( , ) is called a measurable space, and µ A (X , A, µ) µ : F→[0, ∞] if is a measure on , the triple is called a measure space. Given A ∈ F, µ(A) is a measure of the ‘size’ of set A. Defn A.39 (Probability Space). A function µ on a σ− field A of X is a measure of A measure µ is called a probability measure if µ(X ) = 1, and then the triple (X , A, µ) is called a probability space.

← ToC 98 Defn A.40 (Measurable functions). ∫ ( ) If (X , A) is a measurable space and f is a real-valued function on X , f if measure- b ∑N j b − a f(x)dx = lim f a + (b − a) able if →∞ a N N N f −1(B) := {x ∈ X : f(x) ∈ B} ∈ A j=1 for every Borel set B. Theorem A.17 (Fundamental Theorem of∫ Calculus). x If f is continuous on [a, b] then F (x) := a f(t)dt is differentiable on the open interval (a, b) and F ′(x) = f(x) ∀x ∈ (a, b). F is called the anti-derivative. A.6 Integration ∫ b An integral is a map assigning a number to a function, where the number is viewed f(t)dt = F (b) − F (a) as the area/volume ‘under’ the function. Given∫ a measure space (Ω, F, µ) and a a →R measureable function f :Ω , an integral fdµ is a map from f to number such Defn A.43 (Indicator Function). that the following three properties hold Given a measurable space (Ω, F) and a set E ∈ F, we define an indicator function ∫ f :Ω→R defined by ≥ ≥ e • If f 0, then fdµ 0 1 ∫ ∫ fE(ω) = ω∈E • ∀a ∈ R, aϕdµ = a ϕdµ This function is measurable. ∫ ∫ ∫ (f + g)dµ = fdµ + gdµ Defn A.44 (Integration of Simple∑ Functions). • n 1 ∀ ∈ R∧ E ∈ F Any function f of the form f(ω) i=1 ai ω∈Ei ai E1,... n constitutes Defn A.41 (Riemann Sums). a finite partition of Ω. A countable sum of measurable functions is measurable, Suppose f is a bounded function defined on [a, b]. An increasing sequence P := which implies that f is measurable. Then we define ∫ n {a = x0 < x1 < x2 < ··· < xn = b} defines a partition of the interval. The mesh ∑ size of the partition is defined to be fdµ := aiµ(Ei) i=1 |P | = max {|xi − xi−1| : i = 1,...,N} To each partition we associate two approximations of the area under the graph of Defn A.45 (Lebesgue Integral / Integration of Measurable Functions). f, by the rules For any measurable function f, define f + := max {f, 0} and f − := max {−f, 0}, + − + − ∑N which∫ are also measurable.∫ We also have f = f − f and |f| = f − f . When − f −dµ f −dµ U(f, P ) := sup f(x)(xj xj−1) either or is∫ finite, we∫ define∫ the integral ∈ j=1 x [xj−1,xj ] fdµ := f +dµ − f −dµ ∑N ∫ ∫ − − − L(f, P ) := inf f(x)(xj xj−1) When both f dµ and f dµ are finite, we say f is integrable w.r.t. µ. Lebesgue x∈[x − ,x ] j=1 j 1 j integrals intuitively slice the function horizontally, while Reimann integrals slice these are called the upper and lower Riemann Sums For any partition, U(f, P ) ≥ vertically. L(f, P ). Defn A.42 (Riemann Integrability). A.7 Probability Theory A bounded function f defined on an interval [a, b] is said to be Riemann integrable Defn A.46 (Kolmogorov Axioms). if (Ω, S,P ) ∫ The triple is a probability space if it satisfies the following b inf U(f, P ) = sup L(f, P ); f(x)dx =: Reimann Integral • Unitarity: Pr (Ω) = 1 P P a Suppose f is piecewise continuous, defined on [a, b]. Then, f is Reimann integrable • Non Negativity: ∀s ∈ S, Pr (a) ≥ 0 Pr (a) ∈ R ∧ Pr (a) < ∞ and

← ToC 99 • Countable Additivity: If A1,A2,..., ∈ S are pairwise disjoint[i.e. ∀i ≠ j, Ai ∩ Defn A.50 (Cumulative Distribution Function). ∅ F R→ Aj = ], Then ( ) Map : [0, 1] such that ∪∞ ∑∞ FX (x) = P (X ≤ x) = P ({e ∈ E : X(e) ≤ x}) =: Px((−∞, x]) P A = P(A ) i i for x ∈ R i=1 i=1 Properties of CDF F (u) Other properties for any event A, B • Boundary property: limu→−∞ F (u) = 0, limu→∞ F (u) = 1 • A ⊂ B =⇒ Pr (A) ≤ Pr (B) • Nondecreasing: F (x) ≤ F (y) if x ≤ y • Pr (A) ≤ 1 • Right continuous: limu↓x F (u) = F (x) • Pr (A) = 1 − Pr (Ac) A.7.1 Densities • Pr (∅) = 0 Theorem A.18 (Radon-Nikodym). Stated differently: If a finite measure P is absolutely continuous wrt a σ− finite measure µ, then ∃ a f Defn A.47 (Probability). nonnegative measurable function ∫s.t. ∫ Let P be a probability measure on a measurable space (E, B), so (E, B,P ) is a prob- 1 ability space. Sets B ∈ B are called events, points e ∈ E are called outcomes, and P (A) = fdµ =: f Adµ A P (B) is called the probability of B. The function f in this theorem is called the Radon-Nikodym derivative of P with Let (E, B) be a measurable space. Let P : B→[0, 1] be a ‘set’ function mapping the respect to µ, or the density of P with respect to µ, denoted σ− algebra of subsets of E into the real line. We say P is a probability measure if, ∈ E dP for events A, B , f = dµ 1. 0 ≤ Pr (A) ≤ 1: Events range from never happening to always happening Defn A.51 (Absolutely Continuous Random Variables). E 2. Pr ( ) = 1: Something must happen If a random variable has density p wrt Lebesgue measure on R, then X or its dis- P p 3. Pr (∅) = 0: Nothing never happens tribution X is called absolutely continuous with density∫ . Then, from R-N , x c ≤ −∞ 4. Pr (A) + Pr (A ) = 1: A must either happen or not happen FX (x) = P (X x) = PX (( , x]) = p(u)du ∑ −∞ ∪∞ ∞ − 5. Pr ( n=1An) = n=1 Pr (An): σ additivity for countable disjoint events Using the fundamental theorem of calculus, p can be found from the CDF FX by ∑ p(x) = F ′ (x) ∪∞ ≤ ∞ differentiation, X . • Boole’s Inequality Pr ( n=1An) n=1 for any sequence of events Defn A.52 (Discrete Random Variables:). ⊆ ⇒ ≤ 6. Monotonicity: for events A, B; A B = Pr (A) Pr (B) Let X0 be a countable subset of R. The measure µ := µ(B) = #(X ∩ B) for borel sets B is also called counting measure on X . Then, Defn A.48 (Random Variable). ∫ 0 →R ∀ ∈ R { ∈ ≤ } ∈ E ∑ A measurable function X :Ω s.t. r , ω Ω: X(ω) r (event space) fdµ = f(x) is called a random variable. In other words, a random variable is a function from ∈X the sample space to the real line R, and the probability of its value being in a given x 0 ∈ X X interval is well defined. Suppose X is a random variable s.t. P (X 0) = PX ( 0 = 1). Then, X is called a discrete random variable. Defn A.49 (Probability Distribution). p P µ The density of X w.r.t. satisfies ∫ The probability measure PX living on (R, B(R)) such that for any A ∈ B(R), ∑ ∈ 1 PX (A) = P ({e ∈ E : X(e) ∈ A}) =: P (X ∈ A) P (X A) = PX (A) = pdµ = p(x) A(x) A ∈X for Borel sets A is called the distribution of X. The notation X ∼ Q is used to x 0 indicate that X has distribution Q =⇒ PX = Q. In particular, if A = {y} s.t. y ∈ X0, then X ∈ A ⇔ X = y, and so

← ToC 100 ∑   E P (X = y) = p(x)1{ }(x) = p(y) [X1] y   ∈X E [X] = . x 0 . The density p is called the mass function for X. E [Xn] A random vector is said to be absolutely∫ ∫ continuous∫ if the CDF can be written as A.7.2 Moments x1 x2 xn F (x1, x2, . . . , xn) = ··· f(z1, . . . , zn)dz1 . . . dzn Defn A.53 (Expectation:). −∞ −∞ −∞ If X is a random variable on a probability space (E, B,P ), then the expectation of X ∼ PX (i.e. density p), is defined as Defn A.55 (Random Matrices). ∫ ∫ ∫ A matrix W is called a random matrix if the entries Wij are random variables. E [X] := XdP = xdP (x) = xp(x)d(x) X Defn A.56 (Covariance). ∈ X X The covariance of a random vector X is the matrix of covariances of the variables in For discrete RV X with P (X 0) = 1 for a countable set 0, if µ is counting X measure on X0, and p is the mass function given by p(x) = P (X = x), ∫ ∫ ∑ [Cov(X)]ij = Cov(Xi,Xj) ′ E [X] := xdPX (x) = xp(x)dµ(x) = xp(x) If µ = E [X] and (X − µ) is the transpose of the mean deviation, then ∈X ′ x 0 E − E − − Cov(Xi,Xj) := [Xi µi](Xjµj) = [(X µ)(X µ) ]ij Defn A.54 (Variance). so E − − ′ E ′ − ′ The variance of a random variable X with finite expectation is defined as Cov(X) = [(X µ)(X µ) ] = [XX ] µµ V [X] = E [X − E [X]]2 A.7.4 Product Measures and Independence X p If is absolutely continuous with∫ density , Let (X , A, µ) and (Y, B, ν) be measure spaces. Then ∃ a unique product measure V [X] = (x − E [X])2p(x)dx µ × ν on (X × Y, A ∨ B) such that (µ × ν)(A × B) = µ(A)ν(B) ∀A ∈ A,B ∈ B. The σ-field A ∨ B is defined formally as the smallest σ− field containing all sets A × B If X is discrete with mass function p, ∈ A ∈ B ∑ with A , B . V − E 2 [X] = (x [X]) p(x) Theorem A.19 (Fubini). ∈X x 0 Integration against the product measure µ × ν can be accomplished by iterated integration against µ and ν, in either order ∫ ∫ [∫ ] ∫ [∫ ] A.7.3 Random vectors d(µ × ν) = f(x, y)dν(y) dµ(x) = f(x, y)dµ(x) dν(y) If X ,...,X are random variables, then the function X : E→Rn defined by 1 n   X (e) 1 Defn A.57 (Independence of random variables). X(e) =  .  , e ∈ E →R ≤ ≤ . Suppose Xi :Ω , 1 i m are random variables. Xn(e) X1,X2,...,Xm are indepepndent for all B1,B2,...,Bm Borel subsets of R, it is is called a random vector. The definitions above extend naturally to random vectors, true that e.g. the distribution PX of X is Pr (Xi ∈ Bi, ∀i 1 ≤ i ≤ m) = Pr (X1 ∈ B1) ... Pr (Xm ∈ Bm)

PX (B) = P (X ∈ B) := P ({e ∈ E : X(e) ∈ B}) for Borel sets B ∈ Rn. The expectation of a random vector X is the vector of A.7.5 Conditional Expectations expectations Defn A.58 (Conditional Probability). Given two r.v.s X,Y with finite second moments, E [Y |X] is defined as a σ(X)− measurable function m(X) such that

← ToC 101 ( ∑ ) 2 F n m(.) argmin EP [Y − m(X)] (m)(t) = i=m h niF(t)i(1 − F(t))n−i , −∞ < t < ∞ For continuous X,Y ∈ R2 with pdf f(x, y), For any y∗ ∈ R, the conditional proba- ∗ ∗ Severini (2005, chap 7). bility of the event {Y ≤ y } := PY |X (Y ≤ y |x) is defined as a function satisfying ∫ ∗ x Example A.21 (Max of n iid U [0, 1] vars). ∗ ∗ ∗ ∗ n−1 PY |X (Y ≤ y |x)fX (dx) = PXY (X ≤ x ,Y ≤ y ) ∀x ∈ R has a distribution with denisty g(r) = nr −∞ The conditional CDF is Example A.22 (Distribution of second highest value Y 2 ). ∫ ∗ y (Useful for Vickrey auctions) ∗ ∗ f(x, y) F | (y ) = P | (Y ≤ y |x) = dy m m−1 Y X Y X F 2 (r) = F (r) + m(1 − F(r))F (r) −∞ fX (x) Y | {z } m−2 f 2 (r) = m(m − 1)(1 − F(r))F (r)f(r) conditional density f(y|x) Y Defn A.59 (Bounded Lipshitz Distance). A.8 Linear Functions and Linear Algebra Let X ∼ FX ; Y ∼ FY . Define the the class of ‘Bounded Lipshitz functions’ with A.8.1 Linear Functions Lipshitz constant 1{ as } R→R | − | ≤ | − | ∧ | | ≤ Defn A.60 (Linear Function / Homomorphism). BL(1) := h : s.t. h(x) h(x) x y sup h(x) 1 f : X→Y X,Y x∈R A function between two linear spaces is linear if it preserves th elinearity of sets X and Y through the following properties (∀ x , x ∈ X) Then the Bounded Lipshitz Distance is 1 2 |E − E | dBL(FX ,FY ) := sup FX [h(X)] FY [h(Y )] • Additivity f(x1 + x2) = f(x1) + f(x2) h∈BL(1) • Homogeneity f(αx1) = αf(x1) FX and FY are said to be ‘close’ if dBL(FX ,FY ) is small. • f : V →W linear and bijective is called an isomorphism A.7.6 Order Statistics R Fact A.20 (Distribution of Order Statistics). • A linear function that maps onto is called a linear functional. F ∈ Suppose X1,X2,...,Xn are iid r.v.s with distribution x (). To each ω Ω define • A linear function that maps onto itself is called a linear operator / automorphism. max {X1,...,Xn} (ω) = max {X1(ω),...,Xn(ω)}. We want the distribution F of { } max X1,...,Xn Every linear function mapping from a finite-dimensional domain X can be rep- resented by a matrix. G(r) = Pr ({ω ∈ Ω; max {X1,...,Xn} ≤ r}) ∏N Fact A.23 (Typology of Linear functions). ∩n ≤ ≤ − − = Pr ( i=1[Xi r]) = Pr (Xi r) Linear Functions f(αx1 + (1 α)x2) = αf(x1 + (1 α)f(x2) Imply i=1 • Additivity : f(x + x = f(x ) + f(x ) generalises to ∏N 1 2 1 2 Fn = F (r) = (r) – Convex Functions f(αx1 + (1 − α)x2) ≤ αf(x1) + (1 − α)f(x2) gener- i=1 alises to – Quasiconvex functions f(αx1 + (1 − α)x2) ≤ min(f(x1, x2) If F has a density, G has a density too • Homogeneity: f(αx) = αf(x) generalises to g(r) = G′(r) = nFn−1(r)f (r) k More generally, the distribution function of X(m) is given by F(m) – Homogeneous functions f(αx) = α f(x) , α > 0 generalises to – Homothetic functions f(x1) = f(x2) =⇒ f(αx1) = f(αx2) α > 0

← ToC 102 Defn A.61 (Kernel, Rank Nullity). Defn A.66 (Euclidian Norm). x ∈ RN For Φ: V →W , we define the kernel/null space as Euclidian norm of a vector is defined√ as −1 ∥x∥ := ⟨x, x⟩ ker(Φ) := Φ (0W ) = {v ∈ V : Φ(v) = 0W } and the image/range Fact A.24 (Angle between two vectors ). Angle between u and v is given by Im(Φ) := Φ(V ) = {w ∈ W |∃v ∈ V : Φ(v) = w} ⟨u, v⟩ cos θ = The dimension of the image is called the rank of Φ. The dimension of the kernel is ∥u∥ ∥v∥ called the nullity of Φ. If X is finite dimensional, rankΦ + nullityΦ = dimX. This is the rank-nullity result Defn A.67 (Cauchy-Schwarz). A linear function Φ has full rank if rankΦ(X) = min{rankX, rankY } ∥⟨x, y⟩∥ ≤ ∥x∥ ∥y∥ Defn A.62 (Inner Product / Dot Product). Defn A.68 (Trace). An inner product on a vector space V is a mapping ⟨·, ·⟩ : V × V →R that satisfies, ∀x, y, z ∈ V ∧ a inR ∑N Trace(A) = a ⟨ ⟩ ≥ ⟨ ⟩ ⇔ nn • x, x 0 and x, x = 0 x = 0 n=1 • ⟨x, y + z⟩ = ⟨x, y⟩ + ⟨x, z⟩ For conformable matrices A, B; tr(AB) = tr(BA) • ⟨x, ay⟩ = a ⟨x, y⟩ Defn A.69 (Eigenvalues and Eigenvectors). For a square matrix A, scalar λ and vector x that satisfies Ax = λx constitute an ⟨ ⟩ ⟨ ⟩ • x, y = y, x eigenvalue and eigenvector respectively. ∑n ′ Defn A.70 (Orthogonal / Orthonormal Matrix). ⟨ ⟩ × x y = x, y := xiyi A square matrix A ∈ Rn n is orthogonal iff its columns are orthonormal so that i=1 AA′ = I = A′A =⇒ A−1 = A′ This generalises to an inner product of two functionals u, v : R→R where ∫ b ⟨u, v⟩ = u(x)v(x)dx A.8.2 Projection a √ Defn A.71 (Projection). An inner product defines a norm ∥v∥ = ⟨v, v⟩. This gives us a restatement of Let V be a vector space and U ⊆ V is a subspace of V . A linear mapping π : V →U the Cauchy-Schwartz inequality |⟨x, y⟩| ≤ ∥x∥ ∥y∥ is called a projection if π2 = π ◦ π = π. Since homeomorphisms can be expressed by a transformation matrix, projections can be represented as a projection matrix Defn A.63 (Orthogonal / Orthonormal Vectors). 2 Pπ with the property Pπ = P. Projection matrices are always symmetric. 1. ⟨x, y⟩ = 0 =⇒ x⊥y (Orthogonality). Furthermore, if ∥x∥ = 1 = ∥y∥, then Example A.25 (Projection onto general subspaces). they are said to be orthonormal. We look at orthogonal projections of vectors y ∈ Rn onto lower dimensional sub- ⊆ Rn ≥ 2. ⟨x, y⟩ = 1 =⇒ x parallel to y spaces X with dim(X) = m 1. Assume∑ (x1,..., xm) is an ordered basis X π (x) = m λ x = Xλ Defn A.64 (Symmetric, Positive Definite Matrices). of . Therefore, any projection U i=1 i i . The problem, then, × ′ λ , . . . , λ X A ∈ Rn n ∀x ∈ V : x Ax is to find 1 m coordinates of the projection (with respect to basis ) where is symmetric, positive definite if > 0. n×m T m πU (x) = Bλ given X = [x1,..., xm] ∈ R and λ = [λ1, . . . , λm] ∈ R . The Defn A.65 (Hadamard Product). solution is the familiar OLS coef vector For conformable matrices A and B with identical dimensions, the element-wise ′ − ′ product λ = (X X) 1X y ′ −1 ′ cij = aij × bij ∀i, j ∈ dim(A) ≡ A ⊙ B where (X X) X is also called the pseudo-inverse of B, which can be computed − is called the hadamard product. as long as (X′X) 1 is full rank.

← ToC 103 The projection matrix is therefore Example A.27 (QR Decomposition for OLS). ′ −1 ′ ′ −1 ′ βˆ = (X X) X y is often numerically unstable, so we can define X = QR. The, Pπ = X (X X) X −1 ′ the[ OLS] estimate can be written as βˆ = (R) Q y. The homoscedastic variance is Defn A.72 (Gram-Schmidt Orthogonalisation). ′ −1 ′ −1 V βˆ = σ2 (X X) = (R R) σ2. Any basis (b1,..., bn) of an n− dimensional vector space V can be transformed into ˆ ′ an orthogonal/orthonormal basis (u1,..., un) where span[b1,..., bn] = span[u1,..., un]Using the same decomposition, yˆ = Xβ = QQ y. as follows Defn A.76 (Singular Value Decomposition). u1 := b1 Any n × p matrix Z may be written as − ′ uk := bk πspan[u1,...,uk−1](bk) k = 2, . . . , n Z = UΣV U n × n V p × p Σ where the kth basis vector b is projected onto the subspace spanned by the first where is a orthogonal matrix, is a orthogonal matrix, and is a k n × p diagnonal matrix with non-negative elements. k − 1 constructed orthogonal vectors u1,..., uk−1. This is the same as FWL Theorem, but older. Example A.28 (SVD of covariance matrix equivalence with spectral decompo- sition). A.8.3 Matrix Decompositions For a square covariance matrix X′X, if X = USV′, then ′ T T T ′ Defn A.73 (Spectral / Eigenvalue Decomposition). X X = VS U USV = VDV A square matrix A admits to an eigen-decomposition if it can be factorised as A = where D = S2 contains the square singular values. In other words, −1 ′ ′ ′ ′ QΛQ where U = evec(XX ), V = evec(X X), S2 = eval(X X) = eval(XX )

• Q is a n × n matrix whose ith column is the eigenvector qi of A (orthogonal matrix) A.8.4 Matrix Identities For conformable matrices A, B, C, • Λ is a diagonal matrix with corresponding eigenvalues Λii = λi Fact A.26 (Orthogonal Matrices). A(B + C) = AB + AC If Q, N are N × N orthogonal matrices A + B⊤ = A⊤ + B⊤ ⊤ ⊤ ⊤ • QT = Q−1 is also orthogonal AB = B A − − − (AB) 1 = (B) 1 (A) 1 • QN is orthogonal trace(ABC) = trace(CBA) = trace(BCA) • det(Q) ∈ {−1, 1} A.8.5 Partitioned Matrices Defn A.74 (Cholesky Decomposition). If A is positive definite, then it admits to Defn A.77 (Partitioned Matrices). It can be useful to partition a matrix as follows A = RT R R • where is non-singular upper triangular X11 ··· X1n [ ]  . .  X11 X12 T X = . .. . = • A = LL where L is non-singular lower triangular . . . X21 X22 Xm1 ··· Xmn Defn A.75 (QR Decomposition). If A is a N × K matrix with full column rank, ∃ a factorisation A = QR where Multiplying a partitioned matrix with a stacked vector c [ ] [ ] [ ] Xc = X11 X12 c1 = X11c1 + X12c2 • Q is an orthogonal matrix X21 X22 c2 X21c1 + X22c2 • R is K × K upper triangular and nonsingular (invertible)

← ToC 104 Fact A.29 (Inverse of 2 × 2 partitioned matrix). every z ∈ L is denoted L⊥. Every v ∈ L can be written as v = w + z where z is the [ ] ⊥ [ ] − − − − ∈ −1 X 1 + X 1X F X (X ) 1 − (X ) 1 X F projection of v onto L and w L . X11 X12 = 11 11 12 2 21 11 11 12 2 X X − −1 21 22 F2X21 (X11) F2 Example A.30 (R). ( )− − 1 , the set of random variables defined on a common probability space {Σ, F, µ} where F = X − X (X ) 1 X 2 22 21 11 12 is√ a Hilbert space with inner product ⟨X,Y ⟩ = E [XY ], associated norm ∥X∥ = E [X2] and metric ∥X − Y ∥. A.9 Function Spaces Example A.31 (L2(w)). R Almost all of this is based on / stolen from Larry Wasserman ∫the space of Borel-measurable real functions f on √given density w(x) satisfying ∞ 2 ∞ ∥ ∥ ⟨ ⟩ ∥ − ∥ http://www.stat.cmu.edu/ larry/=sml/functionspaces.pdf and Racine, Su, and −∞ f(x) w(x)dx < and associated norm f = f, f and metric f g is Ullah (2013). a hilbert space. Defn A.78 (Function spaces). Defn A.82 (Orthogonal / Direct Sum). U bU f : U→R Let be any set, let be the collection of all bounded functions s.t. (i.e. If L and M are spaces such that every ℓ ∈ L is orthogonal to every m ∈ M, then sup |f(x)| < ∞ x∈U and let we define the orthogonal sum as ∥ − ∥ | − | d∞(f, g) := f g ∞ := sup f(x) g(x) ⊕ { ∈ ∈ } x∈U L M = l + m : l L, m M A set of vectors {et, t ∈ T } is orthonormal if ⟨es, et⟩ = 0 when s ≠ t and ∥et∥ = 1∀T . Spaces of functions can be treated as linear vector spaces This is also called an orthonormal basis. Every hilbert space has an orthonor- Defn A.79 (Inner Product and Norm in function spaces). mal basis. A Hilbert space is said to be separable if there exists a countable ∫ orthonormal basis. ⟨f, g⟩ = f(x)g(x)dx A.9.1 Lp spaces which leads to a norm for functions ∫ Let F be a collection of functions [a, b] 7→ R. The Lp norm on F is defined by ( ) ∥ ∥2 2 ∫ 1/p f 2 = f (x)dx b ∥ ∥ | |p f p = f(x) dx Defn A.80 (Eigenvalues and Eigenfunctions). a ∑ O ∞ ∥ ∥ | | An operator is a higher-order function that maps from one function to another. For p = , we define the sup norm f ∞ = x f(x) . A derivative and integral are both operators. The space Lp(a, b) is defined as Operators can have eigenvalues and eigenfunctions such that { } L (a, b) := f :[a, b]→R : ∥f∥ < ∞ Of = λf p p exp ax is an eigenfunction for both differentiation and integration. Every Lp space is a Banach Space. (∫ ) ∫ ∫ H 2 Defn A.81 (Hilbert Space ). • Cauchy Schwartz: f(x)g(x)dx ≤ f 2(x)dx g2(x)dx is a complete (:= every Cauchy sequence in the space converges to a point in it), inner product space. Equivalently, it is a vector space endowed with an inner product ∥ ∥ ≤ ∥ ∥ ∥ ∥ • Minkowski : f + g p f p + g p where p > 1 and an associated norm and metric such that every Cauchy sequence has a limit in H. Q ∥ ∥ ≤ ∥ ∥ ∥ ∥ Intuitively,√ it means it doesn’t have any ‘holes’ in it ( is not a complete space be- • Holder: fg 1 f p g q where (1/p) + (1/q) = 1 cause 2 is missing from it). Every Hilbert space is a Banach space but the reverse is not true in general. In a Example A.32 (L2 space). ∥ − ∥ → →∞ 2 hilbert space, fn f 0 as n . Functions where ∥f∥ < ∞ are said to be square-integrable, and the space of square- ∀ ∃ ∈ 2 If V is a hilbert space and L is a closed subspace then v, y L called a projection integrable functions is called L2. Many familiar results from vector spaces carries ∥ − ∥ ∈ of v onto L that minimises v z over z L. The set of elements orthogonal over into L2.

← ToC 105 ( ) ∈ j/2 j − L2∫(a, b) is a Hilbertspace. The inner product between∫ two functions f, g L2(a, b) ψjk(x) = 2 ψ 2 x k and b ∥ ∥2 b 2  is a f(x)g(x)dx and the norm of f is f = a f (x)dx. With this inner product,  1 −1 if 0 ≤ x ≤ L2(a, b) is a separable Hilbert space; that is,∫ we can find a countable orthonormal 2 b ψ(x) = basis ϕ1, ϕ2,... ; , that is ∥ϕj∥ = 1 ∀j, and ϕi(x)ϕj(x) = 0 ∀i ≠ j. It follows that  1 a 1 if < x ≤ 1 if f ∈ L2(a, b), 2 ∫ ∑∞ b This is a doubly indexed set of functions so when f is expanded in this basis we f(x) = θjϕj(x) where θj = f(x)ϕj(x)dx write ∞ j − j=1 a ∑ 2∑1 ∫ b ∑∞ f(x) = αϕ(x) + β ψ (x) are the coefficients. Parseval’s identity f 2(x)dx = θ2. jk jk a j=1 j j=1 k=1 ∫ ∫ The span of L2 is 1 1   where α = f(x)ϕ(x)dx and β = f(x)ψ (x)dx. The Haar basis is an exam- ∑∞  0 jk 0 jk ∈ R ple of a wavelet basis.  ajϕj(x): a1, . . . , an  j=1 Defn A.84 (Holder Spaces). ∑∞ ∑ Let β be a positive integer. Let T ⊂ R. The Holder space H(β, L) is the set of f = θ ϕ (x) {ϕ , . . . , ϕ } f = n θ ϕ (x) The projection of j=1 j j onto the span 1 n is n j=1 j j , functions g : T → R such that n-term linear approximation of f which we call the . − − g(β 1)(y) − g(β 1)(x) ≤ L|x − y|, for all x, y ∈ T Defn A.83 (Bases in function space). A sequence of functions ψ1,... can be considered a basis. An orthonormal basis The special case β = 1 is sometimes called the Lipschitz space. If β = 2 then we is one that admits to have ′ ′ ∑∞ |g (x) − g (y)| ≤ L|x − y|, for all x, y ⟨ ⟩ f = f, ψj ψj Roughly speaking, this means that the functions have bounded second derivatives. j=1 2 R Mononomials 1, x, x ,... are a basis for L2 on [0, 1] and , but they aren’t orthog- Multivariate version There is also a multivariate version of Holder spaces. Let onal. d s T ⊂ R . Given a vector s = (s1, . . . , sd) , define |s| = s1+···+sd, s! = s1! ··· sd!, x = s1 ··· sd Example A.33 (Famous Bases). x1 xd and ··· A popular basis for L2 on [0, 1] are the sines and cosines, which may be written ∂s1+ +sd ϕ = 1 ϕ = sin 2kπx ϕ = cos 2kπx Ds = as 1 , 2k , 2k+1 . Coefficients in this expansion are ∂xs1 ··· ∂xsd referred to as the Fourier transform of the original function. 1 d → R A cosine basis on [0, 1] is The Hölder class H(β, L) is the set of functions g : T such that √ |Dsg(x) − Dsg(y)| ≤ L∥x − y∥β−|s| ϕ0(x) = 1, ϕj(x) = 2 cos(2πjx) , j = 1, 2,... | | − Legendre basis on (−1, 1) is for all x, y and all s such that s = β 1 ( ) ( ) Defn A.85 (Sobolev Space). 1 2 1 3 P0(x) = 1,P1(x) = x, P2(x) = 3x − 1 ,P3(x) = 5x − 3x ,... A Sobolev space is a space of functions possessing sufficiently many derivatives 2 2 for some application domain. Formally, The Haar basis on [0,1] consists of functions Let f be integrable on every bounded interval. Then f is weakly differentiable if { } ′ ϕ(x), ψ (x): j = 0, 1, . . . , k = 0, 1,..., 2j − 1 there∫ exists a function f that is integrable on every bounded interval, such that jk y ′ − ≤ ′ x f (s)ds = f(y) f(x) whenever x y. We call f the weak derivative of f. Let where { j th ≤ D f denote the j weak derivative of f ϕ(x) = 1 if 0 x < 1 0 otherwise The Sobolev space of order m is defined by m Wm,p = {f ∈ Lp(0, 1) : ∥D f∥ ∈ Lp(0, 1)}

← ToC 106 The Sobolev ball of order m and radius c is defined by where α = (α , . . . , α )T and K is the k × k matrix with K = K (x , x ) { } 1 k jk j k W (c) = f : f ∈ W , ∥Dmf∥ ≤ c m,p m,p p The Reproducing∑ Property . Let f(x) = αiKx (x). Note the following crucial property: Defn A.86 (Mercer Kernel and Theorem). i i ∑ K :[a, b] × [a, b] → R K(x, y) = A Mercer kernel is a continuous function such that ⟨f, Kx⟩ = αiK (xi, x) = f(x) K(y, x), and such that K is positive semidefinite, meaning that i ∑n ∑n This follows from the definition of ⟨f, g⟩ where we take g = Kx. This implies that K (xi, xj) cicj ≥ 0 i=1 j=1 ⟨Kx,Kx⟩ = K(x, x) ∈ for all finite sets of points x1, . . . , xn [a, b] and all real numbers c1, . . . , cn. The This is called the reproducing property. It also implies that Kx is the representer function of the evaluation functional. − ∫ The completion of H0 with respect to ∥ · ∥K is denoted by HK and is called the m∑1 1 x∧y (x − u)m−1(y − u)m−1 K(x, y) = xkyk + du RKHS generated by K. k! (m − 1)!2 Evaluation Functionals. A key property of RKHS’s is the behavior of the evaluation k=1 0 functional. The evaluation functional δx assigns a real number to each function. It is an example of a Mercer kernel. The most commonly used kernel is the Gaussian is defined by δ f = f(x). In general, the evaluation functional is not continuous. kernel x This means we can have fn → f but δxfn does not converge to δxf. For example, 2 √ ( ) √ − ∥x−y∥ 2 ∥ − ∥ → K(x, y) = e σ2 let f(x)√ = 0 and fn(x) = nI x < 1/n . Then fn f = 1/ n 0. But δ0fn = n which does not converge to δ0f = 0. Intuitively, this is because Hilbert Defn A.87 (Reproducing Kernel Hilbert Spaces). spaces can contain very unsmooth functions. · Given a kernel K, let Kx( ) be the function obtained by fixing the first coordinate. But in an RKHS, the evaluation functional is continuous. Intuitively, this means That is, Kx(y) = K(x, y). For the Gaussian kernel, Kx is a Normal, centered at x. that the functions in the space are well-behaved. To see this, suppose that fn → f. We can create functions by taking liner combinations of the kernel: Then ∑k δxfn = ⟨fnKx⟩ → ⟨fKx⟩ = f(x) = δxf f(x) = α K (x) j xj so the evaluation functional is continuous. j=1 A Hilbert space is a RKHS if and only if the evaluation functionals are contin- H uous. Let 0 denote all such functions:    ∑k  Theorem A.34 (Representer Theorem). H = f : α K (x) Let ℓ be a loss function depending on (X1,Y1) ,..., (Xn,Yn) and on f (X1) , . . . , f (Xn) . 0  j xj  b j=1 Let f minimize ( ) ∑ ∑ ℓ + g ∥f∥2 k m K Given two such functions f(x) = αjKxj (x) and g(x) = βjKyj (x) we b j=1 j=1 where g is any monotone increasing function. Then f has the form define an inner product ∑ ∑ ∑n b ⟨f, g⟩ = ⟨f, g⟩K = αiβjK (xi, yj) f(x) = αiK (xi, x) i j i=1

In general, f( and g) might be representable in more than one way. You can check for some α1, . . . , αn that ⟨f, g⟩K is independent of how f( or g) is represented. The inner product de- fines a norm: A.9.2 o () , O () Notation √ √∑ ∑ √ ∥f∥ = ⟨f, f, ⟩ = α α K (x , x ) = αT Kα ∃ f(n) → →∞ K j k j k A function f(n) is of constant order or of order 1 if c > 0 s.t. c 1 as n . j k Equivalently, if f(n)→c as n→∞. We can then write f(n) = O (1) (read of the same

← ToC 107 order as). Theorem A.35 (Inverse Function Theorem). ⇒ → ∀ →Rd ∃ ′ ′ o () (read is ultimately smaller than): f(n) = o (1) = f(n) 0 c. If φ : U is differentiable at a and Dφa is invertible, then U ,V such that a ∈ U ′ ⊆ U, φ(a) ∈ V ′ ∧ φ : U ′→V ′ is bijective. Further, the inverse function ′→ ′ A.10 Optimisation ψ : V U is differentiable. General Results from Optimisation Theory Luenberger (1997) Theorem A.36 (Implicit function theorem). Let U ⊆ Rd+1 be a domain and f : U→R be a differentiable function. If x ∈ Rd∧y ∈ R ∈ Rd+1 • Projection Theorem - In Rk, the shortest line from a point to the plane is , we’ll concatenate the two vectors and write (x, y) . ̸ ∃ ′ ∋ ∧ funished by the perpendicular from the point to the plane. Core idea carries Suppose c = f(a, b), and ∂yf(a, b) = 0. Then, U a differentiable function ′→R ∧ ∀ ∈ ′ through to higher dimensions and infinite-dimensional Hilbert Space g : U s.t. g(a) = b f(x, g(x)) = c x U . Further, ∃V ′ ∋ b s.t. {(x, y)|x ∈ U ′, y ∈ V ′, f(x, y) = c} = {(g, g(x))|x ∈ U ′}. IoW, • Hahn-Banach Theorem: given a sphere and a point not in the sphere, there ∀x ∈ U ′, f(x, y) = c has a unique solution y = g(x) ∈ V ′. exists a hyperplane separating the point and the sphere. Fact A.37 (Differentiating implicit fns using tangent planes). • Duality: The shortest distance from a point to a convex set is equal to the Let f : R2→R be differentiable and consider the implicitly defined curve maximum of the distances from the point to a hyperplane separating the { } : ∈ R2| point from the convex set. Γ = (x, y) f(x, y) = c (i.e. a level set of f). Pick (a, b) ∈ Γ, suppose ∂ f(a, b) ≠ 0. By IFT, we know y− • Differentials: Set derivative of the objective function to zero. y coordinate of this curve can locally be expressed as a differentiable function of x. Directly differentiating f(x, y) = c w.r.t. x gives A.11 Calculus dy dy −∂xf(a, b) ∂xf + ∂yf = 0 ⇔ = Defn A.88 (Derivative). dx dx ∂yf(a, b) The derivative of a function f at point x, when defined, is the tangent to the func- tion at x. A.11.1 Differentiation Rules ∂f f(x + h) − f(x) = lim f(x) f ′(x) ∂x h→0 h Function Rules xa axa−1 Defn A.89 (Gradient and Jacobian). For a multivariate function f : Rm→R, we define the   ex ex ∂f ∂x1  ∂f    1 ∇  ∂x2  log x xf =  .  x . ∂f ∂f ∂g ∂xn Linear Rule (af + bg) a + b ∂x ∂x which collects the partial derivatives in a column vector. ′ ′ ′ For a vector-valued function f : Rn→Rm, we can construct the Jacobian, which Product Rule (f · g) f (x)g(x) + f(x)g (x) collects all m × n partial derivatives.     ′ ′ ∇ ∂ ∂ ∂ f (x)g(x) − f(x)g (x) f1(x) f1(x) f1(x) ... f1(x) f/g ∂x1 ∂x2 ∂xn Quotient Rule 2 ∂f  ∇f2(x)   ∂ ∂ ∂  (g(x))   f2(x) f2(x) ... f2(x) =  .  =  ∂x1 ∂x2 ∂xn  ∂x . ... ∂f ∂g . ∂ ∂ ∂ ∇ fm(x) fm(x) ... fm(x) Chain Rule f(g(x)) fm(x) ∂x1 ∂x2 ∂xn ∂g ∂x Let U ⊆ Rd

← ToC 108 A.11.2 Matrix Derivatives [ ] 0 −g (x , x ) −g (x , x ) a, x ∈ Rn A 1 1 2 2 1 2 Let , and be a conformable matrix BH(x1, x2, λ) = −g1(x1, x2) f11(x1, x2) f12(x1, x2) − ′ g2(x1, x2) f21(x1, x2) f22(x1, x2) ∂a x ∗ ∗ • ∂x = a If detBH > 0, then it is negative definite, which implies that the (x1, x2) that solves ∂a′x ′ the system is indeed a local maximum. • ∂x′ = a Defn A.91 (Hessian, Definiteness). • — The Hessian for of a C2 [twice differentiable] function f : Rnd→R is defined by the ∂ ′ Ax = A matrix • ∂x   ∂ ∂ f ∂ ∂ f . . . ∂ ∂ f ∂ ′ 1 1 2 1 d 1 • Ax = A ∂1∂2f ∂2∂2f . . . ∂d∂2f ∂x f =   H . . . • ∂ x′Ax = (A + A′)x ...... ∂x ∂1∂df ∂2∂df . . . ∂d∂df ∂ x′Ax = xx′ • ∂A • If (Av) · v ≤ 0 ∀v ∈ Rd, A is said to be negative semi-definite. ′ −1 ∂ | | d • ∂A log A = (A ) • If (Av) · v < 0 ∀v ∈ R , A is said to be negative definite. · ≥ ∀ ∈ Rd A.11.3 Constrained Maximisation • If (Av) v 0 v , A is said to be positive semi-definite. Fact A.38 (General Proposition). • If (Av) · v > 0 ∀v ∈ Rd, A is said to be positive definite. We want to maximise f(x) subject to g(x) = c (implicitly defined by S := {g = c}). Suppose ∇g ≠ 0∀x ∈ S. If f attains a constrained local maximum (or minimum) at a on the surface S, ∃λ ∈ R s.t. ∇f(a) = λ∇g(a). Generic problem of the form Defn A.90 (Lagrangian).

max f(x1, x2) s.t. g(x1, x2) = b n x1,x2∈R First, write L(x1, x2, λ) = f(x1, x2) + λ [g(x1, x2) − b] differentiating wrt x1, x2, λ yields FOCs ∂L [x1]: = f1(x1, x2) + λg1(x1, x2) = 0 ∂x1 ∂L [x2]: = f2(x1, x2) + λg2(x1, x2) = 0 ∂x2 ∂L [λ]: = g(x , x ) − b = 0 ∂λ 1 2 which gives us three (potentially nonlinear) equations with three unknowns (x1, x2, λ), that can be solved simultaneously. To check sufficiency, the second-order condition analogue is the determinant of the bordered hessian matrix

← ToC 109 B Bibliography Callaway, Brantly and Pedro HC Sant’Anna (2020). “Difference-in-differences with multiple time periods”. Journal of Econometrics (cit. on p. 44). References Cameron, A Colin and Pravin K Trivedi (2005). Microeconometrics: methods and ap- plications. Cambridge university press. Abadie, Alberto (Apr. 2003). “Semiparametric instrumental variable estimation of Carter, Michael (2001). Foundations of mathematical economics. MIT Press. treatment response models”. Journal of econometrics 113.2, pp. 231–263 (cit. on Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Chris- p. 36). tian Hansen, and Whitney Newey (2017). “Double/debiased/neyman machine Abadie, Alberto, Alexis Diamond, and Jens Hainmueller (2010). “Synthetic con- learning of treatment effects”. American Economic Review 107.5, pp. 261–65 (cit. trol methods for comparative case studies: Estimating the effect of California’s on p. 76). tobacco control program”. Journal of the American statistical Association 105.490, Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian pp. 493–505 (cit. on pp. 45, 46). Hansen, Whitney Newey, and James Robins (Feb. 2018). “Double/debiased ma- Acharya, Avidit, Matthew Blackwell, and Maya Sen (Aug. 2016). “Explaining chine learning for treatment and structural parameters”. The econometrics jour- Causal Findings Without Bias: Detecting and Assessing Direct Effects”. The nal 21.1, pp. C1–C68 (cit. on p. 33). American political science review 110.3, pp. 512–529 (cit. on p. 53). Chernozhukov, Victor, Iván Fernández-Val, and Blaise Melly (2013). “Inference Angrist, Joshua D. and Jörn-Steffen Pischke (2008). Mostly Harmless Econometrics: on counterfactual distributions”. Econometrica 81.6, pp. 2205–2268 (cit. on p. 49). An Empiricist’s Companion. Princeton university press (cit. on p. 37). Chernozhukov, Victor, Christian Hansen, and Martin Spindler (2015). “Post-selection Aronow, Peter M and Benjamin T Miller (2019). Foundations of agnostic statistics. and post-regularization inference in linear models with many controls and in- Cambridge University Press. struments”. American Economic Review 105.5, pp. 486–90 (cit. on p. 39). Athey, Susan and Guido Imbens (2016a). “Recursive partitioning for heterogeneous Cornelissen, Thomas et al. (Aug. 2016). “From LATE to MTE: Alternative methods causal effects”. Proceedings of the National Academy of Sciences 113.27, pp. 7353– for the evaluation of policy interventions”. Labour economics 41, pp. 47–60 (cit. 7360 (cit. on p. 78). on p. 38). — (July 2016b). “The Econometrics of Randomized Experiments”. arXiv: 1607. Deisenroth, Marc Peter, A Aldo Faisal, and Cheng Soon Ong (2020). Mathematics 00698 [stat.ME] (cit. on p. 26). for machine learning. Cambridge University Press. Athey, Susan, Julie Tibshirani, and Stefan Wager (2019). “Generalized random Doudchenko, Nikolay and Guido W Imbens (2016). Balancing, regression, difference- forests”. The Annals of Statistics 47.2, pp. 1148–1178 (cit. on p. 77). in-differences and synthetic control methods: A synthesis. Tech. rep. National Bureau Austen-Smith, David and Jeffrey S Banks (2000). Positive political theory I: collective of Economic Research (cit. on pp. 45, 46). preference. Vol. 1. University of Michigan Press. Frangakis, Constantine E and Donald B Rubin (2002). “Principal stratification in Bach, Philipp et al. (2021). “DoubleML–An Object-Oriented Implementation of causal inference”. Biometrics 58.1, pp. 21–29 (cit. on p. 39). Double Machine Learning in R”. arXiv preprint arXiv:2103.09603 (cit. on p. 76). Glynn, Adam N and Kevin M Quinn (2010). “Anintroduction to the augmented in- Bai, Jushan (2009). “Panel data models with interactive fixed effects”. Econometrica verse propensity weighted estimator”. Political analysis, pp. 36–56 (cit. on p. 32). 77.4, pp. 1229–1279 (cit. on p. 47). Goodman-Bacon, Andrew (2018). Difference-in-differences with variation in treatment Bang, Heejung and James M Robins (2005). “Doubly robust estimation in missing timing. Tech. rep. National Bureau of Economic Research (cit. on p. 44). data and causal inference models”. Biometrics 61.4, pp. 962–973 (cit. on p. 32). Györfi, László et al. (2006). A distribution-free theory of nonparametric regression. Springer Bazen, Stephen (2011). Econometric methods for labour economics. Oxford University Science & Business Media. Hainmueller, Jens (2012). “Entropy balancing for causal effects: A multivariate Press (cit. on p. 50). reweighting method to produce balanced samples in observational studies”. Bellemare, Marc F, Jeffrey R Bloem, and Noah Wexler (2020). “The Paper of How: Political analysis, pp. 25–46 (cit. on p. 32). Estimating Treatment Effects Using the Front-Door Criterion”. Working paper Hastie, Tibshirani, Robert Tibshirani, and H Jerome (2009). Elements of Statistical (cit. on p. 52). Learning. Springer, NY. Belloni, Alexandre, Victor Chernozhukov, and Christian Hansen (May 2014). “High- Heaton, Matthew J et al. (2019). “A case study competition among methods for Dimensional Methods and Inference on Structural and Treatment Effects”. The journal of economic perspectives: a journal of the American Economic Association 28.2, analyzing large spatial data”. Journal of Agricultural, Biological and Environmental pp. 29–50 (cit. on pp. 33, 76). Statistics 24.3, pp. 398–425 (cit. on p. 91). Billingsley, Patrick (2008). Probability and measure. John Wiley & Sons. Heckman, James J and Edward J Vytlacil (2007). “Econometric evaluation of so- Boyd, Stephen and J Duchi (2012). “EE364b: Convex optimization II”. Course Notes, cial programs, part I: Causal models, structural models and econometric policy http://www.stanford.edu/class/ee364b. evaluation”. Handbook of econometrics 6, pp. 4779–4874 (cit. on p. 37).

← ToC 110 Imai, Kosuke, Luke Keele, and Teppei Yamamoto (Feb. 2010). “Identification, In- Robinson, Peter M (1988). “Root-N-consistent semiparametric regression”. Econo- ference and Sensitivity Analysis for Causal Mediation Effects”. en. Statistical metrica: Journal of the Econometric Society, pp. 931–954 (cit. on pp. 59, 77). science: a review journal of the Institute of Mathematical Statistics 25.1, pp. 51–71 Rosenbaum, Paul R and Donald B Rubin (1983). “The central role of the propensity (cit. on p. 52). score in observational studies for causal effects”. Biometrika 70.1, pp. 41–55 (cit. Imbens, Guido W and Donald B Rubin (2015). Causal inference in statistics, social, and on p. 31). biomedical sciences. Cambridge University Press (cit. on p. 26). Rue, Havard and Leonhard Held (2005). Gaussian Markov random fields: theory and Imbens, Guido and Stefan Wager (May 2017). “Optimized Regression Discontinu- applications. CRC press (cit. on p. 91). ity Designs”. arXiv: 1705.01677 [stat.ME] (cit. on p. 41). Ruppert, David, Matt P Wand, and Raymond J Carroll (2003). Semiparametric re- Jann, Ben (2008). “A Stata implementation of the Blinder-Oaxaca decomposition”. gression. 12. Cambridge university press. Stata journal 8.4, pp. 453–479. Scholkopf, Bernhard and Alexander J Smola (2018). Learning with kernels: support Keener, Robert W (2011). Theoretical statistics: Topics for a core course. Springer. vector machines, regularization, optimization, and beyond. Adaptive Computation Kelly, Morgan (2020). “Direct Standard Errors for Regressions with Spatially Au- and Machine Learning series (cit. on p. 61). tocorrelated Residuals” (cit. on p. 89). Schrimpf, Paul (2018). Mathematics for Economics - Lecture Notes. Kennedy, Edward H (Oct. 2015). “Semiparametric theory and empirical processes Severini, Thomas A (2005). Elements of distribution theory. Cambridge University in causal inference”. arXiv: 1510.04740 [math.ST] (cit. on pp. 54, 56). Press (cit. on p. 102). Koenker (2005). Quantile Regression (Econometric Society Monographs; No. 38). Cam- Shalizi, Cosma (2019). Advanced data analysis from an elementary point of view. bridge university press (cit. on pp. 21, 22). Stachurski, John (2009). Economic dynamics: theory and computation. MIT Press. Lee, Myoung-jae (2016). Matching, regression discontinuity, difference in differences, — (2016). A primer in econometric theory. Mit Press. and beyond. Oxford University Press. Tsiatis, Anastasios (2007). Semiparametric theory and missing data. Springer Science Li, Fan, Kari Lock Morgan, and Alan M Zaslavsky (2018). “Balancing covariates & Business Media (cit. on p. 54). via propensity score weighting”. Journal of the American Statistical Association Van Der Laan, Mark J and Daniel Rubin (2006). “Targeted maximum likelihood 113.521, pp. 390–400 (cit. on p. 32). learning”. The international journal of 2.1 (cit. on p. 32). Luenberger, David G (1997). Optimization by vector space methods. John Wiley & Sons Vohra, Rakesh V (2004). Advanced mathematical economics. Routledge. (cit. on p. 108). Ward, Michael D and John S Ahlquist (2018). Maximum Likelihood for Social Science: Manski, Charles F (1993). “Identification of endogenous social effects: The reflec- Strategies for Analysis. Cambridge University Press. tion problem”. The review of economic studies 60.3, pp. 531–542 (cit. on p. 90). Wasserman, Larry (2006). All of . Springer Science & Business Menezes, Flavio M and Paulo Klinger Monteiro (2005). An introduction to auction Media (cit. on p. 54). theory. OUP Oxford. — (2013). All of statistics: a concise course in . Springer Science & Mogstad, Magne and Alexander Torgovitsky (Aug. 2018). “Identification and Ex- Business Media. trapolation of Causal Effects with Instrumental Variables”. Annual review of eco- White, Halbert (2014). Asymptotic theory for econometricians. Academic press (cit. on nomics 10.1, pp. 577–613 (cit. on p. 39). p. 14). Mullainathan, Sendhil and Jann Spiess (2017). “Machine learning: an applied econo- Williams, Christopher KI and Carl Edward Rasmussen (2006). Gaussian processes metric approach”. Journal of Economic Perspectives 31.2, pp. 87–106 (cit. on p. 70). for machine learning. Vol. 2. 3. MIT press (cit. on p. 61). Murphy, Kevin P (2012). Machine learning: a probabilistic perspective. MIT press (cit. Wooldridge, Jeffrey M (2010). Econometric analysis of cross section and panel data. on p. 61). MIT press (cit. on p. 43). Oster, Emily (2019). “Unobservable selection and coefficient stability: Theory and Xu, Yiqing (Jan. 2017). “Generalized Synthetic Control Method: Causal Inference evidence”. Journal of Business & Economic Statistics 37.2, pp. 187–204 (cit. on with Interactive Fixed Effects Models”. Political analysis: an annual publication of p. 34). the Methodology Section of the American Political Science Association 25.1, pp. 57–76 Racine, Jeffrey, Liangjun Su, and Aman Ullah (2013). The Oxford handbook of applied (cit. on p. 48). nonparametric and semiparametric econometrics and statistics. Oxford University Press (cit. on p. 105). Robins, James M, Andrea Rotnitzky, and Lue Ping Zhao (1994). “Estimation of regression coefficients when some regressors are not always observed”. Journal of the American statistical Association 89.427, pp. 846–866 (cit. on p. 32).

← ToC 111