Regression, Causality and Identification Issues Kamiljon T
Total Page:16
File Type:pdf, Size:1020Kb
Regression, Causality and Identification Issues Kamiljon T. Akramov, Ph.D. IFPRI, Washington, DC, USA Training Course on Applied Econometric Analysis June 4, 2015, WIUT, Tashkent, Uzbekistan Standard OLS Model: Summary • Consider a simple regression model Y = β0 + β1X1 + β2X2 + u • It is assumed to satisfy the following four conditions • The model is linear in the parameters β • n observations are drawn randomly from populations of interest • E(u|X1, X2)=0, i.e., no endogeniety in the true model • No perfect collinearity • These four assumptions guarantee that the OLS estimates of parameters will be unbiased Standard OLS model: Summary (cont.) • Standard OLS model provides an estimate of the effect on Y of arbitrary changes in independent variables (∆X) • It resolves the problem of omitted variable bias, if an omitted variable can be measured and included • It can handle certain nonlinear relations (effects that vary with the X’s) • Still, standard OLS might yield a biased estimator of the true causal effect Motivation • The most challenging empirical questions in economics involve causal-effect relationships: • Will mandatory health insurance really make people healthier? • How does an additional year of education change earnings? • What is the effect of farm size on agricultural productivity? • How does agricultural diversity impact nutritional outcomes? Does Chocolate Consumption enhance cognitive function? Source: Messerli (2012) Regression analysis • “Essentially, all (statistical) models are wrong, but some are useful” George E. P. Box (1987) • All regression (statistical) models are description of real world phenomenon using mathematical concepts, i.e., they are just simplifications of reality • Regression analysis can be very useful if it is carefully designed • In accordance with current good practice guidelines, and • A thorough understanding of the limitations of the methods used • If not, it can be not only inaccurate but also potentially damaging by misleading policymakers, practitioners and public • Example 1: impact of hormone replacement therapy on the risk of coronary heart diseases (Hully et al. 1998; Velickovic 2015) • Example 2: Relationship between levels of government debt and rates of economic growth (Reinhart & Rogoff controversy) Regression and causality • The aim of standard regression analysis is to infer parameters of a distribution from samples drawn of that distribution • With the help of such parameters, one can: • Infer association among variables • Estimate the likelihood of past and future events • As well as update the likelihood of events in light of new evidence or new measurement • Causal analysis goes one step further: • Its aim is to infer aspects of the data generation process • With the help of such aspects, one can deduce not only the likelihood of events under static conditions, but also the dynamics of events under changing conditions 7 Causal analysis: schooling and earnings • Causal relationship between schooling and earnings tells us what people would earn, on average, if we could either • Change their schooling in a perfectly controlling environment or • Change their schooling randomly so that those with different levels of schooling would otherwise comparable • Conditional independence assumption (CIA) requires that we must hold a variety of control variables fixed for causal inferences to be valid • Selection on observables • Covariates to be fixed are assumed to be known and observed Causal analysis: schooling and earnings • Assume schooling is a binary decision, • Two potential earnings variables Y1i if ci =1 Outcome = Y0i if ci = 0 • We would like to know the difference between , which is causal effect of schooling on individual 1 2 Yi = Y0i + (Y1i −Y0i )ci Causal analysis: schooling and earnings • Comparison of average earning conditional on schooling status is formally linked to the average causal effect, i.e., Observed difference=Average treatment effect on treated + Selection bias E[Yi | Ci =1]− E[Yi | Ci = 0]= E[Y1i −Y0i | Ci =1] + E[Y0i | Ci =1]− E[Y0i | Ci = 0] • If selection bias is positive, the naïve comparison of earnings exaggerates the benefits of schooling • CIA asserts that conditional on observed characteristics selection bias disappears and comparisons of average earnings across schooling levels have a causal interpretation Fundamental problem of causal inference • It is impossible to observe the value of Y1i and Y0i on the same individual and, therefore, it is impossible to directly observe the effect of schooling on earnings • Another way to express this problem is to say that we cannot infer the effect of schooling because we do not have the counterfactual evidence, i.e., what would have happened in the absence of schooling • Given that the causal effect for a single individual cannot be observed, we aim to identify the average causal effect for the entire population or for sub-populations Fundamental problem of causal inference: solution • The econometric solution replaces the impossible-to-observe causal effect of treatment on a specific unit with the possible-to-estimate average causal effect of treatment over a population of units • Although E(Y1i) and E(Y0i) cannot both be calculated, they can be estimated. • Most econometrics methods attempt to construct from observational data consistent estimates of Y1i and Y0i Causal analysis: additional issues • In most circumstances, there is simply no information available on how those in the control group would have reacted if they had received the treatment instead • This is the basis for an important insight into another potential biase of standard regression analysis – treatment heterogeneity • Thus, two sources of biases need to be eliminated from estimates of causal effects from observational studies 1. Selection Bias: Baseline difference 2. Treatment Heterogeneity • Most of the methods available only deal with selection bias, simply assuming that the treatment effect is constant in the population or by redefining the parameter of interest in the population Macro example • What explains income differences across countries? • Hypothesis: the quality of institutions explains the variation in per capita income across countries • How would you establish causal link between institutions and income? • Higher levels of economic development may cause higher levels of institutional quality • Unobserved variable may jointly determine both high levels of institutional quality and high levels of income Internal Validity • Recall our definition of internal validity: • An econometric analysis is internally valid if the statistical inferences about causal effects are valid for the population being studied • We know that internal validity hinges on two things: • The estimator of the causal effect should be consistent (unbiased would be nice too, but it’s not always feasible) • Hypothesis tests should have the desired significance level (i.e. you should be using the correct standard errors) Threats to Internal Validity • Omitted variable bias • Model misspecification or wrong functional form • Measurement error • Selection bias • Simultaneous causality bias • All of these imply that E(ui|X1,X2) ≠ 0 Omitted Variable Bias • The bias in the OLS estimator that occurs as a result of an omitted factor is called omitted variable bias • For omitted variable bias to occur, the omitted factor “Z” must be: • a determinant of Y; and • correlated with the regressor X but unobserved, so cannot be included in the regression • Both conditions must hold for the omission of Z to result in omitted variable bias Omitted Variable Bias Formula • Regression of wages on schooling = α + ρ + + where α, ρ, and are population regression coefficients and is a regression residual that is uncorrelated with all regressor • What are the consequences of leaving ability out of regression? • OVB formula ( , ) ( ) = ρ + Where is the vector of coefficients from regressions of the elements of and Potential Solutions to Omitted Variable Bias • If the variable can be measured and observed, include it as a regressor in multiple regression • But be careful about bad control variables • Possibly, use panel data in which each entity (individual) is observed more than once • Fixed effects to control for unobserved heterogeneity • If the variable cannot be measured, use instrumental variables regression • Run a randomized controlled experiment OVB Example: estimates of the returns to education for men in the NLSY Controls (1) (2) (3) (4) (5) None Age dummies Col. (2) and Col. (4) and Col. (4) and additional AFQT score occupation control variables dummies (mother’s and father’s years of schooling, and dummies for race and census region) 0.132 0.131 0.114 0.087 0.066 (0.007) (0.007) (0.007) (0.009) (0.010) Table reports the coefficient on years of schooling in a regression of log wages on years of schooling and the indicated controls. Source: Angrist and Pischke (2009). Misspecification or Wrong Functional Form • Arises if the functional form is incorrect • If an interaction term is incorrectly omitted, then inferences on causal effects will be biased • Variable transformations (logarithms) • Discrete dependent variables • For example, the effect of dietary diversity on nutritional outcomes may depend on children’s age • Other examples? Measurement Error • In reality, economic data often have measurement error • Data entry errors in administrative data • Recollection errors in surveys • when did you start your current job? • Ambiguous questions problems