<<

Research Design

Christine Pal Chee March 25, 2015

Health Services Research

 Many questions in health services research aim to establish – Does the adoption of electronic medical records reduce health care costs or improve quality of care? – Did the transition to Patient Aligned Care Teams (PACT) improve quality of care and health outcomes? – What effect will the Affordable Care Act (ACA) have on the demand for VHA services?  Ideally studied through randomized controlled trials  When can regression of observational answer these questions?

2 Poll: Familiarity with Regressions

 How would you describe your familiarity with regression analysis? – Regression is my middle name. – I’ve run a few regressions and get the gist of how they work. – I took a class many years ago. – What is a regression?

3

Objectives

 Provide a conceptual framework for research design  Review the model  Define exogeneity and endogeneity  Discuss three forms of endogeneity – Omitted variable bias – Sample selection – Simultaneous causality

4 Research Question

 Start with a research question: – What is the effect of on ?  For example: 𝑋𝑋 𝑌𝑌 – What effect does education have on health?

5 Linear Regression Model

= + + +

 : outcome variable𝑌𝑌𝑖𝑖 𝛽𝛽of0 interest𝛽𝛽1𝑋𝑋 1𝑖𝑖 𝛽𝛽2𝑋𝑋2𝑖𝑖 𝑒𝑒𝑖𝑖  : explanatory variable of interest  𝑌𝑌 : control variable 1  𝑋𝑋: error term 2 𝑋𝑋– is the difference between the observed and predicted values of Y 𝑒𝑒– contains all other factors besides and that determine the value of  𝑒𝑒 : the change in associated with1 a unit2 change in , holding constant𝑒𝑒 𝑋𝑋 𝑋𝑋 𝑌𝑌 𝛽𝛽–1 is our estimate𝑌𝑌 of 𝑋𝑋1 2 1 𝑋𝑋 1  Model𝛽𝛽̂ specifies all meaningful𝛽𝛽 determinants of

𝑌𝑌 6 Linear Regression Model (2)

 In our example:

= + +

– : dependent𝑖𝑖 variable0 1 𝑖𝑖 𝑖𝑖 – ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒: independent� 𝛽𝛽 variable𝛽𝛽 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑒𝑒 – ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒: error� term 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒. contains all other factors besides education that determine health – 𝑒𝑒 : the change in health associated with an increase in education 𝑒𝑒 𝛽𝛽1  When does estimate the causal effect of education on health? 𝛽𝛽̂1

7 Exogeneity

 Assumption: = 0 – Conditional of given is zero . Conditional mean𝑖𝑖 𝑖𝑖 independence 𝐸𝐸 𝑒𝑒 𝑋𝑋 𝑖𝑖 𝑖𝑖 – is “exogenous” 𝑒𝑒 𝑋𝑋

𝑋𝑋  Knowing does not help us predict – is the difference between the observed and predicted values of – contains𝑖𝑖 other factors besides that determine𝑖𝑖 the value of 𝑖𝑖 𝑋𝑋 𝑒𝑒 𝑖𝑖 – 𝑒𝑒 other than does not tell us anything more about 𝑌𝑌 𝑖𝑖 𝑖𝑖 𝑖𝑖 𝑒𝑒 𝑋𝑋 𝑌𝑌 𝑖𝑖 𝑖𝑖  Implies that and cannot𝑋𝑋 be correlated 𝑌𝑌

𝑋𝑋𝑖𝑖 𝑒𝑒𝑖𝑖

8 Exogeneity (2)

 In the context of a randomized controlled trial:

= + +

– can include things like age, gender, pre-existing conditions, income,𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 education,𝑜𝑜𝑜𝑜𝑖𝑖 etc. 𝛽𝛽0 𝛽𝛽1𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑖𝑖 𝑒𝑒𝑖𝑖 𝑒𝑒𝑖𝑖  Because treatment is randomly assigned, and are independent – This implies is exogenous 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑒𝑒

 In observational𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 studies, is not randomly assigned – The best we can hope for is𝑡𝑡𝑡𝑡 that𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 is as if randomly assigned 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡

9 Exogeneity (3)

 In our example: = + +  In order for to estimate the causal effect of ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒�𝑖𝑖 𝛽𝛽0 𝛽𝛽1𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑖𝑖 𝑒𝑒𝑖𝑖 education on health, must be exogenous ̂1 – All factors 𝛽𝛽other than education do not tell us anything more about health 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒  In the context of a randomized controlled trial, is exogenous – Is the same true in the context of observational studies? 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒

10 Endogeneity

 Violation of the exogeneity assumption – is endogenous – Always true when is correlated with 𝑋𝑋  is biased 𝑋𝑋𝑖𝑖 𝑒𝑒𝑖𝑖 – is unbiased if the expected value of is equal to 𝛽𝛽̂1the true value of 𝛽𝛽̂1 𝛽𝛽̂1  will not estimate a causal effect of on 𝛽𝛽1 – is a measure of the correlation between and 1 𝛽𝛽̂– Correlation does not imply causation 𝑋𝑋 𝑌𝑌 ̂1 𝛽𝛽 𝑋𝑋 𝑌𝑌 11 Forms of Endogeneity

 Omitted variable bias  Sample selection  Simultaneous causality

12 Omitted Variable Bias

 Arises when: – A variable omitted from the regression model is a determinant of the dependent variable, – The omitted variable is correlated with the regressor, 𝑌𝑌

 Leads to 𝑋𝑋be biased – also captures the correlation between the ̂1 omitted𝛽𝛽 variable and the dependent variable 𝛽𝛽̂1 13 Omitted Variable Bias (2)

 Regression model: = + +  Say another factor, , determines 𝑖𝑖 0 1 𝑖𝑖 𝑖𝑖 – is included in the𝑌𝑌 error𝛽𝛽 term,𝛽𝛽 𝑋𝑋 𝑒𝑒 𝑖𝑖 𝑖𝑖  If and are correlated𝑊𝑊 𝑌𝑌 𝑊𝑊𝑖𝑖 𝑒𝑒𝑖𝑖 – and are correlated 𝑖𝑖 𝑖𝑖  𝑋𝑋 is endogenous𝑊𝑊 𝑋𝑋𝑖𝑖 𝑒𝑒𝑖𝑖 – is biased 𝑖𝑖 𝑋𝑋 . also captures the correlation between and 𝛽𝛽̂1 ̂1 𝑖𝑖 𝑖𝑖 𝛽𝛽 𝑊𝑊 𝑌𝑌 14 Omitted Variable Bias: Example

 In our example: = + +  Two questions: 𝑖𝑖 0 1 𝑖𝑖 𝑖𝑖 – Besidesℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 �education,𝛽𝛽 do𝛽𝛽 any𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 other factors𝑒𝑒 determine health? – Are those factors correlated with education?  Consider two factors: income and social networks

15 Omitted Variable Bias: Example (2)

 Income – Individuals with higher incomes may be more able to afford goods and services that promote good health – Individuals with higher levels of education tend to have higher incomes  Social networks – Individuals with greater social networks may be more able to obtain better health care – Individuals with higher levels of education are more likely to have social networks that facilitate access to better health care

16 Omitted Variable Bias: Solutions

 Multiple linear regression – Include all relevant factors in the regression model so that we have conditional mean independence – Often not possible to include all omitted variables in the regression  Randomized controlled trial  Natural – More on this in the Natural and Difference-in-Differences lecture on April 8

17 Omitted Variable Bias: Solutions (2)

 Utilize panel data (same observational unit observed at different points in time) – Fixed effects regression: control for unobserved omitted variables that do not change over time – For more information: Stock and , Chapter 10  Instrumental variables regression – Utilize an instrumental variable that is correlated with the independent variable of interest but is uncorrelated with the omitted variables – More on this in the Instrumental Variables Regression lecture on April 22

18 Sample Selection

 Arises when: – A selection process influences the availability of data – The selection process is related to the dependent variable, , beyond depending on

𝑌𝑌  Leads𝑋𝑋 to be biased ̂1 𝛽𝛽 19 Sample Selection (2)

 Form of omitted variable bias – The selection process is captured by the error term – Induces correlation between the regressor, , and the error term,

𝑋𝑋 𝑒𝑒

20 Sample Selection: Examples

 Want to evaluate the effect of a new tobacco cessation program (offered to all patients) on quitting – = + + – Problem: Individuals who participate in the program may𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞 𝑖𝑖be more𝛽𝛽0 likely𝛽𝛽1𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 to quit to𝑖𝑖 begin𝑒𝑒𝑖𝑖 with

 Want to evaluate the effect of a new primary care model (rolled out for some patients at a facility) on patient satisfaction – = + + – Problem: Patients who don’t like the new program stop coming𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 to the 𝑖𝑖facility𝛽𝛽0 and𝛽𝛽1 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚receive𝑖𝑖 their𝑒𝑒𝑖𝑖 care elsewhere

21 Sample Selection: Solutions

 Randomized controlled trial  – More on this in the Natural Experiments and Difference- in-Differences lecture on April 8  Sample selection and treatment effect models – For more information: . Greene, 2000 Chapter 20 . Wooldridge, 2010, Chapter 17  Instrumental variables regression – More on this in the Instrumental Variables Regression lecture on April 22

22 Simultaneous Causality

 Arises when: – There is a causal link from X to Y – There is also a causal link from Y to X  Also called simultaneous equations bias  Leads to be biased – Reverse causality leads to pick up both ̂1 effects𝛽𝛽 𝛽𝛽̂1 23 Simultaneous Causality: Example

 We want to estimate the effect of primary care visits on health

= + +

 If a person’s health𝑖𝑖 0 also affects1 the𝑖𝑖 likelihood𝑖𝑖 of makingℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 a primary� 𝛽𝛽 care visit:𝛽𝛽 𝑝𝑝𝑝𝑝 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑠𝑠 𝑒𝑒

= + +

 Both equations 𝑖𝑖are necessary0 1 to understand𝑖𝑖 𝑖𝑖 the relationship𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 between𝑠𝑠 𝛾𝛾 primary𝛾𝛾 ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 care� visits𝜀𝜀 and health

24

Simultaneous Causality: Example (2)

 We now have two simultaneous equations:

= + + (1) = + + (2) ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒�𝑖𝑖 𝛽𝛽0 𝛽𝛽1 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑖𝑖 𝑒𝑒𝑖𝑖  Suppose a positive error𝑝𝑝𝑝𝑝𝑝𝑝 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 leads𝑠𝑠𝑖𝑖 𝛾𝛾to0 a higher𝛾𝛾1ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 value�𝑖𝑖 𝜀𝜀 of𝑖𝑖 (better health)

𝑖𝑖 𝑖𝑖 𝑒𝑒 = + + (1)ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 �

 If < 0, then a higherℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 value�𝑖𝑖 of𝛽𝛽 0 𝛽𝛽1𝑝𝑝𝑝𝑝 𝑝𝑝leads𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑠𝑠𝑖𝑖 to a𝑒𝑒𝑖𝑖 lower value of

1 <0 𝑖𝑖 𝑖𝑖 𝛾𝛾 = ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒+ � + (2) 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑠𝑠

 Therefore, a positive error𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑠𝑠 leads𝑖𝑖 𝛾𝛾0 to a𝛾𝛾 1lowerℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒� value𝑖𝑖 𝜀𝜀𝑖𝑖 of – ↑ → ↓ – and are correlated𝑒𝑒𝑖𝑖 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑖𝑖 𝑖𝑖 𝑖𝑖 – 𝑒𝑒 is biased𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑠𝑠𝑖𝑖 𝑒𝑒𝑖𝑖 𝛽𝛽̂1

25 Simultaneous Causality: Solutions

 Randomized controlled trial where the reverse causality channel is eliminated  Natural experiment – More on this in the Natural Experiments and Difference-in-Differences lecture on April 8  Instrumental variables regression – Utilize an instrumental variable that is correlated with but is uncorrelated with the error term (does not otherwise determine ) – More𝑋𝑋 on this in the Instrumental Variables Regression lecture on April𝑌𝑌 22

26 Summary

 Good research design requires an understanding of how the dependent variable is determined  Need to ask: is the explanatory variable of interest exogenous? – Are there omitted variables? – Is there sample selection? – Is there simultaneous causality?  Exogeneity is necessary for the estimation of a causal treatment effect  Understanding sources of endogeneity can: – Help us understand what our regression estimates actually estimate and the limitations of our analyses – Can point us to appropriate methods to use to answer our research question

27 Resources

 Stock and Watson, Introduction to , 3rd edition (2011)  Green, Econometric Analysis, 7th edition (2012)  Wooldridge, Econometric Analysis of Cross Section and Panel Data, 2nd edition (2010)

28