Applied Econometrics for Health Economists

Exercise 0 – Preliminaries

The data ﬁle hals1class.dta contains the following variables:

age age in years male dummy for men white dummy for ethnic group=white aglsch age at which R left school (proxy for education) rheuma dummy whether R suffers from arthritis/rheuma prheuma dummy whether either of R’s parents suffered from arthritis/rheuma ownh self-rated general health breakhot how often R eats cooked breakfast tea how many cups of tea R drinks per day teasug how many spoons of sugar R puts into tea coffee how many cups of coffee R drinks per day

Stata syntax needed: use, save

1 Exercise 1 – Is rheumatism inherited?

1. Load the data, restrict the sample to respondents aged 40 and over, and compute summary statistics for the ﬁrst six variables. What percentage suﬀers from rheuma?

2. Compute a linear probability model (with heteroskedasticity-robust standard errors) for rheuma using age, age squared, sex, ethnic group, education, and parental rheuma as explanatory variables.

• Interpret your results w.r.t. to all explanatory variables • Show the predicted values in a histogram. How do you interpret the predicted values? Which problem of the linear probability model do you see in your histogram? • Show the estimated residuals in a histogram. Which OLS assumptions are violated in the linear probability model?

3. Compute the above as a probit model – how do you interpret the estimated regression coeﬃcients?

4. Test whether the deterministic part of your model is „correctly“ speciﬁed.

5. Compute marginal/average eﬀects at the median of the explanatory variables – comment on the size of the average eﬀect for prheuma.

6. Compute the marginal/average effects for aglsch and prheuma as the median of the individual marginal effects – is there a difference to the effects computed above?

7. Compute the above as a logit model and report odds ratios – how much does the „risk“ of rheumatism change when people get one year older? how much does it change when increase when people get two years older?

8. Assume that prheuma was a genetic marker of rheumatism and you wanted to study its eﬀect

• Draw a random sample of the general population (aged 40+) of the same size as your sample of respondents who reported to have rheuma • Compute a probit model and report marginal eﬀects at the median – compare the results with those based on the full HALS sample • Compute a logit model and report odds ratios – compare the results with those based on the full HALS sample

Stata syntax needed: summarize, regress, generate, histogram, predict, probit, fracgen, testparm, mfx, logistic, functions normalden, normal

2 Exercise 2 – Who eats a good breakfast?

The variable breakhot which indicates how often R have a cooked (English) breakfast (ﬁrst meal after getting up)

0 never 1 less than once a week 2 once or twice a week 3 most days (3-6) 4 every day

1. Show frequencies of breakhot. The value labels are wrong and there are missing values. Please correct both problems and show the table of frequencies again

2. Estimate a regression model to explain breakfast eating habits by age, age squared, sex, education, and general health – which method is appropriate?

3. Compute the full set of marginal eﬀects for education – if possible in a forval loop. Where does education have its impact on eating habits?

4. Describe the relationship between age and eating habits – when do people have cooked breakfast most often?

Stata syntax needed: label define, recode, oprobit, forval

3 Exercise 3 – Tea with sugar?

The variable tea indicates how many cups of tea R have per day

0 none 1 one or two 2 three or four 3 ﬁve or six 4 more than six teasug indicates how many spoons of sugar Rs have in their tea

0 none 1 1 or less 2 over 1, to 2 3 more than 2

Again, note that the value labels are wrong.

1. Using tea and teasug, generate a new variable that has the value 0 if R doesn’t drink tea, 1 if R drinks tea without sugar, and 2 if R drinks tea with sugar. Show the distribution of this variable in a table.

2. Estimate a regression model to explain tea drinking habits by age, age squared, sex, education, and whether they have cooked breakfast – which method is appropriate?

3. When using non-tea drinkers as baseline category, how do you interpret your regression parameters?

4. Compute the full set of marginal/average effects for all variables – again in a forval loop. What happens if you add up marginal effects for a variable across outcomes? How do you interpret the marginal/average effects?

5. In your own words, describe the meaning of the IIA assumption implied by the multino- mial logit model for the present application. How do you test this assumption? Explain the logic of the test. Does the IIA assumption hold in the present example?

Stata syntax needed: mlogit, hausman, suest, estimates store, test

4 Exercise 4 – Tea or coﬀee?

The variable coffee indicates how many cups of coﬀee R have per day (0=none, 1=one or two, 2=three or four, 3=ﬁve or six, 4=more than six). Value labels are again wrong.

1. Generate dummy variables for tea and coﬀee drinkers and show a crosstabulation of the two variables. Based on this table, would you say that tea and coﬀee are complements or substitutes?

2. Estimate a regression model explaining the consumption of tea or coffee jointly (using the same explantory variables as before). How does education affect preference for tea and coffee? How do you interpret the parameter rho?

3. Compute marginal eﬀects of education and answer the following questions. How does education aﬀect the probability of drinking . . .

• . . . tea? • . . . neither tea nor coffee? • . . . tea but no coffee? • . . . tea when R drinks coffee?

Stata syntax needed: biprobit

5 Exercise 5 – Education and Health

Continue using hals1class.dta

1. You want to estimate the causal eﬀect of education on health. What are the problems with estimating a simple regression of health on education and how can you solve them?

2. Recode aglsch so that it becomes a binary treatment (low education = up to 14 years, high education = 15 or more years). Show a simple cross-tabulation of self-assessed health and education. What is the average eﬀect of education on health? Conﬁrm your guess with a simple probit regression of health on education.

3. Show a scatterplot of the proportion of respondents with high education by age. Which identiﬁcation strategy does this graph suggest?

4. Compare the results of a simple probit regression of self-assessed health on age, age squared, sex, ethnicity and education with the results of a recursive bivariate probit model. What is the average treatment eﬀect? What is the average treatment eﬀect on the treated? How do you interpret the parameter ρ?

Stata syntax needed: tabulate, biprobit, egen, twoway scatter

6 Exercise 6 – Tea and coﬀee: harmful to your health?

The data ﬁle hals2class.dta contains the following variables:

age2 age in years in wave 2 of HALS visits number of GP visits in last month sah2 self-rated general health in wave 2 dis number of conditions R ever had sym number of health symptoms in last month

1. Merge the HALS wave 2 data with the wave 1 data. How many respondents have been „lost“ between the ﬁrst and second wave of HALS?

2. Draw spikeplots of the count variables in you data

3. Compute mean and variance for all count variables. Discuss the results.

4. You want to estimate the effect of tea and coffee consumption in wave 1 on number of symptoms in wave 2 using age (in wave 2), sex, education, and ethnicity as control variables. Choose the most appropriate estimator from the following list: OLS, poisson, negbin, zero-inflated negbin.

5. Report and interpret marginal/average eﬀects for the most appropriate model.

Stata syntax needed: use, save, merge, spikeplot, summarize, poisson, nbreg, zinb

7 Exercise 7 – Is subjective health a good predictor of mortality?

The data ﬁle hals3class.dta contains the following variables:

serno HALS person identiﬁer hals1age Age in years at HALS wave 1 dead Vital status (1=dead) lifespan Survival until wave 3 in years

1. Merge HALS wave 1, 2 and 3 data. For how many wave 1 respondents is the vital status unknown? Drop all cases with unknown vital status.

2. Deﬁne the data set as survival data.

3. Compare Kaplan-Meier survival rates by self-rated health in wave 1 and sex. Which group has the highest life expectancy?

4. Compare hazard rates of dying for men and women. Are hazard rated dependent on duration? If so, how?

5. Only looking at respondents who took part in wave 2: estimate a parametric survival model using sex, education, ethnicity, self-rated health in wave 2, the number of diseases, symptoms, and doctor visits. Interpret the results. Comment on the value of subjective health as a predictor of mortality.

6. Re-estimate the previous model using a semi-parametric Cox model. Why is this model called semi-parametric?

Stata syntax needed: merge, stset, sts, streg, stcox

8 Exercise 8 – The dynamics of weight and obesity

The data ﬁle halspanel.dta contains the following variables:

serno HALS person identiﬁer wave wave identiﬁer: 0=wave 1, 1=wave 2 age age male dummy for sex weightm measured weight in kg heightm measured height in cm (measured in wave 1) aglsch age left school unempl dummy for unemployment

1. Declare that this is panel data. Which variables are time-invariant? Drop all respondents who were younger than 65 in wave 1

2. For each respondent and wave compute the body mass index (weight in kg/squared height in m). Show the distribution in wave 1 in a histogram.

3. In order to estimate regressions of bmi on age, sex, education, and unemployment, generate an indicator variable for balanced panel data.

4. Estimate a pooled OLS model (for the balanced panel) with clustered standard errors. Comment on the results.

5. Estimate random effects and fixed effects models. Use a Hausman test to determine the correct specification. Comment on the results.

6. Generate an indicator variable for obesity (bmi > 30). How has the prevalence of obesity increased over time? Show a transition matrix for obesity.

7. Re-estimate the previous panel model as a random eﬀects probit and conditional logit model. Why does the conditional logit model have so few observations left?

Stata syntax needed: xtset, xtreg, xttest0, hausman, xttrans, xtprobit, clogit