AFRICAN ECONOMIC RESEARCH CONSORTIUM (AERC)

COLLABORATIVE MASTERS DEGREE PROGRAMME (CMAP) IN ECONOMICS FOR SUB-SAHARAN AFRICA

JOINT FACILITY FOR ELECTIVES

Teaching Module Materials ECON 562: Theory and Practice II (Microeconometrics)

(Revised: August, 2020)

Facebook Twitter Website Email Website

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved

Our mailing address is: African Economic Research Consortium (AERC) 3rd Floor, Middle East Bank Towers, Jakaya Kikwete Road P. O. Box 62882 00200 Nairobi Kenya

TOPIC 2.1. LINEAR PANEL MODELS (16 HOURS)

Introduction

The wide availability of panel data presents unique opportunities for researchers. It is essential to understand the intuition and implications of the estimators and tests that are currently available to successfully take advantage of panel data. This Topic will focus on core linear panel data techniques, building a strong foundation before moving into more cutting-edge methods that have been recently developed.

Objectives of the Topic:

After completing this Topic students will be able to:

• Understand the nuances of linear panel data estimators and the empirical implications that manifest; • Understand the strengths and weaknesses of alternative approaches to estimation and testing of linear panel data models; • Specify and estimate linear panel data models and perform relevant diagnostic tests; and • Successfully integrate data into STATA and construct appropriate linear panel data models which they can estimate, conduct inference and rigorously interpret to provide sound policy insights.

2.1.1. Regression with Pooled /Cross-Section Data a. Types of Data i. Cross-section data

These are data for which values of variables are collected for several sample units/economic entities (workers, households, firms, cities, countries, etc.) at the same point in time. Order of data does not matter. These are usually survey data.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 1

Examples of the well-known cross-sectional databases can be found at the following link https://microdata.worldbank.org/index.php/home. They include among others:

• Living Standards Measurement Study surveys, • UNICEF Multiple Indicator Cluster Surveys (MICS), • Demographic and Health Surveys (DHS), • Enterprise Surveys, • Integrated living conditions surveys, etc.

ii. Time series data

Data for a single entity (person, firm, country) collected at multiple time periods. Repeated observations of the same variables (GDP, prices, etc.). Order of data is important. Observations are typically not independent over time.

iii. Panel data (cross-section and time series)

Data for multiple entities (individuals, firms, countries) in which outcomes and characteristics of each entity are observed at multiple points in time. Panel data are also called “Pooled data” (pooling of time series and cross-section observations). An example of the structure of a panel data set is as follows, where Id is the variable identifying the individual that we follow over time; yr92, yr93 and yr94 are time dummies, constructed from the year variable; 푥1 is an example of a time varying variable and 푥2is an example of a time invariant variable:

Table 1.1. Example of panel data

id Year yr92 yr93 yr94 x1 x2 1 1992 1 0 0 8 1

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 2

1 1993 0 1 0 12 1 1 1994 0 0 1 10 1 2 1992 1 0 0 7 0 2 1993 0 1 0 5 0 2 1994 0 0 1 3 0 (…) (…) (…) (…) (…) (…) (…)

Regressions based on such data are called panel data regression models or panel data models.

Examples

Examples of panel data include: Gravity model of trade, where you observe trade figures for different countries/products over time; investment model, where your cross-sections are the firms observed over time; and examining the determinants of economic growth for Sub-Saharan African countries over the period 1960-2019, etc.

b. Structures of panel data

Panel data may have the following structure:

Cross-section oriented panel data or micro-panel data: It is a panel for which the time dimension T is largely less important than the individual dimension N: 푇 ≪ 푁. An example of this type of data is the University of Michigan’s Panel Study of Income Dynamics, PSID with 15,000 individuals observed since 1968 is a micro-panel.

Time-series oriented panel data or macro-panel data: A macro-panel data set is a panel for which the time dimension T is similar to the individual dimension N: T~N. This is quite common in macroeconomics.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 3

Balanced and unbalanced panel data: A panel is said to be balanced if we have the same time periods,푡 = 1,2, … , 푇, for each cross-section observation. For an unbalanced panel, the time dimension, denoted 푇푖, is specific to each individual.

c. Advantages of panel data

Panel data sets for economic research possess several major advantages over conventional cross- sectional or time-series data sets. These include:

Advantage 1: More informative data, more variability

Panel data give more informative data, more variability, less collinearity among the variables, more degrees of freedom and more . It is well known that time-series studies are plagued with multicollinearity; this is less likely with panel data since the cross-section dimension adds a lot of variability, adding more informative data. In fact, because of the double dimension, the variation in the data can be decomposed into variation between cross-section units, and variation within cross-section units. With additional, more informative data one can produce more reliable parameter estimates.

Advantage 2: New economic questions

Panel data are able to identify and measure effects that are simply not detectable in pure cross- section or pure time-series data. They allow a researcher to analyze a number of important economic questions that cannot be addressed using cross-sectional or time-series data sets. For instance, with panel data, it’s possible to ask how certain effects evolve over time (e.g. time trend in dependent variable; or changes in the coefficients of the model). Panel data also enable us to estimate dynamic equations (e.g. specifications with lagged dependent variables on the right-hand side).

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 4

Advantage 3: Panel data allows to control for unobservable components

Panel data allows to control for omitted (unobserved or measurement errors) variables. Thus, panel data models are able to control for individual-specific, time-invariant, unobserved heterogeneity, the presence of which could lead to bias in standard estimators like OLS.

Advantage 4: Panel data are better able to study dynamics of adjustment. For example, spells of unemployment, job turnover, residential and income mobility are better studied with panels. Panel data are also well suited to study the duration of economic states like unemployment and poverty, and if these panels are long enough, they can shed light on the speed of adjustments to economic policy changes.

d. Issues involved in using panel data

The main issue involved in utilizing panel data is the heterogeneity issue. Ignoring the individual or time-specific effects that exist among cross-sectional or time-series units but are not captured by the included explanatory variables can lead to parameter heterogeneity in the model specification. Ignoring such heterogeneity could lead to inconsistent or meaningless estimates of parameters.

Figure 1.1: Homogeneous slope but heterogeneous intercepts

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 5

Figure 1.2: Heterogeneous intercepts and slopes

In these graphs, the broken line ellipses represent the point scatter for an individual over time, and the broken straight lines represent the individual regressions. Solid lines serve the same purpose for the least-squares regression using all NT observations. All of these figures depict situations in which biases arise in pooled least-squares estimates because of heterogeneous intercepts and/or heterogeneous slopes. In these cases, pooled regression model ignoring heterogeneous intercepts should never be used.

To understand the heterogeneity issue in panel data model specification and estimation, let us consider a production function (Cobb-Douglas) with two factors (labor and capital) in logarithms. We have N countries and T periods. Let us denote:

푦푖푡 = 훼푖 + 훽푖푘푖푡 + 훿푖푙푖푡 + 휀푖푡

In this specification, the elasticities 훽푖 and 훿푖 are specific to each country. Several alternative specifications can be considered. First, we can assume that the production function is the same for all countries; in this case we have a homogeneous specification:

푦푖푡 = 훼 + 훽푘푖푡 + 훿푙푖푡 + 휀푖푡, where 훼푖 = 훼, 훽푖 = 훽, and 훿푖 = 훿.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 6

However, a homogeneous specification of the production function for macro aggregated data is meaningless. Alternatively, we can have a specification with individual effects 훼푖 and common slope parameters (elasticities 훽 and 훿):

푦푖푡 = 훼푖 + 훽푘푖푡 + 훿푙푖푡 + 휀푖푡, where 훽푖 = 훽, and 훿푖 = 훿.

Finally, we can assume that the labor and/or capital elasticities are different across countries. In this case, we will have a heterogeneous specification of the panel data model (heterogeneous panel):

푦푖푡 = 훼푖 + 훽푖푘푖푡 + 훿푖푙푖푡 + 휀푖푡

In this case of heterogeneous panel, there are two solutions to estimate the parameters. The first solution consists in using N times series models to produce some group- estimates of the elasticities. The second solution consists in considering a random coefficient model. In this case, we can assume that parameters 훽푖 and 훿푖 and randomly distributed, with for instance:

̅ 2 ̅ 2 훽푖~풩(훽, 휎훽 ), and 훿푖~풩(훿, 휎훿 ).

It is important to note that ignoring heterogeneity (in slope and/or constant) in panel data regression models could lead to inconsistent or meaningless estimates of parameters.

e. Specification tests

Assume the following linear panel data regression model:

′ 푦푖푡 = 훼푖 + 훽푖 푥푖푡 + 휀푖푡, 푖 = 1,2, … , 푁 & 푡 = 1,2, … , 푇

′ ′ where 훽푖 = (훽1푖, 훽2푖, … , 훽푘푖) is a vector of parameters, and 푥푖푡 = (푥1푖푡, 푥2푖푡, … , 푥푘푖푡) is a vector of k explanatory variables.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 7

According to Hsiao (2014), two aspects of the estimated regression coefficients can be tested: i. The homogeneity of regression slope coefficients; and ii. The homogeneity of regression intercept coefficients.

The test procedure proposed by Hsiao (2014) has three main steps:

Step 1: Test whether or not slopes and intercepts simultaneously are homogeneous among different individuals at different times. This is the hypothesis of overall homogeneity, and the estimated model estimated under this hypothesis is the pooled model:

푦푖푡 = 훼 + 훽′푥푖푡 + 휀푖푡,

Step 2: Test whether or not the regression slopes collectively are the same, given that intercepts vary across individuals. Under this hypothesis, the model estimated is called an individual effects model.

푦푖푡 = 훼푖 + 훽′푥푖푡 + 휀푖푡

Step 3: Test whether or not the regression intercepts are the same, given that the slopes of the regression are also the same.

It is obvious that if the hypothesis of overall homogeneity (Step 1) is accepted, the testing procedure will go no further. However, should the overall homogeneity hypothesis be rejected, the second step of the analysis is to decide if the regression slopes are the same. If this hypothesis of homogeneity is not rejected, one then proceeds to the third and final test to determine the equality of regression intercepts.

In step 1, the null and alternative hypotheses are written as:

1 퐻0 : 훼1 = 훼2 = ⋯ = 훼푁,

훽1 = 훽2 = ⋯ = 훽푁

퐻1: 훼푖 ≠ 훼푗 푎푛푑 훽푖 ≠ 훽푗

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 8

With the null hypothesis of common intercept and slope, we have (푘 + 1)(푁 − 1) linear restrictions. The Fisher test is given as:

1 (푅푆푆푅 − 푅푆푆푈)/(푁 − 1)(푘 + 1) 퐹1 = 푅푆푆푈/(푁푇 − 푁(푘 + 1))

1 The restricted residual sum of squares (푅푆푆푅) is from the estimation of a pooled model with the hypothesis of overall homogeneity. The unrestricted residual sum of squares is given by 푅푆푆푈 = 푁 ∑푖=1 푅푆푆푖, where 푅푆푆푖 is the residual sum of squares for each cross-section i using a time-series model.

If 퐹1 with (푁 − 1)(푘 + 1) and (푁(푇 − 푘 − 1) degrees of freedom is not significant, we pool the data and estimate a pooled model. If the F ratio is significant, a further attempt is usually made to find out if the non-homogeneity can be attributed to heterogeneous slopes or heterogeneous intercepts.

If the null hypothesis in step 1 is rejected, in step 2, we test the hypothesis of homogeneous slopes, without putting any restriction on the intercepts (heterogeneous intercepts).

In step 2, the null and alternative hypotheses are written as:

2 퐻0 : 훽1 = 훽2 = ⋯ = 훽푁 given 훼푖 ≠ 훼푗

퐻1: 훽푖 ≠ 훽푗 given 훼푖 ≠ 훼푗

Under the hypothesis of homogeneous slopes but heterogeneous intercepts, we have (푁 − 1)푘 2 linear restrictions. The Fisher-statistic is then given by the following, where 푅푆푆푅 is the residual sum of squares from the individual effects model; 푅푆푆푈 remains defined as in step 1.

2 (푅푆푆푅 − 푅푆푆푈)/(푁 − 1)푘 퐹2 = 푅푆푆푈/(푁푇 − 푁(푘 + 1))

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 9

If 퐹2 with (푁 − 1)푘 and [푁푇 − 푁(푘 + 1)] degrees of freedom is significant, the test sequence is naturally halted. If 퐹2 is not significant, we can then determine the extent to which non- homogeneities can arise in the intercepts.

In step 3, the null hypothesis of homogeneity of intercepts is tested as follows:

3 퐻0 : 훼1 = 훼2 = ⋯ = 훼푁 given 훽1 = 훽2 = ⋯ = 훽푁. This gives (푁 − 1) linear restrictions.

퐻1: 훼푖 ≠ 훼푗 given 훽1 = 훽2 = ⋯ = 훽푁

3 The Fisher-statistic is given by the following, where 푅푆푆푅 is the residual sum of squares from the pooled model; 푅푆푆푈 is from the individual effects model.

3 (푅푆푆푅 − 푅푆푆푈)/(푁 − 1) 퐹3 = 푅푆푆푈/[푁(푇 − 1) − 푘]

A significant 퐹3 would confirm that heterogeneity arises in the intercepts, and that the correct specification is:

푦푖푡 = 훼푖 + 훽′푥푖푡 + 휀푖푡 f. The pooled data model

A very general for panel data is written as:

푦푖푡 = 훼푖 + 훽′푖푥푖푡 + 휀푖푡, 푖 = 1,2, … , 푁 & 푡 = 1,2, … , 푇 where 훽′푖 = (훽1푖, 훽2푖, … , 훽푘푖) is a vector of ′ parameters, and 푥푖푡 = (푥1푖푡, 푥2푖푡, … , 푥푘푖푡), is a vector of k explanatory variables.

A pooled data model is the most restrictive model that specifies common intercepts and slopes across individuals (cross-sections). The pooled data model ignores the heterogeneity dimension of

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 10

cross-section units. Under the hypothesis of overall homogeneity, the pooled model is specified as:

푦푖푡 = 훼 + 훽′푥푖푡 + 휀푖푡,

It is assumed that errors are homoscedastic and serially independent both within and between individuals (cross-sections): 2 푉푎푟(휀푖푡) = 휎

퐶표푣(휀푖푡, 휀푗푠) = 0, 푤ℎ푒푛 푖 ≠ 푗 and/or 푡 ≠ 푠.

If the model is correctly specified and regressors are uncorrelated with the error term, the pooled OLS will produce consistent and efficient estimates for the parameters. The pooled OLS estimates are given by:

1 푁 푇 ∑푖=1 ∑푡=1(푥푖푡 − 푥̿)(푦푖푡 − 푦̿) 훽̂ = 푁푇 1 ∑푁 ∑푇 (푥 − 푥̿)2 푁푇 푖=1 푡=1 푖푡

푁 푇 푁 푇 Where 푥̿ = ∑푖=1 ∑푡=1 푥푖푡 and 푦̿ = ∑푖=1 ∑푡=1 푦푖푡 are overall . 훼̂ = 푦̿ − 훽̂푥̿

2.1.2. Static Panel Models: Fixed and Random Effects Models

In this subtopic, we introduce you to panel data models in which the regression slopes are the same, but intercepts vary across individuals:

푦푖푡 = 훼푖 + 훽′푥푖푡 + 휀푖푡, where 훽′ = (훽1, 훽2, … , 훽푘) is a vector of slope parameters.

This formulation of the model assumes that differences across units can be captured in differences in the constant term. 훼푖 represents individual-specific effects, and each 훼푖 is treated as an unknown parameter to be estimated.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 11

Our problem is that we do not observe 훼푖, we only know that it is constant over time but varies across individuals. We have two alternative models depending on assumptions made concerning

훼푖. The first assumption is that the individual-specific effects, 훼푖 are correlated with explanatory variables, that is, 퐶표푣(푥푖푡, 훼푖) ≠ 0. This is the assumption made by the Fixed Effects model. The second assumption is that the individual-specific effects, 훼푖 are not correlated with explanatory variables, that is, 퐶표푣(푥푖푡, 훼푖) = 0, which is the assumption made by the Random Effects model. We look at these two models in this section.

2.1.2.1. The fixed effects model

Assume this model with individual-specific effects.

푦푖푡 = 훼푖 + 훽푥푖푡 + 휀푖푡

If 훼푖 is uncorrelated with 푥푖푡, then 훼푖 is just another unobserved factor making up the residual. The problem with the fixed effects model is that it assumes that individual-specific effects, 훼푖 are correlated with explanatory variables. In this case, putting 훼푖 in the error term can cause serious problems. This, of course, is an omitted variables problem, so we can use some of familiar results to understand the nature of the problem. For the single-regression model, we have:

훽 + ∑푁 ∑푇 푥 (훼 + 휀 ) ̂푂퐿푆 푖=1 푡=1 푖푡 푖 푖푡 훽 = 푁 푇 2 ∑푖=1 ∑푡=1 푥푖푡

̂푂퐿푆 퐶표푣(푥푖푡,훼푖) Hence, 푚 훽 = 훽 + 2 , which shows that the OLS estimator is inconsistent unless 휎푥

퐶표푣(푥푖푡, 훼푖) = 0. If 푥푖푡 is positively correlated with the unobserved effect, then there is an upward bias. If the correlation is negative, we get a negative bias. In what follows, we discuss the approaches used by the fixed effects model to account for individual-specific effects, 훼푖.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 12

i. Dummy Variable (LSDV) model

The LSDV model captures the individual-specific effects, 훼푖 with dummy variables. A dummy variable 푑푖 is constructed for each cross-section i. 푑푖 = 1 for individual i, and 0 otherwise.

The LSDV model is written as:

푌 = 푋훽 + 퐷훼 + 휀, where 퐷 = [푑1, 푑2, … , 푑푁] is a (푁푇 × 푁) matrix of dummy variables.

This model is a classical regression model. If the number of cross-sections N is small enough, then the model can be estimated by with K regressors in X and N columns in D, as a multiple regression with K + N parameters. However, if N is very big, we will have the problem of dummy variable trap. Imagine for example having 1000 cross-sections (individuals), which means creating 1000 dummy variables. The LSDV model is therefore not practical when the number of cross-sections, N is big.

The least squares estimator of 훽 is obtained using a partitioned regression as:

̂ ′ −1 ′ 훽 = [푋 푀퐷푋] [푋 푀퐷푌]

′ −1 ′ Where 푀퐷 = I − 퐷(퐷 퐷) 퐷 and 퐼 is an identity matrix.

The dummy variable coefficients can be recovered from the other normal equation in the partitioned regression: 퐷′퐷훼 + 퐷′푋훽̂ = 퐷′푌

퐷′퐷훼 = 퐷′푌 − 퐷′푋훽̂

퐷′퐷훼 = 퐷′(푌 − 푋훽̂)

훼 = (퐷′퐷)−1퐷′(푌 − 푋훽̂)

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 13

̂′ This implies that for each cross-section i: 훼푖 = 푦̅푖. − 훽 푥̅푖.

Testing the significance of the individual effects

Assume the model with individual-specific effects:

푦푖푡 = 훼푖 + 훽푥푖푡 + 휀푖푡

We can test the overall significance of the individual-specific effects using a Fisher test. If we are interested in differences across groups, then we can test the hypothesis that the constant terms, 훼푖 are all equal with an F test. Under the null hypothesis of equality of 훼푖 terms, [퐻0: 훼1 = 훼2 =

훼3 … = 훼푁], the efficient estimator is Pooled least squares. The F-ratio used for this test is as follows, where LSDV indicates the dummy variable model and Pooled indicates the pooled model with only a single overall constant term (overall homogeneity hypothesis). Under the null hypothesis, it is the Pooled Model. Under the of distinct 훼푖 terms, we have the fixed effects model.

2 2 (푅퐿푆퐷푉 − 푅푃표표푙푒푑)/(푁 − 1) 퐹 = 2 (1 − 푅퐿푆퐷푉)/(푁푇 − 푁 − 퐾)

If 퐹 is significant, then 훼푖 terms vary across cross-section units. Pooled least squares model should not be used in this case.

ii. The within-and between-group estimators

Assume again the panel data model:

푦푖푡 = 훼푖 + 훽′푥푖푡 + 휀푖푡, where 퐶표푣(푥푖푡, 훼푖) ≠ 0, and 퐶표푣(푥푖푡, 휀푖푡) = 0

푇 In terms of the group means, this model can be written as follows, where 푦̅푖. = ∑푡=1 푦푖푡 and 푥̅푖. = 푇 ∑푡=1 푥푖푡 are individual means:

푦̅푖. = 훼푖 + 훽′푥̅푖. + 휀̅푖.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 14

We can also write the model in terms of deviation from the group means:

(푦푖푡 − 푦̅푖.) = 훽′(푥푖푡 − 푥̅푖.) + (휀푖푡 − 휀̅푖.)

Notice that, by taking deviation from the group means, the individual-specific effects, 훼푖 have been wiped out from the equation. This transformation of the original equation is known as the within transformation. We can estimate 훽 consistently by using OLS on the transformed equation. This is called the within estimator or the Fixed Effects estimator. It should be noted however that in the within transformation, all time-invariant explanatory variables are also wiped out.

Let’s define the matrices of sums of squares and cross products that would be used in each case, where we focus only on estimation of 훽.

The total sums of squares and cross products accumulate variation about the overall means, 푥̿ and 푦̿: 푁 푇 푡표푡푎푙 ′ 푆푥푥 = ∑ ∑(푥푖푡 − 푥̿)(푥푖푡 − 푥̿) 푖=1 푡=1

푁 푇 푡표푡푎푙 ′ 푆푥푦 = ∑ ∑(푥푖푡 − 푥̿)(푦푖푡 − 푦̿) 푖=1 푡=1

If now we use deviations from the group means, the matrices we get, are within-groups sums of squares and cross products:

푁 푇 푤푖푡ℎ푖푛 ′ 푆푥푥 = ∑ ∑(푥푖푡 − 푥̅푖.)(푥푖푡 − 푥̅푖.) 푖=1 푡=1

푁 푇 푤푖푡ℎ푖푛 푆푥푦 = ∑ ∑(푥푖푡 − 푥̅푖.)(푦푖푡 − 푦̅푖.) 푖=1 푡=1

Lastly, the moment matrices are the between-groups sums of squares and cross products, if we consider the variation of the group means around the overall means:

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 15

푁 푏푒푡푤푒푒푛 ′ 푆푥푥 = ∑ 푇(푥̅푖. − 푥̿)(푥̅푖. − 푥̿) 푖=1

푁 푏푒푡푤푒푒푛 ′ 푆푥푦 = ∑ 푇(푥̅푖. − 푥̿)(푦̅푖. − 푦̿) 푖=1

푁 푇 The between-group estimator is then given by the following, where 푥̿ = ∑푖=1 ∑푡=1 푥푖푡 and 푦̿ = 푁 푇 ∑푖=1 ∑푡=1 푦푖푡 are the overall means:

푁 −1 푁 ̂푏푒푡푤푒푒푛 ′ ′ 훽 = [∑ 푇(푥̅푖. − 푥̿)(푥̅푖. − 푥̿) ] [∑ 푇(푥̅푖. − 푥̿)(푦̅푖. − 푦̿) ] 푖=1 푖=1

The within-group estimator is given by:

푁 푇 −1 푁 푇 ̂푤푖푡ℎ푖푛 ′ 훽 = [∑ ∑(푥푖푡 − 푥̅푖.)(푥푖푡 − 푥̅푖.) ] [∑ ∑(푥푖푡 − 푥̅푖.)(푦푖푡 − 푦̅푖.)] 푖=1 푡=1 푖=1 푡=1

The least squares estimator is given by:

푁 푇 −1 푁 푇 ̂푡표푡푎푙 ′ 훽 = [∑ ∑(푥푖푡 − 푥̿)(푥푖푡 − 푥̿) ] [∑ ∑(푥푖푡 − 푥̿)(푦푖푡 − 푦̿)] 푖=1 푡=1 푖=1 푡=1

Since total variation is the sum of within-group variation and between-group variation, that is,

푡표푡푎푙 푤푖푡ℎ푖푛 푏푒푡푤푒푒푛 푡표푡푎푙 푤푖푡ℎ푖푛 푏푒푡푤푒푒푛 푆푥푥 = 푆푥푥 + 푆푥푥 , and 푆푥푦 = 푆푥푦 + 푆푥푦 , we can rewrite the least squares estimator as: ̂푡표푡푎푙 푡표푡푎푙 −1 푡표푡푎푙 훽 = [푆푥푥 ] 푆푥푦

̂푡표푡푎푙 푤푖푡ℎ푖푛 푏푒푡푤푒푒푛 −1 푤푖푡ℎ푖푛 푏푒푡푤푒푒푛 훽 = [푆푥푥 + 푆푥푥 ] [푆푥푦 + 푆푥푦 ]

̂푤푖푡ℎ푖푛 푤푖푡ℎ푖푛 −1 푤푖푡ℎ푖푛 푤푖푡ℎ푖푛 푤푖푡ℎ푖푛 ̂푤푖푡ℎ푖푛 It can be verified that, 훽 = [푆푥푥 ] [푆푥푦 ] ⟹ 푆푥푦 = 푆푥푥 훽 , and that,

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 16

̂푏푒푡푤푒푒푛 푏푒푡푤푒푒푛 −1 푏푒푡푤푒푒푛 푏푒푡푤푒푒푛 푏푒푡푤푒푒푛 ̂푏푒푡푤푒푒푛 훽 = [푆푥푥 ] [푆푥푦 ] ⟹ 푆푥푦 = 푆푥푥 훽 . Substituting these, in the expression for 훽̂푡표푡푎푙, we get:

̂푡표푡푎푙 푤푖푡ℎ푖푛 푏푒푡푤푒푒푛 −1 푤푖푡ℎ푖푛 ̂푤푖푡ℎ푖푛 푏푒푡푤푒푒푛 ̂푏푒푡푤푒푒푛 훽 = [푆푥푥 + 푆푥푥 ] [푆푥푥 훽 + 푆푥푥 훽 ]

̂푡표푡푎푙 푤푖푡ℎ푖푛 ̂푤푖푡ℎ푖푛 푏푒푡푤푒푒푛 ̂푏푒푡푤푒푒푛 푤푖푡ℎ푖푛 푤푖푡ℎ푖푛 푏푒푡푤푒푒푛 −1 푤푖푡ℎ푖푛 훽 = 퐹 훽 + 퐹 훽 , where 퐹 = [푆푥푥 + 푆푥푥 ] 푆푥푥 = 퐼 − 퐹푏푒푡푤푒푒푛.

It can be seen that the least squares estimator is a matrix weighted average of the within- and between-groups estimators.

iii. Panel data model with both individual- and time-specific effects

The least squares dummy variable approach can be extended to include a time-specific effect, 훿푡 as well. Dummy variables are created for the periods, in addition to dummy variables for individuals and is specified as:

푦푖푡 = 휇 + 훼푖 + 훿푡 + 훽′푥푖푡 + 휀푖푡

Least squares estimates of the slopes in this model are obtained by regression of the following variables:

1 푦∗ = 푦 − 푦̅ − 푦̅ + 푦̿ and 푥∗ = 푥 − 푥̅ − 푥̅ + 푥̿, where 푥̅ = ∑푁 푥 and 푦̅ = 푖푡 푖푡 푖. .푡 푖푡 푖푡 푖. .푡 .푡 푁 푖=1 푖푡 .푡 1 ∑푁 푦 are period-means. 푁 푖=1 푖푡

푁 푇 −1 푁 푇 ̂ ′ 훽 = [∑ ∑(푥푖푡 − 푥̅푖. − 푥̅.푡 + 푥̿)(푥푖푡 − 푥̅푖. − 푥̅.푡 + 푥̿) ] [∑ ∑(푥푖푡 − 푥̅푖. − 푥̅.푡 푖=1 푡=1 푖=1 푡=1

′ + 푥̿)푦푖푡 − 푦̅푖. − 푦̅.푡 + 푦̿) ]

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 17

The overall constant and the dummy variable coefficients can then be recovered from the normal equations as:

휇̂ = 푦̿ − 푥̿′훽̂

̂ 훼̂푖 = (푦̅푖. − 푦̿) − (푥̅푖. − 푥̿)훽

̂ ̂ 훿푡 = (푦̅.푡 − 푦̿) − (푥̅.푡 − 푥̿)훽

iv. The first difference estimator

Assume again this panel data model.

푦푖푡 = 훼푖 + 푥푖푡훽 + 휀푖푡, where 훼푖and 휀푖푡are unobserved, with the following assumptions;

퐶표푣(푥푖푡, 훼푖) ≠ 0, and 퐶표푣(푥푖푡, 휀푖푠) = 0 for 푠 = 푡, 푡 − 1 (푤푒푎푘 푒푥표푔푒푛푒푖푡푦)

Instead of time-demeaning the data (which gives the Fixed Effect estimator), we now difference the data:

푦푖푡 − 푦푖푡−1 = (푥푖푡 − 푥푖푡−1)훽 + (휀푖푡 − 휀푖푡−1)

∆푦푖푡 = ∆푥푖푡훽 + ∆휀푖푡

Clearly this removes the individual fixed effects, and so we can obtain consistent estimates of 훽 by estimating the equation in first differences by OLS. However, the differencing wipes out the time-invariant explanatory variables. This estimator requires strict exogeneity; the differenced equation contains the residuals 휀푖푡 and 휀푖푡−1 whereas the vector of transformed explanatory variables contains 푥푖푡 and 푥푖푡−1. Hence, we need 퐶표푣(푥푖푡, 휀푖푠) = 0 for 푠 = 푡, 푡 − 1; or there will be endogeneity bias if we estimate the differenced equation using OLS.

1.2.2.2. The random effects model

Assume again this panel data model.

푦푖푡 = 훼푖 + 푥푖푡훽 + 휀푖푡, with the following assumptions on unobserved terms;

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 18

• 훼푖 푎re assumed to be uncorrelated with 푥푖푡, 퐶표푣(푥푖푡, 훼푖) ≠ 0, and

• 퐶표푣(푥푖푡, 휀푖푠) = 0 for 푠 = 1, 2, … , 푇 (푠푡푟푖푐푡 푒푥표푔푒푛푒푖푡푦)

With these two assumptions, we can put 훼푖 together with 휀푖푡 to form a new error term:

푣푖푡 = 훼푖 + 휀푖푡

We further assume that:

퐸(휀푖푡) = 퐸(훼푖) = 0

2 2 퐸(휀푖푡) = 휎휀

2 2 퐸(훼푖 ) = 휎훼

퐸(휀푖푡훼푖) = 0 푓표푟 푎푙푙 푖, 푡, 푎푛푑 푗,

퐸(휀푖푡휀푗푠) = 0 푖푓푡 ≠ 푠 표푟 푖 ≠ 푗,

퐸(훼푖훼푗) = 0 푖푓 푖 ≠ 푗.

2 2 2 퐸(푣푖푡) = 휎휀 + 휎훼 ,

2 퐸(푣푖푡푣푖푠) = 휎훼 , 푡 ≠ 푠

퐸(푣푖푡푣푗푠) = 0, 푓표푟 푎푙푙 푡 푎푛푑 푠 푖푓 푖 ≠ 푗.

It can be show that the new formed error term is serially correlated:

퐶표푣(푣푖푡, 푣푖푡−푠) = 퐸[(훼푖 + 휀푖푡)(훼푖 + 휀푖푡−푠)]

퐶표푣(푣푖푡, 푣푖푡−푠) = 퐸(훼푖훼푖) + 퐸(휀푖푡휀푖푡−푠) + 퐸(훼푖휀푖푡) + 퐸(훼푖휀푖푡−푠)

2 퐶표푣(푣푖푡, 푣푖푡−푠) = 휎훼 ≠ 0

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 19

And the between 푣푖푡, 푣푖푡−푠 is given by:

2 퐶표푣(푣푖푡, 푣푖푡−푠) 휎훼 푐표푟푟(푣푖푡, 푣푖푡−푠) = = 2 2 2 2 휎훼 + 휎휀 √휎푣푡 ∗ √휎푣푡−푠

The Random Effects (RE) estimator is a Generalized Least Squares (GLS) estimator. GLS estimator involves transforming the original equation, so that the transformed equation fulfils the assumptions underlying the classical model.

The random transformation is done as follows:

• From this model 푦푖푡 = 푥푖푡훽 + 푣푖푡, where 푣푖푡 = 훼푖 + 휀푖푡,

• Define:

2 1/2 휎휀 휆 = 1 − ( 2 2) 푇휎훼 + 휎휀

• Multiply 휆 by the individual average of the equation:

휆푦̅푖. = 휆푥̅푖.훽 + 휆푣̅푖.

• Subtract this expression from the original equation:

푦푖푡 − 휆푦̅푖. = (푥푖푡 − 휆푥̅푖.)훽 + (푣푖푡 − 휆푣̅푖.)

Since (푣푖푡 − 휆푣̅푖.) is now serially uncorrelated, using OLS on the transformed equation will give GLS efficient estimator.

The similarity of this procedure to the computation in the within-group model can be noticed, which uses 휆 = 1. It can be shown that the GLS estimator is, like the OLS estimator, a matrix weighted average of the within- and between-units estimators, as earlier described:

훽̂ = 퐹푤푖푡ℎ푖푛훽̂푤푖푡ℎ푖푛 + (퐼 − 퐹푤푖푡ℎ푖푛)훽̂푏푒푡푤푒푒푛

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 20

푤푖푡ℎ푖푛 푤푖푡ℎ푖푛 푏푒푡푤푒푒푛 −1 푤푖푡ℎ푖푛 Where 퐹 = [푆푥푥 + 휃푆푥푥 ] 푆푥푥

2 휎휀 휃 = 2 2 = (1 − 휆) 푇휎훼 + 휎휀

If 휃 equals 1, then generalized least squares is identical to ordinary least squares. This situation 2 would occur if 휎훼 = 0 were zero, in which case a classical regression model would apply. If 휃 equals zero, then the estimator is the dummy variable estimator we used in the fixed effects setting. 2 There are two possibilities. If 휎휀 were zero, then all variation across units would be due to the different 훼푖, which, because they are constant across time, would be equivalent to the dummy variables we used in the fixed-effects model.

Let’s now show that the new error term (푣푖푡 − 휆푣̅푖.), is serially uncorrelated, that is:

퐸[(푣푖푡 − 휆푣̅푖.)(푣푖푡−푠 − 휆푣̅푖.)] = 0

It follows that:

2 2 퐸(푣푖푡푣푖푡−푠) − 휆퐸(푣푖푡푣̅푖.) − 휆퐸(푣푖푡−푠푣̅푖.) + 휆 퐸(푣̅푖. ) = 0

Let’s examine one term at a time:

The first term:

2 2 퐸(푣푖푡푣푖푡−푠) = 퐸[(훼푖 + 휀푖푡)(훼푖 + 휀푖푡−푠)] = 퐸(훼푖 ) + 퐸(휀푖푡−푠훼푖) + 퐸(휀푖푡훼푖) + 퐸(휀푖푡휀푖푡−푠) = 휎훼

The second term:

2 −휆퐸(푣푖푡푣̅푖.) = −휆퐸[(훼푖 + 휀푖푡)(훼푖 + 휀푖̅ .)] = −휆[퐸(훼푖 ) + 퐸(훼푖휀푖̅ .) + 퐸(휀푖푡훼푖) + 퐸(휀푖푡휀푖̅ .)] (휀 + 휀 + ⋯ + 휀 ) 휎2 = −휆 [퐸(훼2) + 퐸 (휀 푖1 푖2 푖푇 )] = −휆(휎2 + 휀 ) 푖 푖푡 푇 훼 푇

The third term:

You are advised to check this:

휎2 −휆퐸(푣 푣̅ ) = −휆퐸(푣 푣̅ ) = −휆(휎2 + 휀 ) 푖푡−푠 푖. 푖푡 푖. 훼 푇

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 21

The fourth term:

2 2 2 2 2 2 휆 퐸(푣̅푖. ) = 휆 퐸[(훼푖 + 휀푖̅ .)(훼푖 + 휀푖̅ .)] = 휆 [퐸(훼푖 ) + 퐸(훼푖휀푖̅ .) + 퐸(훼푖휀푖̅ .) + 퐸(휀푖̅ . )] 휎2 = 휆2[퐸(훼2) + 퐸(휀̅ 2)] = 휆2(휎2 + 휀 ) 푖 푖. 훼 푇

Putting the terms together we have:

휎2 휎2 휎2 − 2휆 (휎2 + 휀 ) 휆2 (휎2 + 휀 ) = 0 훼 훼 푇 훼 푇

Or

휎2 휎2 휎2 휎2 − 2휆 (휎2 + 휀 ) 휆2 (휎2 + 휀 ) = 휎2 + (휎2 + 휀 ) (휆2 − 2휆) = 0 훼 훼 푇 훼 푇 훼 훼 푇

2 1/2 휎휀 With 휆 = 1 − ( 2 2) : 푇휎훼+휎휀

1 2 1 2 2 2 2 2 휎휀 휎휀 휆 − 2휆 = (1 − ( 2 2) ) − 2 (1 − ( 2 2) ) 푇휎훼 + 휎휀 푇휎훼 + 휎휀

1 1 2 2 2 2 2 2 휎휀 휎휀 휎휀 휎휀 = 1 − 2 ( 2 2) + 2 2 − 2 + 2 ( 2 2) = 2 2 − 1 푇휎훼 + 휎휀 푇휎훼 + 휎휀 푇휎훼 + 휎휀 푇휎훼 + 휎휀 2 −푇휎훼 = 2 2 푇휎훼 + 휎휀

Putting this back, we have:

2 2 2 2 휎휀 −푇휎훼 2 2 휎훼 + (휎훼 + ) ( 2 2) = 휎훼 − 휎훼 = 0 푇 푇휎훼 + 휎휀

Hence, we have shown that (푣푖푡 − 휆푣̅푖.), is serially uncorrelated.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 22

a. Testing for random effects

This is a Lagrange multiplier test advanced by Breusch and Pagan (1980), used to test for the presence of an unobserved effect. It is usually used to choose between a pooled OLS model and a random effect model. The idea behind the test is that, if the regressors are strictly exogenous and

휀푖푡 is non-autocorrelated and homoskedastic, then Pooled OLS and random effect model will both 2 2 be efficient if there are no unobserved effects, i.e. 휎훼 = 0. If 휎훼 > 0 then Random effect is efficient (provided, of course, that 훼푖 is uncorrelated with the explanatory variables).

2 The null hypothesis is that 퐻0: 휎훼 = 0.

The LM test statistic is given as follows, where 푒̂푖푡 is the estimated pooled OLS residual:

푁 푇 2 2 푁 2 2 푁푇 ∑푖=1[∑푡=1 푒̂푖푡] 푁푇 ∑푖=1[푇푒푖̅ .] 퐿푀 = [ 푁 푇 2 − 1] = [ 푁 푇 2 − 1] 2(푇 − 1) ∑푖=1 ∑푡=1 푒̂푖푡 2(푇 − 1) ∑푖=1 ∑푡=1 푒̂푖푡

Under the null hypothesis, LM statistic is distributed as chi-squared with one degree of freedom.

b. Hausman Test

Hausman (1978) test is usually used to choose between the random and fixed effects models. An important consideration when choosing between a random effects and fixed effects approach is whether 훼푖 is correlated with 푥푖푡. To test the hypothesis that 훼푖 is uncorrelated with 푥푖푡, we can use a Hausman test.

Hausman test in general involves comparing one estimator which is consistent regardless of whether the null hypothesis is true or not, to another estimator which is only consistent under the null hypothesis. In the present case, the fixed effect estimator is consistent regardless of whether

훼푖 is or isn’t correlated with 푥푖푡, while the Random effect requires this correlation to be zero in

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 23

order to be consistent. Strict exogeneity is assumed for both models. The null hypothesis is that both models are consistent, and the two estimates should not differ systematically. A statistically significant difference is therefore interpreted as evidence against the Random effect model. Under the alternative, Hausman considers that the fixed effect model is consistent, but Random effect model is not. If we cannot reject the null, we may decide to use the Random effect model in the analysis on the grounds that this model is efficient.

̂ ̂ 퐻0: 훽퐹퐸 − 훽푅퐸 = 0

̂ ̂ 퐻1: 훽퐹퐸 − 훽푅퐸 ≠ 0

̂ ̂ ′ ̂ ̂ −1 ̂ ̂ 퐻 = (훽퐹퐸 − 훽푅퐸) [푉푎푟(훽퐹퐸) − 푉푎푟(훽푅퐸)] (훽퐹퐸 − 훽푅퐸)

Under the null hypothesis, the test follows a limiting chi-squared distribution with K − 1 degrees of freedom.

Application of Static Panel Data Models in STATA

We use the traffic fatality dataset (fatality.dta) from Stock and Watson, Introduction to Econometrics. To obtain the dataset, connect to internet and download it by typing “ssc install bcuse”, in the STATA command window, followed by the command “bcuse fatality”. You can then save the dataset on your laptop. The dataset shows state-level data on traffic fatality rates (deaths per 10,000) for 48 U.S. states over the period 1982–1988. In this application, we model the highway fatality rates as a function of several common factors: beertax, the tax on a case of beer, spircons, a measure of spirits consumption and two economic factors: the state unemployment rate (unrate) and state per capita personal income, $000 (perincK). Let’s first transform two variables, Vehicle Fatality Rate (mrall) and Per Capita Personal Income (perinc) as follows:

. gen fatal=mrall*10000 . gen perincK=perinc/1000

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 24

We present descriptive for these variables.

. summarize fatal beertax spircons unrate perincK

Variable Obs Mean Std. Dev. Min Max

fatal 336 2.040444 .5701938 .82121 4.21784 beertax 336 .513256 .4778442 .0433109 2.720764 spircons 336 1.75369 .6835745 .79 4.9 unrate 336 7.346726 2.533405 2.4 18 perincK 336 13.88018 2.253046 9.513762 22.19345 Tell STATA that we want to use panel data:

. xtset state year panel variable: state (strongly balanced) time variable: year, 1982 to 1988 delta: 1 unit The results of the one-way fixed effects model are as follows:

. xtreg fatal beertax spircons unrate perincK, fe

Fixed-effects (within) regression Number of obs = 336 Group variable: state Number of groups = 48

R-sq: Obs per group: within = 0.3526 min = 7 between = 0.1146 avg = 7.0 overall = 0.0863 max = 7

F(4,284) = 38.68 corr(u_i, Xb) = -0.8804 Prob > F = 0.0000

fatal Coef. Std. Err. t P>|t| [95% Conf. Interval]

beertax -.4840728 .1625106 -2.98 0.003 -.8039508 -.1641948 spircons .8169652 .0792118 10.31 0.000 .6610484 .9728819 unrate -.0290499 .0090274 -3.22 0.001 -.0468191 -.0112808 perincK .1047103 .0205986 5.08 0.000 .064165 .1452555 _cons -.383783 .4201781 -0.91 0.362 -1.210841 .4432754

sigma_u 1.1181913 sigma_e .15678965 rho .98071823 (fraction of due to u_i)

F test that all u_i=0: F(47, 284) = 59.77 Prob > F = 0.0000

All explanatory factors are highly significant, with the unemployment rate (unrate) having a negative effect on the fatality rate (perhaps since those who are unemployed are income- constrained and drive fewer miles), and income (perincK) a positive effect (as expected because driving is a normal good). Note the empirical correlation labeled corr(u_i, Xb) of −0.8804. This

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 25

correlation indicates that the unobserved heterogeneity term, proxied by the estimated fixed effect, is strongly correlated with a linear combination of the included regressors. That is not a problem for the fixed effects model.

We also estimate a two-way fixed effects model by adding time effects to the model of the previous example. Let’s first generate the time effects dummy variables as follows; we drop the last one yr7 to avoid perfect multicollinearity.

. quietly tabulate year, generate(yr) . drop yr7

After generating the time effects, we estimate the two-way fixed effects model as:

. xtreg fatal beertax spircons unrate perincK yr*, fe

Fixed-effects (within) regression Number of obs = 336 Group variable: state Number of groups = 48

R-sq: Obs per group: within = 0.4528 min = 7 between = 0.1090 avg = 7.0 overall = 0.0770 max = 7

F(10,278) = 23.00 corr(u_i, Xb) = -0.8728 Prob > F = 0.0000

fatal Coef. Std. Err. t P>|t| [95% Conf. Interval]

beertax -.4347195 .1539564 -2.82 0.005 -.7377878 -.1316511 spircons .805857 .1126425 7.15 0.000 .5841163 1.027598 unrate -.0549084 .0103418 -5.31 0.000 -.0752666 -.0345502 perincK .0882636 .0199988 4.41 0.000 .0488953 .1276319 yr1 .134057 .0677696 1.98 0.049 .0006503 .2674638 yr2 .0806858 .0639253 1.26 0.208 -.0451535 .206525 yr3 -.0309258 .0503737 -0.61 0.540 -.1300882 .0682365 yr4 -.0656806 .0441832 -1.49 0.138 -.1526568 .0212956 yr5 .0832536 .0354834 2.35 0.020 .0134033 .153104 yr6 .0339842 .0314024 1.08 0.280 -.0278324 .0958009 _cons -.0050003 .4096507 -0.01 0.990 -.8114115 .801411

sigma_u 1.0987683 sigma_e .14570531 rho .98271904 (fraction of variance due to u_i)

F test that all u_i=0: F(47, 278) = 64.52 Prob > F = 0.0000

.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 26

The four variables included in the one-way fixed effects model retain their sign and significance in the two-way fixed effects model. We can test if the time effects are jointly significant by typing:

. test yr1 yr2 yr3 yr4 yr5 yr6

. . test yr1 yr2 yr3 yr4 yr5 yr6

( 1) yr1 = 0 ( 2) yr2 = 0 ( 3) yr3 = 0 ( 4) yr4 = 0 ( 5) yr5 = 0 ( 6) yr6 = 0

F( 6, 278) = 8.48 Prob > F = 0.0000

The above test indicates that time effects are jointly significant, suggesting that they should be included in a properly specified model. Otherwise, the model is qualitatively similar to the earlier model, with a sizable amount of variation explained by the individual (state) fixed effect.

Between Estimator

As discussed, another estimator that may be defined for a panel data set is the between estimator, in which the group means of y are regressed on the group means of X in a regression of N observations. This estimator ignores all of the individual-specific variation in y and X that is considered by the within estimator, replacing each observation for an individual with their mean behavior. This estimator (between estimator) is not widely used, but can be sometimes applied in cross-country studies where the time series data for each individual are thought to be somewhat inaccurate, or when they are assumed to contain random deviations from long-run means. If you assume that the inaccuracy has mean zero over time, a solution to this measurement error problem can be found by averaging the data over time and retaining only one observation per unit.

The estimation results for the traffic fatality dataset are as follows:

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 27

. xtreg fatal beertax spircons unrate perincK, be

Between regression (regression on group means) Number of obs = 336 Group variable: state Number of groups = 48

R-sq: Obs per group: within = 0.0479 min = 7 between = 0.4565 avg = 7.0 overall = 0.2583 max = 7

F(4,43) = 9.03 sd(u_i + avg(e_i.))= .4209489 Prob > F = 0.0000

fatal Coef. Std. Err. t P>|t| [95% Conf. Interval]

beertax .0740362 .1456333 0.51 0.614 -.2196614 .3677338 spircons .2997517 .1128135 2.66 0.011 .0722417 .5272618 unrate .0322333 .038005 0.85 0.401 -.0444111 .1088776 perincK -.1841747 .0422241 -4.36 0.000 -.2693277 -.0990218 _cons 3.796343 .7502025 5.06 0.000 2.283415 5.309271

The results indicate that the cross-sectional (interstate) variation in beertax and unrate has no explanatory power in this specification, whereas they are highly significant when the within estimator is employed.

Random Effects Estimator

As discussed, as an alternative to considering the individual-specific intercept as a “fixed effect” of that unit, we might consider that the individual effect may be viewed as a random draw from a distribution. The application of the random effects model is as follows:

. xtreg fatal beertax spircons unrate perincK, re

Random-effects GLS regression Number of obs = 336 Group variable: state Number of groups = 48

R-sq: Obs per group: within = 0.2263 min = 7 between = 0.0123 avg = 7.0 overall = 0.0042 max = 7

Wald chi2(4) = 49.90 corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000

fatal Coef. Std. Err. z P>|z| [95% Conf. Interval]

beertax .0442768 .1204613 0.37 0.713 -.191823 .2803765 spircons .3024711 .0642954 4.70 0.000 .1764546 .4284877 unrate -.0491381 .0098197 -5.00 0.000 -.0683843 -.0298919 perincK -.0110727 .0194746 -0.57 0.570 -.0492423 .0270968 _cons 2.001973 .3811247 5.25 0.000 1.254983 2.748964

sigma_u .41675665 sigma_e .15678965 rho .87601197 (fraction of variance due to u_i)

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 28

In comparison to the fixed effects model, where all four regressors were significant, we see that the beertax and perincK variables do not have significant effects on the fatality rate. The latter variable’s coefficient switched sign. The corr(u_i, X) in this context is assumed to be zero: a necessary condition for the random effects estimator to yield consistent estimates. Recall that when the fixed effect estimator was used, this correlation was reported as −0.8804.

Hausman Test

A Hausman test may be used to test the null hypothesis that the extra orthogonality conditions imposed by the random effects estimator are valid. The fixed effects estimator, which does not impose those conditions, is consistent regardless of the independence of the individual effects. The fixed effects estimates are inefficient if that assumption of independence is warranted. The random effects estimator is efficient under the assumption of independence, but inconsistent otherwise.

To illustrate the Hausman test with the two forms of the motor vehicle fatality equation, use the following commands:

. quietly xtreg fatal beertax spircons unrate perincK, fe

. estimates store fix

. quietly xtreg fatal beertax spircons unrate perincK, re

. estimates store re

. hausman fix re

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 29

. hausman fix re

Coefficients (b) (B) (b-B) sqrt(diag(V_b-V_B)) fix re Difference S.E.

beertax -.4840728 .0442768 -.5283495 .1090815 spircons .8169652 .3024711 .514494 .0462668 unrate -.0290499 -.0491381 .0200882 . perincK .1047103 -.0110727 .115783 .0067112

b = consistent under Ho and Ha; obtained from xtreg B = inconsistent under Ha, efficient under Ho; obtained from xtreg

Test: Ho: difference in coefficients not systematic

chi2(4) = (b-B)'[(V_b-V_B)^(-1)](b-B) = 130.93 Prob>chi2 = 0.0000 (V_b-V_B is not positive definite)

The results indicate that the Hausman test’s null hypothesis that the random effects estimator is consistent, is soundly rejected. The sizable estimated correlation reported in the fixed effects estimator also supports this rejection.

The First Difference Estimator

The within transformation used by fixed effects models removes unobserved heterogeneity at the unit level. The same can be achieved by first differencing the original equation (which removes the constant term). We illustrate the first difference estimator with the traffic data set.

First, type the following command to install “xtivreg2” STATA command:

. ssc install xtivreg2

The results of the first difference estimators are:

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 30

. xtivreg2 fatal beertax spircons unrate perincK, fd nocons small

FIRST DIFFERENCES ESTIMATION

Number of groups = 48 Obs per group: min = 6 avg = 6.0 max = 6

OLS estimation

Estimates efficient for homoskedasticity only Statistics consistent for homoskedasticity only

Number of obs = 288 F( 4, 284) = 6.29 Prob > F = 0.0001 Total (centered) SS = 11.21286023 Centered R2 = 0.0812 Total (uncentered) SS = 11.21590589 Uncentered R2 = 0.0814 Residual SS = 10.30276586 Root MSE = .1905

D.fatal Coef. Std. Err. t P>|t| [95% Conf. Interval]

beertax D1. .1187701 .2728036 0.44 0.664 -.4182035 .6557438

spircons D1. .523584 .1408249 3.72 0.000 .2463911 .800777

unrate D1. .003399 .0117009 0.29 0.772 -.0196325 .0264304

perincK D1. .1417981 .0372814 3.80 0.000 .0684152 .215181

Included instruments: D.beertax D.spircons D.unrate D.perincK

It can be observed that, as in the between estimation results, the beertax and unrate variables are not significant. The larger Root MSE for the first difference equation, compared to that for fixed effects, illustrates the relative inefficiency of the first difference estimator when there are more than two time periods.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 31

2.1.3. Dynamic Panel Models

2.1.3.1. Introduction

Many economic issues are dynamic by nature and use the panel data structure to understand adjustment. For example, Demand (i.e. present demand depends on past demand), Dynamic wage equation (The macroeconomic empirical wage equation implies that the expected log real wage depends on the lagged log real wage), and employment models (costs of hiring and firing), Investment of firms

In the context of panel data, we usually deal with unobserved heterogeneity by applying the within (demeaning) transformation, as in one-way fixed effects models, or by taking first differences. The ability of first differencing to remove unobserved heterogeneity also underlies the family of estimators that have been developed for dynamic panel data (DPD) models. A dynamic panel data incorporates a lagged dependent variable (with or without other exogenous variables), allowing for the modelling of a partial adjustment mechanism.

The inclusion of exogenous variables only brings minor complications with respect to the estimation of the parameters. These complications pertain to the number of instruments (in instrumental variable estimation) or the number of moment conditions (in GMM estimation). There are also complications arising from the time dimensions of the panel datasets. Most of the panel estimation methods are designed for panel datasets with large N (the cross-section dimension) and large T (the time dimension). Panel datasets with small N and large T may require more specialized techniques (e.g. SUR) for estimation

For simplicity, let us consider a one-way error component model: ′ 푦푖푡 = 훾푦푖,푡−1 + 훽 푥푖푡 + 훼푖 + 휀푖푡

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 32

for i = 1,.., n and t = 1, .., T . 훼푖 and 휆푡 are the (unobserved) individual and time-specific effects, 2 and 휀푖푡 the error (idiosyncratic) term with E(휀푖푡) = 0,and E(휀푖푡휀푗푠) = 휎푡 if j = i and t = s, and

E(휀푖푡휀푗푠) = 0 otherwise.

In a dynamic panel model, the choice between a fixed-effects formulation and a random-effects formulation has implications for estimation that are of a different nature than those associated with the static model.

2.1.3.2. Dynamic panel issues

❑ If lagged dependent variables appear as explanatory variables, strict exogeneity of the regressors no longer holds. a) The FE is no longer consistent when N tends to infinity and T is fixed.

Within transformation sweeps out the individual effect (훼푖).

푦 (푦 − 푦̅ ) where 푦̅ = ∑푇 푖,푡−1 is correlated with (휀 −휀̅ ) ) 푖,푡−1 푖,푡−1 푖 푖=2 푇−1 푖,푡−1 푖,푡−1

b) RE GLS estimator is biased and inconsistent

Quasi-demeaning transforms the data to (푦푖,푡−1 − 휃푦̅푖,푡−1 and accordingly for the other

terms. (푦푖,푡−1 − 휃푦̅푖,푡−1) is correlated with (휀푖푡− 휃휀푖̅ ) because 휀푖̅ contains 휀푖푡−1which is

correlated with 푦푖,푡−1

∗ ∗ c) 푦푖푡 is correlated with 훼푖 , implying that 푦푖,푡−1 is also correlated with 훼푖 , hence OLS is biased

and inconsistent even if 휀푖푡 are not serially correlated

❑ With a random-effects formulation, the interpretation of a model depends on the assumption of initial observation. ❑ The consistency property of the MLE and the GLS estimator depends on the way in which T and n tend to infinity.

A serious difficulty arises with the one-way fixed effects model in the context of a dynamic panel data (DPD) model particularly in the “small T, large N" context. This arises because the demeaning

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 33

process which subtracts the individual’s mean value of y and each X from the respective variable creates a correlation between regressor and error. The mean of the lagged dependent variable contains observations 0 through (T − 1) on y, and the mean error—which is being conceptually subtracted from each 휀푖푡- contains contemporaneous values of 휀 for t = 1, . . ., T. The resulting correlation creates a bias in the estimate of the coefficient of the lagged dependent variable which is not mitigated by increasing N, the number of individual units.

The demeaning operation creates a regressor which cannot be distributed independently of the error term. The inconsistency of 훾̂ as N → ∞ is of order 1/T, which may be quite sizable in a “small T " context. If 훾 > 0, the bias is invariably negative, so that the persistence of y will be underestimated.

For reasonably large values of T, the limit of 훾̂ − 훾 as N → ∞ will be approximately −(1 + 훾)/(T − 1): a sizable value, even if T = 10. With 훾 = 0.5, the bias will be -0.167, or about 1/3 of the true value. The inclusion of additional regressors does not remove this bias. Indeed, if the regressors are correlated with the lagged dependent variable to some degree, their coefficients may be seriously biased as well. Note also that this bias is not caused by an autocorrelated error process 휀. The bias arises even if the error process is i.i.d. If the error process is autocorrelated, the problem is even more severe given the difficulty of deriving a consistent estimate of the AR parameters in that context. The same problem affects the one-way random effects model. The 훼푖 error component enters every value of 푦푖푡 by assumption, so that the lagged dependent variable cannot be independent of the composite error process. One solution to this problem involves taking first differences of the original model.

Consider a model containing a lagged dependent variable and a single regressor X:

푦푖푡 = 훽1 + 훾푦푖,푡−1 + 푥푖푡훽2 + 훼푖 + 휀푖푡 The first difference transformation removes both the constant term and the individual effect:

∆푦푖푡 = 훾∆푦푖,푡−1 + ∆푥푖푡훽2 + ∆휀푖푡 There is still correlation between the differenced lagged dependent variable and the disturbance process (which is now a first-order process, or MA(1)): the former contains 푦푖,푡−1

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 34

and the latter contains 휀푖,푡−1. But with the individual fixed effects swept out, a straightforward instrumental variables estimator is available. We may construct instruments for the lagged dependent variable from the second and third lags of y, either in the form of differences or lagged levels. If 휀 is i.i.d., those lags of y will be highly correlated with the lagged dependent variable (and its difference) but uncorrelated with the composite error process.

Even if we had reason to believe that 휀 might be following an AR(1) process, we could still follow this strategy, “backing off” one period and using the third and fourth lags of y (presuming that the timeseries for each unit is long enough to do so). This approach is the Anderson–Hsiao (AH) estimator implemented by the Stata command xtivreg, fd.

2.1.3.3. Dynamic panel bias / Nickell’s bias The LSDV estimator is consistent for the static model whether the effects are fixed or random. On the contrary, the LSDV is inconsistent for a dynamic panel data model with individual effects, whether the effects are fixed or random. The bias of the LSDV estimator in a dynamic model is generally known as dynamic panel bias or Nickell’s bias (1981).

Consider the simple AR(1) model: ∗ 푦푖푡 = 훾푦푖,푡−1 + 훼푖 + 휀푖푡 for i = 1, .., n and t = 1, .., T . ∗ 푛 For simplicity, let 훼푖 = 훼 + 훼푖, to avoid imposing the restriction that ∑푖=1 훼푖 = 0, or E(훼푖)=0 in the case of random individual effects. Assumptions: i) The autoregressive parameter 훾 satisfies |훾| < 0

ii) The initial condition 푦푖0 is observable 2 iii) The error term satisfies E(휀푖푡)=0, and E(휀푖푡휀푗푠)= 휎푡 if j = i and t = s, and E(휀푖푡휀푗푠) =0 otherwise.

In this AR(1) panel data model, the dynamic panel bias is: plim 훾̂퐿푆퐷푉 ≠ 훾 푛→∞

The LSDV estimator is defined by: 훼̂푖 = 푦̅푖 − 훾̂퐿푆퐷푉푦̅푖,푡−1

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 35

푁 푇 −1 푁 푇 2 훾̂퐿푆퐷푉 = (∑ ∑(푦푖,푡−1 − 푦̅푖,푡−1) ) (∑ ∑(푦푖,푡−1 − 푦̅푖,푡−1)(푦푖푡, − 푦̅푖)) 푖=1 푡=1 푖=1 푡=1 1 1 1 푦̅ = ∑푇 푦 ; 푦̅ = ∑푇 푦 ; 푥̅ = ∑푇 푥 푖,푡−1 푇 푡=1 푖,푡−1 푖 푇 푡=1 푖푡 푖 푇 푡=1 푖,

The bias of the LSDV estimator is defined by:

푁 푇 −1 푁 푇 2 훾̂퐿푆퐷푉 − 훾 = (∑ ∑(푦푖,푡−1 − 푦̅푖,푡−1) ) (∑ ∑(푦푖,푡−1 − 푦̅푖,푡−1)(휀푖푡, − 휀푖̅ )) 푖=1 푡=1 푖=1 푡=1 The bias of the LSDV estimator can be rewritten as: 푁 푇 (∑푖=1 ∑푡=1(푦푖,푡−1 − 푦̅푖,푡−1)(휀푖푡, − 휀푖̅ ))/(푁푇) 훾̂퐿푆퐷푉 − 훾 = 푁 푇 2 (∑푖=1 ∑푡=1(푦푖,푡−1 − 푦̅푖,푡−1) ) /(푁푇)

1 푁 푇 1 푁 plim (∑푖=1 ∑푡=1(푦푖,푡−1 − 푦̅푖,푡−1)(휀푖푡, − 휀푖̅ )) = −plim (∑푖=1 푦푖푡−1,휀푖̅ ) 푛→∞ 푁푇 푁→∞ 푁

If this plim is not null, then the LSDV estimator 훾̂퐿푆퐷푉 is biased when n tends to infinity and T is fixed. The Nickel bias is given by:

-1 1+ g æ 1 1-g T ö é 2g æ 1-g T öù P lim(gˆLSDV -g ) = - ç1- ÷*ê1- ç1- ÷ú n®¥ T -1ç T 1-g ÷ (1-g )(T -1) ç T(1- l) ÷ è ø ë è øû

Note the following:

• When T is large, the right-hand-side variables become asymptotically uncorrelated. • For small T, this bias is always negative if 훾 > 0. • The bias does not go to zero as 훾 goes to zero.

What are the solutions?

Consistent estimator of γ can be obtained by using:

❑ ML or FIML (but additional assumptions on yi0 are necessary)

❑ 2 Feasible GLS (but additional assumptions on yi0 are necessary)

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 36

❑ 3 LSDV bias corrected (Kiviet, 1995) ❑ 4 IV approach (Anderson and Hsiao, 1982) ❑ 5 GMM approach (Arenallo and Bond, 1985)

We shall discuss the following two ways of estimating dynamic panel data:

(1) The classic Anderson-Hsiao estimator (which is a first difference instrumental variable estimator) (2) The Arellano-Bond estimator (which is a first difference GMM estimator) Note that the exogeneity requirements for the Anderson-Hsiao estimator are less restrictive compared to the FE estimator

IV estimation, Anderson-Hsiao

Instrumental variable estimators, IV, have been proposed by Anderson-Hsiao, as they are consistent with 푁 → ∞ and finite T. As with any instrumental variable technique, the choice of instruments must be made. We would like to find something that is correlated with the endogenous variable, in this case (yit-1 - yit-2), but not correlated with the error (푢푖,푡 - 푢푖,푡−1.). For instance, the instrument 푦푖,푡−2 as proxy is correlated with 푦푖,푡−1 − 푦푖,푡−2, but not with 푢푖,푡−1 while an instrument

푦푖,푡−2 − 푦푖,푡−3 or ∆푦푖,푡−2as proxy for 푦푖,푡−1 − 푦푖,푡−2 leads to the loss of one more sample period.

Constructing the instrument matrix In standard 2SLS, including the Anderson–Hsiao approach, the twice-lagged level appears in the instrument matrix as: æ . ö ç ÷ ç yi,1 ÷ ç . ÷ Z = ç ÷ i ç . ÷ ç ÷ ç . ÷ ç y ÷ è i,T -2 ø

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 37

Where the first row corresponds to t = 2, given that the first observation is lost in applying the FD transformation. The missing value in the instrument for t = 2 causes that observation for each panel unit to be removed from the estimation.

If we also included the thrice-lagged level 푦푡−3 as a second instrument in the Anderson–Hsiao approach, we would lose another observation per panel: æ . . ö ç ÷ ç yi1 . ÷ ç y y ÷ ç i2 i1 ÷ Zi == ç . . ÷ ç ÷ ç . . ÷ ç . . ÷ ç ÷ ç y y ÷ è i,T -2 i,T -3 ø so that the first observation available for the regression is that dated t = 4. To avoid this loss of degrees of freedom, Holtz-Eakin et al. construct a set of instruments from the second lag of y, one instrument pertaining to each time period: æ 0 0 ... 0 ö ç ÷ ç yi,1 0 ... 0 ÷ ç 0 y ... 0 ÷ ç i2 ÷ Zi = ç . . . ÷ ç ÷ ç . . . ÷ ç . . . ÷ ç ÷ ç 0 0 ... y ÷ è i,T -2 ø The inclusion of zeros in place of missing values prevents the loss of additional degrees of freedom, in that all observations dated t = 2 and later can now be included in the regression. Although the inclusion of zeros might seem arbitrary, the columns of the resulting instrument matrix will be orthogonal to the transformed errors. The resulting moment conditions correspond ∗ ∗ to an expectation we believe should hold: E (푦푖,푡−2 − 2푠푖푡) = 0, where 푠푖푡 refers to the FD- transformed errors.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 38

It would also be valid to ‘collapse’ the columns of this Z matrix into a single column, which embodies the same expectation, but conveys less information as it will only produce a single moment condition. In this context, the collapsed instrument set will be the same implied by standard IV, with a zero replacing the missing value in the first usable observation: æ 0 ö ç ÷ ç yi,1 ÷ ç . ÷ Z = ç ÷ i ç . ÷ ç ÷ ç . ÷ ç y ÷ è i,T -2 ø This is specified in Roodman’s xtabond2 software by giving the collapse option. Given this solution to the tradeoff between lag length and sample length, we can now adopt Holtz-Eakin et al.’s suggestion and include all available lags of the untransformed variables as instruments. For endogenous variables, lags 2 and higher are available. For predetermined variables that are not strictly exogenous, lag 1 is also valid, as its value is only correlated with errors dated t − 2 or earlier. Using all available instruments gives rise to an instrument matrix such as: æ 0 0 0 0 0 0 ...ö ç ÷ ç yi,1 0 0 0 0 0 ...÷ ç 0 y y 0 0 0 ...÷ ç i,2 i,1 ÷ Zi = ç 0 0 0 yi,3 yi,2 yi,1 ...÷ ç ÷ ç ...... ÷ ç ...... ÷ ç ÷ è ...... ø

In this setup, we have different numbers of instruments available for each time period: one for t =2, two for t =3, and soon. As we move to the later time periods in each panel’s timeseries, additional orthogonality conditions become available, and taking these additional conditions into account improves the efficiency of the AB estimator.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 39

One disadvantage of this strategy is that the number of instruments produced will be quadratic in T, the length of the timeseries available. This implies that for a longer timeseries (T>10), it may be necessary to restrict the number of past lags used. Both the official Stata commands and Roodman’s xtabond2 allow the specification of the particular lags to be included in estimation, rather than relying on the default strategy.

Arellano–Bond (AB) estimator Arellano and Bond argue that the Anderson–Hsiao estimator, while consistent, fails to take all of the potential orthogonality conditions into account. A key aspect of the AB strategy, echoing that of AH, is the assumption that the necessary instruments are ‘internal’: that is, based on lagged values of the instrumented variable(s). The estimators allow the inclusion of external instruments as well.

Consider the equations:

푦푖푡 = 푋푖푡훽1 + 푊푖푡훽2 + 푣푖푡

푣푖푡 = 푢푖 + 푠푖푡

Where 푋푖푡 includes strictly exogenous regressors, 푊푖푡 are predetermined regressors (which may include lags of y) and endogenous regressors, all of which may be correlated with 푢푖, the unobserved individual effect. First-differencing the equation removes the 푢푖 and its associated omitted-variable bias.

The AB approach, and its extension to the ‘System GMM’ context, is an estimator designed for situations with: ❑ ‘Small T, large N’ panels: few time periods and many individual units ❑ A linear functional relationship ❑ One left-hand variable that is dynamic, depending on its own past realizations ❑ Right-hand variables that are not strictly exogenous: correlated with past and possibly current realizations of the error

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 40

❑ Fixed individual effects, implying unobserved heterogeneity ❑ Heteroskedasticity and within individual units’ errors, but not across them

The Arellano–Bond estimator sets up a generalized method of moments (GMM) problem in which the model is specified as a system of equations, one per time period, where the instruments applicable to each equation differ (for instance, in later time periods, additional lagged values of the instruments are available).

NB: This estimator is available in Stata as xtabond. A more general version, allowing for autocorrelated errors, is available as xtdpd. An excellent alternative to Stata’s built-in commands is David Roodman’s xtabond2, available from SSC (findit xtabond2). It is very well documented in his paper, included in your materials. The xtabond2 routine provides several additional features—such as the orthogonal deviations transformation discussed below—not available in official Stata’s commands.

The System GMM estimator A potential weakness in the Arellano–Bond DPD estimator was revealed in later work by Arellano and Bover (1995) and Blundell and Bond (1998). The lagged levels are often rather poor instruments for first differenced variables, especially if the variables are close to a random walk. Their modification of the estimator includes lagged levels as well as lagged differences. The original estimator is often entitled difference GMM, while the expanded estimator is commonly termed System GMM. The System GMM estimator involves a set of additional restrictions on the initial conditions of the process generating y. This estimator is available in Stata as xtdpdsys.

Let us consider the dynamic panel data model:

푦푖푡 = 훾푦푖,푡−1 + 훽′푥푖푡 + 휗′푤푖 + 훼푖 + 휀푖푡

훼푖 are the (unobserved) individual effects, 푥푖푡 is a vector of 퐾1 time-varying explanatory variables,

푤푖 is a vector of 퐾2 time-invariant variables.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 41

Assumptions: we assume that the component error term 푣푖푡 = 훼푖 + 휀푖푡

❑ E(휀푖푡)=0, E(훼푖)=0 2 ❑ E(휀푖푡휀푗푠)= 휎휀 if j = i and t = s, 0 otherwise 2 ❑ E(훼푖훼푗)= 휎훼 if j = i, 0 otherwise

❑ E(훼푖푥푖푡)=0, E(훼푖푤푖)= 0 (exogeneity assumption for 푤푖)

The GMM estimation method is based on a model in first differences, in order to swipe out the individual effects 훼푖and the variables 푤푖:

(푦푖푡 − 푦푖,푡−1) = 훾(푦푖,푡−1 − 푦푖,푡−2) + 훽′(푥푖푡 − 푥푖,푡−1) + 휀푖푡 − 휀푖,푡−1, for t = 1,…, T.

Intuition of the moment conditions:

❑ Notice that 푦푖,푡−2 and (푦푖,푡−2 − 푦푖,푡−3) are not the only valid instruments for (푦푖,푡−1, 푦푖,푡−2).

❑ All the lagged variables 푦푖,푡−2−푗, for j > 0, satisfy

퐸(푦푖,푡−2−푗, (휀푖푡 − 휀푖,푡−1)) = 0; Exogeneity property

퐸(푦푖,푡−2−푗(푦푖,푡−1 − 푦푖,푡−2) ≠ 0; Relevance property

❑ Therefore, they all are legitimate instruments for (푦푖,푡−1 − 푦푖,푡−2)

The m + 1 conditions, 퐸(푦푖,푡−2−푗, (휀푖푡 − 휀푖,푡−1)) = 0 for j = 0,1,..,m can be used as moment 2 2 conditions in order to estimate: 휃 = 훽, 훾, 푤, 휎훼 , 휎휀

The Arellano-Bond (also Arellano-Bover) method of moments estimator is consistent. The moment conditions use the property of the instruments 푦푖,푡−푗, 푗 ≥ 2, to be uncorrelated with the errors 휀푖푡 and 휀푖,푡−1. We obtain an increasing number of moment conditions for t = 3, 4, . . . , T.

t = 3; E(ui3 − ui2 )yi1  = 0

t = 4; E(ui4 − ui3 )yi2  = 0 ; E(ui4 − ui3 )yi1  = 0

t = 5 ; E(ui5 − ui4 )yi3  = 0; E(ui5 − ui4 )yi2  = 0 ; E(ui5 − ui4 )yi1  = 0 etc.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 42

We define the (T − 2) × 1 vector:

Δui = [(ui;3 − ui;2), . . . , (ui;T − ui;T−1)]′ and a (T − 2) × (T − 2) matrix of instruments

yi,1 yi,1 ... yi,1   0 y ... y  Z ' =  i,2 i,2  i  0 0 ......     0 ... 0 yi,T −2 

Z is a vector of instruments. Each row of Zi contains instruments that are valid for a given period such that E[Z′i ui]=0

Recall that: Δyi,t = γΔyit-1+ Δui,t,

Δui,t,= Δyi,t - γΔyit-1

Therefore: E[Z′i Δui] = E[Z′i (Δyi − γΔyit-1)]

Under GMM, the number of moment conditions exceeds the number of unknown coefficients, so γ is estimated by minimizing the quadratic expression :

with a weighting matrix, .

Differentiating the quantratic equation with respect to γ;

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 43

The optimal matrix Wn yielding an asymptotically efficient estimator is the inverse of the matrix of the sample moments.

• The matrix can be estimated directly from the data after a first consistent estimation

step. • Under weak regularity conditions the GMM estimator is asymptotically normal for N →  for fixed T, T > 2 using our instruments. • It is also consistent for and T → , though the number of moment conditions tend to ∞ as T → . • In practice, it is advisable to limit the number of moment conditions.

GMM estimation, Arellano-Bond, with exogenous variables(x’s)

2 yit = xit  + yi,t−1 +  i + uit , uit ~ iid (0, u )

As also exogenous x’s are included in the model additional moment conditions can be formulated:

• For strictly exogenous vars, E[xis uit ] = 0 for all s, t, E[xis Δuit] = 0

• For predetermined (not strictly exogenous) vars, E[xisuit] = 0 for s ≤ t E[xit± Δuit] = 0 j = 1, . . . , t – 1

Clearly, there are a lot of possible moment restrictions both for differences as well as for levels, and so a variety of GMM estimators. GMM estimation may be combined with both Fixed Effects and Random Effects. Here also, the RE estimator is identical with the FE estimator with T → .

Diagnostic tests ❑ As the DPD estimators are instrumental variables methods, it is particularly important to evaluate the Sargan–Hansen test results when they are applied. ❑ Roodman’s xtabond2 provides C tests (as in re ivreg2) for groups of instruments.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 44

❑ In his routine, instruments can be either “GMM-style" or “IV-style". The former is constructed per the Arellano–Bond logic, making use of multiple lags; the latter are included as is in the instrument matrix. ❑ For the system GMM estimator (the default in xtabond2) instruments may be specified as applying to the differenced equations, the level equations or both.

Another important diagnostic in DPD estimation is the AR test for autocorrelation of the residuals. By construction, the residuals of the differenced equation should possess serial correlation, but if the assumption of serial independence in the original errors is warranted, the differenced residuals should not exhibit significant AR(2) behaviour. These statistics are produced in the xtabond and xtabond2 output. If a significant AR(2) statistic is encountered, the second lags of endogenous variables will not be appropriate instruments for their current values.

A useful feature of xtabond2 is the ability to specify, for GMM-style instruments, the limits on how many lags are to be included. If T is fairly large (more than 7–8) an unrestricted set of lags will introduce a huge number of instruments, with a possible loss of efficiency. By using the lag limits options, you may specify, for instance, that only lags 2–5 are to be used in constructing the GMM instruments.

An Empirical Exercise To illustrate the performance of the several estimators, we make use of the original AB dataset, available within Stata with webuse abdata. This is an unbalanced panel of annual data from 140 UK firms for 1976–1984. In their original paper, they modelled firms’ employment n using a partial adjustment model to reflect the costs of hiring and firing, with two lags of employment. Other variables included were the current and lagged wage level w, the current, once- and twice- lagged capital stock (k) and the current, once- and twice-lagged output in the firm’s sector (ys). All variables are expressed as logarithms. A set of time dummies is also included to capture business cycle effects. If we were to estimate this model ignoring its dynamic panel nature, we could merely apply regress with panel-clustered standard errors: ∗ 푟푒푔푟푒푠푠 n nL1 nL2 w wL1 k kL1 kL2 ys ysL1 ysL2 yr , cluster(id)

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 45

One obvious difficulty with this approach is the likely importance of firm-level unobserved heterogeneity. We have accounted for potential correlation between firms’ errors over time with the cluster-robust VCE, but this does not address the potential impact of unobserved heterogeneity on the conditional mean. We can apply the within transformation to take account of this aspect of the data:

∗ 푥푡푟푒푔 n nL1 nL2 w wL1 k kL1 kL2 ys ysL1 ysL2 yr , fe cluster(id) NB: t statistics in parentheses ∗ p < 0.05, ∗∗ p < 0.01, ∗∗∗ p < 0.001

The fixed effects estimates will suffer from Nickell bias, which may be severe given the short time series available. In the original OLS regression, the lagged dependent variable was positively

OLS (t-values) FE (t-values)

nL1 1.045∗∗∗ (20.17) 0.733∗∗∗ (12.28) nL2 -0.0765 (-1.57) -0.139 (-1.78) w -0.524∗∗ (-3.01) -0.560∗∗∗ (-3.51) k 0.343∗∗∗ (7.06) 0.388∗∗∗ (6.82) ys 0.433∗ (2.42) 0.469∗∗ (2.74)

N 751 751 correlated with the error, biasing its coefficient upward. In the fixed effects regression, its coefficient is biased downward due to the negative sign on 푣푡−1 in the transformed error. The OLS estimate of the first lag of n is 1.045; the fixed effects estimate is 0.733.

Given the opposite directions of bias present in these estimates, consistent estimates should lie between these values, which may be a useful check. As the coefficient on the second lag of n cannot be distinguished from zero, the first lag coefficient should be below unity for dynamic stability. To deal with these two aspects of the estimation problem, we might use the

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 46

Anderson–Hsiao estimator to the first-differenced equation, instrumenting the lagged dependent variable with the twice-lagged level:

푖푣푟푒푔푟푒푠푠 2푠푙푠 퐷. 푛 (퐷. nL1 = nL2 ) D. (nL2w wL1 k kL1 kL2 //

/ ys ysL1 ysL2 yr1979 yr1980 yr1981 yr1982 yr1983)

A-H

D.nL1 2.308 (1.17)

D.nL2 -0.224 (-1.25)

D.w -0.810∗∗ (-3.10)

D.k 0.253 (1.75)

D.ys 0.991∗ (2.14)

N 611 NB: t statistics in parentheses ∗ p < 0.05, ∗∗ p < 0.01, ∗∗∗ p < 0.001

Although these results should be consistent, they are quite disappointing. The coefficient on lagged n is outside the bounds of its OLS and FE counterparts, and much larger than unity, a value consistent with dynamic stability. It is also very imprecisely estimated. The difference GMM approach deals with this inherent endogeneity by transforming the data to remove the fixed effects.

The standard approach applies the first difference (FD) transformation, which as discussed earlier removes the fixed effect at the cost of introducing a correlation between ∆푦푖,푡−1 and ∆푣푖푡 , both of which have a term dated (t−1). This is preferable to the application of the within transformation, as that transformation makes every observation in the transformed data endogenous to every other for a given individual.

The one disadvantage of the first difference transformation is that it magnifies gaps in unbalanced panels. If some value of 푦푖푡 is missing, then both ∆푦푖푡 and ∆푦푖,푡−1 will be missing in the transformed data. This motivates an alternative transformation: the forward orthogonal deviations (FOD) transformation, proposed by Arellano and Bover (1995).

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 47

In contrast to the within transformation, which subtracts the average of all observations’ values from the current value, and the FD transformation, that subtracts the previous value from the current value, the FOD transformation subtracts the average of all available future observations from the current value. While the FD transformation drops the first observation on each individual in the panel, the FOD transformation drops the last observation for each individual. It is computable for all periods except the last period, even in the presence of gaps in the panel.

The FOD transformation is not available in any of official Stata’s DPD commands, but it is available in David Roodman’s xtabond2 implementation of the DPD estimator, available from SSC. To illustrate the use of the AB estimator, we may re-estimate the model with xtabond2, assuming that the only endogeneity present is that involving the lagged dependent variable.

1 0 0 푥푡푎푏표푛푑2 푛 퐿 ( ) . 푛 퐿 ( ) . 푤 퐿 ( ) . (푘 ys) yr∗, 푔푚푚(퐿. 푛) // 2 1 2 0 0 / iv (L ( ) . 푤 퐿 ( ) . (푘 ys) yr∗) nolevel robust small 1 2

Note that in xtabond2 syntax, every right-hand variable generally appears twice in the command, as instruments must be explicitly specified when they are instrumenting themselves. In this example, all explanatory variables except the lagged dependent variable are taken as “IV-style” instruments, entering the Z matrix as a single column. The lagged dependent variable is specified as a “GMM-style” instrument, where all available lags will be used as separate instruments. The noleveleq option is needed to specify the AB estimator.

A-B L.n 0.686∗∗∗ (4.67) L2.n -0.0854 (-1.50) w -0.608∗∗ (-3.36) k 0.357∗∗∗ (5.95) ys 0.609∗∗∗ (3.47)

N 611

NB: t statistics in parentheses ∗ p < 0.05, ∗∗ p < 0.01, ∗∗∗ p < 0.001

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 48

In these results, 41 instruments have been created, with 17 corresponding to the “IV-style” regressors and the rest computed from lagged values of n. Note that the coefficient on the lagged dependent variable now lies within the for dynamic stability. In contrast to that produced by the Anderson–Hsiao estimator, the coefficient is quite precisely estimated.

There are 25 over-identifying restrictions in this instance, as shown in the first column below. The hansen_df represents the degrees of freedom for the Hansen J test of overidentifying restrictions. The p-value of that test is shown as hansenp.

All lags Lags 2-5 Lags 2-4 L.n 0.686∗∗∗ (4.67) 0.835∗ (2.59) 1.107∗∗∗ (3.94) L2.n -0.0854 (-1.50) 0.262 (1.56) 0.231 (1.32) w -0.608∗∗ (-3.36) -0.671∗∗ (-3.18) -0.709∗∗ (-3.26) k 0.357∗∗∗ (5.95) 0.325∗∗∗ (4.95) 0.309∗∗∗ (4.55) ys 0.609∗∗∗ (3.47) 0.640∗∗ (3.07) 0.698∗∗∗ (3.45) hansen_df 25 16 13 hansenp 0.177 0.676 0.714

In this table, we can examine the sensitivity of the results to the choice of “GMM-style” lag specification. In the first column, all available lags of the level of n are used. In the second column, the lag(2 5) option is used to restrict the maximum lag to 5 periods, while in the third column, the maximum lag is set to 4 periods. Fewer instruments are used in those instances, as shown by the smaller values of sar_df. The p-value of Hansen’s J is also considerably larger for the restricted-lag cases. On the other hand, the estimate of the lagged dependent variable’s coefficient appears to be quite sensitive to the choice of lag length.

We illustrate estimating this equation with both the FD transformation and the forward orthogonal deviations (FOD) transformation:

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 49

First diff FOD L.n 0.686∗∗∗ (4.67) 0.737∗∗∗ (5.14) L2.n -0.0854 (-1.50) -0.0960 (-1.38) w -0.608∗∗ (-3.36) -0.563∗∗∗ (-3.47) k 0.357∗∗∗ (5.95) 0.384∗∗∗ (6.85) ys 0.609∗∗∗ (3.47) 0.469∗∗ (2.72) hansen_df 25 25 hansenp 0.177 0.170

The results appear reasonably robust to the choice of transformation, with slightly more precise estimates for most coefficients when the FOD transformation is employed. We might reasonably consider, as did Blundell and Bond (1998), that wages and the capital stock should not be taken as strictly exogenous in this context, as we have in the above models. Re-estimating the equation producing “GMM-style” instruments for all three variables, with both one-step and two-step VCE:

1 0 0 푥푡푎푏표푛푑2 푛 퐿 ( ) . 푛 퐿 ( ) . 푤 퐿 ( ) . (푘 ys) yr∗, 푔푚푚(퐿. 푛 푤 푘) // 2 1 2 0 / iv ( 퐿 ( ) . ys yr∗) nolevel robust small 2

One-step Two-step L.n 0.818∗∗∗(9.51) 0.824∗∗∗(8.51) L2.n -0.112∗ (-2.23) -0.101 (-1.90) w -0.682∗∗∗(-4.78) -0.711∗∗∗(-4.67) k 0.353∗∗ (2.89) 0.377∗∗ (2.79) ys 0.651∗∗∗(3.43) 0.662∗∗∗(3.89) hansen_df 74 74 hansenp 0.487 0.487

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 50

The results from both one-step and two-step estimation appear reasonable. Interestingly, only the coefficient on ys appears to be more precisely estimated by the two-step VCE. With no restrictions on the instrument set, 74 over-identifying restrictions are defined, with 90 instruments in total. The 2-step estimator is generally more efficient. It uses first-step estimation to obtain the optimal weighting matrix used at the second step. The optimal GMM estimator is obtained by using the ‘twostep’ option. [See Cameron & Trivedi, Microeconometrics Using Stata, pp. 180-181 & p. 295]

To illustrate system GMM, we follow Blundell and Bond, who used the same abdata dataset on a somewhat simpler model, dropping the second lags and removing sectoral demand. We consider wages and capital as potentially endogenous, with GMM-style instruments.

Estimate the one-step BB model. 0 푥푡푎푏표푛푑2 푛 퐿. 푛 퐿 ( ) . (푤 푘) yr∗, 푔푚푚(퐿. (푛 푤 푘)) iv( yr∗, 푒푞푢푎푡푖표푛(푙푒푣푒푙)) // 1 / robust small

We indicate here with the equation(level) sub option that the year dummies are only to be considered instruments in the level equation. As the default for xtabond2 is the BB estimator, we omit the noleveleq option that has called for the AB estimator in earlier examples.

n L.n 0.936∗∗∗ (35.21) W -0.631∗∗∗ (-5.29) k 0.484∗∗∗ (8.89) hansen_df 100

Hansenp 0.218

We find that the 훼 coefficient is much higher than in the AB estimates, although it may be distinguished from unity. 113 instruments are created, with 100 degrees of freedom in the test of overidentifying restrictions.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 51

2.1.4. Non-stationarity, Unit Roots and in Panels Testing the and cointegration hypotheses by using panel data instead of individual time series involves several additional complications: ❑ panel data generally introduce a substantial amount of unobserved heterogeneity, rendering the parameters of the model cross section specific. • If T is large enough, we can estimate each time-series separately and test for heterogeneity. This raises a question as to what the parameters of interest are: the coefficients of the

individual units, say βi for i = 1,..., N or the expected values (‘means’) and the

of the coefficients over the groups, E[βi] and Var[βi]? Consider the following example data generating process (read: the true process driving the data), taken from Smith and Fuertes (2007): 2 let 푦푖푡 = 휇푖 + 휀푖푡, 퐸[휀푖푡] = 0, 푣푎푟[휀푖푡] = 휎 For each group i there is zero-mean variation in y around a constant group-specific mean

휇푖. Furthermore, these group-specific means also vary across groups: 2 2 휇푖 = 휇 + 휂푖; 퐸[휂푖] = 0, 푣푎푟[휂푖] = 퐸[휂푖 ] = 휎 We can now consider the different means (ȳ) we can estimate for this very simple example: 1 푦̅ = ∑ ∑ 푦 , 퐸[ 푦̅] = 휇 푁푇 푖푡 푖 푡 Even in such a simple model setup, we thus obtain very different results: two of the averages are estimates for the population average (휇), whereas the third is for the group-

specific average (휇푖), with variances of the estimators differing across all three. All of them are unbiased and {NT, T, N}-consistent estimators of something, the question is just whether this something is interesting at all.

Lesson: what is the statistic of interest? What is the research question?

In many empirical applications, it is inappropriate to assume that the cross-section units are independent. To overcome these difficulties, variants of panel unit root tests are developed that allow for different forms of cross-sectional dependence Variable and/or residual correlation across

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 52

panel members: due to common shocks (e.g. recession) or spillover effects. Standard panel estimators assume cross section independence. If neglected, cross-section dependence (CSD) can lead to imprecise estimates and at worst to a serious identification problem.

Example: Agro-climatic ‘distance’ — how similar or different is the climatic environment in agriculture?

❑ The panel test outcomes are often difficult to interpret if the null of the unit root or cointegration is rejected. The best that can be concluded is that "a significant fraction of the cross-section units is stationary or cointegrated". The panel tests do not provide explicit guidance as to the size of this fraction or the identity of the cross-section units that are stationary or cointegrated. ❑ With unobserved I (1) (i.e. integrated of order unity) common factors affecting some or all the variables in the panel, it is also necessary to consider the possibility of cointegration between the variables across the groups (cross section cointegration) as well as within group cointegration. ❑ The asymptotic theory is considerably more complicated due to the fact that the design involves a time as well as a cross section dimension. For example, applying the usual Dickey-Fuller test to a panel data set introduces a bias that is not present in the case of a univariate test. Furthermore, a proper limit theory has to take into account the relationship between the increasing number of time periods and cross section units (cf. Phillips and Moon, 1999).

Spatial econometrics: Econometrician ‘knows’ how panel members are associated/correlated (e.g. neighbourhood), models this association explicitly employing a weight matrix (‘spatially lagged dependent variable’).

Common factor models: Models dependence with unobserved common factors ft with heterogeneous impact γi. The trick is to estimate common factors or blend out their impact on estimation.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 53

In general, the commonly-used unit root tests, such as the Dickey-Fuller (DF) and the Augmented DF (ADF) test, have non-standard limiting distributions which depend on whether deterministic components are included in the regression equation. Moreover, in finite samples such tests show little power in distinguishing the unit root from stationary alternatives as well as unit root tests based on a single time series with highly persistent deviations from equilibrium. The reason for the poor performance of standard unit root tests in the panel framework may be the different null hypothesis tested in this case.

Review of the Dickey Fuller Test for time series The DF test considers three sets of equations:

yt = ρyt-1 + t (i) No constant and no trend

yt = 0 + ρyt-1 + t (ii) A constant and no trend

yt =  0+ 2t + ρyt-1 + t (iii) Both the constant and trend Thus, the test hypothesis;

Ho: ρ = 0 (non-stationary or with unit root)

H1: ρ < 0 (stationary or no unit root).

The Dickey-Fuller test statistic has a non-standard distribution (i.e. not normal, not t, not chi- squared) which means that special tables are required. Usually, the critical values of the Dickey- Fuller test statistic (and other non-standard test statistics) are determined by simulation. The D-F is generalized into the Augmented D-F test to accommodate the general ARIMA and ARMA models. With ADF we assume the error term is independently and identically distributed. ADF test is thus specified as follows:

∆y푡 = α0 + α2푡 + 휌 y푡−1 + ∑ 훽푖 ∆y푡−푖 + 푢푡

The equation is more general, allows for the presence of a non-zero mean and a constant deterministic drift. The maximum lag length K is determined using one of the criteria. The distribution of the DF/ADF test statistic, depends on the inclusion/exclusion of a constant and/or the inclusion/exclusion of a time trend.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 54

Extending the ADF Test to Panel Unit Roots

Panel dataset has two dimensions, i (denoting cross-section units) and t (denoting the time period).

In the single equation case, we are interested in testing 휙1 = 0 against the alternative hypothesis

휙1 < 0 and we apply a unit root for the first time series. Instead, in the panel data case, the hypothesis we are interested in is: 휙푖 = 0 against the alternative hypothesis 휙푖 < 0. The ADF can be extended to panel data more generally as:

∆y푖푡 = α푖0 + α푖2푡 + 휌 y푖푡−1 + 휕푖휃푡 + 푢푖푡

휃푡 is called common factor as it introduces cross-section dependence via factor loadings 휕푖

퐻0 : 휌 = 1 (always the case in all panel unit root tests)

퐻푎 =? : (depends on whether it is a homogeneous panel or heterogeneous panel)

Note that in panel unit root testing, we have both homogeneous and heterogeneous panels. Homogeneous panels are assumed to have common unit root processes for all the cross-section units while heterogeneous panels are assumed to have (possibly) different unit root processes. Some unit root tests are applicable to homogeneous panels; others are applicable to heterogeneous panels. Two generations of panel unit root tests have been developed:

First Generation Cross-sectional independence

Non-stationary tests Levin and Lin (1992, 1993) and Levin, Lin and Chu (2000): pooled ADF test (levinlin)

Im, Pesaran and Shin (1997, 2003): averaged unit root test for heterogeneous panels (IPS) (ipshin)

Maddala and Wu (1999) Fisher combination test (MW) (xtfisher); and Choi (1999, 2001):

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 55

Stationary tests Choi’s (2001) extension Hadri (2000) Breitung (2000), Hadri (2000), Harris & Tzavalis (1999) (xtunitroot with options breitung, hadri, ht, respectively) Second Generation tests Cross-sectional dependence Factor Structure Pesaran (2003) (pescadf) Moon and Perron (2004) Bai and Ng (2002, 2004) Choi (2002) Other approaches O’Connell (1998) Chang (2002, 2004) Pesaran, Smith, and Yamagata (2009) panel unit root test (xtcipsm under construction) NB: italized words are the tests under Stata

The first-generation unit root tests are applicable to homogeneous panels. Several of these tests are straight-forward extensions of the classic (Augmented) unit root tests in pure time series data. On the other hand, the second-generation unit root tests are applicable to heterogeneous panels. These tests allow the unit root processes to vary among the cross-section units, and are obtained by combining individual unit root test processes to devise a panel-specific test. Some of these tests are averages of straight-forward averages of the classic (Augmented) unit root tests in pure time series data.

First Generation Panel Unit Root Tests (PURTs) The first generation of tests includes Levin, Lin and Chu’s test (2002), Im, Pesaran and Shin (2003) and the Fisher-type test proposed first by Maddala and Wu (1999), then developed by Choi (2001). The main limit of these tests is that they are all constructed under the assumption that the individual time series in the panel are cross-sectionally independently distributed, when on the contrary a large amount of literature provides evidence of the co-movements between economic variable. They all assumed that the individual time-series in the panel are cross sectionally independently distributed.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 56

The Basic Model

Assume that time series {푦푖0, … , 푦푖푇} , on the cross-section units i = 1, 2, … , N are generated for each i by a simple first-order autoregressive, AR(1), process:

푦푖푡 = (1 − 훼푖)휇푖+훼푖푦푖,푡−1 + 휀푖푡 where the initial values, 푦푖0, are given, and the errors 휀푖푡 are identically, independently distributed 2 2 4 (i.i.d.) across i and t with E(휀푖푡) = 0, E(휀푖푡 ) = 휎푖 < 1 and E(휀푖푡 ) < ∞ . These processes can also be written equivalently as simple Dickey-Fuller (DF) regressions

∆푦푡 = −휙푖휇푖 + 휙푖푦푡−1 + 휀푖푡

In further developments of the model, it is also helpful to write (1) or (2) in mean-deviations forms

: 푦̅푖푡 = 훼푖푦̅푖,푡−1 + 휀푖푡, where 푦̅푖푡 = 푦푖푡 -휇푖. The corresponding DF regression in 푦̅푖푡 is given by :

∆̅̅̅푦̅푖푡 = 휙푖푦̅푖,푡−1 + 휀푖푡

The null hypothesis of interest is:

퐻0: 휙1 = ⋯ = 휙푁 = 0 That is, all time series are independent random walks. We will consider two alternatives:

퐻1푎: 휙1 = ⋯ = 휙푁 ≡ 휙 푎푛푑 휙 < 0

퐻1푏: 휙1 < 0, … , 휙푁0 < 0, 푁0 ≤ 푁

Under 퐻1푎 it is assumed that the autoregressive parameter is identical for all cross-section units (see, for example, Levin and Lin (1993, LL), and Levin, Lin and Chu 2002). This is called the homogeneous alternative. 퐻1푏 assumes that 푁0 of the N (0 < 푁0 ≤ 푁) panel units are stationary with individual specific autoregressive coefficients. panel units are stationary with individual specific autoregressive coefficients. This is referred to as the heterogeneous alternatives (see, for 푁 example, Im, Pesaran and Shin (2003, IPS). For the consistency of the test it is assumed that 0 → 푁 푘 > 0 푎푠 푁 → ∞. Different panel testing procedures can be developed depending on which of the two alternatives is being considered. The panel unit root statistics motivated by the first alternative,

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 57

퐻1푎, pools the observations across the different cross section units before forming the “pooled” statistic, whilst the tests developed against the heterogeneous alternatives, 퐻1푏, operates directly on the test statistics for the individual cross section units using (standardized) simple averages of the underlying individual statistics or their suitable transformations such as rejection probabilities. Despite the differences in the way the two tests view the alternative hypothesis both tests can be consistent against both types of the alternatives. Also, interpretation of the outcomes of both tests is subject to similar considerations discussed in the introduction. When the null hypothesis is rejected one can only conclude that a significant fraction of the AR(1) processes in the panel does not contain unit roots.

Testing for unit roots within homogeneous dynamic panels is considered, for example, by Harris and Tzavalis (1999) and Levin, Lin, and Chu (2002). One drawback of tests based on such alternative hypothesis is that they usually have power also if not all units are stationary; and hence a rejection is not convincing evidence that all series are indeed stationary. In particular, Westerlund and Breitung (2009) show that the local power of the Levin, Lin, and Chu (2002) test is greater than that of the Im, Pesaran, and Shin (2003) test, based on a less restrictive alternative, also when not all individual series are stationary. A further drawback in using 퐻1푎 is that this is likely to be unduly restrictive, particularly for cross-country studies involving differing short-run dynamics. For example, such homogeneous alternative seems particularly inappropriate in the case of the purchasing power parity (PPP) hypothesis, where 푦푖푡is taken to be the real exchange rate. There are no theoretical grounds for the imposition of the homogeneity hypothesis, 휙푖 = 휙 under PPP.

The alternative hypothesis 퐻1푏 is at the basis of panel unit root tests proposed by Chang (2002) and Chang (2004). It is only appropriate when N is finite, namely within the multivariate model with a fixed number of variables analyzed in the time series literature. On the contrary, in the case of large N and T, panel unit root tests will lack power if the alternative, 퐻1푏, is adopted. For large N and T panels it is reasonable to entertain alternatives that lie somewhere between the two extremes of 퐻1푎 and 퐻1푏. In this context, a more appropriate alternative is given by:

휙푖 < 0, 푖 = 1, … , 푁1 … , 휙푖 = 0, 1 = 푁1 + 1, 푁1 + 2, … , 푁 푁 Such that: lim 0 = 푘, 0 < 푘 ≤ 1 푁→∞ 푁

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 58

Using the above specification, the null hypothesis is: 퐻0: 푘 = 0, while the alternative hypothesis can be written as: 퐻1푐: 푘 > 0

In other words, rejection of the unit root null hypothesis can be interpreted as providing evidence in favour of rejecting the unit root hypothesis for a non-zero fraction of panel members as 푁 → ∞. In cases where T is sufficiently large, 푘 can be estimated by application of univariate unit root tests to all the individual time series in the panel. With N and T sufficiently large, a consistent estimate of 푘 will then be given by the proportion of the cross-section units for which the individual unit root tests are rejected. In applications where N and T sufficiently large, 푘 cannot be estimated consistently, although the panel unit root test outcome could still be valid, in the sense that it has the correct size under the null hypothesis, 퐻0 .

Residual-Based LM Test: The test proposed by Hadri (2000) is based on the null hypothesis of stationarity. It is an extension of the stationarity test developed by Kwiatkowski et al. (1992) in the time series context. Hadri proposes a residual-based Lagrange multiplier test for the null hypothesis that the individual series are stationary around a deterministic level or around a deterministic trend, against the alternative of unit root.

푦푖푡 = 푟푖푡 + 푢푖푡; 푟푖푡 = 푟푖,푡−1+푒푖푡

2 퐻0 : σ푢 = 0

The Hadri test allows for heteroskedasticity adjustments. Its empirical size is close to its nominal size if N and T are large: The Stata command for the Hadri test is: xtunitroot hadri varname [if] [in] [,Hadri options]

Summary Both the Im-Pesaran-Shin and Fisher-type test relax the restrictive assumption of Levin-Lin-Chu that ρi must be the same for all series under the alternative hypothesis. Also, when N is small, the empirical size of both tests is close to their nominal size of 5 percent. (Fisher shows some distortions at N = 100) With respect to the size-adjusted power, the Fisher-type test outperforms

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 59

the Im-Pesaran-Shin test. It should be mentioned that in the presence of a linear time trend, the power of all tests decreases considerably. The Levin-Lin-Chu test has high power if the time dimension T is large. This can be problematic, because one might infer stationarity for the whole panel, even if it is only true for a few individuals. On the other hand, it has low power for a small- time dimension. In this case, one can conclude non-stationarity when in fact most series display stationary behavior. Thus, it is advisable to analyze the outcome of both the Levin-Lin-Chu and the Im-Pesaran-Shin test. All in all, there is no dominant performance of one particular test. That is, the econometrician must actively think about the benefits and drawbacks of the different tests (and might compare the outcomes of different tests). A caveat is that all tests discussed here assume cross-sectional independence. Therefore, they cannot be applied to panels of the second and third generation. There are, however, tests that can deal with cross-sectional dependence—for instance, Choi (2002), Chang (2002, 2004), and Pesaran (2007).

Second generation PURTs The assumption of independence across members of the panel is rather restrictive, particularly in the context of cross-country regressions. Moreover, this cross-sectional correlation may affect the finite sample properties of panel unit root test. Cross section dependence can arise due to a variety of factors, such as omitted observed common factors, spatial spillover effects, unobserved common factors, or general residual interdependence that could remain even when all the observed and unobserved common effects are taken into account.

To overcome the difficulty in first generation tests, a second generation of tests rejecting the cross- sectional independence hypothesis has been proposed. Within this second generation of tests, two main approaches are distinguished. The first one consists in imposing few or no restrictions on the residual and has been adopted notably by Chang (2002, 2004), who proposed the use of nonlinear instrumental variable methods or the use of bootstrap approaches to solve the nuisance parameter problem due to cross-sectional dependency. The second approach relies on the factor structure approach and includes contributions by Bai and Ng (2004), Phillips and Sul (2003), Moon and Perron (2004), Choi (2002) and Pesaran (2003) among others.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 60

The Bai and Ng tests Bai and Ng (2001, 2004) have proposed the first test of the unit root null hypothesis taking into account the potential cross-sectional correlation. The problem consists on the specification of a special form of these dependences. Bai and Ng suggest a rather simple approach and consider a factor analytic model: ′ 푦푖푡 = 퐷푖푡 + 휆푖퐹푡 + 푒푖푡 where 퐷푖푡 is a polynomial time function of order t; 퐹푡 is a (r;1) vector of common factors, and 휆푖 is a vector of factor loadings. Thus, the individual series 푦푖푡 is decomposed into a heterogeneous ′ deterministic component 퐷푖푡, a common component 휆푖퐹 and an error term 푒푖푡 largely idiosyncratic.

It is worth noting that it is the presence of the common factors 퐹푡; according to which each individual has a specific elasticity 휆푖, which is at the origin of the cross-sectional dependencies.

In this case, 푦푖푡 is said to be nonstationary if at least one common factor of the vector Ft is nonstationary and/or the idiosyncratic term 푒푖푡 is nonstationary. Nothing guarantees that these two terms have the same dynamic properties: one could be stationary, the other nonstationary, some components of 퐹푡 could be I (0) ; others I (1) ; 퐹푡 and 푒푖푡 could be integrated of different orders, etc. However, it is well known that a series defined as the sum of two components which have different dynamic properties has itself dynamic properties which are very different from its entities. Thus, it may be difficult to check the stationarity of 푦푖푡 if this series contains a large stationary component. This is why, rather than directly testing the non-stationarity of 푦푖푡; Bai and Ng (2004) suggest to separately test the presence of a unit root in the common and individual components. This procedure is called PANIC (Panel Analysis of Non stationarity in the Idiosyncratic and Common components) by the authors.

What is the advantage of this procedure in regard to the cross-sectional dependencies? It stays in the fact that the idiosyncratic component 푒푖푡 can be considered as being slightly correlated across individuals, while, at the same time, the complete series 푦푖푡 can present high cross-sectional correlations. One of the main critics addressed to the first generation of unit root tests, principally in the context of macroeconomic series, is thus dropped.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 61

The implementation of the Bai and Ng (2004) test, clearly illustrates the importance of co- movements across individuals. Indeed, for panels characterized by a strong cross-sectional dependency, the Bai and Ng tests -by taking account the common factors across series - accept the null hypothesis of a unit root in the factors, leading to the conclusion that the series is nonstationary.

The Phillips and Sul (2003) and Moon and Perron (2004) tests

In contrast to Bai and Ng (2001), Phillips and Sul (2003) and Moon and Perron (2004) directly test the presence of a unit root in the observable series 푦푖푡: Indeed, they do not proceed to separate tests on the individual and common components. Beyond this fundamental difference, there exists some similarities between the two approaches due to the use of a factor model. The underlying idea of the Moon and Perron (2004a) approach is the following: it consists on the transformation of the model in order to eliminate the common components of the 푦푖푡series, and on the application of the unit root test on de-factored series. Applying this procedure removes the cross-sectional dependencies and it is then possible to derive normal asymptotic distributions. Thus, normal distributions are obtained, as for Im et al. (2003) or Levin and Lin (1992, 1993), but the fundamental difference here is that the test statistics are computed from de-factored data. Thus, they are independent in the individual dimension.

The Choi tests

Like Moon and Perron (2004), Choi (2002) tests the unit root hypothesis using the modified observed series 푦푖푡 that allows the elimination of the cross-sectional correlations and the potential deterministic trend components. However, if the principle of the Choi (2002) approach is globally similar to that of Moon and Perron, it differs in two main points. In this model, in contrast to Bai and Ng (2001) and Moon and Perron (2004), there exists only one common factor (r = 1). More fundamentally, the Choi model assumes that the individual variables 푦푖푡are equally affected by the time effect (i.e. the unique common factor). The second difference with the Moon and Perron (2004) approach, linked to a certain extent to the model specification, stays in the

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 62

orthogonalization of the individual series 푦푖푡that will be used for a unit root test only on the individual component

The Pesaran tests

Pesaran (2003) proposes a different approach to deal with the problem of cross-sectional dependencies. He considers a one-factor model with heterogeneous loading factors for residuals, as in Phillips and Sul (2003). However, instead of basing the unit root tests on deviations from the estimated common factors, he augments the standard Dickey-Fuller or Augmented Dickey-Fuller regressions with the cross-section average of lagged levels and first-differences of the individual series. The Pesaranís test is based on these individual cross- sectionally augmented ADF statistics, denoted CADF. A truncated version, denoted CADF*, is also considered to avoid undue influence of extreme outcomes that could arise for small T samples.

Second generation tests: other approaches

There is a second approach to model the cross-sectional dependencies, which is more general than those based on dynamic factors models or error component models. It consists in imposing few or none restrictions on the covariance matrix of residuals. It is in particular the solution adopted in O’Connell (1998), Maddala and Wu (1999), Taylor and Sarno (1998), Chang (2002, 2004). Such an approach raises some important technical problems. With cross-sectional dependencies, the usual Wald type unit root tests based on standard estimators have limit distributions that are dependent in a very complicated way upon various nuisance parameters defining correlations across individual units. There does not exist any simple way to eliminate these nuisance parameters.

The first attempt to deal with this problem was done in O’Connell (1998). He considers a covariance matrix similar as that would arise in an error component model with mutually independent random time effects and random individual effects. However, this specification of cross-sectional correlations remains too specific to be widely used. Another attempt was proposed in Maddala and Wu (1999). They propose to way out the problem by using a bootstrap method to get the empirical distributions of the LL, IPS or Fisher’s type test statistics in order to make

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 63

inferences. Their approach is technically difficult to implement since it requires bootstrap methods for panel data. Besides, as pointed out by Maddala and Wu, the bootstrap methods result in a decrease of the size distortions due to the cross-sectional correlations, although it does not eliminate them. So, the bootstrap versions of the first-generation tests performs much better, but do not provide the validity of using bootstrap methodology. More recently, a second-generation bootstrap unit root test has been proposed by Chang (2004). He considers a general framework in which each panel is driven by a heterogeneous linear process, approximated by a finite order autoregressive process. In order to take into account the dependency among the innovations, Chang proposes a unit root test based on the estimation on the entire system of N equations. The critical values are then computed by a Bootstrap method.

Another solution consists in using the instrumental variable (IV thereafter) to solve the nuisance parameter problem due to cross-sectional dependency. It is the solution adopted in Chang (2002). The Chang (2002) testing procedure is as follows. In a first step, for each cross- section unit, he estimates the autoregressive coefficient from a usual ADF regression using the instruments generated by an integrable transformation of the lagged values of the endogenous variable. He then constructs N individual t-statistics for testing the unit root based on these N nonlinear IV estimators. For each unit, this t-statistic has limiting standard normal distribution under the null hypothesis. In a second step, a cross-sectional average of these individual unit test statistics is considered, as in IPS.

Panel Unit Root Tests in Stata

Stata implements a variety of tests for unit roots or stationarity in panel datasets with xtunitroot. The Levin–Lin–Chu (2002), Harris–Tzavalis (1999), Breitung (2000; Breitung and Das 2005), Im–Pesaran–Shin (2003), and Fisher-type (Choi 2001) tests have as the null hypothesis that all the panels contain a unit root. The Hadri (2000) Lagrange multiplier (LM) test has as the null hypothesis that all the panels are (trend) stationary. Options allow you to include fixed effects and time trends in the model of the data-generating process.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 64

Example (Syntax):

Levin-Lin-Chu Test : xtunitroot llc varname [if] [in] [,LLC options] Im, Pesaran and Shin Test: xtunitroot ips varname [if] [in] [,IPS options] Fisher-type Test: xtunitroot fisher varname [if] [in] {dfuller or:pperron} lags( )

The assorted tests make different asymptotic assumptions regarding the number of panels in your dataset and the number of time periods in each panel. xtunitroot has all the bases covered, including tests appropriate for datasets with a large number of panels and few time periods, datasets with few panels but many time periods, and datasets with many panels and many time periods. The majority of the tests assume that you have a balanced panel dataset, but the Im–Pesaran–Shin and Fisher-type tests allow for unbalanced panels.

Cointegration Testing

The estimation of long-run relationships has been the focus of extensive re- search in time series econometrics. In the case of variables on a single cross section unit the existence and the nature of long-run relations are investigated using cointegration techniques developed by Engle and Granger (1987), Johansen (1991,1995) and Phillips (1991). Like in time-series cointegration, there are three types of cointegration tests:

❑ Residual-based tests with the null of no cointegration: residuals from (potentially) cointegrating regressions are subjected to unit-root tests. If this is rejected, then cointegration is supported. ❑ System tests with the null of no or less cointegration: VAR models are estimated, the rank of the impact matrix determines the cointegrating rank. Such ideas only work if T is large and N remains small. ❑ Tests with the null of cointegration: extensions of the KPSS idea. These have been found to perform poorly in panels.

There have been several panel data cointegration tests. They all allow for heterogeneity in the cointegrating coefficients. But the null and alternatives imply that either all the relationships are

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 65

cointegrated or all the relationships are not cointegrated. There is no allowance for some relationships to be cointegrated and others not (the Fisher test allows for this -the p-values can be different).

Kao (1997) and Pedroni (1997) proposed the original tests for cointegration in panels under the null of no cointegration. The Kao (1997) test is used for homogeneous panels. Pedroni (1997) gives two sets of statistics: the first set is for testing cointegration in homogeneous panels and the second set of statistics is for testing cointegration in heterogeneous panels. McCoskey and Kao (1998) proposed the use of the average of the Augmented Dickey- Fuller (ADF) statistics over cross-sections based on Im et al. (1997) to test the hypothesis of no cointegration in heterogeneous panels. Pedroni (1997) suggests seven test statistics whose null is no cointegration. Kao (1999) suggests tests for no cointegration as the null and McCoskey and Kao (1998a) suggest tests for the null of cointegration. The methodology for the derivation of the test statistics in the two cases is different.

Fisher Test for Cointegration

This test uses the same ideas as in the Maddala-Wu and Choi tests for unit roots. Johansen Fisher- based combines p-values for the Johansen’s test for different cross section units. Thus, it combines information from individual cointegration tests (Johansen’s cointegration test is used in EVIEWS].

The main Fisher result is that the p-values from independent tests 휋푖 are distributed are independent 푈(0,1) variables

푁 2 And 휆 = −2 ∑푖=1 ln (휋푖) 휒2푁 as 푇 → ∞ (i.e. the time dimension tends to infinity) not 푁 → ∞.

Residual-based tests for panel cointegration

Most panel cointegration tests assume a common cointegrating relationship across i. A fixed- ′ effects (possibly ‘cointegrating’) regression 푦푖푡 = 푋푖푡훽 + 휇푖 + 휀푖푡, is estimated by OLS, and unit- root tests are applied to the residuals 휀̂. Dickey-Fuller test statistics on these 휀̂ can then be calculated by a pooled regression or averaged over i individual statistics (Kao and Pedroni tests).

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 66

Kao (1999): Kao suggests a total of 5 tests, all rather restrictive on cointegrating vector (common) and dynamics (common). Kao proposes four DF-type statistics and an ADF statistic. The first two DF statistics are based on assuming strict exogeneity of the regressors with respect to the errors in the equation, while the remaining two DF statistics allow for endogeneity of the regressors. The DF statistic, which allows for endogeneity, and the ADF statistic involve deriving some nuisance parameters from the long-run conditional variances

Run a static fixed effects model of variables assumed cointegrated, get residuals and apply a pooled ADF regression (analogous to the Engle-Granger procedure in time-series). We get a Dickey- Fuller test of cointegration.

The DF type tests from Kao can be calculated from the estimated residuals as:

푒̂푖푡 = 휌 푒̂푖푡−1 + 푣푖푡

′ ̂ Where 푒̂푖푡 = 푦푡 − 푥푖푡 훽

The null of no cointegration is represented as 퐻0 : 휌 = 1.

(if 휀푖푡̂ ∼ I(1) do not reject 퐻0 , if 휀푖푡̂ ≠ I(0) reject 퐻0).

The OLS estimate of 휌 and the t-statistic are given as:

∑푁 ∑푇 푒̂ 푒̂ 휌̂ = 푖=1 푡=1 푖푡 푖푡−1 푁 푇 2 ∑푖=1 ∑푡=1 푒̂푖푡

푁 푇 (휌̂ − 1)√∑푖=1 ∑푡=1 푒̂푖푡−1 푡휌̂ = 푠푒

For the ADF tests, the following regression is considered:

푒̂푖푡 = 휌 푒̂푖푡−1 + ∑ 휑푗 푒̂푖푡−j + 푣푖푡 푗=1

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 67

McCoskey and Kao (1998): McCoskey and Kao (1998) derived a residual-based LM test for the null of cointegration in panels. For the residual based test, it is necessary to use an efficient estimation technique of cointegrated variables. There are several methods have been shown to be efficient asymptotically in literature. These include the fully modified estimator of Phillips and Hansen (1990) and the dynamic least squares estimator of Saikkonen (1991) and Stock and Watson (1993). The model considered in McCoskey and Kao (1998) allows for varying slopes and intercepts: ′ 푦푡 = α푖 + 푥푖푡 β + e푖푡

푥푖푡 = 푥푖푡−1 + 휖푖푡

e푖푡 = 푟푖푡 + 푢푖푡

푟푖푡 = 푟푖푡−1 + 휃푢푖푡 2 where 푢푖푡 are i.i.d. (0, 휎푢 ). The null hypothesis of cointegration is equivalent to 휃 = 0.

The LM test statistic proposed by McCoskey and Kao (1998) is defined as:

1 1 ∑푇 푆2 푁 푇2 푡=1 푖푡 퐿푀 = 2 휎̂푒 where 푠푖푡 is the partial sum process of the residuals:

푠푖푡 = ∑ 푒̂푖푗 푗=1

The asymptotic result for the test is:

2 √푁(퐿푀 − 휇푣) ⇒ N[0, 휎푣 ]

2 The moments, μv and σv , can be found through Monte Carlo simulation. The limiting distribution of LM is then free of nuisance parameters and robust to heteroskedasticity.

Pedroni (1999) and Pedroni (2004): Pedroni (1999) extends the procedure of residual-based panel cointegration tests that he introduced in Pedroni (1995) for the models, where there is more than one independent variable. He proposes several residual-based null of no cointegration panel

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 68

cointegration test statistics. The Pedroni DF test dominates tests based on Phillips-Perron tests or on variance bounds. It introduces flexibility/heterogeneity in terms of cointegrating vector and dynamics. Pedroni (1999) considers (among other ideas) two ideas for test statistics:

❑ Weighted averaging (because of heteroskedasticity) of N augmented Dickey-Fuller test statistics on residuals from the cointegrating regressions. After adjusting for tabulated correction factors, this statistic converges to N(0,1) under its null of no cointegration (sequential limits). ❑ Pooled evaluation of a joint weighted augmented DF statistic. After adjusting using tabulated values, this statistic converges to N(0,1) under its null of no cointegration (sequential limits).

The first version has power against the alternative of some cointegrating individuals, the second version has more power against cointegration in all individuals.

Pedroni test assumes that the variables being tested are all I(1). Hence, it is important to establish that the variables are I(1) before proceeding with the test. Hence, under the null hypothesis of no cointegration, the residuals should be I(1). The tests’ estimation method is an extension of the Engle and Granger’s methodology.

The starting point of the residual-based panel cointegration test statistics of Pedroni (1999) is the computation of the residuals of the hypothesized cointegrating regression, after which the relevant panel cointegration test statistics are computed. Pedroni proposes seven different test statistics (three group-mean test statistics (also called “between”) and four pooled (also called “within”) tests. The stationarity of the estimated residuals is examined with either a technique similar to the Dickey-Fuller tests or to the correction terms in the single equation Phillips-Perron tests. The null hypothesis of no cointegration for the panel cointegration test is the same for each test statistic,

퐻0 : 휌푖 =1 for all i whereas the alternative hypothesis for the between-dimension-based and within-dimension- based panel cointegration tests differs. The alternative hypothesis (heterogenous) for the between- dimension-based statistics is:

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 69

퐻1 : 휌푖 < 1 for all i where a common value for 휌푖 = 휌 is not required.

For within-dimension-based statistics, the alternative hypothesis (homogeneous) is:

H1 : 휌 =휌푖 < 1 for all i assumes a common value for 휌푖 = 휌. 2nd Generation Cointegration Tests

Westerlund (2007)

Many studies fail to reject the no-cointegration null, even in cases where cointegration is strongly suggested by theory. One explanation for this failure to reject centers on the fact that most residual- based cointegration tests, both in pure time series and in panels, require that the long-run parameters for the variables in their levels are equal to the short-run parameters for the variables in their differences. Banerjee, Dolado, and Mestre (1998) and Kremers, Ericsson, and Dolado (1992) refer to this as a common-factor restriction and show that its failure can cause a significant loss of power for residual-based cointegration tests. As a response to this, Westerlund (2007) developed four new panel cointegration tests that are based on structural rather than residual dynamics and, therefore, do not impose any common-factor restriction.

The idea is to test the null hypothesis of no cointegration by inferring whether the error-correction term in a conditional panel error-correction model is equal to zero. The tests are all normally distributed and are general enough to accommodate unit-specific short-run dynamics, unit-specific trend and slope parameters, and cross-sectional dependence. The error-correction tests assume the following data-generating process:

∆푦 =훿′훿 + 훾 [푦 − 훽′ 푥 ] + ∑푝푖 훼 ∆푦 + ∑푝푖 휃 ∆푥 + 푒 푖푡 푖 푡 푖 푖,푡−1 푖 푖,푡−1 푗=1 푖푗 푖,푡−푗 푗=−푞푖 푖푗 푖,푡−푗 푖푡 where t = 1,...,T and i = 1,...,N index the time-series and cross-sectional units, respectively, while

훿푡 contains the deterministic components, for which there are three cases. In the first case, 훿푡 = 0

(no deterministic term), in the second case, 훿푡 = 1 (constant), in the third case, 훿푡 = (1,t)′ (both a constant and a trend ). Two tests are designed to test the alternative hypothesis that the panel is

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 70

cointegrated as a whole, while the other two test the alternative that at least one unit is cointegrated. If the null hypothesis of no error correction is rejected, then the null hypothesis of no cointegration is also rejected.

NB: Coded in Stata as xtwest, using the AIC/SBC to choose optimal lag and lead length. In order to avoid misleading inference in case of cross-member correlation, xtwest also comes with a bootstrap () option Gengenbach, Urbain, and Westerlund (2009)

An ECM approach: A common factor structure for cross-section dependence (CSD) is assumed and accounted for in the test regressions:

푝푖 푝푖 ∆푦푖푡=훾푖푦푖,푡−1 + 훼1푖 푥푖,푡−1 + 훼2푖 퐹푖,푡−1 + ∑푠=1 휋1푖푠 ∆푦푖,푡−푠 + ∑푠=0 휋2푖푠 ∆푥푖,푡−푠 +

푝푖 ∑푠=0 휋3푖푠 ∆퐹푖,푡−푠 + 푒푖푡

푝푖 푝푖 =훾푖푦푖,푡−1 + 훼1푖 푥푖,푡−1 + 휁1푖푦̂푖,푡−1 + 휁2푖 푥̂푖,푡−1 + ∑푠=1 휋1푖푠 ∆푦푖,푡−푠 + ∑푠=0 휋2푖푠 ∆푥푖,푡−푠 +

푝푖 푝푖 ∑푠=1 휉1푖푠 ∆푦̂푖,푡−1 + ∑푠=0 휉2푖푠 ∆ 푥̂푖,푡−1 + 푒푖푡

The equation is estimated for i individually with ‘ideal’ lag-length (AIC, BIC selection criterion), then results are averaged from either t-ratios of 훾̂푖 or a of 훾̂푖 and 훼̂1푖. Tests distributions are nonstandard, so we have to go by the values created from simulations. In practice apply a truncation rule to wipe out the influence of outliers.

Estimating the cointegrating vector

The direct cointegrating regression estimated by FE/OLS yields poor coefficient estimates. Estimation methods used include:

A) D-OLS (dynamic OLS, Kao & Chiang, 2000) augments the cointegrating regression by lags and leads:

n n +k 2 yt = 0 + i xi,t +   i, jxi,t− j +t i=1 i=1 j=−k1

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 71

DOLS yields unbiased and asymptotically efficient estimates of the long run relationship, even if there are endogenous regressors, thus allowing us to control for the potential endogeneity. Additionally, in panel data samples with small time dimension the DOLS estimator performs better than other available estimators, like, for instance, the non-parametric fully modified ordinary least squares (FMOLS) estimator.

B) Breitung’s two-step panel cointegration estimator: Breitung suggests a two-step estimator that extends the Johansen VAR procedure to the panel case. This method is preferable if there is more than one cointegrating vector. The Johansen maximum-likelihood system cointegration procedure for time series relies on an algebraic transformation of the system using an estimate of the loading matrix 훼. Breitung assumes cointegrating spaces as identical for all i, whereas 훼푖 may be heterogeneous. Thus, the two steps are:

1. Run Johansen-type procedures for each i = 1, . . ., N. Transform (rotate) using the estimated

훼푖. Also determine the cointegrating rank.

2. Run pooled OLS on all transformed observations. This yields an estimate of the matrix β that contains the cointegrating vectors.

The estimator is asymptotically normally distributed. The statistic for the rank test is also distributed N(0,1) in large samples. It uses, however, correction factors for mean and variance bias. NB: • Cointegration tests and estimators can be sensitive to cross-section dependence, particularly unmodeled cross-section cointegration (i and j connected by a cointegrating vector); • Among the residual-based cointegration tests, Pedroni tests based on the Dickey-Fuller concept are the most reliable tests; • The best estimator for a single cointegrating relation is the DOLS method; • If the cointegrating rank r is larger than one, Breitung’s two-step algorithm performs best, but it tends to overstate r.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 72

Having established a cointegration relationship, the long-run parameters can be estimated efficiently using techniques similar to the ones proposed in the case of single time series models. Specifically, fully-modified OLS procedures, the dynamic OLS estimator and estimators based on a vector error correction representation were adopted to panel data structures. Most approaches employ a homogenous framework, that is, the cointegration vectors are assumed to be identical for all panel units, whereas the short-run parameters are panel specific. Although such an assumption seems plausible for some economic relationships, there are other behavioral relationships (like the consumption function or money demand), where a homogeneous framework seems overly restrictive. On the other hand, allowing all parameters to be individual specific would substantially reduce the appeal of a panel data study. It is therefore important to identify parameters that are likely to be similar across panel units whilst at the same time allowing for sufficient heterogeneity of other parameters. This requires the development of appropriate techniques for testing the homogeneity of a sub-set of parameters across the cross-section units. When N is small relative to T; standard likelihood ratio-based statistics can be used.

The above procedures described above are only able to indicate whether or not the variables are cointegrated and a LR relationship exists between them. To identify the direction of causality, we estimate a panel-based VECM and use it to conduct tests on the energy consumption–GDP relationship. We do this using Engle and Granger’s (1987) procedure.

 Step 1: We estimate the LR model in order to obtain the estimated residuals.  Let the empircal model for this test be as ff:

gdpit = ai + dti + benit + gpit + eit where gdp, en and p are the natural logs of GDP, energy consumption and prices, respectively;  Step 2: We estimate a Granger causality model with a dynamic error correction term as ff:

m m m

gdpit =1 j +  11 ik  gdp it− k +   12 ik  en it − k +   13 ik  p it − k +  1 i  it − 1 + u 1 it k=1 k = 1 k = 1

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 73

m m m

enit =2 j +  21 ik  en it− k +   22 ik  gdp it − k +   23 ik  p it − k +  2 i  it − 1 + u 2 it k=1 k = 1 k = 1

mm m Dpit = qq33j + åå1ik Denit--k + q32ik Dgdpit k + åq33ik Dpit-k + l3ieit-1 + u3it kk==11k=1

Where:∆ denotes first differences and k is the optimal lag length determined by the Schwarz Bayesian Criterion.

Questions;

Consider the dynamic, linear, cross country, random effects regression model yit = α + βxit + δizit + γyi,t-1 + ui + εit, t = 1,...,4 (and yi,0 is observed data). in which i is a country and t is a year; yit is national income per capita, zit is domestic investment and xit is a measure of national labor input. You have 30 countries and 4 years of data. Note that the coefficient on zit is allowed to differ across countries.

i) Assuming for the moment that δi is constant across countries, show that the pooled ordinary least squares estimator is inconsistent. ii) Continuing to assume that δi is the same for all countries, show two approaches, (1) Anderson and Hsiao and (2) Hausman and Taylor, could be used to obtain consistent estimators of β, δ and γ.

2) You have a sample of N individuals for T years. Suppose you estimate by OLS the annual income equation: yit = α0 + α1edi + α2ageit + α3(edi × ageit) + γyit−1 + uit where edi represents the years of education of the ith individual, ageit represents the age of the individual i in period t and uit represents all unobservables. Suppose you estimate γ as 0.82 with the of 0.12. State a set of sufficient assumptions for the consistency of the OLS estimator in this context.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 74

3) Suppose you wish to estimate a dynamic model of the form yit = βxit+fi+uit uit = ρuit−1 + eit where fi is an unobserved fixed effect and the unobservables eit are independent and identically distributed over time. The single regressor xit may be correlated with f, is uncorrelated with eit but is not strictly exogenous.

i) Derive a consistent estimator for β. State carefully any assumptions you might have to make and also the minimum number of observations required for estimation.

ii) What is the covariance matrix of your estimator?

iii) Suggest a way of testing the hypothesis that ρ = 0 and describe a consistent estimator for β under the hypothesis that ρ = 0. State carefully any assumptions you might have to make and also the minimum number of observations required for estimation.

Further Readings Dynamic Panel Data: Theoretical/review papers/books Arellano, M. (2003), Panel Data Econometrics, Oxford University Press, Oxford. Baltagi, B.H. (2008), Econometric Analysis of Panel Data, 4th Edition, John Wiley: New York. Anderson, T.W. and C. Hsiao (1982), “Formulation and Estimation of Dynamic Models Using Panel Data,” Journal of Econometrics, 18: 47–82. Arrelano, M. and S. Bond (1991), “Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations,” review of Economic Studies, 58, 277- 297. Arellano, M. and O. Bover (1995), “Another Look at the Instrumental Variable estimation of Error-Components Models,” Journal of Econometrics, 68, 29-51. Judson, R.A. and A.L. Owen (1999), “Estimating Dynamic Panel Data Models: A Guide for Macroeconomists,” Economics Letters, 65, 9-15.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 75

TOPIC 2.2. DISCRETE CHOICE MODELS (12HOURS)

Objectives of the Topic After completing this Topic, students will be able to:

• Identify common features of discrete choice models; • Identify the limitations and advantages of each model; • Understand the methodological principles of discrete choice modeling; • Fit, interpret, and use discrete choice models; and • Apply these models to practical problems in economics using statistical software such as STATA.

Introduction Human life is full of choices to select from on a daily basis, showing the importance of discrete choice models for analyzing individual choice behavior. Discrete choice models solve problems in many fields such as agriculture, economics, accounting, health, engineering, environmental management, urban planning, tourism and transportation among other fields. For example, discrete choice modeling is used to inform on the best technology or innovation that are beneficial to farmers in agriculture, to inform the preference for healthcare in health and to guide product positioning and pricing in marketing.

Discrete choice models are statistical procedures that model choices made by people among a finite set of alternatives. These models statistically relate the choice made by each individual to the attributes of the individual and the attributes of the alternatives available to him/her. For instance, the choice of which car a person buys is statistically related to the person’s income, age, price, fuel efficiency, size, and other attributes of each available car. The models estimate the probability that a person chooses a particular alternative. Discrete choice models take many forms, including: Binary choice models (Logit and Probit), unordered choice of multiple categories (multinomial logit, conditional logit, mixed logit, multinomial probit, nested logit and multivariate probit) and

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 76

ordered choice of multiple categories (ordered logit and ordered probit). In this Topic we will discuss these models in detail.

2.2.1. Binary Choice Models

2.2.1.1. Introduction

The models discussed thus far in the first semester course and Topic 2.1 are well suited for modelling a continuous variable - e.g. economic growth, log of value-added or output, log of earnings, etc. Many economic phenomena of interest, however, concern variables that are not continuous or perhaps not even quantitative. What characteristics (e.g. parental) affect the likelihood that a JFE student scores an excellent grade in econometrics? What determines labor force participation (employed vs not employed)? What factors drive the incidence of COVID-19? Under this sub-topic, we will discuss binary choice or dummy variable models which are central models in applied econometrics. We start with using the Ordinary Least Squares (OLS) for a binary dependent variable called Linear Probability Model (LPM) and show its drawbacks. Following this, we study the alternative methods for analyzing data with dichotomous response variables called logit and probit models which overcome the limitations of LPM. Binary choice models are useful not only when the dependent variable of interest is binary, but also as an ingredient in other models. For example, in propensity score models (Topic 2.4), we identify the average treatment effect by comparing outcomes of treated and non-treated individuals who, a priori, have similar probabilities of being treated. The probability of being treated is typically modelled using Probit or Logit models. In Heckman’s selection model (Topic 2.3), we use probit model in the first stage to predict the likelihood that someone is included in the sample. We then control for the likelihood of being selected when estimating the equation of interest (e.g. a wage equation). Binary choice models are also a good starting point for studying more complicated models such as models for multinomial or ordered responses (Topic 2.2), and models combining continuous and discrete outcomes such as Tobit and Heckman models (Topic 2.3).

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 77

2.2.1.2. Linear Probability Model (LPM)

Whenever the variable that we want to model is binary, it is natural to think in terms of probabilities, e.g. ‘What is the probability that a farmer with such and such characteristics adopts a technology?’ ‘If extension contact of a farmer increases by one, by how much is the probability of adopting a technology change?’ When the dependent variable y is binary, it is typically equal to one for all observations in the data for which the event of interest has happened (‘success’) and zero for the remaining observations (‘failure’). Provided we have a random sample, the sample mean of this binary variable is an unbiased estimate of the unconditional probability that the event happens. That is, if y denotes a binary dependent variable, then: 푦 Pr (y = 1) = E (y) = ∑ 푖, 푖 푁 where N is the number of observations in the sample.

Estimating the unconditional probability is trivial, but usually not the most interesting one. Suppose we want to analyze factors that determine changes in the probability that y equals one.

Activity question: Can we use OLS for estimating binary dependent variable? Can you probe?

Let us consider the linear regression model: y = β1 +β2x2 + … + βKxK +Ꜫ

= xβ+Ꜫ (2.2.1) where y is a binary dependent variable, β is a Kx1 vector of parameters, x is an NxK matrix of explanatory variables, and Ꜫ is the error term.

For now, let us assume that the error term is uncorrelated with the predictors, i.e. there is no endogeneity problem. This allows us to use OLS to estimate the parameters of interest. To interpret the results, note that if we take expectations on both sides of Equation (2.2.1) we obtain:

E (y|x; β) = xβ

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 78

Now, just like the unconditional probability that y equals one is equal to the unconditional expected value of y, i.e. E (y) = Pr (y = 1), the conditional probability that y equals one is equal to the conditional expected value of y, i.e. Pr (y = 1|x) = E (y|x; β) we have: Pr (y = 1|x) = xβ (2.2.2) Since probabilities must add to one, we also have: Pr (y = 0|x) = 1 - xβ

Equation (2.2.2) is a binary choice model. In this particular model the probability of success (i.e. y = 1) is a linear function of the explanatory variables in the vector x. This is why using OLS with a binary dependent variable is called the Linear Probability Model (LPM). Notice that in the

LPM the parameter βj measures the change in the probability of ‘success’, resulting from a unit increase in the variable xj, holding other factors constant. That is,

ΔPr (y = 1|x) = βjΔxj This can be interpreted as a partial effect on the probability of ‘success’.

Example 2.2.1: Let us see how we can model the probability of being in excellent health condition using the data hyperlinked herewith..\..\bivariate_health.dta.

Using the Stata command for linear regression “regress hlthe age ndisease linc dmdu” we will get the following results.

Table 2.2.1a. Linear regression estimation of excellent health Source SS df MS Number of obs = 5,574 F(4, 5569) = 154.020 Model 137.854 4 34.463 Prob > F = 0.000 Residual 1246.156 5,569 0.224 R-squared = 0.100 Adj R-squared = 0.099 Total 1384.010 5,573 0.248 Root MSE = 0.473

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 79

hlthe Coef. Std. Err. t P>t [95% Conf. Interval] age -0.007 0.000 -17.040 0.000 -0.007 -0.006 ndisease -0.012 0.001 -12.190 0.000 -0.014 -0.010 linc 0.046 0.005 8.870 0.000 0.036 0.057 dmdu 0.016 0.014 1.200 0.232 -0.011 0.044 _cons 0.428 0.047 9.130 0.000 0.336 0.520

. predict hlthehat (option xb assumed; fitted values) . su hlthehat Variable Obs Mean Std. Dev. Min Max hlthehat 5,574 0.541 0.157 -0.277 0.796

. count if hlthehat>1 0 . count if hlthehat<0 14

Limitations of the LPM Clearly the LPM is straightforward to estimate, however there are some important shortcomings.

1. One undesirable property of the LPM is that, if we plug in certain combinations of values for the independent variables into Equation (2.2.2), we can get predictions either less than zero or greater than one. Of course, a probability by definition falls within the [0,1] interval, so predictions outside this range are meaningless and somewhat embarrassing. This is not an unusual result; for instance, based on the above LPM results, there are 14 observations for which the predicted probability is less than zero. That is, considerable proportions of the predictions fall outside the (0,1) interval in this application as the for the predictions reported above under Table 2.2.1 shows.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 80

Angrist and Pischke (2009) argued that linear regression may generate fitted values outside the LDV boundaries which bothers some researchers and has generated a lot of bad press for the linear probability model.

2. A related problem is that, conceptually, it does not make sense to say that a probability is linearly related to a continuous independent variable for all possible values. If it were, then continually increasing this explanatory variable would eventually drive P(y = 1|x) above one or below zero.

For example, the model above predicts that an increase in income by 1 year increases the probability of excellent health by about 1 percentage point. This may seem reasonable for families with average levels of wealth, however for very rich or very poor families the wealth effect is probably smaller. In fact, when taken to the extreme our model implies that a hundred-fold increase in income increases the probability of excellent health by more than 1 which, of course, is impossible (the income variable ranges from 0 to 10.3 in the data, so such comparison is not unrealistic).

3. A third problem with the LPM - arguably less serious than those above - is that the error term is heteroskedastic by definition. Why is this? Since y takes the value of 1 or 0, the error terms in

Equation (2.2.1) can take only two values, conditional on x: 1-βxi and -βxi. Further, the respective probabilities of these events are βxi and 1-βxi. Hence, 2 2 var (u|x) = Pr (y = 1|xi) [1-βxi] + Pr (y = 0|xi) [-βxi]

2 2 = βxi[1-βxi] +1-βxi [-βxi]

= βxi[1-βxi] which clearly varies with the explanatory variables xi. The OLS estimator is still unbiased, but the conventional formula for estimating the standard errors, and hence the t-values, will be wrong. The easiest way of solving this problem is to obtain estimates of the standard errors that are robust to heteroskedasticity.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 81

Example 2.2.1 continued: The LPM results with robust standard errors, Table 2.2.1b can be compared to LPM with non-robust standard errors (Table 2.2.1a).

Table 2.2.1b. Linear regression with robust standard errors . regress hlthe age ndisease linc dmdu, robust

Linear regression Number of obs = 5,574 F(4, 5569) = 192.17 Prob > F = 0.000 R-squared = 0.0996 Root MSE = 0.473 hlthe Coef. Robust t P>t [95% Conf. Interval] Std. Err. age -0.007 0.000 -17.690 0.000 -0.007 -0.006 ndisease -0.012 0.001 -12.980 0.000 -0.014 -0.010 linc 0.046 0.005 8.830 0.000 0.036 0.057 dmdu 0.016 0.014 1.180 0.236 -0.011 0.044 _cons 0.428 0.047 9.010 0.000 0.335 0.521

4. A fourth and related problem is that, because the residual can only take two values, it cannot be normally distributed. The problem of non-normality means that linear regression estimates are unbiased, but its violation means that inference in small samples cannot be based on the usual suite of normality-based distributions such as the t-test.

In sum, these limitations of the LPM together indicate that applications of OLS on a binary dependent variable leads to a biased, inconsistent and inefficient estimations and hence not appropriate.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 82

2.2.1.2. Logit and Probit Models

The two main problems with the LPM were: nonsense predictions are possible (there is nothing to bind the value of Y to the (0,1) range); and linearity doesn’t make much sense conceptually. To address these problems, we abandon the use of LPM or OLS for estimating binary response models. Consider instead a class of binary response models of the form:

Pr (y = 1|x) = G(β0 +β1x1 + … + βKxK)

Pr (y = 1|x) = G(βx) (2.2.3) where G is a function taking on values strictly between zero and one: 0 < G (z) < 1, for all real numbers z. The model specified in Equation (2.2.3) is often referred to in general terms as an index model, because Pr(y = 1|x) is a function of the vector x only through the index:

xβ =β0 +β1x1 + … + βKxK which is simply a scalar. Notice that 0 < G (xβ) < 1 ensures that the estimated response probabilities are strictly between zero and one, which thus addresses the main worries of using LPM. G is usually a cumulative density function (cdf), monotonically increasing in the index z (i.e. xβ), with

Pr (y = 1|x) → 1 as xβ → ∞

Pr (y = 1|x) → 0 as xβ→ -∞

It follows that G must be a non-linear function, and hence we cannot use OLS. Various non-linear functions for G have been suggested in the literature. By far the most common ones are the logistic distribution, yielding the logit model, and the standard normal distribution, yielding the probit model. In the logit model,

G (xβ) = exp (xβ)/(1 + exp (xβ)) = Λ(xβ) (2.2.4) which is between zero and one for all values of xβ (recall that xβ is a scalar). This is the cumulative distribution function (CDF) for a logistic variable. In the probit model, G is the standard normal CDF, expressed as an integral:

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 83

푥훽 푥훽 1 푣2 G (xβ) = Ф(xβ) = ∫ ∅(푣)푑푣 = ∫ exp (− ) 푑푣 (2.2.5) −∞ −∞ √2휋 2 is the standard normal density. This choice of G also ensures that the probability of success is strictly between zero and one for all values of the parameters and the explanatory variables.

The logit and probit functions are both increasing in xβ. Both functions increase relatively quickly at xβ = 0, while the effect on G at extreme values of xβ tends to zero. The latter result ensures that the partial effects of changes in explanatory variables are not constant, a concern we had with the LPM.

Also notice that the standard normal CDF has a shape very similar to of the logistic CDF, suggesting that it doesn’t matter much which one of the two we choose to use in our analysis.

The latent variable framework

As we have seen, the probit and logit models resolve some of the limitations of the LPM model. The key, really, is the specification:

Pr (y = 1|xβ) = G (xβ) where G is the cdf for either the standard normal or the logistic distribution, because with any of these models we have a functional form that is easier to defend than the linear model. This, essentially, is how Wooldridge motivates the use of these models.

The traditional way of introducing probit and logit models in econometrics, however, is not as a response to a functional form problem. Instead, probit and logit are traditionally viewed as models suitable for estimating parameters of interest when the dependent variable is not fully observed. Let us have a look at this perspective.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 84

Let y* be a continuous variable that we do not observe - a latent variable - and assume y* is determined by the model:

y*= β0 +β1x1 + … + βKxK +Ꜫ = xβ+Ꜫ (2.2.6) where Ꜫ is the error term, assumed uncorrelated with x (i.e. x is not endogenous). While we do not observe y*, we do observe the binary choice made by the individual, according to the following choice rule: y = 1 if y* > 0 and y = 0 if y*≤ 0 (2.2.7)

Why is y* unobserved? Think about y as representing net utility of, say, buying a car. The individual undertakes a cost-benefit analysis and decides to purchase the car if the net utility is positive. We do not observe (because we cannot measure) the ‘amount’ of net utility; all we observe is the actual outcome of whether or not the individual does buy a car. (If we had data on y* we could estimate the model with OLS as usual.)

Now, we want to model the probability that a ‘positive’ choice is made (e.g. buying, as distinct from not buying, a car). Assuming that Ꜫ follows a logistic distribution,

λ(Ꜫ) = exp (-Ꜫ)/(1 + exp(-Ꜫ))2(density),

Λ(Ꜫ) = exp (Ꜫ)/(1 + exp(Ꜫ)) (CDF),

We have:

Pr (y = 1|x) = Pr (y*> 0|x)

= Pr (xβ+Ꜫ>0|x)

= Pr (Ꜫ > -xβ)

= 1-Λ (-xβ)) (integrate)

= Λ (xβ) (exploit symmetry) Notice that the last step here exploits the fact that the logistic distribution is symmetric, so that G (z) =1-G (-z) for all z. This equation is exactly the binary choice model (2.2.6) for the logit model. This is how the binary response model can be derived from an underlying latent variable model.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 85

We can follow the same rule to derive the probit model. Assume Ꜫ follows a standard normal distribution:

Pr (y = 1|x) = Pr (y*> 0|x)

= Pr (xβ+Ꜫ>0|x)

= Pr (Ꜫ > -xβ)

= 1-N(-xβ/σ)) (integrate)

= Ф(xβ) where again we exploit symmetry and use σ = 1 implied by the standard normal distribution. This is the binary choice model (2.2.6) for the probit model. Note that the assumption that σ= 1 may appear restrictive. In fact, this is a necessary normalization, because we cannot estimate σ by means of a binary choice model.

Estimation of logit and probit models

To estimate the LPM we use OLS method. Because of the non-linear nature of the probit and logit models, however, we cannot apply linear estimators. Instead we use Maximum Likelihood (ML) estimation technique. The principle of ML is very general and not confined to probit and logit models. Before we go into using ML to estimate probit and logit models, here is an informal recap of ML.

Recap of the Maximum Likelihood (ML) method

Suppose that, in the population, there is a variable w which is distributed according to some distribution f(w; θ), where θ is a vector of unknown parameters. Suppose we have a random sample

{w1, w2, …, Wn} drawn from the population distribution f (w; θ) where θ is unknown. Our objective is to estimate θ. Our sample is more likely to have come from a population characterized by one particular set of parameter values, say 휃̂, than from another set of parameter values, say 휃̃.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 86

The maximum likelihood estimate (MLE) of θ is simply the particular vector 휃̂ML that gives the greatest likelihood (or, if you prefer, probability) of observing the sample {w1, w2, …, Wn}.

Random sampling (an assumption) implies that w1, w2, …, Wn are independent of each other, hence the likelihood of observing {w1, w2, …, Wn} (i.e. the sample) is simply: . L (θ; w1, w2, …, Wn) = f (w1; θ) f (w2; θ) … f (wN; θ), or, in more compact notation, L (θ; w1, w2, 푁 …, Wn) = ∏푖=1 f(푤푖; θ). i.e. the product of the individual likelihoods. The equation just defined is a function of θ: for some values of θ the resulting L will be relatively high while for other values of θ it will be low. This is why we refer to equations of this form as likelihood functions. The value of θ that gives the maximum value of the is the maximum likelihood estimate of θ.

For computational reasons, it is much more convenient to work with the log-likelihood function:

푁 ln L(θ; w1, w2,…, Wn) = ∑푖=1 ln f(푤푖; θ) (2.2.8)

The value of θ that gives the maximum value of the log likelihood function is the 휃̂ML.

Maximum likelihood estimation of logit and probit models

We now return to the logit and probit models. How can ML be used to estimate the parameters of interest in these models, i.e.β? Assume that we have random sample of size N. The ML estimate

ML of is the particular vector 휃̂ that gives the greatest likelihood of observing the sample {y1, y2,

…, yN}, conditional on the explanatory variables x.

By assumption, the probability of observing yi = 1 is G(xβ) while the probability of observing yi = 0 is 1 - G (xβ). It follows that the probability of observing the entire sample is given by: . L(y|x; β) = ΠG(xiβ) Π(1-G (xiβ)) iєl iєm where i refers to the observations for which y = 1 and m refers to the observations for which y = 0. We can rewrite this as:

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 87

y (1-y ) L (y|xi; β) = ΠG (xiβ) i[1-G (xiβ)] i , because when y = 1 we get G (xiβ) and when y = 0 we get

[1 - G (xiβ)].

The log likelihood for the sample is given by: 푁 lnL(y|x; β) = ∑푖=1{yilnG(xiβ) + (1 − yi) ln [1 − G (xiβ)]}. The MLE of β maximizes this log likelihood function.

If G is the logistic CDF then we obtain the logit likelihood: 푁 lnL(y|x; β) = ∑푖=1{yilnΛ(xiβ) + (1 − yi) ln [1 − Λ (xiβ)]}. 푁 = ∑푖=1{yiln(exp(xiβ)/1 + exp(xiβ)) + (1 − yi) ln [1/1 + exp(xiβ)]}. which simplifies to:

푁 lnL(y|x; β) =∑푖=1{y푖[푥푖β − ln(1 + exp(푥푖β))] − (1 − 푦푖) + ln(1 + exp(푥푖β))} (2.2.9)

If G is the standard normal CDF we get the probit estimator: 푁 lnL(y|x; β) = ∑푖=1{푦푖 Ф(푥푖훽) + (1-푦푖)ln (1 − Ф(푥푖훽))} (2.10) How do we maximize the log likelihood function? The sample log likelihood is given by:

푁 lnL(y|x; β) = ∑푖=1{푦푖 퐺(푥푖훽) + (1-푦푖)ln (1 − 퐺(푥푖훽))}

Because the objective is to maximize the log likelihood function with respect to the parameters of interest, it must be that, at the maximum, the following K first order conditions will hold: 푁 푔(푥푖훽) 푔(푥푖훽) ∑{푦푖 + (1 − 푦푖) }푥푖 = 0 퐺(푥푖훽) 1 − 퐺(푥푖훽) 푖=1 1x1 1xK

It is not possible to solve for β analytically here. Instead, to obtain parameter estimates, we rely on some sophisticated iterative ‘trial and error’ technique. There are many algorithms that can be

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 88

used, but this is not our concern here. The most common ones are based on first and sometimes second derivatives of the log likelihood function. Think of a blind man walking up a hill, and whose only knowledge of the hill comes from what passes under his feet. Provided the hill is strictly concave, the man should have no trouble finding the top. Luckily, the log likelihood functions for logit and probit are concave, but this is not always the case for other models.

Interpreting logit and probit results

In most cases the main objective is to determine the effects on the response probability Pr (y = 1|x) resulting from a one-unit increase of the explanatory variables, say xj.

Case 1: The explanatory variable is continuous

In linear models the marginal effect of a unit change in some explanatory variable on the dependent variable is simply the associated coefficient on the relevant explanatory variable. However, for logit and probit models obtaining measures of the marginal effect is more complicated (which should come as no surprise, as these models are non-linear). When xj is a continuous variable, its partial effect on Pr (y = 1|x) is obtained from the partial derivative:

휕Pr (y = 1|x)/휕xj =휕G (xβ)/ 휕xj

= g (xβ)* βj where g (z) ≡ dG (z)/dz is the probability density function associated with G.

Because the density function is non-negative, the partial effect of xj will always have the same sign as βj. Notice that the partial effect depends on g (xβ); i.e. for different values of x1, x2, …, xk the partial effect will be different.

Activity question: Can you see at which values of xβ the partial (marginal) effect will be relatively small/large?

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 89

Case 2: The explanatory variable is discrete

If xj is a discrete variable then we should not rely on calculus in evaluating the effect on the response probability. To keep things simple, suppose x2 is binary. In this case the partial effect from changing x2 from zero to one, holding all other variables fixed, is given by:

G (β1 + β2) (1 + … + βKxK) - G (β1 + β2) (0 + … + βKxK)

Again, this depends on all the values of the other explanatory variables and the values of all the other coefficients. Again, knowing the sign of β2 is sufficient for determining whether the effect is positive or not, but to find the magnitude of the effect we have to use the formula above.

The Stata command ‘mfx compute’ can spot dummy explanatory variables. In this case we will use the above formula for estimating the partial effect.

Case 3: Non-linear explanatory variables

Suppose the model is given by: 2 Pr (y = 1|x) = G (β1 + β2x2 + β3x3 + β22x2 ) 2 where x2 is a continuous variable.

Activity question: What is the marginal effect of x2 on the response probability?

Hypotheses testing in binary choice models

For any regression model estimated by ML, there are three choices for testing hypotheses about the unknown parameter estimates: (1) Wald test statistic, (2) Likelihood ratio test, or (3) Lagrange Multiplier test. We consider them in turn.

The Wald test

The Wald test is the most commonly used test in econometric models. Indeed, it is the one that most statistics students learn in their introductory courses. Consider the following hypothesis test:

H0: β1=β

H1: β1≠β (2.2.11)

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 90

Quite often in this test we are interested in the case when β = 0. That is, in testing if the independent variable’s estimated parameter is statistically different from zero. However, β can be any value. Moreover, this test can be used to test multiple restrictions on the slope parameters for multiple independent variables. In the case of a hypothesis test on a single parameter, the t-ratio is the appropriate test statistic. The t-statistic is given by:

t = (bi−β)/s.e(bi)~tn−k−1 (2.2.12) where b is the estimator of β, k is the number of parameters in the that are estimated. The F- statistic is the appropriate test statistic when the null hypothesis has restrictions on multiple parameters. According to Hauck and Donner (1977) the Wald test may exhibit awkward behavior when the sample size is small. For this reason, this test must be used with some care.

The likelihood ratio test

The likelihood ratio test is based on a comparison of the maximum log of likelihood function for the unrestricted model with the maximum log of likelihood function for the model with the restrictions implied by the null hypothesis. Consider the null hypothesis given in (2.2.11).

Let L(β) be the value of the likelihood function when β1 be the value of the likelihood function when is restricted to being equal to β and L(b1) be the value of the likelihood function when there is no restriction on the value of β. Then the appropriate test statistic is given by:

LR = −2[lnL(β)−lnL(b1)] (2.2.13) The likelihood ratio statistic has the chi-squared distribution χ2(r), where r is the number of restrictions. Thus, using a likelihood ratio test involves two estimations—one with no restrictions on the model and one with the restrictions implied by null hypothesis. Since the likelihood ratio test does not appear to exhibit perverse behavior with small sample sizes, it is an attractive test. Thus, we will run through an example of how to execute the test using Stata.

Example 2.2.2: Let us consider an example using data from Stata linked herewith..\..\lowbrithweight.dta

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 91

In this model, we estimate a model that explains the likelihood that a child will be born with a weight under 2,500 grams (low). The eight explanatory variables used in the model are listed in Table 2.2.2. The model to be estimated is:

ln(Pr(Low)/1−Pr(Low)) = β1Age+β2Lwt+β3RaceB+β4RaceO+β5Smoke+β6Ptl+β7Ht+β8Ui+ε. Also, we want to test the null hypothesis that the coefficients on Age, Lwt, Ptl, and Ht are all zero. The first step is to estimate the unrestricted regression using the Stata command:

. logistic low age lwt raceb raceo smoke ptl ht ui

Table 2.2.2. Definition of the explanatory variables Variable name Definition Age Age of mother Lwt Weight at last menstrual period RaceB Dummy variable =1 if mother is black; 0 otherwise RaceO Dummy variable = 1 if mother in neither white or black; 0 otherwise Smoke Dummy variable = 1 if mother smoked during pregnancy; 0 otherwise Ptl Number of times mother had premature labor Ht Dummy variable = 1 if mother has a history of hypertension; 0 otherwise Ui Dummy variable = 1 there is presence in mother of uterine irritability; 0 otherwise Ftv Number of visits to physician during first trimester

The results of this estimation are shown in column 2 of Table 2.2.3. Next, we save the results of this regression with the command: . estimates store full where “full” is the name that we will refer to when we want to recall the estimation results from this regression. Now we estimate the with the omitting the variables whose parameters are to be restricted to being equal to zero: . logistic low raceb raceo smoke ui

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 92

The results of this estimation are reported in column 3 of Table 2.2.3. Finally, we run the likelihood ratio test with the command: . lrtest full

Notice that we refer to the first regression with the word “full” and to the second regression with the second period. The results of this command are as follows: Likelihood-ratio test LR chi2(4) = 14.42 (Assumption: nested in full) Prob > chi2 = 0.006

The interpretation of these results is that the omitted variables are statistically significant at the 0.6 percent level.

Table 2.2.3. Estimation results

Explanatory variable Unrestricted model Restricted model Age of mother -0.97(-0.74) — Weight at last menstrual period -0.99(-2.19) — Dummy variable =1 if mother is black; 0 otherwise 3.54(2.40) 3.05(2.27) Dummy variable = 1 if mother in neither white or 2.37(1.96) 2.92(2.64) black; 0 otherwise Dummy variable = 1 if mother smoked during 2.52(2.30) 2.95(2.89) pregnancy; 0 otherwise Number of times mother had premature labor 1.72(1.56) — Dummy variable = 1 if mother has a history of 6.25(2.64) — hypertension; 0 otherwise Dummy variable = 1 if there is presence in mother of 2.12(1.65) 2.42(2.04) uterine irritability; 0 otherwise Log likelihood -100.72 -107.93 Number of observations 189 189 Pseudo-R2 0.14 0.08 Note: Parameter estimates are odds ratios; z statistics are shown in parentheses.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 93

The Lagrange multiplier test

The intuition behind the Lagrange multiplier (LM) test (or ) is that the gradient of the log-likelihood function is equal to zero at the maximum of the likelihood function. If the null hypothesis in (2.2.9) is correct, then maximizing the log of the likelihood function for the restricted model is equivalent to maximizing the log-likelihood function with the constraint specified by the null hypothesis. The LM test measures how close the Lagrange multipliers of this constrained maximization problem are to zero. The closer they are to zero, the more likely that the null hypothesis can be rejected.

Economists generally do not make use of the LM test because the test is complicated to compute and the LR test is a reasonable alternative. Thus, as a practical matter the Wald test and the LR test are reasonable alternative test statistics to use to test most linear restrictions on the parameters. Moreover, since the calculations are relatively easy, it may make sense to calculate both test statistics to be sure they produce consistent conclusions. However, when the sample size is small, the LM test probably is preferred.

Goodness-of-fit measures

The standard measure of goodness-of-fit in the linear OLS regression model is R2. No such measure exists for non-linear models like the logit model. Several potential alternatives have been developed in the literature and are known collectively as pseudo-R2. Many of these measures are discussed in McFadden (1974), Amemiya (1981) and Maddala (1983). In case any researcher really cares about the pseudo-R2, a practical approach is to report the value that the computer program reports.

One addition measure of goodness-of-fit is a measure called percentage correctly predicted. This variable is computed in one of several ways. One way is to use the observed values of the independent variable to forecast the probability the dependent variable equal one. Then, if the

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 94

predicted probability is above some critical value, you assume that the predicted value of the dependent value is one. If it is below this value, you assume the predicted value of the dependent variable is zero. Then you construct a table that compares the predicted values of the dependent variable with the actual value of the dependent as shown in Table 2.2.4.

Table 2.2.4: Percent correctly predicted

Predicted Actual 푌̂ = 0 푌̂ = 1

Y = 0 n00 n01

Y = 1 n10 n11

The percentage correctly predicted is equal to the sum of the diagonal elements, that is, n00+n11, over the sample size. The main problem with this measure is that the choice of the cutoff point is arbitrary. Traditionally, a cutoff point used has been 0.5. However, there is no reason why this cutoff is the appropriate one. Cramer (2003, p67) suggests that a more appropriate cutoff point is the sample —that is, n10+n11n00+n01+n10+n11. The bottom line is that the uncertainty about the proper choice of cutoff point is a major problem with using the percentage correctly predicted as a measure of goodness-of-fit.

2.2.2. Unordered Choice Models

2.2.2.1. Introduction

Unordered choice models have been widely implemented in the social sciences since McFadden (1973)’s seminal paper on transportation choice. With such models, individuals choose among several options, but each individual has a different choice set, as a function of their characteristics and/or attributes of the choices. The unordered choice models have a dependent variable that is a categorical, unordered variable. The choices/categories are called alternatives (coded as 1, 2, 3, 4…) and only one alternative can be selected. Examples include the type of insurance contract

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 95

(Health, travel, fire, etc.) that an individual select, and farmer’s choice of climate change adaptation strategy (soil and water conservation, early planting, early maturing varieties, irrigation). Models under this category include, Multinomial logit, Multinomial probit, Nested logit and Multivariate probit models.

2.2.2.2. Multinomial Logit/Probit Models

Multinomial outcome examples include: The type of insurance contract that an individual selects, climate change adaptation strategies (adjusting planting dates, improved crop varieties, crop diversification, small scale irrigation, and soil and water conservation practices), occupational choice by an individual (business, academic, non-profit organization) and the choice of fishing mode (beach, pier, private boat, charter boat).

Multinomial outcome dependent variable

The dependent variable y is a categorical, unordered variable. An individual may select only one alternative. The choices/categories are called alternatives and are coded as j =1, 2, …, m. The numbers are only codes and their magnitude cannot be interpreted (use frequency for each category instead of means to summarize the dependent variable). The data are usually recorded in two formats: a wide format and a long format. When using the wide format, the data for each individual i is recorded on one row. The dependent variable is: yi = j

When using the long format, the data for each individual i is recorded on j rows, where j is the number of alternatives. The dependent variable is: 1 if y = j y j =  0 if y  j 

Therefore, yj = 1 if the alternative j is the observed outcome and the remaining yk = 0. For each observation only one of y1, y2, …, yn will be non-zero.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 96

Table 2.2.5. Example for multinomial data in wide form

Person Dependent variable (y) Codes wi (income) xi1 (price of xi2 (price of ID (i) for y alternative alternative 1) 2) 1 apple juice (alternative 1) y=1 40,000 2.5 1.5 2 orange juice (alternative 2) y=2 38,000 2.7 1.7 3 orange juice (alternative 2) y=2 50,000 2.9 1.6

Table 2.2.6. Example for multinomial data in long form

Person Dependent variable (yj) Codes wi (income) xij

ID (i) for yj (price)

1 apple juice (alternative 1) y1 = 1 40,000 2.5

1 orange juice (alternative 2) y2 = 0 40,000 1.5

2 apple juice (alternative 1) y1 = 0 38,000 2.7

2 orange juice (alternative 2) y2 = 1 38,000 1.7

3 apple juice (alternative 1) y1 = 0 50,000 2.9

3 orange juice (alternative 2) y2 = 1 50,000 1.6

The multinomial density for one observation is defined as:

m y1 y2 ym y j f (y) = p1  p2 ... pm =  p j j=1 The probability that individual i chooses the jth alternative is:

pij = pr[yi = j] = Fj(x’i β)

The functional form of Fj should be selected so that the probabilities lie between 0 and 1 and sum over j to one. Different functional forms of Fj lead to multinomial logit model, multinomial probit model, conditional, mixed and nested models.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 97

Independent variables

Two types of independent variables are worth mentioning here:

1. Alternative-invariant or case-specific regressors: The regressors wi vary over the individual i but do not vary over the alternative j. For example, income, age, and education are different for each individual but they do not vary based on the type of a product that the individual selects. In this case, multinomial logit model is used.

2. Alternative-variant or alternative-specific regressors: The regressors xij vary over the individual i and the alternative j. Examples include, prices for products vary for each product and individuals may also pay different prices and salaries for occupation may be different between occupations and also for each individual. In this case, the conditional and mixed logit models are used.

Multinomial logit model

The multinomial logit model is used with alternative-invariant regressors. The probability that individual i will select alternative j is given by:

exp (푥푖훽푗) 푝푖푗 = 푝(푦푖 = 푗) = 푚 ∑푗=1 exp (푥푖훽푘)

This model is a generalization of the binary logit model. The probabilities for choosing each alternative sum up to 1, i.e.,

m  pij =1 j=1

In MNL model, one set of coefficients needs to be normalized to zero to estimate the models

(usually y1 = 0), so there are (j-1) sets of coefficients estimated. The coefficients of other alternatives are interpreted in reference to the base outcome. For alternative j coefficient is

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 98

interpreted in comparison to the base alternative, an increase in the independent variable makes the selection of alternative j more or less likely.

Marginal effects

Depending on which alternative we select as a base category, the coefficients will be different (in reference to the base category) but the marginal effects will be the same regardless of the base category. Note that the marginal effects of each variable on the different alternatives sum up to zero. As to the interpretation of marginal effects, each unit increase in the independent variable increases/decreases the probability of selecting alternative j by the marginal effect expressed as a percent. The STATA command for marginal effect for mlogit is: margins, dydx(_all).

Independence from Irrelevant Alternatives (IIA) property

The IIA property states that for a specific individual the ratio of the choice probabilities of any two alternatives is entirely unaffected by the systematic utilities of any other alternatives. This property arises from the assumption in the derivation of the logit model that the error terms Ꜫn across individuals are independent. In other words, it is assumed that un-observed attributes (error terms) of alternatives are independent. The odds ratios in the multinomial logit models are independent of other alternatives. For choices j and k, the only depends on the coefficients for choices j and k.

For the MNL model, the odds ratio is given by:

pij /pik = exp [xi (βj - βk)]

Example 2.2.3: The best example showing this weakness of the multinomial model (IIA assumption) is the red bus-blue bus problem. If the choice is between a car and a blue bus, according to this assumption is the introduction of a red bus will not change the probabilities. The problem is that this assumption is invalid in many situations. Let us assume that initially there are only red buses. Moreover, an individual chooses to walk with probability 2/3 and the probability

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 99

of taking a red bus is 1/3. Hence, the probability ratio is 2:1. With the introduction of blue buses, it is rational to believe that that the probability of walking will not change. If the number of red buses equals the number of blue buses, the probabilities that a person walks is 4/6, that a person takes a red bus is 1/6 and that a person takes a blue bus is 1/6. Now, the new probability ratio for walking vs. red bus is 4:1 which is not possible according to IIA.

The following probabilities result from the IIA assumption:

P(by foot) = 2/4 P(red bus) = 1/4 P(blue bus) = 1/4, such that P(walk)/P(red bus) = 2/1

The problem with this assumption is that the probability of walking decreases from 2/3 to 2/4 due to the introduction of blue buses which is not plausible. The main reason of IIA assumption is that the error terms are independently distributed over all alternatives. The IIA property causes no problems if all alternatives considered differ in almost the same way. For example, the probability of taking a red bus is highly correlated with the probability of taking a blue bus “substitution patterns”.

Hausman test for IIA

The Hausman test where s are estimators from restricted subset of choices and f are estimators from model with full set of choices is given by: −1 (βˆ − βˆ )Vˆ − Vˆ (βˆ −βˆ )~  2 s f  s f  s f

In Hausman test, the null hypothesis is that H0: IIA is valid (“odds ratios” are independent of additional alternatives). Then by “omitting” a category, we see if the estimated coefficients change significantly. If we reject H0, then we cannot apply multinomial logit and hence choose nested logit or multinomial probit instead.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 100

Example 2.2.4: The dataset for this example comes from Herriges and Kling (1999) and Cameron and Trivedi (2005). It is linked herewith ..\..\multinomial_fishing.dta. We want to study how income affects the fishing choice of individuals. The multinomial dependent variable has four categories/alternatives: beach, pier, private, and charter boat. Independent variable is income. It is alternative invariant. Data are in a wide form, with one row for every individual.

Table 2.2.5. of fishing mode

Fishing mode Codes for alternatives Freq. Percent beach 1 134 11.34 pier 2 178 15.06 private 3 418 35.36 charter 4 452 38.24 Total 1,182 100.00

There will be three sets of coefficients for income (one set is normalized to zero – the reference/base category) but there will be four sets of marginal effects for income (normalization of coefficients does not matter).

Table 2.2.6a. Multinomial logit coefficients (with coefficients for charter boat fishing normalized to zero)

Beach Pier Private boat Charter boat Income 0.03 -0.11* 0.12* 0 Constant -1.34* -0.53* -0.60* 0

With regards to the interpretation of the coefficient of income, in comparison to charter boat fishing, higher income is associated with a lower likelihood of pier fishing and a higher likelihood of private boat fishing.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 101

Table 2.2.6b. Multinomial logit coefficients (with coefficients for pier fishing normalized to zero)

Beach Pier Private boat Charter boat Income 0.14* 0 0.23* 0.11* Constant -0.81* 0 -0.08 0.53*

With regards to the interpretation of the coefficient of income, in comparison to pier fishing, higher income is associated with higher likelihoods of beach, private boat, and charter boat fishing.

Table 2.2.7. Marginal effects of MNL model

Beach Pier Private boat Charter boat Income 0.00008 -0.02* 0.03* -0.01*

Note that marginal effects are the same regardless of whether alternative 4 or 2 is the base category. As to the interpretation of marginal effects, a one-unit increase in income (corresponding to a thousand dollars) is associated with pier fishing being 2% less likely, private fishing being 3% more likely, and charter fishing being 1% less likely. The marginal effects sum up to zero.

Multinomial probit model

The multinomial probit model is similar to multinomial logit model, just like the binary probit model is similar to the binary logit model. The difference is that it uses the standard normal cdf. The probability that observation i will select alternative j is:

pij = p(yi = j) = Φ (xi′β) It takes longer for a probit model to obtain results compared to multinomial logit model. As in binary case, the coefficients are different by a scale factor from the logit model. The marginal effects will be calculated similarly.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 102

Conditional logit model

The conditional logit model is used with alternative-invariant and alternative-variant regressors. The probability that observation i will choose alternative j is:    exp xij   p = p(y = j) =   ij i m  exp x     ik  k=1 where xij are alternative-variant regressors.

The conditional logit model has only one set of coefficients (β) for the alternative- specific regressors. The probabilities for choosing each alternative sum up to 1. Coefficients for the alternative-invariant regressors yj (similar treatment as the multinomial logit model). That is, one set of coefficients for the alternative-invariant regressors is normalized to zero (say γ1 = 0), this is the base outcome. The rest of coefficients are interpreted in relation to this base category. There are (j-1) sets of coefficients (corresponding to the number of alternatives minus 1 for the base). Coefficient for alternative j is interpreted as: in comparison to the base alternative, an increase in the independent variable makes the selection of alternative j more or less likely.

For coefficients for the alternative-specific regressors (β), there is no need of normalization, we only have one set of coefficients across all alternatives and coefficient is interpreted as: an increase in the price of one alternative decreases the probability of choosing that alternative and increases the probability of choosing other alternatives.

Marginal effects

The marginal effect of an increase of a regressor by one unit on the probability of selecting alternative j is given by: p ij = p ( − p ) x ij ijk ik ik where δijk = 1 if j=k and 0 otherwise.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 103

There are j sets of marginal effects for both the alternative-specific and case specific regressors.

For each alternative-specific variable xij, there are jxj sets of marginal effects. The marginal effects of each variable on the different alternatives sum up to zero. Marginal effects are interpreted as: each unit increase in the independent variable increases the probability of selecting the kth alternative and decreases the probability of the other alternatives, by the marginal effect expressed as a percent. The STATA command for estimating the conditional logit model: asclogit Y X1...XK, group(id) alternative () casevars (), and for marginal effects: estat mfx.

Example 2.2.5: Using the dataset linked in Example 2.2.4, we want to study how income and the price and catch rate of each alternative affects the fishing choice of individuals. The dependent variable has four categories/alternatives: beach, pier, private, and charter boat. The independent variables are price and catch rate (alternative specific) and income (alternative variant). Data are in a long form, with four rows (alternatives) for each individual.

There will be three sets of coefficients (one set is normalized to zero – the reference/base category) for the alternative-invariant regressors (income). There will be one set of coefficients for the alternative-specific regressors (price and catch rate). There will be four sets of marginal effects (normalization of coefficients does not matter) for all regressors.

Table 2.2.8a. Conditional logit coefficients (with coefficients for charter boat fishing normalized to zero)

Beach Pier Private boat Charter boat Income 0.03 -0.09 0.12* 0 Constant -1.69* -0.92* -1.16* 0 Price -0.03* Catch rate 0.36*

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 104

Coefficient for alternative-invariant regressors is interpreted as: in comparison to charter boat fishing, higher income is associated with a higher likelihood of private boat fishing. Coefficient for alternative-specific regressors is interpreted as: when the price of an alternative increases, this alternative is less likely to be chosen; when the catch rate of an alternative increases, this alternative is more likely to be chosen. Note here that the coefficients on income in the conditional logit model are similar to the ones in the multinomial logit model.

Table 2.2.8b. Conditional logit coefficients (with coefficients for pier fishing normalized to zero)

Beach Pier Private boat Charter boat Income 0.13** 0 0.22*** 0.09* Constant -0.78*** 0 -0.25 0.91*** Price -0.03* Catch rate 0.36*

Coefficient for alternative-invariant regressors is interpreted as: in comparison to pier fishing, higher income is associated with higher likelihoods of beach, private boat, and charter boat fishing while coefficient for alternative-specific regressors is interpreted as: when the price of an alternative increases, this alternative is less likely to be chosen; when the catch rate of an alternative increases, this alternative is more likely to be chosen. Note that the coefficients on income in the conditional logit model are very similar to the ones in the multinomial logit model, the coefficients on the alternative-specific regressors are the same, when different categories are selected as base category, and marginal effects are the same regardless of whether alternative 4 or 2 is the base category.

Table 2.2.9. Conditional logit marginal effects

Beach Pier Private boat Charter boat Income -0.0007 -0.010* 0.032* -0.022* Price Beach -0.0013* 0.00009* 0.0006* 0.0006*

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 105

Price Pier 0.00009* -0.0016* 0.0007* 0.0008* Price Private 0.0006* 0.0007* -0.0061* 0.0049* Price Charter 0.0006* 0.0008* 0.0049* -0.0062* Catch Rate Beach 0.0178* -0.0012* -0.0079* -0.0087* Catch Rate Pier -0.0012* 0.022* -0.01* -0.011* Catch Rate Private -0.0079* -0.01* 0.087* -0.069* Catch Rate Charter -0.0087* -0.011* -0.069* 0.089*

Marginal effect for alternative-invariant regressors is interpreted as: one unit increase in income (corresponding to a thousand dollars) is associated with pier fishing being 1% less likely, private fishing being 3% more likely, and charter fishing being 2% less likely while marginal effect for alternative-specific regressors is interpreted as: one unit increase in the price of beach fishing is associated with beach fishing being 0.13% less likely, pier fishing being 0.009% more likely, private boat fishing being 0.06% more likely, and charter boat fishing being 0.06% more likely. Note also that when the price of an alternative increases, this alternative is less likely to be chosen, and other alternatives are more likely to be chosen and the marginal effects sum up to zero for both the alternative-invariant and alternative-specific variables.

Mixed logit model

Conditional logit model cannot account for preference heterogeneity among respondents (unless it is related to observables) and the IIA property can lead to unrealistic predictions. This has led researchers in various disciplines to consider more flexible alternatives. The mixed logit model overcomes these limitations by allowing the coefficients in the model to vary across decision makers. That is, the mixed logit model extends the standard conditional logit model by allowing one or more of the parameters in the model to be randomly distributed.

The probability that individual i selects alternative j represents a mixed logit model: 퐽 exp (푥푖훽푗 + 푧푖푗훼) 푃푖푗 = ∑ exp (푥푖훽푘 + 푧푖푘훼) 푘=1

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 106

The mixed logit model relaxes the IIA assumption by allowing parameters in the conditional logit model to be normally (or log-normally) distributed. When estimating the mixed logit model, the researcher needs to specify which parameters will be estimated as random. If a parameter is random, this implies that effect of a particular regressor on the chosen alternative varies across the individuals. The mixed logit model produces random parameters coefficients for both the regressor

(xi) and the of the regressor (sd(xi)).

With regards to interpretation of the coefficients for the regressors (xi): when the independent variable increases, the consumers are more or less likely to choose this alternative. The coefficient interpretation on the standard deviation of a regressor (sd(xi)): there is a heterogeneity across individuals with respect to the effect of the independent variable on the alternative chosen. The STATA command for estimating mixed logit model is: mixlogit CV X1...XK, group(id) rand(Xm).

Example 2.2.6: We want to study the price of each alternative affects the fishing choice of individuals, when there is a heterogeneity across individuals in the effect of price. The dependent variable has three categories/alternatives: beach, pier, and private. The independent variables are catch rate and price. They are alternative specific. Also, we will include price and standard deviation of price in the model and dummy variables for two of the three alternatives: beach and private. Note that the data are in a long form, three rows (alternatives) for each observation.

Table 2.2.10. Mixed logit model results Mixed logit model coefficients Catch rate 0.78 Dummy if beach is chosen -0.77* Dummy if private is chosen -0.21 Dummy if beach is chosen * income 0.12* Dummy if private is chosen * income 0.17* Price -0.11* Sd (price) 0.06*

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 107

Coefficient interpretation: if the beach alternative is chosen, consumers are less likely to select other alternatives; if the private boat alternative is chosen, consumes are indifferent about selecting other alternatives. The coefficient on price can be interpreted as: when the price of an alternative increases, consumers are less likely to choose it. The coefficient on the standard deviation in price is interpreted as: there is a considerable variation/heterogeneity across consumers in the effect of price.

Nested logit model

To overcome the IIA assumption, one may think of a multi-dimensional choice context as one with inherent structure, or hierarchy. This notion helps the analyst to visualize the nested logit model, although the nested logit model is not inherently a hierarchical model. Consider the case of four travel alternatives, auto, bus with walk access, bus with auto access, and carpool. This might be thought of as a nested choice structure, the first decision is made between public transit and auto, and then between which alternative given that public or private has been selected. Mathematically, this nested structure allows subsets of alternatives to share unobserved components of utility, which is a strict violation of the IIA property in the MNL model. For example, if transit alternatives are nested together, then it is feasible that these alternatives shared unobserved utility components such as comfort, ride quality, safety, and other attributes of transit that were omitted from the systematic utility functions. This ‘work-around’ solution to the IIA assumption in MNL model is a feasible and relatively easy solution. In a nutshell, the analyst groups alternatives that share unobserved attributes at different levels of a nest, so as to allow the error terms within a nest to be correlated.

Activity question: Did you understand how nested logit model overcomes the IIA assumption of MNL model?

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 108

2.2.3. Ordered Choice Models

2.2.3.1. Introduction

The choice options in multinomial logit/probit models have no natural ordering or . They may be applied to ordinal data as well, but the models make no explicit use of the fact that the response categories are ordered. Ordered choice models are designed specifically for the analysis of responses measured on an ordinal scale. Examples include: Results of opinion surveys in which responses can be strongly disagree, disagree, neutral, agree or strongly agree, assignment of grades or work performance ratings. Students receive grades A, B, C, D, F which are ordered on the basis of a teacher’s evaluation of their performance. Employees are often given evaluations on scales such as outstanding, very good, good, fair and poor which are similar in spirit. When modeling these types of outcomes numerical values are assigned to the outcomes, but the numerical values are ordinal, and reflect only the ranking of the outcomes. The distance between the values is not meaningful. The usual linear regression model is not appropriate for such data, because in linear regression we would treat the dependent variable as having some numerical meaning when they do not.

2.2.3.2. Ordered Logit/Probit models

Some discrete outcomes can be ordered to elicit more robust and representative information about the subject under consideration.

Examples include: Rating systems (excellent, very good, good, fair, and poor). For instance, self- assessed English proficiency: (i) excellent, (ii) very good, (iii) good, (iv) fair, (v) poor. Do you agree with the following statement? The answer could be: Strongly agree, agree, disagree, strongly disagree. For example, suppose you are asked the question: Is the Kenyan banking system is efficient in promoting SMEs? Your answer is either of the following: strongly agree, (ii) agree, (iii) disagree, (iv) strongly disagree. Other examples include: Grades (A, B, C, D, F), employment

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 109

status (unemployed, part time, full time), economic/social status (low, medium and high) and educational status (elementary school graduate, high school graduate, college graduate).

With regards to coding the responses, the options-excellent, very good, good, fair, poor can be coded as 5,4,3,2,1. They can also be coded as 1,2,3,4,5. However, it is easier to explain the former than the latter in terms of highest to lowest value representing highest to lowest rating. Similarly, for the options - strongly agree, agree, disagree, strongly disagree options, we may also code as 5,4,3,2,1.

Here note that the numbers 1-5/5-1 mean nothing in terms of their value, just an ordering to show you the lowest to highest/highest to lowest. Even though we can order these from lowest to highest, the spacing between the values may not be the same across the categories of the ordered variable. In essence, these categories are not equally spaced. For instance, if we are modelling the predictors of differences in economic/social status at the household level in Kenya, we can assign scores 1, 2, and 3 to the low, medium and high levels of economic and social status. However, the difference between categories one and two (low and medium) is probably much bigger than the difference between categories two and three (medium and high). In Statistics, variables described this way are classified as ordinal variables/ordered outcomes/polychotomous responses (as opposed to dichotomous responses in the case of binary outcomes).

Like the binary choice models, when such a variable appears on the left-hand side [Dependent variable] of a , it is obvious that Least Squares regression will suffer from some short-comings such as heteroskedasticity, predicted probabilities lying outside the unit interval, etc. Thus, the appropriate models for analysis in such a situation are the Logit and Probit models. The logit and probit models involving ordered outcomes are described as Ordered Logit and Ordered Probit models respectively.

Just like the binary choice models, the central idea behind the ordinal outcomes is that there is a latent continuous metric (defined as y* underlying the observed responses by the analyst? As previously explained, y* is an unobserved variable, we only know when it crosses thresholds. For

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 110

instance, if we are modelling the predictors of bank performance; once y* crosses a certain value cut-off point, we report poor, then good, then very good, then excellent performance.

As a theoretical explanation, let us consider a latent variable model given by:

Yi * = 0 + 1 X i1 + 2 X i2 +...+ k X ik + i Y * = X '  + i i i

And Yi = j if cj-1 < Yi* ≤ cj, where i = 1, ..., N and c is the cut-off point.

In general, for ordered choice data, the probability that an individual i will select alternative j is given by: pij = p(yi = j) = p(cj–1

= F(cj − xi’β)−F(cj–1 − xi’β) For the Ordered logit, F is the logistic cdf given by: F(z) = ez/(1 + ez)

For the Ordered probit, F is the standard normal cdf given by:

t 2 − z e 2 F(z) =  dt − 2

2.2.3.3. Estimation of Ordered Logit/Probit models

As an illustration, let us assume yi = (1,2,3,4,5) for (poor, fair, good, very good and excellent alternatives). Then the choice rule is given by:

yi = 1 if yi∗≤c1

yi = 2 if c1

yi = 3 if c2

yi = 4 if c3

yi = 5 if yi∗>c4

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 111

Using the generic representation, the respective probabilities for the five categories are derived as:

Pr (yi = 1) = F(c1 − xi’β)

Pr (yi = 2) = F(c2 − xi’β)−F(c1 − xi’β)

Pr (yi = 3) = F(c3 − xi’β)−F(c2 − xi’β)

Pr (yi= 4) = F(c4 − xi’β)−F(c3 − xi’β)

Pr (yi= 5) = 1−F(c4 − xi’β)

Now the question is how did we come about these probabilities? Let us show this for some categories. For category 1,

Pr (yi = 1) = Pr(yi∗≤c1 )

Recall that yi* = xi’β + εi and substituting this into the above expression gives us:

Pr (yi =1) = Pr(xi’β +εi≤c1)

= Pr(εi≤c1−xi’β)

= F(c1 − xi’β) For category 2, we have:

Pr (yi=2) = Pr(c1

= Pr(c1

= Pr(c1 −xi’β <εi≤c2−xi’β)

= F(c2−xi’β)−F(c1 − xi’β) Similar proof can be made for the other categories. Note that the distribution of F(∙) is determined by the assumed distribution of εi.

Note that when estimating ordered logit/probit models:

1. The ordered logit/probit model with j alternatives will have one set of coefficients with (j − 1) intercepts. You can recognize an ordered choice model by the multiple intercepts. 2. The ordered logit/probit model with j alternatives will have j sets of marginal effects.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 112

Ordered logit and probit being non-linear models coefficients have no meaningful interpretations, but we should note the following:

1. The sign of βk shows whether the latent variable y* increases with the regressor (xk).

2. The magnitude of βk will be different by a scale factor between the probit and logit models.

2.2.3.4. Marginal effects for the ordered logit/probit models

We consider two cases:

Case 1: The regressors are continuous variables

∂pij/∂xki = {F’ (cj-1 − xi’β) - F’ (c1 − xi’β)}β

Consider the case where yi = 1. Then marginal effect is given by:

∂Pr(yi=1)/∂xi = ƏF(c1− xi’β)/Əxi

= F’(cj–1− xi’β)β It is the same procedure for the remaining outcomes.

Case 2: Regressors are discrete variables

The procedure is similar to binary logit/probit models. Let us assume that: xiβ = βO + β1x1i + β2x2i + ⋯ + βkxki

If x2i is a binary independent variable, the marginal effect is computed as:

∂Pr (yi=1)/∂x2i = Change in the probabilities when x2i=1 and x2i=0

= F(βO + β1x1i + β2. 1 + β3x3i + ⋯ +βkxki) − F(βO + β1x1i + 0 + β3x3i + ⋯ +βkxki)

The marginal effect is interpreted as: Each unit increase in the independent variable increases/decreases the probability of selecting alternative j by the marginal effect expressed as a percent.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 113

Example 2.2.7: Considerer the dataset linked herewith ..\..\ordered_health.dta to identify factors influencing the health status of an individual. Health status being an ordinal data can be modeled using either ordered logit or ordered probit models as:

yij = ƒ(Agei, Incomei, Diseasesi) where i = 1,2,3, … , 5574; j = 1,2,3; i is for observations and j is for alternative outcomes. The dependent variable is coded as: 1 = fair, 2 = good and 3 = excellent. We estimate the model using the maximum likelihood method using a STATA software. Here, it is good to note that: i. There will be one set of coefficients with two intercepts, and ii. There will be three sets of marginal effects, one for each category.

Step 1: Descriptive statistics Good to start with computing relevant descriptive statistics for all the variables in the model. (a) Compute summary statistics for continuous variables su age logincome numberdiseases

Table 2.2.11. Descriptive statistics of continuous variables Variable Obs Mean Std. Dev. Min Max age 5574 25.58 16.73 0.03 63.28 logincome 5574 8.70 1.22 0 10.28 numberdise~s 5574 11.21 6.79 0 58.60 b1) You have to tabulate discrete variables and not to summarize. Why? . tab healthstatus

Table 2.2.12. Tabulation of discrete variable Health status Freq. Percent

Fair 523 9.38 Good 2,034 36.49 Excellent 3,017 54.13 Total 5,574 100.00

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 114

(b2) Still on discrete variables . tab healthstatus1

Table 2.2.13. Tabulation of coded discrete variable

Healthstatus1 Freq. Percent 1 523 9.38 2 2,034 36.49 3 3,017 54.13 Total 5,574 100.00

Results presented on Table 2.2.13 show that about 9.4% (523), 36.5% (2034) and 54% (3017) of the respondents’ report fair, good and excellent health status respectively. In other words, more than half of the respondents report excellent health status.

You may also probe further to determine the distribution of age, income and number of diseases across the three categories.

Case I: Category 1 su age logincome numberdiseases if healthstatus==1

Table 2.2.14. Descriptive statistics of continuous variables for category 1

Variable Obs Mean Std. Dev. Min Max age 523 34.731 17.919 1.337 62.938 logincome 523 8.020 2.059 0 10.093 numberdise~s 523 15.064 9.676 0 58.6

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 115

Case II: Category 2 su age logincome numberdiseases if healthstatus==2

Table 2.2.14. Descriptive statistics of continuous variables for category 2

Variable Obs Mean Std. Dev. Min Max age 2034 28.945 16.472 1.157 63.023 logincome 2034 8.706 1.126 0 10.283 numberdise~s 2034 12.103 7.109 0 44.8

Case III: Category 3 su age logincome numberdiseases if healthstatus==3

Table 2.2.16. Descriptive statistics of continuous variables for category 3

Variable Obs Mean Std. Dev. Min Max age 3017 21.717 15.54 0.025 63.27 logincome 3017 8.808 1.036 0 10.28 numberdise~s 3017 9.931 5.490 0 41.4

Results of the breakdown show that people with excellent health status have the lowest mean values of incidence of diseases and age but highest mean value of income. The reverse is the case for fair health status while good health status is in- between. Thus, heath status seems to improve with lower incidence of diseases, higher income and lower age.

Step 2: Estimation of the model

(a) 1. Estimation of the ordered logit model ologit healthstatus age logincome numberdiseases

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 116

Table 2.2.17. Ordered logit model results

Iteration 0: log likelihood = -5140.0463 Iteration 1: log likelihood = -4776.008 Iteration 2: log likelihood = -4769.8693 Iteration 3: log likelihood = -4769.8525 Iteration 4: log likelihood = -4769.8525 Ordered logistic regression Number of obs = 5574

LR chi2(3) = 740.39

Prob > chi2 = 0.0000 Log likelihood = -4769.8525 Pseudo R2 = 0.0720 healthstatus Coef. Std. z P>|z| [95% Conf. Err. Interval] age -0.029 0.002 -17.43 0.000 -0.033 -0.026 logincome 0.284 0.023 12.27 0.000 0.238 0.329 numberdiseases -0.055 0.004 -13.51 0.000 -.0 -0.047 cut1 -1.396 0.206 -1.800 -0.992 cut2 0.951 0.205 0.549 1.354

Here, it is god to note that the results are log odds. Results presented in the table above show that each additional year of age decreases the log odds of reporting better health status (from fair, good to excellent) by 0.029 points holding other variables constant. Also, the log odds in favor of an improved health status will increase by 0.283 points with a unit increase in income after controlling for age and no. of diseases.

Similarly, a unit increase in the incidence of diseases reduces the log odds by 0.055 points after controlling for age and income. In sum, the health status is better (from fair to good to excellent) with lower age, higher income and lower number of diseases. You will find the cut-off points

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 117

below the regression coefficients. They are statistically different from each other so the three categories should not be combined into one. We can as well compute the Odds ratio as shown below.

2. Computation of odds ratio . ologit healthstatus age logincome numberdiseases, or

Table 2.2.18. Ordered logit model results (odds ratio)

Iteration 0: log likelihood = -5140.046 Iteration 1: log likelihood = -4776.008 Iteration 2: log likelihood = -4769.869 Iteration 3: log likelihood = -4769.853 Iteration 4: log likelihood = -4769.853 Ordered logistic regression Number of observations= 5574 LR chi2(3) = 740.39 Prob > chi2 = 0.0000 Pseudo R2 = 0.0720 Log likelihood = -4769.853 healthstatus Odds Std. Err. z P>|z| [95% Conf. Ratio Interval] age 0.97 0.002 -17.43 0.000 0.968 0.974 logincome 1.33 0.030 12.27 0.000 1.269 1.390 numberdiseases 0.947 0.004 -13.51 0.000 0.939 0.954 cut1 -1.396 0.206 -1.800 -0.992 cut2 0.951 0.205 0.549 1.354

Note that ordered logit model estimates a single equation (regression coefficients) over the levels of the dependent variable. Now, if we view the change in levels in a cumulative sense and interpret

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 118

the coefficients in odds, we are comparing the people who are in groups greater than k versus those who are in groups less than or equal to k, where k is the level of the response variable. The interpretation would be that for a one-unit increase in the predictor variable, the odds for cases in a group that is greater than k versus less than or equal to k are the proportional odds times larger.

As to the interpretation of the results, first note that the results are proportional odds ratios. For a one-year increase in age, the odds of excellent health status versus the combined good and fair health status categories are 0.97 times lower, given the other variables in the model held constant.

We are done with the coefficients and odds ratios. Let us now determine the slope (marginal effect) for each of the regressors across the three categories. In fact, it is more convenient to interpret the marginal effects than the coefficients (log odds) and odds ratios.

(b) Marginal effects for the ordered logit (fair health status) . margins, dydx(*) atmeans predict(outcome(1))

Table 2.2.19. Marginal effects for ordered logit model results (fair health status)

Conditional marginal effects Number of obs = 5574 Model VCE : OIM Expression : Pr(healthstatus==1), predict(outcome(1)) dy/dx w.r.t. : age logincome numberdiseases logincome = 8.696929 (mean) numberdise~s = 11.20526 (mean)

Variable dy/dx Delta-method Std. z P>|z| [95% Conf. Err. Interval] age 0.002 0.0001 15.44 0.000 0.002 0.002 logincome -0.020 0.002 -11.49 0.000 -0.023 -0.017 numberdiseases 0.004 0.0003 12.64 0.000 0.003 0.005

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 119

Results presented in the above table shows that each additional year of age increases the probability of reporting fair by 0.2%. Each additional unit increase in the no. of diseases increases the probability of reporting fair by 0.4%. However, for income, the probability of reporting fair declines by 2% for every additional unit increase in income.

(c) Marginal effects for the ordered logit (good health status) margins, dydx(*) atmeans predict(outcome(2))

Table 2.2.20. Marginal effects for ordered logit model results (good health status)

Conditional marginal effects Number of obs = 5574 Model VCE : OIM

Expression: Pr(healthstatus==2), predict(outcome(2)) dy/dx w.r.t.: age logincome numberdiseases at: age = 25.57613 (mean)

logincome = 8.696929 (mean)

numberdise~s = 11.20526 (mean)

Variable dy/dx Delta-method z P>|z| [95% Conf. Interval] Std. Err. age 0.005 0.0003 16.04 0.000 0.005 0.006 logincome -0.051 0.004 -11.75 0.000 -0.059 -0.042 numberdiseases .010 0.001 12.77 0.000 0.008 0.011

Note that the results for category 2 are quite similar to category 1 particularly in terms of the signs and . Thus, the coefficients are interpreted the same way.

(d) Marginal effects for the ordered logit (excellent health status)

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 120

. margins, dydx(*) atmeans predict(outcome(3))

Table 2.2.21. Marginal effects for ordered logit model results (excellent health status)

Conditional marginal effects Number of obs = 5574 Model VCE : OIM

Expression : Pr(healthstatus==3), predict(outcome(3)) dy/dx w.r.t. : age logincome numberdiseases at : age = 25.576 (mean)

logincome = 8.697 (mean)

numberdise~s = 11.205 (mean)

dy/dx Delta-method z P>|z| [95% Conf. Std. Err. Interval] age -0.007 0.0004 -17.43 0.000 -0.008 -0.007 logincome 0.071 0.006 12.26 0.000 0.059 0.082 numberdiseases -0.014 0.001 -13.50 0.000 -0.016 -0.012

The results under category 3 (excellent health status) differ from categories 1 & 2 particularly in terms of the signs. Each additional year of age reduces the chance of reporting excellent by 0.7%. A unit decrease in the no. of diseases increases the probability of reporting excellent by 1.4%. While for every additional unit increase in income, the probability of reporting excellent increases by 7.1%.

(e) Computation of predicted probabilities (1) Predicted probabilities for the three categories for the ordered logit model . predict p1ologit p2ologit p3ologit, pr . su p1ologit p2ologit p3ologit

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 121

Table 2.2.22. Predicted probabilities for the ordered logit model

Variable Obs Mean Std. Dev. Min Max p1ologit 5574 0.095 0.084 0.023 0.859 p2ologit 5574 0.365 0.095 0.126 0.528 p3ologit 5574 0.540 0.164 0.016 0.800

Results presented on Table 2.2.23 show that, given the mean values of the regressors, the average probability values of reporting fair, good and excellent health statuses are 9%, 37% and 54% respectively. Let us compare the predicted values with the actual values.

(2) Actual percentage distribution . tab healthstatus

Table 2.2.23. Results on actual percentage distribution

Health status Freq. Percent

Fair 523 9.38 Good 2,034 36.49 Excellent 3,017 54.13 Total 5,574 100.00

Results show that the model reasonably fits the data. The predicted probabilities are similar to the actual probabilities.

(f) Estimation of ordered probit model oprobit healthstatus age logincome numberdiseases

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 122

Table 2.2.24. Ordered probit model results

Iteration 0: log likelihood = -5140.046 Iteration 1: log likelihood = -4771.756 Iteration 2: log likelihood = -4771.030 Iteration 3: log likelihood = -4771.030

Ordered probit regression Number of obs = 5574

LR chi2(3) = 738.03

Prob > chi2 = 0.000 Log likelihood = -4771.030 Pseudo R2 = 0.072

Variable Coef. Std. Err. z P>|z| [95% Conf. Interval] age -0.017 0.001 -17.3 0.000 -0.019 -0.015 logincome 0.165 0.013 12.86 0.000 0.140 0.191 numberdiseases -0.032 0.002 -13.2 0.000 -0.036 -0.027 cut1 -0.795 0.115 -1.020 -0.569 cut2 0.546 0.115 0.321 0.771

Like in binary probit models, the results here are z-scores. Thus, the interpretation here is not different from the binary probit models except that the ordering is reflected in this case. For instance, using income, the result shows that a unit increase in income will lead to an increase in the z-score in favour of better health status by 0.165 points. Note that the z-scores are similar to the log odds in terms of sign and significance (income is positive while age and number of diseases are negative). Therefore, the option of reporting better health status increases with a higher income, lower age and declining incidence of diseases.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 123

(g) Marginal effects for the ordered probit (outcome (1) - fair health status) margins, dydx(*) atmeans predict(outcome(1))

Table 2.2.25. Marginal effects for ordered logit model results (fair health status)

Conditional marginal effects Number of obs = 5574 Model VCE : OIM Expression : Pr(healthstatus==1), predict(outcome(1)) dy/dx w.r.t. : age logincome numberdiseases at : age = 25.576 (mean)

logincome = 8.697 (mean)

numberdise~s = 11.205 (mean)

Variable dy/dx Delta- z P>|z| [95% Conf. method Interval] Std. Err. age 0.002 0.0002 15.70 0.000 0.002 0.003 logincome -0.023 0.002 -12.11 0.000 -0.027 -0.020 numberdiseases 0.005 0.0004 12.42 0.000 0.004 0.005

(h) Marginal effects for the ordered probit (good health status)

Table 2.2.26. Marginal effects for ordered logit model results (good health status)

Conditional marginal effects Number of obs = 5574 Model VCE : OIM Expression : Pr(healthstatus==2), predict(outcome(2)) dy/dx w.r.t. : age logincome numberdiseases

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 124

at : age = 25.576 (mean)

logincome = 8.697 (mean)

numberdise~s = 11.205 (mean)

Variable dy/dx Delta- z P>|z| [95% Conf. method Std. Interval] Err. age 0.004 0.0003 15.75 0.000 0.004 0.005 logincome -0.042 0.003 -12.18 0.000 -0.049 -0.036 numberdiseases 0.008 0.001 12.48 0.000 0.007 0.009

(i) Marginal effects for the ordered probit (excellent health status) . margins, dydx(*) atmeans predict(outcome(3))

Table 2.2.27. Marginal effects for ordered logit model results (excellent health status)

Conditional marginal effects Number of obs = 5574 Model VCE : OIM Expression : Pr(healthstatus==3), predict(outcome(3)) dy/dx w.r.t. : age logincome numberdiseases at : age = 25.57613 (mean)

logincome = 8.696929 (mean)

numberdise~s = 11.20526 (mean)

Variable dy/dx Delta- z P>|z| [95% Conf. method Std. Interval] Err. age -0.007 0.0004 -17.34 0.000 -0.008 -0.006 logincome 0.066 0.005 12.86 0.000 0.056 0.076 numberdiseases -0.013 0.001 -13.21 0.000 -0.014 -0.011

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 125

Note that the marginal effects for ordered probit models are quite similar to ordered logit models. Also, their slopes are interpreted the same way. Thus, the interpretation of the latter suffices for the former here.

(j) Computation of predicted probabilities (1) Predicted probabilities for the three categories for the ordered logit model . predict p1oprobit p2oprobit p3oprobit, pr . summarize p1oprobit p2oprobit p3oprobit

Table 2.2.28. Predicted probabilities for ordered probit model

Variable Obs Mean Std. Dev. Min Max p1oprobit 5574 0.094 0.090 0.016 0.856 p2oprobit 5574 0.364 0.080 0.136 0.497 p3oprobit 5574 0.542 0.157 0.008 0.793

(2) Actual percentage distribution tab healthstatus

Table 2.2.29. Actual percentage distribution for ordered probit model

Health status Freq. Percent fair 523 9.38 good 2,034 36.49 excellent 3,017 54.13 Total 5,574 100.00

Results show that the model reasonably fits the data. The predicted probabilities are similar to the actual probabilities.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 126

Step 3: Some scenario analyses

Note that the mean values were used in the computation of the marginal effects for both ordered logit and probit models. However, one may be interested in computing marginal effects for specific values of one or more of the regressors. You will recall that we did something similar to that in binary logit and probit models.

2.3.6. The proportional odds assumption in ordered logit/probit models

The proportional odds assumption states that the relationship between each pair of outcome groups is the same. In other words, ordered logistic regression assumes that the coefficients that describe the relationship between, say, the lowest versus all higher categories of the response variable are the same as those that describe the relationship between the next lowest category and all higher categories, etc. This is called the proportional odds assumption or the parallel regression assumption.

In practice, violating this assumption may or may not alter your substantive conclusions. You need to test whether this is the case. There are several tests for verifying this assumption. Among these tests are Wolfe Gould, Brant, Score, Likelihood ratio and Wald tests. The underlying null hypothesis is that the relationship is proportional; that is, parallel. To test it in Stata, use ‘oparallel’ command after you estimate an ordered logit model. The test results for the parallel regression assumption for ordered logit model is given in Table 2.27 below.

Table 2.2.30. Test results for the parallel regression assumption for ordered logit model

Test Chi2 df P>Chi2 Wolfe Gould 21.33 3 0.000 Brant 17.45 3 0.001 Score 26.66 3 0.000 Likelihood ratio 22.05 3 0.000 Wald 28.88 3 0.000

The interpretation is that the relationship is not proportional across all the test statistics.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 127

Dealing with violations of proportionality assumption

Option 1: Do nothing. Use ordered logistic regression because the practical implications of violating this assumption are minimal. Option 2: Use a multinomial logit model. This frees you of the proportionality assumption, but it is less parsimonious and often dubious on substantive grounds. Option 3: Dichotomize the outcome and use binary logistic regression. This is common, but you lose information and it could alter your substantive conclusions. For instance, in the case of our example, one can merge categories 1 & 2 since their marginal effects are similar both in terms of sign and significance. The Stata command on how to dichotomize: recode healthstatus (1 2 = 0) (3= 1), gen (health). The Stata command for estimation is: logit health age logincome numberdiseases Option 4: Use a model that does not assume proportionality. Increasingly, this is common. Two user-submitted Stata commands fit these kinds of models: “gologit2” – generalized ordered logit models (see Williams 2007, Stata Journal). “oglm” – heterogeneous choice models (see Williams 2010, Stata Journal)

Note that they are user-written programs. You have to install them before estimation. The Stata estimation commands after installation is: oglm healthstatus age logincome numberdiseases gologit2 healthstatus age logincome numberdiseases Note that the results obtained here are log odds. For the odds ratio, use: oglm healthstatus age logincome numberdiseases, or gologit2 healthstatus1 age logincome numberdiseases, or

For the marginal effects, use the Stata commands below immediately after estimating with oglm/gologit2: margins, dydx(*) predict(outcome(1)) atmeans margins, dydx(*) predict(outcome(2)) atmeans margins, dydx(*) predict(outcome(3)) atmeans

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 128

Like the ologit command, you can also perform relevant scenario analyses for the oglm and gologit2 commands. For instance, you may be interested in evaluating how certain fixed values of age will affect the probability of reporting excellent health status. The Stata commands: margins, dydx(*) at(age=(25 30 35)) predict(outcome(3)) atmeans margins, dydx(*) at(age=25/30 ) predict(outcome(3)) atmeans

2.2.4. Multivariate Probit Model

2.2.4.1. Introduction

In econometrics, the multivariate probit model is a generalization of the probit model used to estimate several correlated binary outcomes jointly. For example, if it is believed that the decision of u an individual to work or not and whether to have children or not are correlated (both decisions are binary), then the bivariate probit model would be appropriate for jointly predicting these two choices on an individual-specific basis. If there are multiple binary choices such as climate change adaptation strategies (adjusting planting dates, improved crop varieties, crop diversification, small scale irrigation, and soil and water conservation practices) which are likely to be correlated and where one can adapt more than one adaptation strategies, the multivariate probit model would be appropriate for jointly predicting these multiple choices. We start with two binary outcomes case called bivariate probit model followed by multivariate probit models where we can jointly predict multiple outcomes which are likely to be correlated.

2.2.4.2. Bivariate probit model

It is two equation probit model. That is, the bivariate probit model is a joint model for two binary outcomes. These outcomes may be correlated, with correlation coefficient ρ. If the correlation turns out insignificant, then we can estimate two separate probit models, otherwise we have to use a bivariate probit model. In the ordinary probit model, there is only one binary dependent variable Y and so only one latent variable Y* is used. In contrast, in the bivariate probit model there are two binary dependent variables Y1 and Y2, so there are two latent variables: Y1* and Y2*. It is

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 129

assumed that each observed variable takes on the value 1 if and only if its underlying continuous latent variable takes on a positive value:

Y1 = 1 if Y1* > 0 and 0 otherwise

Y2 = 1 if Y2* > 0 and 0 otherwise with

Y1* =X1β1 + Ꜫ1

Y2* =X2β2 + Ꜫ2 and 휀 0 1 휌 ( 1) |푋~푁 (( ) , ( )) 휀2 0 휌 1

X1 and X2 may be the same or different. That is, there is no need for each equation to have its ‘own variable’. ρ is the conditional tetrachoric correlation between Y1 and Y2.

Fitting the bivariate probit model involves estimating the values of β1, β2, and ρ. To do so, the likelihood of the model has to be maximized. This likelihood function is given by:

Substituting the latent variables Y1* and Y2* in the probability functions and taking logs gives:

After some rewriting, the log-likelihood function becomes:

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 130

Note that is the cumulative distribution function of the bivariate normal distribution. Y1 and Y2 in the log-likelihood function are observed variables being equal to one or zero.

Marginal effects

Marginal effects and predicted values can be estimated similarly to those for the binary probit models. Marginal effects for the joint probability, say P (Y1 = 1 and Y2 = 1) are also available.

Example 2.2.8: We study the factors influencing the joint outcome of being in an excellent health status (Y1) and visiting the doctor (Y2). Data are from Rand Health Insurance (bivariate_health.dta).

The mean (proportion) for excellent health status (Y1) is 0.54 and the mean (proportion) for visiting the doctor (Y2) is 0.67. The correlation coefficient is -0.01, so the two outcomes are practically uncorrelated (higher correlation is needed to apply bivariate probit).

Table 2.2.31. Bivariate outcome model coefficients

Variable Probit Probit Bivariate Bivariate coefficients coefficients Probit Probit for Y1 for Y2 coefficients coefficients being in visiting the for Y1 for Y2 excellent doctor being in visiting the health excellent doctor health

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 131

Age -0.01* 0.002 -0.01* 0.002 Log income 0.13* 0.12* 0.13* 0.12* Number of chronic diseases -0.03* 0.03* -0.03* 0.03* Constant -0.23 -1.03* -0.23 -1.03*

Rho 0.02

Interpretation of the model results

Coefficient interpretation: Younger individuals, individuals with higher incomes, and those with lower number of chronic diseases are more likely to be in an excellent health status. Individuals with higher incomes and those with higher number of chronic diseases are more likely visit the doctor. The correlation coefficient between the bivariate outcomes is 0.02 and not significant. Therefore, we can proceed by estimating separate probit models instead of a bivariate probit model. The decisions are not interrelated and can be estimated independently. The results from the separate probit models are almost identical to those from the bivariate probit model, so in this case there is no need to perform the bivariate probit model. The marginal effects for the joint probabilities are provided in Table 2.29 below.

Table 2.2.32. Marginal effects for the joint probabilities of the bivariate probit model

Variable P(Y1=0, P(Y1=0, P(Y1=1, P(Y1=1,

Y2=0) Y2=1) Y2=0) Y2=1) Age 0.001* 0.005* -0.003* -0.004* Log income -0.037* -0.015* -0.006 0.059* Number of chronic diseases -0.012* 0.015* -0.011* -0.003*

The marginal effects are interpreted similarly to those of binary probit and logit models, but the effect is on the joint probability of the two outcomes. The marginal effects sum up to zero across the four joint probabilities. When age increases by 1 year, the probability P (Y1 = 0, Y2 = 1) of a person not being in an excellent health and visiting the doctor increases by 0.5%. Notice that the other marginal effects are negative.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 132

2.4.3. The MVP model

The MVP model is typically specified as:

Yij* = Xiβj + Ꜫij

Yij = 1(Yij* > 0)

Ꜫi = (Ꜫi1,...,ꜪiM) ∼ MVN(O, R) or Yi* = (Yi1,...,YiM) ∼ MVN(XiB, R) where i = 1,...,N indexes observations, j = 1,...,M indexes outcomes, Xi is a K-vector of exogenous covariates, the Ꜫi are assumed to be independent identically distributed across i but correlated across j for any i, and MVN denotes the multivariate normal distribution. (Henceforth, the i subscripts will be suppressed.) The standard normalization sets the diagonal elements of R equal 1 to 1 so that R is a correlation matrix with off-diagonal elements ρpq, {p, q}∈{1,...,M}, p ≠ q. With standard full-rank conditions on the X’s and each |ρpq| < 1, B = (β1,..., βM) and R will be identified and estimable with sufficient sample variation in the X’s.

2.4.3.1. Estimation and inference Estimation of the M-outcome MVP model using mvprobit requires simulation of the MVN probabilities (Cappellari and Jenkins, 2003), with mvprobit computation time increasing in M, K, N, and D (simulation draws).2 However, all the parameters (B, R) can be estimated consistently using bivariate probit—implemented as Stata’s biprobit command—while consistent inferences about all of these parameters are afforded via Stata’s suest command. Because the proposed approach proves significantly faster in terms of computation time with no obvious disadvantages, this strategy may merit consideration in applied work.

1 This normalization rules out cases like heteroskedastic errors (Wooldridge, 2010). While this normalization is common-for instance, normalizing each univariate marginal to be a standard probit—it is not the only possible normalization of the covariance matrix. 2 Specifically, in the empirical exercises reported below as well as in some other simulations not reported here, mvprobit computation time increases-trivially in K, essentially proportionately in D, slightly more than proportionately in N, and at a rate between 2M and 3M in M. Greene and Hensher (2010) suggest that MVP computation time would increase with 2M, but the results obtained in the simulations here suggest a somewhat greater rate of increase.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 133

The key result for the proposed estimation strategy is that the multivariate normal distribution is fully characterized by the mean vector XB and correlation matrix R. For present purposes, the key feature of the multivariate (conditional) normal distribution F(Y1*,...,YM*|X) is that all of its bivariate marginal—F(Yj*,Ym*|X)—are bivariate normal with mean vectors and correlation matrices corresponding to the respective sub-matrices of xB and R (Rao, 1973).

Under the normalization that the diagonal elements of Rare all one, the B parameters are identified using all M (conditional) univariate marginals F(Yj*|x); there is no need to appeal to the multivariate features of F(Y1*,...,YM*|x) to identify B. The 0.5M (M − 1) bivariate marginals provide the additional information about the ρpq parameters. As such, identifying the parameters of all the bivariate marginals implies identification3 of the parameters of the full multivariate joint distribution so that consistent estimation of all the bivariate marginal probit models Pr(yp = tp, yq

= tq|x) provides consistent estimates of all the parameters (B, R) of the full MVP model for Pr(Y1

= t1,...,YM = tM|X) for tj ∈ {0, 1}, j = 1,...,M.

Estimation via bivariate probit

The proposed approach, which can be implemented using the Mata function bvpmvp(), is as follows. First, corresponding to each possible outcome pair, 0.5M(M −1) bivariate probit models 4 are fit using biprobit, yielding one estimate of each ρpq and M – 1 estimate of βj , where j = 1,...,M.

Each M − 1 estimate of βj is consistent because each biprobit specification uses the same normalization on the relevant submatrices of R. Each of these estimates (βp, βq, ρpq)b, where b = 1,..., 0.5M(M − 1), is stored and then combined using Stata’s suest command, which provides a consistent estimate of the joint variance–covariance matrix of all M(M−1)(0.5+K) parameters estimated with the 0.5M(M − 1) biprobit estimates. We denote this vector of parameter estimates and its estimated variance–covariance matrix as α and Ω, respectively.5

3 As discussed below, identification of all the bivariate marginals implies overidentification of B. 4 biprobit directly estimates the inverse hyperbolic tangent of ρpq or 0.5ln{(1 + ρpq)/(1 − ρpq)}. 5 훼̂ and Ω̂ are the suest-stored matrix results e(b) (a row vector) and e(V), respectively.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 134

̂ ̂ Second, we compute the simple averages 훽푗퐴 = {1/(M − 1)} ∑ 훽푗푚 . This gives a k × M matrix of ̂ ̂ estimated averaged coefficients, denoted 퐵̂퐴 = (훽1퐴,..., 훽푀퐴). Because a weighted average of consistent estimators is generally a consistent estimator, the resulting 퐵퐴will be consistent for B. This averaging occurs because the B parameters in the proposed approach are overidentified; that is, there are M − 1 consistent estimates of each βj, j = 1,..., M. Some other rule could be used to compute one consistent estimate of each βj from among the M −1 candidates, but unless alternative strategies could boast significant precision gains, computational simplicity recommends the simple average as an obvious solution. See the appendix for further discussion.

−1 Finally, we let Q denote the 0.5M (M − 1) vector of the tanh (ρjk) estimated in each biprobit

T T T specification, and we define the M{0.5(M − 1) + K} × 1 vector Θhat = [vec(퐵̂퐴) , 푄̂ ] . We define H as the M{0.5(M − 1) + K} × M(M − 1)(0.5 + K) averaging and selection matrix that maps 훼̂ to

T Θ̂; that is, Θ̂ = 퐻̂훼 ; the elements of H are 1/(M − 1), 1, or 0. The estimated variance–covariance matrix of Θ̂, useful for inference, is given by 푣푎푟̂ (Θ̂) = HΩ̂HT. bvpmvp(): A Mata function to implement the proposed estimation approach

The function bvpmvp() returns the M{k + 0.5(M − 1)} × [M{k + 0.5(M − 1)} + 1]matrix, whose first column is ΘhatT and whose remaining elements are the elements of the M{k + 0.5(M − 1)} dimension-symmetric square matrix varhat(Θhat). bvpmvp() takes six arguments: 1) a string containing the names of the M outcomes; 2) a string containing the names of the K − 1 non-constant covariates; 3) a (possibly null) string containing any “if” conditions for estimation; 4) a scalar indicating whether to display the interim estimation results; 5) a scalar indicating the rounding level of presented results; and 6) a scalar indicating whether to display the final results. For example, bv1 = bvpmvp("Y1 Y2 Y3 Y4","X1 X2 X3 X4","if _n<=10000",0,.001,1) bv2 = bvpmvp(Yn,Xn,ic,0,.001,1)

bvpmvp()’s summary report displays the 퐵̂퐴 estimates, their estimated standard errors, and the estimated correlation matrix 푅̂; an exercise is provided at the end of this Topic. Of course, suppressing these results may be useful, for instance, in simulation or bootstrapping exercises.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 135

Topic Summary

Discrete choice models statistically relate the choice made by each person to the attributes of the person and the attributes of the alternatives available to the person. Discrete choice models specify the probability that an individual chooses an option among a set of alternatives. Discrete choice models have played an important role in many fields of study for modeling alternatives faced by an individual. They are namely used to provide a detailed representation of the complex aspects of transportation demand, based on strong theoretical justifications. Moreover, several packages and tools are available to help practitioners using these models for real applications, making discrete choice models more and more popular. Discrete choice models are powerful but complex. The art of finding the appropriate model for a particular application requires from the analyst both a close familiarity with the reality under interest and a strong understanding of the methodological and theoretical background of the model. The main theoretical aspects of discrete choice models are presented in this Topic. The main assumptions used to derive discrete choice models in general, and latent or index-based models in particular, are covered in detail. The specifications, estimations and interpretations of the binary logit/probit models, multinomial logit/probit models, the conditional, mixed and nested logit models, ordered logit/probit models and the multivariate probit model are also briefly discussed.

Exercises for the Topic

You are expected to complete these exercises within three weeks’ time and submit to your facilitator by uploading on the learning management system.

1. Compare and contrast linear probability model, logit and probit models for binary choice variables.

2. (a) Describe the intuition behind the maximum likelihood estimation technique used for limited dependent variable models.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 136

(b) Why do we need to exercise caution when interpreting the coefficients of a probit or logit model? (c) How can we measure whether a logit model that we have estimated fits the data well or not? (d) What is the difference, in terms of the model setup, in binary choice versus multiple choice problems?

3. The dataset ..\..\binary_data.dta hyperlinked herewith has a binary response (outcome, dependent) variable called admit. There are three predictor variables: gre (a standardized test that is an admissions requirement for many graduate schools), gpa (grade point average: it is calculated by dividing the total amount of grade points earned by the total amount of credit hours attempted) and rank. We will treat the variables gre and gpa as continuous. The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest.

(a) Do some basic descriptive statistics for this data: i. How many students have been admitted? ii. Is there a relationship between the admission and the rank of the school? Try to explain this point using the command table and some graphs. Use an appropriate association index and test to formally verify the possible relationship between the two variables. iii. What is the average gre score and gpa score for the admitted students? and for the not admitted ones? iv. Are gre and gpa correlated? (b) Fit a logistic regression model with explanatory variables gre and gpa. i. Interpret the coefficients ii. Compare the probability of being admitted of two students having GREs equal respectively to 320 and 420. iii. Compare the probability of being admitted of two students having GPAs equal respectively to 2.5 and 3.5. iv. Compute the marginal effects of GRE and GPA and interpret them. (c) Fit a logistic regression model with explanatory variables GRE and rank and interpret the coefficients.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 137

(d) What is the probability of being admitted of a student with average GRE coming from a school with rank equal to 1? And of a student coming from a school with rank 3? (e) Draw the predicted probabilities for each school rank. (f) Fit a logistic regression model with all explanatory variables and interpret the coefficients. (g) Evaluate the fitness of the model computing the pseudo-R2. (h) Moreover, evaluate the predictive capability of the model using the crosstabulation of the predicted and observed values. (i) Fit the full model using a probit model: are there any differences with the logit model?

4. The dataset ..\..\unordered_data.dta hyperlinked herewith consists of the brand choices data and the travel mode data. Altogether, there are 12,800 observations in the brand choices data. The travel mode data appear in the first 840 rows of the data area. These computations will be based only on the shoe brand data.

The model analyzes 4 choices, brand1, brand2, brand3, and ‘none of the brands.’ ASC4 is the constant term that appears only in the ‘NONE’ choice.

There are two characteristic variables, gender and age. Gender is coded as MALE = 1 for men, 0 for women. AGE is categorized as AGE25 = 1 if age < 25 and 0 else, AGE39 if 25 < age < 39 and 0 else, and AGE40 if age > 40. For modeling purposes, drop AGE40.

(a) Fit the basic model. Is pricesq significant? Use the Wald test based on estimates of the basic model. Fit model without pricesq and use a likelihood ratio test. (b) Do age and sex matter? Add age24, age39 and male to the basic model and use a likelihood ratio test. (c) Estimate a marginal effect (of price) in the MNL model. What are the estimates of the own and cross elasticities across the three brands? What is the evidence of the IIA assumption in these results? (d) One aspect of choice theory that has attracted some attention is the possibility that in a choice study, some individuals in the sample may be ignoring some of the attributes. This is called

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 138

‘attribute nonattendance’ in the recent literature. Individuals sometimes appear to be revealing this kind of behavior, but there is no definitive observed indicator. The latent class model provides a way to analyze this possibility. Which model is appropriate to fit the two attributes, fashion and quality to be separately or jointly non-attended? Run this model and interpret the results. (e) Do the multinomial logit and multinomial probit models give similar results? You can’t tell directly from the coefficient estimates because of scaling and normalization, so you have to rely on other indicators such as marginal effects. Fit a multinomial probit and a multinomial logit model, and compare the results. 5. A study looks at factors that influence the decision of whether to apply to graduate school. College juniors are asked if they are unlikely, somewhat likely, or very likely to apply to graduate school. Hence, our outcome variable has three categories. Data on parental educational status, whether the undergraduate institution is public or private and their current GPA are also collected. The researchers have a reason to believe that the "distances" between these three points are not equal. For example, the "distance" between "unlikely" and "somewhat likely" may be shorter than the distance between "somewhat likely" and "very likely".

This hypothetical data set has thee level variable called apply (coded 0, 1, 2), that we will use as our outcome variable. We also have three variables that we will use as predictors: pared, which is a 0/1 variable indicating whether at least one parent has a graduate degree; public, which is a 0/1 variable where 1 indicates that the undergraduate institution is public and 0 private, and gpa, which is the student's grade point average.

Use the dataset ..\..\ordered_data.dta hyperlinked herewith this assignment to answer the following questions: (a) Do descriptive statistics of these variables and make interpretations. (b) If the dependent variable of the model is “apply”, which econometric model is appropriate to identify factors affecting it? Why? (c) What is the underlying assumption of this model? How do you test it? Do the test in STATA and report the result.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 139

(d) Run this model in STATA and interpret all the results including the different specifications underlying the model.

References

Books

Cameron, A.C. and P.K. Trivedi (2005). Microeconometrics: Methods and Applications, Ch. 14, 15. Williams H. Greene (2018). Econometric Analysis, 8th Edition, Ch. 17. Wooldrigde, J. (2002). Econometric Analysis of Cross Section and Panel Data. Ch.15.1-7. Angrist, J. and Pischke J.F. (2009). Mostly Harmless Econometrics. An Empiricist’s Companion. Ch.3.4.2.

Articles

Martins, M. F. O. (2001). Parametric and Semiparametric Estimation of Sample Selection Models: An Empirical Application to the Female Labour Force in Portugal. Journal of Applied Econometrics, 16: 23-39. Pagan, A. (2002). Learning about Models and their Fit to Data. International Economic Journal, 16(2): 1-18. Pagan, A. and Frank V. (1989). Diagnostic Tests for Models Based on Individual Data: A Survey. Journal of Applied Econometrics, 4: S29-S59. I find this to be an excellent note. The MVP discussion and its implementation using Mata may prove very challenging for many candidates. Perhaps worth making that an optional material?

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 140

TOPIC 2.3. LIMITED DEPENDENT VARIABLE MODELS (8 HOURS)

Objectives of the Topic

At the end of this Topic, students will be able to:

• Estimate and interpret Tobit and sample selection models as well as marginal effects; • Know the areas of application of these models; • Understand the methods for checking the adequacy of these models with real data; and • Apply these models to analyze economic data using statistical software such as Stata.

Introduction

Limited dependent variable models are econometric models in which the economic agent’s response is limited in some way. The dependent variable, rather than being continuous on the real line (or half–line), is restricted. In Topic 2.2, we dealt with discrete choice models where the response variable may be restricted to two or more outcomes, indicating that an individual can choose from two or more alternatives which are nominal or ordered in certain ways. The dependent variable may also take on only integer values, such as the number of children per family (Topic 2.4). Alternatively, it may appear to be a continuous variable with a number of responses at a threshold value. For example, the response to the question “how many hours did you work last month?” will be recorded as zero for the non-working respondents. None of these measures are amenable to being modeled by the linear regression methods we have discussed. These models include the Tobit and Heckman models that enables us to deal with censored and truncated samples, and sample selection bias respectively.

2.3.1. The Tobit Model

2.3.1.1. Introduction

The Tobit models are a family of regression models that describe the relationship between a censored (or truncated, in an even broader sense) continuous dependent variable yi and a vector of independent variables xi. The model was originally proposed by James Tobin (1958) to model non-

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 141

negative continuous variables with several observations taking value 0 (household expenditure).

Generally, the Tobit models assume there is a latent continuous variable yi∗, which has not been observed over its entire range. It can happen due to truncation or . When truncation occurs, individuals on certain range of the variable yi∗ are not included in the dataset. In the database we only observe individuals taking values on the variable in a restricted range; individuals out of that range have been excluded of the dataset or in fact do not exist. When censoring occurs, there is no exclusion of individuals in certain ranges of yi∗, but yi∗ is not observed over its entire range anyway; instead of observing this variable, we only observe yi∗, which is a censored observation of yi∗. Unlike in the case of truncation, the variable takes values all over its entire range in the individuals of the database; however, for some ranges, it is not correctly observed, taking a different given value. For example, measuring heights with a 2-m tape measure could lead to individuals higher than 2 m being censored to that value; taller individuals would be in the database but all we know is they are censored (so their “2” value indicates they are at least 2 m tall). Tobit models carry the idea of the censored normal distribution into regression models, assuming that this latent variable yi∗ linearly depends on xi via a parameter β.

2.3.1.2. Truncated regression model

In truncated or censored regression models, we must fully understand the context in which the data were generated. That is, it is quite important that we identify situations of truncated or censored response variables. Utilizing these variables as the dependent variable in a regression equation without consideration of these qualities will be misleading.

In the case of truncation, the sample is drawn from a subset of the population so that only certain values are included in the sample. We lack observations on both the dependent variable and independent variables. For instance, we might have a sample of individuals who have a high school diploma, some college experience, or one or more college degrees. The sample has been generated by interviewing those who completed high school. This is a truncated sample, relative to the population, in that it excludes all individuals who have not completed high school. The characteristics of those excluded individuals are not likely to be the same as those in our sample.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 142

For instance, we might expect that average or income of dropouts is lower than that of graduates. The effect of truncating the distribution of a is clear. The expected value or mean of the truncated random variable moves away from the truncation point and the variance is reduced.

Descriptive statistics on the level of education in our sample should make that clear: with the minimum years of education set to 12, the mean education level is higher than it would be if high school dropouts were included, and the variance will be smaller. In the subpopulation defined by a truncated sample, we have no information about the characteristics of those who were excluded. For instance, we do not know whether the proportion of minority high school dropouts exceeds the proportion of minorities in the population.

A sample from this truncated population cannot be used to make inferences about the entire population without correction for the fact that those excluded individuals are not randomly selected from the population at large. While it might appear that we could use these truncated data to make inferences about the subpopulation, we cannot even do that.

A regression estimated from the subpopulation will yield coefficients that are biased toward zero— 2 or attenuated—as well as an estimate of σ Ꜫ that is biased downward.

If we are dealing with a truncated normal distribution, where y = Xβ + Ꜫ is only observed if it exceeds τ, we may define:

αi = (τ − Xiβ)/σꜪλ(αi) = φ(αi)(1 − Φ(αi)) (2.3.1) where σꜪ is the standard error of the untruncated disturbance Ꜫ, φ(·) is the Normal density function

(PDF) and Φ(·) is the Normal CDF. The expression λ(αi) is termed the inverse Mills ratio, or IMR.

If a regression is estimated from the truncated sample, we find that:

[yi|yi > τ, Xi] = Xiβ + σuλ(αi) + Ꜫi (2.3.2)

These regression estimates suffer from the exclusion of the term λ(αi). This regression is mis- specified, and the effect of that misspecification will differ across observations, with a

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 143

heteroskedastic error term whose variance depends on Xi. To deal with these problems, we include the IMR as an additional predictor. This allows us to use a truncated sample to make consistent inferences about the subpopulation. If we can justify making the assumption that the regression errors in the population are normally distributed, then we can estimate an equation for a truncated sample with the Stata command truncreg.

Under the assumption of normality, inferences for the population may be made from the truncated regression model. The estimator used in this command assumes that the regression errors are normal.

The truncreg option ll(#) is used to indicate that values of the response variable less than or equal to # are truncated. We might have a sample of college students with yearsEduc truncated from below at 12 years. Upper truncation can be handled by the ul(#) option: for instance, we may have a sample of individuals whose income is recorded up to $200,000. Both lower and upper truncation can be specified by combining the options.

The coefficient estimates and marginal effects from truncreg may be used to make inferences about the entire population, whereas the results from the mis-specified regression model should not be used for any purpose.

Example 2.3.1: We consider a sample of married women from the laborsub dataset which can be downloaded from “http://www.stata-press.com/data/r10/laborsub.dta” whose hours of work are truncated from below at zero. . use laborsub, clear . summarize whrs kl6 k618 wa we

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 144

Table 2.3.1: Summary statistics of the variables

Variable Obs Mean Std. Dev. Min Max whrs 250 799.84 915.60 0 4950 kl6 250 0.24 0.51 0 3 k618 250 1.36 1.37 0 8 wa 250 42.92 8.43 30 60 we 250 12.35 2.17 5 17

To illustrate the consequences of ignoring truncation we estimate a model of hours worked with OLS, including only working women. The regressors include measures of the number of preschool children (kl6), number of school-age children (k618), age (wa) and years of education (we). . regress whrs kl6 k618 wa we if whrs>0

Table 3.2: OLS estimation results for truncated data

Source SS df MS Number of obs = 150 Model 7326995.15 4 1831748.79 F( 4, 145) = 2.80 Residual 94793104.2 145 653745.55 Prob > F = 0.028 Total 102120099 149 685369.79 R-squared = 0.072 Adj R-squared = 0.046 Root MSE = 808.55 whrs Coef. Std. Err. t P>|t| [95% Conf. Interval] kl6 -421.482 167.973 -2.51 0.013 -753.475 -89.490 k618 -104.457 54.186 -1.93 0.056 -211.554 2.640 wa -4.785 9.691 -0.49 0.622 -23.938 14.368 we 9.353 31.238 0.30 0.765 -52.387 71.094 cons 1629.817 615.130 2.65 0.009 414.037 2845.597

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 145

We now re-estimate the model with truncreg, taking into account that 100 of the 250 observations have zero recorded whrs: . truncreg whrs kl6 k618 wa we, ll(0) nolog (note: 100 obs. truncated)

Table 3.3: Truncated regression model results

Limit: lower = 0 Number of obs = 150 upper = +inf Wald chi2(4) = 10.05 Log likelihood = -1200.9157 Prob > chi2 = 0.0395 whrs Coef. Std. Err. t P>|t| [95% Conf. Interval] eq1 kl6 -803.004 321.361 -2.50 0.012 -1432.861 -173.147 k618 -172.875 88.729 -1.95 0.051 -346.781 1.031 wa -8.821 14.369 -0.61 0.539 -36.983 19.341 we 16.529 46.504 0.36 0.722 -74.617 107.674 _cons 1586.26 912.355 1.74 0.082 -201.923 3374.442 sigma 1586.26 912.355 1.74 0.082 -201.923 3374.442 _cons

The effect of truncation in the subsample is quite clear. Some of the attenuated coefficient estimates from regress are no more than half as large as their counterparts from truncreg. The parameter sigma _cons, comparable to Root MSE in the OLS regression, is considerably larger in the truncated regression reflecting its downward bias in a truncated sample.

2.3.1.3. Censored regression model

Let us now turn to another commonly encountered issue with the data: censoring. Unlike truncation, in which the distribution from which the sample was drawn is a non-randomly selected subpopulation, censoring occurs when a response variable is set to an arbitrary value above or

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 146

below a certain value: the censoring point. In contrast to the truncated case, we have observations on the explanatory variables in this sample. The problem of censoring is that we do not have observations on the response variable for certain individuals. For instance, we may have full demographic information on a set of individuals, but only observe the number of hours worked per week for those who are employed.

As another example of a censored variable, consider that the numeric response to the question “How much did you spend on a new car last year?” may be zero for many individuals, but that should be considered as the expression of their choice not to buy a car. Such a censored response variable should be considered as being generated by a mixture of distributions: the binary choice to purchase a car or not, and the continuous response of how much to spend conditional on choosing to purchase. Although it would appear that the variable car outlay could be used as the dependent variable in a regression, it should not be employed in that manner, since it is generated by a censored distribution.

A solution to this problem was first proposed by Tobin (1958) as the censored regression model; it became known as “Tobin’s probit” or the tobit model. The model can be expressed in terms of a latent variable:

yi* = Xiβ + Ꜫi

yi = 0 if yi* ≤ 0 (2.3.3)

yi = yi* if yi* > 0

As in the prior example, our variable yi contains either zeros for non-purchasers or a dollar amount for those who chose to buy a car last year. The model combines aspects of the binomial probit for the distinction of yi = 0 versus yi > 0 and the regression model for [yi|yi > 0].

Of course, we could collapse all positive observations on yi and treat this as a binomial probit (or logit) estimation problem, but that would discard the information on the dollar amounts spent by purchasers. Likewise, we could throw away the yi = 0 observations, but we would then be left with a truncated distribution, with the various problems that creates.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 147

To take into account all the information in yi properly, we must estimate the model with the tobit estimation method, which employs maximum likelihood to combine the probit and linear regression components of the log-likelihood function.

Tobit models may be defined with a threshold other than zero. Censoring from below may be specified at any point on the y scale with the ll(#) option for left censoring. Similarly, the standard tobit formulation may employ an upper threshold (censoring from above, or right censoring) using the ul(#) option to specify the upper limit. Stata’s tobit also supports the two-limit tobit model where observations on y are censored from both left and right by specifying both the ll(#) and ul(#) options.

Even in the case of a single censoring point, predictions from the tobit model are quite complex, since one may want to calculate the regression-like xb with predict, but could also compute the predicted probability that [y|X] falls within a particular interval (which may be open-ended on left or right). This may be specified with the pr(a,b) option, where arguments a, b specify the limits of the interval; the missing value code (.) is taken to mean infinity (of either sign).

Another predict option, e(a,b), calculates the expectation E(y) = E[Xβ + Ꜫ] conditional on [y|X] being in the a, b interval. Last, the ystar(a,b) option computes the prediction from Equation (2.3.3): a censored prediction, where the threshold is taken into account.

The marginal effects of the tobit model are also quite complex. The estimated coefficients are the marginal effects of a change in Xj on y* the unobservable latent variable:

∂E(y*|Xj)/∂Xj = βj (2.3.4) but that is not very useful. If instead we evaluate the effect on the observable y, we find that:

∂E(y|Xj)/∂Xj = βj × Pr[a < yi*< b] (2.3.5) where a and b are defined as above for predict. For instance, for left-censoring at zero, a = 0, b = +∞. Since that probability is at most unity (and will be reduced by a larger proportion of censored observations), the marginal effect of Xj is attenuated from the reported coefficient toward zero.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 148

An increase in an explanatory variable with a positive coefficient will imply that a left-censored individual is less likely to be censored. Their predicted probability of a nonzero value will increase.

For a non-censored individual, an increase in Xj will imply that E[y|y > 0] will increase. So, for instance, a decrease in the mortgage interest rate will allow more people to be homebuyers (since many borrowers’ income will qualify them for a mortgage at lower interest rates), and allow prequalified homebuyers to purchase a more expensive home.

The marginal effect captures the combination of those effects. Since the newly-qualified homebuyers will be purchasing the cheapest homes, the effect of the lower interest rate on the average price at which homes are sold will incorporate both effects. We expect that it will increase the average transactions price, but due to attenuation, by a smaller amount than the regression function component of the model would indicate.

Example 2.3.2: Here we use the womenwk dataset, which can be downloaded from “http://www.stata-press.com/data/r9/womenwk.dta”. We generate the log of the wage (lw) for working women and set lwf equal to lw for working women and zero for non-working women. This could be problematic if recorded wages below $1.00 were present in the data, but in these data the minimum wage recorded is $5.88. We first estimate the model with OLS ignoring the censored nature of the response variable. . use womenwk, clear . regress lwf age married children education

Table 2.3.4: OLS estimation results for censored data

Source SS df MS Number of obs = 2000 Model 937.873 4 234.468 F( 4, 1995) = 134.21 Residual 3485.341 1995 1.747 Prob > F = 0.000 Total 4423.215 1999 2.213 R-squared = 0.212 Adj R-squared = 0.211 Root MSE = 1.322

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 149

lwf Coef. Std. Err. t P>|t| [95% Conf. Interval] age 0.036 0.004 9.42 0.000 0.029 0.044 married 0.319 0.069 4.62 0.000 0.183 0.454 children 0.331 0.021 15.51 0.000 0.289 0.372 education 0.084 0.010 8.24 0.000 0.064 0.104 _cons -1.078 0.170 -6.33 0.000 -1.412 -0.744

Re-estimating the model as a tobit and indicating that lwf is left-censored at zero with the ll option yields: . tobit lwf age married children education, ll(0)

Table 2.3.5: Tobit regression estimation results for censored data

Tobit regression Number of obs = 2000 LR chi2(4) = 461.85 Prob > chi2 = 0.000 Log likelihood = -3349.969 Pseudo-R2 = 0.065 lwf Coef. Std. Err. t P>|t| [95% Conf. Interval] age 0.052 0.006 9.080 0.000 0.041 0.063 married 0.484 0.104 4.680 0.000 0.281 0.687 children 0.486 0.032 15.330 0.000 0.424 0.548 education 0.115 0.015 7.620 0.000 0.085 0.145 _cons 2.808 0.263 -10.670 0.000 -3.324 -2.291 /sigma 1.873 0.040 1.794 1.951

Obs. summary: 657 left-censored observations at lwf<=0 1343 uncensored observations 0 right-censored observations

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 150

The tobit estimates of lwf show positive, significant effects for age, marital status, the number of children and the number of years of education. Each of these factors is expected to both increase the probability that a woman will work as well as increase her wage conditional on employed status.

Following tobit estimation, we first generate the marginal effects of each explanatory variable on the probability that an individual will have a positive log(wage): the pr(a,b) option of predict. . margins, predict(pr(0,.)) dydx(_all)

Table 2.3.6: Marginal effects of the Tobit model

Average marginal effects Number of obs = 2000 Model VCE: OIM Expression: Pr(lwf>0), predict(pr(0,.)) dy/dx w.r.t. : age married children education

Variable dy/dx Delta- z P>|z| [95% Conf. Interval] method Std. Err. age 0.007 0.001 9.080 0.000 0.006 0.009 married 0.066 0.014 4.670 0.000 0.039 0.094 children 0.067 0.004 14.910 0.000 0.058 0.075 education 0.016 0.002 7.610 0.000 0.012 0.020

We then calculate the marginal effect of each explanatory variable on the expected log wage, given that the individual has not been censored (i.e., was working). These effects, unlike the estimated coefficients from regress, properly take into account the censored nature of the response variable. . margins, predict(e(0,.)) dydx(_all)

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 151

Average marginal effects Number of obs = 2000 Model VCE: OIM Expression: E(lwf|lwf>0), predict(e(0,.)) dy/dx w.r.t. : age married children education

Table 2.3.7: Marginal effects of the Tobit model

Variable dy/dx Delta- z P>|z| [95% Conf. Interval] method Std. Err. age 0.032 0.003 9.080 0.000 0.025 0.038 married 0.293 0.063 4.680 0.000 0.170 0.415 children 0.294 0.019 15.490 0.000 0.257 0.331 education 0.069 0.009 7.610 0.000 0.052 0.087

Note, for instance, the much smaller marginal effects associated with number of children and level of education in tobit vs. regress.

2.3.2. Sample Selection Model

2.3.2.1. Introduction

This subtopic presents how to handle one of the more common problems that arise in economic analyses—sample selection bias. Essentially, sample selection bias can arise whenever some potential observations cannot be observed. For instance, the students enrolled in an intermediate microeconomics course are not a random sample of all undergraduates. Students self-select when they enroll in any class or choose a major. While we do not know all of the reasons for this self- selection, we suspect that students choosing to take advanced economics courses have more quantitative skills than students choosing courses in the humanities. Since we do not observe the grades that students who did not enroll in the intermediate microeconomics class would have made had they enrolled, we can never observe the grades that they would have made. Under certain

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 152

circumstances the omission of potential members of a sample will cause ordinary least squares (OLS) to give biased estimates of the parameters of a model.

In the 1970s, James Heckman developed techniques that will correct the bias introduced by sample selection bias (Heckman, 1979). Since then, most econometric computer programs include a command that automatically used Heckman’s method. However, blind use of these commands can lead to errors that would be avoided by a better understanding of his correction technique. This module is intended to provide this understanding.

In this subtopic we will first discuss the sources of sample selection bias by examining the basic economic model used to understand the problem. Second, the estimation strategy first developed by Heckman will be presented. Third, we will present how to estimate the Heckman model in Stata. Finally, an extended example of the technique will be presented.

2.3.2.2. The Heckman model

The Heckman two-stage procedure to be specified below is called Type II Tobit model (Amemiya) or the probit selection model (Wooldridge). Assume that there is an unobserved latent variable, yi*, and an unobserved latent index, di*, such that:

yi* = x′iβ+εi where i=1,…,N (2.3.6)

di* = z′iγ+νi where i=1,…,N (2.3.7)

di =1 if di*>0 and di =0 if di*≤0 (2.3.8)

yi = yi*di (2.3.9) Substituting (2.3.6), (2.3.7) and (2.3.8) into (2.3.9) gives:

yi = x′iβ+εi if z′iγ+νi>0 and yi = 0 if z′iγ+νi ≤0 (2.3.10)

Note that N is the total sample size and n is the number of observations for which di =1. Since yi* is not observed for (N−n), the question becomes why are these observations missing. A concrete example of such a model is a model of female wage determination. Equation (2.3.6) would model the wage rate earned by women in the labor force and Equation (2.3.7) would model the decision

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 153

by a female to enter the labor force. In this case, yi, the wage rate woman i receives, is a function of the variables in xi; however, women not in the labor force are not included in the sample. If these missing observations are drawn randomly from the population, there is no need for concern. Selectivity bias arises if the (N−n) omitted observations have unobserved characteristics that affect the likelihood that di = 1 and are correlated with the wage the woman would receive had she entered the labor force. For instance, a mentally unstable female is likely to earn relatively low wages and might be more unlikely to enter the labor force. In this case, the error terms, εi and νi would be independent and identically distributed N(0,φ),where: 2 휎휀 휎휀푣 (2.3.11) 휑 = [ 2 ] 휎푣휀 휎푣

and (εi,νi) are independent of zi. The selectivity bias arises because σεν≠0. In effect the residual εi includes the same unobserved characteristics as does the residual νi causing the two error terms to be correlated. OLS estimation of Equation (2.3.6) would have a missing variable— the bias created by the missing observations (due to wage data not being available for women not in the work force). As in other cases of omitted variables, the estimates of the parameters of the model, 훽̂, would be biased. Heckman (1979) noted the following in his seminal article on selectivity bias:

“One can also show that the least squares estimator of the population variance is downward biased. Second, a symptom of selection bias is that variables that do not belong in the true structural equation (variables in not in may appear to be statistically significant determinants of when regressions are fit on selected samples. Third, the model just outlined contains a variety of previous models as special cases. For a more complete development of the relationship between the model developed here and previous models for limited dependent variables, censored samples and truncated samples, see Heckman (1976). Fourth, multivariate extensions of the preceding analysis, while mathematically straightforward, are of considerable substantive interest. One example is offered. Consider migrants choosing among K possible regions of residence. If the self-selection rule is to choose to migrate to that region with the highest income, both the self-selection rule and the subsample regression functions can be simply characterized by a direct extension of the previous analysis.”

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 154

2.3.2.3. Estimation of the sample selection model

There are two methods for estimating the sample selection model: The two-stage procedure and the maximum likelihood method. Heckman (1979) suggests a two-step estimation strategy. In the first step a probit estimate of Equation (2.3.7) is used to construct a variable that measures the bias. This variable is known as the “inverse Mills ratio.” Heckman and others demonstrate that: 2 E[εi|zi,di =1] = σεν/σ ν{ϕ(zi′γ)/Φ(zi′γ)} (2.3.12) where ϕ(zi′γ) and Φ(zi′γ) are the probability density function and the cumulative distribution functions, respectively, evaluated at zi′γ. The ratio in the brackets in Equation (2.3.12) is known as the inverse Mills ratio. We will use an estimate of the inverse Mills ratio in the estimation of equation (2.3.10) to measure the sample selectivity bias.

The Heckman two-step estimator is relatively easy to implement. In the first step called the selection equation, you use a maximum likelihood probit regression on the whole sample to calculate ̂ from Equation (2.3.7). You then use ̂ to estimate the inverse Mills ratio: ̂ 휆푖 = 휑(푧푖′휐̂)Ф(푧푖′휐̂) (2.3.13)

In the second step called the outcome equation, we estimate:

yi = x′iβ+μ휆̂ +ηi 2 using OLS where E(휇̂)=σεν/σ ν. Thus, a t-ratio test of the null hypothesis H0: μ = 0 is equivalent to testing the null hypothesis H0: σεν = 0 and is a test of existence of the sample selectivity bias.

An alternative approach to the sample selectivity problem is to use a maximum likelihood estimator. Heckman (1974) originally suggested estimating the parameters of the model by maximizing the average log likelihood function:

1 ∞ ∞ ∞ 퐿 = ∑푁 {푑 푙푛 [∫ ∅ (푦 −푥 ′β)푑푣] + (1 − 푑 ) [∫ ∫ ∅ 푑푣 ]} (2.3.14) 푁 푖=1 푖 −푧′훾 휀푣 푖 푖 푖 −푧′훾 −∞ 휀푣 where φεν is the probability density function for the bivariate normal distribution. Fortunately, Stata offers a single command for calculating either the two-step or the maximum likelihood estimators.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 155

2.3.2.4. Estimation of the sample selection model in Stata

Estimation of the two versions of the Heckman sample selectivity bias models is straightforward in Stata. The command is: .heckman depvar [varlist], select(varlist_s) [twostep] or .heckman depvar [varlist], select(depvar_s = varlist_s) [twostep]

The syntax for maximum-likelihood estimates is: .heckman depvar [varlist] [weight] [if exp] [in range], select([depvar_s =] varlist_s [, offset(varname) noconstant]) [ robust cluster(varname) score(newvarlist|stub*) nshazard(newvarname) mills(newvarname) offset(varname) noconstant constraints(numlist) first noskip level(#) iterate(0) nolog maximize_options ]

The predict command has these options, among others: xb, the default, calculates the linear predictions from the underlying regression equation. ycond calculates the expected value of the dependent variable conditional on the dependent variable being observed/selected; E(y | y observed). yexpected calculates the expected value of the dependent variable (y*), where that value is taken to be 0 when it is expected to be unobserved; y* = P(y observed) * E(y | y observed). The assumption of 0 is valid for many cases where non-selection implies non-participation (e.g., unobserved wage levels, insurance claims from those who are uninsured, etc.) but may be inappropriate for some problems (e.g., unobserved disease incidence).

Examples of these two commands are: . heckman wage educ age, select (married children educ age) . predict yhat

These two commands would use the maximum likelihood estimate of the equations (2.3.6) wage as a function of education and age using a selection equation that used marital status, number of children, education level, and age to explain which individuals are participating in the labor force.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 156

The help file in Stata provides additional information on the structure of the Heckman command and is well worth printing out if you are dealing with a sample selectivity bias problem.

Example 2.3.3: We will illustrate various issues of selection bias using the data set available from the Stata site. Retrieve the data set by entering: . use http://www.stata-press.com/data/imeus/womenwk, clear

This data set has 2,000 observations of 15 variables. We can use the describe command (.describe) to get a brief description of the data set:

Table 2.3.8: Description of variables included in the dataset obs: 2,000 vars: 15 size: 142,000 (86.5% of memory free) Variable Name Storage Type Display format Value label VariablelLabel c1 double %10.0g c2 double %10.0g u double %10.0g v (7,2) %10.0g country float %9.0g age int %8.0g education int %8.0g married byte %8.0g children int %8.0g select float %9.0g wageful float %9.0g wage float %9.0g lw float %9.0g work float %9.0g lwf float %9.0g

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 157

We are interested in only a subset of these data. Table 2.3.9 reports the definitions of variables that are relevant for our analysis. We can get further insight into the data set using the summarize command. Table 2.3.10 reports the summary statistics for the data set.

Table 2.3.9: Definition of the relevant variable in the data set

Variable name Definition country County of residence ( equal to 0, 1, ..., 9) age Age of the woman education Number of years of education of the woman married Dummy variable equal to 1 if the woman is married and 0 otherwise children Number of children that the woman has in their household wage Hourly wage rate of the woman lw Natural logarithm of hourly wage rate work Dummy variable equal to 1 if the individual is in the workforce and 0 otherwise

Use the Stata command “summarize age education married children wage lw work” for descriptive statistics of the variables.

Table 2.3.10: Summary statistics of the relevant variables in the data set

Variable Obs Mean Std. Dev Min Max age 2000 36.208 8.287 20 59 education 2000 13.084 3.046 10 20 married 2000 0.671 0.470 0 1 children 2000 1.645 1.399 0 5 wage 1343 23.692 6.305 5.885 45.810 lw 1343 3.127 0.287 1.772 3.825 work 2000 0.672 0.470 0 1

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 158

Here, we are interested in modeling two things: (1) the decision of the woman to enter the labor force and (2) determinants of the female wage rate. It might be reasonable to assume that the decision to enter the labor force by a woman is a function of age, marital status, the number of children, and her level of education. Also, the wage rate a woman earns should be a function of her age and education.

The decision to enter the labor force

We can use a probit regression to model the decision of a woman to enter the labor force. The results of this estimation are reported in Table 2.3.11. However, we can use the predict command to produce some results that we can use to be sure that we understand what the regression results mean. In particular, type in the following two commands: .predict zbhat, xb .predict phat, p

These two commands will predict (1) the linear prediction (zbhat) and (2) the predicted probability that the woman will be in the workforce (phat). Table 2.3.12 reports the values of these two variables for observations 1 through 10.

Use the Stata command “probit work age education married children” to identify factors affecting the decision to enter the labor force.

Table 2.3.11: Probit estimation of the decision to enter the labor force

Iteration 0: log likelihood = -1266.223 Iteration 4: log likelihood = -1027.062 Number of obs = 2000 LR chi2(4) = 478.32 Prob > chi2 = 0.000 Log likelihood = -1027.062 Pseudo-R2 = 0.189

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 159

work Coef. Std. Err. z P>z [95% Conf. Interval] Age 0.035 0.004 8.21 0.000 0.026 0.043 Education 0.058 0.011 5.32 0.000 0.037 0.080 Married 0.431 0.074 5.81 0.000 0.285 0.576 Children 0.447 0.029 15.56 0.000 0.391 0.504 Constant -2.467 0.193 -12.81 0.000 -2.845 -2.090

The predicted values of zbhat and phat for some of the observations are provided in Table 2.3.12.

Table 2.3.12: Predicted values of zbhat and phat for observations 1 through 10.

Observation zbhat phat 1 -0.689 0.245 2 -0.203 0.420 3 -0.481 0.315 4 -0.168 0.433 5 0.349 0.636 6 0.588 0.722 7 0.974 0.835 8 0.460 0.677 9 0.018 0.507 10 0.326 0.628

The interpretation of the numbers in Table 2.3.12 is straightforward. Consider individual 1. The z- value predicted for this individual is -0.689. Using the standard normal tables found in any statistics books it is easy to see: Φ(z≤−0.69)=Pr(Individual 1 is in the labor force) Φ(z≤−0.69)=0.5−Φ(0≤z≤0.69)≈0.5−0.2549≈0.2451.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 160

The difference between this number and the value reported for phat in Table 2.3.7 is due to rounding error. Next, we will calculate the inverse Mills ratio. As noted in equation (2.3.13), the formula for the inverse Mills ratio is: ̂ 휆푖 = 휑(푧푖′휐̂)Ф(푧푖′휐̂)

The variable phat is equal to Ф(푧푖′휐̂). Stata offers an easy way to calculate 휑(푧푖′휐̂) with the function “normden(zbhat)” as follows: .generate imratio = normden(zbhat)/phat

Table 2.3.13: Calculation of the inverse Mills ratio for the first 10 observations

Observation zbhat phat Inverse Mills Ratio 1 -0.689 0.245 1.282 2 -0.203 0.420 0.931 3 -0.481 0.315 1.127 4 -0.168 0.433 0.908 5 0.349 0.636 0.590 6 0.588 0.722 0.465 7 0.974 0.835 0.298 8 0.460 0.677 0.530 9 0.018 0.507 0.787 10 0.326 0.628 0.602

Comparison of the two Heckman estimates

One of the great advantages of using an econometrics program like Stata is that the authors quite often have created a command that does all of the work for the user. In our case, the commands we need to run to generate the maximum likelihood estimate of the Heckman model are: . global wage_eqn wage educ age . global seleqn married children age education . heckman $wage_eqn, select($seleqn)

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 161

Notice that we have used the global command to create a shortcut for referring to each of the two equations in the estimation. The command for the Heckman two-stage estimate is: .heckman $wage_eqn, select ($seleqn) twostage .predict mymills, mills

Table 2.3.14: Comparison of Heckman maximum-likelihood and the Heckman two-step estimates with the probit estimates of the selection equation

(1) Explanatory (2) Maximum (3) Heckman two- (4) Probit estimate of the variable likelihood estimate step selection equation

Wage equation Education 0.990(18.59) 0.983(18.23) — Age 0.213(10.34) 0.212(9.61) — Intercept 0.486(0.45) 0.734(0.59) —

Selection equation Married 0.445(6.61) 0.431(5.81) 0.431(5.81) Children 0.439(15.79) 0.447(15.56) 0.447(15.56) Age 0.037(8.79) 0.035(8.21) 0.035(8.21) Education 0.056(5.19) 0.058(5.32) 0.058(5.32) Intercept -2.491(-13.16) -2.467(-12.81) -2.467(-12.81) σ 0.704 0.673 — λ 6.005 5.947 — (Mills)λ 4.224 4.002(6.60) — Observations 2000 2000 2000 Number of women not 657 657 657 working Number of women 1343 1343 1343 working Log likelihood -5178.304 — -1027.062

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 162

Wald χ2(2) 508.44 — — Probability> χ2 0.000 — — Wald χ2(4) — 551.37 — Probability> χ2 — 0.000 —

LR test of independent equations (ρ = 0) χ2(1) 61.20 — 478.32 Probability>χ2 0.000 — 0.000 Note that figures in the parenthesis are standard errors

The second command reports the estimates of the inverse Mills ratio; we have retrieved these values in order to check our earlier calculations. Table 2.3.14 reports the results of these two estimations. Column 2 reports the maximum-likelihood estimates; Column 3 reports the Heckman two-step estimates; and Column 3 reports the probit estimate of selection equation as reported in Table 2.3.11. The estimates for the two methods are very similar. Of course, the probit estimates in column 4 exactly match the results reported for the selection equation in column 3. As a final check, Table 2.3.15 reports the values of the inverse Mills ratio reported in Table 2.3.8 with the values of the inverse Mills ratio calculated in the Heckman two-step method. The two estimates are identical except for some rounding errors.

Table 2.3.15: Inverse Mills ratio comparison

Observation As calculated from probit estimate As reported by the Heckman two-step 1 1.282 1.282 2 0.931 0.931 3 1.127 1.127 4 0.908 0.908 5 0.590 0.590 6 0.465 0.465

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 163

Observation As calculated from probit estimate As reported by the Heckman two-step 7 0.297 0.297 8 0.530 0.530 9 0.786 0.786 10 0.602 0.602

Topic Summary

This Topic covers regression models for truncated or censored data. It begins by defining truncation, censoring, and a special variant of truncation—incidental truncation—that gives rise to sample‐selection bias. The Topic then introduces the truncated regression model, and shows how it is estimated via maximum likelihood. The discussion then moves to the more commonly used tobit model. Estimation via maximum likelihood, various ways of interpreting model coefficients, and an analogue of R2 are all considered for this model. The Topic concludes with a discussion of sample selection model, presenting both two‐step and maximum‐likelihood estimation techniques for this model.

Exercises on the Topic

1. Explain in brief why OLS fails for a truncated or censored dependent variable. 2. Explain the difference between Tobit and Heckman selection models. 3. The dataset ..\..\livestockdata.dta is a Stata data on the determinants of livestock market participation and volume of sales in smallholder farm production in Ethiopia. Perform the following analysis using this dataset. (a) Do descriptive statistics of these variables and make interpretations. (b) Run OLS for all samples and interpret the results. Why is OLS inappropriate for such dependent variables? Explain (c) Run OLS for market participants only. Test for the endogeneity of livestock owned for this model. (d) Run probit model where marketpart is the dependent variable and interpret the results.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 164

(e) Run Tobit, Truncated and Heckman and interpret the model results. Compare and contrast the results of these models with OLS and among themselves. (f) Estimate the marginal effects for each model and interpret the results. Reading materials

Empirical Applications Readings: Cameron and Trivedi (2005) Ch. 16; Greene (2012) Ch. 19; Verbeek (2012) Ch. 7.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 165

TOPIC 2.4. TREATMENT EFFECTS MODELS (14 HOURS)

Topic Objectives

After completing this Topic, students will be able to:

• Identify the major problems of impact evaluation or causal inference; • Discuss and understand different approaches to program evaluation; • Differentiate between quantitative and qualitative approaches to evaluation, as well as ex ante versus ex post approaches; • Understand ways in which selection bias in participation can confound the treatment effect; and • Apply different methodologies in impact evaluation, including , propensity score matching, double differences, instrumental variable methods and regression discontinuity approaches to practical problems in economics using statistical software such as STATA.

Introduction

Treatment effect or impact evaluation assesses the changes that can be attributed to a particular intervention, such as a project, program or policy, both the intended ones, as well as ideally the unintended ones. In contrast to outcome monitoring, which examines whether targets have been achieved, impact evaluation is structured to answer the question: how would outcomes such as participants’ well-being have changed if the intervention had not been undertaken? This involves counterfactual analysis, that is, “a comparison between what actually happened and what would have happened in the absence of the intervention.” Impact evaluations seek to answer cause-and- effect questions. In other words, they look for the changes in outcome that are directly attributable to a program. Impact evaluation helps us to answer key questions for evidence-based policy making: what works, what doesn’t, where, why and for how much? It has received increasing attention in policy making in recent years in both developed and developing country contexts. It is an important component of the armory of evaluation tools and approaches and integral to global efforts to improve the effectiveness of aid delivery and public spending more generally in

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 166

improving living standards. Originally, more oriented towards evaluation of social sector programs in developing countries, notably conditional cash transfers, impact evaluation is now being increasingly applied in other areas such as the agriculture, energy and transport.

2.4.1. The Evaluation Problem

Causal inference or impact evaluation or treatment effect is problematic due to the fundamental problem of causal inference. A causal inference is a statement about counterfactuals— it is a statement about the difference between what did and didn’t happen. The core puzzle of causal inference is how you get the information about what didn’t happen. The difference between prediction and causal inference is the intervention on the system under study. Like it or not, social science theories are almost always expressed as causal claims: e.g. “an increase in X causes an increase in Y”. The study of causal inference helps us understand the assumptions we need to make this kind of claim. The term identification also plays an important role in causal inference. A quantity of interest is identified when (given stated assumptions) access to infinite data would result in the estimate taking on only a single value. For example, having all dummy variables in a linear model is not statistically identified because they cannot be distinguished from the intercept.

Causal identification is what we can learn about a causal effect from available data. If an effect is not identified, no estimation method will recover it. This means the relevant question is “what’s your identification strategy?” or what are the set of assumptions that let you claim you will be able to estimate a causal effect from the observed data? In addition the difference between identification and estimation should be made clear. Identification refers to how much you can learn about the estimand if you have an infinite amount of data, while estimation refers to how much you can learn about the estimand from a finite sample. That means, identification precedes estimation.

Often identification requires (hopefully minimal) assumptions. Even when identification is possible, estimation may impose additional assumptions (i.e. that the linear approximation to the Conditional Expected Function (CEF) is good enough). Note also the Law of Decreasing Credibility by Manski which states that the credibility of inference decreases with the strength of the assumptions maintained.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 167

Causal inference is hard due to the fundamental problem of causal inference. To address the fundamental problem of causal inference we have to make assumptions. The approach to be used to address this fundamental problem is known as the Rubin Causal Model or the Potential Outcomes framework. Towards this, we make the following definitions:

Let Yi0 be the outcome of interest under control (i.e., no treatment), Yi1 be the outcome of interest under treatment. Then the causal effect we are after is Yi1 - Yi0. And Yi0 and Yi1 are the potential outcomes. Let’s look at an example.

In an ideal world, we would see this:

Uniti Xi1 Xi2 Xi3 Ti Yi0 Yi1 Yi1- Yi0 1 2 1 50 0 69 75 6 2 3 1 98 0 111 108 -3 3 2 2 80 1 92 102 10 4 3 1 98 1 112 111 -1

But in the real world, we see this:

Uniti Xi1 Xi2 Xi3 Ti Yi0 Yi1 Yi1- Yi0 1 2 1 50 0 69 ? ? 2 3 1 98 0 111 ? ? 3 2 2 80 1 ? 102 ? 4 3 1 98 1 ? 111 ?

Hence, the fundamental problem of causal inference is that at most only one of the two potential outcomes Yi0 or Yi1 can be observed for each unit i. For control units, Yi1 is the counterfactual (i.e., unobserved) potential outcome. For treatment units, Yi0 is the counterfactual. For this reason, some people (including Rubin) call causal inference a problem.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 168

Before we discuss about solutions to this problem, we also define the following quantities of interest:

1. The individual treatment effect: Yi1 - Yi0

2. The average treatment effect (ATE): E[Y1 - Y0] = E[Y1 ]- E[Y0]

3. The average treatment effect on the treated (ATT): E[Y1 - Y0|T=1] = E[Y1 |T = 1) - Y0 |T = 1)]

4. The average treatment effect on the untreated (ATU): E[Y1 - Y0|T=0] = E[Y1 |T = 0) - Y0 |T = 0)]

At this point, we can’t estimate all the above quantities of interest because of the above-mentioned fundamental problem. To estimate causal effects from observations, economists use substitutes for the missing data. These substitutes can provide the correct answer if the potential outcomes of the substitutes are the same as the potential outcomes of the people for whom you want to estimate a causal effect. While we generally cannot hope to find substitutes for the potential outcomes of a particular individual, we can hope to find substitutes that represent the distribution of potential outcomes among a group of individuals (Greenland, 2002). This is referred to as the exchangeability or no assumption (Greenland and Robins, 1986) and is well known, well accepted and well-studied, in economics.

However, an additional assumption is necessary to estimate the precise causal effect defined from a potential outcomes perspective - the Stable Unit Treatment Value Assumption (SUTVA) (Rubin, 1986).

Ignorability/Unconfoundedness assumption

It states that treatment assignment is independent of the outcomes (Y). Mathematically it is stated as:

(Y (1), Y (0))ꞱT (2.4.1)

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 169

Ignorability and Unconfoundedness are often used interchangeably. Technically, unconfoundedness is a stronger assumption. The major violation due to this assumption is the ‘Omitted Variable Bias’. This is mainly related to difficulty to capture all the variables affecting treatment assignment which are independent of the outcome of interest. It is well-known that omission of an important variable from the model leads to a biased and inconsistent causal effect estimation.

The Stable Unit Treatment Value Assumption (SUTVA)

SUTVA, as developed by Rubin (Rubin, 1980; 1986), is the assumption that each individual has only one potential outcome under each exposure condition. This is necessary to ensure that the causal effect for each individual is stable.

SUTVA has two elements: 1. the effect of treatment on an individual is independent of treatment of other individuals (non- interference), and 2. treatment has the same effect on an individual regardless of how the individual came to be treated (no hidden variation in treatment).

SUTVA is violated if the of the individual to the same treatment condition at the same moment in time could result in different outcomes. Examples of violations of these assumptions include job training for too many people may flood the market with qualified job applicants (interference) or some patients get extra-strength aspirin (variation in treatment). Violation of either aspect of SUTVA creates unstable estimates of the causal effect. By this we mean that there is no unique potential outcome for each individual under each exposure condition. In general, the instability arises because there are multiple “versions of treatment”. Although the treatment was defined as a single construct, it really represents different versions of treatments that were not recognized and delineated. Each version of the treatment may influence a particular individual in a different way. Thus, an individual may have more than one potential

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 170

outcome for what is considered in the study to be a single treatment condition. These different versions of treatment arise from ambiguities in the measurement or operationalization of the treatment (violation of the second aspect of SUTVA) or from effects of the treatments received by others (violation of the first aspect of SUTVA).

SUTVA is a useful assumption, but as with all assumptions, there are circumstances in which it is not credible. What can be done in these circumstances? When researchers suspect that there may be spillover between units in different treatment groups, they can change their unit of analysis. Students assigned to attend a tutoring program to improve their grades might interact with other students in their school who were not assigned to the tutoring program and influence the grades of these control students. To enable causal inference, the analysis might be completed at the school level rather than the individual level. SUTVA would then require no interference across schools, a more plausible assumption than no interference across students. However, this approach is somewhat unsatisfactory. It generally entails a sharp reduction in sample size. More importantly, it changes the question that we can answer: no longer can we learn about the performance of individual students, we can only learn about the performance of schools.

I have not come across a more satisfactory statistical solution for circumstances in which SUTVA is violated. Manski provides some bounds on treatment effects in the presence of social interactions. Unfortunately, these bounds are often uninformative, since when SUTVA is violated random assignment to treatment arms does not identify treatment effects. Sinclair suggests using multi-level to empirically identify spillover effects. This approach (which relies on multiple rounds of randomization to test if treatment effects are over-identified, as we would expect if there were no spillovers) is appealing, as the process of diffusion within networks is of great scientific interest. However, it does not help identify treatment effects when spillovers are present. Neither can we simply assume that effects estimated under SUTVA represent upper bounds on the true effects, because it is possible that interference across units intensifies the treatment effects rather than diluting them.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 171

In the following subtopic, we present the different treatment effects models (rigorous impact evaluation methods) taking into account the above limitations of impact evaluations.

2.2.2. Impact Evaluation Methods

Impact evaluation designs are identified by the type of methods used to generate the counterfactual and can be broadly classified into three categories: experimental, quasi-experimental and non- experimental designs – that vary in feasibility, cost, involvement during design or after implementation phase of the intervention, and degree of selection bias. We will look at each of them in details.

Experimental design or randomization: Under these types of evaluations, the treatment and comparison groups are selected randomly and isolated both from the intervention, as well as any interventions which may affect the outcome of interest. These evaluation designs are referred to as randomized control trials (RCTs). In experimental evaluations the comparison group is called a control group. When randomization is implemented over a sufficiently large sample with no contamination by the intervention, the only difference between treatment and control groups on average is that the latter does not receive the intervention. Random sample surveys, in which the sample for the evaluation is chosen on a random basis, should not be confused with randomized control trials evaluation designs, which require the random assignment of the treatment.

The RCT approach is often held up as the ‘’ of evaluation, and it is the only evaluation design which can conclusively account for selection bias in representing a causal relationship between intervention and outcomes. Randomization and isolation from interventions might not be practicable in the realm of social policy, and may also be ethically difficult to defend (Ravallion, 2009), although there may be opportunities to use natural experiments. Bamberger and White (2007) highlight some of the limitations of using RCTs to development interventions. Methodological critiques have been made by Scriven (2008) on account of the biases introduced since social interventions cannot be triple blinded, and Deaton (2009) has pointed out that in practice the analysis of RCTs falls back on the regression-based approaches they seek to avoid,

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 172

and so are subject to the same potential biases. Other problems include the often heterogeneous and changing contexts of interventions, logistical and practical challenges, difficulties with monitoring service delivery, access to the intervention by the comparison group and changes in selection criteria and/or intervention over time.

Quasi-experimental design: Quasi-experimental approaches can remove bias arising from selection on observables and, where panel data are available, time invariant unobservables. The most widely used quasi-experimental methods include matching, regression discontinuity design, differencing and instrumental variables, and are usually carried out by multivariate .

If selection characteristics are known and observed then they can be controlled for to remove the bias. Matching involves comparing program participants with non-participants based on observed selection characteristics. Propensity score matching (PSM) uses a statistical model to calculate the probability of participating on the basis of a set of observable characteristics, and matches participants and non-participants with similar probability scores. Regression discontinuity design exploits a decision rule as to who does and does not get the intervention to compare outcomes for those just either side of this cut-off. Difference-in-differences or double differences, which use data collected at baseline and end-line for intervention and comparison groups, can be used to account for selection bias under the assumption that unobservable factors determining selection are fixed over time (time invariant). Instrumental variables estimation accounts for selection bias by modelling participation using factors (‘instruments’) that are correlated with selection but not the outcome, thus isolating the aspects of program participation which can be treated as exogenous.

Non-experimental design: Non-experimental impact evaluations are so-called because they do not involve a comparison group which does not have access to the intervention. The method used in non-experimental evaluation is to compare intervention groups before and after implementation of the intervention. Intervention interrupted time-series (ITS) evaluations require multiple data points on treated individuals both before and after the intervention, while before versus after (or pre-test post-test) designs simply require a single data point before and after. Post-test analyses

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 173

include data after the intervention from the intervention group only. Non-experimental designs are the weakest evaluation design, because in order to show a causal relationship between intervention and outcomes convincingly, the evaluation must demonstrate that any likely alternate explanations for the outcomes are irrelevant. However, there remain applications to which this design is relevant, for example in calculating time-savings from an intervention which improves access to amenities. In addition, there may be cases where non-experimental designs are the only feasible impact evaluation design, such as universally-implemented programmes or national policy reforms in which no isolated comparison groups are likely to exist.

2.4.2.1. Randomized Controlled Trial (RCT)

Randomized controlled trials (RCTs), or randomized impact evaluations, are a type of impact evaluation which uses randomized access to social interventions as a means of limiting bias and generating an internally valid impact estimate. RCT randomizes who receives an intervention–the treatment group - and who does not – the control. It then compares outcomes between the two groups; this comparison gives us the impact of an intervention. RCTs do not necessarily require a “no treatment” control – randomization can just as easily be used to compare different versions of the same intervention, or different interventions trying to tackle the same problem. In this way, the control mimics the counterfactual, which is defined as what would have happened to the same individuals at the same time had the intervention not been implemented. It is, by definition, impossible to observe counterfactuals as individuals do not have two existence.

Many times, evaluations compare groups that are quite different to the group receiving the intervention. For example, if we compare the outcomes for women who take up microcredit to those that do not, it could be that women who choose not to take up microcredit were different in important ways that would affect the outcomes. For example, women who do not take up microcredit might be less motivated, or less aware of financial products.

Using a randomization approach means that a target population is first identified by the intervention implementer, and then intervention access is randomized within that population.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 174

Instead of randomizing individuals, randomization can be done at cluster levels, such as villages, or schools, or health clinics. These are known as cluster randomized control trials.

There are two main reasons to randomize at a level larger than the individual. First, it can address contamination: where treated individuals mix and chat and potentially “share” treatment with individuals in the control group. This would “contaminate” our impact, and our control group would no longer be a good comparison. Randomizing at the village level may minimize the risk of this happening. Second, we might want to randomize at the level that the intervention would actually be implemented: for example, an intervention which provides electrification to schools. It is logically impractical – if not impossible – to randomize electricity access over schoolchildren.

RCT and causal inference assumptions

1. Independence or ignorability assumption

RCT is the best available study design to explore causal effect. Random assignment means that the treatment has been assigned to units independent of their potential outcomes. Thus, mean potential outcomes for the treatment group and control group are the same for a given state of the world.

2. Random assignment solves the selection problem

Randomized experiments solve the problem of selection bias by generating an experimental control group of people who would have participated in a program but who were randomly denied access to the program or treatment. The random assignment does not remove selection bias but instead balances the bias between the participant and nonparticipant samples. To see this formally, the simple in mean difference (SDO) is given by:

EN[yi|ti=1]-EN[yi|ti=0]= E[Y1]-E[Y0] + E[Y0|Ti=1] - E[Y0|Ti=0] + (1-π) (ATT-ATU) (2.4.2)

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 175

Note that E[Y1]-E[Y0] is the ATE, E[Y0|Ti=1] - E[Y0|Ti=0] is the selection bias and the last term (1-π) (ATT-ATU) is the heterogeneous treatment effect bias. Notice also that the selection bias from the second line of the decomposition of SDO was:

E[Y0|T = 1] - E[Y0|T = 0] If treatment is independent of potential outcomes, then swap out equations and selection bias zeroes out:

E[Y0|T = 1] - E[Y0|T = 0] = E[Y0|T = 0] - E[Y0|T = 0] = 0

3. Random assignment solves the heterogenous treatment effects

How does randomization affect heterogeneity treatment effects bias from the third line? Rewrite the third-row bias after 1-π:

ATT - ATU = E[Y1|T=1] - E[Y0|T = 1] -E[Y1|T=0] + E[Y0|T = 0] = 0

If treatment is independent of potential outcomes, then EN[yi|ti = 1] - EN[yi|ti = 0] = E[Y1] - E[Y0], which implies that SDO = ATE.

4. Stable Unit Treatment Value Assumption (SUTVA)

SUTVA means that average treatment effects are parameters that assume (1) homogenous dosage and (2) potential outcomes are invariant to who else (and how many) is treated. i. Homogenous dose

SUTVA constrains what the treatment can be. Individuals are receiving the same treatment - i.e., the “dose” of the treatment to each member of the treatment group is the same. That’s the “stable unit” part. If we are estimating the effect of hospitalization on health status, we assume everyone is getting the same dose of the hospitalization treatment. Easy to imagine violations if hospital quality varies, though, across individuals. But that just means we have to be careful what we are and are not defining as the treatment.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 176

ii. No externalities

What if hospitalizing Jack (hospitalized, T = 1) is actually about vaccinating Jack from smallpox? If Jack is vaccinated for smallpox, then Jill’s potential health status (without vaccination) may be higher than when he isn’t vaccinated. In other words, Y1Jill, may vary with what Jack does regardless of whether she herself receives treatment. SUTVA means that you don’t have a problem like this. If there are no externalities from treatment, then the outcome is stable for each i unit regardless of whether someone else receives the treatment too. iii. Partial equilibrium only

Easier to imagine this with a different example. Let’s say we are estimating the effect of some technological innovation that lowers the cost functions to firms in competitive markets. A decrease in cost raises profits in the short-run, but positive profits leads to firm entry in the long-run. Firm entry in the long-run causes the supply curve to shift right, pushing market prices down until price equals average total cost. The first effect - short-run responses to decreases in cost - are the only things we can estimate with potential outcomes.

2.4.2.2. Propensity Score Matching (PSM)

Randomization is often not an option for practical reasons. For example, implementing agencies may not be willing to accept randomization. Networked infrastructure may not be possible to effectively roll out in a random sequence. Large investment projects may not be able to be altered in location or sequencing. Or interest in the impact evaluation may have arisen only after the program is already under way or even completed. When randomization is not possible, impact often can still be estimated through a range of quasi-experimental designs. Quasi-experimental methods form a comparison group by statistical methods, rather than by random assignment where matching method is the one which received popularity more recently as a tool of impact evaluation.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 177

It assumes that selection can be explained purely in terms of observable characteristics. For every unit in the treatment group a matching unit (twin) is found among the non-treatment group. The choice of the match is dictated by observable characteristics. What is required is to match each unit exposed to the treatment with one or more non-treated units sharing similar observable characteristics.

The degree of similarities between different units is measured on the basis of the probability of being exposed to the intervention given a set of observable characteristics not affected by the programme, the so-called propensity score. The idea is to find, from a large group of non- participants, units that are observationally similar to participants in terms of characteristics not affected by the intervention. The mean effect of treatment can then be calculated as the average difference in outcomes between the treated and non-treated units after matching.

More specifically, if T =1 for treated group and T = 0 for comparison group, then the average treatment effect on the treated (ATT) on an outcome variable Y is given by:

ATT= E(Y1-Y0|T=1) = E(Y1|T=1)-E(Y0|T=1) (2.4.3)

PSM implementation steps

Impact evaluation using PSM can be implemented as follows:

Step 1: Propensity score estimation

The propensity score is the probability of a unit in the target group (treated and control units) to be treated given its observed characteristics Xi; formally:

Pr(Ti = 1|Xi). (2.4.4) The propensity score is a balancing score in the sense that, as demonstrated by Rosenbaum and Rubin (1983), if two units have similar propensity scores, then they are also similar with respect the set of covariates X used for its estimation. The probability is usually obtained from

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 178

probit/logistic regression to create a counterfactual group. The predicted probability is the propensity score.

Step 2: Check overlap and common support Comparing the uncomparable must be avoided, i.e. only the subset of the comparison group that is comparable to the treatment group should be used in the analysis. Hence, an important step is to check if there is at least one treated unit and one non-treated unit for each value of the propensity score. Several methods are suggested in the literature, but the most straightforward one is a visual analysis of the density distribution of the propensity score in the two groups.

Another possible method is based on comparing the minima and maxima of the propensity score in the treated and in the non-treated group. Both approaches require deleting all the observations whose propensity score is smaller than the minimum and larger than the maximum in the opposite group.

Figure 2.4.1: Example of distribution of propensity scores

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 179

Step 3: Choose a matching algorithm

The next step consists of matching treated and non-treated units that have similar propensity scores using an appropriate algorithm. The matching algorithms differ not only in the way they measure the degree of similarity between treated and non-treated units but also with respect to the weight they assign to the matched units. Three major matching algorithms used in the literature are: Nearest Neighbor matching (NNM), Caliper matching (CM) and Kernel matching (KM).

Nearest neighbor matching: The treated unit is matched with the unit in the comparison group that presents the closest estimated propensity score. Two variants are possible: matching with replacement (an untreated unit can be used more than once as a match) and matching without replacement (an untreated unit can be used only once as a match).

Caliper matching: Nearest-neighbor matching faces the risk of bad matches, if the closest neighbor is not sufficiently similar. This can be avoided by imposing the condition that, in order to be matched, the propensity score of treated and non-treated units should not differ, for example, by more than 5%. This tolerance level (5%) is called the caliper.

Kernel matching: The two matching algorithms discussed above have in common that only some observations from the comparison group are used to construct the counterfactual outcome of a treated unit. KM uses weighted averages of all individuals in the control group to construct the counterfactual outcome. Weights depend on the distance between each individual from the control group and the unit exposed to the treatment for which the counterfactual is estimated. The kernel function assigns higher weight to observations close in terms of propensity score to a treated individual and lower weight to more distant observations.

According to Caliendo and Copeining (2008), the best matching algorithm is the one with: Large matched sample size, large number of insignificant variables after matching, lower standardized mean bias (3-5%) and low Pseudo-R2.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 180

Step 4: Matching quality

The quality of the matching procedure is evaluated on the basis of its capability in balancing the control and the treatment groups with respect to the covariates used for the propensity score estimation. There are several procedures for testing the matching quality: Standardized bias, t-test and joint significance and pseudo-R-squared. The basic idea of all approaches is to compare the distribution of these covariates in the two groups before and after matching on the propensity score. If there are significant differences after matching, then matching on the propensity score was not (completely) successful in making the groups comparable and remedial measures have to be taken. The Stata command for testing matching quality is: pstest x1 x2 x3 … xk.

Step 5: Effect estimation

After the match has been judged of acceptable quality, computing the effect becomes a quite easy task. It is just to calculate the average of the difference between the outcome variable in the treated and non-treated groups.

Step 6: Sensitivity analysis

Matching methods are not robust against hidden bias arising from unobserved variable that simultaneously affect assignment to treatment and the outcome variable. Rosenbium R-bounds is the most widely used sensitivity analysis technique. The Stata command for sensitivity analysis using Rosenbium R-bounds is: rbounds difscore, gamma (1 (0.1)2).

What is needed to conduct PSM?

Propensity score matching requires data from both a treatment group and an untreated group, from which the comparison group is drawn. The data should include community, household, and individual characteristics that determine program participation from both program placement and self-selection. Both samples need to be larger than the sample size suggested by simple power calculations, since observations outside the region of common support are discarded. In practice, the researcher does not need to perform the above steps manually. Statistical packages have a single command to conduct the analysis, such as the “teffects psmatch” command in STATA.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 181

Advantages and disadvantages of PSM

The advantage of propensity models is that they do not necessarily require a baseline or panel survey (especially for the outcome variables), although the observed covariates entering the logit/probit model for the propensity score estimation would have to satisfy the conditional mean independence assumption by reflecting observed characteristics that are not affected by participation. The drawback of propensity methods is that they rely on the degree to which observed characteristics drive programme participation, and do not control for unobservable variables that could lead to bias in the estimates. Thus, it is important to test the sensitivity of results with respect to such bias and small changes to the matched or weighted samples. With PSM, it is recommended to adopt the Rosenbaum (2002) bounds test, which suggests how strongly an unmeasured variable must influence the selection process to undermine implications of matching analysis. In all cases, if the results are sensitive and there are doubts about the conditional independence assumption, alternative identifying assumptions should be considered.

Example 2.4.3: We will illustrate the use of teffects psmatch by using data from a study of the effect of a mother’s smoking status during pregnancy (mbsmoke) on infant birthweight (bweight) as reported by Cattaneo (2010). This dataset also contains information about each mother’s age (mage), education level (medu), marital status (mmarried), whether the first prenatal exam occurred in the first trimester (prenatal1), whether this baby was the mother’s first birth (fbaby), and the father’s age (fage).

Estimating the ATE

We begin by using teffects psmatch to estimate the average treatment effect of mbsmoke on bweight. We use a logistic model (the default) to predict each subject’s propensity score, using covariates mage, medu, mmarried, and fbaby. Because the performance of PSM hinges upon how well we can predict the propensity scores, we will use factor-variable notation to include both linear and quadratic terms for mage, the only continuous variable in our model: use http://www.stata-press.com/data/r13/cattaneo2 (Excerpt from Cattaneo (2010) Journal of Econometrics 155: 138-154)

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 182

• teffects psmatch (bweight) (mbsmoke mmarried c.mage##c.mage fbaby medu)

Table 2.4.1. Treatment-effects estimation result

Treatment-effects estimation Number of obs = 4642 Estimator: propensity-score matching Matches: requested = 1 Outcome model: matching min = 1 Treatment model: logit max = 74 bweight Coef. AI Robust z P>|z| [95% Conf. Interval] Std. Err. ATE - 32.02 -6.59 0.00 -273.73 -148.21 mbsmoke 210.97 (smoker vs nonsmoker)

Smoking causes infants’ birthweights to be reduced by an average of 211 grams. By default, teffects psmatch estimates the ATE by matching each subject to a single subject with the opposite treatment whose propensity score is closest. Sometimes, however, we may want to ensure that matching occurs only when the propensity scores of a subject and a match differ by less than a specified amount. To do that, we use the caliper () option. If a match within the distance specified in the caliper () option cannot be found, teffects psmatch exits.

Specifying the caliper Here we reconsider the previous example, first specifying that we only want to consider a pair of observations a match if the absolute difference in the propensity scores is less than 0.03: • teffects psmatch (bweight) (mbsmoke mmarried c.mage##c.mage fbaby medu), caliper(0.03) No nearest-neighbor matches for observation 2209 within caliper 0.03; this is not allowed r(459);

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 183

The error arose because there is not a smoking mother whose propensity score is within 0.03 of the propensity score of the nonsmoking mother in observation 2209. If we instead raise the caliper to 0.10, we have matches for all subjects and therefore obtain the same results as in example 1: • teffects psmatch (bweight) (mbsmoke mmarried c.mage##c.mage fbaby medu), caliper(0.1)

Table 2.4.2. Treatment-effects estimation result using caliper (0.1) estimator

Treatment-effects estimation Number of obs = 4642 Estimator: propensity-score matching Matches: requested = 1 Outcome model: matching min = 1 Treatment model: logit max = 74 bweight Coef. AI Robust z P>|z| [95% Conf. Interval] Std. Err. ATE -210.97 32.02 -6.59 0.00 -273.73 -148.21 mbsmoke (smoker vs nonsmoker)

The above result highlights that estimating the ATE requires finding matches for both the treated and control subjects. In contrast, estimating the ATET only requires finding matches for the treated subjects. Because subject 2209 is a control subject, we can estimate the ATET using caliper (0.03). We must also specify vce(iid) because the default robust standard errors for the estimated ATET require viable matches for both treated subjects and control subjects. This requirement comes from the nonparametric method derived by Abadie and Imbens (2012). • teffects psmatch (bweight) (mbsmoke mmarried c.mage##c.mage fbaby medu), atet vce(iid) caliper(0.03)

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 184

Table 2.4.3. Treatment-effects estimation result using caliper (0.03) estimator

Treatment-effects estimation Number of obs = 4642 Estimator: propensity-score matching Matches: requested = 1 Outcome model: matching min = 1 Treatment model: logit max = 74 bweight Coef. AI Robust z P>|z| [95% Conf. Interval] Std. Err. ATE -236.79 26.12 -9.07 0.00 -287.97 -185.60 mbsmoke (smoker vs nonsmoker)

In the previous examples, each subject was matched to at least one other subject, which is the default behavior for teffects psmatch. However, we can request that teffects psmatch match each subject to multiple subjects with the opposite treatment level by specifying the nneighbor() option. Matching on more distant neighbors can reduce the variance of the estimator at a cost of an increase in bias.

Finally, we request that teffects psmatch match a mother to four mothers in the opposite treatment group: • teffects psmatch (bweight) (mbsmoke mmarried c.mage##c.mage fbaby medu), nneighbor(4)

Table 2.4.4. Treatment-effects estimation result using nneighbor (4) estimator

Treatment-effects estimation Number of obs = 4642 Estimator: propensity-score matching Matches: requested = 1 Outcome model: matching min = 1

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 185

Treatment model: logit max = 74 bweight Coef. AI Robust z P>|z| [95% Conf. Interval] Std. Err. ATE -224.01 29.89 -7.50 0.00 -282.58 -165.43 mbsmoke (smoker vs nonsmoker)

2.4.2.3. Regression Discontinuity Design (RDD)

Regression Discontinuity Design (RDD) is a key method in the toolkit of any applied researcher interested in analyzing the causal effects of interventions. The method was first used in 1960 by Thistlethwaite and Campbell, who were interested in identifying the causal impacts of merit awards, assigned based on observed test scores, on future academic outcomes (Lee and Lemieux, 2010). The use of RDD has increased exponentially in the last few years. Researchers have used it to evaluate electoral accountability; small and medium enterprise policies; social protection programs such as conditional cash transfers; and educational programs such as school grants. RDD is a quasi-experimental impact evaluation method used to evaluate programs that have a cutoff point determining who is eligible to participate. It allows researchers to compare the people immediately above and below the cutoff point to identify the impact of the program on a given outcome. In RDD, assignment of treatment and control is not random, but rather based on some clear-cut threshold (or cutoff point) of an observed variable such as age, income and score. RDD requires a continuous eligibility score on which the population of interest is ranked and a clearly defined cutoff point above or below which the population is determined eligible for a program. Causal inference is then made comparing individuals on both sides of the cutoff point. This subtopic will cover when to use RDD, sharp vs. fuzzy design, how to interpret results, and methods of treatment effect estimation.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 186

Conditions and Assumptions for RDD

Conditions

There are two main conditions that are needed in order to apply RDD:

1. A continuous eligibility index: a continuous measure on which the population of interest is ranked (i.e. test score, poverty score, age).

2. A clearly defined cutoff point: a point on the index above or below which the population is determined to be eligible for the program. For example, students with a test score of at least 80 of 100 might be eligible for a scholarship, households with a poverty score less than 60 out of 100 might be eligible for food stamps, and individuals age 67 and older might be eligible for pension. The cutoff points in these examples are 80, 60, and 67, respectively. The cutoff point may also be referred to as the threshold.

Assumptions

1. The eligibility index should be continuous around the cutoff point. There should be no jumps in the eligibility index at the cutoff point or any other sign of individuals manipulating their eligibility index in order to increase their chances of being included in or excluded from the program. The McCrary Density Test tests this assumption by checking eligibility index density function for discontinuities around the cutoff point.

2. Individuals close to the cutoff point should be very similar, on average, in observed and unobserved characteristics. In the RD framework, this means that the distribution of the observed and unobserved variables should be continuous around the threshold. Even though researchers can check similarity between observed covariates, the similarity between unobserved characteristics has to be assumed. This is considered a plausible assumption to make for individuals very close to the cutoff point, that is, for a relatively narrow window.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 187

Fuzzy vs Sharp RDD

The assignment rule indicates how people are assigned or selected into the program. In practice, the assignment rule can be deterministic or probabilistic (see Hahn et al., 2001). If deterministic, the regression discontinuity takes a sharp design; if probabilistic, the regression discontinuity takes a fuzzy design.

Figure 2.4.2: Sharp and fussy RDD examples

Sharp RDD

In sharp RDD, the probability of treatment changes from 0 to 1 at the cutoff. There are no cross- overs and no no-shows. For example, if a scholarship award is given to all students above a threshold test score of 80%, then the assignment rule defines treatment status deterministically with probabilities of 0 or 1. Thus, the design is sharp.

Fuzzy RDD

In fuzzy designs, the probability of treatment is discontinuous at the cutoff, but not to the degree of a definitive 0 to 1 jump. For example, if food stamp eligibility is given to all households below a certain income, but not all households receive the food stamps, then the assignment rule defines

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 188

treatment status probabilistically but not perfectly. Thus, the design is fuzzy. The fuzziness may result from imperfect compliance with the law/rule/program; imperfect implementation that treated some non-eligible units or neglected to treat some eligible units; spillover effects; or manipulation of the eligibility index.

A fuzzy design assumes that, in the absence of the assignment rule, some of those who take up the treatment would not have participated in the program. The eligibility index acts as a nudge. The subgroup that participates in a program due to the selection rule is called compliers (see e.g. Angrist and Imbens (1994), and Imbens et al. (1996)); under the RDD, the treatment effects are estimated only for the group of compliers. The estimates of the causal effect under the fuzzy design require more assumptions than under the sharp design, but are weaker than any IV approach.

Interpretation of RDD estimates

RDD estimates local average treatment effects around the cutoff point, where treatment and comparison units are most similar. The units to the left and right of the cutoff look more and more similar as they near the cutoff. Given that the design meets all assumptions and conditions outlined above, the units directly to the left and right of the cutoff point should be so similar that they lay the groundwork for a comparison as well as does randomized assignment of the treatment.

Because the RDD estimates the local average treatment effects around the cutoff point, or locally, the estimate does not necessarily apply to units with scores further away from the cutoff point. These units may not be as similar to each other as the eligible and ineligible units close to the cutoff. RDD’s inability to compute an average treatment effect for all program participants is both a strength and a limitation, depending on the question of interest. If the evaluation primarily seeks to answer whether the program should exist or not, then the RDD will not provide a sufficient answer: the average treatment effect for the entire eligible population would be the most relevant parameter in this case. However, if the policy question of interest is whether the program should

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 189

be cut or expanded at the margin, then the RDD produces precisely the local estimate of interest to inform this important policy decision.

Note that the most recent advances in the RDD literature suggest that it is not very accurate to interpret a discontinuity design as a local experiment. To be considered as good as a local experiment for the units close enough to the cutoff point, one must use a very narrow bandwidth and drop the assignment variable (or a function of it) from the regression equation.

Treatment effect estimation in RDD

There are two methods widely used for estimating treatment effects for RDD method: Parametric and non-parametric methods.

Parametric methods

The estimation of the treatment effects can be performed parametrically as follows:

Yi = ρ + γXi + h(Zi) + εi (2.4.5)

where Yi is the outcome of interest of individual i, Xi is an indicator function that takes the value of 1 for individuals assigned to the treatment and 0 otherwise, Zi is the assignment variable that defines an observable clear cutoff point, and h(Zi) is a flexible function in Z. The identification strategy hinges on the exogeneity of Z at the threshold. It is standard to center the assignment variable at the cutoff point. In this case, one would use h (Z1 – Z0) instead with Z0 being the cutoff. With that assumption, the parameter of interest, γ, provides the treatment effect estimate.

In the case of a sharp design with perfect compliance, the parameter γ identifies the average treatment effect on the treated (ATT). In the case of a fuzzy design, γ corresponds to the intent-to- treat effects – i.e. the effect of the eligibility rather than the treatment itself on the outcomes of interest. The LATE can be estimated using an IV approach. This could be done as follows:

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 190

First stage: Pi = ρ + γXi + h(Zi) + εi (2.4.6)

Second stage: Yi = θ + γ푃̂푖 + h(Zi) + vi (2.4.7)

where Pi is a dummy variable that identifies actual participation of individual i in the program/intervention. Notice that with a parametric specification, the researcher should specify h(Zi) the same way in both regressions (Imbens and Lemieux, 2008).

Despite the natural appeal of parametric method such as the one just outlined; this method has some direct practical implications. First, the right functional form of h(Zi)is never known.

Researchers are thus encouraged to fit the model with different specifications of h(Zi) (Lee and Lemieux, 2010), particularly when they must consider data farther away from the cutoff point to have enough statistical power.

Even though some authors test the sensitivity of results using high order polynomials, there is some recent discussion arguing against the use of high order polynomials given that they assign too much weight to observations away of the cutoff point (Imbens and Gelman, 2014).

Non-parametric methods

Another way of estimating treatment effects with RDD is via non-parametric methods. In fact, the use of non-parametric methods has been growing in the last few years to both estimate treatment effects and check robustness of estimates obtained parametrically. This might be partially explained by the increasing number of available Stata commands, but perhaps more importantly, by some attractive properties of the method compared to parametric ones (see i.e. Imbens and Gelman (2014) for this point). In particular, non-parametric methods provide estimates based on data closer to the cut-off, reducing bias that may otherwise result from using data farther away from the cutoff to estimate local treatment effects.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 191

The use of non-parametric methods does not come without costs: the researcher must make an array of decisions about the kernel function, the algorithm for selecting optimal bandwidth size, and the specifications. Imbens and Gelman (2014) suggest the use of local linear and at most local quadratic polynomials.

Bandwidth size

In practice, the bandwidth size depends on data availability. Ideally, one would like to have enough sample to run the regressions using information very close to the cutoff. The main advantage of using a very narrow bandwidth is that the functional form h(Zi) becomes much less of a worry and treatment effects can be obtained with parametric regression using a linear or piecewise linear specification of the assignment variable (see Lee and Lemieux, 2010). However, Schochet (2008) points out two disadvantages to a narrow versus wider bandwidth:

1. For a given sample size, a narrower bandwidth could yield less precise estimates if the outcome- score relationship can be correctly modelled using a wider range of scores.

2. External validity: extrapolating results to units further away from the threshold using the estimated parametric regression lines may be more defensible if you have a wider range of scores to fit these lines over.

Placebo tests

Falsification (or placebo) tests are really important when using RDD as identification strategy. The researcher needs to convince the reader (and referees!) that the discontinuity exploited to inform causal impacts of an intervention was very much likely caused by the assignment rule to the intervention. In practice, researchers use fake cutoffs or different cohorts to run those tests.

Advantages and disadvantages of RDD

RDD more completely controls unobservables than other quasi-experimental matching methods. It can also utilize administrative data to a large extent, thus reducing the need for , though the outcome data for the rejected applicants may need to be collected.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 192

The limits of the technique are that there needs to have been clear assignment criteria and sufficient samples for the analysis. A challenge for RDD is often to have sufficient observations either side of the threshold. The Mongolia Food Stamps Program study collected data from a purposive sample around the PMT threshold to avoid this problem.

A further limitation is that the impact is estimated only for the population close to the threshold. The estimate is called a local average treatment effect (LATE), rather than an average treatment effect for the whole treated population. In principle, this limitation restricts the external validity of the approach. Still, it may be argued that LATE gives information on the effect at the margin of eligibility, and thus is a good proxy for what would be expected if the program were expanded.

Example 2.4.4: The study analyzed the impact of a scholarship program, the Cambodia Education Sector Support Program (CESSP), on school enrollment, selection, and test scores using a sharp regression discontinuity design (RDD). Eligibility for the $45 scholarship depended on an independently calculated dropout risk score created using household characteristics. The data used included the composite dropout risk scores, mathematics and vocabulary test scores, and a household survey. The cut-off score varied by school size, and the estimates were weighted averages. The sample used in the analysis consisted of children within a ten-point bandwidth range around the cut-off score. The evaluation found that school enrollment and attendance increased 20-25% due to the scholarships. Years of schooling increased by 0.21 years, while annual educational expenses paid by the family rose by $9. The actual program impact was likely to be higher since recipients were, on average, poorer than non-recipients. No significant impacts on learning outcomes were found. However, recipients had better knowledge of HIV/AIDS, and the program had a positive effect on the recipients’ mental health (Filmer and Schady, 2009).

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 193

Figure 2.4.3: Program effects on school attendance

2.4.2.4. Difference-in-Differences (DiD) method

DiD estimates are based on the difference in the changes in the outcome between treatment and comparison groups over time. Fixed effects models combine differencing with multivariate models that can account for differences in observed variables over time. The method takes the trajectory of the comparison group as the counterfactual trajectory for the treatment group. That is, the change in the outcome that takes place in the comparison group is taken as what would have happened to the treatment group in the absence of the intervention. Therefore, subtracting the change in the outcome observed in the comparison group from that observed in the treatment group gives the measure of impact. The effects of all factors that do not change over time or that do not affect changes over time are thereby eliminated from the impact estimate. Many determinants of program placement or participation can be expected to be rather time invariant, hence the attractiveness of this approach.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 194

The DiD method explores the time dimension of the data to define the counterfactual. It requires having data for both treated and control groups, before and after the treatment takes place. The

ATTDiD is estimated by comparing the difference in outcomes between treated and control groups in some period after the participants have completed the programme with the difference that existed before the programme. In this method, the treatment effect is obtained by taking two differences between group means. In a first step, the before-after mean difference is computed for T T C C each group: (Y ai − Y bi |Di = 1) and (Y ai − Y bi |Di = 0), where the subscripts a and b denote “after” and “before” the policy intervention, and the superscripts T and C indicate the treatment and the control group, respectively.

The average treatment effect on the treated is the difference of these two differences: T T C C ATTDID = (Y ai − Y bi |Di = 1) - (Y ai − Y bi |Di = 0) (2.4.8)

This method relies on three main assumptions: 1. The unobserved heterogeneity is time invariant and is cancelled out by comparing the before and after situations; 2. Another important assumption is the Stable Unit Treatment Value Assumption (SUTVA), which implies that there should be no spillover effects between the treatment and control groups, as the treatment effect would then not be identified (Dulfo, Glennerster and Kremer, 2008). Furthermore, the control variables should be exogenous, unaffected by the treatment. Otherwise, 훽̂ in (2.4.9) will be biased. A typical approach is to use covariates that predate the intervention itself, although this does not fully rule out endogeneity concerns, as there may be anticipation effects. In some DiD studies and data sets, the controls may be available for each time period, which is fine as long as they are not affected by the treatment. Implied by the assumptions is that there should be no compositional changes over time. An example would be if individuals with poor health move from control region to the treatment region where the health reform impact would then likely be underestimated. 3. The validity of the DiD approach relies on the paralle trend’s assumption, or rather the assumption that no time-varying differences exist between the treatment and control groups. In

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 195

other words, DiD provides an unbiased estimate of program impact if the “parallel trends” assumption holds, which is the assumption that the outcome variable follows the same trajectory over time in treatment and comparison groups without the intervention. The parallel trends assumption can be tested with pre-intervention data if they are available, though this test only lends support to the assumption rather than demonstrates that it is valid. The assumption is more likely to hold if a matching method, such as propensity score matching (described in 2.4.2.2), has been used to control for observable causes of differences in trajectory. Fixed effects models combine DiD approaches with multivariate regression. This allows for control of other factors which may be influencing the outcome, to give a stronger estimate than the simple DiD of mean outcomes. By including observable changes in covariates over time in the regression, the parallel trends assumption is reduced to only unobservable factors.

DiD implementation steps

DiD estimation requires three major steps:

Step 1: Choose the periods before and after the intervention Step 2: Discuss and test the common trend assumption Step 3: Effect estimation

Step 1: Choose the periods before and after the intervention

This choice is usually limited by data availability.

Step 2: Discuss and test the common trend assumption

If data are available for the outcome of both groups in pre-programme periods other than the one used to estimate ATTDID, the validity of the common trend assumption may be tested. When applied to pre-programme periods only, the ATTDID should be zero, as no programme has yet been implemented. If, due to some unobservable characteristics, the treated and control groups respond

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 196

differently to a common shock (e.g. a macroeconomic shock), the common trend assumption is violated and DID would either under or overestimate the ATT. To overcome this problem, Bell et al. (1999) suggest a ‘trend adjustment’ technique: a triple difference method to account for these differential time-trend effects. In this case, data on both treated and control groups are needed not only in the before and after periods but also on other two periods, say t and t’(t’< t

The triple differences estimator, DDD (Differences-in-Differences-in-Differences), is: T T C C T t C c ATTDID = (Y ai − Y bi |Di = 1) - (Y ai − Y bi |Di = 0)- (Y ti − Y t’i |Di = 1)-(Y ti − Y t’i |Di = 0) (2.4.9) where the last term is the trend differential between the two groups, measured between t and t’.

Step 3: Effect estimation

The ATTDID is usually estimated within a regression framework. The key independent variables are: an indicator for members of the treated group, Tit; an indicator of the post-treatment period, t; the of treatment group status and post-programme period, Titt.

yit = α+βTitt + γTit + πt + εit (2.4.10)

The coefficient of the interaction term, β, is the ATTDID as it measures the difference between the two groups in the post-programme relative to the pre-programme.

This parametric approach is convenient for two reasons: i) for the estimation of standard errors; and ii) because it is a more flexible approach that allows including other explanatory variables, namely those that reflect differences between the groups’ initial conditions and those that would lead to differential time trends.

2.4.2.5. Instrumental variables (IV) method

In traditional linear regression model, the outcome may be regressed on either an intervention dummy or a measure of participation in the intervention, such as duration of training attended or

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 197

distance to a road. Under certain conditions, including that participation is driven by observable measured characteristics, OLS will be unbiased. However, the presence of selection on unobservables means there will be endogeneity in the estimates from such an approach leading to a biased and inconsistent estimation. To correct this, instrumental variables (IV) method can be used to obtain consistent estimates by using one or more variables that affect treatment, but not outcomes, as a proxy for the intervention (Reiersol, 1945).

The IV method

The instrumental variable method deals directly with the selection on unobservables. The ATTIV is identified if the researcher finds a variable, the instrument, which affects the selection into treatment but is not directly related with the outcome of interest or with the unobserved variables that determine it. The instrument is a source of exogenous variation that is used to approximate randomization.

The choice of the instrument is the most crucial step in the implementation of the IV method, and should be carefully motivated by economic intuition or theory. The estimation of the ATTIV is achieved through a linear regression model:

yi = α+βDi + γXi + εi (2.4.11) where the parameter of interest is β which indicates the effect of the treatment on the outcome, keeping other pre-determined variables constant.

In general, in non-experimental settings there is selection bias into treatment, namely selection on unobservable variables. Therefore, variables that affect simultaneously the outcome and the selection into treatment are often unobservable by the researcher (for instance innate ability or motivation). If the role of these unobservable variables is not taken into account, the ATT estimate will either be over or underestimated.

In the traditional simple OLS approach for dichotomous treatments, the outcome is regressed on a dummy variable for participation T (T=1 for treatment group and T=0 for comparison), along with

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 198

other variables which affect the outcome. The coefficient on W is the measure of impact. The problem with this approach is that selection bias can cause the estimate of the impact coefficient to be biased. If selection is entirely on observables, and the regression has included variables on all those observables, then OLS will indeed yield a valid impact estimate. This can rarely be assumed to be the case. If the unobservables are time invariant, then differencing removes their effect, so estimating the impact equation using differences will be unbiased. However, if there are time-varying unobservables, difference-in-differences will also yield biased impact estimates. Instrumental variable estimation can be a technique to remove the bias.

Instrumental variable estimation is a regression in which the variable which is the source of the endogeneity problem (i.e., T because of selection bias) is replaced by an instrument (Z). This instrument has to satisfy two conditions: 1. It is correlated with T (termed “relevance”). 2. It is not correlated with the outcome (Y), except through its effect on W; i.e., there is no direct relationship between Z and Y (termed the “exclusion restriction”).

Example 2.4.5: A simple example might be an instrument for the effect of smoking on lung cancer. Those who smoke may have other characteristics that differ from those who do not, such as exercise or other risk taking, so that a direct regression on smoking is biased. Yet, taxation of cigarettes affects smoking, but does not affect lung cancer other than through effects on smoking, so that it can be used as an instrument for the effect of smoking. As another example, the impact of electricity access on households was estimated by using the distances from electricity poles as an instrument. Proximity to electricity poles determines the electricity connection fee, but does not affect outcomes directly, as poorer households tend to be closer to the poles. Generally, the challenge is to find a valid instrument that meets both conditions. Two methods already discussed can be seen as examples of IV in which valid instruments were described: RCTs and RDD.

It is common to estimate impact from RCT studies using a regression rather than simply comparing means of treatment and control. In that case, random assignment is being used as an instrument. The random assignment is correlated with participation (but is not the same variable if there are

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 199

crossovers), and it is not correlated with the outcome by design. Fuzzy RDD is a special case of IV in which the instrument is the assignment variable.

Selection of the instruments is best undertaken by specifying the underlying structural model, which is derived from the theory of change. It will usually be the case that more than one instrument is identified. When there is more than one instrument, instrumental variables is often implemented as two-stage least squares: (i) in stage one, regress the endogenous variable (that measuring program participation) on the instruments and calculate the fitted value; and (ii) in stage two, estimate the outcome equation, replacing the endogenous variable with the fitted values from the first stage. The impact estimate is the coefficient on the fitted values.

What is needed for IV?

In practice, these two stages are not performed manually: software packages will perform the calculations, which will also give the correct standard errors (which the second stage regression estimates would not if performed manually). For example, in STATA the command for instrumental variables is “ivregress”, and “ivreg2” offers useful additional diagnostics (Baum et al., 2010). IV estimation requires data on treated and untreated observations, including the outcome and the instruments, as well as other confounding variables. If data are being collected for the study, it is important to have determined the instruments beforehand, so the relevant questions are included in the survey instruments.

Advantages and disadvantages of IV method

The advantage of IV is that, given a valid instrument, both observable and unobservable sources of selection bias are controlled. The main disadvantage is that it may be difficult to find a valid instrument, as many factors that affect treatment also affect outcomes in some way. The approach also yields a LATE, which may be difficult for policy audiences to understand.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 200

Topic Summary

Increasingly, people working in government and public policy are debating the best methods for evaluating government policies and programmes. How can we get the evidence we need to assess the impact of government investment in a way that is both transparent and defensible? How can we determine what works, in what circumstances and why, to the benefit of current and future policies and programmes? Impact evaluation seeks to determine the longer-term results that are generated by policy decisions, often through interventions, projects or programmes. Impacts may be positive or negative, intended or unintended, direct or indirect.

The choice of methods and designs for impact evaluation of policies and programmes in different sectors is not straightforward, and comes with a unique set of challenges. Policies and programmes may depend on contributions from other agencies and other actors, or take many years to emerge. Measuring direct cause and effect can be difficult. There is not one right way to conduct an impact evaluation. What is needed is a combination of methods and designs that suit the particular situation. When choosing these methods and designs, three issues need to be taken into account: the available resources and constraints; the nature of what is being evaluated; and the intended use of the evaluation. Additionally, impact evaluation involves three different types of questions— descriptive (the way things are or were), causal (how the programme has caused these things to change) and evaluative (overall value judgement of the merit or worth of the changes brought about).

With regards to the choice of evaluation methods the most rigorous method for answering causal questions is a randomized controlled trial (RCT), where individuals, organizations or sites are randomly assigned to either receive a ‘treatment’ (participate in a programme) or not (in some cases receiving nothing, and in others receiving the current programme), and changes are compared. However, RCTs are not always possible or appropriate since conducting an RCT requires reproducing what has been tested; undertaking and maintaining random allocation into treatment and control groups; and that the sample size is sufficient to detect differences between treatment and control groups. This Topic also discusses the use of quasi-experimental methods,

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 201

such as propensity score matching, regression discontinuity, differences-in differences and IV methods, which have their own advantages and limitations in impact evaluation process.

Exercises on the Topic

1. Discuss in general a) what we mean by impact evaluation and why they are important, b) What are the building blocks of an impact evaluation? 2. Discuss about the impact evaluation problems and the requirements of counterfactual impact evaluation. 3. Explain in brief the advantages and disadvantages of experimental and quasi-experimental impact evaluation methods 4. Given the dataset..\..\rainharvest.dta analyze the impact of participation in rain water harvesting on annual income using PSM.

Readings materials

Cameron and Trivedi (2010), “Microeconometrics Using Stata”, Stata Press.

Khandker, S., Koolwal, G. and Samad, H. (2010), “Handbook on Impact Evaluation: Quantitative Methods and Practices”, The World Bank, Washington DC.

Gertler, P., Martinez, S., Premand, P., Rawlings, L. and Vermeersch, C. (2016), “Impact Evaluation in Practice”, Second Edition, World Bank Group.

Journal Articles and Chapters

Blundell, R. and Dias, C. (2007), Alternative Approaches to Evaluation in Empirical Micro- econometrics, The IFS.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 202

Burtless, G. (1995), The Case for Randomized Field Trials in Economic Policy Research, The Journal of Economic Perspectives, Vol. 9, No. 2, 63-84. Heckman, J., LaLonde, R. and Smith, J. (1999), The Economics and Econometrics of Active Labor Market Programs, Handbook of Labor Economics (Ch. 31), Volume 3, Part A, 1865-2097.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 203

TOPIC 2.5. INTRODUCTION TO FURTHER TOPICS (10 HOURS)

Topic Objectives

Having successfully completed this Topic students will be able to:

• Understand the basic concepts and ideas of duration and analysis; • Understand the limitations of OLS for analyzing duration and count data; • Demonstrate the range of analyses available for duration and count data regression models; • Apply and interpret non-parametric, semi-parametric and parametric regression models for duration data using STATA software; • Differentiate OLS from Quantile Regression; • Apply Quantile Regression to analyze real world data using STATA software; and • Critically review articles in which basic survival, count data and quantile regression modelling are applied.

Introduction

Duration or is a class of statistical methods commonly used in the social and behavioral sciences to study both the occurrence and timing of events. It is also called a ‘time to event’ analysis. This technique is called survival analysis because this method was primarily developed by medical researchers and they were more interested in finding expected lifetime of patients in different cohorts. A unique feature of survival data is that typically not all patients experience the event (e.g., death) by the end of the observation period, so the actual survival times for some patients are unknown. This phenomenon, referred to as censoring, must be accounted for in the analysis to allow for valid inferences. In addition, survival times are usually skewed, limiting the usefulness of analysis methods that assume a normal data distribution. In economics, duration of unemployment is the most widely analyzed concept using duration models. In this Topic, appropriate econometric methods for the analysis of time-to-event data, including non-parametric and semi-parametric methods—specifically the Kaplan-Meier and Cox proportional hazards models which are by far the most commonly used techniques in economics literature are provided.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 204

Count data as dependent variables in a micro-econometric analysis are quantitative variables that are discrete, restricted to non-negative integers, and refer to events within a fixed time interval. Examples of economic variables that use a basic counting scale are number of children as an indicator of fertility, number of doctor visits as an indicator of health care demand, and number of days absent from work as an indicator of employee shirking. Several econometric methods are available for analyzing such data, including the Poisson and negative binomial models and their variants. They can provide useful insights that cannot be obtained from standard linear regression models.

Sub-topic 3 presents quantile regression which as opposed to linear regression gives a more comprehensive picture of the effect of the independent variables on the dependent variable. Instead of estimating the model with average effects using the OLS method, the quantile regression produces different effects along the quantiles of the dependent variable. The dependent variable is continuous with no zeros or too many repeated values. Examples include estimating the effects of household income on food expenditures for low-and high-expenditure households; and determining the factors affecting student scores along their score distribution.

2.5.1. Duration Models

2.5.1.1. Introduction

Depending on the discipline, duration analysis is also referred to as event history analysis and survival analysis. Duration models are used to analyze duration in a particular state or time until a particular event, such as duration of unemployment or time until first marriage. Duration data can be thought as being generated by what is called a ‘failure time process’. A failure time process consists of units – individuals, governments, countries, and so on – that are observed at some starting point in time. These units are in some state – the individual is healthy, the government is in power, the individual is employed, and so on – and are then observed over time. At any given point in time, these units are ‘at risk’ of experiencing some event, where an ‘event’ essentially represents a change or transition to another state – the individual dies, the government falls from

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 205

power, the individual lost his job, and so on. After the event is experienced, the unit is either no longer observed or it is at risk of experiencing another kind of event. In some circumstances, units are not observed experiencing an event; that is, no transition is made from one state to another while the unit is being observed – the individual remains healthy, the government remains in power, the individual remains employed, and so on. As we will see, we call these cases ‘censored’ since we do not observe the subsequent history of the unit after the last observation point. This process is called a ‘failure time process’ because (a) units are observed at an initial point in time, (b) the unit survives for some length of time (or spell), and (c) then the unit ‘fails’ or is ‘censored’. Examples of survival analysis include: Finance (borrowers obtain loans and then they either default or continue to repay their loans), Economics (Firm survival and exit, time to retirement, finding a new job, and adoption of new technology).

2.5.1.2. Duration or survival analysis

Duration or survival analysis is the study of the distribution of time-to-event data, that is, the times from an initiating event (birth, start of treatment, employment in a given job) to some terminal event (death, relapse, disability pension). In survival analysis, we may encounter three cases:

Case 1: Subjects are tracked until an event happens (failure) or we lose them from the sample (censored observations). Case 2: We are interested in how long they stay in the sample (survival). Case 3: We are also interested in their risk of failure (hazard rates).

In survival data analysis, the dependent variable is duration (time to event or time to being censored) so it is a combination of time and event/censoring. The time variable refers to the length of time until the event happened or as long as they are in the study and the event variable assumes a value of 1 if the event happened or 0 if the event has not yet happened. Instead of an event variable, a censored variable can be defined. The censored variable assumes a value of 1 if the event has not happened yet, and 0 if the event has happened.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 206

Table 2.5.1: Explanation for event and censored variables

Time Event/ Failure Censored Explanation 15 0 1 Event hasn’t happened yet (censored) 22 1 0 Event happened (not censored) 78 0 1 Event hasn’t happened yet (censored) 34 1 0 Event happened (not censored)

The hazard rate is the probability that the event will happen at time t given that the individual is at risk at time t. Hazard rates usually change over time. For example, the probability of defaulting on a loan may be low in the beginning but increases over the time of the loan.

The basic survival analysis can be extended to:

1. Multiple occurrences of events (multiple observations per individual). For example, borrower may have repeated restructuring of the loan or a firm may adopt technology in some years but not others. 2. More than one type of event (include codes for events, e.g. 1, 2, 3, 4). For instance, a borrower may default (one type of event) or repay the loan earlier (a second type of event) or firms may adopt different types of technologies. 3. Two groups of participants. An example for this could be the effect of two types of educational programs on technology adoption rates. 4. Time-varying covariates: borrower’s income may have changed during the study period which caused the default. 5. Discrete instead of continuous transition times: exits are measured in intervals (such as every month). There may different starting times and hence we need to measure time from the beginning time to the event.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 207

2.5.1.3. Survival, hazard and cumulative hazard functions Duration data are usually described in terms of two related probabilities, namely, survival and hazard. Let us first consider the continuous time case and we will extend it to discrete time case.

The

The length of a spell for a subject (person, firm, etc.) is a realization of a continuous random variable T with a cumulative distribution function (cdf), F(t), and probability density function (pdf), f(t). F(t) is also known in the survival analysis literature as the failure function.

The survival probability, also called the survival function, S (t), is the probability that an individual survives from a specified time point (e.g. the diagnosis of cancer) to a specified future time t. Or, it is the probability that a subject survives longer than time t. The survivor function is given by: S(t) = 1 - F(t); t is the elapsed time since entry to the state at time 0.

But the failure function (cdf) is by definition: 푡 F(t) = Prob(T ≤ t) = ( ) (2.5.1) ∫0 푓 푠 푑푠 Hence, the survival function which is the probability that the duration will be at least t is given by: S(t) = 1-F(t) = Prob(T > t) (2.5.2)

The hazard function

An alternative characterization of the distribution of T is given by the hazard function, or instantaneous rate of occurrence of the event, which is defined as:

h(t) = lim (Pr(t ≤ T < t + Δt|T ≥ t)/Δt = ∂F(t)/∂t =-∂S(t)/∂t Δt→0 The numerator of this expression is the conditional probability that the event will occur in the interval [t,t+Δt) given that it has not occurred before, and the denominator is the width of the interval. Dividing one by the other we obtain a rate of event occurrence per unit of time. Taking

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 208

the limit as the width of the interval goes down to zero, we obtain an instantaneous rate of occurrence. The conditional probability in the numerator may be written as the ratio of the joint probability that T is in the interval [t,t+Δt) and T≥t (which is, of course, the same as the probability that t is in the interval), to the probability of the condition T≥t. The former may be written as f(t)Δt for small Δt, while the latter is S(t) by definition. Dividing by Δt and passing to the limit gives the useful result for the hazard rate: H(t) = f(t)/S(t) (2.5.3)

The hazard rate is the probability that an individual will experience the event at time t while that individual is at risk of experiencing the event.

2.5.1.4. Methods for analyzing survival data

Several methods exist for doing survival analysis. Broadly, these can be categorized into three: non-parametric, semi-parametric and parametric methods.

1. Non-parametric methods, which neither impose assumptions on the distribution of survival times (a specific shape of the survival function or hazard function) nor assume a specific relationship between covariates and the survival time. The most widely used method for survival data analysis is the Kaplan-Meier estimator.

2. Semi-parametric methods also make no assumptions regarding the distribution of survival times but do assume a specific relationship between covariates and the hazard function—and hence, the survival time. The widely used method, Cox Proportional Hazard model is a semi-parametric method.

3. Parametric methods assume a distribution of the survival times and a functional form of the covariates.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 209

The difference between the methods lies in the assumptions that we make regarding the distribution of the survival data. However, all survival analysis methods can handle censored data.

Non-parametric models

Non-parametric estimation is useful for descriptive purposes and to see the shape of the hazard or survival function before a parametric model with regressors is introduced.

To carry out survival analysis using non-parametric procedure, we follow the following steps:

Step 1: Sort the observations based on duration from smallest to largest t1 ≤ t2 ≤ ⋯ ≤ tn.

Step 2: For each duration, determine the number of observations at risk nj (those still in the sample), the number of events dj and the number of censored observations nj. Step 3: Calculate the hazard function as the number of events as a proportion of the number of observations at risk by:

λ(tj)=dj/nj (2.5.4) Step 4: The Nelson-Aalen estimator of the cumulative hazard function is calculated by summing up hazard functions over time: d (t ) = j j  (2.5.5) n j Step 5: The Kaplan-Meier estimator of the survival function takes the ratios of those without events over those at risk and multiply that over time: n − d s(t ) = j j j  n j (2.5.6) The Kaplan-Meier survival function has the following drawbacks:

1. It is a decreasing step function with a jump at each discrete event time. 2. Without censoring, the Kaplan-Meier estimator is just the empirical distribution of the data.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 210

Example 2.5.1: Consider the data presented in Table 2.5.1 on time to first partnership obtained from National Child Development Study (1958 birth cohort). Note that respondents were interviewed at age 33, so there is no censoring before then.

Table 2.5.1. Time to first partnership t r(t) d(t) w(t) h(t) S(t) 16 500 9 0 0.02 1 17 491 20 0 0.04 0.98 18 471 32 0 0.07 0.94 19 439 52 0 0.12 0.88 20 387 49 0 0.13 0.77 …… …… 32 39 3 0 0.08 0.08 33 36 1 35 0.03 0.07

With regards to results presented in Table 2.5.1, first note that event is partnering for the first time and ‘Survival’ here is remaining single. Hence, h (16) = 0.02 means that only 2% of the samples partnered before age 17. Moreover, h (20) = 0.13 means that, of those who were unpartnered at their 20th birthday, 13% partnered before age 21 and S (20) = 0.77 shows that 77% had not partnered by age 20.

Figure 2.5.1 below is the graph of the hazard function against age.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 211

Figure 2.5.1: Hazard of first partnership Results in the above figure indicates that if an individual has not partnered by their late 20s, their chance of partnering declines thereafter. Figure 2.5.2 is the graph of the survival function against age.

Figure 2.5.2: Survivor function: Probability of remaining unpartnered

Note that the survivor function will always decrease with time and that the hazard function may go up and down.

Semi-parametric models

The semi-parametric method is the most commonly used type of survival data analysis because of its general flexibility. This methodology originated from the famous article by Sir David Cox (1972). Knighted in 1985 for his contributions to survival analysis, the British popularized the proportional hazards modeling estimator that is now often called Cox modeling. The Cox proportional hazard model is the most commonly used technique for survival data analysis that simultaneously allows one to include and to assess the effect of multiple covariates. The semi-parametric Cox model is a safe and proven method without the need to specify a specific data distribution, which is why this model is most common in analyzing survival data. The Cox proportional hazard model is among the semi-parametric models where the hazard rate is defined by:

h(t|xi, β) = h0(t) exp(βxi) (2.5.7)

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 212

Parametric models Parametric models assume a specific distribution of the survival times. Advantages of a parametric model include a higher efficiency (ie, greater power), which can be particularly useful with smaller sample sizes. Furthermore, a variety of parametric techniques can model survival times when the proportional hazards assumption is not met.

However, it can be quite challenging to identify the most appropriate data distribution, and parametric models have the drawback of providing misleading inferences if the distributional assumptions are not met. The parametric models can assume different parametric forms for the hazard and survival functions. Table 2.5.2 presents some of the functional forms that the hazard and survival functions can assume.

Table 2.5.2. Functional forms for the hazard and survival functions

Parametric model Hazard function h Survival function S Exponential β exp (-βt)

k k Weibull β k(β t) exp (-βt) kt Gompertz βexp(kt) exp (-(β/k)(e -1)) k k – 1 k k Log-logistic ky t /(1 + (t) ) 1/(1 + (βt) )

Note that the exponential model has a constant hazard rate over time.

Estimation of the parametric and semi-parametric models

The parametric and semi-parametric models report both the coefficients and hazard ratios. The coefficients are interpreted as: a positive coefficient means that as the independent variable increases the time-to-event decreases (lower duration or more likely for the event to happen), while hazard rates are interpreted as: a hazard ratio of 2 (0.5) means that for a one unit increase in the x

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 213

variable, the hazard rate (probability of event happening) increases by 100% (decreases by 50%). A hazard rate of greater than 1 means that it is more likely for the event to happen.

Table 2.5.3: Coefficients and hazard rates of parametric and semi-parametric models

Coefficient Hazard rate Conclusion Positive >1 Lower duration, higher hazard rates (more likely for the event to happen) Negative (0,1) Higher duration, lower hazard rates (less likely for the event to happen)

In summary, there are many types of duration models, which vary according to: 1. Assumptions about the shape of the hazard function 2. Whether time is treated as continuous or discrete 3. Whether the effects of covariates can be assumed constant over time (proportional hazards)

Continuous-time models

The most commonly applied model for continuous-time survival data is the Cox proportional hazards model which makes no assumptions about the shape of the hazard function and assumes that the effects of covariates are constant over time (although this can be modified). Note that in the Cox model specified in equation 2.5.7, hi (t) is the hazard for individual i at time t, xi is a vector of covariates (for now assumed fixed over time) with coefficients β and h0 (t) is the baseline hazard, i.e. the hazard when xi = 0. In addition to this, an individual’s hazard depends on t through h0 (t) which is left unspecified, so no need to make assumptions about the shape of the hazard.

With regards to the interpretation of the Cox proportional hazards model there are three important points to mention:

1. Covariates have a multiplicative effect on the hazard.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 214

2. For each 1-unit increase in x the hazard is multiplied by exp(β). To see this, consider a binary x coded as 0 and 1, xi = 0 implies that hi (t) = h0 (t) and xi = 1 implies that hi (t) = h0 (t) exp(β). Hence, exp(β) is the ratio of the hazard for x = 1 to the hazard for x = 0, called the relative risk or hazard ratio.

3. exp(β) = 1 implies that x has no effect on the hazard, exp(β) > 1 implies that x has a positive effect on the hazard, i.e. higher values of x are associated with shorter durations. For example, exp(β) = 2.5 implies that an increase in h(t) by a factor of (2.5 − 1) × 100 = 150% for a 1-unit increase in x. Moreover, exp(β) < 1 implies that x has a negative effect on the hazard, i.e. lower values of x are associated with longer durations. For instance, exp(β) = 0.6 implies a decrease in h(t) by a factor of (1 − 0.6) × 100 = 40% for a 1-unit increase in x.

Example 2.5.2: Gender effects on age at first partnership

Table 2.5.4: Cox proportional hazards model results

_t Coef. Std. Err. z P>|z| [95% Conf. Interval female 0.401 0.093 4.29 0.000 0.218 0.583

Results presented in Table 2.5.4 indicate that the log-hazard of forming the first partnership at age t is 0.4 points higher for women than for men; the hazard of forming the first partnership at age t is exp (0.40) = 1.49 times higher for women than for men; and women partner at a younger age than men.

The proportional hazards assumption

Consider a model with a single covariate x and two individuals with different values denoted by x1 and x2. The Cox proportional hazards model is given by hi (t) = h0 (t) exp (βxi). Hence, the ratio

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 215

of the hazards for individual 1 to individual 2 is h1 (t)/h2 (t) = exp (βx1)/exp (βx2) which does not depend on t. i.e. the effect of x is the same at all durations t.

Example of (a) proportional and (b) non-proportional hazards for binary x are given in the figure below.

Figure 2.5.3: Graphs for (a) proportional and (b) non-proportional hazards for binary x

Estimation of the Cox model

All statistical software packages have in-built procedures for estimating the Cox model. The input data are each individual’s duration yi and censoring indicator σi. The data are restructured before estimation (although this is hidden from the user), and the Cox model is then estimated using . We will look at this data restructuring to better understand the model and its relationship with the discrete-time approach. But note that you do not have to do this restructuring yourself!

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 216

Creation of risks sets

A risk set is defined for each observed event time and contains all individuals at risk of the event at that time.

Suppose there are K distinct uncensored event times and denote the ordered times by t(1), t(2), . . . , t(K). Consider the following example.

Example 2.5.4. Suppose the ordered uncensored event times (age at marriage) are as given in Table 2.5.5.

Table 2.5.5. Ordered uncensored age at marriage k 1 2 3 4 5 6 t(k) 16 17 18 21 22 24

The event time ranges from 16 to 24, so there are potentially 9 event times (taking 16 as the origin). But there are 6 risk sets because no events were observed at t = 19, 20, 23.

Risk set based file

Consider records for three individuals:

Individual i yi σi 1 21 1 2 18 0

3 16 1

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 217

Individual i Risk set k t(k) yki (event at t(k)) 1 1 16 0 1 2 17 0 1 3 18 0 1 4 21 1 2 1 16 0 2 2 17 0 2 3 18 0 3 1 16 1

Results from fitting Cox model

The Cox model results reporting coefficients and hazard ratios for the data provided in example 2.5.1 are provided in Tables 2.5.6 and 2.5.7, respectively.

Table 2.5.6. Cox model results reporting coefficients

_t Coef. Std. Err. z P>|z| [95% Conf. Interval female 0.394 0.093 4.21 0.000 0.211 0.577 fulltime -1.031 0.190 -5.42 0.000 -1.404 -0.658

Table 2.5.7. Cox model results reporting hazard ratios

_t Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval female 1.483 0.139 4.21 0.000 1.234 1.781 fulltime 0.357 0.068 -5.42 0.000 0.246 0.518

Results from fitting the Cox model (Table 2.5.7) show that the hazard of partnering at age t is (1.48−1)×100 = 48% higher for women than for men (i.e. women partner quicker than male partner). Moreover, being in full-time education decreases the hazard by (1 − 0.36) × 100 = 64%.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 218

Example 2.5.5: We want to study the unemployment duration – the length of time it takes someone to find a full-time job. Data from the January Current Population Survey’s Displaced Workers Supplements (DWS) for years 1986, 1988, 1990 and 1992. The dependent variable is duration (number of periods being unemployed), event (finding a job) and the independent variables are log wage, claim unemployment insurance, and age.

Summary statistics: Subjects tracked from 1 to 28 periods. They either find a job (event) or are still looking (censored). Number of subjects is 3,343; time at risk (periods summed over the subjects) is 20,887. Number of failures is 1,073 or 32% of sample has failed. Incidence rate is 5.13% which is the number of failures divided by the time at risk.

Table 2.5.8. Survival function table

Time Number of subjects Failure/event Net lost/censored Survival function 1 3343 294 246 0.91 2 2803 178 304 0.85 3 2321 119 305 0.81 4 1897 56 165 0.79 5 1676 104 233 0.74 6 1339 32 111 0.72 7 1196 85 178 0.67

…. 25 58 0 10 0.38 26 48 2 13 0.37 27 33 5 24 0.31 28 4 0 4 0.31

The hazard rate in Figure 2.5.3 shows that the probability of having the event (finding a job) going down from 4% to 3% over 25 time periods.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 219

Figure 2.5.3: Smoothed hazard estimate

Result presented on Figure 2.5.4 indicates that the Nelson-Aalen cumulative hazard estimate is non-decreasing.

Figure 2.5.4: Nelson-Aalen cumulative hazard estimate

The Kaplan-Meier survival function shows that survival probabilities go down to 31% over 28 time periods. This means that 31% of the individuals still have not found a job after 28 time periods.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 220

Figure 2.5.5: Kaplan-Meier survival estimate

Survival function for the group that has unemployment insurance (ui=1) and the group that does not have unemployment insurance (ui=0) is presented in Figure 2.5.6. The survival functions show at any point in time that claiming unemployment insurance is associated with higher survival rate. This means that if someone receives unemployment benefits, he/she is more likely to still be unemployed.

Figure 2.5.6: Kaplan-Meier survival estimate for the group

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 221

Table 2.5.9. Parametric regression model coefficients

Duration of Exponential Weibull Gompertz Cox proportional hazard unemployment regression regression regression model coefficients coefficients coefficients coefficients Log wage 0.48* 0.49* 0.48* 0.46* Claim -1.08* -1.11* -1.06* -0.98* unemployment insurance Age -0.01* -0.01* -0.01* -0.01*

Results reported here are Stata results. Coefficients are interpreted as: Individuals with higher wages have lower unemployment duration, meaning will terminate unemployment faster. Individuals who claim unemployment insurances have higher unemployment durations, meaning they terminate unemployment slower.

Table 2.5.10. Parametric regression model hazard rates

Duration of Exponential Weibull Gompertz Cox unemployment regression regression regression proportional hazard rates hazard rates hazard rates hazard model hazard rates Log wage 1.62* 1.63* 1.61* 1.58* Claim unemployment 0.34* 0.33* 0.35* 0.37* insurance Age 0.99* 0.99* 0.99* 0.98*

Interpretation of the hazard rates: A unit increase in the log wage is associated with 1.62-1 = 62% increase in the hazard rates. For individuals who claim unemployment insurance the hazard rates are 0.34-1 = 66% lower. In other words, individuals with higher wages are more likely to find a job and those that claim unemployment insurance are less likely to find a job.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 222

Discrete-time models

In social research, duration data are usually collected retrospectively in a cross-sectional survey, where dates are recorded to the nearest month or year, or prospectively in waves of a panel study (e.g. annually). Both give rise to discretely-measured durations. It is also called interval-censored because we only know that an event occurred at some point during an interval of time.

For carrying out discrete-duration analysis, we must first restructure the data. Towards this, we expand the event times and censoring indicator (yi, σi) to a sequence of binary responses {yti} where yti indicates whether an event has occurred in time interval [t, t + 1).

The required structure is very similar to the risk set-based file for the Cox model, but the user has to do the restructuring rather than the software. Also, we now have a record for every time interval (not risk sets, i.e. intervals where events occur).

Table 2.5.11. Data structure: The person-period file

Individual i yi σi 1 21 1 2 33 0 ↓

Individual i t yti 1 16 0 1 17 0 …. 1 20 0 1 21 1 2 16 0 2 17 0 … 2 32 0 2 33 0

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 223

Discrete-time hazard function

Denote by pti the probability that individual i has an event during interval t, given that no event has occurred before the start of t. That is, pti = Pr(yti = 1|yt−1, i = 0). Note that pti is a discrete-time approximation to the continuous-time hazard function hi (t). We call pti the discrete-time hazard function.

Discrete-time logit model

After expanding the data fit, a binary response model to yti, we will fit e.g. a logit model given by:

log (pti/1-pti) = αDti + βxti (2.5.8) where pti is the probability of an event during interval t, Dti is a vector of functions of the cumulative duration by interval t with coefficients α and xti is a vector of covariates (time-varying or constant over time) with coefficients β.

Modelling the time-dependency of the hazard

Changes in pti with t are captured in the model by αDti, the baseline hazard function. Dti has to be specified by the user where the options include: i. Polynomial of order p given by: p αDti = α0 + α1t + . . . + αpt ii. Step function given by:

αDti = α1D1 + α2D2 + . . . + αqDq where D1, . . . ,Dq are dummies for time intervals t = 1, . . . , q and q is the maximum observed event time. If q is large, categories may be grouped to give a piecewise constant hazard model.

Example 2.5.5: Discrete-time analysis of age at first partnership

In this data, we have two covariates namely FEMALE and FULLTIME (time-varying). We consider two forms of αDti: i. Step function: dummy variable for each year of age, 16-33 ii. Quadratic function: include t and t2 as explanatory variables

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 224

Table 2.5.12. Duration effects fitted as a step function

Reference category for t is age 16 (could have fitted dummies for all ages, t1-t18, and omitted intercept)

Table 2.5.13. Comparison of Cox and logit estimates for age at 1st partnership

Variable Cox Logit 훽̂ Se(훽̂) 훽̂ Se(훽̂) Female 0.394 0.093 0.468 0.102 Fulltime(t) −1.031 0.190 −1.133 0.197

Same substantive conclusions, but: • Cox estimates are effects on log scale, and exp(β) are hazards ratios (relative risks) • Logit estimates are effects on log-odds scale, and exp(β) are hazard-odds ratios

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 225

When will Cox and logit estimates be similar?

In general, Cox and logit estimates will get closer as the hazard function becomes smaller because: log(h(t))≈ log (h(t)/1−h(t)) as h(t) → 0. The discrete-time hazard will get smaller as the width of the time intervals become smaller. Note that a discrete-time model with a complementary log-log link, log (−log(1 − pt)), is an approximation to the Cox proportional hazards model, and the coefficients are directly comparable.

Table 2.5.14. Duration effects fitted as a quadratic

Results presented in Table 2.5.14 show that approximating step function by a quadratic leads to little change in estimated covariate effects. That is, estimates from step function model were 0.468 (SE = 0.102) for female and −1.133 (SE = 0.197) for fulltime.

2.5.2. Count Data Models

2.5.2.1. Introduction

Often, economic policies are directed toward outcomes that are measured as counts. Examples of economic variables that use a basic counting scale are number of children as an indicator of fertility, number of doctor visits as an indicator of health care demand, and number of days absent from work as an indicator of employee shirking. Several econometric methods are available for analyzing such data, including the Poisson and negative binomial models and their variants. They can provide useful insights that cannot be obtained from standard linear regression models. Estimation and interpretation are illustrated an empirical example.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 226

2.5.2.2. The Poisson model

The Poisson model predicts the number of occurrences of an event. The Poisson model states that the probability that the dependent variable Y will be equal to a certain number y is given by: e  y P(Y = y) = y! (2.5.9) For the Poisson model, µ is the intensity or rate parameter.

µ = exp (xi′β) (2.5.10)

For the Poisson model coefficients are interpreted as: A one-unit increase in x will increase/decrease the average number of the dependent variable by the coefficient expressed as a percentage.

Properties of the Poisson distribution

The Poisson distribution has the following properties:

1. Equidispersion property of the Poisson distribution: the equality of mean and variance.

E(y|x) = var(y|x) = μ (2.5.11)

This is a restrictive property and often fails to hold in practice, i.e., there is “overdispersion” in the data. In this case, use the negative binomial model.

2. Excess zeros problem of the Poisson distribution: there are usually more zeros in the data than a Poisson model predicts. In this case, use the zero-inflated Poisson model.

Marginal effects for the Poisson model

The marginal effect of a variable on the average number of events is given by:

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 227

E(y / x) (2.5.12) =  j exp(xi) x j

The marginal effect of the Poisson regression is interpreted as: A one-unit increase in x will increase/decrease the average number of the dependent variable by the marginal effect.

2.5.2.3. Negative binomial model

The negative binomial model is used with count data instead of the Poisson model if there is over- dispersion in the data. Unlike the Poisson model, the negative binomial model has a less restrictive property that the variance is not equal to the mean (µ). There are cases where the variance of the dependent variable can have the following form: var(y|x) = µ + αµ2. Another functional form for the variance of the dependent variable is var(y|x) = µ + αµ, but this form is less used. The negative binomial model also estimates the over-dispersion parameter α.

Test for overdispersion

The test for overdispersion estimates the negative binomial model which includes the overdispersion parameter α and test if α is significantly different than zero. That is, H0: α = 0 or

Ha: α≠0. There are three cases: Case 1: If α = 0, then use the Poisson model. Case 2: If α>0, then we have overdispersion which frequently holds with real data. Case 3: If α<0, then we have what we call under-dispersion which is not very common.

Incidence rate ratios (irr)

For the Poisson and negative binomial models, in addition to reporting the coefficients and marginal effects, we can also report the incidence rate ratios. The incidence rate ratios report exp(b) rather than b. The incidence rate ratio is interpreted as: irr = 2 means that for each unit increase in x, the expected number of y will double.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 228

2.5.2.4. Hurdle or two-part models

The two-part model relaxes the assumption that the zeros (whether or not there are events) and positives (how many events) come from the same data-generating processes. For example: different factors may affect whether or not you practice a particular sport and how many times you practice your sport in a month. We can estimate the two-part models similar to the truncated regression models. If the process generating the zeros is f1(.) and the process generating the positive responses is f2(.) then the two-part hurdle model is defined by the following probabilities:

 f (0) íf y = 0 (2.5.13)  1 g(y) = 1− f (0)  1 f (y) if y 1  2 1− f2 (0)

If the two processes are the same, then it is the standard count data model. The model for the zero versus positive responses is a binary model with the specified distribution, but we usually estimate it with the probit/logit model.

2.5.2.5. Zero-inflated models

The zero-inflated model is used with count data when there is an excess zeros problem. The zero- inflated model lets the zeros occur in two different ways: as a realization of the binary process (z = 0) and as a realization of the count process when the binary variable z = 1. For example, you either like playing football or you do not. If you like playing football, the number of trips you can take to the football field is 0, 1, 2, 3, etc. So, you may like playing football, but may not take a trip this year. We are able to generate more zeros in the data. If the process generating the zeros is f1(.) and the process generating the positive responses is f2(.) then the zero-inflated model is:

(2.5.14)  f1(0)+1− ( f1(0)) f2 (0)íf y = 0 g(y) = 

 (1− f1(0))f2 (y) if y 1

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 229

The zero-inflated model is less frequently used than the hurdle model. The zero-inflated models can handle the excess zeros problem.

2.5.2.6. Advantages and disadvantages of count data models

Advantages

The advantages of using count data models are the following:

1. Count data regressions provide an appropriate, rich, and flexible modeling environment for non- negative integers, 0, 1, 2, etc. 2. Poisson regression is the workhorse model for estimating constant relative policy effects. 3. Hurdle and related models allow distinguishing between extensive margin effects (outcome probability of a zero) and intensive margin effects (probability of one or more counts). 4. With count data, policy evaluations can move beyond the consideration of mean effects and determine the effect on the entire distribution of outcomes instead.

Disadvantages

The disadvantages of using count data models are the following:

1. Count data models impose parametric assumptions that, if invalid, can lead to incorrect policy conclusions. 2. While many software packages implement standard count models, such as the Poisson and negative binomial models, more elaborate models may require some programming by the researcher. 3. A count data approach does not solve the fundamental evaluation problem: absent a randomized controlled experiment, identifying policy effects from observational data can be marred by selection bias, requiring plausibly exogenous variation in the form of a quasi-.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 230

Example 2.5.6: Suppose we want to study the factors influencing the number of doctor visits. Data are from the U.S. Medical Expenditure Panel Survey for 2003. The dependent variable is the number of doctor visits. The independent variables include: has private insurance, has Medicaid insurance, age and education. The results of the Poisson and negative binomial models are presented in Table 2.5.15.

Table 2.5.15. Poisson and negative binomial model coefficients and marginal effects

Number of doctor visits Poisson Poisson Negative Negative binomial coefficients marginal binomial marginal effects effects coefficients Has private insurance 0.15* 1.04* 0.16* 1.08* Has Medicaid 0.29* 1.96* 0.28* 1.96* insurance Age 0.01* 0.07* 0.01* 0.07* Education 0.02* 0.17* 0.02* 0.16*

Intercept 0.74* 0.75*

Alpha (overdispersion 0.81* parameter)

Interpretation of the results presented on Table 2.5.15

The coefficients for both models are interpreted as: individuals who have private insurance are expected to have a 15% and 16% increase in the number of doctor visits for the Poisson and negative binomial models respectively. The marginal effects results are interpreted as: individuals who have private insurance are expected to have 1.04 (1.08) additional doctor visits for the Poisson (negative binomial) model. The overdispersion parameter α is significantly different from zero, therefore we should use the negative binomial model instead of the Poisson model.

The results for the Truncated Poisson and negative binomial models are presented in Table 2.5.16.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 231

Table 2.5.16. Truncated Poisson and negative binomial model coefficients and marginal effects

Number of doctor’s Logit Zero Zero Zero trunc Zero trunc visits coefficients truncated trunc negative neg Poisson Poisson binomial binomial coefficients marginal coefficients marginal effects effects Has private 0.64* 0.09* 0.67* 0.10* 0.71* insurance Has Medicaid 0.46* 0.24* 1.83* 0.26* 1.83* insurance Age 0.04* 0.01* 0.05* 0.01* 0.05* Education 0.04* 0.02* 0.16* 0.02* 0.15*

Intercept -1.71* 1.25* 1.10*

Alpha (dispersion 0.77* parameter)

Interpretation of the results presented on Table 2.5.16

Interpretation of the coefficients for the logit model: individuals who have private insurance, have Medicaid insurance, who are older, or have higher education are more likely to have a positive number of doctor visits. As to the interpretation of the coefficients and marginal effects for the truncated Poisson (truncated negative binomial) model: for individuals with positive number of doctor visits, those who have private insurance have a 9% (10%) higher number of doctor visits or 0.67 (0.71) additional number of doctor visits. Note that similar results are obtained for the two- step and one-step models but that doesn’t have to be the case.

Regression results for the zero-inflated Poisson and negative binomial models are presented on Table 2.5.17.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 232

Table 2.5.17. Zero-inflated Poisson and negative binomial model coefficients and marginal effects

Number of Inflated Zero- Zero- Inflated Zero- Zero- doctor’s model inflated inflated model inflated inflated visits coefficients Poisson Poisson coefficients negative negative coefficients marginal binomial binomial effects coefficients marginal effects Has -0.64* 0.09* 1.05* -2.56* 0.10* 0.95* private insurance Has -0.46* 0.24* 1.95* -0.88* 0.26* 1.85* Medicaid insurance Age -0.04* 0.01* 0.07* -0.13* 0.01* 0.06* Education -0.04* 0.02* 0.17* 0.05* 0.02* 0.16*

Intercept 1.70* 1.25* 7.42* 1.05*

Alpha 0.75* (dispersion parameter)

Interpretation of the results presented on Table 2.5.17

Interpretation of the coefficients and marginal effects for the zero-inflated Poisson (truncated negative binomial) model: for individuals who are inclined to go the doctor, those who have private insurance have a 9% (10%) higher number of doctor visits or 1.05 (0.95) additional number of doctor visits. Inflated model is similar to a logit model (binary model) but the coefficients are different. Note that similar results for the truncated and zero-inflated models. In this case, there are higher number of additional doctor visits if we use inflated models than if we use truncated

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 233

models. Note also that we do not have to use the same variables in the binary models (zero vs. positive outcome) and the truncated or zero-inflated second-step models (for the positive outcomes).

2.5.3. Quantile Regression

2.5.3.1. Introduction

Quantile regression as introduced by Koenker and Bassett (1978) seeks to complement classical linear regression analysis. Central hereby is the extension of “ordinary quantiles from a location model to a more general class of linear models in which the conditional quantiles have a linear form”. In Ordinary Least Squares (OLS), the primary goal is to determine the conditional mean of random variable Y, given some explanatory variable Xi, reaching the expected value E[Y|Xi]. Quantile regression, however, goes beyond this and enables one to pose such a question at any quantile of the conditional distribution function. This subtopic introduces you to the ideas behind quantile regression. First, we will look at what quantiles are, followed by a short review of OLS. Finally, quantile regression analysis is presented, along with an example.

2.5.3.2. What are quantiles?

Gilchrist (2001) describes a quantile as “simply the value that corresponds to a specified proportion of an (ordered) sample of a population”. For instance, a very commonly used quantile is the median M, which is equal to a proportion of 0.5 of the ordered data. This corresponds to a quantile with a probability of 0.5 of occurrence. Quantiles hereby mark the boundaries of equally sized, consecutive subsets.

More formally stated, let Y be a continuous random variable with a distribution function FY(y) such that:

FY(y) = P(Y≤ y)= τ (2.5.15)

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 234

which states that for the distribution function FY(y) can determine for a given value y the probability τ of occurrence. Now if one is dealing with quantiles, one wants to do the opposite that is one wants to determine for a given probability τ of the sample data set the corresponding value y. A τth-quantile refers in a sample data to the probability τ for a value y.

FY(yτ) = τ (2.5.16)

Another form of expressing the τth-quantile mathematically is following:

-1 yτ =FY (τ) (2.5.17)

is such that it constitutes the inverse of the function FY(yτ) for a probability τ.

Note that there are two possible scenarios. On the one hand, if the distribution function FY(y) is monotonically increasing, quantiles are well defined for every τ (0,1). However, if a distribution function FY(y) is not strictly monotonically increasing, there are some τs for which a unique quantile cannot be defined. In this case one uses the smallest value that y can take on for a given probability τ.

Both cases, with and without a strictly monotonically increasing function, can be described as follows:

-1 yτ= FY (τ) = inf{y|FY(y)≥τ} (2.5.18)

That is yτ is equal to the inverse of the function FY(τ) which in turn is equal to the infimum of y th such that the distribution function FY(y) is greater or equal to a given probability τ, i.e. the τ - quantile (Handl, 2000).

However, a problem that frequently occurs is that an empirical distribution function is a step function. Handl (2000) describes a solution to this problem. As a first step, one reformulates

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 235

equation (2.5.18) in such a way that one replaces the continuous random variable Y with n, the observations, in the distribution function FY(y), resulting in the empirical distribution function

Fn(y). This gives the following equation:

푦̂휏 = inf{y|Fn(y)≥τ} (2.5.19) The empirical distribution function can be separated into equally sized, consecutive subsets via the number of observations n. Which then leads one to the following step:

푦̂휏 = y(i) (2.5.20)

i = 1,...,n and y(1),...,y(n) as the sorted observations. Hereby, of course, the range of values that yτ can take on is limited simply by the observations yi and their nature. However, what if one wants to implement a different subset, i.e. different quantiles but those that can be derived from the number of observations n?

Therefore, a further step necessary to solving the problem of a step function is to smooth the empirical distribution function through replacing it a with continuous linear function 퐹̂푌. In order to do this, there are several algorithms available which are well described in Handl (2000) and more in detail with an evaluation of the different algorithms and their efficiency in computer packages in Hyndman and Fan (1996). Only then one can apply any division into quantiles of the dataset as suitable for the purpose of the analysis (Handl, 2000).

2.5.3.3. Ordinary Least Squares (OLS)

In regression analysis, we are interested in analyzing the behavior of a dependent variable yi given the information contained in a set of explanatory variables xi. OLS is a standard approach to specify a linear regression model and estimate its unknown parameters by minimizing the sum of squared residuals. This leads to an approximation of the mean function of the conditional distribution of the dependent variable. OLS achieves the property of BLUE, it is the best, linear, and unbiased estimator, if following four assumptions hold:

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 236

1. The explanatory variable xi is non-stochastic

2. The expectations of the error term Ꜫi are zero, i.e. E[Ꜫi]=0 2 3. - the variance of the error terms var(Ꜫi) is constant, i.e. var(Ꜫi) = σ

4. No autocorrelation, i.e. cov (Ꜫi,Ꜫj) = 0

However, frequently one or more of these assumptions are violated, resulting in that OLS is not anymore, the best, linear, unbiased estimator. However, quantile regression can tackle the following issues: (i) frequently the error terms are not necessarily constant across a distribution thereby violating the assumption of homoscedasticity, (ii) by focusing on the mean as a measure of location, information about the tails of a distribution are lost, and (iii) OLS is sensitive to extreme outliers that can distort the results significantly (Montenegro, 2001).

2.5.3.4. Quantile regression

Quantile regression essentially transforms a conditional distribution function into a conditional quantile function by slicing it into segments. These segments describe the cumulative distribution of a conditional dependent variable Y given the explanatory variable xi with the use of quantiles as defined in equation (2.5.16).

For a dependent variable Y given the explanatory variable X = x and fixed τ, 0<τ<1, the conditional th quantile function is defined as the τ quantile QY|X(τ|x) of the conditional distribution function

QY|X(y|x). For the estimation of the location of the conditional distribution function, the conditional median QY|X(0.5|x) can be used as an alternative to the conditional mean (Lee, 2005).

One can nicely illustrate quantile regression when comparing it with OLS. In OLS, modeling a conditional distribution function of a random sample (y1, ..., yn) with a parametric function μ(xi,β) where xi represents the independent variables, β the corresponding estimates and μ the conditional mean, one gets following minimization problem: 푛 2 푚푖푛훽휖푅 ∑푖=1(푦푖 − 휇(푥푖, 훽) (2.5.21)

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 237

One thereby obtains the conditional expectation function E[Y|xi]. Now, in a similar fashion one can proceed in Quantile Regression. Central feature thereby becomes ρτ, which serves as a check function. 휏 ∗ 푥 푖푓 푥 ≥ 0 ρτ (x)= { (2.5.22) (휏 − 1) ∗ 푥 푖푓 푥 < 0

This check-function ensures that i) all ρτ are positive and ii) the scale is according to the probability τ. Such a function with two supports is a must if dealing with L1 distances, which can become negative.

In quantile regression, we minimize the following function: 푛 푚푖푛훽휖푅 ∑푖=1 휌휏(푦푖 − 휉(푥푖, 훽)) (2.5.23)

Here, as opposed to OLS, the minimization is done for each subsection defined by ρτ, where the th estimate of the τ quantile function is achieved with the parametric function 휉(푥푖, 훽) (Koenker and Hallock, 2001).

Features of the quantile regression which differentiate it from other regression methods are the following:

1. The entire conditional distribution of the dependent variable Y can be characterized through different values of τ. 2. can be detected. 3. If the data is heteroscedastic, median regression estimators can be more efficient than mean regression estimators. 4. The minimization problem as illustrated in equation (2.5.23) can be solved efficiently by linear programming methods, making estimation easy. 5. Quantile functions are also equivariant to monotone transformations. That is

Qh(Y|X)(xτ)=h(Q(Y|X)(xτ), for any function. 6. Quantiles are robust with regards to outliers (Lee, 2005).

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 238

A graphical illustration of quantile regression

Before proceeding to a numerical example, the following subsection seeks to graphically illustrate the concept of quantile regression. First, as a starting point for this illustration, consider Figure

2.5.7. For a given explanatory value of xi the density for a conditional dependent variable Y is indicated by the size of the balloon. The bigger the balloon, the higher is the density, with the mode, i.e. where the density is the highest, for a given xi being the biggest balloon. Quantile regression essentially connects the equally sized balloons, i.e. probabilities, across the different values of xi, thereby allowing one to focus on the interrelationship between the explanatory variable xi and the dependent variable Y for the different quantiles, as can be seen in Figure 2.5.8. These subsets, marked by the quantile lines, reflect the probability density of the dependent variable Y given xi.

Figure 2.5.7: Probabilities of occurrence for individual explanatory variables

The example used in Figure 2.5.7 is originally from Koenker and Hallock (2000) and illustrates a classical empirical application, Ernst Engel’s (1857) investigation into the relationship of household food expenditure, being the dependent variable, and household income as the explanatory variable. In quantile regression, the conditional function of QY|X(τ|x) is segmented by the τth-quantile. In the analysis, the τth-quantiles τ {0.05,0.1,0.25,0.5,0.75,0.9,0.95}, indicated by the thin blue lines that separate the different color sections, are superimposed on the data points. The conditional median (τ = 0.5) is indicated by a thick dark blue line, the conditional mean by a

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 239

light-yellow line. The color sections thereby represent the subsections of the data as generated by the quantiles.

Figure 2.5.8: Engels curve, with the median highlighted in dark blue and the mean in yellow

Figure 2.5.8 can be understood as a contour plot representing a 3-D graph, with food expenditure and income on the respective y and x axis. The third dimension arises from the probability density of the respective values. The density of a value is thereby indicated by the darkness of the shade of blue, the darker the color, the higher is the probability of occurrence. For instance, on the outer bounds, where the blue is very light, the probability density for the given data set is relatively low, as they are marked by the quantiles 0.05 to 0.1 and 0.9 to 0.95. It is important to notice that Figure 5.8 represents for each subsections the individual probability of occurrence, however, quantiles utilize the cumulative probability of a conditional function. For example, τ of 0.05 means that 5% of observations are expected to fall below this line, a τ of 0.25 for instance means that 25% of the observations are expected to fall below this and the 0.1 line.

The graph in Figure 2.5.8 suggests that the error variance is not constant across the distribution. The dispersion of food expenditure increases as household income goes up. Also, the data is

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 240

skewed to the left, indicated by the spacing of the quantile lines that decreases above the median and also by the relative position of the median which lies above the mean. This suggests that the axiom of homoscedasticity is violated, which OLS relies on. The statistician is therefore well advised to engage in an alternative method of analysis such as quantile regression, which is actually able to deal with heteroscedasticity.

Example 2.5.7: In order to give a numerical example of the analytical power of quantile regression and to compare it with OLS, we will analyze some selected variables of the Boston Housing dataset which is available at the md-base website. The data was first analyzed by Belsley et al. (1980). The original data comprised 506 observations for 14 variables stemming from the of the Boston metropolitan area.

This analysis utilizes as the dependent variable, the median value of owner-occupied homes (a metric variable, abbreviated with H) and investigates the effects of four independent variables as shown in Table 2.5.18. These variables were selected as they best illustrate the difference between OLS and quantile regression. For the sake of simplicity, it was neglected for now to deal with potential difficulties related to finding the correct specification of a parametric model. A model therefore was assumed. For the estimation of asymptotic standard errors see for example Buchinsky (1998), which illustrates the design-matrix bootstrap estimator or alternatively Powell (1986) for kernel-based estimation of asymptotic standard errors.

Table 2.5.18. The explanatory variables

Name Short Variable Type NonrTail T Proportion of non-retail business acres metric NoorOoms O Average number of rooms per dwelling metric Age A Proportion of owner-built dwellings prior to 1940 metric PupilTeacher P Pupil-teacher ratio metric

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 241

First, an OLS model was estimated. Three digits after the decimal point were indicated in the tables as some of the estimates turned out to be very small.

E[Hi|Ti,Oi,Ai,Pi]=α +βTi+δOi+γAi+λPi (2.5.24) Computing this via Stata one obtains the results provided in the Table 2.5.19.

Table 2.5.19. OLS estimates

훼̂ 훽̂ 훿̂ 훾̂ 휆̂ 36.459 0.021 38.010 0.001 -0.953

Analyzing this data set via quantile regression, utilizing the τth quantiles τ (0.1, 0.3, 0.5, 0.7, 0.9) the model can be specified as follows:

QH[τ|Ti,Oi,Ai,Pi]= ατ +βτTi+δτOi+γτAi+λτPi (2.5.25)

Just for illustrative purposes and to further foster the understanding of the reader for quantile regression, the equation for the 0.1th quantile is briefly illustrated, all others follow analogous: min[ρ0.1(y1-x1β) + ρ0.1(y2-x2β)+...+ ρ0.1(yn-xnβ) (2.5.24) equation 12 with 0.1(yi − xiβ)푖푓 (yi − xiβ) > 0 ρ0.1(yi-xiβ)= { 0.9(yi − xiβ)푖푓 (yi − xiβ) < 0

Table 2.5.20. Quantile regression estimates

̂ ̂ ̂ τ 훼̂휏 훽휏 훿휏 훾̂휏 휆휏 0.1 23.442 0.087 29.606 -0.022 -0.443 0.3 15.713 -0.001 45.281 -0.037 -0.617 0.5 14.850 0.022 53.252 -0.031 -0.737 0.7 20.791 -0.021 50.999 -0.003 -0.925 0.9 34.031 -0.067 51.353 0.004 -1.257

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 242

Now if one compares the results for the estimates of OLS from Table 2.5.19 and quantile regression, Table 2.5.20, one finds that the latter method can make much more subtle inferences of the effect of the explanatory variables on the dependent variable. Of particular interest are the quantile estimates that are relatively different as compared to other quantiles for the same estimate.

Probably the most interesting result and most illustrative in regard to an understanding of the functioning of Quantile Regression and pointing to the differences with OLS are the results for the independent variable of the proportion of non-retail business acres (Ti). OLS indicates that this variable has a positive influence on the dependent variable, the value of homes, with an estimate of 훽̂=0.021, i.e. the value of houses increases as the proportion of non-retail business acres (Ti) increases in regard to the Boston Housing data.

Looking at the output that quantile regression provides us with, one finds a more differentiated picture. For the 0.1 quantile, we find an estimate of ̂ 훽0.1= 0.087 which would suggest that for this low quantile the effect seems to be even stronger than is suggested by OLS. Here house prices go up when the proportion of non-retail businesses

(Ti) goes up, too. However, considering the other quantiles, this effect is not quite as strong anymore, for the 0.7th and 0.9th quantile this effect seems to be even reversed indicated by the ̂ ̂ parameter 훽0.7=-0.021 and 훽0.9=-0.062. These values indicate that in these quantiles the house price is negatively influenced by an increase of non-retail business acres (Ti). The influence of non-retail business acres (Ti) seems to be obviously very ambiguous on the dependent variable of housing price, depending on which quantile one is looking at. The general recommendation from

OLS is that if the proportion of non-retail business acres (Ti) increases, the house prices would increase can obviously not be generalized. A policy recommendation from the OLS estimate could therefore be grossly misleading.

One would intuitively find the statement that the average number of rooms of a property (Oi) positively influences the value of a house, to be true. This is also suggested by OLS with an estimate of 훿̂=38.099. Now quantile regression also confirms this statement, however, it also

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 243

allows for much subtler conclusions. There seems to be a significant difference between the 0.1 quantile as opposed to the rest of the quantiles, in particular the 0.9th quantile. For the lowest ̂ ̂ quantile the estimate is 훿0.1= 29.606, whereas for the 0.9th quantile it is 훿0.9=51.353. Looking at the other quantiles one can find similar values for the Boston housing data set as for the 0.9th, with ̂ ̂ ̂ estimates of 훿0.3=45.281, 훿0.5=53.252, and 훿0.7=50.999 respectively. Hence, for the lowest quantile the influence of additional number of rooms (Oi) on the house price seems to be considerably smaller than all the other quantiles.

Another illustrative example is provided analyzing the proportion of owner-occupied units built prior to 1940 (Ai) and its effect on the value of homes. Whereas OLS would indicate this variable has hardly any influence with an estimate of 훾̂=0.001, looking at quantile regression one gets a different impression. For the 0.1th quantile, the age has got a negative influence on the value of the home with 훾̂0.1= -0.022. Comparing this with the highest quantile where the estimate is 훾̂0.9= 0.004, one finds that the value of the house is suddenly now positively influenced by its age. Thus, the negative influence is confirmed by all other quantiles besides the highest, the 0.9th quantile.

Last but not least, looking at the pupil-teacher ratio (Pi) and its influence on the value of houses, one finds that the tendency that OLS indicates with a value of 휆̂=-0.953 to be also reflected in the Quantile Regression analysis. However, in quantile regression one can see that the influence on the housing price of the pupils-teacher ratio (Pi) gradually ̂ increases over the different quantiles, from the 0,1th quantile with an estimate of 휆0.1=-0.443 to ̂ the 0.9th quantile with a value of 휆0.9=-1.257.

This analysis makes clear, that quantile regression allows one to make much more differentiated statements when using quantile regression as opposed to OLS. Sometimes OLS estimates can even be misleading what the true relationship between an explanatory and a dependent variable is as the effects can be very different for different subsection of the sample.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 244

Topic Summary

Duration or survival analysis is concerned with “time to event” data. When the outcome of a study is the time to an event, it is often not possible to wait until the event in question has happened to all the subjects, for example, until all are dead. In addition, subjects may leave the study prematurely which lead to what is called censored observations as complete information is not available for these subjects. The data set is thus an assemblage of times to the event in question and times after which no more information on the individual is available. Survival analysis methods are the only techniques capable of handling censored observations without treating them as missing data. They also make no assumption regarding normal distribution of time to event data. Descriptive methods for exploring survival times in a sample, Kaplan–Meier technique and the Cox's proportional hazard model which is the most general of the regression methods that allows the hazard function to be modeled on a set of explanatory variables without making restrictive assumptions concerning the nature or shape of the underlying survival distribution.

As the saying goes, “Not everything that can be counted counts”. True, but often it does. Economists often are interested in the distributional effects of a policy reform on count outcomes. For instance, does a policy have a disproportionate impact on heavy users of health care services compared with occasional users? With continuous outcomes, such a research question would typically be addressed using quantile regression. With counts, such asymmetric responses can be modeled directly. With confidence in the assumptions underlying the Poisson model, these effects depend on a single parameter, λ. Simple departures, such as zero-inflated models, hurdle models and two-inflated models, allow for more flexible effects at various values of the count outcome.

For a distribution function FY (y) one can determine for a given value of y the probability τ of occurrence. Now quantiles do exactly the opposite. That is, one wants to determine for a given probability τ of the sample dataset the corresponding value y. In OLS, one has the primary goal of determining the conditional mean of random variable Y, given some explanatory variable xi,

E[Y|xi]. Quantile regression goes beyond this and enables us to pose such a question at any quantile of the conditional distribution function. It focuses on the interrelationship between a dependent

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 245

variable and its explanatory variables for a given quantile. Quantile regression overcomes various problems that OLS is confronted with such as heteroscedasticity, loss of information at the tails of the distribution and OLS’s sensitivity to extreme outliers.

Exercises on the Topic

1. In the period 1962-77 a total of 205 patients with malignant melanoma were operated at Odense University hospital in Denmark. The patients were followed up until death or censoring at the end of the study at 31/12-1977. The data are available at the course web-page. The coding of the variables are as follows: 1) status: 1=death from disease, 2=censored, 4=death from other cause 2) lifetime: time from operation to death or censoring (in years) 3) ulcer: ulceration (1=present, 2=absent) 4) thickn: tumor thickness in mm 5) sex: 1=female, 2=male 6) age: age at operation (in years) 7) grthick: grouped tumor thickness (1: 0-1 mm, 2: 2-5 mm, 3: 5+ mm) 8) logthick: logarithm of tumor thickness.

As we are interested in the mortality from malignant melanoma, we will in this exercise treat death from other causes as censored observations in the analysis. a) Make Kaplan-Meier plots for each of the two genders, and test the difference using the . Discuss the results. b) Repeat the analysis in a) for each of the three groups of tumor thickness: 0-1 mm, 2-5 mm, and 5+ mm. c) Repeat the analysis in a) for ulceration. (Ulceration is present if the surface of the melanoma viewed in a microscope shows signs of ulcers and absent otherwise.) d) Make a univariate Cox regression analysis of each of the covariates sex, tumor thickness and ulceration. For the numeric covariate tumor thickness, you should consider whether you ought to use the original version “thickn”, the log-transformed version “logthick”, or the grouped version “grthick” (as a factor). Interpret the results and compare with the results in a-c. e) Make a multivariate Cox regression analysis with the covariates mentioned in question d. Interpret the fitted model.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 246

2. Compare and contrast Poisson and Negative Binomial models.

3. Compare and contrast OLS and quantile regression.

4. The survival time we will work with is the time that heroin addicts stay in a clinic for methadone maintenance treatment. The data were first analyzed by Caplehorn and Bell (1991) and appeared in a handbook of small datasets by Hand et al. (1994). They have also been analyzed by Rabe- Hesketh and Everitt (2006) in various editions of their Handbook of Statistical analyses using Stata. The data are available in the datasets section in a Stata file called heroin.dta.

The data come from two clinics coded 1 and 2; status is 1 if the event occurred (the person left the clinic) and 0 otherwise, and time is in days. The other two variables are an indicator of whether the person had a prison record, and the dose of methadone in mg.

(A) Models with no covariates (a) Verify that we have 150 failures in a total of 95,812 days of exposure. It will be convenient to recon time in months, treating a month as 30 days. Assume the hazard is constant over time. What's the event rate per month? What's the probability that someone would still be in treatment after one, two and three years? (b) Split the dataset so that we have separate observations for 0-3, 3-6, 6-12, 12-18, 18-24 and 24- 36 months. (In other words, split the first year into two quarters and a semester, and the second year into two semesters.) There are no exits after 3 years. Fit a piece-wise exponential model and describe the shape of the hazard. Test the hypothesis that the hazard is constant. Estimate the probability that someone would be in treatment after one, two and three years.

(B) Clinic effects (a) Introduce a dummy variable to identify clinic 1 and add it to the model. Interpret the exponentiated coefficient and test its significance. (Note that clinic is coded 1 and 2, we want to treat clinic 2 as the reference. You may want to save this model for part c.)

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 247

(b) Let us add an interaction between clinic and duration. To save d.f. we will group time in years for purposes of the interaction. For full credit present your parameter estimates in terms of the effect of clinic 1 compared to 2 in the first, second and third year of treatment. Comment on the point estimates. (c) Test the hypothesis that the clinic effect does indeed vary by year using a likelihood ratio test and a Wald test.

(C) Prison and methadone (a) Starting from the model of part 2.b where the effect of clinic varies by year, add a dummy variable for prison history and interpret the resulting estimate. (b) Add a linear effect of dose and interpret the estimate. The effect of 1 mg of methadone may not be of interest because doses vary by several mg. In fact, the standard deviation is 14.45. What's the effect on the hazard of increasing the dose by one standard deviation? (c) The original paper by Caplehorn and Bell treated dose as a categorical variable with three levels: < 60, 60-79, and 80+. Compare this specification with the linear term in part b in terms of parsimony and .

(D) Survival probabilities (a) Use the piece-wise exponential model of part 3.b with effects of clinic, dose, and prison history, to predict the probability of remaining in treatment after one, two and three years for someone with no prison record receiving the average dose of 60.4 mg of methadone in clinic 1. (b) Repeat the calculations for someone with a prison record who receives 60.4 mg in clinic 1. (c) Finally, repeat the calculations for someone without a prison record who receives the average dose of 60.4 mg, but in clinic 2. Please present the results for parts a-c in a single table.

5. Cameron and Trivedi (2009) have some interesting data on the number of office-based doctor visits by adults aged 25-64 based on the 2002 Medical Expenditure Panel Survey. We will use data for the most recent wave, available in the datasets section of the website as docvis.dta.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 248

(A) A Poisson model (a) Fit a Poisson regression model with the number of doctor visits (docvis), as the outcome. We will use the same predictors as Cameron and Trivedi, namely health insurance status (private), health status (chronic), gender (female) and income (income), but will add two indicators of ethnicity (black and hispanic). There are many more variables one could add, but we'll keep things simple. (b) Interpret the coefficient of black and test its significance using a Wald test and a likelihood ratio test. (c) Compute a 95% for the effect of private insurance and interpret this result in terms of doctor visits. (d) Compute the and Pearson chi-squared statistics for this model. Does the model fit the data? Is there evidence of overdispersion? (e) Predict the proportion expected to have exactly zero doctor visits and compare with the observed proportion. You will find the formula for Poisson probabilities in the notes. The probability of zero is simply e − μ.

(B) Poisson Overdispersion (a) Suppose the variance is proportional to the mean rather than equal to the mean. Estimate the proportionality parameter using Pearson's chi-squared and use this estimate to correct the standard errors. (b) What happens to the significance of the black coefficient once we allow for extra-Poisson variation? Could we test this coefficient using a likelihood ratio test? Explain. (c) Compare the standard errors adjusted for over-dispersion with the robust or "sandwich" estimator of the standard errors.

(C) A Negative Binomial model (a) Fit a negative model using the same outcome and predictors as in part 1.a. Comment on any remarkable changes in the coefficients. (b) Interpret the coefficient of black and test its significance using a Wald test and a likelihood ratio test. Compare your results with parts 1.b and 2.b

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 249

(c) Predict the percent of respondents with zero doctor visits according to this model and compare with part 1.e. You will find a formula for negative binomial probabilities in the addendum to the notes. The probability of zero is given by [β / (μ + β)]α where α = β = 1 / σ2. (d) Interpret the estimate of σ2 in this model and test its significance, noting carefully the distribution of the criterion. (e) Use predicted values from this model to divide the sample into twenty groups of about equal size, compute the mean and variance of docvis in each group, and plot these values. Superimpose curves representing the over-dispersed Poisson and negative binomial variance functions and comment.

(D) A Zero-Inflated Poisson model (a) Try a zero-inflated Poisson model with the same predictors of part 1a in both the Poisson and inflate equations. (b) Predict the proportion of respondents with zero doctor visits according to this model and compare with 1.e and 3.c. (Don’t forget that there are two ways of having an outcome of zero in this model.) (c) Interpret the coefficients of black in the two equations. Is the effect related to whether blacks visit the doctor at all? How often they visit?

Further Readings Cameron and Trivedi (2005), Ch. 20, 23, & 25, Greene (2018), Ch. 18; Baltagi (2008), Ch.10; Blundell and Dias (2000).

References Baker, M., and Molino, A. (2000): “Duration Dependence and Non-Parametric the Presence of Competing Risks”. The Stata Journal, 4, 103-112.

Belsley, D. A., E. Kuh, and R. E. Welsch (1980): “Applied Multivariate Statistical Analysis”. Regression Diagnostics, Wiley.

Buchinsky, M. (1998): “Recent Advances in Quantile Regression Models: A Practical Guideline for Empirical Research”. Journal of Human Resources, 33(1), 88–126.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 250

Cade, B.S. and B.R. Noon (2003): “A gentle introduction to quantile regression for ecologists”. Frontiers in Ecology and the Environment, 1(8): 412-420. http://www.fort.usgs.gov/products/publications/21137/21137.pdf

Cleves, Mario A., William W. Gould and Roberto G. Gutierrez (2004): “An Introduction to Survival Analysis Using STATA”. Texas: STATA Corporation.

Coviello, V. and Boggess, M. (2004): “Cumulative Incidence Estimation in Heterogeneity: A Monte-Carlo Study”. Journal of Econometrics, 96, 357-393.

Hougaard, P. (2000): “Analysis of Multivariate Survival Data.” New York: Springer-Verlag.

Joseph, M.H. (2011): “Negative Binomial Regression”. 2nd edn. Cambridge, UK, Cambridge.

Kaplan E.L., Meier P. (1958): “Non-parametric Estimation from Incomplete Observations”. Journal of the American Statistical Association, 53, 457-481.

Koenker, R., and K. F. Hallock (2000): “Quantile Regression an Introduction,” available at http://www.econ.uiuc.edu/~roger/research/intro/intro.html

Koenker, R., and K. F. Hallock (2001): “Quantile Regression,”. Journal of Economic Perspectives, 15(4), 143–156.

Lee, S. (2005): “Lecture Notes for MECT1 Quantile Regression,” available at http://www.homepages.ucl.ac.uk/~uctplso/Teaching/MECT/lecture8.pdf

Mullahy, J. (1986): “Specification and Testing of Some Modified Count Data Models”. Journal of Economics, 33, 341–365.

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 251

ACKNOWLEDGEMENT

The African Economic Research Consortium (AERC) wishes to acknowledge and express its immense gratitude to the following resource persons, for their tireless efforts and valuable contribution in the development and compilation of this teaching module and other associated learning materials.

1. Dr. Getinet Haile, University of Nottingham, UK. (Email: [email protected]); 2. Prof. Jema Haji Mohamed, Haramaya University, Ethiopia. (Email: [email protected]); 3. Dr. Dianah M. Ngui-Muchai, Kenyatta University, Kenya. (Email: [email protected]; [email protected]); 4. Prof. Arcade Ndoricimpa, University of Burundi, Burundi. (Email: [email protected]).

Thank you.

Facebook Twitter Website Email Website

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved

Our mailing address is: African Economic Research Consortium (AERC) 3rd Floor, Middle East Bank Towers, Jakaya Kikwete Road P. O. Box 62882 00200 Nairobi Kenya

Copyright © 2020 African Economic Research Consortium (AERC), All Rights Reserved Page 252